Saturday, November 21, 2015

Personal Data Science

When I graduated with a Master's in Statistics in the summer of 2011, I had never heard of the term data science. And most of the world hadn't really either, it only picked up as a term really in the year following. Now I can no longer just say that sort of sentence. I have to show some data 

In these intervening years, I've been working in an engineering firm learning a lot about how buildings work, what sort of mechanical systems use power, how to pick a piece of glass with the right reflectivity, light and heat tranmissivity, how to model wind flow and all sorts of applied knowledge I never conceived of as a student. But I haven't been getting in on this data science action in my day job. 

But I did study statistics and I like to play with data. While trying to turn my academic knowledge into something with real world applicability, I've realized how off-base a lot of my education was. Let's start with undergraduate, where I took courses such as Abstract Algebra, Galois Theory, and Complex Analysis. I haven't even come close to using anything there at all. Even with multivariable calculus, one of the foundations of a math major, I can't remember Green's theorem and don't particularly care to. The statistics portion of our learning was thus certainly more applied. The regression course I took junior year has proved invaluable, and learning to code and model in graduate school has been great. The most valuable lesson I probably learned was to not overfit the data when modelling, to make sure your fundament processes are right rather than your results. However the fundamental processes of graduate statistics are flawed in this modern world - courses teach more like history courses. Painstaking attention is made to how a theorem was discovered and proved. Professors are convinced these steps are crucial. Sure, I think there is something to be said for understanding the theory behind a model or function, but goodness gracious we never use those proofs again. And we spend very, very little time working with real world data and never any of the large datasets that have become so common. There's some balance between blindly learning how to use a tool and understanding the tool's entire backstory and manufacturing process. 

And there's a ton of free data out there, but there's also my own data. There's stuff that's automatically kept track for me, like bank transactions, cell phone data etc. I decided to play with my own Facebook data. Due to Facebook's stringent API, you can't really access their stuff by a scraping algorithm. So I took my own data myself. Copy and paste. Went through all my statuses and took down the time of posting, the number of likes, comments, and then whether the status was a joke, pun, announcement, topical, language-related, a link, a check-in, a holiday etc. I know, I'm ridiculous. But I really wanted to practice and not lose out on the value of my education. While at the Census Bureau, I read a lot of statistics papers that I no longer recall, but I also attended a very popular talk by Dr. Nathan Yau, at the time a Statistics PhD student who had just published a book on visualizing data. He lectured on the value and techniques behind awesome data visualizations, and I was hooked. I bought his book and follow his blog (flowingdata.com). I still haven't come up with any graphics worthy of his blog, but I tried here. I started with a simple plot of the post likes vs time and colored them in differently based on some of the metrics I recorded. 

The data is actually quite fun to play with. There's a few motivating variables to analyze. For starters, I want to see how often I pun, and whether these tend to be the most popular posts. It turns out they're not! The graph to the left actually graphs 6 variables. There's time on the x axis and # of likes on the y axis. Blue dots are puns. The size of the dot indicates the number of shares the post has had (most have 0 and a few have 1), and actually the color of the dots are different for posts judged to be "topical." Posts that are squares are picture posts. However, I feel pretty strongly now that most humans can really only grasp 4 variables. Yeah if you stare long enough you can try to understand them all, but after 4 the mind really has to work. I redid the graph with a log y-axis, which I think looks a bit better but doesn't make the popular posts look as impressive. For the record, puns make up 18.6% of the last 3 year's posts. 

I also used the wordcloud package to pick some of my most used words. This is a good package and once I downloaded it, I really didn't have to do much. The results are quite pleasing and cool. Note that <97> represents some Chinese character - I can't get Chinese character display working on my R.

Well those days are pretty even, but Sundays are fun days aren't they. Sunday's have the highest mean likes, but this graph shows that Thursdays have the highest median. The results aren't very drastic though, and just looking at the sample variances one can see that the differences might not stand up to a test of robustness. Add in the fact that these times are all Hong Kong based but not all posts were, and I wouldn't publish an academic paper advising Thursday posts. As a note I'm a big fan of boxplots, but not everyone learns how to read them, and it seems like they might be a legacy of older statistics that'll seem too clunky in this new age.


If you notice though, none of what I've done has involved any sort of fancy modeling. Maybe my experience has been limited, but it seems most of the value of data science is relatively simple. Most often at work I'm asked "what's the average energy use for a tall office tower?" All I have to do is type in a simple query, but this is a service simply not available before we had the database. It doesn't involve any explanation to math illiterates, or model validation. With my Facebook example, just having the database in the first place setup to allow these useful queries is the main step. This is primarily why the data science game is shifted towards computer scientists right now. The market demand is to get data in all the right places rather than advanced statistical modeling, so you need programmers to scrape data or design apps or programs that continually feed in usable data and store it. You'll need a few Statistics PhD's scattered around to come up with the original algorithms, but everyone else just has to learn how they work.

Here's one more graph with comments and likes together:









But anyway, I want to just post my own favorite statuses. These are not the ones with the most statistical properties, just my own faves:

  • My mom asks me if I want a home sound system for my birthday - I tell her thanks but I'm not the stereo-type.
  • I'm being asked at work to write an "inception report." I'm not sure what that means, but I hope it doesn't involve a report within a report.
  • England is to football as Iraq is to civilization. Sure they might have invented it, but you could argue other countries are doing it better now.
  • The Hong Kong Football Association reportedly paid about $30million HKD to get the Argentine National team to come play in tonight's friendly against the 164 ranked Hong Kong team. I guess this is the second time this month that the government has paid a group of men to beat up on Hong Kongers."
  • Once upon a time, Georgetown and Syracuse had a rivalry. Georgetown won. The end.
  • Sometimes I do my own crossword puzzles, which I no longer remember. And then when I solve a particularly clever clue, I feel pumped that I solved the clue and more pumped that I wrote it. #lowselfesteem
  • Reading from a Kindle is not helping my shelf esteem.
  • China doesn't have a Mount Rushmore, it just has a Mao-nt.
  • Sometimes my phone says "Call Failed" and I think it says "Cal Failed" and I'm like come on, I really don't need you to rub it in.

Sunday, November 8, 2015

Running Diary

The last week of October was the strangest week I've had in Hong Kong, moving out from my Kennedy Town apartment to a family farmhouse outside the Aberdeen tunnel isolated from urban convenience. This move was followed two days later by a stay at the hospital for a nasal surgery and then several days of recovery at another uncle's actual apartment. When I emerged from this strange week, I could breathe out my nose, my voice had changed, my sleep schedule had adapted to early morning Aberdeen buses, and I hadn't really hung out with people for over a week, an interminable length for an extrovert.

The following weekend that all changed. It was Tommy and Jana's long awaited wedding day Saturday night, and Colin Erickson-Sheehy's farewell weekend in Shenzhen. In addition, I had forgotten that this was Georgetown's International Alumni Weekend hosted in Hong Kong. For me, this alumni event was merely a Google Calendar event but others were flying in from all over the world. Paul Tagliabue, Nancy Pelosi, Senator George Mitchell and President John DeGoia were among the appearances. How to prioritize?

I figured that while I would party with Colin the following weekend in Manila, I had really promised to come up to Shenzhen many times. And whenever I'd gone to Shenzhen to party, I'd not been disappointed. I couldn't make the Georgetown weekend's main events anyway because of the wedding, so I made my plan. When I learned I had a meeting Friday night with Swire, necessitating fancy clothes, and needed a followup doctor's appointment, well things got complicated. But I managed, and kept notes as well. I'll take off at the end of the meeting.

Friday, November 6, 2015
5:47pm: The meeting had started in 3:30 and at one point descended into a shitshow, but luckily I was able to leave at this time. My boss has sometimes been stuck in there until 7:30. I navigate out of the Swire buildings footbridge systems, which was incidentally what the meeting was about, and jump onto the MTR towards Central.
6:22pm: Enter Dr. Victor To's office. I'd said I'd be there at 5:45, but they always overbook anyways and turns out showing up late means I have to wait less time.
6:30pm: In fact I wait like 8 minutes. The doctor sticks a scope up my nose and proceeds to poke around in my sinus, and looks confused when I scream in pain.
6:40pm: Doctor decides to end my torture. Says my nose is a lot better but I still have to come back next Monday and Thursday and finish getting dried blood out of my sinus. I currently have undried blood flowing out my sinus. Tells me my sinus will stop bleeding in a few minutes
6:45pm: Holding a tissue to my bloody nose, I decide to risk venturing out to Central.
7:09pm: Show up at the Watermark Cafe at the central piers. The event is more formal than I was expecting. Staff are tabling at the door handing out nametags. I'm glad I'm dressed in a suit instead of my typical casual Friday clothes, or my Shenzhen party gear. Someone is speaking already, which is frighteningly early for an event that started at 7:00pm. He's saying something about international development and is name dropping countries - Vietnam, Myanmar, Laos and Philippines - like he's announcing ASEAN bingo.
7:12pm: Cool Vincent Ko is here. I last saw him a year ago when he visited Hong Kong. Apparently he's in Ho Chi Minh City now.
7:15pm: Caroline Kwok is here. I last saw her last night. She is very, very excited about this weekend.
7:20pm: Jas Wee is here. I last saw her at the end of December, but she lives in Hong Kong so this is less excusable.
7:22pm: There's a minibar of hors d'oevres and I raid the hell out of it. There are Brie & Gorgonzola crackers which are better than anything I ever ate at Georgetown. My university definitely treats its alumni better than its student body.
8:10pm: I've had more glasses of champagne (3) then met people not working in finance.
8:40pm: Meet someone who studied full time at the SFS Qatar campus and knows two of my friends who worked there. He speaks Turkish, Arabic, Armenian and English fluently.
8:44pm: For the first time in my life, someone says "Hi this is Cal Lee and he's interested in going to Armenia." I'm not sure how that happened but it was definitely an escalation in conversation.
9:10pm: Find three other people from my year, Susie O'hare, Abby Zhang and Winston Wang, and talk about 27 year old Hong Kong things.
9:30pm: Realize we've closed out the party. Head on downstairs and wait for the ferry across to TST, cause I mean it's right there.
9:49pm: Man Hong Kong is beautiful tonight. Everyone on the ferry seems to be a tourist.
10:05pm: Was going to meet up with Nick Tsao and go to Shenzhen together, but he seems way behind schedule. Decide I have time to drop off my fancy clothes at work and change. 
10:05pm: There are still 4 coworkers at work who don't seem to hate the fact that they're still working at 10pm on a Friday night. I tell them I'm slightly drunk and on my way to Shenzhen to get more drunk.
10:23pm: I drape my jacket, pants and tie over my chair and bounce down to the light rail line in tshirt and shorts.
11:05pm: I race through the border crossing at Lo Wu and fill my form in expert fashion. I have a new passport now but my China visa is in my old passport, so I expertly hand my passports to immigration open to the right places. The woman seems unfazed by this and stamps me through.
11:07pm: Swarmed by people offering black cab rides. Make the mistake of saying "screw off, I'm taking the metro." Informed the metro is closed. Guy follows me for 100 yards before I tell him he's bothering me and losing money.
11:10pm: Shit, the line for taxis is really long. Damnit China.
11:15pm: I actually offer my destination to a soliciting black cab driver. He asks me to name a price. I say 50 kuai and he laughs me off.
11:14pm: He comes back with 80 kuai and I laugh him off. Two beggars also approach aggressively and initiate physical contact. I have plenty of spare change but I don't appreciate the aggressiveness in a port of entry.
11:35pm: Taxi ride exposes my rusty Mandarin, like when Isaid diaozhuan instead of diaotou. Taxi driver comes within 100 yards of my destination before turning off into a side street for no clear reason. Drops me off and tells me it should be somewhere near here. Thanks shiji.
1140pm: The bar is called Hawa. I run the last half block and excitedly see the name. I run up the stairs like a kid on Christmas morning and almost into a bar called Sugar. Apparently Hawa is the basement bar. I awkwardly walk back down the stairs and enter.
11:41pm: Colin is the first person I see. He totally was not expecting me to come and gives me a huge hug. He's like "we're heading off to KTV, but you can order a beer or something. Oh wait, someone ordered a rum and coke and didn't finish it. Perfect, you can have his leftovers."
11:43pm: I realize I've been here before. I've been to like 4 bars in Shenzhen,  a city of 12 million, and this is one of them.
11:50pm: Everyone gathers outside to head to KTV. Matt Sexton hands me a bowl of punch with 3 straws.
12:05am: A KTV building is conveniently located across the street. There's a piano in the lobby and drunk people go and excitedly hit keys. Sexton's girlfriend Irene can actually play and starts Fur Elise. The keys are hopelessly out of tune like they left the piano through a typhoon. Nonetheless, I beseech to play next and Sexton is like "Irene, let Cal play." I put the bowl of punch off to the side and start playing. It takes 15 bars but Glenn Cornell exclaims, "it's Piano Man!" I don't get far in the song because apparently the black keys are purely ornamental
12:13am: We head upstairs to the real KTV action and there's another piano there. This one actually works. Colin calls on me to perform and for the first time in my life, I play Piano Man to an adoring audience. I miss half the notes.
12:17am: Colin opens up the KTV performance to Avril Lavigne's Complicated. Hits all the high notes. Bruno Mars follows.
12:20am: There's a proper mic stand and Glenn is all over it.
12:25am: The KTV has like a supermarket aisle to purchase singing help. We scoop up a couple dozen cans of cheap beer.
12:45am: The song choices are not randomized, which means the 3 songs I picked come one after another. I crush 2 of the Chinese songs I know and then One Thing by One Direction, which is like One Squared, which is also One.
12:54am: Finish my songs and decide to upgrade from the cheap beer. I was thinking of a going away present for Colin, so a bottle of whiskey during this session seems appropriate. While it might not make sense to pay 800 kuai for this bottle when I wouldn't pay 80 kuai for a cab ride earlier, it did to me at that point.
12:57am: The cashier is unable to process my credit card. She asks me if I got this credit card in Shenzhen. I race my mind trying to think if I've had other encounters where a Chinese credit card machine only took local credit cards. Would this machine process credit cards from Guangzhou? I don't think that's how credit cards work? The beer challenges this recollection process.
1:05am: Run outside to the nearest ATM. At some point I knew how to say ATM in Mandarin, and that they used a different word in Taiwan, but right now I just use 銀行. Run back with 500 kuai like I just robbed a very poor bank.
1:10am: The whiskey arrives and people are confused like there's been a mistake. Well if there was a mistake, I made it.
1:15am: Nobody is drinking the whiskey. Aggressively pour out 8 glasses of whiskey and coke and hand them to people.
1:16am: Oh God there is not enough coke in this glass.
1:33am: Lots of songs.
1:52am: Make Colin do a shot of whiskey with me. I remind him about that time I first met him and he wouldn't let me play with him.
2:08am: There's been like a run of this female pop star.  I can't remember her name but it's all these catchy songs from the last few years. No, not Rihanna, a bit more annoying than her. No, more talented than Miley Cyrus. Less talented than Lady Gaga. Oh it's Katy Perry. My goodness we've sang so many Katy Perry songs.
2:20-2:45am: Complete black hole.
2:45am: Roldy says let's go.
Sometime in the middle of the night: Glenn comes out of his bedroom naked to take a piss. Sees me passed out on the floor and is like "ah who the hell is that?!"
9:30am: Wake up on the carpet of a strange apartment. Complete discombobulation lasts 3 seconds. Pleasantly surprised to find that phone is charged. Piss out about a gallon of processed alcohol from the Pearl River Delta region. Discover there's a mop in the bathroom.
11:50am: Roldy comes out of his room and wakes us up.
11:53am: Celine comes out of her room looking remarkably not dishevelled. Glenn comes out in a bathrobe. I suddenly realize how hungover I am. Drink two bottles of water.
12:50pm: Lunch in a classy cafe that serves quesadillas. Glenn and I act very American. I ask for 3 glasses of water.
1:15pm: Glenn reveals he only just learned that Tommy and Jana were dating.
1:50pm: Oh shit. How did I spend so much time at lunch? Shit I have to be at Hang Hau at 3:30pm to take the shuttle bus to the wedding in Clearwater Bay. And I kinda need to take a dump. No time for that.
2:00pm: Taxi reaches the Huanggang border. Oh I had found 35 kuai in coins while cleaning out my apartment. I pay the driver in coins. He is not amused.
2:02pm: Line to cross immigration is depressingly long. Mentally calculate every minute of my journey time to my office, changing time, and time to Clearwater Bay. I think I need to take some taxis. This is that border crossing that is so long you need to take a bus in between.
2:23pm: Reach the Hong Kong side and get on a Mong Kok bound bus. The hangover is quite real. I begin to have my first regrets of going to Shenzhen. Still need to take a dump.
3:00pm: Still on the bus. Concede shuttle bus defeat.
3:02pm: Man I really need to take a dump. But I realize I need to take out cash for the wedding Lai See present. Balls. Redirect route planning to the HSBC ATMs in Pioneer Center because my office building doesn't have an ATM except inside the MTR station.
3:05pm: Bus finally lands and I run off. Almost vomit on the streets of Mong Kok but rein it in. Dash to the Pioneer Centre.
3:08pm: There are 15 people in line for the ATM. WTF. Immediate regrets not going into the MTR station ATM.
3:27pm: Finally make it back to my office. Take my clothes into the bathroom. Details will be omitted here but let's just say I took that dump and changed into my suit within 4 minutes.
3:31pm: Put on my American tie for the first time. Tie was given as an award in 2012 and is so ridiculous looking that I've never worn it. Crush a bottle of cranberry juice that I have at work and run down to get a taxi.
4:00pm: Taxi ride to Clearwater Bay Golf & Country Club is very long, and there are lots of winding roads once you leave civilization. The hangover intensifies exponentially with car sickness. Stealthily lower the window slightly.
4:10pm: Reach the country club and stumble outside the taxi. Try to gauge whether I can make it to a bathroom. Can't. I make it as far as a small bush next to the entrance before yaking. The cranberry juice comes right out. Hold my tie against me as I throw up more water.
4:11pm: I'm still throwing up. Food is coming out now. This is a low moment in my life.
4:12pm: Finally done throwing up. Turn around and see Clay Carol and Kim Alexandersen staring at me. "Are you ok? We thought you were an elderly man suffering a heart attack or something."
4:13pm: Wedding time.