When I graduated with a Master's in Statistics in the summer of 2011, I had never heard of the term data science. And most of the world hadn't really either, it only picked up as a term really in the year following. Now I can no longer just say that sort of sentence. I have to show some data
In these intervening years, I've been working in an engineering firm learning a lot about how buildings work, what sort of mechanical systems use power, how to pick a piece of glass with the right reflectivity, light and heat tranmissivity, how to model wind flow and all sorts of applied knowledge I never conceived of as a student. But I haven't been getting in on this data science action in my day job.
But I did study statistics and I like to play with data. While trying to turn my academic knowledge into something with real world applicability, I've realized how off-base a lot of my education was. Let's start with undergraduate, where I took courses such as Abstract Algebra, Galois Theory, and Complex Analysis. I haven't even come close to using anything there at all. Even with multivariable calculus, one of the foundations of a math major, I can't remember Green's theorem and don't particularly care to. The statistics portion of our learning was thus certainly more applied. The regression course I took junior year has proved invaluable, and learning to code and model in graduate school has been great. The most valuable lesson I probably learned was to not overfit the data when modelling, to make sure your fundament processes are right rather than your results. However the fundamental processes of graduate statistics are flawed in this modern world - courses teach more like history courses. Painstaking attention is made to how a theorem was discovered and proved. Professors are convinced these steps are crucial. Sure, I think there is something to be said for understanding the theory behind a model or function, but goodness gracious we never use those proofs again. And we spend very, very little time working with real world data and never any of the large datasets that have become so common. There's some balance between blindly learning how to use a tool and understanding the tool's entire backstory and manufacturing process.
And there's a ton of free data out there, but there's also my own data. There's stuff that's automatically kept track for me, like bank transactions, cell phone data etc. I decided to play with my own Facebook data. Due to Facebook's stringent API, you can't really access their stuff by a scraping algorithm. So I took my own data myself. Copy and paste. Went through all my statuses and took down the time of posting, the number of likes, comments, and then whether the status was a joke, pun, announcement, topical, language-related, a link, a check-in, a holiday etc. I know, I'm ridiculous. But I really wanted to practice and not lose out on the value of my education. While at the Census Bureau, I read a lot of statistics papers that I no longer recall, but I also attended a very popular talk by Dr. Nathan Yau, at the time a Statistics PhD student who had just published a book on visualizing data. He lectured on the value and techniques behind awesome data visualizations, and I was hooked. I bought his book and follow his blog (flowingdata.com). I still haven't come up with any graphics worthy of his blog, but I tried here. I started with a simple plot of the post likes vs time and colored them in differently based on some of the metrics I recorded.
The data is actually quite fun to play with. There's a few motivating variables to analyze. For starters, I want to see how often I pun, and whether these tend to be the most popular posts. It turns out they're not! The graph to the left actually graphs 6 variables. There's time on the x axis and # of likes on the y axis. Blue dots are puns. The size of the dot indicates the number of shares the post has had (most have 0 and a few have 1), and actually the color of the dots are different for posts judged to be "topical." Posts that are squares are picture posts. However, I feel pretty strongly now that most humans can really only grasp 4 variables. Yeah if you stare long enough you can try to understand them all, but after 4 the mind really has to work. I redid the graph with a log y-axis, which I think looks a bit better but doesn't make the popular posts look as impressive. For the record, puns make up 18.6% of the last 3 year's posts.
I also used the wordcloud package to pick some of my most used words. This is a good package and once I downloaded it, I really didn't have to do much. The results are quite pleasing and cool. Note that <97> represents some Chinese character - I can't get Chinese character display working on my R.97>
Well those days are pretty even, but Sundays are fun days aren't they. Sunday's have the highest mean likes, but this graph shows that Thursdays have the highest median. The results aren't very drastic though, and just looking at the sample variances one can see that the differences might not stand up to a test of robustness. Add in the fact that these times are all Hong Kong based but not all posts were, and I wouldn't publish an academic paper advising Thursday posts. As a note I'm a big fan of boxplots, but not everyone learns how to read them, and it seems like they might be a legacy of older statistics that'll seem too clunky in this new age.
If you notice though, none of what I've done has involved any sort of fancy modeling. Maybe my experience has been limited, but it seems most of the value of data science is relatively simple. Most often at work I'm asked "what's the average energy use for a tall office tower?" All I have to do is type in a simple query, but this is a service simply not available before we had the database. It doesn't involve any explanation to math illiterates, or model validation. With my Facebook example, just having the database in the first place setup to allow these useful queries is the main step. This is primarily why the data science game is shifted towards computer scientists right now. The market demand is to get data in all the right places rather than advanced statistical modeling, so you need programmers to scrape data or design apps or programs that continually feed in usable data and store it. You'll need a few Statistics PhD's scattered around to come up with the original algorithms, but everyone else just has to learn how they work.
Here's one more graph with comments and likes together:
Here's one more graph with comments and likes together:
But anyway, I want to just post my own favorite statuses. These are not the ones with the most statistical properties, just my own faves:
- My mom asks me if I want a home sound system for my birthday - I tell her thanks but I'm not the stereo-type.
- I'm being asked at work to write an "inception report." I'm not sure what that means, but I hope it doesn't involve a report within a report.
- England is to football as Iraq is to civilization. Sure they might have invented it, but you could argue other countries are doing it better now.
- The Hong Kong Football Association reportedly paid about $30million HKD to get the Argentine National team to come play in tonight's friendly against the 164 ranked Hong Kong team. I guess this is the second time this month that the government has paid a group of men to beat up on Hong Kongers."
- Once upon a time, Georgetown and Syracuse had a rivalry. Georgetown won. The end.
- Sometimes I do my own crossword puzzles, which I no longer remember. And then when I solve a particularly clever clue, I feel pumped that I solved the clue and more pumped that I wrote it. #lowselfesteem
- Reading from a Kindle is not helping my shelf esteem.
- China doesn't have a Mount Rushmore, it just has a Mao-nt.
- Sometimes my phone says "Call Failed" and I think it says "Cal Failed" and I'm like come on, I really don't need you to rub it in.