Chinese American Living: data visualization

Showing posts with label data visualization. Show all posts

Tuesday, January 5, 2016

Travels Visualized

If you've read my post on my Data visualization of Facebook data, you may have learned that I have this odd hobby of playing with personal data. However you are unlikely to know that I have kept a spreadsheet of where I have spent every night since 2008. But it's true, and it hasn't even been remotely onerous. I have noted the date in which i entered a city and the date in which I left. I've kept the detail mainly to the city level, so there is no record of meanderings throughout urban bedrooms, if that had theoretically happened. There is no record of day trip cities I've visited without spending the night, so sorry Nara, Krebs and Brookline. I also did not stress about whether I reached a city before or after midnight - with the exception of red eye flights, if I flew from Hong Kong on a Friday night and reached my room in Taipei 2am Saturday morning, I recorded myself as spending Friday night in Taipei.

I started this list in the margins of my notebook in a Geography lecture in University College Dublin in the fall of 2008, mainly because I was bored in class. It was far more interesting for me to think about the places I'd traveled to in that very epic year of 2008. Through memory I filled in most of the dates, and eventually I went through my gmail archives and filled in my whole year. It would not be possible for me to fill in 2007 and before because of lack of memory and records. But starting from that fall, I created a spreadsheet and it has survived 4 computers and is now safely on the cloud. I started this blog in 2008 and I feel that I became a very different person starting from that year so I find this data very fitting.

For me, trawling back through this data just brings me pure joy. I had not planned on ever analyzing the data as I am doing now - just scrolling through it had been enough. I'm not sure if anyone else would enjoy it so much going through their own life, and they certainly wouldn't find it so interesting going over mine. But these data visualizations directly take me back to trips I had forgotten.

In terms of the visualization nitty gritty, I had a lot of cleaning up to do. I hadn't even heard of R when I began these records, so I didn't really have a thought to how I should format the data or the scheme of the database in technical speak.

Once I fixed name inconsistencies and date formatting, I decided to focus on the chronological part of my data first. I found the ten cities I'd spent the most time in, then realized many cities were tied and expanded the list to top 15. Without worrying about a y-axis at all, I plotted the dates I'd been in these cities and assigned a nice color palette to them.

And immediately I was pleased. As familiar with my own life story as I am, I could see a lot of stories in those dots. I was at first surprised to see what cities made it. Civate and Osaka are on there on the strength of one trip each, which were both for World's ultimate tournaments. Ultimate tournament trips to Manila are also clearly regular, spaced out evenly starting in 2012. My move from DC, where trips to New York and Newton (my hometown outside Boston) to Hong Kong in late 2011 is also quite obvious. My lengthier stays in Dublin and Beijing which are well-documented in this blog are also visible. Irregular trips to Shanghai, Taipei, Shenzhen and Bangkok pop up. Lastly, I may never spend another night in Newton after we sold our family home, and instead nights in the city of Boston show up instead.

These cities appear low to high in order of appearance (starting from 2008), which is really quite arbitrary. I played around with making the order completely random. I actually liked that better, because in the original version, all the long stays are at the bottom and the top seems very bare, giving the image a sense of imbalance. My friend points out that at this point I enter data art, because the randomization serves no functional purpose. Here is that graph with even more cities.

Now this has more little stories and might be too cluttered. I don't think I could possibly fit all of my stops in there. Because I go back to places multiple times, I don't know how to provide a sense of chronological order to the cities. There isn't a sense in either graph really of many sequential trips.

Skipping this thought, I considered graphing the latitudes on the y-axis. As the data points were all geographic this was the logical next step. If you didn't recognize the names of these cities, the graph loses a lot of meaning. However, adding longitude and latitude coordinates was tricky. I hadn't been inputting that data in along the way, but the Maps package in R comes with a nice database of world cities. The database comes with population and coordinate information for just about every city of more than 40,000 people in the world (as of 2006). However the naming of some of the cities is bizarre, full of colonial-era names (Rangoon, Bombay), out-of-vogue spelling choices (Cracow, Soul), and a full pinyin rendering of Macau and Hong Kong, with Hong Kong being split into the Sai Kung, Kowloon and Hong Kong Island (xigong, jiulong and xianggangdao). Some of my travels had taken me to places of less than 40,000 people. For the names that didn't match, I didn't know of any better solution that editing all the names individually. If there is a more common sense laden database out there, please let me know. For the cities not on the database, I googled their coordinates. I'm not quite a good enough programmer to build a scraper to do this automatically, but I aspire to get to that level soon. Here's the same graph re-run with latitudes on the y-axis. I added in the names by hand where I saw fit, otherwise they'd bleed over each other.

This graph doesn't do much for me. It's very cluttered and it's hard to organize time and space coordinates together in my mind. The points give the illusion of coordinates, but they're not, they're only one dimensionally coordinates. It is interesting to see that so many of the places I've been are on the same latitude (Beijing is nearly the same as New York), and places that actually are near each other (Hong Kong and Shenzhen, Newton and Boston) now appear so on the map. And it's also cool to see that these cities range as south Bangkok to as north as Dublin, a thought I'd never made on my own. Still, I think there is limited use to this visualization attempt.

Well with all the coordinates in hand, it was time to put them on a real map. I've learned that with maps it's easy to control the size and color overlaid points. R has some in-built plotting functions, which I'd been using for 5 years, and they're pretty great. But people serious about data visualization seem to use a lot of ggplot2 and ggmap, packages developed by the Kiwi statistician Hadley Wickham, who should probably be knighted. I decided it was high time to learn some ggmap. Here's my first attempt:

This map visualization is much more of a classic representation and was equally delightful to me. I particularly liked the lack of borders in this version. It requires more deductive efforts in identifying all the points. I decided to color the points by the year that I first visited that point. Blue represents 2008 and takes up a lot of real estate. The size was set to the number of days I spent in each city. Note that the sizes are not scaled linearly, and in fact I haven't figured out how to control this properly. I actually am rather alright with the outcome though, otherwise Hong Kong and Washington, DC would drown out other cities. I could see more stories here, such as two separate trips to Europe. I could see the "outlier" travels of 2010 that took me to India and Peru. I saw two points from 2008 in western US and was confused as to what they might represent until I remembered I went to Las Vegas and Los Angeles many months apart that year. This truly shows the power of visualization, the ability to convey so much more in so little space. I could also see a bug in my data in this black dot in western America. Investigating further, I'd inputted the coordinates for Littleton, Colorado instead of Littleton, New Hampshire by mistake. Some of the years were also incorrect.

My second go at the map fixed these bugs and added borders just for kicks. I don't really like these large borders, which seem to drown out and minimize my destinations. But for the most part, I was very happy with this map. All I needed was a legend. Turns out this was way harder than I realized, because I had been using ggmap incorrectly this whole time. Hours of debugging later, I finally converted my years into factors and got this to popup.

As a conclusion, these Travels of Cal are about to get a lot more interesting. After 4 years and 3 months at Arup in Hong Kong, my last day will be January 8, 2016. I will be adding some data points in Southeast Asia and building on these visualizations. There's a greater purpose here than documenting my own life, rest assured.

Saturday, November 21, 2015

Personal Data Science

When I graduated with a Master's in Statistics in the summer of 2011, I had never heard of the term data science. And most of the world hadn't really either, it only picked up as a term really in the year following. Now I can no longer just say that sort of sentence. I have to show some data

In these intervening years, I've been working in an engineering firm learning a lot about how buildings work, what sort of mechanical systems use power, how to pick a piece of glass with the right reflectivity, light and heat tranmissivity, how to model wind flow and all sorts of applied knowledge I never conceived of as a student. But I haven't been getting in on this data science action in my day job.

But I did study statistics and I like to play with data. While trying to turn my academic knowledge into something with real world applicability, I've realized how off-base a lot of my education was. Let's start with undergraduate, where I took courses such as Abstract Algebra, Galois Theory, and Complex Analysis. I haven't even come close to using anything there at all. Even with multivariable calculus, one of the foundations of a math major, I can't remember Green's theorem and don't particularly care to. The statistics portion of our learning was thus certainly more applied. The regression course I took junior year has proved invaluable, and learning to code and model in graduate school has been great. The most valuable lesson I probably learned was to not overfit the data when modelling, to make sure your fundament processes are right rather than your results. However the fundamental processes of graduate statistics are flawed in this modern world - courses teach more like history courses. Painstaking attention is made to how a theorem was discovered and proved. Professors are convinced these steps are crucial. Sure, I think there is something to be said for understanding the theory behind a model or function, but goodness gracious we never use those proofs again. And we spend very, very little time working with real world data and never any of the large datasets that have become so common. There's some balance between blindly learning how to use a tool and understanding the tool's entire backstory and manufacturing process.

And there's a ton of free data out there, but there's also my own data. There's stuff that's automatically kept track for me, like bank transactions, cell phone data etc. I decided to play with my own Facebook data. Due to Facebook's stringent API, you can't really access their stuff by a scraping algorithm. So I took my own data myself. Copy and paste. Went through all my statuses and took down the time of posting, the number of likes, comments, and then whether the status was a joke, pun, announcement, topical, language-related, a link, a check-in, a holiday etc. I know, I'm ridiculous. But I really wanted to practice and not lose out on the value of my education. While at the Census Bureau, I read a lot of statistics papers that I no longer recall, but I also attended a very popular talk by Dr. Nathan Yau, at the time a Statistics PhD student who had just published a book on visualizing data. He lectured on the value and techniques behind awesome data visualizations, and I was hooked. I bought his book and follow his blog (flowingdata.com). I still haven't come up with any graphics worthy of his blog, but I tried here. I started with a simple plot of the post likes vs time and colored them in differently based on some of the metrics I recorded.

The data is actually quite fun to play with. There's a few motivating variables to analyze. For starters, I want to see how often I pun, and whether these tend to be the most popular posts. It turns out they're not! The graph to the left actually graphs 6 variables. There's time on the x axis and # of likes on the y axis. Blue dots are puns. The size of the dot indicates the number of shares the post has had (most have 0 and a few have 1), and actually the color of the dots are different for posts judged to be "topical." Posts that are squares are picture posts. However, I feel pretty strongly now that most humans can really only grasp 4 variables. Yeah if you stare long enough you can try to understand them all, but after 4 the mind really has to work. I redid the graph with a log y-axis, which I think looks a bit better but doesn't make the popular posts look as impressive. For the record, puns make up 18.6% of the last 3 year's posts.

I also used the wordcloud package to pick some of my most used words. This is a good package and once I downloaded it, I really didn't have to do much. The results are quite pleasing and cool. Note that <97> represents some Chinese character - I can't get Chinese character display working on my R.

Well those days are pretty even, but Sundays are fun days aren't they. Sunday's have the highest mean likes, but this graph shows that Thursdays have the highest median. The results aren't very drastic though, and just looking at the sample variances one can see that the differences might not stand up to a test of robustness. Add in the fact that these times are all Hong Kong based but not all posts were, and I wouldn't publish an academic paper advising Thursday posts. As a note I'm a big fan of boxplots, but not everyone learns how to read them, and it seems like they might be a legacy of older statistics that'll seem too clunky in this new age.

If you notice though, none of what I've done has involved any sort of fancy modeling. Maybe my experience has been limited, but it seems most of the value of data science is relatively simple. Most often at work I'm asked "what's the average energy use for a tall office tower?" All I have to do is type in a simple query, but this is a service simply not available before we had the database. It doesn't involve any explanation to math illiterates, or model validation. With my Facebook example, just having the database in the first place setup to allow these useful queries is the main step. This is primarily why the data science game is shifted towards computer scientists right now. The market demand is to get data in all the right places rather than advanced statistical modeling, so you need programmers to scrape data or design apps or programs that continually feed in usable data and store it. You'll need a few Statistics PhD's scattered around to come up with the original algorithms, but everyone else just has to learn how they work.

Here's one more graph with comments and likes together:

But anyway, I want to just post my own favorite statuses. These are not the ones with the most statistical properties, just my own faves:

My mom asks me if I want a home sound system for my birthday - I tell her thanks but I'm not the stereo-type.
I'm being asked at work to write an "inception report." I'm not sure what that means, but I hope it doesn't involve a report within a report.
England is to football as Iraq is to civilization. Sure they might have invented it, but you could argue other countries are doing it better now.
The Hong Kong Football Association reportedly paid about $30million HKD to get the Argentine National team to come play in tonight's friendly against the 164 ranked Hong Kong team. I guess this is the second time this month that the government has paid a group of men to beat up on Hong Kongers."
Once upon a time, Georgetown and Syracuse had a rivalry. Georgetown won. The end.
Sometimes I do my own crossword puzzles, which I no longer remember. And then when I solve a particularly clever clue, I feel pumped that I solved the clue and more pumped that I wrote it. #lowselfesteem
Reading from a Kindle is not helping my shelf esteem.
China doesn't have a Mount Rushmore, it just has a Mao-nt.
Sometimes my phone says "Call Failed" and I think it says "Cal Failed" and I'm like come on, I really don't need you to rub it in.