Tuesday, January 5, 2016

Travels Visualized

If you've read my post on my Data visualization of Facebook data, you may have learned that I have this odd hobby of playing with personal data. However you are unlikely to know that I have kept a spreadsheet of where I have spent every night since 2008. But it's true, and it hasn't even been remotely onerous. I have noted the date in which i entered a city and the date in which I left. I've kept the detail mainly to the city level, so there is no record of meanderings throughout urban bedrooms, if that had theoretically happened. There is no record of day trip cities I've visited without spending the night, so sorry Nara, Krebs and Brookline. I also did not stress about whether I reached a city before or after midnight - with the exception of red eye flights, if I flew from Hong Kong on a Friday night and reached my room in Taipei 2am Saturday morning, I recorded myself as spending Friday night in Taipei.

I started this list in the margins of my notebook in a Geography lecture in University College Dublin in the fall of 2008, mainly because I was bored in class. It was far more interesting for me to think about the places I'd traveled to in that very epic year of 2008. Through memory I filled in most of the dates, and eventually I went through my gmail archives and filled in my whole year. It would not be possible for me to fill in 2007 and before because of lack of memory and records. But starting from that fall, I created a spreadsheet and it has survived 4 computers and is now safely on the cloud. I started this blog in 2008 and I feel that I became a very different person starting from that year so I find this data very fitting.

For me, trawling back through this data just brings me pure joy. I had not planned on ever analyzing the data as I am doing now - just scrolling through it had been enough. I'm not sure if anyone else would enjoy it so much going through their own life, and they certainly wouldn't find it so interesting going over mine. But these data visualizations directly take me back to trips I had forgotten.
In terms of the visualization nitty gritty, I had a lot of cleaning up to do. I hadn't even heard of R when I began these records, so I didn't really have a thought to how I should format the data or the scheme of the database in technical speak.
Once I fixed name inconsistencies and date formatting, I decided to focus on the chronological part of my data first. I found the ten cities I'd spent the most time in, then realized many cities were tied and expanded the list to top 15. Without worrying about a y-axis at all, I plotted the dates I'd been in these cities and assigned a nice color palette to them.
And immediately I was pleased. As familiar with my own life story as I am, I could see a lot of stories in those dots. I was at first surprised to see what cities made it. Civate and Osaka are on there on the strength of one trip each, which were both for World's ultimate tournaments. Ultimate tournament trips to Manila are also clearly regular, spaced out evenly starting in 2012. My move from DC, where trips to New York and Newton (my hometown outside Boston) to Hong Kong in late 2011 is also quite obvious. My lengthier stays in Dublin and Beijing which are well-documented in this blog are also visible. Irregular trips to Shanghai, Taipei, Shenzhen and Bangkok pop up. Lastly, I may never spend another night in Newton after we sold our family home, and instead nights in the city of Boston show up instead.

These cities appear low to high in order of appearance (starting from 2008), which is really quite arbitrary. I played around with making the order completely random. I actually liked that better, because in the original version, all the long stays are at the bottom and the top seems very bare, giving the image a sense of imbalance.  My friend points out that at this point I enter data art, because the randomization serves no functional purpose. Here is that graph with even more cities.

Now this has more little stories and might be too cluttered. I don't think I could possibly fit all of my stops in there. Because I go back to places multiple times, I don't know how to provide a sense of chronological order to the cities. There isn't a sense in either graph really of many sequential trips.

Skipping this thought, I considered graphing the latitudes on the y-axis. As the data points were all geographic this was the logical next step. If you didn't recognize the names of these cities, the graph loses a lot of meaning. However, adding longitude and latitude coordinates was tricky. I hadn't been inputting that data in along the way, but the Maps package in R comes with a nice database of world cities. The database comes with population and coordinate information for just about every city of more than 40,000 people in the world (as of 2006). However the naming of some of the cities is bizarre, full of colonial-era names (Rangoon, Bombay), out-of-vogue spelling choices (Cracow, Soul), and a full pinyin rendering of Macau and Hong Kong, with Hong Kong being split into the Sai Kung, Kowloon and Hong Kong Island (xigong, jiulong and xianggangdao). Some of my travels had taken me to places of less than 40,000 people. For the names that didn't match, I didn't know of any better solution that editing all the names individually. If there is a more common sense laden database out there, please let me know. For the cities not on the database, I googled their coordinates. I'm not quite a good enough programmer to build a scraper to do this automatically, but I aspire to get to that level soon. Here's the same graph re-run with latitudes on the y-axis. I added in the names by hand where I saw fit, otherwise they'd bleed over each other.
This graph doesn't do much for me. It's very cluttered and it's hard to organize time and space coordinates together in my mind. The points give the illusion of coordinates, but they're not, they're only one dimensionally coordinates. It is interesting to see that so many of the places I've been are on the same latitude (Beijing is nearly the same as New York), and places that actually are near each other (Hong Kong and Shenzhen, Newton and Boston) now appear so on the map. And it's also cool to see that these cities range as south Bangkok to as north as Dublin, a thought I'd never made on my own. Still, I think there is limited use to this visualization attempt.

Well with all the coordinates in hand, it was time to put them on a real map. I've learned that with maps it's easy to control the size and color overlaid points. R has some in-built plotting functions, which I'd been using for 5 years, and they're pretty great. But people serious about data visualization seem to use a lot of ggplot2 and ggmap, packages developed by the Kiwi statistician Hadley Wickham, who should probably be knighted. I decided it was high time to learn some ggmap. Here's my first attempt:

This map visualization is much more of a classic representation and was equally delightful to me. I particularly liked the lack of borders in this version. It requires more deductive efforts in identifying all the points. I decided to color the points by the year that I first visited that point. Blue represents 2008 and takes up a lot of real estate. The size was set to the number of days I spent in each city. Note that the sizes are not scaled linearly, and in fact I haven't figured out how to control this properly. I actually am rather alright with the outcome though, otherwise Hong Kong and Washington, DC would drown out other cities. I could see more stories here, such as two separate trips to Europe. I could see the "outlier" travels of 2010 that took me to India and Peru. I saw two points from 2008 in western US and was confused as to what they might represent until I remembered I went to Las Vegas and Los Angeles many months apart that year. This truly shows the power of visualization, the ability to convey so much more in so little space. I could also see a bug in my data in this black dot in western America. Investigating further, I'd inputted the coordinates for Littleton, Colorado instead of Littleton, New Hampshire by mistake. Some of the years were also incorrect.

My second go at the map fixed these bugs and added borders just for kicks. I don't really like these large borders, which seem to drown out and minimize my destinations. But for the most part, I was very happy with this map. All I needed was a legend. Turns out this was way harder than I realized, because I had been using ggmap incorrectly this whole time. Hours of debugging later, I finally converted my years into factors and got this to popup.

As a conclusion, these Travels of Cal are about to get a lot more interesting. After 4 years and 3 months at Arup in Hong Kong, my last day will be January 8, 2016. I will be adding some data points in Southeast Asia and building on these visualizations. There's a greater purpose here than documenting my own life, rest assured.


2 comments:

Ricardo said...

Hi Cal,

Care to share your R code for this?

Cal Lee said...

Hey Ricardo,

Sorry this took embarrassingly long, but I finally commented most of the basic code and added it to a github repository

https://github.com/cal65/Geography-of-Cal.git