Sunday, December 31, 2017

2017 in Recap

I wrote a recap post in 2016, and while I feel no pressure to maintain any sort of annual tradition, the final days of 2017 find me in a contemplative state. Had I started writing this post in the beginning of December, it likely would have an entirely different tone. That post would certainly contain heavy doses of data science, supply chain, and politics. It might have espoused the breadth of America in my first full stateside year since 2010 and one where I visited 19 states + the district. As a travel blogger I'd find it fun to recount scrambling up the fiery red rocks of Utah's Arches National Park, viewing a solar eclipse in a private hilltop in the middle of Tennessee, waking up at 5am in the Newark train station, or having high expectations surpassed by the lovely town of Asheville, NC.

Instead, on December 5th, while on a work assignment trying to improve our aircraft engine manufacturing supply chain, I received a tap on my shoulder and was ushered into a video conference room. There I learned that my audit analytics team would be restructured into a centralized group and that I would be laid off. With GE going into full crisis mode, I had plenty of company. Suddenly I was hit with a flurry of emotions and realizations. I was not angry. My job with GE, though far from perfect was easily the best job I had ever had, and I had not been thinking of leaving. I had been so focused on trying to excel there that I hadn't given much thought to external options. In addition, in my struggle to find a regular rhythm in the US that satisfied me, I had focused most of my energies into work. Now I was suddenly faced with a clean slate and a mixture of emptiness and possibilities beckoned back.

I've been lucky in life in many ways, including finding careers I could be passionate about, twice. My first passion in sustainability proved to be more challenging than rewarding, but my switch towards data science ultimately rewarded me very well. I've found my skills to be in dizzying demand across all industries. They say timing is the most important thing in life and being a data scientist in Boston in 2017 is definitely good timing.

With the bare necessities of food and shelter able to be met, I nonetheless find myself incapable of enjoying this comfort. Inevitably, I gravitate towards trying to address higher questions of purpose. I believe this is a such a common angst of our generation, that any novel that professes to capture this zeitgeist must address this angst. Our generation is more educated about the world than any before it, with videos live streamed from conflict areas in Syria and Myanmar. At the same time, we are for obvious reasons part of the largest population the world has ever had, making each individual relatively less significant. The end result is that the population most aware of large problems is the one least equipped to affect them. (Rising tuition costs and stagnant wages are also contributing forces). In so many of my peers, I see the desire to help other people, the desire to be part of something larger than ourselves, to have a higher sense of purpose and place. With technology finding a way to bubble up stories of great deeds and causes to our surfaces, it is easier than ever to be inspired - or reminded of our inadequacies. The lives of many people I know revolve around making daily ends meet while occasionally re-calibrating to make sure they are on the path to fulfilling their long-term noble goals.

Repping Chewbacca and Ties Tuesday
Purpose can be a very deceptive motivational force. I had a firsthand lesson in this fallacy over the summer. Motivated by a small research task I was given at work, I ended up writing a 10 page memo on the future of GE and scheduling a meeting with my Senior Vice President and 6 executives to discuss. The weeks leading up to these were some of my most purposeful of the year, where I reached out to interview numerous contacts internally and externally and read as many industry papers on automation and AI as I possibly could. Running from meeting to meeting and editing the memo boosted me with adrenaline, because I really thought what I was doing might matter and influence some real change. At last though, timing was not my friend this time, and multiple crises came up the day of the meeting. My hour long meeting was reduced to 20 minutes and nothing got done. The lesson learned here is that purpose is not enough. If your ultimate goal is exacting positive change, the motivation to do so may not be the most correlating factor. It is easy to delude yourself into thinking your work is significant.

Half of the most international team I've been part of
At the same time I'm very aware that there are plenty of perfectly happy people that exist without any delusions of grandeur or goodwill. They know their jobs have, crudely speaking, no higher purpose, but are perfectly content to collect the paycheck and go home and be happy. I have no ill will, and likely some envy, towards people in this category. It is possible that those who have not given enough thought fleshing out their goals will hit a proverbial mid-life crisis, but I think there are plenty of people who will live through the ups and downs of life blissfully unencumbered by unachievable dreams. For better or worse, I am not able to join this group of people. It is possible that I was born this way, or raised this way, or influenced this way. My years in Asia definitely contributed  - I have seen too much poverty in Myanmar and too much pollution in China to sit quietly day after day at my first world office. I cannot go from an international workplace where conversations revolve around sharing cultural upbringings to any sort of regionally-limited commerce where conversations revolve around the Patriots and Massachusetts towns.

Thus in 2018 I know not be distracted by a false sense of purpose, but to hold true to my beliefs and core essentials.  This balance between pragmatism and idealism is a generational struggle which I believe many of my friends share. It is a struggle that does not get resolved quickly. Our problems are too large to be solved by willpower alone. Years of hard work and luck may not even be enough. In personal and practical terms, my pragmatic side believes I still need technical skills and business experience and my next job should help provide this. In idealistic terms, I know I need to keep exploring new countries because in my experience, travel inspires the best ideas. So stay tuned, 2018 promises to be very exciting. Happy New Years everyone!

Thursday, December 21, 2017

The Experience of being a Native English Speaker

The origins of this blog post begins with me, a native English speaker who loves discussing linguistics, engaging in such conversations with an international crowd and getting peeved at English native speakers who so often put forth the same questions over and over again. Despite its leading global status, English is only the third most common mother tongue, trailing Mandarin and Spanish, with native speakers comprising 5% of the world.  The majority of the people in the world that speak English proficiently learned it as a foreign language - comprising up to 15% of the world.  In my annoyance at fellow Americans or Brits, I realized that there are many aspects of being a native English speaker, especially a monolingual one, that results in an experience atypical from most humans on planet earth.

I expect this to be read, in English, by both native and non-native speakers. Even though native English speakers on a whole are far from misunderstood, hopefully this post will still shed light on just why we ask such silly questions and be interesting to both native and non-native readers alike.

1. English monolinguals don't understand what it is like to have another closely related language
There is no living language closely related to English anywhere near the point of mutually intelligibility.  As a result, most English monolinguals have a hard time understanding the mere concept of mutual intelligibility ("What do you mean you can understand it but not speak it?"). Many of the world's languages are part of dialect continuums, where languages slowly vary through a geographic span, with neighboring languages within the continuum mutually intelligible, even if the end nodes are not. This is why most Polish speakers can understand Russian, or Thai speakers understand Lao etc. Even when related languages are not mutually intelligible, there may be so many structural grammatical similarities and shared words that the barrier to language acquisition is not so steep. As a result, an Italian-native speaker could reasonably learn French in 100-150 hours*, a number simply not possible for monolingual English speakers (outside of gifted savants). The Foreign Service Institute, which trains US State Department Officials, gives estimates for the number of hours it takes to train Americans to proficiency in various languages. The lowest numbers are 575 hours for most Romance and Germanic languages, and the highest are 2200 hours for Arabic, Chinese, Japanese and Korean.

There actually are/were some other "Anglic" languages, including the extinct Yola and the difficult to classify Scots. There are also many English accents that might not be easy for other English speakers to comprehend. Learning to understand these accents might take some time, but it is not nearly the same as learning another language.

So why is there no living close relative? Geography, plain and simple. When the Angles crossed over the English Channel in the 5th-7th centuries from modern day Denmark/Germany, they separated themselves from the dialect continuum on mainland Europe. Over a thousand years later, the closest relatives on the mainland, Frisian (a dying language in the Netherlands), Dutch and German are all quite different from English. The Norman invasion of 1066 and subsequent French conquest of England also dramatically influenced what would become modern English. Despite the plethora of loanwords that came from French (and Latin), English is still a Germanic and not a Romance language, so it's still not simple for English speakers to absorb the genders and conjugations in French.

Within the British Isles there was some further divergent language evolution. Geographic impasses like islands and mountains can over time lead to neighbors speaking mutually incomprehensible languages. But English only had 1000 or so years to spread out over a relatively small landmass. Technology may have further reduced variation, as Britain was at the forefront of the printing press and the industrial revolution, propelling the spread and standarization of language. So while there definitely were regional differences, some like Yola (spoken near Wexford, Ireland) that were so different they could be classified as another language, many of these differences ultimately converged or died out.

While there are other languages around the world without a closely related language, including Japanese, Korean, Hungarian etc., the majority of humans grow up speaking a language with mutually intelligible relatives.

*La tua personale esperienza potrebbe essere diversa

2. There are so many people learning it. 
Over a billion and growing. You'll see different numbers for something so hard to measure, but there is a consensus that there are far more second language speakers of English than the 350 million or so native speakers. Spanish in contrast, has less than 100 million second language speakers. English is not alone in the high proportion of second language speakers - French and Swahili and Hindi are spoken mainly by second language speakers - but it's in the rare camp among the world's 6000+ languages. The billion number seems low to me. It seems like everywhere one travels, someone knows a bit of English.

The effect here is that often when Americans try to learn a foreign language, even in a foreign language, they encounter many people who want to practice English and already speak English better than the American speaks their native language. That actually adds a degree of difficulty to the language acquisition process. A Scottish person I met said that while learning French, when speaking English he would make his Scottish accent as thick as possible so that people would prefer to talk to him in French.

3. Native English speakers very rarely engage in conversations where both speakers are speaking in second languages
Ok this header is a mouthful, so reread it and hear me out. This one is best understood in the converse - non-native English speakers are nearly guaranteed to have an experience where they communicate in English to someone else who is also a non-native speaker. French people speaking to Germans, Koreans conversing with Filipinos, Egyptians talking to Kenyans - there is nothing fantastical about these dialogues taking place in English. It is a daily rite of our global economy.

For native English speakers, this is a rarity, even those who learn other languages. How often does an American who speaks German come across someone who is not a native German speaker, but speaks German better than they speak English? Maybe in parts of Central Europe, but it's a rarity.  So sure, learners of Spanish may find it useful in Brazil, and learners of French in Algeria and learners of Mandarin in Xinjiang. But from my conversations, many Americans who know a foreign language well have literally never had this experience, whereas every non-Native English speaker I've ever met has had this experience.

Lastly, I'd add that this is not just a neat trivial notion. Conversations where both speakers are handicapped can be really interesting - both speakers may find themselves grappling for the right word and journeying together towards it. Humor and sarcasm get tricky when relative cultural markers are thrown off by the neutral language setting. Enough practice in these settings can greatly improve one's general communication skills. This is a significant aspect of the human experience that someone may never experience simply by being born in an English speaking country.

4. There are many second language varieties of English
Every language spanning more than a village will have its accents, but few languages have the global and political baggage of English. One result of this is that full populations have gone from speaking their own language to creating a unique variety of English. This occurred with Gaelic languages in Ireland, Wales and Scotland, with Bantu languages across Eastern and Southern Africa, and the Indo-Aryan and Dravidian languages in South Asia. The English accents spoken in these areas have their roots in second language acquisition, and many grammatical and phonetic traits remain, even amongst modern native English users who do not speak the substrate language. For example, many people in Ireland who do not speak Irish pronounce the "th" in "think" as a dental "t" like "tink". This example embodies an interesting linguistic greyzone - the speaker appears to have phonetic rules from another language interfering with their pronunciation of English, but the speaker does not speak any other language.

You may find similar manifestations of this phenomenon across the Spanish-speaking, Arabic-speaking or Chinese-speaking worlds, but arguably not to the same extent as in English.

---

In conclusion, the prevalence of English throughout our globe today does not make it a boring subject. To the contrary it has led to some consequences, which I hope that you agree with me, are absolutely fascinating.

Sunday, September 10, 2017

Emojis, elections, LKF and whatsapplied statistics

Whatsapp is the messaging medium of choice among the ultimate community of Hong Kong. Since December 2011, an ever-expanding group of players have coordinated practice, social activities, shared news and engaged in raucous discussions over a Whatsapp group. At times, this group has exploded like a phone ringing off its hook and been the source of much hilarity for the community. The group has gone through many names, but most commonly has been known as "Party in My Pants" (PIMP).

This analysis was started by my friend, teammate and former colleague Jak Lau. In June 2017 he emailed me this:

Party in my TUANsuit
STATISTICS
Started: 10 December 2011
Age: 5 years 7 months
Messages 39,700
Group name changes: 19
Messages written no.%
Doona32618.0
Sam24536.0
Jak22235.4
Neil21205.2
Mikey20975.1
Kim20415.0
Will13393.3
Tommy8052.0
Gio6551.6

This is the email exchange that followed:
Cal: What?!?!?! How did you get these??
Jak: I made them.
Cal: You downloaded the transcript and searched?
Jak: Yeah, just exported the chat into excel and used a few simple analysis tools. Mostly sort and filter.
I was gonna do more, but wouldn't be worth it.
Cal: Haha I might play around with it. I want to see which emojis we use (eggplant)
Jak: Have fun. I don't know if you can export the emojis.

And so I set forth to do some further analysis. Whatsapp's Export Chat feature is nifty (and a feature that separates it from many other messaging apps), allowing me to email my account's stored data of the phone to myself as one raw text file. You can include media as well (images and videos) but I figured that might be an overwhelming amount and didn't include the images. Since leaving Hong Kong in early 2016, I've both remained in this group chat and become a professional data scientist, learning many techniques that would help me work with this text file.

Python is a good tool for text analysis, especially when used through a web application interface like a Juypiter Notebook. The benefits of using a web interface is the text gets outputted in your browser, which means different language scripts and emojis, both of which are relevant here, will likely be supported. The Pandas package allows in one step for the ingestion of the .txt file and conversion into a useful dataframe. The raw data contains lots of activity from whatsapp, including when users entered or left the group and when they sent images. I was mainly just interested in messages. I knew I would eventually do analysis on the emojis though, and I wasn't sure how that would work in a Python Juypiter Notebook - there was a tutorial on emoji data science in R though, which happens to be my strongest programming language. I exported the raw data from Python and used R to analyze it.

So I imported the data into R and was looking at a 37,351 * 4 dimension dataframe. I had the timestamp of the message, the sender, the message text itself, and whether it was an image.

My first step was to look at a timeline - what was typical activity? What sort of spikes occurred and when? It's hard to plot a timeline without first grouping data into buckets, hence I grouped the continuous timestamps into the days of the message. Since I was in the US EST time zone when I emailed these messages to myself, the timestamps are also in that time zone - however the majority of the group is based in Hong Kong, and it made sense to add 12 hours to all those times first. Then I was able to calculate how many messages were sent on any given day in Hong Kong, and eventually create the following plot.

The labels that you see there were created semi-manually. After observing the timeline without the labels, I looked up the peak days and dug into the original data to see what people were talking about on those peak days. For most of these I could find some clear events that piqued interest in the group. Some of these were external events like the US Presidential Election and the Hong Kong Umbrella Revolution (perhaps the most sustained spike) and some were random internal events like the time we decided to play a game where people typed entirely in emojis and others guessed what movie they referred to. A few of the spikes didn't really correspond to anything more than a Saturday night. The external events were mostly major news events, live sporting events including ultimate events. I also felt like the chart was lacking colors and struggled to think of what other variable I could use to color the dots, and decided to just use day of the week. It isn't really adding any more insights to the chart, but it makes the presentation better.

This plot also shows several periods of no activity which I can trace to periods where I lost my phone and had to restore an earlier backup. Whatsapp stores data on the cloud of course but access to this data on any given device is local. Whenever I lost my phone and had to reset, weeks or months of messages were lost. This explains why my total message numbers are lower than Jak's, despite being done several months later. I also chose to break down the plot into each year, called facetting in ggplot, and keep each year to its own scale - the 469 behemoth during the first emoji movie game ruins the scale and makes it hard to see other outliers. This is where ggplot really shines - facetting is awful to do in many other plot packages, and without ggplot I would probably just create 6 plots and piece them together in a photo editor.

Next was to repeat Jak's work and look at the most common texters. The group has expanded greatly over the years. Once limited by the app itself to 25 users, it has now grown to 91 users, and I found unique messages by over 100 users including users who have left the group. After aggregating the sender count, I made the following bar chart including all texters who had sent over 100 messages. In R it was also easy to add in a bit of extra information by coloring the message quantities by year (which I do with sequential shades of green, making it clear that darker means more recent).

So we do see that Donna is far and away the most active user historically, followed by Sam, Mike Ying, Kim, Jak, Neil, me and Tuan. The group's expansion is also visible here, with users Wanda and Jason noticeable for being high volume users with messages only in 2016 and 2017. On the flip side there are users whose volume dropped off over the years, including people like Nickie Wong and Chris Harrison who left Hong Kong.

What was also fun was searching for specific words, and then redoing the barplot for just messages with those words. As one of the main functions of this group was to organize social activity among people in Hong Kong, several places in Hong Kong appear in hundreds of messages over the years. Chief among these is "LKF", short for Lan Kwai Fong, one of the best party areas in Hong Kong and in all seriousness, the world. LKF appeared in messages 102 times, led far and away by Tuan Phan.
Ok I'm #2, but Tuan has me beat by a mile. Along each bar, I included a randomly sampled message by the respective person using LKF, and it so happens that Ruth Chen's message is "Tuan's always in LKF."
Just as an added side bonus, I wanted to see how deep into the socializing these texts typically occurred. It took a bunch of manipulation (I had to extract  the time portions of these texts, then set them all to the same arbitrary day) before I was able to graph the frequency of these texts over the course of the day

Hmm, it would appear that texts referring to the party place in Hong Kong within the group "Party in My Pants" really take off between 6pm and 1am. Whodathunk it?

You can repeat the first LKF graph with any other word, or regular expression. I'll do one more, and be careful if you're reading this at work, because our group is not a PG13 group.
Interesting, Tuan also has a commanding lead in this category, and his randomly sampled "sex" sentence even includes "lkf." Even if you the reader are not familiar with any of the people mentioned here, you may have an inkling of why this groupchat is now named "Party in my Tuansuit."

Ok at this point, the most data science heavy thing I've done is sample a random sentence and plot it in that graph. Surely this is not what I'm paid to do (you'd be surprised). But let's actually apply some text mining to this wonderful data set.  I first do some quick preprocessing steps, reducing everything to lower case and getting rid of pesky punctuation. Using the R package "tm", I also eliminate a healthy group of English language stopwords (generic words like "an", "me", "who" etc which don't really provide any insight), and created a corpus and dictionary. Here dictionary means that the program creates a vector to store words. It iterates along each word of each message and every time it comes across a word, if it hasn't seen it before, it adds a new element to the vector and assigns it the value 1. If it has seen the word before, it finds the index corresponding to that word and increases its value by 1. The program will separately keep a vector of the words themselves so that we can match them to the word count later. This step tells me that we have 17,608 unique words. Considering we have over 37k texts and most texts have multiple words, I was surprised that the unique words was so low. As it turns out, we repeat words a lot. From this step I can see that we've said "happy" 1509 times and "birthday" 1215 times, the 1st and 3rd most used words respectively.

So I want to calculate the average frequency usage of each word, and the average frequency for each user. Key to this step is the document term matrix. The document term matrix is essentially a collection of all those word vectors but corresponds to each document, which in this case is an individual text. Each vector must have as many indices as there are unique words, so each vector is 17,608 elements long. Since there are 37,351 texts, we are looking at a 37,351 * 17,608 matrix! That matrix takes up at least 5 gb of ram on my computer. I say at least because my workstation would crash before it finished creating that matrix.

Luckily computer scientists have figured ways around this - a sparse matrix. Nearly all the elements in the matrix are 0 - no text has anywhere close to 17k unique words.  A sparse matrix only stores the non-zero elements. It is a little bit harder to do operations with this, but you still can, and it saves a lot of storage. The sparse matrix for this whatsapp group was only 5.3 mb. I combined this matrix with a vector containing the senders of each chat, and iterated through for each unique sender to find each person's total word vocabulary, or individual dictionary. These individual frequencies could be compared to the overall frequency in a couple ways. We could look at the % difference in values, finding cases where someone used a word 1/100th of the time and overall it was used 1/10000th of the time. However for words that were only used a couple times in total, this % value would be wildly distorted. So I removed from the consideration all words which were only used once overall, and created a weighted equation where the raw difference in values was also taken into consideration. The weightings I used here were arbitrary, but I tried a couple variations until I got words that seemed "interesting."

After successfully iterating through each sender (and not crashing my computer), I saved the 10 most "distinctive" words for each sender, and graphed these words for a bunch of people

Awesome. There are a lot of interesting words in here, which I'll get to in a bit. but first, what are the u0001---- things? Most of these are unicode for emojis - a couple of them are Chinese characters. And there are lots of emojis, to the extent that this graph is really more distracting than useful until those unicode sequences are converted into weird smiles. And thus I broke out the emoji data science tutorial, written by the affable Hamdan Azhar, who has actually founded a company around emoji analysis.

Turns out emojis are really complicated. The steps involve in making that graph pretty were extensive - I spent a couple weeks of free time on it. Hamdan's strategy is to create a dictionary mapping each unicode id to a name of the emoji, and downloading a bunch of emoji .png images with the same name. His tutorial links to a dictionary and set of images, unfortunately the dictionary I used did not contain unicodes and I had to find another one online. This one for some reason named some emojis differently. Aggravatingly differently. There isn't exactly one emoji regulatory body (or 👮👉 for short). For example, my dictionary had "grinning face with sweat" and my images used "smiling face with open mouth and cold sweat". Also, whatever regulatory body there is keeps adding new emoji and the dictionary and set of images were not up to date. So I expanded the repertoire as I came across new items. A new frustration came with the new png files, some of which create an error when I tried to render them. Turns out I needed to download windows specific emoji, some of which look quite different from browser, android or Apple versions. Eventually with enough "manual" work, I was able to redo the plot by removing all the text that matched with emojis, then one by one rendering an image of the matching emoji in their place. The cleaned up result is below:

How sweet is that? Some users almost exclusively communicate in emoji (Kingi, Cat MK, Rie). Mike Ying's first emoji definitely rang a bell, and of course the eggplant appeared on Neil's most distinctive words list. Also notable here are Sam Axelrod's 7th most distinct word, Lincoln's 4th and 7th, Clay's 9th, Conor talking about football, my love for Tom Brady, and the fact that Donna, the group's most prolific user, apparently just texts various different types of laughs all the time. Note, this emoji chart was done a bit later and with slightly improved methodology from the previous non-emoji chart, hence not all words match up.

More users:

Of course Wilkie mentions Madonna, Jeremy uses aviation vocab, Quention has a baby named "Marni" and Jason talks about master's. Ed Lee says "sold" whenever you propose any social event. I'm not sure what it says that Kirk's most distinctive word is "harem", but he's used it twice and no one else has. And yes there still are some frequency issues here. While I removed words that were used once overall, some of the words that show up here were only used once by the user and twice or thrice overall. Is a word really distinctive of a person if he/she has only used it once? Perhaps my weighting equation needs some reworking, but there was always going to be some issues, especially with users who haven't sent that many texts.

But Cal! There are still emoji unicodes in here! Yes there are. Like I said, emojis are really aggravatingly complicated. The basic emojis are all one unicode to one emoji - however they just keep expanding it. You know how the face emojis now have adjustable skin tones? That is a combination of two unicodes - the original face unicode and an additional one signifying the skin tone. All the flag emojis? They are a combination of two unicodes. And actually, England, Wales and Scotland are all considered subdivisions of a national flag and are somehow represented by a combination of 7 unicodes. My current methodology breaks up every unicode combination into individual words of one unicode. I could redo the process grouping everything into bigrams, but it's not even guaranteed to solve this problem. It's a tricky one that might be best solved with more manual work. The ungraphed emojis in the graphs above are mainly Hong Kong/USA/UK/England flags as well the skin-tone signifier emoji. There are still some Chinese text that are left encoded - while I've worked with Chinese text before, for some reason I had trouble getting them to display on this file. 

I get the impression that many people find big data and data science very abstract and impersonal. The algorithms crunching massive data behind your targeted ads don't exactly inspire congeniality. But these techniques can be applied to anything, including more personal data. I've already written posts looking at my Facebook data and my travel locations, where data visualization really helped me understand my own past better. Going through this particularly dataset was especially fun - I was constantly reminded of hilarious exchanges from years ago with friends on the opposite side of the globe. Does this analysis add any business value? Nope, but I spend plenty of time doing analysis for data that will add value, and sometimes it's fun to just see exactly how many times Tuan drunkenly messaged the group.

P.S. If I can do this, Whatsapp (Facebook) is also probably doing this with your data.

Monday, February 20, 2017

Urban Clustering

When you see a picture of a city, does instinct immediately bring you to guess where it was taken? It’s an urge I can’t quite suppress. I find that even if I can’t recognize the city, I can almost always still guess the continent. Even without signpost textual giveaways or inhabitants’ facial features, many subtle urban features can clue you in. The roof architecture, street food, make of the buildings, road paving - these all help distinguish the continental origins of a city. Though there is incredible urban diversity between countries within continents, it still seemed to me that cities of one continent had more in common with each other than they did with cities of other continents. 


I was staring at an aerial photo of Dhaka when I decided I wanted to numerically prove this hypothesis. Surely there could be some city metrics that featured more variation within continents than between them - the old ANOVA test from classical statistics. And so I set about gathering as much data as I could.

The headaches started immediately. This project almost certainly couldn’t have been possible at any scale 5 years ago. It will certainly get easier in 5 years. There aren’t really any comprehensive worldwide cities database. There are no standards of definitions and though efforts are being done to correct this (including a proposal for ISO for cities), nothing has been widely adopted. Even basic definitions of area and population are frustratingly inconsistent, with significant discrepancies of where to draw borders. Here though discrepancies are also continental in nature - American cities go by strict legal districting, whereas Asian cities often redefine their borders to match their urban sprawl. I had begun with dreams of finding creative metrics such as the average sidewalk width or the % of restaurants open after midnight, but soon realized I’d have to settle for what I could find.

Acquiescing that this would not be an exact science, I began with using a base dataset from the World Cities Cultural Forum (WCCF), who collected such interesting metrics as Art Exhibits daily visits and Rare & Secondhand Bookshops for 25 core cities. Their data was not without proven  flaws (there was clearly a lack of consistent methodology) resulting in some questionable figures (Berlin has 4 rare bookshops and Johannesburg has 943?). I would fact check strange results and often manually make changes after vetting through data collection methods. It might seem like modern society is swimming in big data, but estimates for international tourists in Hong Kong ranged from 60 million to 27 million because the government doesn’t have a method for identifying what a tourist is.

Ultimately I created a dataset of 42 cities (10 from Asia, 19 from Europe, 6 from North America, 7 from elsewhere) with 24 metrics. These included Number of Concert Halls and Median Weekly Earnings from the WCCF supplemented with data I could find on metro systems (length of rail, annual ridership and % usage), number of Starbucks, CO2 emissions, number of airport runways and the number of international firms (calculated by McKinsey). There was plenty of missing data, which I imputed with the metric mean. Then, on every possible permutation combining 6 metrics, I ran a K-means algorithm. I then analyzed the resulting clusters and found the combination that best matched reality, putting over 70% of the cities together with other cities from the same continent. There were a few “best combinations” and the one I’ve chosen to display in this application is Foreign Born %, Number of Cinemas (per capita), Metro Usage, Number of Restaurants (per capita), Working Age Population (as a % of total population) and Number of International Service Firms (per capita). Some takeaways from this combination:
  • American cities have by far the highest foreign born %, followed by European cities. Most cities in Asia, Africa and South America have close to 0 foreign born %. Singapore, being an exception, was actually clustered with the North American cities by the algorithm.
  • European cities have a lot more cinema screens per capita than other cities.
  • Asian cities have way more restaurants per capita (although this statistic is hard to measure)
  • Asian cities also have a large % of working age population, with American cities at the other extreme. To be honest, this one doesn’t quite make sense. I do think you see more elderly working in Asia - and when it’s a little octogenarian pushing trash uphill, in heartbreaking public ways - but I don’t think that’s captured in the accounting methods here. More likely we have vastly different population denominators between methodologies.
  • Predictably, most international service firms are European or American and thus cities from those continents have much higher firms per capita. This statistic is pretty biased but I think it might have some effect on how a city looks and feels, as a proxy for how many familiar logos one sees.
  • It wasn’t clear to me what to do with the cities outside these 3 continents in my database, including 3 South American cities, Istanbul, Mumbai, Johannesburg and 2 Australian cities. I labeled them all as Other, but the algorithm clustered the Australian cities with the Europeans, which meets the eye test. 
The app has an interactive map (built in Leaflet) with all the 42 cities plotted and colored by their cluster. The color legend labels each color by the continent most associated with the cluster - this results in cities with a label not matching their true continent. Rome is colored as an Asian city - that just means it is closer to the Asian cluster than any other, even though I am fully aware that Rome is in Europe. No geographic information is included in the clustering algorithm.

The app lets you click on a city to have its data popup. You can also see 6 tabs of on the left which show density plots for each metric, split by the clusters. The idea is that the density plots will look rather distinct for each cluster. An orange line then shows where the metric of that exact city fits into the density plot. For cities missing data for a given metric, no orange line is shown.

There are plenty of flaws in my methodology and data, but ones that can be improved over time with more and better data. I believe there are many metrics that can reveal interesting urban planning or sociological distinctions between the continents - essentially the data that I saw helped confirm my thesis to me. Understanding the underlying reasons behind these distinctions can drive interdisciplinary conversations.

Personally this was also an important project for me. It was the major impetus driving my data science training, giving me a goal to work towards that necessitated me learning about data merging, language encoding, data standardization, clustering/classification algorithms and web application development. I even talked about the project in my final interview with GE.


The project is hosted on the free shinyapps.io server at https://cal65.shinyapps.io/Cities/ . This minimally viable approach is slow and won’t work when my home laptop is turned off (!). For users of R, a better user experience is available by downloading the Shiny package and running runGitHub("Cities", “cal65") . All my (sloppy) code is up on Github, and I’m happy to collaborate with people to improve this project. Shoutout to Ivan Peng for helping me on the project and teaching me how to setup on a database in Python!