Sunday, September 10, 2017

Emojis, elections, LKF and whatsapplied statistics

Whatsapp is the messaging medium of choice among the ultimate community of Hong Kong. Since December 2011, an ever-expanding group of players have coordinated practice, social activities, shared news and engaged in raucous discussions over a Whatsapp group. At times, this group has exploded like a phone ringing off its hook and been the source of much hilarity for the community. The group has gone through many names, but most commonly has been known as "Party in My Pants" (PIMP).

This analysis was started by my friend, teammate and former colleague Jak Lau. In June 2017 he emailed me this:

Party in my TUANsuit
STATISTICS
Started: 10 December 2011
Age: 5 years 7 months
Messages 39,700
Group name changes: 19
Messages written no.%
Doona32618.0
Sam24536.0
Jak22235.4
Neil21205.2
Mikey20975.1
Kim20415.0
Will13393.3
Tommy8052.0
Gio6551.6

This is the email exchange that followed:
Cal: What?!?!?! How did you get these??
Jak: I made them.
Cal: You downloaded the transcript and searched?
Jak: Yeah, just exported the chat into excel and used a few simple analysis tools. Mostly sort and filter.
I was gonna do more, but wouldn't be worth it.
Cal: Haha I might play around with it. I want to see which emojis we use (eggplant)
Jak: Have fun. I don't know if you can export the emojis.

And so I set forth to do some further analysis. Whatsapp's Export Chat feature is nifty (and a feature that separates it from many other messaging apps), allowing me to email my account's stored data of the phone to myself as one raw text file. You can include media as well (images and videos) but I figured that might be an overwhelming amount and didn't include the images. Since leaving Hong Kong in early 2016, I've both remained in this group chat and become a professional data scientist, learning many techniques that would help me work with this text file.

Python is a good tool for text analysis, especially when used through a web application interface like a Juypiter Notebook. The benefits of using a web interface is the text gets outputted in your browser, which means different language scripts and emojis, both of which are relevant here, will likely be supported. The Pandas package allows in one step for the ingestion of the .txt file and conversion into a useful dataframe. The raw data contains lots of activity from whatsapp, including when users entered or left the group and when they sent images. I was mainly just interested in messages. I knew I would eventually do analysis on the emojis though, and I wasn't sure how that would work in a Python Juypiter Notebook - there was a tutorial on emoji data science in R though, which happens to be my strongest programming language. I exported the raw data from Python and used R to analyze it.

So I imported the data into R and was looking at a 37,351 * 4 dimension dataframe. I had the timestamp of the message, the sender, the message text itself, and whether it was an image.

My first step was to look at a timeline - what was typical activity? What sort of spikes occurred and when? It's hard to plot a timeline without first grouping data into buckets, hence I grouped the continuous timestamps into the days of the message. Since I was in the US EST time zone when I emailed these messages to myself, the timestamps are also in that time zone - however the majority of the group is based in Hong Kong, and it made sense to add 12 hours to all those times first. Then I was able to calculate how many messages were sent on any given day in Hong Kong, and eventually create the following plot.

The labels that you see there were created semi-manually. After observing the timeline without the labels, I looked up the peak days and dug into the original data to see what people were talking about on those peak days. For most of these I could find some clear events that piqued interest in the group. Some of these were external events like the US Presidential Election and the Hong Kong Umbrella Revolution (perhaps the most sustained spike) and some were random internal events like the time we decided to play a game where people typed entirely in emojis and others guessed what movie they referred to. A few of the spikes didn't really correspond to anything more than a Saturday night. The external events were mostly major news events, live sporting events including ultimate events. I also felt like the chart was lacking colors and struggled to think of what other variable I could use to color the dots, and decided to just use day of the week. It isn't really adding any more insights to the chart, but it makes the presentation better.

This plot also shows several periods of no activity which I can trace to periods where I lost my phone and had to restore an earlier backup. Whatsapp stores data on the cloud of course but access to this data on any given device is local. Whenever I lost my phone and had to reset, weeks or months of messages were lost. This explains why my total message numbers are lower than Jak's, despite being done several months later. I also chose to break down the plot into each year, called facetting in ggplot, and keep each year to its own scale - the 469 behemoth during the first emoji movie game ruins the scale and makes it hard to see other outliers. This is where ggplot really shines - facetting is awful to do in many other plot packages, and without ggplot I would probably just create 6 plots and piece them together in a photo editor.

Next was to repeat Jak's work and look at the most common texters. The group has expanded greatly over the years. Once limited by the app itself to 25 users, it has now grown to 91 users, and I found unique messages by over 100 users including users who have left the group. After aggregating the sender count, I made the following bar chart including all texters who had sent over 100 messages. In R it was also easy to add in a bit of extra information by coloring the message quantities by year (which I do with sequential shades of green, making it clear that darker means more recent).

So we do see that Donna is far and away the most active user historically, followed by Sam, Mike Ying, Kim, Jak, Neil, me and Tuan. The group's expansion is also visible here, with users Wanda and Jason noticeable for being high volume users with messages only in 2016 and 2017. On the flip side there are users whose volume dropped off over the years, including people like Nickie Wong and Chris Harrison who left Hong Kong.

What was also fun was searching for specific words, and then redoing the barplot for just messages with those words. As one of the main functions of this group was to organize social activity among people in Hong Kong, several places in Hong Kong appear in hundreds of messages over the years. Chief among these is "LKF", short for Lan Kwai Fong, one of the best party areas in Hong Kong and in all seriousness, the world. LKF appeared in messages 102 times, led far and away by Tuan Phan.
Ok I'm #2, but Tuan has me beat by a mile. Along each bar, I included a randomly sampled message by the respective person using LKF, and it so happens that Ruth Chen's message is "Tuan's always in LKF."
Just as an added side bonus, I wanted to see how deep into the socializing these texts typically occurred. It took a bunch of manipulation (I had to extract  the time portions of these texts, then set them all to the same arbitrary day) before I was able to graph the frequency of these texts over the course of the day

Hmm, it would appear that texts referring to the party place in Hong Kong within the group "Party in My Pants" really take off between 6pm and 1am. Whodathunk it?

You can repeat the first LKF graph with any other word, or regular expression. I'll do one more, and be careful if you're reading this at work, because our group is not a PG13 group.
Interesting, Tuan also has a commanding lead in this category, and his randomly sampled "sex" sentence even includes "lkf." Even if you the reader are not familiar with any of the people mentioned here, you may have an inkling of why this groupchat is now named "Party in my Tuansuit."

Ok at this point, the most data science heavy thing I've done is sample a random sentence and plot it in that graph. Surely this is not what I'm paid to do (you'd be surprised). But let's actually apply some text mining to this wonderful data set.  I first do some quick preprocessing steps, reducing everything to lower case and getting rid of pesky punctuation. Using the R package "tm", I also eliminate a healthy group of English language stopwords (generic words like "an", "me", "who" etc which don't really provide any insight), and created a corpus and dictionary. Here dictionary means that the program creates a vector to store words. It iterates along each word of each message and every time it comes across a word, if it hasn't seen it before, it adds a new element to the vector and assigns it the value 1. If it has seen the word before, it finds the index corresponding to that word and increases its value by 1. The program will separately keep a vector of the words themselves so that we can match them to the word count later. This step tells me that we have 17,608 unique words. Considering we have over 37k texts and most texts have multiple words, I was surprised that the unique words was so low. As it turns out, we repeat words a lot. From this step I can see that we've said "happy" 1509 times and "birthday" 1215 times, the 1st and 3rd most used words respectively.

So I want to calculate the average frequency usage of each word, and the average frequency for each user. Key to this step is the document term matrix. The document term matrix is essentially a collection of all those word vectors but corresponds to each document, which in this case is an individual text. Each vector must have as many indices as there are unique words, so each vector is 17,608 elements long. Since there are 37,351 texts, we are looking at a 37,351 * 17,608 matrix! That matrix takes up at least 5 gb of ram on my computer. I say at least because my workstation would crash before it finished creating that matrix.

Luckily computer scientists have figured ways around this - a sparse matrix. Nearly all the elements in the matrix are 0 - no text has anywhere close to 17k unique words.  A sparse matrix only stores the non-zero elements. It is a little bit harder to do operations with this, but you still can, and it saves a lot of storage. The sparse matrix for this whatsapp group was only 5.3 mb. I combined this matrix with a vector containing the senders of each chat, and iterated through for each unique sender to find each person's total word vocabulary, or individual dictionary. These individual frequencies could be compared to the overall frequency in a couple ways. We could look at the % difference in values, finding cases where someone used a word 1/100th of the time and overall it was used 1/10000th of the time. However for words that were only used a couple times in total, this % value would be wildly distorted. So I removed from the consideration all words which were only used once overall, and created a weighted equation where the raw difference in values was also taken into consideration. The weightings I used here were arbitrary, but I tried a couple variations until I got words that seemed "interesting."

After successfully iterating through each sender (and not crashing my computer), I saved the 10 most "distinctive" words for each sender, and graphed these words for a bunch of people

Awesome. There are a lot of interesting words in here, which I'll get to in a bit. but first, what are the u0001---- things? Most of these are unicode for emojis - a couple of them are Chinese characters. And there are lots of emojis, to the extent that this graph is really more distracting than useful until those unicode sequences are converted into weird smiles. And thus I broke out the emoji data science tutorial, written by the affable Hamdan Azhar, who has actually founded a company around emoji analysis.

Turns out emojis are really complicated. The steps involve in making that graph pretty were extensive - I spent a couple weeks of free time on it. Hamdan's strategy is to create a dictionary mapping each unicode id to a name of the emoji, and downloading a bunch of emoji .png images with the same name. His tutorial links to a dictionary and set of images, unfortunately the dictionary I used did not contain unicodes and I had to find another one online. This one for some reason named some emojis differently. Aggravatingly differently. There isn't exactly one emoji regulatory body (or 👮👉 for short). For example, my dictionary had "grinning face with sweat" and my images used "smiling face with open mouth and cold sweat". Also, whatever regulatory body there is keeps adding new emoji and the dictionary and set of images were not up to date. So I expanded the repertoire as I came across new items. A new frustration came with the new png files, some of which create an error when I tried to render them. Turns out I needed to download windows specific emoji, some of which look quite different from browser, android or Apple versions. Eventually with enough "manual" work, I was able to redo the plot by removing all the text that matched with emojis, then one by one rendering an image of the matching emoji in their place. The cleaned up result is below:

How sweet is that? Some users almost exclusively communicate in emoji (Kingi, Cat MK, Rie). Mike Ying's first emoji definitely rang a bell, and of course the eggplant appeared on Neil's most distinctive words list. Also notable here are Sam Axelrod's 7th most distinct word, Lincoln's 4th and 7th, Clay's 9th, Conor talking about football, my love for Tom Brady, and the fact that Donna, the group's most prolific user, apparently just texts various different types of laughs all the time. Note, this emoji chart was done a bit later and with slightly improved methodology from the previous non-emoji chart, hence not all words match up.

More users:

Of course Wilkie mentions Madonna, Jeremy uses aviation vocab, Quention has a baby named "Marni" and Jason talks about master's. Ed Lee says "sold" whenever you propose any social event. I'm not sure what it says that Kirk's most distinctive word is "harem", but he's used it twice and no one else has. And yes there still are some frequency issues here. While I removed words that were used once overall, some of the words that show up here were only used once by the user and twice or thrice overall. Is a word really distinctive of a person if he/she has only used it once? Perhaps my weighting equation needs some reworking, but there was always going to be some issues, especially with users who haven't sent that many texts.

But Cal! There are still emoji unicodes in here! Yes there are. Like I said, emojis are really aggravatingly complicated. The basic emojis are all one unicode to one emoji - however they just keep expanding it. You know how the face emojis now have adjustable skin tones? That is a combination of two unicodes - the original face unicode and an additional one signifying the skin tone. All the flag emojis? They are a combination of two unicodes. And actually, England, Wales and Scotland are all considered subdivisions of a national flag and are somehow represented by a combination of 7 unicodes. My current methodology breaks up every unicode combination into individual words of one unicode. I could redo the process grouping everything into bigrams, but it's not even guaranteed to solve this problem. It's a tricky one that might be best solved with more manual work. The ungraphed emojis in the graphs above are mainly Hong Kong/USA/UK/England flags as well the skin-tone signifier emoji. There are still some Chinese text that are left encoded - while I've worked with Chinese text before, for some reason I had trouble getting them to display on this file. 

I get the impression that many people find big data and data science very abstract and impersonal. The algorithms crunching massive data behind your targeted ads don't exactly inspire congeniality. But these techniques can be applied to anything, including more personal data. I've already written posts looking at my Facebook data and my travel locations, where data visualization really helped me understand my own past better. Going through this particularly dataset was especially fun - I was constantly reminded of hilarious exchanges from years ago with friends on the opposite side of the globe. Does this analysis add any business value? Nope, but I spend plenty of time doing analysis for data that will add value, and sometimes it's fun to just see exactly how many times Tuan drunkenly messaged the group.

P.S. If I can do this, Whatsapp (Facebook) is also probably doing this with your data.