Milestones Report on the Swiftkey Datasets by Alejandro Salinas

Summary

Below is an exploratory analysis and description of the datasets we received from Swiftkey. As a whole, we have a fairly diverse set of text that is chunked in a short lines. While we can glean some of the thematic elements of each dataset, as can be seen best from the wordclouds below, we will need to use more complex language models to truly tease out context and do prediction based on these datasets.

Description of Datasets (Raw Characters and Lines)

We have three datasets from Swiftkey: blogs, news, and twitter. These correspond to text from, as you would expect, blogs, news sites, and Twitter. These datasets contain .9 billion, 1 billion, and 2.4 billion lines of text respectively that we have to work with.

Each line is generally short. For these three datasets, the mean characters per line are between 69 and 230. However, there are extreme outliers that character counts as large as 2560 (for news). This can be seen in the density plot below – most lines are fairly short and are largely clustered around the mean, but there is a long tail of outliers.

Granular Counts

The more detailed statistics for each dataset’s character counts per line are below:

Blogs

##     numchar      
##  Min.   :   2.0  
##  1st Qu.:  48.0  
##  Median : 157.0  
##  Mean   : 231.4  
##  3rd Qu.: 330.0  
##  Max.   :7376.0

News

##     numchar      
##  Min.   :   2.0  
##  1st Qu.: 111.0  
##  Median : 186.0  
##  Mean   : 202.5  
##  3rd Qu.: 269.0  
##  Max.   :1900.0

Twitter

##     numchar      
##  Min.   :  4.00  
##  1st Qu.: 38.00  
##  Median : 65.00  
##  Mean   : 69.61  
##  3rd Qu.:101.00  
##  Max.   :141.00

Dataset Word Content

In order to better understand what we’re working with, we need to take a look at the underlying content (the words) in the datasets. Below, I’ve laid out the top 30 words from each dataset. Note that I’ve removed common “stopwords” like “the”, “and”, etc. that would otherwise completely dominate our picture of the dataset.

There are two things of note in the above picture.

The frequency of words drops off quickly. For example, the top 10 words across datasets are far more common than the next 10 (almost double the frequency).
There are many “ambiguous” words in the top few words. For example, “just” may or may not be a grammatical construct (“I just went to the store” vs. just cause, or “I will go to the store” vs. iron will). This, unsurprisingly, shows that we’ll need to use more complex language models than simply word frequencies to tease out context – we’ll probably want to look at word pairs at least, and possibly triples, quads, etc. if not full-blown sematics (basically, model the grammar of sentences).

For a better look, I’ve included word clouds for each dataset below. The size corresponds with the freqnecy that the word is exhibited. Using this view, we can start to see some thematic differences between the datasets. For example, “said” is a fairly prominent word in news vs. “love” and “thank” in twitter. These clouds show all words that have occurred at least 1000 times in our datasets.