Data

SwiftKey provided three training data sets as text files. These data sets include text scraped from Twitter, blogs, and news sources, and were read into R using the readLines function, then converted to tibbles for exploratory analysis. Each row in the data sets corresponds to a single tweet or line from a blog or news article.

Exploratory Analysis

We begin by reporting summary statistics from each of the three data sources, shown in the table below. Length refers to the number of characters in each line from the data source.

Data Source Lines Mean Length Median Length St Dev Length
Twitter 2360148 68.68045 64 37.22725
Blogs 899288 229.98695 156 258.66081
News 1010242 201.16285 185 133.21714

In addtion, we can view density plots of the line length from each data source. Note that this data from Twitter has a maximum of 144 characters. For the density plots for blog and news sources, we plot based on base ten log of number of characters, as the maximum number of characters are 40833 and 11384, respectively.

To build a predictive text algorithm, we must break down each line by word. After filtering to remove the most common words (stop words), we then filter for profanity. We use a list of banned words previously developed by Google and found on the user RobertJGabriel’s GitHub. We then sort by the most frequent words in each data set, the top 15 of which are shown in the table below.

Fifteen Most Frequent Words by Data Source
Twitter
Word Frequency
just 151115
like 122455
get 112459
love 106721
good 101026
day 91710
can 89847
thanks 89660
rt 89537
now 83986
one 82858
know 79916
u 77531
time 76794
great 76139
Blog
Word Frequency
one 127287
just 100793
like 100442
can 98420
time 90918
get 71093
know 60496
now 60358
people 59574
also 55366
new 54847
day 52372
even 52174
first 51634
back 51306
News
Word Frequency
said 250418
one 88794
year 76765
new 70773
two 63867
can 58924
also 58786
first 57866
time 57062
just 53350
last 52079
like 50829
state 50095
people 47666
years 46969

We can also see how frequently unique words appear in each source after filtering. The plot below shows the base ten log of the frequency of each word in the Twitter data set, where the horizontal axis is given by word frequency rank. We do not show the analogous plots for the other two data sets, as they exhibit the same behavior as the plot for Twitter.

To have an effective predictive algorithm, we must be able to find the frequencies of pairs of words in each data set. The table below shows, for each data set, the fifteen most common ordered word pairings after filtering for stop words and profanity.

Fifteen Most Frequent Pairs of Words by Data Source
Twitter
Word 1 Word 2 Frequency
happy birthday 8389
social media 3886
mother’s day 2874
stay tuned 2657
mothers day 2572
san diego 2232
rt rt 2106
happy friday 1952
1 2 1919
ice cream 1899
happy hour 1859
beautiful day 1813
happy mothers 1769
lol rt 1646
tomorrow night 1605
Blogs
Word 1 Word 2 Frequency
1 2 3976
weeks ago 1606
ice cream 1585
1 4 1469
social media 1342
jesus christ 1314
south africa 1153
real life 1145
3 4 1108
10 minutes 1072
olive oil 1059
feel free 1014
blog post 997
months ago 983
30 minutes 968
News
Word 1 Word 2 Frequency
st louis 9329
los angeles 5333
30 p.m 4493
san francisco 4478
health care 4009
vice president 2906
1 2 2885
san diego 2712
7 p.m 2275
white house 2249
30 a.m 2188
law enforcement 2170
executive director 2156
real estate 2062
supreme court 2052

Our predictive model will use the relative frequencies of ordered word pairs and ordered word triples to predict a word based on the two words preceeding it in the training data set. For words input in the algorithm that do not appear in the training set, the algorithm will average over the training set to generate a suggested word.