Data
SwiftKey provided three training data sets as text files. These data sets include text scraped from Twitter, blogs, and news sources, and were read into R using the readLines function, then converted to tibbles for exploratory analysis. Each row in the data sets corresponds to a single tweet or line from a blog or news article.
Exploratory Analysis
We begin by reporting summary statistics from each of the three data sources, shown in the table below. Length refers to the number of characters in each line from the data source.
|
Data Source
|
Lines
|
Mean Length
|
Median Length
|
St Dev Length
|
|
Twitter
|
2360148
|
68.68045
|
64
|
37.22725
|
|
Blogs
|
899288
|
229.98695
|
156
|
258.66081
|
|
News
|
1010242
|
201.16285
|
185
|
133.21714
|
In addtion, we can view density plots of the line length from each data source. Note that this data from Twitter has a maximum of 144 characters. For the density plots for blog and news sources, we plot based on base ten log of number of characters, as the maximum number of characters are 40833 and 11384, respectively.



To build a predictive text algorithm, we must break down each line by word. After filtering to remove the most common words (stop words), we then filter for profanity. We use a list of banned words previously developed by Google and found on the user RobertJGabriel’s GitHub. We then sort by the most frequent words in each data set, the top 15 of which are shown in the table below.
Fifteen Most Frequent Words by Data Source
Twitter
|
Word
|
Frequency
|
|
just
|
151115
|
|
like
|
122455
|
|
get
|
112459
|
|
love
|
106721
|
|
good
|
101026
|
|
day
|
91710
|
|
can
|
89847
|
|
thanks
|
89660
|
|
rt
|
89537
|
|
now
|
83986
|
|
one
|
82858
|
|
know
|
79916
|
|
u
|
77531
|
|
time
|
76794
|
|
great
|
76139
|
|
Blog
|
Word
|
Frequency
|
|
one
|
127287
|
|
just
|
100793
|
|
like
|
100442
|
|
can
|
98420
|
|
time
|
90918
|
|
get
|
71093
|
|
know
|
60496
|
|
now
|
60358
|
|
people
|
59574
|
|
also
|
55366
|
|
new
|
54847
|
|
day
|
52372
|
|
even
|
52174
|
|
first
|
51634
|
|
back
|
51306
|
|
News
|
Word
|
Frequency
|
|
said
|
250418
|
|
one
|
88794
|
|
year
|
76765
|
|
new
|
70773
|
|
two
|
63867
|
|
can
|
58924
|
|
also
|
58786
|
|
first
|
57866
|
|
time
|
57062
|
|
just
|
53350
|
|
last
|
52079
|
|
like
|
50829
|
|
state
|
50095
|
|
people
|
47666
|
|
years
|
46969
|
|
We can also see how frequently unique words appear in each source after filtering. The plot below shows the base ten log of the frequency of each word in the Twitter data set, where the horizontal axis is given by word frequency rank. We do not show the analogous plots for the other two data sets, as they exhibit the same behavior as the plot for Twitter.

To have an effective predictive algorithm, we must be able to find the frequencies of pairs of words in each data set. The table below shows, for each data set, the fifteen most common ordered word pairings after filtering for stop words and profanity.
Fifteen Most Frequent Pairs of Words by Data Source
Twitter
|
Word 1
|
Word 2
|
Frequency
|
|
happy
|
birthday
|
8389
|
|
social
|
media
|
3886
|
|
mother’s
|
day
|
2874
|
|
stay
|
tuned
|
2657
|
|
mothers
|
day
|
2572
|
|
san
|
diego
|
2232
|
|
rt
|
rt
|
2106
|
|
happy
|
friday
|
1952
|
|
1
|
2
|
1919
|
|
ice
|
cream
|
1899
|
|
happy
|
hour
|
1859
|
|
beautiful
|
day
|
1813
|
|
happy
|
mothers
|
1769
|
|
lol
|
rt
|
1646
|
|
tomorrow
|
night
|
1605
|
|
Blogs
|
Word 1
|
Word 2
|
Frequency
|
|
1
|
2
|
3976
|
|
weeks
|
ago
|
1606
|
|
ice
|
cream
|
1585
|
|
1
|
4
|
1469
|
|
social
|
media
|
1342
|
|
jesus
|
christ
|
1314
|
|
south
|
africa
|
1153
|
|
real
|
life
|
1145
|
|
3
|
4
|
1108
|
|
10
|
minutes
|
1072
|
|
olive
|
oil
|
1059
|
|
feel
|
free
|
1014
|
|
blog
|
post
|
997
|
|
months
|
ago
|
983
|
|
30
|
minutes
|
968
|
|
News
|
Word 1
|
Word 2
|
Frequency
|
|
st
|
louis
|
9329
|
|
los
|
angeles
|
5333
|
|
30
|
p.m
|
4493
|
|
san
|
francisco
|
4478
|
|
health
|
care
|
4009
|
|
vice
|
president
|
2906
|
|
1
|
2
|
2885
|
|
san
|
diego
|
2712
|
|
7
|
p.m
|
2275
|
|
white
|
house
|
2249
|
|
30
|
a.m
|
2188
|
|
law
|
enforcement
|
2170
|
|
executive
|
director
|
2156
|
|
real
|
estate
|
2062
|
|
supreme
|
court
|
2052
|
|
Our predictive model will use the relative frequencies of ordered word pairs and ordered word triples to predict a word based on the two words preceeding it in the training data set. For words input in the algorithm that do not appear in the training set, the algorithm will average over the training set to generate a suggested word.