Introduction

This is a preliminary report for the Data Science Specialization Capstone on Coursera. The purpose of this report is to explore the data in preparation for predictive modeling. The R Markdown file that generated this report is available on github at https://github.com/jnd18/capstone-milestone.

Data

The data for the capstone project is available at https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip. Inside the zipped archive is a directory containing three english language text files. These text files were gathered from public websites by a web crawler.

There are three files, corresponding to three types of text sources: twitter, news websites, and blogs. The twitter file is 159 MB, the news file is 196 MB, and the blogs file is 200 MB.

Analysis

First, some basic descriptions. The Twitter file contains 2360148 lines and 30093369 words. The news file contains 77259 lines and 2674536 words. The blogs file contains 899288 lines and 37546246 words. Below we have tables displaying the top ten most frequent words in each source, removing common so-called “stop words”, so that the ten words are more illustrative of what makes the text unique. The top words for each source look very different.

Twitter
rank word occurences
1 love 106721
2 day 91710
3 rt 89537
4 time 76794
5 lol 70133
6 3 54940
7 people 52040
8 happy 48998
9 follow 48104
10 2 45515
News
rank word occurences
1 time 4474
2 people 3673
3 1 2994
4 city 2902
5 school 2702
6 percent 2635
7 game 2589
8 day 2477
9 home 2438
10 2 2434
Blogs
rank word occurences
1 time 90918
2 people 59574
3 day 52372
4 love 45230
5 life 41251
6 it’s 38657
7 1 30907
8 2 29561
9 world 29305
10 i’m 29189

Below, we have histograms of the word frequencies from different sources. The histograms show that most words appear very few times, but a few words appear frequently. This is with the common words removed. The distribution of frequencies is highly skewed-right.

Next up we have the bigram frequency tables for the three sources, again with common words removed. A bigram is just a pair of words. The Twitter table shows two forms of “Mother’s Day.” Perhaps the data were collected around Mother’s Day. The news table shows many city names. The blog table has a few pairs of numbers, which we believe come from fractions used in recipes, like “1/2 cup”. The histograms of word frequencies look almost identical to the ones above, except with even more mass to the left. It makes sense that more pairs of words appear only once. Thus, we won’t bother to display the histograms.

Twitter
rank bigram occurences
1 happy birthday 8389
2 social media 3886
3 mother’s day 2874
4 stay tuned 2657
5 mothers day 2572
6 san diego 2232
7 rt rt 2102
8 happy friday 1952
9 1 2 1918
10 ice cream 1899
News
rank bigram occurences
1 st louis 701
2 los angeles 436
3 san francisco 381
4 30 p.m 354
5 health care 317
6 1 2 227
7 san diego 219
8 vice president 219
9 white house 179
10 7 p.m 167
Blogs
rank bigram occurences
1 1 2 3974
2 weeks ago 1606
3 ice cream 1585
4 1 4 1465
5 social media 1342
6 jesus christ 1314
7 south africa 1153
8 real life 1145
9 3 4 1109
10 10 minutes 1072

Finally, we have the trigram tables. For the Twitter data, we see many holidays which all occur in the beginning of the year. We also amusingly see “cake cake cake.” For the news data different times of day as well as names and titles. For the blog data, we see many cooking measurements. Interestingly, “world war ii” appears frequently both in the news and on blogs, but not on Twitter. Again, the histograms just look like one huge spike on the left, so we won’t display them.

Twitter
rank trigram occurences
1 happy mothers day 1743
2 happy mother’s day 1582
3 cinco de mayo 1002
4 st patrick’s day 414
5 love love love 412
6 ha ha ha 363
7 cake cake cake 341
8 happy valentine’s day 341
9 ralph waldo emerson 318
10 happy valentines day 298
News
rank trigram occurences
1 president barack obama 95
2 7 30 p.m 77
3 st louis county 76
4 gov chris christie 66
5 world war ii 53
6 11 30 a.m 49
7 6 30 p.m 42
8 1 1 2 41
9 chief financial officer 40
10 1 2 cup 39
Blogs
rank trigram occurences
1 1 2 cup 710
2 1 4 cup 462
3 1 1 2 461
4 amazon services llc 427
5 world war ii 310
6 1 2 tsp 266
7 2 1 2 262
8 amp amp amp 250
9 lord jesus christ 219
10 amazon eu associates 213

Plans for the App

Using the basic techniques we’ve developed for exploring the data, we can create a simple n-gram model. To predict the next word in a sequence, we would look at the previous few words and pick the word that most commonly appears after those words in the text data. We would not remove the stopwords in this context. We could use some kind of smoothing, like adding 1 to each count of word appearances. We could also use a backoff strategy. For example, when predicting the next word from a two word sequence, if that two word sequence is common, then use the most frequent third word. If the two word sequence is uncommon, then just predict based on the second word.

With n-gram models it can be tricky to achieve high predictive performance with reasonable computational constraints. For this reason, we may try the alternative approach of training a recurrent neural net to predict the next character in a sequence of characters. This would frontload the computational burden into the training time and hopefully deliver a more performant app once the model is trained. Another benefit of this approach is that predicting on the basis of characters allows us to complete partial words, just like a real predictive keyboard.