This is a preliminary report for the Data Science Specialization Capstone on Coursera. The purpose of this report is to explore the data in preparation for predictive modeling. The R Markdown file that generated this report is available on github at https://github.com/jnd18/capstone-milestone.
The data for the capstone project is available at https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip. Inside the zipped archive is a directory containing three english language text files. These text files were gathered from public websites by a web crawler.
There are three files, corresponding to three types of text sources: twitter, news websites, and blogs. The twitter file is 159 MB, the news file is 196 MB, and the blogs file is 200 MB.
First, some basic descriptions. The Twitter file contains 2360148 lines and 30093369 words. The news file contains 77259 lines and 2674536 words. The blogs file contains 899288 lines and 37546246 words. Below we have tables displaying the top ten most frequent words in each source, removing common so-called “stop words”, so that the ten words are more illustrative of what makes the text unique. The top words for each source look very different.
| rank | word | occurences |
|---|---|---|
| 1 | love | 106721 |
| 2 | day | 91710 |
| 3 | rt | 89537 |
| 4 | time | 76794 |
| 5 | lol | 70133 |
| 6 | 3 | 54940 |
| 7 | people | 52040 |
| 8 | happy | 48998 |
| 9 | follow | 48104 |
| 10 | 2 | 45515 |
| rank | word | occurences |
|---|---|---|
| 1 | time | 4474 |
| 2 | people | 3673 |
| 3 | 1 | 2994 |
| 4 | city | 2902 |
| 5 | school | 2702 |
| 6 | percent | 2635 |
| 7 | game | 2589 |
| 8 | day | 2477 |
| 9 | home | 2438 |
| 10 | 2 | 2434 |
| rank | word | occurences |
|---|---|---|
| 1 | time | 90918 |
| 2 | people | 59574 |
| 3 | day | 52372 |
| 4 | love | 45230 |
| 5 | life | 41251 |
| 6 | it’s | 38657 |
| 7 | 1 | 30907 |
| 8 | 2 | 29561 |
| 9 | world | 29305 |
| 10 | i’m | 29189 |
Below, we have histograms of the word frequencies from different sources. The histograms show that most words appear very few times, but a few words appear frequently. This is with the common words removed. The distribution of frequencies is highly skewed-right.
Next up we have the bigram frequency tables for the three sources, again with common words removed. A bigram is just a pair of words. The Twitter table shows two forms of “Mother’s Day.” Perhaps the data were collected around Mother’s Day. The news table shows many city names. The blog table has a few pairs of numbers, which we believe come from fractions used in recipes, like “1/2 cup”. The histograms of word frequencies look almost identical to the ones above, except with even more mass to the left. It makes sense that more pairs of words appear only once. Thus, we won’t bother to display the histograms.
| rank | bigram | occurences |
|---|---|---|
| 1 | happy birthday | 8389 |
| 2 | social media | 3886 |
| 3 | mother’s day | 2874 |
| 4 | stay tuned | 2657 |
| 5 | mothers day | 2572 |
| 6 | san diego | 2232 |
| 7 | rt rt | 2102 |
| 8 | happy friday | 1952 |
| 9 | 1 2 | 1918 |
| 10 | ice cream | 1899 |
| rank | bigram | occurences |
|---|---|---|
| 1 | st louis | 701 |
| 2 | los angeles | 436 |
| 3 | san francisco | 381 |
| 4 | 30 p.m | 354 |
| 5 | health care | 317 |
| 6 | 1 2 | 227 |
| 7 | san diego | 219 |
| 8 | vice president | 219 |
| 9 | white house | 179 |
| 10 | 7 p.m | 167 |
| rank | bigram | occurences |
|---|---|---|
| 1 | 1 2 | 3974 |
| 2 | weeks ago | 1606 |
| 3 | ice cream | 1585 |
| 4 | 1 4 | 1465 |
| 5 | social media | 1342 |
| 6 | jesus christ | 1314 |
| 7 | south africa | 1153 |
| 8 | real life | 1145 |
| 9 | 3 4 | 1109 |
| 10 | 10 minutes | 1072 |
Finally, we have the trigram tables. For the Twitter data, we see many holidays which all occur in the beginning of the year. We also amusingly see “cake cake cake.” For the news data different times of day as well as names and titles. For the blog data, we see many cooking measurements. Interestingly, “world war ii” appears frequently both in the news and on blogs, but not on Twitter. Again, the histograms just look like one huge spike on the left, so we won’t display them.
| rank | trigram | occurences |
|---|---|---|
| 1 | happy mothers day | 1743 |
| 2 | happy mother’s day | 1582 |
| 3 | cinco de mayo | 1002 |
| 4 | st patrick’s day | 414 |
| 5 | love love love | 412 |
| 6 | ha ha ha | 363 |
| 7 | cake cake cake | 341 |
| 8 | happy valentine’s day | 341 |
| 9 | ralph waldo emerson | 318 |
| 10 | happy valentines day | 298 |
| rank | trigram | occurences |
|---|---|---|
| 1 | president barack obama | 95 |
| 2 | 7 30 p.m | 77 |
| 3 | st louis county | 76 |
| 4 | gov chris christie | 66 |
| 5 | world war ii | 53 |
| 6 | 11 30 a.m | 49 |
| 7 | 6 30 p.m | 42 |
| 8 | 1 1 2 | 41 |
| 9 | chief financial officer | 40 |
| 10 | 1 2 cup | 39 |
| rank | trigram | occurences |
|---|---|---|
| 1 | 1 2 cup | 710 |
| 2 | 1 4 cup | 462 |
| 3 | 1 1 2 | 461 |
| 4 | amazon services llc | 427 |
| 5 | world war ii | 310 |
| 6 | 1 2 tsp | 266 |
| 7 | 2 1 2 | 262 |
| 8 | amp amp amp | 250 |
| 9 | lord jesus christ | 219 |
| 10 | amazon eu associates | 213 |
Using the basic techniques we’ve developed for exploring the data, we can create a simple n-gram model. To predict the next word in a sequence, we would look at the previous few words and pick the word that most commonly appears after those words in the text data. We would not remove the stopwords in this context. We could use some kind of smoothing, like adding 1 to each count of word appearances. We could also use a backoff strategy. For example, when predicting the next word from a two word sequence, if that two word sequence is common, then use the most frequent third word. If the two word sequence is uncommon, then just predict based on the second word.
With n-gram models it can be tricky to achieve high predictive performance with reasonable computational constraints. For this reason, we may try the alternative approach of training a recurrent neural net to predict the next character in a sequence of characters. This would frontload the computational burden into the training time and hopefully deliver a more performant app once the model is trained. Another benefit of this approach is that predicting on the basis of characters allows us to complete partial words, just like a real predictive keyboard.