The objective of this project is to build a web page using “Shiny,” which will predict the next word that a user is going to type, given a few words that he has already typed into the page. In order to do this, we have been given three pieces of sample text on which to base our predictions. The three are a collections of text from blogs, from news sources, and from Twitter. In this Milestone Report, we look at the raw data from those three sources to get a feel for the data.

To start with, here is a bit of information about the three data sources:

fileName lineCount wordCount firstLine lastLine wordsPerLine
en_US.blogs.txt 899288 37334131 1 899288 41.52
en_US.news.txt 1909530 71706661 899289 1909530 37.55
en_US.twitter.txt 4269678 102080244 1909531 4269678 23.91

And now a “Word Cloud” for a sample from each source (10,000 lines), followed by a graph with most used words and their frequencies in each sample.

They all look very similar - the word freqencies are different, but not in a dramatic way. But, after cleaning up the data a bit, by removing “and” and “the” and similar “stop” words, getting rid of punctuation, and numbers, and whitespace, let’s see what we’ve got:

Well, now the samples from the input datasets look somewhat different from one another. Their top words are distinct, now that the clutter of the “stop words” has been removed. Although that could be useful if we were to know more about our hypothetical user – for instance, if our user were writing a tweet, we might be better off training our algorithm on the Twitter data. But we don’t know that he’s doing so. So, it may be best for the purposes of this project to simply combine all of the sources into one big training set, or by training separate models on each dataset, and having them “vote” on what’s coming next.

We’ll finish up this exploration by showing a chart which shows how much of the sample input dataset taken from blogs (the sample taken from “en_US.blogs.txt”) is accounted for by each word, arranged by the most frequent word in the sample set, to the least.

##      word freq percentOfDataset cumulativePercentOfDataset
## 1     one 1365         0.645106                   0.645106
## 2    will 1211         0.572325                   1.217431
## 3    just 1142         0.539715                   1.757146
## 4    like 1128         0.533099                   2.290245
## 5     can 1098         0.518921                   2.809166
## 6    time  950         0.448975                   3.258141
## 7      ’s  807         0.381393                   3.639534
## 8     get  752         0.355399                   3.994933
## 9  people  665         0.314283                   4.309216
## 10    new  638         0.301522                   4.610738
## 11   also  617         0.291598                   4.902336
## 12    now  616         0.291125                   5.193461
## 13   know  613         0.289707                   5.483168
## 14  first  606         0.286399                   5.769567
## 15   even  587         0.277419                   6.046986
## 16    day  573         0.270803                   6.317789
## 17   make  573         0.270803                   6.588592
## 18 really  561         0.265132                   6.853724
## 19    see  554         0.261823                   7.115547
## 20   much  551         0.260406                   7.375953
## 21   back  549         0.259460                   7.635413
## 22   love  538         0.254262                   7.889675
## 23 little  514         0.242919                   8.132594
## 24   good  488         0.230631                   8.363225
## 25      –  479         0.226378                   8.589603

It takes 940 words before we have covered half of the dataset.

##            word freq percentOfDataset cumulativePercentOfDataset
## 930         six   39             0.02                      49.80
## 931        snow   39             0.02                      49.82
## 932    standing   39             0.02                      49.84
## 933     mention   39             0.02                      49.86
## 934        race   38             0.02                      49.88
## 935      colors   38             0.02                      49.90
## 936      picked   38             0.02                      49.92
## 937      expect   38             0.02                      49.94
## 938     holiday   38             0.02                      49.96
## 939    inspired   38             0.02                      49.98
## 940      period   38             0.02                      50.00
## 941     culture   38             0.02                      50.02
## 942     october   38             0.02                      50.04
## 943     enjoyed   38             0.02                      50.06
## 944      spread   38             0.02                      50.08
## 945     haven’t   38             0.02                      50.10
## 946 ingredients   38             0.02                      50.12
## 947        mark   38             0.02                      50.14
## 948     details   38             0.02                      50.16
## 949       daily   38             0.02                      50.18
## 950       fight   38             0.02                      50.20

But we have some tools now to make things a bit more efficient. We can use only those words which account for some percentage of the database (say, cumulatively, 60% or 75% to start with) and reduce the size of our eventual training database. And we have a plan to combine the datasets into one training set, and if that doesn’t work out, a backup plan to train three ways and vote on an answer.

The Shiny app we imagine now just has an input box into which the user can type a few words. Suggestions will appear in a menu to the right of the box, and by clicking on one of the suggestions, the user will be able to “paste” that suggestion into his current input location.