The objective of this project is to build a web page using “Shiny,” which will predict the next word that a user is going to type, given a few words that he has already typed into the page. In order to do this, we have been given three pieces of sample text on which to base our predictions. The three are a collections of text from blogs, from news sources, and from Twitter. In this Milestone Report, we look at the raw data from those three sources to get a feel for the data.
To start with, here is a bit of information about the three data sources:
| fileName | lineCount | wordCount | firstLine | lastLine | wordsPerLine |
|---|---|---|---|---|---|
| en_US.blogs.txt | 899288 | 37334131 | 1 | 899288 | 41.52 |
| en_US.news.txt | 1909530 | 71706661 | 899289 | 1909530 | 37.55 |
| en_US.twitter.txt | 4269678 | 102080244 | 1909531 | 4269678 | 23.91 |
And now a “Word Cloud” for a sample from each source (10,000 lines), followed by a graph with most used words and their frequencies in each sample.
They all look very similar - the word freqencies are different, but not in a dramatic way. But, after cleaning up the data a bit, by removing “and” and “the” and similar “stop” words, getting rid of punctuation, and numbers, and whitespace, let’s see what we’ve got:
Well, now the samples from the input datasets look somewhat different from one another. Their top words are distinct, now that the clutter of the “stop words” has been removed. Although that could be useful if we were to know more about our hypothetical user – for instance, if our user were writing a tweet, we might be better off training our algorithm on the Twitter data. But we don’t know that he’s doing so. So, it may be best for the purposes of this project to simply combine all of the sources into one big training set, or by training separate models on each dataset, and having them “vote” on what’s coming next.
We’ll finish up this exploration by showing a chart which shows how much of the sample input dataset taken from blogs (the sample taken from “en_US.blogs.txt”) is accounted for by each word, arranged by the most frequent word in the sample set, to the least.
## word freq percentOfDataset cumulativePercentOfDataset
## 1 one 1365 0.645106 0.645106
## 2 will 1211 0.572325 1.217431
## 3 just 1142 0.539715 1.757146
## 4 like 1128 0.533099 2.290245
## 5 can 1098 0.518921 2.809166
## 6 time 950 0.448975 3.258141
## 7 ’s 807 0.381393 3.639534
## 8 get 752 0.355399 3.994933
## 9 people 665 0.314283 4.309216
## 10 new 638 0.301522 4.610738
## 11 also 617 0.291598 4.902336
## 12 now 616 0.291125 5.193461
## 13 know 613 0.289707 5.483168
## 14 first 606 0.286399 5.769567
## 15 even 587 0.277419 6.046986
## 16 day 573 0.270803 6.317789
## 17 make 573 0.270803 6.588592
## 18 really 561 0.265132 6.853724
## 19 see 554 0.261823 7.115547
## 20 much 551 0.260406 7.375953
## 21 back 549 0.259460 7.635413
## 22 love 538 0.254262 7.889675
## 23 little 514 0.242919 8.132594
## 24 good 488 0.230631 8.363225
## 25 – 479 0.226378 8.589603
It takes 940 words before we have covered half of the dataset.
## word freq percentOfDataset cumulativePercentOfDataset
## 930 six 39 0.02 49.80
## 931 snow 39 0.02 49.82
## 932 standing 39 0.02 49.84
## 933 mention 39 0.02 49.86
## 934 race 38 0.02 49.88
## 935 colors 38 0.02 49.90
## 936 picked 38 0.02 49.92
## 937 expect 38 0.02 49.94
## 938 holiday 38 0.02 49.96
## 939 inspired 38 0.02 49.98
## 940 period 38 0.02 50.00
## 941 culture 38 0.02 50.02
## 942 october 38 0.02 50.04
## 943 enjoyed 38 0.02 50.06
## 944 spread 38 0.02 50.08
## 945 haven’t 38 0.02 50.10
## 946 ingredients 38 0.02 50.12
## 947 mark 38 0.02 50.14
## 948 details 38 0.02 50.16
## 949 daily 38 0.02 50.18
## 950 fight 38 0.02 50.20
But we have some tools now to make things a bit more efficient. We can use only those words which account for some percentage of the database (say, cumulatively, 60% or 75% to start with) and reduce the size of our eventual training database. And we have a plan to combine the datasets into one training set, and if that doesn’t work out, a backup plan to train three ways and vote on an answer.
The Shiny app we imagine now just has an input box into which the user can type a few words. Suggestions will appear in a menu to the right of the box, and by clicking on one of the suggestions, the user will be able to “paste” that suggestion into his current input location.