This app is used for make predictions based on what you have typed, just as what Swiftkey does. Type some words, and personal predictions tailored to you will appear. The final version of this app will be presented via Shinyapp. (This image comes from the official discription of Swiftkey.)
No details will be explained here, but all will be listed with links.
The whole dataset can be downloaded here, and only ‘en_US’ files are used for training.
There are 3 different text sets in ‘en_US’: en_US.blogs.txt, en_US.news.txt, en_US.twitter.txt, each of them includes a great great number of sentences and words which will be shown below:| en_US.blogs.txt | en_US.news.txt | en_US.twitter.txt | |
|---|---|---|---|
| sentences | 2072941.00 | 1867522.00 | 2588551.00 |
| words | 42840140.00 | 39918314.00 | 119478112.00 |
Since there are so many words, a brief exploratory analysis will be shown here. Firstly, let us have a close look at the top 20th most frequent words across three data sets.
## feature frequency rank docfreq group
## 1 one 307902 1 3 all
## 2 said 305186 2 3 all
## 3 just 304843 3 3 all
## 4 get 301290 4 3 all
## 5 like 301118 5 3 all
## 6 go 266898 6 3 all
## 7 time 258628 7 3 all
## 8 can 248756 8 3 all
## 9 day 222912 9 3 all
## 10 year 214750 10 3 all
## 11 make 206712 11 3 all
## 12 love 203287 12 3 all
## 13 new 194531 13 3 all
## 14 good 185428 14 3 all
## 15 know 184011 15 3 all
## 16 now 180157 16 3 all
## 17 work 176685 17 3 all
## 18 peopl 163635 18 3 all
## 19 say 162207 19 3 all
## 20 want 160958 20 3 all
Besides, 524600 words only appear once in these three data sets. We ignore these words, and we can still find that most words only appeared within 100 times, more clearly with the grey curve in the figure.
However, there are still some words appeared beyond 10000 times, seeing the salmon curve.
A logarithm based on 10 is applied with the frequency for the sake of comparativity.
At last, A beautiful wordcloud of the top 100 frequent words are made. The more frequent the word appeared, the bigger the word will be.
The more frequent, the bigger the word is.