File Report for Data Science, Capstone

App Goal

This app is used for make predictions based on what you have typed, just as what Swiftkey does. Type some words, and personal predictions tailored to you will appear. The final version of this app will be presented via Shinyapp. (This image comes from the official discription of Swiftkey.)

Algorithms

No details will be explained here, but all will be listed with links.

Data for training

The whole dataset can be downloaded here, and only ‘en_US’ files are used for training.

There are 3 different text sets in ‘en_US’: en_US.blogs.txt, en_US.news.txt, en_US.twitter.txt, each of them includes a great great number of sentences and words which will be shown below:

	en_US.blogs.txt	en_US.news.txt	en_US.twitter.txt
sentences	2072941.00	1867522.00	2588551.00
words	42840140.00	39918314.00	119478112.00

Words frequency

Since there are so many words, a brief exploratory analysis will be shown here. Firstly, let us have a close look at the top 20th most frequent words across three data sets.

##    feature frequency rank docfreq group
## 1      one    307902    1       3   all
## 2     said    305186    2       3   all
## 3     just    304843    3       3   all
## 4      get    301290    4       3   all
## 5     like    301118    5       3   all
## 6       go    266898    6       3   all
## 7     time    258628    7       3   all
## 8      can    248756    8       3   all
## 9      day    222912    9       3   all
## 10    year    214750   10       3   all
## 11    make    206712   11       3   all
## 12    love    203287   12       3   all
## 13     new    194531   13       3   all
## 14    good    185428   14       3   all
## 15    know    184011   15       3   all
## 16     now    180157   16       3   all
## 17    work    176685   17       3   all
## 18   peopl    163635   18       3   all
## 19     say    162207   19       3   all
## 20    want    160958   20       3   all

Besides, 524600 words only appear once in these three data sets. We ignore these words, and we can still find that most words only appeared within 100 times, more clearly with the grey curve in the figure.

However, there are still some words appeared beyond 10000 times, seeing the salmon curve.

A logarithm based on 10 is applied with the frequency for the sake of comparativity.

At last, A beautiful wordcloud of the top 100 frequent words are made. The more frequent the word appeared, the bigger the word will be.

The more frequent, the bigger the word is.

File Report for Data Science, Capstone

Reports on three files in en_US

Jinxi Li

2018/9/16

App Goal

Algorithms

Data for training

Words frequency