The purpose of this project is to create an algorithm that is capable of predicting the next word that a user might type. For example, consider if the user had typed “how are you” and the predicted word might be “today”. There are a wide range of applications for such an algorithm. One such application is to speed data entry on mobile devices, such as cell phones.

Basic Data Analysis

The three different files are of varying size and line counts, as seen here. The total sizes/counts are shown as well.

##               files  sizes   lines
## 1   en_US.blogs.txt 210 MB  899288
## 2    en_US.news.txt 206 MB 1010242
## 3 en_US.twitter.txt 167 MB 2360148
## 4             total 583 MB 4269678

The individual words range from 1 character to 365 characters. The following chart shows the distribution of longer words to shorter ones. A table, showing the letter counts and number of words is also shown.

##    Letter Count Number of Words
## 1             2        16825924
## 2             3        20582473
## 3             4        19594998
## 4             5        12118415
## 5             6         8951324
## 6             7         7585415
## 7             8         4825078
## 8             9         3158304
## 9            10         1884810
## 10           11          971720
## 11           12          496435
## 12           13          260724
## 13           14          103211
## 14           15           45007
## 15           16           20933
## 16           17           12649
## 17           18            9220
## 18           19            6320
## 19           20            4948
## 20           21            3606
## 21           22            2833
## 22           23            2232
## 23           24            1593
## 24           25            1182

plot of chunk unnamed-chunk-2

I also looked for common words that were not stop words. A stop word is a very common word that does not convey too much meaning. Examples of stop words include “the”, “a” or “of”.

##    X83821   come
## 1   84837   made
## 2   85487  being
## 3   86209   take
## 4   88678  those
## 5   89142   many
## 6   89523 before
## 7   90852   down
## 8   91101   life
## 9   93179  years
## 10  94591   very
## 11  97718   need
## 12  98697   here
## 13  98795 thanks
## 14 102866  still
## 15 104439  right
## 16 106306   work
## 17 107353   want
## 18 108433   even
## 19 110833  today
## 20 113026    way
## 21 114389 really
## 22 115805   well
## 23 120718   much
## 24 123363    two
## 25 124473  great
## 26 125599   last
## 27 126412  think
## 28 126936  going
## 29 127398   over
## 30 131333   make
## 31 139117    see
## 32 139686   year
## 33 140925  first
## 34 144875   back
## 35 158816 people
## 36 160854   love
## 37 162089  don't
## 38 163590   know
## 39 175105    day
## 40 180122   good
## 41 180307    now
## 42 194870    new
## 43 216559    i'm
## 44 224086   time
## 45 229880   it's
## 46 243702   more
## 47 298189    one
## 48 301122    out

plot of chunk unnamed-chunk-3

These plots were used to give me an overall picture of the data. Most of my analysis did not involve graphs. At this point I am interested in the predictive power of 3-Grams. The next section deals with my research in this area.

Prediction with 3-Grams

To perform the original prediction goal, n-grams can be used. An n-gram looks at the last n words to predict the next word. Consider, the following sentence, given by the class quiz.

You’re the reason why I smile everyday. Can you follow me please? It would mean the

Here we consider the last three words (a 3-gram) to predict the next word, occurring after “the”.

Word 1: would
Word 2: mean
Word 3: the

Despite the fact that “the” is a stop word, we still consider it. A word like “the” can be an important predictor of the next word.

I scanned the entire corpus looking for the occurrence of the words “would”, “mean” and “the” and counted the following word. The top occurrences are shown here.

##   Occurrences   Word
## 1         201  world
## 2           4    end
## 3           3   loss
## 4           2 entire

This gives some indication of the next word.

Swiftkey Johns Hopkins Coursera Data Science Capstone Milestone Project

Basic Data Analysis

Prediction with 3-Grams

Next Steps