The purpose of this project is to create an algorithm that is capable of predicting the next word that a user might type. For example, consider if the user had typed “how are you” and the predicted word might be “today”. There are a wide range of applications for such an algorithm. One such application is to speed data entry on mobile devices, such as cell phones.

To accomplish this a prediction algorithm is built using three sources of data:

These data can be downloaded from the following URL.

https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

The following sectionn provides a brief analysis of these data.

Basic Data Analysis

The three different files are of varying size and line counts, as seen here. The total sizes/counts are shown as well.

##               files  sizes   lines
## 1   en_US.blogs.txt 210 MB  899288
## 2    en_US.news.txt 206 MB 1010242
## 3 en_US.twitter.txt 167 MB 2360148
## 4             total 583 MB 4269678

The individual words range from 1 character to 365 characters. The following chart shows the distribution of longer words to shorter ones. A table, showing the letter counts and number of words is also shown.

##    Letter Count Number of Words
## 1             2        16825924
## 2             3        20582473
## 3             4        19594998
## 4             5        12118415
## 5             6         8951324
## 6             7         7585415
## 7             8         4825078
## 8             9         3158304
## 9            10         1884810
## 10           11          971720
## 11           12          496435
## 12           13          260724
## 13           14          103211
## 14           15           45007
## 15           16           20933
## 16           17           12649
## 17           18            9220
## 18           19            6320
## 19           20            4948
## 20           21            3606
## 21           22            2833
## 22           23            2232
## 23           24            1593
## 24           25            1182

plot of chunk unnamed-chunk-2

I also looked for common words that were not stop words. A stop word is a very common word that does not convey too much meaning. Examples of stop words include “the”, “a” or “of”.

##    X83821   come
## 1   84837   made
## 2   85487  being
## 3   86209   take
## 4   88678  those
## 5   89142   many
## 6   89523 before
## 7   90852   down
## 8   91101   life
## 9   93179  years
## 10  94591   very
## 11  97718   need
## 12  98697   here
## 13  98795 thanks
## 14 102866  still
## 15 104439  right
## 16 106306   work
## 17 107353   want
## 18 108433   even
## 19 110833  today
## 20 113026    way
## 21 114389 really
## 22 115805   well
## 23 120718   much
## 24 123363    two
## 25 124473  great
## 26 125599   last
## 27 126412  think
## 28 126936  going
## 29 127398   over
## 30 131333   make
## 31 139117    see
## 32 139686   year
## 33 140925  first
## 34 144875   back
## 35 158816 people
## 36 160854   love
## 37 162089  don't
## 38 163590   know
## 39 175105    day
## 40 180122   good
## 41 180307    now
## 42 194870    new
## 43 216559    i'm
## 44 224086   time
## 45 229880   it's
## 46 243702   more
## 47 298189    one
## 48 301122    out

plot of chunk unnamed-chunk-3

These plots were used to give me an overall picture of the data. Most of my analysis did not involve graphs. At this point I am interested in the predictive power of 3-Grams. The next section deals with my research in this area.

Prediction with 3-Grams

To perform the original prediction goal, n-grams can be used. An n-gram looks at the last n words to predict the next word. Consider, the following sentence, given by the class quiz.

You’re the reason why I smile everyday. Can you follow me please? It would mean the

Here we consider the last three words (a 3-gram) to predict the next word, occurring after “the”.

  • Word 1: would
  • Word 2: mean
  • Word 3: the

Despite the fact that “the” is a stop word, we still consider it. A word like “the” can be an important predictor of the next word.

I scanned the entire corpus looking for the occurrence of the words “would”, “mean” and “the” and counted the following word. The top occurrences are shown here.

##   Occurrences   Word
## 1         201  world
## 2           4    end
## 3           3   loss
## 4           2 entire

This gives some indication of the next word.

Next Steps

The n-grams alone were not sufficient to solve all of the quiz problems. I believe it will also be necessary to include some earlier words for context. I will also make use of backoff for n-grams that were not present in the training data. This will take some experimentation.