The purpose of this project is to create an algorithm that is capable of predicting the next word that a user might type. For example, consider if the user had typed “how are you” and the predicted word might be “today”. There are a wide range of applications for such an algorithm. One such application is to speed data entry on mobile devices, such as cell phones.
To accomplish this a prediction algorithm is built using three sources of data:
These data can be downloaded from the following URL.
https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
The following sectionn provides a brief analysis of these data.
The three different files are of varying size and line counts, as seen here. The total sizes/counts are shown as well.
## files sizes lines
## 1 en_US.blogs.txt 210 MB 899288
## 2 en_US.news.txt 206 MB 1010242
## 3 en_US.twitter.txt 167 MB 2360148
## 4 total 583 MB 4269678
The individual words range from 1 character to 365 characters. The following chart shows the distribution of longer words to shorter ones. A table, showing the letter counts and number of words is also shown.
## Letter Count Number of Words
## 1 2 16825924
## 2 3 20582473
## 3 4 19594998
## 4 5 12118415
## 5 6 8951324
## 6 7 7585415
## 7 8 4825078
## 8 9 3158304
## 9 10 1884810
## 10 11 971720
## 11 12 496435
## 12 13 260724
## 13 14 103211
## 14 15 45007
## 15 16 20933
## 16 17 12649
## 17 18 9220
## 18 19 6320
## 19 20 4948
## 20 21 3606
## 21 22 2833
## 22 23 2232
## 23 24 1593
## 24 25 1182
I also looked for common words that were not stop words. A stop word is a very common word that does not convey too much meaning. Examples of stop words include “the”, “a” or “of”.
## X83821 come
## 1 84837 made
## 2 85487 being
## 3 86209 take
## 4 88678 those
## 5 89142 many
## 6 89523 before
## 7 90852 down
## 8 91101 life
## 9 93179 years
## 10 94591 very
## 11 97718 need
## 12 98697 here
## 13 98795 thanks
## 14 102866 still
## 15 104439 right
## 16 106306 work
## 17 107353 want
## 18 108433 even
## 19 110833 today
## 20 113026 way
## 21 114389 really
## 22 115805 well
## 23 120718 much
## 24 123363 two
## 25 124473 great
## 26 125599 last
## 27 126412 think
## 28 126936 going
## 29 127398 over
## 30 131333 make
## 31 139117 see
## 32 139686 year
## 33 140925 first
## 34 144875 back
## 35 158816 people
## 36 160854 love
## 37 162089 don't
## 38 163590 know
## 39 175105 day
## 40 180122 good
## 41 180307 now
## 42 194870 new
## 43 216559 i'm
## 44 224086 time
## 45 229880 it's
## 46 243702 more
## 47 298189 one
## 48 301122 out
These plots were used to give me an overall picture of the data. Most of my analysis did not involve graphs. At this point I am interested in the predictive power of 3-Grams. The next section deals with my research in this area.
To perform the original prediction goal, n-grams can be used. An n-gram looks at the last n words to predict the next word. Consider, the following sentence, given by the class quiz.
You’re the reason why I smile everyday. Can you follow me please? It would mean the
Here we consider the last three words (a 3-gram) to predict the next word, occurring after “the”.
Despite the fact that “the” is a stop word, we still consider it. A word like “the” can be an important predictor of the next word.
I scanned the entire corpus looking for the occurrence of the words “would”, “mean” and “the” and counted the following word. The top occurrences are shown here.
## Occurrences Word
## 1 201 world
## 2 4 end
## 3 3 loss
## 4 2 entire
This gives some indication of the next word.
The n-grams alone were not sufficient to solve all of the quiz problems. I believe it will also be necessary to include some earlier words for context. I will also make use of backoff for n-grams that were not present in the training data. This will take some experimentation.