The goal of the Capstone is to create a predictive model that will take in text from a user and predict the next word, The aim is to improve the efficiency of the users typing. A large amount of training data was provided from three sources: blogs, news, and twitter. This data will be used to train the predictive model.
The first step is to load the data, clean it (remove characters and other parts of the data that are not useful to building the model), and explore it. Below are a series of tables and charts that shows what the data looks like.
Table 1: Number of Characters in each of the 3 files
## Characters File
## 1 899288 USblog
## 2 77259 USnews
## 3 2360148 UStwitter
Plotting the number of words by how frequently they are used, shows the largest count of words are infrequently used words.
Table 2: Top 10 Words by Frequency per Source
## # A tibble: 10 x 6
## `word-blog` `freq-blog` `word-news` `freq-news` `word-twitter` `freq-twitter`
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 the 0.0504 the 0.0589 the 0.0318
## 2 and 0.0295 to 0.0269 to 0.0268
## 3 to 0.0289 and 0.0265 i 0.0243
## 4 a 0.0244 a 0.0261 a 0.0207
## 5 of 0.0238 of 0.023 you 0.0185
## 6 i 0.0209 in 0.02 and 0.0148
## 7 in 0.0161 for 0.0105 for 0.0131
## 8 that 0.0125 that 0.0102 in 0.0128
## 9 is 0.0117 is 0.0085 of 0.0122
## 10 it 0.0109 on 0.008 is 0.0122
The top words are words such as the, and, to. These are referred to as stop words. To get a better look at the data, the stop words can be removed.
Table 3: Top 10 Words by Frequency with Stop Words Removed
## # A tibble: 10 x 6
## `word-blog` `freq-blog` `word-news` `freq-news` `word-twitter` `freq-twitter`
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 time 0.0024 time 0.0016 love 0.0036
## 2 people 0.0016 people 0.0014 day 0.0031
## 3 day 0.0014 city 0.0011 rt 0.003
## 4 love 0.0012 percent 0.001 time 0.0025
## 5 life 0.0011 school 0.001 lol 0.0023
## 6 im 8e-04 game 9e-04 people 0.0018
## 7 dont 8e-04 home 9e-04 happy 0.0017
## 8 world 8e-04 million 9e-04 follow 0.0016
## 9 book 7e-04 county 9e-04 tonight 0.0015
## 10 home 7e-04 day 9e-04 night 0.0014
You can see a big difference in the types of words used in the different sources. Most stark is news vs twitter.
A word cloud gives another view of the frequency of the words used. Again here, stop words are removed.
Another way to view the structure of the data is to look at multi-word pairings such a bigrams (2 words) and trigrams (3 words). These will be used to make the predictive model. Table 4 and 5 show the most frequently used bigrams and trigrams in the training data. Looking at the tables, you can begin to see how a predictive model can be made. The data will be broken into many of unigrams (1-word) bigrams (2-word) and trigrams (3-word) and quadgrams (4-word). The model will then be trained on these sequences to learn the most frequently used next word.
Table 4: Top 10 Bi-grams by Frequency
## blog_bigram news_bigram twitter_bigram
## 1 of the of the in the
## 2 in the in the for the
## 3 to the to the of the
## 4 on the on the on the
## 5 to be for the to be
## 6 and the at the to the
## 7 for the and the thanks for
## 8 i was in a at the
## 9 and i to be i love
## 10 i have with the going to
Table 5: Top 10 Tri-grams by Frequency
## blog_trigram news_trigram twitter_trigram
## 1 <NA> <NA> <NA>
## 2 one of the one of the thanks for the
## 3 a lot of a lot of looking forward to
## 4 as well as as well as thank you for
## 5 to be a according to the i love you
## 6 it was a in the first for the follow
## 7 some of the going to be going to be
## 8 out of the part of the can't wait to
## 9 the end of the end of i want to
## 10 be able to out of the a lot of
Lastly, if there is any profanity in the data it needs to be taken into account before suggested to the user. A check shows that 292 different profane words are used. These will need to be removed.
A Stupid Back-Off model will be used to make the prediction model. Details of how these models are built mathmetically can be found here https://www.aclweb.org/anthology/D07-1090.pdf .
The basic concept is that the model will use N-grams such as the bi-grams and tri-grams discussed above to learn the most frequently used next word. The model will be built on quad-grams (4 words) to allow prediction on the previous 3 words entered by the user.
The programming language R provides several convenient functions to easily build these models from the training data. The data provided will be subsetting into training and test data allowing testing of the model for accuracy prior to putting it into the app. The app will be built with ShinyR. The user interface will be fairly simple, asking the user to input text. It will then suggest the next word from the output of the prediction model.
Profanity will be removed and a few technics will be tried to improve the accuracy. Memory usage and time to give prediction will be taken into account with an attempt to minimize both,