The goal of the Capstone is to create a predictive model that will take in text from a user and predict the next word, The aim is to improve the efficiency of the users typing. A large amount of training data was provided from three sources: blogs, news, and twitter. This data will be used to train the predictive model.

Data Exploration

The first step is to load the data, clean it (remove characters and other parts of the data that are not useful to building the model), and explore it. Below are a series of tables and charts that shows what the data looks like.

Table 1: Number of Characters in each of the 3 files

##   Characters      File
## 1     899288    USblog
## 2      77259    USnews
## 3    2360148 UStwitter

Plotting the number of words by how frequently they are used, shows the largest count of words are infrequently used words.

Table 2: Top 10 Words by Frequency per Source

## # A tibble: 10 x 6
##    `word-blog` `freq-blog` `word-news` `freq-news` `word-twitter` `freq-twitter`
##    <chr>       <chr>       <chr>       <chr>       <chr>          <chr>         
##  1 the         0.0504      the         0.0589      the            0.0318        
##  2 and         0.0295      to          0.0269      to             0.0268        
##  3 to          0.0289      and         0.0265      i              0.0243        
##  4 a           0.0244      a           0.0261      a              0.0207        
##  5 of          0.0238      of          0.023       you            0.0185        
##  6 i           0.0209      in          0.02        and            0.0148        
##  7 in          0.0161      for         0.0105      for            0.0131        
##  8 that        0.0125      that        0.0102      in             0.0128        
##  9 is          0.0117      is          0.0085      of             0.0122        
## 10 it          0.0109      on          0.008       is             0.0122

The top words are words such as the, and, to. These are referred to as stop words. To get a better look at the data, the stop words can be removed.

Table 3: Top 10 Words by Frequency with Stop Words Removed

## # A tibble: 10 x 6
##    `word-blog` `freq-blog` `word-news` `freq-news` `word-twitter` `freq-twitter`
##    <chr>       <chr>       <chr>       <chr>       <chr>          <chr>         
##  1 time        0.0024      time        0.0016      love           0.0036        
##  2 people      0.0016      people      0.0014      day            0.0031        
##  3 day         0.0014      city        0.0011      rt             0.003         
##  4 love        0.0012      percent     0.001       time           0.0025        
##  5 life        0.0011      school      0.001       lol            0.0023        
##  6 im          8e-04       game        9e-04       people         0.0018        
##  7 dont        8e-04       home        9e-04       happy          0.0017        
##  8 world       8e-04       million     9e-04       follow         0.0016        
##  9 book        7e-04       county      9e-04       tonight        0.0015        
## 10 home        7e-04       day         9e-04       night          0.0014

You can see a big difference in the types of words used in the different sources. Most stark is news vs twitter.

A word cloud gives another view of the frequency of the words used. Again here, stop words are removed.

Another way to view the structure of the data is to look at multi-word pairings such a bigrams (2 words) and trigrams (3 words). These will be used to make the predictive model. Table 4 and 5 show the most frequently used bigrams and trigrams in the training data. Looking at the tables, you can begin to see how a predictive model can be made. The data will be broken into many of unigrams (1-word) bigrams (2-word) and trigrams (3-word) and quadgrams (4-word). The model will then be trained on these sequences to learn the most frequently used next word.

Table 4: Top 10 Bi-grams by Frequency

##    blog_bigram news_bigram twitter_bigram
## 1       of the      of the         in the
## 2       in the      in the        for the
## 3       to the      to the         of the
## 4       on the      on the         on the
## 5        to be     for the          to be
## 6      and the      at the         to the
## 7      for the     and the     thanks for
## 8        i was        in a         at the
## 9        and i       to be         i love
## 10      i have    with the       going to

Table 5: Top 10 Tri-grams by Frequency

##    blog_trigram     news_trigram    twitter_trigram
## 1          <NA>             <NA>               <NA>
## 2    one of the       one of the     thanks for the
## 3      a lot of         a lot of looking forward to
## 4    as well as       as well as      thank you for
## 5       to be a according to the         i love you
## 6      it was a     in the first     for the follow
## 7   some of the      going to be        going to be
## 8    out of the      part of the      can't wait to
## 9    the end of       the end of          i want to
## 10   be able to       out of the           a lot of

Lastly, if there is any profanity in the data it needs to be taken into account before suggested to the user. A check shows that 292 different profane words are used. These will need to be removed.

Prediction Model

A Stupid Back-Off model will be used to make the prediction model. Details of how these models are built mathmetically can be found here https://www.aclweb.org/anthology/D07-1090.pdf .

The basic concept is that the model will use N-grams such as the bi-grams and tri-grams discussed above to learn the most frequently used next word. The model will be built on quad-grams (4 words) to allow prediction on the previous 3 words entered by the user.

The programming language R provides several convenient functions to easily build these models from the training data. The data provided will be subsetting into training and test data allowing testing of the model for accuracy prior to putting it into the app. The app will be built with ShinyR. The user interface will be fairly simple, asking the user to input text. It will then suggest the next word from the output of the prediction model.

Profanity will be removed and a few technics will be tried to improve the accuracy. Memory usage and time to give prediction will be taken into account with an attempt to minimize both,