Summary

This data product takes in a word or a sentence and predicts the next word. The model is trained on 70% of the dataset and uses a stupid backoff model with ngrams ranging from 1-5. The application is deployed at this link. https://chrishan.shinyapps.io/finalwordprediction/

Background

Using the user’s input text to predict their next word is best illustrated by the SwiftKey Keyboard. By using machine learning, SwiftKey provides a convenient way to reduce typing and improve the speed of communication. Our goal is to create a crude version of SwiftKey Keyboard in which the user can input text and the algorithm returns a vector of words along with their respective predictive probabilities.

This project was originally completed in February 2019 as part of the Data Science Specialization Capstone course offered on Coursera by Johns Hopkins University. This report aims to explain the process from the initial data exploration to creating the data product with the shiny application.

Exploratory Analysis

I first load in three documents each from different sources: twitter, blogs, and news. Blogs file contains around 900,000 sentences, news file contains a little over a million, and twitter file contains nearly 2.4 million sentences. 10% of each text file was sampled in order to reduce computing time and consolidated the documents together to create a combined corpus. The summary of the combined corpus can be seen below.

## Corpus consisting of 3 documents:
## 
##     Text  Types  Tokens Sentences
##  twitter 139570 3674563    259151
##    blogs 132517 4292354    207572
##     news 131067 3994142    186933
## 
## Source: Concatenation by c.corpus()
## Created: Thu Nov 14 13:04:33 2019
## Notes:

Next, I created the document frequency matrix for mono, bi, and trigrams. For the monogram model, I removed the punctuations, symbols, twitter hashtags, and common stopwords. The 50 most frequent words are displayed below.

##  said  just   one  like   can   get  time   new  good   now 
## 30415 30212 28942 26939 24563 22785 21443 19613 17841 17560

For bigrams, I also removed the punctuations, symbols, and twitter hashtags, but not the stopwords.

##   of_the   in_the   to_the  for_the   on_the    to_be   at_the  and_the 
##    43024    40751    21237    20121    19622    16317    14375    12491 
##     in_a with_the 
##    11662    10767

Same goes for the trigrams.

##     one_of_the       a_lot_of thanks_for_the        to_be_a    going_to_be 
##           3406           2894           2393           1881           1718 
##     the_end_of       it_was_a     out_of_the      i_want_to    some_of_the 
##           1536           1463           1444           1444           1381

Building an n-gram model

The first predictive model was built with a basic bigram prediction where the algorithm looks at the word before and chooses the bigram with the highest frequency containing that beginning word. For example, for the sentence “I have a car”, the model looks at all the bigrams that starts with “I” and chooses the bigram with highest frequency, which happens to be “I have”. Next, the model chooses the bigram starting with “have” that has the highest frequency in our corpus.

Hence, this model only looks at the word that comes right before the one I am trying to predict. The result is not so great. I predicted on two sentences “I have a beautiful car” and “who let the d0gs out”. As seen below, predicting on only the previous word creates sentences that may be gramatically sound but nonsensical in their meanings. Therefore, we need a better way of predicting.

##      [,1]                    
## [1,] "i have a lot day and"  
## [2,] "who is me first and of"

Methodology

The final algorithm uses a stupid backoff model. First the model starts with a 5-gram match, given the sentence is long enough. If there is a match, the probability of the word is calculated based on the 5-gram match. If there is not a match, it moves onto 4-gram, to 3-gram, and so on. This model gives absolute priority to a higher n-gram match so it will not check other lower n-grams if there is a match already. For example, if there is a 5-gram match, regardless of the accuracy of the prediction, the model will stop and not check for 4-gram or lower n-gram matches. In most cases, this does not cause issues but is something to be wary of.

Stopwords

Stopwords are words that are very common in a language such as ‘I’, ‘a’, ‘you’. Removing these words can possibly improve or worsen the prediction. The accuracy depends on the complexity of the sentence. For example, if you input a sentence such as ‘I’ll be on my’, then the prediction with stopwords included would provide a better result since the input contains many stopwords. In contrast, if you are predicting based on a sentence such as ‘Flowers and plants are both very’ then it may be better to exclude the stopwords so that the algorithm predicts on the words ‘flowers’ and ‘plants’ rather than ‘are both very’.

Performance of the Model

Using the benchmark provided here Benchmark, we observed how the model performs on a test set.

Result 3-gram 4-gram 5-gram
Overall top-3 score 17.18% 17.57% 17.56%
Overall top-1 precision 12.77% 13.41% 13.45%
Overall top-3 precision 20.92% 21.09% 21.02%
Average runtime 18.20 msec 20.08 msec 23.84 msec
Total memory used 105.32 MB 106.51 MB 106.88 MB

The 5-gram model provides the best overall top-1 precision with being able to predict the next word on the first try 13.45% of the time. The final deployed application uses the 5-gram model on the basis of this result.

Shiny Application Interface

The shiny application consists of the following elements:

https://chrishan.shinyapps.io/finalwordprediction/

Conclusion

The algorithm I used, namely the stupid backoff model, results in around 21.09% top three precision at its best. I In other words, in a SwiftKey-like experience where the application returns three predictions, our model would correctly predict 21.09% of the time, or about 1 out of every 5. In order for this product to be deployed and used, we need to employ a different algorithm that can drastically improve the performance. Possible models would involve smoothing such as the Kneser-Ney method.

Last Thoughts

The main challenge with this project was the large size of the text files and the even larger resulting corpus from tokenizing on the dataset. Because of the memory limits on the shiny application, I had to balance the accuracy of the model with real-life usability and speed. A model with 90% accuracy that takes two minutes to predict is arguably worse in terms of usability than a model with 20% accuracy but takes two seconds to predict. Nevertheless, this was a helpful introduction to the world of natural language processing and the various challenges associated with building a predictive model. In the future, I plan to revisit this project in order to try out different algorithms and improve the performance.