This data product takes in a word or a sentence and predicts the next word. The model is trained on 70% of the dataset and uses a stupid backoff model with ngrams ranging from 1-5. The application is deployed at this link. https://chrishan.shinyapps.io/finalwordprediction/
Background
Using the user’s input text to predict their next word is best illustrated by the SwiftKey Keyboard. By using machine learning, SwiftKey provides a convenient way to reduce typing and improve the speed of communication. Our goal is to create a crude version of SwiftKey Keyboard in which the user can input text and the algorithm returns a vector of words along with their respective predictive probabilities.
This project was originally completed in February 2019 as part of the Data Science Specialization Capstone course offered on Coursera by Johns Hopkins University. This report aims to explain the process from the initial data exploration to creating the data product with the shiny application.
I first load in three documents each from different sources: twitter, blogs, and news. Blogs file contains around 900,000 sentences, news file contains a little over a million, and twitter file contains nearly 2.4 million sentences. 10% of each text file was sampled in order to reduce computing time and consolidated the documents together to create a combined corpus. The summary of the combined corpus can be seen below.
## Corpus consisting of 3 documents:
##
## Text Types Tokens Sentences
## twitter 139570 3674563 259151
## blogs 132517 4292354 207572
## news 131067 3994142 186933
##
## Source: Concatenation by c.corpus()
## Created: Thu Nov 14 13:04:33 2019
## Notes:
Next, I created the document frequency matrix for mono, bi, and trigrams. For the monogram model, I removed the punctuations, symbols, twitter hashtags, and common stopwords. The 50 most frequent words are displayed below.
## said just one like can get time new good now
## 30415 30212 28942 26939 24563 22785 21443 19613 17841 17560
For bigrams, I also removed the punctuations, symbols, and twitter hashtags, but not the stopwords.
## of_the in_the to_the for_the on_the to_be at_the and_the
## 43024 40751 21237 20121 19622 16317 14375 12491
## in_a with_the
## 11662 10767
Same goes for the trigrams.
## one_of_the a_lot_of thanks_for_the to_be_a going_to_be
## 3406 2894 2393 1881 1718
## the_end_of it_was_a out_of_the i_want_to some_of_the
## 1536 1463 1444 1444 1381
The first predictive model was built with a basic bigram prediction where the algorithm looks at the word before and chooses the bigram with the highest frequency containing that beginning word. For example, for the sentence “I have a car”, the model looks at all the bigrams that starts with “I” and chooses the bigram with highest frequency, which happens to be “I have”. Next, the model chooses the bigram starting with “have” that has the highest frequency in our corpus.
Hence, this model only looks at the word that comes right before the one I am trying to predict. The result is not so great. I predicted on two sentences “I have a beautiful car” and “who let the d0gs out”. As seen below, predicting on only the previous word creates sentences that may be gramatically sound but nonsensical in their meanings. Therefore, we need a better way of predicting.
## [,1]
## [1,] "i have a lot day and"
## [2,] "who is me first and of"
The final algorithm uses a stupid backoff model. First the model starts with a 5-gram match, given the sentence is long enough. If there is a match, the probability of the word is calculated based on the 5-gram match. If there is not a match, it moves onto 4-gram, to 3-gram, and so on. This model gives absolute priority to a higher n-gram match so it will not check other lower n-grams if there is a match already. For example, if there is a 5-gram match, regardless of the accuracy of the prediction, the model will stop and not check for 4-gram or lower n-gram matches. In most cases, this does not cause issues but is something to be wary of.
Stopwords
Stopwords are words that are very common in a language such as ‘I’, ‘a’, ‘you’. Removing these words can possibly improve or worsen the prediction. The accuracy depends on the complexity of the sentence. For example, if you input a sentence such as ‘I’ll be on my’, then the prediction with stopwords included would provide a better result since the input contains many stopwords. In contrast, if you are predicting based on a sentence such as ‘Flowers and plants are both very’ then it may be better to exclude the stopwords so that the algorithm predicts on the words ‘flowers’ and ‘plants’ rather than ‘are both very’.
Using the benchmark provided here Benchmark, we observed how the model performs on a test set.
| Result | 3-gram | 4-gram | 5-gram |
|---|---|---|---|
| Overall top-3 score | 17.18% | 17.57% | 17.56% |
| Overall top-1 precision | 12.77% | 13.41% | 13.45% |
| Overall top-3 precision | 20.92% | 21.09% | 21.02% |
| Average runtime | 18.20 msec | 20.08 msec | 23.84 msec |
| Total memory used | 105.32 MB | 106.51 MB | 106.88 MB |
The 5-gram model provides the best overall top-1 precision with being able to predict the next word on the first try 13.45% of the time. The final deployed application uses the 5-gram model on the basis of this result.
The shiny application consists of the following elements:
The algorithm I used, namely the stupid backoff model, results in around 21.09% top three precision at its best. I In other words, in a SwiftKey-like experience where the application returns three predictions, our model would correctly predict 21.09% of the time, or about 1 out of every 5. In order for this product to be deployed and used, we need to employ a different algorithm that can drastically improve the performance. Possible models would involve smoothing such as the Kneser-Ney method.
Last Thoughts
The main challenge with this project was the large size of the text files and the even larger resulting corpus from tokenizing on the dataset. Because of the memory limits on the shiny application, I had to balance the accuracy of the model with real-life usability and speed. A model with 90% accuracy that takes two minutes to predict is arguably worse in terms of usability than a model with 20% accuracy but takes two seconds to predict. Nevertheless, this was a helpful introduction to the world of natural language processing and the various challenges associated with building a predictive model. In the future, I plan to revisit this project in order to try out different algorithms and improve the performance.