In order to choose a strategy for building a prediction model, we will first analyze our sample data. Our prediction model will be built with the SwiftKey’s en_US.blogs.txt, en_US.news.txt and en_US.twitter.txt datasets. The files contain:
| File | Characters | Lines |
|---|---|---|
en_US.blogs.txt |
206824505 | 899288 |
en_US.news.txt |
203223159 | 1010242 |
en_US.twitter.txt |
162096031 | 2360148 |
The files are then tokenized using quanteda library. The 20 most common tokens in the files are:
Which are all common stopwords. After excluding a selection of stopwords using a list from the tidytext package, the 20 most common tokens are:
Dictionary coverage
Having loaded the dataset, we will next assess the size of dictionary needed to achieve desired coverage. Dictionary coverage is a measure of how many tokens are needed to cover a wanted percentage of tokens in the input text.
| Dictionary coverage | # of tokens required | # of tokens (excl. stop words) |
|---|---|---|
| 75% | 1418 | 7062 |
| 90% | 7661 | 25241 |
| 95% | 19506 | 59342 |
As demonstrated earlier, stopwords account for a large number of tokens in the text, and removing them from the corpus mandates more than twice the amount of tokens to achieve same dictionary coverage.
N-gram coverage
As we repeat the dictionary coverage analysis, we can see that the more tokens are included in n-grams, the larger the dictionary required to achieve same level of coverage. We can also see that stemming the first token of a bigram (or the first two tokens of a trigram) increases our dictionary coverage. However, in order to achieve even a 50% trigram coverage requires over three million distinct n-grams.
Having analyzed the dataset, we will next attempt to build a prediction model to predict the next word given a sequence of text. Specifically, given a sentence, the model will suggest three alternative words to follow said sentence.
The balance between performance and accuracy will need to be carefully weighed. We will not attempt to build the best performing model possible, but rather build the best performing model given a set of performance and resource consumption constraints. Hard limits for resource consumption are not defined, but especially memory consumption will be examined carefully when building the model.
Model validation
In order to validate the model, the file will be randomly divided to training, validation and test datasets. Then, the training dataset will be used to build the model, while using the validation set to tune model performance. Final model performance benchmarks are done against the test dataset. The model will be tested by having it predict three possible next words given a sentence. The number of times the actual following word is present in the three options will determine the model performance.