Exploratory analysis

In order to choose a strategy for building a prediction model, we will first analyze our sample data. Our prediction model will be built with the SwiftKey’s en_US.blogs.txt, en_US.news.txt and en_US.twitter.txt datasets. The files contain:

File Characters Lines
en_US.blogs.txt 206824505 899288
en_US.news.txt 203223159 1010242
en_US.twitter.txt 162096031 2360148

The files are then tokenized using quanteda library. The 20 most common tokens in the files are:

Which are all common stopwords. After excluding a selection of stopwords using a list from the tidytext package, the 20 most common tokens are:

Dictionary coverage

Having loaded the dataset, we will next assess the size of dictionary needed to achieve desired coverage. Dictionary coverage is a measure of how many tokens are needed to cover a wanted percentage of tokens in the input text.

Dictionary coverage # of tokens required # of tokens (excl. stop words)
75% 1418 7062
90% 7661 25241
95% 19506 59342

As demonstrated earlier, stopwords account for a large number of tokens in the text, and removing them from the corpus mandates more than twice the amount of tokens to achieve same dictionary coverage.

N-gram coverage

As we repeat the dictionary coverage analysis, we can see that the more tokens are included in n-grams, the larger the dictionary required to achieve same level of coverage. We can also see that stemming the first token of a bigram (or the first two tokens of a trigram) increases our dictionary coverage. However, in order to achieve even a 50% trigram coverage requires over three million distinct n-grams.

Model implementation plan

Having analyzed the dataset, we will next attempt to build a prediction model to predict the next word given a sequence of text. Specifically, given a sentence, the model will suggest three alternative words to follow said sentence.

The balance between performance and accuracy will need to be carefully weighed. We will not attempt to build the best performing model possible, but rather build the best performing model given a set of performance and resource consumption constraints. Hard limits for resource consumption are not defined, but especially memory consumption will be examined carefully when building the model.

Model validation

In order to validate the model, the file will be randomly divided to training, validation and test datasets. Then, the training dataset will be used to build the model, while using the validation set to tune model performance. Final model performance benchmarks are done against the test dataset. The model will be tested by having it predict three possible next words given a sentence. The number of times the actual following word is present in the three options will determine the model performance.