Word predictor

Dino Budimlija

March 2017

Data Science Specialization capstone project - text prediction application

Logos

Description

Word Predictor is a light text prediction application for mobile phones. User types a text in English and an algorithm tries to predict next logical word based on the previous input.

It shows up to three word predictions with the first one in red button and the other two in blue buttons. User can choose predicted word by clicking on the button or just type the suitable word in the text window.

Two additional buttons are used for clearing the input text or choosing the random word from the dictionary.

User interface:

Background

The language model used for this application was developed by analyzing large database of English texts that contained 3 different types of text documents - blogs, news, twitter; which were combined, analyzed and then processed with natural language processing (NLP) techniques to:

clean the documents (replace abbreviations, contractions and symbols, remove punctuation, numbers and also profanity from it)
sample the data and perform exploratory analysis
process corpora and perform some additional text mining and exploration
create Ngrams of size 1 to 4 (tokenization)
calculate their respective frequencies
create a prediction model using 4-gram linear interpolation method

To reduce the overall table size Ngram probabilities were stored as integers (instead of floats) and data.table indexing on the lookup table was used to achieve quicker data access and keep app responsive.

Algorithm

Prediction model is based on quadrigram (4-gram or four consecutive words) linear interpolation model which relies on Maximum likelihood estimates (MLE) of n-grams to provide the probabilities of all the possible sentences in the corpora.

Major deficiency of linear models is that they assign probability of zero to any sentences not seen in the training set. To overcome this usually some kind of smoothing or discounting methods are used to allocate some of the probability mass to unseen word combinations (Ngrams) in the training set.

I used linear interpolation as smoothing method to combine all four orders of N-grams using different parameters (lambdas) as their respective weights. Formula for this new estimate for an quadrigram can be expressed like this:

\[ \tiny q(w_{i}|w_{i-3}, w_{i-2}, w_{i-1}) = \lambda_{1} q_{ML}(w_{i}|w_{i-3}, w_{i-2}, w_{i-1}) + \lambda_{2}q_{ML}(w_{i}|w_{i-2}, w_{i-1}) + \lambda_{3}q_{ML}(w_{i}|w_{i-1}) + \lambda_{4}q_{ML}(w_{i}) \]

\[ \tiny where: \lambda_{1} + \lambda_{2} + \lambda_{3} + \lambda_{4} = 1\ ;\ and\ \lambda_{i} >= 0\ for\ all\ i \]

Lambdas were derived using held-out validation data set.

Final results

One of the main objectives for this text mining predictive project was that produced web application has a small memory footprint for implementation on mobile phones with limited memory. The other one was that runtime of the algorithm was minimized to provide a reasonable experience to the user.

To achieve quick application response the model was pruned of all Ngrams with less than three occurences, downsizing the model from initial size of almost 50 million Ngrams to around 6 millions and just 34MB in file size.

The results of the independent benchmark on provided corpora show accuracy of 11.23% for the first predicted word and 19.15% on the top-3 words with average prediction runtime of only 22.02 msec, providing almost instantaneous response and enjoyable user experience while using the Word Predictor app.