Next Word Prediction

alt text

Denise Wong November 2018

Coursera JHU Data Science Capstone Project

Executive Summary

The goal of this Data Science Capstone project was to create a Shiny web application that can predict the next word in a sentence. The prototype developed uses text mining techniques as a basis for prediction. The application was developed in R.

– A corpus of 3+ million tweets, news articles and blogs is used to calculate 5-gram probabilities.

– A backoff model is used to rank the next word candidate.

– The predictor achieves 12.8% and 20.8% for top 1 and top 3 precision in under 7ms using 62MB of memory.

This prototype can be used on mobile devices to enhance the typing experience via a dramatic increase in efficiency, accuracy and speed. The approach developed is simple, easy to scale into large size and other languages and has a fast response time.

The web app features a user friendly interface which updates next word prediction after the space bar has been pressed. It has low memory usage thereby providing fast and accurate predictions. The algorithm is easily adaptable to other languages such as German & Finnish.

Data Source

The algorithm uses a dictionary of probabilities of n-grams to predict the likelihood of the next word in a sentence.

The English dictionary was created from a corpus sample of more than 3 million lines of texts from news, tweets and blogs. The sample data was cleaned by conversion to lowercase, removing punctuation, links, profanity, white space, numbers and special characters.

Type	Lines	Words	Characters
Blogs	899289	37334690	171926076
Tweets	77259	34372720	13117038
News	2360148	30374206	134371036

An n-gram is a sequence of n-items from a given sequence of text. The n-gram frequencies represent word and phrase usage. The n-gram frequency matrices are generated from the sample and the most frequent terms are extracted to build a probability based model.

The algorithm used up to 5-ngrams where features which occurred less than 5 times were dropped from the dictionary. This was stored in an advanced data structure to save memory and enhance performance. More details on the n-gram language model can be found here.

Prediction Model

The model uses the n-gram frequencies and the Stupid Backoff model for interpolation to rank the predicted next word probabilities.

The algorithm uses the last four words in the sentence to find the 5-gram match. If there are less than 5 matches found, the process is iterated on 4-gram and then lower order n-grams until at least 5 matches have been found.

If no matches are found, the top 5 most likely unigrams are returned according to their Maximum Likelihood Estimate.

The probabilities for each word are scored using a lambda of 0.4 which discounts lower order n-grams. The matches are ranked according to the sum of probabilities.

Amongst the algorithms considered, the Stupid Backoff method produced the best results in terms of the tradeoff between speed and accuracy.

Performance

Model performance using 40% of the data set for dictionary construction was evaluated against an independent benchmark. Probabilities were precalculated to optimise execution time.

Further work to improve performance includes topic cluster analysis and common spelling error checks.