Next Word Prediction with N-Gram Models

randomwalker2014
04/26/2015

JHU Coursera Capstone Project

Background and Motivation

Development of predictive, assistive systems is a growing field, in particular the use of semantic modeling of text is of particular interest to data scientists.
Adding editing assistance to a user is a challenging problem to solve given the varied distribution of words across documents,speed of typing, personal preferences, the ever expanding vocabulary and sentence constructions.
Although there is a wide range of applications for Natural Language Processing (NLP), this project focuses on a specific problem of predicting the subsequent word given a fragment of text.
The W Predictor application builds on the idea of the auto-completion feature found on almost all smart phones and mobile devices.

N-Gram Model

Mathematically speaking,the problem of predicting the next word is one of simplifying the calculation of Joint probability of an entire sequence of words (a sentence) as a product of conditional probabilities. Applying the Markov assumption makes this calculation a possibility with acceptable loss in precision.
This model uses the Markov assumption to calculate the Maximum Likelihood Estimates (MLE) for the last word in the sequence by only considering the previous word (2-Gram), previous 2 words (3-Gram) or a maximum of previous 3 words (4-Gram).
For calculating the MLEs in the model, linear interpolation technique was used that takes into consideration (mixes) all N-gram model probability estimates with pre-calculated weights. The weights are calculated by minimizing the overall perplexity of the holdout sample.
Since the final model had to be have a small disk and memory footprint the N-Grams had to be pruned without sacrificing too much model accuracy. The final model with about 1.2 Million N-Grams is just under 10 MB on disk.

Building the N-Gram Model

The following key steps were involved in the construction of the N-Gram Model

A random sample of 1.2 Million documents were selected from all three data files, merged and split into a Training set (70%), Test set (20%) and a Holdout set (10%).
During the pre-processing stage, the data was cleaned by removing non-alpha characters, repeated word sequences and short documents(less than 4 characters long). It was a conscious decision to not perform any Stop Word removal, profanity filtering and stemming.
The sanitized data was tokenized into 2-Gram, 3-Gram and 4-Gram tokens and any low-frequency Quad Grams were removed prior to building the model.
The initial set of MLEs were calculated for the N-grams and later optimized via interpolation against the Holdout dataset. The interpolated MLEs were assigned to all N-Grams.
The generated model was tested against the Test Set and optimized by improving the pruning and data cleanup process. The overall out of sample accuracy of the model improved from 12% to 16%.

Next Word Predictor Application

Instructions for using the Safe, R Shiny Reactive application:

Type or paste a fragment of text in the Text Box provided on the left corner for the predictor to suggest the Top next word.

Future Work

Through this project, I have barely scraped the surface of Natural Language Processing. I would like to better understand the limitations of N-Gram models and explore the possibility of combining N-Gram models with other NLP models for achieving better accuracy.

I would also like to explore practical applications of this technique and in particular the Markov simplification rule to domains outside of NLP.

Finally, a BIG Thank you to Coursera and our Instructors at JHU