Next word prediction Based on an N-gram Language Model

Maher Harb
December 7, 2014

Presentation for the Coursera/JHU Data Science capstone project

Background & Motivation

Natural language processing (NLP) is a fascinating field of study aimed at extracting meaningful information from human language by computer processing.
The applications of NLP are boundless. Some notable examples are in information retrieval, translation, speech recognition, and structured prediction.
This project explored one NLP application: building an algorithm to predict the next word, given a phrase of text as input.
Such application has wide uses in tablets and mobile devices where it is highly desired to allow users alternative methods of inputting text in order to circumvent limitations of the touch-based keyboard.

The N-gram Model

The implemented n-gram prediction algorithm assumes that one can predict the next word in a phrase based on the previous n-1 words (Markov approximation). The following were key steps in the implementation of the algorithm:

Sample data from ~600 MB corpora of text from Blogs, News, & twitter sources (~4% at a time), clean, and split into sentences (based on sentence ending punctuation).
Extract frequency of occurrence of all possible n-grams (n=1 to 6), then leave out less frequent n-grams.
Re-sample, and aggregate n-grams obtained from different samples.
Interpolate prediction results from the different length n-grams.
Perform cross validation in order to fine tune key model parameters (length of largest n-gram, value of the weights) and assess performance of the prediction out-of-sample.

The main findings of the validation exercise were an out-of-sample prediction accuracy of ~14% and determining that increasing the n-gram length beyond n=4 did not have any appreciable effect on accuracy.

Shiny Application

The model was packaged as a shiny application, supported with the jQuery-ui autocomplete widget. This allows users to quickly pick predicted words with the keyboard. The instructions for using the application are:

Type or paste a phrase of text in the input box, then type space for the next word prediction (see left panel below).
After a slight delay, a drop-down menu with 5 words appears with the top word in the list representing the model's best guess. The menu can be navigated with the arrow keys and words can be chosen by the Enter or Tab key.
You can also filter down the predicted words by typing the first letter or few letters of the word (see right panel below).

Future work

The model has an acceptable prediction accuracy to begin with. Further improvements of the model may focus on 2 areas:

Adaptive learning: Tailoring the model for a specific user by recording user input and adding that to the text corpora used for building of the n-gram tables.
Context-based prediction: A possible approach to implementing context-based prediction is to categorize the Blogs and News articles into topics so that a corpus would entail all text that belongs to one category (e.g. sports, news, politics, entertainment, etc.). Based on the input phrase, one could then assign a category (or multiple categories) to the typing session and use the n-gram table relevant to each category to produce predictions.