'Predict Next Word' Application

Marco Marchetti
November 12, 2017

Johns Hopkins University Capstone Pitch (Coursera)

Overview

The Predict Next Word Application predict three most probable next words based on the user text input. The main application goals are the performances, trying also to reduce the memory allocation. This presentation summarize:

The Data Set and main Preprocessing steps
The nGram Prediction Algorithm and a new Experimental nBiGram approach
The Shiny Application product.

Data and Preprocessing

The predictive model is based on a HC Corpora data set that includes files containing texts extracted from blogs, news and twitter. The english data set contains over 100 million words and a detailed analysis is available here.

The main objective of the data processing is to create contiguous sequence of words (n-gram) from text sequences. The preprocess steps executed on data are:

Sample approximately 50% of the initial data.
Remove all non-alphabetic (numbers, punctuation, special characters).
Lowercase conversion to elminate case sensitivity.
Remove profanity words (words banned by Google).
Extract words sequences (1, 2, 3 grams) and their frequencies.

To reduce memory allocation only nGram with frequency greather than 1 are stored in the nGram Tables.
To improve performances some prediction algorithm parameters are calculated in the ngram Table creation phase; other parameters are calculated on the fly while user typing the text.

Prediction algorithm

Simple Katz Backoff with Good-Turing Discounting is used to calculate the conditional probability of a word against its preceding word. If the n-gram has been seen more than k times in training, the conditional probability of a word given its history is proportional to the maximum likelihood estimate of that n-gram. Otherwise, the conditional probability is equal to the back-off conditional probability of the (n-1)Gram. (Wikipedia Katz's back-off model Page)

An Experimental use of the Katz's back-off algorithm has also been tested. The main idea is to consider couple of words as a single word in the prediction algorithm; the prediction is so based on what we call nBiGram Model (bi-bigram", “tri-bigram”.. and so on).

References:
Katz, S. M. (1987). Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Transactions on Acoustics, Speech, and Signal Processing, 35(3), 400 - 401.

Application

After the nGram Tables loading phase the user can type the text in the textbox and the application will display the top three predicted words in the Swiftkey prediction bar simulation area. The central word is the most probable word and the two words on left and right are less probable words.

The algorithm performances are under 8 msec allowing an “instant” version of the Application. The usage of the Application is quite simple and implements three modes:

Instant Mode: predicted word will appear while user typing the text.
Submit Mode: predicted word will appear by click the Submit button.
Couple Mode: two predicted words will appear while user typing the text.

alt text The Application is available here