Next Word Text Prediction

J. Phillips
11 January 2019

Overview

Next Word is a working prototype of a text prediction app that has a variety of useful applications. It uses text taken from blogs, news feeds and Twitter to create a model that will take a word or phrase as input and attempt to predict the next word. The benefits of using this model make text input faster as the user can select the next word to add to the input with a click of a button rather than having to type it out.

The benefits include:

An intuitive interface
Faster text input
Scalability for greater accuracy
Usefulness in a variety of applications

Working with the Data

The Data

The model for this application was trained on data obtained from the Capstone Dataset. Due to memory constraints and processing time, a random selection of 10% of the text corpora was used to train the model.

Cleaning and Manipulating the Data

Numbers, symbols and URLs were removed. Emojis and undesirable Unicode characters were either transformed or stripped out. The corpus was split into sentences and then split into 1-4 n-grams. N-grams that appeared fewer than twice in a minimum of 2 separate documents (sentences) were deleted in an attempt to filter out obscure, or misspelled words. From these n-grams, frequency tables were created and the last word was separated from the previous words (the root). At this point, the last words were scanned for profanity and suitable replacements were made with a liberal use of asterisks.

Calculations

The probability of the next word is calculated from the frequencies. Good Turing Smoothing was used to improve the accuracy for less frequent n-grams i.e., n-grams that appear fewer than 10 times in the corpus. Maximum Likelihood Estimate was used for n-grams that appear 10 or more times in the corpus. This method is suggested by Prof. Olga Veksler in the document Artificial Intelligence II.

The program uses Stupid Backoff to calculate the most likely next word. The input text string is cleaned of punctuation and extra whitespace. The text is then split into a maximum number of n-grams up to 4 and checked against the probability table. If the longest possible n-gram is not found, it downshifts to the next lowest n-gram applying a penalty of .4 against the probability. This continues using an alpha of .4 for the first downshift to a lower n-gram, .4x.4 to the next downshift etc. The 3 highest probabilites are displayed. If the word has never appeared in the training corpus the 3 most frequently used single words will appear.

Instructions and Link to Next Word App

Either paste some text in the text box or type in a phrase and hit ENTER. The 3 most likely next words will appear in the buttons below. The most common appears in the middle. If you want to select one of these words, click the button and it will be appended to your text. You can keep clicking buttons to build a completely weird, run-on sentence.

The application can be found here: NEXT WORD TEXT PREDICTION APP