Word prediction application

Alexandre Rodichevski
April 12, 2015

Introduction

The goal of this project of the Coursera Data Science Specialization is the development of a predictive data product which, given a text phrase, forecasts the succeding words.

The project uses the data from a corpus of English texts taken from blogs, news feeders and twitter messages. The corpus contains about 100 millions of words. A random sample of about 0.5% from this material has been used as training data set.

The main idea of the algorithm is that some chains of successive words in phrases are more probable then others.

Prediction algorithm

The text is converted into lower case and divided into tokens: words, numbers and punctuation symbols. The algorithm does not predict profanity (or vulgar) words, but does recognize them in the input. The prediction algorithm is based on the 9,738 most frequently used words corresponding to 97% of the whole training text.

The idea of the algorithm is that the probability of a word in the text depends mainly on the previous words in the same phrase. A word in a phrase is predicted as function of two immediately preceding tokens. For this purpose, for every pair of successive tokens from the training data, the most probable following token in the sequence can be determined. For example, “that can't” is more probably followed by the verb “be.”

Usage of the application

The web application takes a text phrase as an input. Applying the algorithm above described, it yields the most probable following word in the phrase. Screenshot of the application page The frase is inputted in the form. The application immediately gives the prediction in the rigth part of the page.

Enhancements of the algorithm

The algorithm predicts better when it uses token chains longer than three. This employs more memory and calculus time.

The data compression can be enhanced. Some of the rarely used words are more frequently used in word chains than other rare words. Such words can be conserved in the vocabulary of the most frequently used words rather than discarded.

The algorithm could be adjusted to the Russian language. In fact, a lot of information in the phrase has been carried by the suffixes of the words. The probability model should include the suffixes. In particular, rarely used words can be converted into special tokens conserving the suffixes.