Next Word Prediction Algorithm

Tatiana Veremeenko
4/25/2015

Motivation and the main idea

Next word prediction is a complex task that require to keep balance between accuracy, memory usage and speed. As the most perspective field of apllication is a mobile text input, there are pretty strict limitations on all of these characteristics.

The language model I built is based on frequencies of bi- and trigrams extracted from the social media corpora. It predicts the top-3 most probable next words. Due to thorough optimization, it consumes only 60 Mb of memory, gives about 20% of top-3 precision and each prediction takes only 4-5 msec.

The algorithm in details

The algorithm uses information of the most probable bi- and trigrams that was collected prom pre-processed social media corpus (50% blog posts, 50% tweets). These texts were cleaned of non-alphabetic symbols, divided into words and sift through dictionary of about 50000 English words (in order to exclude typos and profanity). All these steps were made by Python scripts.

The algorithm takes the phrase entered by a user and looks for the three most probable trigrams, that start with the last two words of that phrase. If the phrase is shorter or there are no such trigrams, the algorithm continue to look through bigrams that start with the last word of the phrase, or suggests top-3 most popular single words.

The application

The application that demonstrates the algorithm is simple yet entertaining. A user inputs any phrase in a text input area and the application suggests top-3 most probable next words below this area. By clicking on these suggestions user can also add the word to the input.

Snapshot

References and technologies

The algorithm:

Data: HC Corpora
Programming languages: R and Python
Dictionary: collection of word lists SCOWL

The application:

A web application framework for R Shiny

Special thanks to Jan Hagelauer for his benchmark that was a great help.