Cerebro - predictive English language modeler

Data science track, Capstone Project

Fedor Duzhin
Coursera student

What Cerebro does

The app "Cerebro" predicts the most likely next word as the text is being typed. For example, if you type Fedor's app is the, it'll guess that the next word should be best.

Input

Enter a phrase and wait a second or two.

Output

Two versions of output: the most probable next word according to Cerebro's caculations and the top five words that are most likely to be the next one together with their probabilities.

How Cerebro works

For three words A, B, C, we calculate the probability that C follows A and B in a phrase. The calculations are based on counting unigrams (words), bigrams (pairs of words) and trigrams (triples of words) in a large body of text.

Given a phrase, we identify 60 natural candidates for the next word - 30 most frequent words overall and 30 most frequent words that follow the last word in the phrase. Then we calculate the probability of each of the 60 candidates and output the five most likely ones.

How we can sell Cerebro

There are two big areas for marketable applications of Cerebro.

Selling Cerebro itself

In situations when typing text is difficult, Cerebro will help by guessing a few most likely next words for you. Applications are mobile phones and keyboard replacements for disabled people

Selling algorithms / consultancy

Essentially, Cerebro contains a compact model of the English langauge. Given any phrase, the algorithm that powers Cerebro estimates the probability that this phrase is a proper English.

Such language models have a number of applications, i.e., machine translation, speech recognition, extracting text from a scanned image.

Although we didn't develop an image recognition algorithm, we now have the expertise to team up with people who have experience in image recognition to, say, create an app that reads text from damaged forensic evidence and sell it to the government.

Maths under the hood

A language model assigns a probability to a phrase. For example, a good English model will assign a high probability to the phrase I ate a whole orange and a low probability to Eye it oh hole a range.

Language models are based on counting ngrams - consecutive sequences of words. An individual word is a unigram, a pair of words is a bigram, a triple of words is a trigram etc. Every language model is based on counting ngrams in a large body of natural text. Cerebro has been trained on 800K blog entries, 1M web news articles, and 2.3M twitter messages.

Simple ngram counts are not sufficient to construct a good language model. Various smoothing methods are applied to estimate the probability of ngrams that never occur in the training set. We have looked at Linear interpolation, Backoff, Good-Turing, and Kneser-Ney smoothing algorithms. According to the literature, the best one is a modified Kneser-Ney.

However, existing literature is very scarce. Thus in the end we implemented the trigram (essentially, Cerebro only looks at the two last words of the input phrase - the rest is not used for prediction) Kneser-Ney with elements of backoff smoothing but we had to invent some key details from scratch.

How Cerebro can be improved

Depending on the area we market Cerebro at, improvements can be introduced, like

Make it distinguish upper/lower case and make use of punctuation for prediction.
Use not only trigrams, but 4-grams, 5-grams and so-called 2-skipgrams - pairs of words that occur in the same sentence but not necessarily together. We'll need more computing power to develop such models.
Predict the next word not only from previous words but also from a few its letters that have been typed already. In other words, this will add spell-checking-like features to Cerebro.
Use parts of speech to improve Cerebro. For instance, "a" is usually followed by a noun while, according to our model the top 5 most likely words that follow "a" are "many", "lot", "few", "little", and "great".

Link

https://theodor.shinyapps.io/ngram_prediction