Next Word Prediction

Nicolais Guevara
April 23 2015

Next Word Prediction?

What is a Next Word Prediction?

Prediction of the next word in an incomplete sentence

Why do we need a prediction?

It helps users of small devices to type text faster and to reduce mispelling.

How ?

A next word prediction app will use a probabilistic method to provide the most probable next words in a sentence. It will improve the experience typing on small devices.

Data Sources

Public dataset

http://www.corpora.heliohost.org/download.html it contains material published in different webpages from 2005 (2,462,000,000 words).
Only english dataset for three different sources (blogs, newspapers and twitter)
Personal dataset from twitter, facebook or personal blogs

Steps:

1) Load the data

R, Python or Java

2) Cleaning the data

remove numbers
remove whitespaces
convert all text to lower case
remove punctuation
remove profanity words

3) Generate n-gram from our data

4) Implementation of the predictor model

Wordcloud of 1-gram from twitter data

plot of chunk Picture

Stopwords included

the you for and 
368 250 158 157

Wordcloud of 1-gram from twitter data

plot of chunk Picture2

Removing stopwords: Most frequent words:

just like  day will 
  59   51   48   46

Histogram of 1-gram from twitter data

plot of chunk Creating Histogram

Keeping stopwords

Histogram of 1-gram from twitter data

plot of chunk Creating Histogram NS

Removing stopwords

Wordcloud of 2-gram from twitter data

plot of chunk Plot 2-gram for twitter new

Most frequent 2-grams:

 in the for the 
     44      35

Reducing n-gram size (vocabulary)

For this sample:

-Total of 1-gram 3357

-Percent of 1-gram with frequency 1 and 2: 82.84%

-Total of 2-gram 9179

-Percent of 2-gram with frequency 1 and 2: 96.39%

Algorithm

For our model we keep from 1- to 4-gram with frequency greater than 2

Read the 1- to 4-grams with corresponding distribution
Look for the incomplete sentence (only the last three words) into the 4-gram. We report the last word of the 4-gram with larger probability (the most frequent sentence)
If not in the 4-gram, we look into the 3-gram (removing one word from the left in the incomplete sentence) and report the last word of the 3-gram with largest probability
We repeat this process up to 1-gram, if no match is found, we will report the most frequent 1-gram

The Application will provide:

The five most probable next words from public or personal data

Why so simple?

Probability of a sentence (chain rule):

Equal sentences (except the last word): the first three terms are identical (comparison only with the last term)

\[ p(car|I\,love\,the) = count(I\,love\,the\,car)/count(I\,love\,the) \] \[ p(house|I\,love\,the) = count(I\,love\,the\,house)/count(I\,love\,the) \]

Denominators are equal: \( count(I\,love\,the\,car) \), \( count(I\,love\,the\,house) \).

The next word prediction app

User should provide:

The incomplete sentence to make the prediction of the next word
If a personal typing history will be provided (twitter or facebook)

The Application will provide:

The five most probable next words from public or personal data