Next Word Prediction

Nicolais Guevara
April 23 2015

Next Word Prediction?

What is a Next Word Prediction?

  • Prediction of the next word in an incomplete sentence

Why do we need a prediction?

  • It helps users of small devices to type text faster and to reduce mispelling.

How ?

  • A next word prediction app will use a probabilistic method to provide the most probable next words in a sentence. It will improve the experience typing on small devices.

Data Sources

Public dataset

  • http://www.corpora.heliohost.org/download.html it contains material published in different webpages from 2005 (2,462,000,000 words).
  • Only english dataset for three different sources (blogs, newspapers and twitter)
  • Personal dataset from twitter, facebook or personal blogs

Steps:

1) Load the data

  • R, Python or Java

2) Cleaning the data

  • remove numbers
  • remove whitespaces
  • convert all text to lower case
  • remove punctuation
  • remove profanity words

3) Generate n-gram from our data

4) Implementation of the predictor model

Wordcloud of 1-gram from twitter data

plot of chunk Picture

Stopwords included

the you for and 
368 250 158 157 

Wordcloud of 1-gram from twitter data

plot of chunk Picture2

Removing stopwords: Most frequent words:

just like  day will 
  59   51   48   46 

Histogram of 1-gram from twitter data

plot of chunk Creating Histogram

Keeping stopwords

Histogram of 1-gram from twitter data

plot of chunk Creating Histogram NS

Removing stopwords

Wordcloud of 2-gram from twitter data

plot of chunk Plot 2-gram for twitter new

Most frequent 2-grams:

 in the for the 
     44      35 

Reducing n-gram size (vocabulary)

For this sample:

-Total of 1-gram 3357

-Percent of 1-gram with frequency 1 and 2: 82.84%

-Total of 2-gram 9179

-Percent of 2-gram with frequency 1 and 2: 96.39%

Algorithm

For our model we keep from 1- to 4-gram with frequency greater than 2

  • Read the 1- to 4-grams with corresponding distribution
  • Look for the incomplete sentence (only the last three words) into the 4-gram. We report the last word of the 4-gram with larger probability (the most frequent sentence)
  • If not in the 4-gram, we look into the 3-gram (removing one word from the left in the incomplete sentence) and report the last word of the 3-gram with largest probability
  • We repeat this process up to 1-gram, if no match is found, we will report the most frequent 1-gram

The Application will provide:

  • The five most probable next words from public or personal data

Why so simple?

Probability of a sentence (chain rule):

\[ p_1(I\,love\,the\,car) = p(I) p(love|I) p(the|I\,love) p(car|I\,love\,the) \\ p_2(I\,love\,the\,house) = p(I) p(love|I) p(the|I\,love) p(house|I\,love\,the) \]

Equal sentences (except the last word): the first three terms are identical (comparison only with the last term)

\[ p(car|I\,love\,the) = count(I\,love\,the\,car)/count(I\,love\,the) \] \[ p(house|I\,love\,the) = count(I\,love\,the\,house)/count(I\,love\,the) \]

Denominators are equal: \( count(I\,love\,the\,car) \), \( count(I\,love\,the\,house) \).

The next word prediction app

User should provide:

  • The incomplete sentence to make the prediction of the next word
  • If a personal typing history will be provided (twitter or facebook)

The Application will provide:

  • The five most probable next words from public or personal data