Data Science Capstone: Next Word Prediction App

Nicolais Guevara
April 26 2015

Next Word Prediction?

What is a Next Word Prediction?

  • Prediction of the next word in an incomplete sentence

Why do we need a prediction?

  • It helps users of small devices to type text faster and to reduce misspelling.

How ?

  • A next word prediction app will use a probabilistic method to provide most probable next words in a sentence.

Steps to build model:

1) Load the english public HC corpora dataset

2) Cleaning the data

  • remove numbers
  • remove whitespaces
  • convert all text to lower case
  • remove punctuation
  • remove profanity words

3) Generate n-gram from our data

4) Implementation of the Back-off language predictor model

Back-off Algorithm:

  • Read the 1- to 4-grams with corresponding distribution
  • Look for the incomplete sentence (only the last three words) into the 4-gram. We report the last word of the 4-gram with larger probability (the most frequent sentence)
  • If not in the 4-gram, we look into the 3-gram (removing one word from the left in the incomplete sentence) and report the last word of the 3-gram with largest probability
  • We repeat this process up to 1-gram, if no match is found, we will report the most frequent 1-gram

For our model we use 1- to 4-gram with frequency greater than 2:

The Next Word Prediction App

User should provide:

  • The incomplete sentence to make the prediction of the next word

The Application will provide:

  • The most probable next word

Test our app: Next Word Prediction App