Next Word Prediction

Final Report

 

A model and website for Natural Language Prediction.
Created for the Data Science Capstone project
run by Coursera and John Hopkins University.

Chris Lill
21 January 2016

Objective

To use a collection of 4 million tweets, blogs and news articles to predict the next word that a user will type.

Screenshot of application which has the input of "What a beautiful" and is predicting "day"

http://chrislill.shinyapps.io/next-word

Approach

The text was simplified and a count was made for each instance of 2, 3 or 4 consecutive words. The probability was calculated for the top 5 answers for each phrase.

Model word-3 word-2 word-1 answer probability
Bigram beautiful day 9%
Trigram a beautiful day 22%
Quadgram what a beautiful day 44%
Interpolated what a beautiful day 34%

Optimization of the model to fit the validation set gave:

\[ P_{interpolated} = 0.8 P_{quadgram} + 0.16 P_{trigram} + 0.04P_{bigram} \]

Features

  • Suggestions are returned whilst the user types.
  • Click a suggestion to add it to the text.
  • The model has an accuracy of 18%.
  • Five answers are displayed with an total accuracy of 34%.
  • A probability table shows answers for each model. Probability table with five answers for "What a beautiful"

Improvements

Ideas for improving the accuracy of the model include:

  • Store more than the first five answers for each n-gram.
  • Introduce a Beginning of Sentence <BOS> tag into the model to improve the prediction of the first three words.
  • Use more data from a source such as Google ngrams.
  • Incorporate Part Of Speech tagging into each model. This is the approach that I'm most excited about trying.