Next Word Prediction

Johann Posch
December 2014

Coursera Data Science Capstone Project

A Web Application to assist a user typing text by predicting the next work for a partial sentence.

a model is constructed to predict the next word for a phrase
a phrase is composed of prefix followed by word
formally, the model will predict word Y given phrase X where
- Y .. is the next word (e.g 'time')
- X .. is zero or more words (e.g 'at the first')
the probabilities of Y given X is calculated by:
- prob of X = count of Y / count of X
N-grams of lenght 1..4 are used build Markov chain
R data tables are used for fast in-memory lookup

Training
- all sentences of a training set partition are used in a step
- model are trained over training set partitions until stop criteria is reached
- phrases with low probability are trimmed to keep size reasonable
Prediction
- for a given prefix (X),
  - highest n-grams are examined first
  - the top N predictions (e.g three are used)

the web application:
- has a text box to enter phrase (partial sentence)
- shows the top N predicted words for the entered phrase
- shows a plot with top N predicted words
- shows help text to guid user
future work:
- code was architecurted with parallel and distibuted processing in mind (e.g on Spark cluster with SparkR)
- modify code to run on Spark (SparkR)
- explore customer specific model (e.g train with works of Shakespeare)
- explore higher N-grams and minimal semantic analysis

Application
Presentation
Acknowledgement

I sincerly thank the Data Science team at John Hopkins, especially Jeff Leek, Roger Peng and Brian Caffo as well as Coursera team, for this excellent specialization course series. For me, it has made a new career opportunity possible!