Coursera Capstone: Next Word Prediction Application

16/08/2015

Introduction

The Coursera Data Science Capstone project is to create a shiny application for next word prediction.

The basis of the project was a corpus from [www.corpora.heliohost.org] comprised of blog posts, news articles and tweets.

From this dataset we were to create an algorithm to predict the next word of a given sentance or phrase

Algorithm Development

The model was built using an N-gram language model approach. The principle of Markov Chains was used to restrict this to a tri-gram approach - i.e. the model looks at most at the last three words of the sentance.

A simplified back off approach was used - first the model checks for full matches of the last 3 words, the most frequently occuring quadgram starting with those 3 words would be selected. If no quadgram contained the phrase the algorithm backs of to tri-grams using the last 2 words, if there are still no matches it backs off to bi-grams using the last word and if there are still no matches it reverts to the most common uni-gram (“the”).

The Shiny Application

The application is simple to operate - It has a single input for the user to enter their phrase in the side panel. The main panel contains two outputs; The first echos back the users input, the second shows their phrase with the additional predicted word. The app can be found at [https://sleol.shinyapps.io/NextWordApp].

Future work/Improvements

Due to time constraints the algorithm developed was kept simple. The following represent ideas that I would like to explore in the future to improve the application

Smoothing functions (Kneser-Ney smoothing, Katz back off, Jelinek-Mercer smoothing etc.)
codifying text as numbers for improved lookup speed & reduced size requirement
Part of speech tagging
Sentance boundary approaches