Next Word Prediction

Okoilu Ruth Oluwadamilola
July 16, 2016

Overview of this application

The goal of this application is to highlight the prediction algorithm that I have built and to provide an interface that can be accessed by you. This application is built on the shiny platform. It simply takes an input phrase in the textbox provides and the application will return outputs: a prediction of the next word. This application can be found here: https://damilolah.shinyapps.io/Next_word_shiny/

The training data for this application can be found here: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

Preparing the data for Predictive analysis

The data was cleaned in order to make it ready for analysis I changed the whole text to lower case, removed punctuations, numbers, stop words and unnecessary white spaces.

N-gram modeling

I created n-grams of size 3 (trigram), size 4 (four-gram) and size 5 (five-gram).

How the text prediction works

The application takes in the input as a query and filter the n-grams by the query entered. It then returns all results containing the query.

Prediction Model: Kneser-Ney Smoothing Algorithm

This algorithm makes use of both higher-order (higher-n) and lower-order (lower-n) language models, reallocating some probability mass from 4-grams to simpler n-grams. For example: The phrase: I cant see without my reading _____.

A fluent English speaker reading this sentence knows that the word glassesshould fill in the blank. But since San Francisco is a common term, absolute-discounting interpolation might declare that Francisco is a better fit: Pabs(Francisco)>Pabs(glasses)

Kneser-Ney fixes this problem by asking a slightly harder question of our lower-order model. Whereas the unigram model simply provides how likely a word wi is to appear, Kneser Neys second term determines how likely a word wi is to appear in an unfamiliar bigram context.

References

http://www.foldl.me/2014/kneser-ney-smoothing