Presentation

Yasneen Ashroff
Jan 5 2017

The goal of this exercise was to build an app that takes an input phrase and predicts the next word.

It reads in a corpus of 1,000,000 blog posts, news articles and tweets and builds 1-gram, 2-gram, 3-gram and 4-gram models. It uses the models to calculate frequencies of each individual word encountered in the corpus, each 2-word phrase (bi-gram), each 3-word phrase and each 4-word phrase. These are stored in Document-Term Matrices and used to predict next word user enters.
Current model uses Stupid Backoff algorithm with Good-Turing Smoothing to calculate probability of each possible next word.

Due to processing limitations on my PC (specifically memory limitations), current model was built using only 20% of the 1 million input entries and only considers n-grams appearing 2 or more times. I also removed the profanity filter from the data cleansing step to increase processing time.

The current accuracy rate is around 15%. While this is low, the Swiftkey accuracy rate is roughly 35%. To improve my algorithm towards the 35% mark:

Future version will use Katz Backoff algorithm
I will revisit the data cleansing steps
Consider using part-of-speech tagging
Increase the input data size, perhaps storing the n-gram models in a MySQL database.