PredictWord

Michael Lee
January 17, 2016

Coursera / Johns Hopkins University Data Science Specialization Capstone Project

Building A Predictive Text Algorithm

Our goal was to create an algorithm for predicting the next word given one or more words as input. A large corpus of more than 4 million documents was loaded and analyzed. N-grams were extracted from the corpus and then used for building the predictive model. Various methods of improving the prediction accuracy and speed were explored.

Designing And Improving The Algorithm

N-gram model with back-off strategy was used
Dataset was cleaned, lower-cased, removing links, twitter handles, punctuations, numbers and extra whitespaces, etc
Matrices from 6-gram to uni-gram were extracted
Sorted N-grams by frequency of occurrence
Reduced size of model by dropping least frequent N-grams
Further optimized speed and memory by dropping least frequent 2-grams and 1-grams since very large matrices of 2-grams and 1-grams do not appear to improve accuracy

Predictive Algorithm Shiny App

Provides a text input box for user to type a phrase
Detects words typed and predicts the next word reactively
Iterates from longest N-gram (6-gram) to shortest (2-gram)
Uses last word in matching N-gram as predicted word
Predicts using the longest, most frequent, matching N-gram
If no match is found using 6-, 5-, 4-, 3- and 2-grams, randomly selects most frequent word (1-gram)
Allows to configure how many words the app should suggest

Application Performance

Final model size:
- 82,672 6-grams with frequency > 2
- 184,171 5-grams with frequency > 2
- 474,914 4-grams with frequency > 2
- 440,851 3-grams with frequency > 3
- About 200k 2-grams and about 20k words (1-grams)
Average response time under 2 seconds
Application memory usage under 250MB
Check out the Shiny App here