Word Prediction

Capstone Project
Data Science Specialization
Nidhi Mavani
16th August 2015

Motivation

The goal of capstone project of Data science specialization was to build a shiny app that will predict the next word in a sentence. It is a problem that falls under Natural language processing

Data and Pre-processing

The Corpus used for this app was taken from sources such as news, blogs and twitter (~550MB).
This data provided was pre-processed to remove punctuations, numbers, whitespaces, profane words to avoid predicting any of them. From the processed data, ngrams of length 1, 2 and 3 were made using KfNgram software.

Source Total(MB) Training(MB)
Blogs 200 119
News 196 116
Twitter 160 92

The total size of the RData file used for the app is about 100MB where all the grams were stored in data table. Table shows the size of ngrams before and after pruning

Ngram Total Size(MB) Final Size(MB)
Unigram 200 2
Bigram 203 4
Trigram 740 58

Prediction Algorithm

The algorithm used to make a model using the Tri-, Bi-, Unigram is Stupid Backoff. It takes about 20ms to return with top 5 most likely next words.
The algorithm which helps in calculating score of the next word is follows
As

Challenges

  • The TM package ran too slow so had to choose an alternative to use third party tool to generate ngrams which was faster (matter of a few minutes)

Future Possibilities

  • Smoothing techniques that can be implemented for better accuracy of the model
    1. Kneser ney
    2. Good Turing
  • Higher n-gram model (4, 5, 6- grams) can be made
  • Futher more generic data from various sources can be used (like Google N-grams)
  • Punctuations like apostrophes should be included in predictions

Links and references