Data Science Capstone Project

A. Stefan
April 24, 2016

The purpose of this project is to develop a prediction algorithm based on a data set from a corpus called HC Corpora (www.corpora.heliohost.org).
The model must (1) accept a phrase as its input and (2) return a prediction for the most likely next word
This application makes use of principles of Natural Language Processing and Text Mining

The model is built using 1% of the original data set, the available RAM limited the size of the sample set
N-grams were constructed, with n = 1, 2, 3. The small sample size did not justify the construction of n-grams of higher order
The entries (word combinations) in each of the n-grams are assigned probabilities (\( w_i \) = i-th word)
Trigram: \( P(w_{i}|w_{i-2}w_{i-1}) = \frac{count(w_{i-2}w_{i-1}w_i)}{count(w_{i-2}w_{i-1})} \)
Bigram: \( P(w_{i}|w_{i-1}) = \frac{count(w_{i-1}w_i)}{count(w_{i-1})} \)
Unigram: \( P(w_{i}) = \frac{count(w_i)}{corpus\ size)} \)

When given a phrase as input, the last two words are selected and matches are sought first in the trigram
If only one word is given, then the bigram is used
If the first step does not return results, i.e., the sequence of two words is not found in the trigram or the single word entered by the user is not found in the bigram, then a simple stupid backoff approach is implemented: if the trigram search returns NA, then select the last word in the input phrase and search in the bigram; if the search returns NA, then use the unigram probability

alt text

The user types a phrase in the Input box at the left and the predicted word is shown in the Output box.
An example of a prediction is shown in the figure.

The model is expected to have fairly low accuracy for the less common words due to the small size of the sample used for training
The application can be accessed at: https://ais1209.shinyapps.io/SwiftkeyProject/
The code used for this application is available on GitHub at: https://github.com/ais1209/CapstoneProject