Data Science Coursera Capstone Project - Word Prediction

Shalini Ruppa Subramanian
24 January 2016

Overview

After completing nine courses in the Data Science Specialisation track in Coursera, a capstone project was given. The project was to build a word prediction shiny application to predict the next word based on an input text by the user. Corpus data is from HC Corpora that comes from blogs, news and twitter data of ~500 MB.

Screenshot of the word prediction application

Data Cleaning

Data sampling was performed to reduce the size of the data from ~500 MB to ~6MB.
Cleaning the data included (with Quanteda) - removing whitespace, punctuations, numbers, english stopwords, profanity, converting to lowercase
Unigrams, bigrams, trigrams and teragrams were subsequently generated with the n-gram words and the associated frequency. Frequency of words criteria was used to further filter the data.
User input is also filtered for punctuation, numbers, whitespace and profanity.

Prediction Alogorithm

Application takes less than a minute to load. Once it is loaded, the search results are obtained in a few seconds.
Used Stupid back-off method with Katz Backoff smoothing as the prediction algorithm.
Prediction function searches from teragram data assuming there are three words in the input text. If the next word is not found, it backs off and searches in the trigram data and so on. The last recursion is the unigram data.
A probability score is computed for the possible words.
Top five words, with the highest probabilities are displayed as output.

Application

Run the shiny application.

Limitations

Removal of stopwords. This would also mean the predicted word may not be grammatically correct sometimes.
Given an user inputs a unigram, it is possible that the word is not found in the bigram list to predict the next word due to the data size limitation. In such cases, the top 5 unigrams are the suggested as the next word

References

Coursera Discussion Board
Coursera Stanford NLP lectures