Data Science Capstone Project: Natural Language Processing - Next Word Prediction App

P.Wang
December 10, 2014

Using natural langurage processing techniques, this application performs text mining and next word prediction given a phrase is enetered.

Data Source: This app examines the three sets of writing samples as the following: US Twitter: ~ 2.36 M tweets; US Blogs: ~ 0.9 M blogs; and US News: ~ 1 M news.
Data Processing: Data from the twitter, blogs and news are processed to create 3-, 4-, and 5-gram models. And the data are preprocessed with the steps to remove numbers, punctuations, whitespace, profanity, and changed to lowercase etc, to clean the data.

Algorithm is based on N-gram method.
3-, 4-, and 5-gram models are built with the SwiftKey project data (word frequency >2).
The 3-, 4-, and 5-gram models are splitted into the first 2, 3, 4 words and the last word based on the input words length.
Only the last 4 words will be considered if the input words length > 4; and the prediction will be treated as 5-gram model.
Top 3 frequent predictions will be made in the order for the next word.
If nothing can be predicted, no prediction message will be displayed.

On the app page, input your words in the “Enter your words” box. Wait for the entered words appear in the “You entered” box.
The top 3 frequent predictions for the next word will be given in the order in the prediction box.

If nothing can be predicted, “–Sorry no prediction for your word–” will be displayed in the prediction box.