My Crystal Ball - NLP Word Prediction

kakilima
15 April 2016

As part of Coursera Data Science Specialization Capstone, this Natural Language Processing(NLP) app showcase predicts the next word of a user key in sentences.

Shiny App Link http://kakilima.shinyapps.io/crystalball
Alternative Shiny App http://seni.shinyapps.io/crystalball
Slides http://rpubs.com/kakilima/crystalball

The Development

The source data was a Corpus obtained from HCCorpura

Actual data used is a polished version of the Corpus by Coursera
Because of familiarity, only English is used. (The other languages are german, finnish & russian)
Data is cleaned (remove punctuation, numbers, convert to lowercase, etc.)
A subset of data (about 20%) is taken & converted into a DTM (Using TM library in R)
A prediction model is built using Stupid-Backoff
Prediction & other utilities function are built
The interface of the app is built using Shiny

The App Interface

App screenshot Consist of 4 screens - Auto Mode, Manual Mode, Settings & About

How to use this App?

First & foremost, please be patient as it might take awhile to be loaded. Once you can see the app logo fully loaded, it's good to go.

Auto Mode, type in your text & the app will automatically provide predicted next work
Manual Mode, type in your text. When you wish the app to predict, click the button.
Settings, you can enable or disable profanity filter here. When enabled, if a filtered word is predicted, it will be replaced with '#@?!'. The word is not removed to give better context for subsequent prediction.
Additional feature, a word count is provided. It helps to count the number of words & characters typed by the user.

Behind the scene walkthrough

When user key in some text, like 'The Quick Brown FOX…???' The app will

cleanup & standardize the text to 'the quick brown fox'
as a total of 4 words is entered, the app will check 5-gram for Maximum Likelihood Estimate (MLE) of 'the quick brown fox *', then proceed with 4-gram, 3-gram & 2-gram.
if less words are entered, it will start with n-1gram
when no words are entered yet, unigram is used
overall score is computed, using \( \alpha \) = 0.4
the word with the highest score will be predicted by the app
in cases where more than 1 word share exactly the same score, a word will be chosen at random