My Crystal Ball - NLP Word Prediction

kakilima
15 April 2016

As part of Coursera Data Science Specialization Capstone, this Natural Language Processing(NLP) app showcase predicts the next word of a user key in sentences.

The Development

The source data was a Corpus obtained from HCCorpura

  • Actual data used is a polished version of the Corpus by Coursera
  • Because of familiarity, only English is used. (The other languages are german, finnish & russian)
  • Data is cleaned (remove punctuation, numbers, convert to lowercase, etc.)
  • A subset of data (about 20%) is taken & converted into a DTM (Using TM library in R)
  • A prediction model is built using Stupid-Backoff
  • Prediction & other utilities function are built
  • The interface of the app is built using Shiny

The App Interface

App screenshot Consist of 4 screens - Auto Mode, Manual Mode, Settings & About

How to use this App?

First & foremost, please be patient as it might take awhile to be loaded. Once you can see the app logo fully loaded, it's good to go.

  • Auto Mode, type in your text & the app will automatically provide predicted next work
  • Manual Mode, type in your text. When you wish the app to predict, click the button.
  • Settings, you can enable or disable profanity filter here. When enabled, if a filtered word is predicted, it will be replaced with '#@?!'. The word is not removed to give better context for subsequent prediction.
  • Additional feature, a word count is provided. It helps to count the number of words & characters typed by the user.

Behind the scene walkthrough

When user key in some text, like 'The Quick Brown FOX…???' The app will

  • cleanup & standardize the text to 'the quick brown fox'
  • as a total of 4 words is entered, the app will check 5-gram for Maximum Likelihood Estimate (MLE) of 'the quick brown fox *', then proceed with 4-gram, 3-gram & 2-gram.
  • if less words are entered, it will start with n-1gram
  • when no words are entered yet, unigram is used
  • overall score is computed, using \( \alpha \) = 0.4
  • the word with the highest score will be predicted by the app
  • in cases where more than 1 word share exactly the same score, a word will be chosen at random