Capstone Final Project: Shiny App for NLP Word Predictive Model

Sunil Kumar (@sunil4data; sunil_iitb96@yahoo.co.in)
11 Aug 2018

Goal of this Capstone course: -
- Use all Data Science learning and tools to solve Next Word Predictive Modeling
- Learn just enough of NLP around n-gram Language Modeling
- Assess and attempt to improve accuracy and performance
Goal of this final project: -
- Build on the work done so far till milestone reporting (refer to http://rpubs.com/sunil4data/milesRepCapstone)
- Create an online application (Shiny App) for Next Word Prediction
- Provide the n-gram Language Model solution backend of Tutorial / White Paper quality (readable, reusable, scalable) https://www.kaggle.com/suniliitb96/tryswiftkeyinr?scriptVersionId=5037782

Create an algorithm for predicting the next word given 2 or more words as input using n-grams Language Model
A large corpus of blog, news and twitter data was loaded and analyzed
N-grams were extracted from 10% of corpus data and then used for building the predictive model
Various methods of improving the prediction accuracy and speed were explored (refer to 'NLP Background study notes & findings' in https://www.kaggle.com/suniliitb96/tryswiftkeyinr?scriptVersionId=5037782)

Challenges of n-gram language modeling
- Words observed rarely (~50% of vocabulary were observed just once)
  - LM modeling was attempted with minDocFreq of 2 & 5
- Stop words observed most frequently (~25% of corpus tokens were stop words)
- Missing words in test sentence
  - When: due to dropped stop words -or- pruned low frequency words -or- OOV words
  - Solution: Smoothing & Backoff/Interpolation
n-gram Language Models
- “UNK” as missing 1-gram & “1gramTokens_UNK” as missing 2-grams were included in LM model
- MLE & Add1 Laplace Smoothed probabilities were pre-computed for n-grams upto 3-grams
Next Word Prediction
- Stupid BackOff algorithm was implemented
  - Observed prediction time of 0.5-0.9 sec
  - Though code execution report shows small overall accuracy using SBO, a close look at top 6 words affirms the prediction is quite great as expected label in test trigrams are inappropriately placed!
- Using Add1Laplace probability, prediction output are quite poor

Pre-computed LM model containing probabilities of 1,2 & 3-grams is available to Shiny App for serving Next Word Predictions
User enters incomplete sentence of 2 or more words whose next word is to be predicted
Same data cleaning & tokenization steps used on 'training' data is applied on this input sentence
Input parameters of prediction algorithms
- Add1 Laplace: None
- Stupid Backoff: fixed value of 0.4 as Lambda
Results
- Cleaned input incomplte sentence
- Prediction of 6 most probable completing words in decreasing order of probability from matching 3, 2 & 1-grams
- Prefix of the n-gram from where this last word was picked
- Elapsed time