Capstone Final Project: Shiny App for NLP Word Predictive Model

Sunil Kumar (@sunil4data; sunil_iitb96@yahoo.co.in)
11 Aug 2018

1. Introduction

  • Goal of this Capstone course: -

    • Use all Data Science learning and tools to solve Next Word Predictive Modeling
    • Learn just enough of NLP around n-gram Language Modeling
    • Assess and attempt to improve accuracy and performance
  • Goal of this final project: -

2. Building a predictive text model

  • Create an algorithm for predicting the next word given 2 or more words as input using n-grams Language Model

  • A large corpus of blog, news and twitter data was loaded and analyzed

  • N-grams were extracted from 10% of corpus data and then used for building the predictive model

  • Various methods of improving the prediction accuracy and speed were explored (refer to 'NLP Background study notes & findings' in https://www.kaggle.com/suniliitb96/tryswiftkeyinr?scriptVersionId=5037782)

3. Algorithm

  • Challenges of n-gram language modeling

    • Words observed rarely (~50% of vocabulary were observed just once)
      • LM modeling was attempted with minDocFreq of 2 & 5
    • Stop words observed most frequently (~25% of corpus tokens were stop words)
    • Missing words in test sentence
      • When: due to dropped stop words -or- pruned low frequency words -or- OOV words
      • Solution: Smoothing & Backoff/Interpolation
  • n-gram Language Models

    • “UNK” as missing 1-gram & “1gramTokens_UNK” as missing 2-grams were included in LM model
    • MLE & Add1 Laplace Smoothed probabilities were pre-computed for n-grams upto 3-grams
  • Next Word Prediction

    • Stupid BackOff algorithm was implemented
      • Observed prediction time of 0.5-0.9 sec
      • Though code execution report shows small overall accuracy using SBO, a close look at top 6 words affirms the prediction is quite great as expected label in test trigrams are inappropriately placed!
    • Using Add1Laplace probability, prediction output are quite poor

4. Application Workflow

  • Pre-computed LM model containing probabilities of 1,2 & 3-grams is available to Shiny App for serving Next Word Predictions

  • User enters incomplete sentence of 2 or more words whose next word is to be predicted

  • Same data cleaning & tokenization steps used on 'training' data is applied on this input sentence

  • Input parameters of prediction algorithms

    • Add1 Laplace: None
    • Stupid Backoff: fixed value of 0.4 as Lambda
  • Results

    • Cleaned input incomplte sentence
    • Prediction of 6 most probable completing words in decreasing order of probability from matching 3, 2 & 1-grams
    • Prefix of the n-gram from where this last word was picked
    • Elapsed time

5. Resources & App Screenshot