Data Science Capstone Project : Predict Next Word with Natural Language Processing

Parag Sengupta
May 7, 2018

Project Background

Build a English text prediction model under Natural Language Processing and Text Mining.

Model Building - Preprocessing and N-Grams

Preprocessing

  • Preprocessing and cleaning of loaded data
    • Removing Numbers, Removing Puntuations, Removing Foreign characters, Removing extra white spaces, Converting to lower case, Profanity Filtering
  • Tokenization of text into words, phrases, symbols, etc. for parsing, text mining, etc.
  • Processed data stored in the form of a VCorpus, Plain Text Document and a Term Document Matrix required for text mining applications

N-Grams

  • Generate N-Grams and their frequencies from the cleaned processed data
  • Store N-Grams in the respective unigrams, bigrams, trigrams, quadgrams data frames with each row containing Frequency, Term and X1, X2 … Xn where Xi represents the ith word in the Term.
  • Memory managed by removing occurrence frequencies < 5 to cope with Shiny app file upload limits and speed up
  • Calculate N-Gram counts using Term Frequency function
  • Assign probabilities to the N-Grams using simple linear interpolation with dummy lambda values

Model Building - Probabilities and Prediction

Probability Formulae Used

  • Maximum Likelihood Estimates
Quadgram ML Estimate Trigram ML Estimate Bigram ML Estimate Unigram ML Estimate
  • Linear Interpolation
  • New Estimate = Weighted Average of the three Maximum Likelihood Estimates

Prediction

  • Input string from user input preprocessed using the same method as the train data
  • Processed input string passed to the predict function where the last 3 words are extracted and matched with the 4-gram table
If a match If no match
4-gram with maximum probability is returned Last 2 words extracted and matched with 3-gram table
3-gram with maximum probability is returned Last word extracted and matched with 2-gram table
2-gram with maximum probability is returned Unigram with maximum probability is returned

Shiny App and Conclusion

Shiny App Screenshot

alt text alt text

How the App works

  • It has 3 sections: Predict Input and Output, Wordcloud Select and Output and Documentation.
  • User inputs a sentence in Predict Input which generates the predicted set of words in Predict Output.
  • Wordcloud is informative and displays 3 wordclouds depending on the user choice of 1-word, 2-words or 3-words.
    • Hovering mouse over a word in Wordcloud shows occurrence frequency
  • Documentation displays the Description and Methodology of the project.

Performance Notes

  • RAM limitations addressed by using ~10% of dataset for the model
  • Prediction process expedited by removing sparse entries from generated N-Grams
  • Prediction Accuracy ~14% using the DSCI benchmark