Data Science Capstone Project : Predict Next Word with Natural Language Processing

Parag Sengupta
May 7, 2018

Build a English text prediction model under Natural Language Processing and Text Mining.

Goal: Predict the next word in a sentence a user would “most likely” want to type after an initial sentence input
Primary Use Environment: Handheld or mobile device - speed user typing by suggesting the next word or autocomplete user search query
Data Source: 3 different corpora comprising of tweets, blog posts and news articles in English. Source from Switfkey https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
Methodology: 4-gram probabilistic model
Project Links:
- Shiny App : https://paragsengupta.shinyapps.io/NextWordPrediction
  - Alternate App view (for Disconnected from Server error in shinyapp.io): https://github.com/paragsengupta/NextWordPrediction (Run the DSCProjectShinyinSlidy.Rmd file in RStudio)
- Pitch Slide Deck:
  - With Shiny App screenshot (RPres): https://rpubs.com/paragsengupta/NextWordPrediction
  - With Shiny App in Action within slide (Slidy): https://github.com/paragsengupta/NextWordPrediction (Run the DSCProjectPitchSlidy.Rmd file in RStudio

Preprocessing

Preprocessing and cleaning of loaded data
- Removing Numbers, Removing Puntuations, Removing Foreign characters, Removing extra white spaces, Converting to lower case, Profanity Filtering
Tokenization of text into words, phrases, symbols, etc. for parsing, text mining, etc.
Processed data stored in the form of a VCorpus, Plain Text Document and a Term Document Matrix required for text mining applications

N-Grams

Generate N-Grams and their frequencies from the cleaned processed data
Store N-Grams in the respective unigrams, bigrams, trigrams, quadgrams data frames with each row containing Frequency, Term and X1, X2 … Xn where Xi represents the i^th word in the Term.
Memory managed by removing occurrence frequencies < 5 to cope with Shiny app file upload limits and speed up
Calculate N-Gram counts using Term Frequency function
Assign probabilities to the N-Grams using simple linear interpolation with dummy lambda values

Probability Formulae Used

Quadgram ML Estimate	Trigram ML Estimate	Bigram ML Estimate	Unigram ML Estimate

Prediction

Input string from user input preprocessed using the same method as the train data
Processed input string passed to the predict function where the last 3 words are extracted and matched with the 4-gram table

If a match	If no match
4-gram with maximum probability is returned	Last 2 words extracted and matched with 3-gram table
3-gram with maximum probability is returned	Last word extracted and matched with 2-gram table
2-gram with maximum probability is returned	Unigram with maximum probability is returned

Shiny App Screenshot

alt text

How the App works

It has 3 sections: Predict Input and Output, Wordcloud Select and Output and Documentation.
User inputs a sentence in Predict Input which generates the predicted set of words in Predict Output.
Wordcloud is informative and displays 3 wordclouds depending on the user choice of 1-word, 2-words or 3-words.
- Hovering mouse over a word in Wordcloud shows occurrence frequency
Documentation displays the Description and Methodology of the project.

Performance Notes