NLP Word Prediction - Capstone Project

Aravind Palempati
04-10-2018

Data Science Specialization

Summary

The objective of the App is implementing a predictive model that offers hints to one or more words, coherent to the sentence that's been input by its user. The Capstone dataset used includes twitter, news and blogs from HC Corpora. After performing data cleansing, sampling and sub-setting, we gather all data in R data frames. Applying some Text Mining ™ and NLP techniques, is created some set of word combinations (N-grams). These are the main support to Katz Backoff algorithm predicts the next word. Some adaptations and heuristics were specially developed to enhance this Shiny application. How the app works

Methodology

  1. Loading 4 data frames contained n-grams combinations with 4-words, 3-words, 2-words, and 1-word previously generated.
  2. Reading user input (a word or sentence)
  3. Cleansing of user input (lowering, profanities removing, tokenization of input words: the last four)
  4. Call to prediction model function, basically, the Stupid backoff algorithm (a more simplified approach to Katz Backoff):
  5. search in the fourgram data frame, if found, shows top 5 most probable matches. Otherwise;
  6. search in the trigram data frame, by the same way above. Otherwise;
  7. search in bigram data frame, by the same way above.

Application Usage

Just type a word, phrase or sentence. The app shows what the user has entered, followed by cleansed form. As the main result, until the top five (more probable) n-grams predictions are displayed in a list control. The user can review or swap your input data, and the app will turn back to present more hints to predict. Another tab offers a more extensive documentation.

Application

Link to Shiny App: prediction