9/14/2020

Introduction

  • This final project is part of Data Science Specialization Capstone project offered by John Hopkins University via Coursera.
  • The purpose of the project is to build a Natural language Processing (NLP) model that predicts the next word in the user specified word or phrase.
  • Three types of data from SwiftKey.zip file namely blogs, news and twitter were used to train the model.
  • Data cleaning and sampling techniques were applied to finalize the training data.
  • Four N-Grams (unigram, bigram, trigram and quadgram) were then created using clean data sets and a Katz Back-off predictive algorithm was applied to predict the next word.
  • The final predictive model was optimized to work as a Shiny app.

Data Handling and Cleaning

  • A sample from the three sources of original data was randomly selected and merged into one data.
  • Data cleaning was done by converting to lower case, removing punctuations, numbers and profanity words, etc.
  • The corresponding N-grams (unigram, bigram ,trigram and quadgram) were then created.
  • The N-grams were sorted according to the cummulative frequencies in descending order.
  • Finally, the four N-grams were saved as R-Compressed files (.RData files).

Next Word Prediction Model

  • The four compressed data sets were first loaded.
  • The user specified sequence of words were filtered by applying the same techniques to clean the training data sets.
  • First use quadgram: the first three words of quadgram are the last three words of the user provided sentence.
  • If no quadgram is matched, back-off to trigram: the first two words of trigram with last two words of the sentence.
  • If no trigram is found, back off to bigram (first word of bigram is the last word of the sentence).
  • Finally if no match found in bigram, the most frequent word from unigram as next word is used.
  • If non-english word or phrases are used, the model returns with no match found.

Shiny Application

  • Two pages are presented: one as “Home” showing the main model box and “About” page which details the apps features.
  • User may enters a word or phrase in the text box, then press “Predict Next Word” button.
  • The predicted next word is displayed with a note indicating which specific N-gram was used for next word prediction.