9/14/2020
Introduction
- This final project is part of Data Science Specialization Capstone project offered by John Hopkins University via Coursera.
- The purpose of the project is to build a Natural language Processing (NLP) model that predicts the next word in the user specified word or phrase.
- Three types of data from SwiftKey.zip file namely blogs, news and twitter were used to train the model.
- Data cleaning and sampling techniques were applied to finalize the training data.
- Four N-Grams (unigram, bigram, trigram and quadgram) were then created using clean data sets and a Katz Back-off predictive algorithm was applied to predict the next word.
- The final predictive model was optimized to work as a Shiny app.
Data Handling and Cleaning
- A sample from the three sources of original data was randomly selected and merged into one data.
- Data cleaning was done by converting to lower case, removing punctuations, numbers and profanity words, etc.
- The corresponding N-grams (unigram, bigram ,trigram and quadgram) were then created.
- The N-grams were sorted according to the cummulative frequencies in descending order.
- Finally, the four N-grams were saved as R-Compressed files (.RData files).
Next Word Prediction Model
- The four compressed data sets were first loaded.
- The user specified sequence of words were filtered by applying the same techniques to clean the training data sets.
- First use quadgram: the first three words of quadgram are the last three words of the user provided sentence.
- If no quadgram is matched, back-off to trigram: the first two words of trigram with last two words of the sentence.
- If no trigram is found, back off to bigram (first word of bigram is the last word of the sentence).
- Finally if no match found in bigram, the most frequent word from unigram as next word is used.
- If non-english word or phrases are used, the model returns with no match found.
Shiny Application
- Two pages are presented: one as “Home” showing the main model box and “About” page which details the apps features.
- User may enters a word or phrase in the text box, then press “Predict Next Word” button.
- The predicted next word is displayed with a note indicating which specific N-gram was used for next word prediction.