Coursera Data Science Capstone - Presentation

Manoj Prasad
24 April, 2020

This is the pitch for the Coursera Data Science Capstone Project offered by John Hopkins University
This presentation provides an overview of all the work that went in, for the completion of this project
Pre processing data
- Pre modelling, the data had to be sampled since the raw data was huge. Only a subset of the News, Twitter, Blogs were sampled to perform modelling
- The sampled data had to be tokenized and cleaned for removing profanity words
- The cleaned corpus was then used to construct Unigram, Bigram, Trigram datasets
Modelling
- The below were some of the modelling techniques applied to perform predictions
  - Stupid Backoff Model
  - Katz's Backoff Model

Stupid Backoff Model
- This was used as an initial attempt to perform predictions
- The steps included starting to look into Trigram, then Bigram, then Unigram for a match based on the maximum Maximum Likelihood Estimate (MLE)
Katz's Backoff Model (Please scroll down for details)
- This was the next tried algorithm to perform predictions
- This is a type of N-gram model which predicts Xi based on Xi−(n−1),…,xi−1
- Maximum likelihood estimate (MLE) assigns probability on each word that exist in corpus. The probability is the number of N-gram events divided by the number of total N-gram events.
- Independent assumptions (Markov assumption) were made so that each word depends only on the last n−i words which is P(xi | xi−(n−1),…, xi−1)
- The model would count the number of word or combination of words and divide by the total number of occurances.
- However, no probablity would be assigned to unobserved N-grams. Hence probablity assigned to observed N-grams would be distributed to unobserved N-grams

The performance of each model was measured on two categories: Speed, Accuracy
The sampled data had to be split further into Training and Test datasets to measure accuracy
Stupid Backoff Model
- Speed: On an average, this took around 1 second for predictions
- Accuracy: The average accuracy came out to be around 40% which isn't a great prediction
Katz's Backoff Model
- Speed: On an average, this took around 5 seconds for predictions
- Accuracy: The average accuracy came out to be around 70% which is much better compared to the Stupid Backoff Model
Tradeoff Analysis (Please scroll down for details)
- Both the above models have their own pros and cons. One is faster but the other one more accurate
- Decided to go with a more accurate model which is Katz's Backoff. Worst case delay of 6 seconds should be OK from a user experience standpoint

Katz's Backoff Model was deployed as a ShinyApp. Below is the link https://manojprasad.shinyapps.io/WordPredictionShinyApp/
The App has three tabs:
- Predict Next Word App
  - This hosts the prediction model
  - A Text Box is provided to type in the words
  - As the text gets typed, the predictions results would get displayed within 5 seconds. No button click is required to trigger the predictions
  - A maximum of 5 best prediction results (if available) would be displayed below the text box
  - The prediction results are clickable. Once a prediction result is clicked, the predicted word gets filled into the Text Box
- Help (Please scroll down for details)
  - Please find more details about the usage of this application under this tab
- Caveats
  - This tab documents some caveats for this application. Please refer to this for more details