Coursera Data Science Capstone - Presentation

Manoj Prasad
24 April, 2020

Overview

  • This is the pitch for the Coursera Data Science Capstone Project offered by John Hopkins University
  • This presentation provides an overview of all the work that went in, for the completion of this project
  • Pre processing data
    • Pre modelling, the data had to be sampled since the raw data was huge. Only a subset of the News, Twitter, Blogs were sampled to perform modelling
    • The sampled data had to be tokenized and cleaned for removing profanity words
    • The cleaned corpus was then used to construct Unigram, Bigram, Trigram datasets
  • Modelling
    • The below were some of the modelling techniques applied to perform predictions
      • Stupid Backoff Model
      • Katz's Backoff Model

Model - Details

  • Stupid Backoff Model
    • This was used as an initial attempt to perform predictions
    • The steps included starting to look into Trigram, then Bigram, then Unigram for a match based on the maximum Maximum Likelihood Estimate (MLE)
  • Katz's Backoff Model (Please scroll down for details)
    • This was the next tried algorithm to perform predictions
    • This is a type of N-gram model which predicts Xi based on Xi−(n−1),…,xi−1
    • Maximum likelihood estimate (MLE) assigns probability on each word that exist in corpus. The probability is the number of N-gram events divided by the number of total N-gram events.
    • Independent assumptions (Markov assumption) were made so that each word depends only on the last n−i words which is P(xi | xi−(n−1),…, xi−1)
    • The model would count the number of word or combination of words and divide by the total number of occurances.
    • However, no probablity would be assigned to unobserved N-grams. Hence probablity assigned to observed N-grams would be distributed to unobserved N-grams

Model - Performance

  • The performance of each model was measured on two categories: Speed, Accuracy
  • The sampled data had to be split further into Training and Test datasets to measure accuracy
  • Stupid Backoff Model
    • Speed: On an average, this took around 1 second for predictions
    • Accuracy: The average accuracy came out to be around 40% which isn't a great prediction
  • Katz's Backoff Model
    • Speed: On an average, this took around 5 seconds for predictions
    • Accuracy: The average accuracy came out to be around 70% which is much better compared to the Stupid Backoff Model
  • Tradeoff Analysis (Please scroll down for details)
    • Both the above models have their own pros and cons. One is faster but the other one more accurate
    • Decided to go with a more accurate model which is Katz's Backoff. Worst case delay of 6 seconds should be OK from a user experience standpoint

ShinyApp - Model Deployment

  • Katz's Backoff Model was deployed as a ShinyApp. Below is the link https://manojprasad.shinyapps.io/WordPredictionShinyApp/
  • The App has three tabs:
    • Predict Next Word App
      • This hosts the prediction model
      • A Text Box is provided to type in the words
      • As the text gets typed, the predictions results would get displayed within 5 seconds. No button click is required to trigger the predictions
      • A maximum of 5 best prediction results (if available) would be displayed below the text box
      • The prediction results are clickable. Once a prediction result is clicked, the predicted word gets filled into the Text Box
    • Help (Please scroll down for details)
      • Please find more details about the usage of this application under this tab
    • Caveats
      • This tab documents some caveats for this application. Please refer to this for more details