Coursera Data Science Capstone - Presentation
Manoj Prasad
24 April, 2020
Overview
- This is the pitch for the Coursera Data Science Capstone Project offered by John Hopkins University
- This presentation provides an overview of all the work that went in, for the completion of this project
- Pre processing data
- Pre modelling, the data had to be sampled since the raw data was huge. Only a subset of the News, Twitter, Blogs were sampled to perform modelling
- The sampled data had to be tokenized and cleaned for removing profanity words
- The cleaned corpus was then used to construct Unigram, Bigram, Trigram datasets
- Modelling
- The below were some of the modelling techniques applied to perform predictions
- Stupid Backoff Model
- Katz's Backoff Model
Model - Details
- Stupid Backoff Model
- This was used as an initial attempt to perform predictions
- The steps included starting to look into Trigram, then Bigram, then Unigram for a match based on the maximum Maximum Likelihood Estimate (MLE)
- Katz's Backoff Model (Please scroll down for details)
- This was the next tried algorithm to perform predictions
- This is a type of N-gram model which predicts Xi based on Xi−(n−1),…,xi−1
- Maximum likelihood estimate (MLE) assigns probability on each word that exist in corpus. The probability is the number of N-gram events divided by the number of total N-gram events.
- Independent assumptions (Markov assumption) were made so that each word depends only on the last n−i words which is P(xi | xi−(n−1),…, xi−1)
- The model would count the number of word or combination of words and divide by the total number of occurances.
- However, no probablity would be assigned to unobserved N-grams. Hence probablity assigned to observed N-grams would be distributed to unobserved N-grams
Model - Performance
- The performance of each model was measured on two categories: Speed, Accuracy
- The sampled data had to be split further into Training and Test datasets to measure accuracy
- Stupid Backoff Model
- Speed: On an average, this took around 1 second for predictions
- Accuracy: The average accuracy came out to be around 40% which isn't a great prediction
- Katz's Backoff Model
- Speed: On an average, this took around 5 seconds for predictions
- Accuracy: The average accuracy came out to be around 70% which is much better compared to the Stupid Backoff Model
- Tradeoff Analysis (Please scroll down for details)
- Both the above models have their own pros and cons. One is faster but the other one more accurate
- Decided to go with a more accurate model which is Katz's Backoff. Worst case delay of 6 seconds should be OK from a user experience standpoint
ShinyApp - Model Deployment
- Katz's Backoff Model was deployed as a ShinyApp. Below is the link
https://manojprasad.shinyapps.io/WordPredictionShinyApp/
- The App has three tabs:
- Predict Next Word App
- This hosts the prediction model
- A Text Box is provided to type in the words
- As the text gets typed, the predictions results would get displayed within 5 seconds. No button click is required to trigger the predictions
- A maximum of 5 best prediction results (if available) would be displayed below the text box
- The prediction results are clickable. Once a prediction result is clicked, the predicted word gets filled into the Text Box
- Help (Please scroll down for details)
- Please find more details about the usage of this application under this tab
- Caveats
- This tab documents some caveats for this application. Please refer to this for more details