Next Word Prediction Application
Tim Kerins
01/23/2020
Johns Hopkins Data Science Capstone Project
Note: use keyboard arrows to navigate thru the slides
Overview
Interactive Word Prediction Application
Takes an input phrase and predicts the next word
Uses corpus provided by SwiftKey (twitters, blogs, news feeds)
Written in R using NLP & Quanteda natural language packages
Uses 3-gram Stupid Backoff model applied to a cleaned dataset
The Shiny applicaton can be accessed on the web at: [
https://tkerins24.shinyapps.io/PredictionApp/
]
Methods/Algorithms
Pre-processing/Cleansing
Downloaded/merged datasets, then took a 10% sample
Numbers,punctuation,URLs,profanity,uppercase, removed
Data tokenized into 1,2,3 grams by frequency
Resulting n-gram tables indexed/stored in .RDS files
Prediction
3-gram Stupid Backoff Algorithm (0.4 disc factor).
If matching 3gram; returns highest freq last word.
Else, backoff to 2gram. If match, return highest freq last word.
If no match, backoff to 1gram, return highest freq word.
Accuracy/Performance/Resource Usage
Accuracy: ~ 30%; Prediction time < 1 Sec
Model tuning activites tried:
3gram Katz backoff (much longer time)
4gram Stupid & Katz Backoff (little improvement)
Table indexing (signifant performance improvement)
drop low frequency ngrams (little improvement)
Best overall model: 3gram stupid backoff
Application Instructions
Open App @ [
https://tkerins24.shinyapps.io/PredictionApp/
]
Type phrase into “Input Phrase” box, then press “Predict”
The prediction will apprear in the “Predicted Word” box.
Click on the “Clear Input” button to repeat the process