Next Word Prediction Application

Tim Kerins
01/23/2020

Johns Hopkins Data Science Capstone Project

Note: use keyboard arrows to navigate thru the slides

Overview

Interactive Word Prediction Application

Takes an input phrase and predicts the next word
Uses corpus provided by SwiftKey (twitters, blogs, news feeds)
Written in R using NLP & Quanteda natural language packages
Uses 3-gram Stupid Backoff model applied to a cleaned dataset
The Shiny applicaton can be accessed on the web at: [https://tkerins24.shinyapps.io/PredictionApp/]

Methods/Algorithms

Pre-processing/Cleansing

Downloaded/merged datasets, then took a 10% sample
Numbers,punctuation,URLs,profanity,uppercase, removed
Data tokenized into 1,2,3 grams by frequency
Resulting n-gram tables indexed/stored in .RDS files

Prediction

3-gram Stupid Backoff Algorithm (0.4 disc factor).
- If matching 3gram; returns highest freq last word.
- Else, backoff to 2gram. If match, return highest freq last word.
- If no match, backoff to 1gram, return highest freq word.

Accuracy/Performance/Resource Usage

Accuracy: ~ 30%; Prediction time < 1 Sec
Model tuning activites tried:
- 3gram Katz backoff (much longer time)
- 4gram Stupid & Katz Backoff (little improvement)
- Table indexing (signifant performance improvement)
- drop low frequency ngrams (little improvement)
Best overall model: 3gram stupid backoff

Application Instructions

Open App @ [https://tkerins24.shinyapps.io/PredictionApp/]
Type phrase into “Input Phrase” box, then press “Predict”
The prediction will apprear in the “Predicted Word” box.
Click on the “Clear Input” button to repeat the process