Next Word Prediction Application

Tim Kerins
01/23/2020

Johns Hopkins Data Science Capstone Project

Note: use keyboard arrows to navigate thru the slides

Overview

Interactive Word Prediction Application

  • Takes an input phrase and predicts the next word
  • Uses corpus provided by SwiftKey (twitters, blogs, news feeds)
  • Written in R using NLP & Quanteda natural language packages
  • Uses 3-gram Stupid Backoff model applied to a cleaned dataset
  • The Shiny applicaton can be accessed on the web at: [https://tkerins24.shinyapps.io/PredictionApp/]

Methods/Algorithms

Pre-processing/Cleansing

  • Downloaded/merged datasets, then took a 10% sample
  • Numbers,punctuation,URLs,profanity,uppercase, removed
  • Data tokenized into 1,2,3 grams by frequency
  • Resulting n-gram tables indexed/stored in .RDS files

Prediction

  • 3-gram Stupid Backoff Algorithm (0.4 disc factor).
    • If matching 3gram; returns highest freq last word.
    • Else, backoff to 2gram. If match, return highest freq last word.
    • If no match, backoff to 1gram, return highest freq word.

Accuracy/Performance/Resource Usage

  • Accuracy: ~ 30%; Prediction time < 1 Sec
  • Model tuning activites tried:
    • 3gram Katz backoff (much longer time)
    • 4gram Stupid & Katz Backoff (little improvement)
    • Table indexing (signifant performance improvement)
    • drop low frequency ngrams (little improvement)
  • Best overall model: 3gram stupid backoff

Application Instructions

  • Open App @ [https://tkerins24.shinyapps.io/PredictionApp/] Alt text
  • Type phrase into “Input Phrase” box, then press “Predict”
  • The prediction will apprear in the “Predicted Word” box.
  • Click on the “Clear Input” button to repeat the process