Text Prediction App

Honeydukes
Sat Apr 16 23:16:00 2016

A capstone project for Coursera Data Science Specialization in collaboration with John Hopkins University and SwiftKey

Overview

Why This App

Typing on mobile = Considerable time = Frustration.
—> Need smart keyboard to predict next word.

How This App Was Built

  1. Train model - using n-gram tokenization
  2. Build predictive algorithm - Katz BackOff + Good Turing
  3. Build Shiny App - clean text + predict next word

Training The Model

  1. Build Corpus
    • Use Random Sampling
    • Split into train(60%), devtest(20%) & test(20%) sets.
  2. Tokenization
    • Train set is cleansed (lower case, remove punctuations etc.)
    • 1,2,3,4-gram tokens built.
  3. n-Gram Look-up Table - This is an (n x 6) matrix with
    • 4-gram tokens (w1-w2-w3-w4) as anchor + Frequency
    • + corresponding tri (w1-w2-w3) + bi (w2-w3) + uni (w3)
    • + corresponding next-word(w4).

Predictive Algorithm

Katz BackOff

  1. Look for seen n-gram in look-up table.
  2. If found, top 3 next-words are displayed.
  3. If not, recursively back-off to (n-1)-gram & repeat 1 to 3.

Good Turing Smoothing

  1. Probability discounted for seen n-grams with freq of freq <= 5.
  2. Excess probabilities re-distributed to unseen n-grams.

User Interface via Shiny App

What Shiny App Does

  1. Clean a body of text (lower case, remove punctuations)
  2. Deploy predictive algorithm in Step 2 above.

What the User Needs To Do

Go to https://honeydukes.shinyapps.io/Project/

  1. Key in any phrase and click Predict button.
  2. See output for top 3 most probable next words.