Capstone: Shiny Text Prediction Application

PW
Aug 23, 2015

Motivation and Key Features

Have you ever been frustrated with the need to type out every word letter-by-letter, especially when using mobile devices with soft keyboards? Enter text prediction software… an effective means of improving typing efficiency and overall user experience.

This shiny-based application showcases a light-weight implementation of this technology, featuring:

  • Kneser-Ney N-Gram Prediction Model
  • Simple input interface
  • Intuitive output of rank-ordered recommendations
  • Plots of output statistics and search/prediction history

Kneser-Ney N-Gram Prediction Model

General Characteristics of N-Gram Models:

  • Prediction is based on frequency of occurrence of 1-, 2- and 3-word combinations, measured from a training corpus
  • Implementation relies on Markov assumption, whereby the conditional probabilities of possible next words in a sequence can be approximated from the last one or two words

Kneser-Ney Smoothing Interpolation

  • Reserves probability for n-grams unseen in the training corpus through the application of a standardized discount
  • Interpolates probabilities from 3-, 2- and 1-gram frequencies by measuring the likelihood that the predicted word would appear as a distinct continuation of the input sequence

Model Implementation

Using the model as-is on the shiny platform is incredibly simple:

  1. Input a word or phrase in the text box to the right
  2. Indicate the number of alternatives to display (1-10)
  3. Click the submit button!
  4. Navigate through the plots to view the word probabilities and counts, or view your session's prediction history

Prediction Screen

Behind the Scenes: How does it Work?

  • The model is incredibly portable, with one primary function that outputs a data frame indicating the predicted word, associated probability and unigram count
  • Additional functions support pre-processing and generation of the standardized discount
prediction <- generatePKN("happy", "happy new", n=3, uniDF, biDF, triDF, numReturn = 4, knDiscApprox(uniDF,biDF,triDF))
Predicted.Word Word.Probability Unigram.Count
434 year 0.8441687 12533
34 birthday 0.1461212 4376
258 mothers 0.0558350 2003
264 new 0.0338841 25870