Capstone: Shiny Text Prediction Application

PW
Aug 23, 2015

Motivation and Key Features

Have you ever been frustrated with the need to type out every word letter-by-letter, especially when using mobile devices with soft keyboards? Enter text prediction software… an effective means of improving typing efficiency and overall user experience.

This shiny-based application showcases a light-weight implementation of this technology, featuring:

Kneser-Ney N-Gram Prediction Model
Simple input interface
Intuitive output of rank-ordered recommendations
Plots of output statistics and search/prediction history

Kneser-Ney N-Gram Prediction Model

General Characteristics of N-Gram Models:

Prediction is based on frequency of occurrence of 1-, 2- and 3-word combinations, measured from a training corpus
Implementation relies on Markov assumption, whereby the conditional probabilities of possible next words in a sequence can be approximated from the last one or two words

Kneser-Ney Smoothing Interpolation

Reserves probability for n-grams unseen in the training corpus through the application of a standardized discount
Interpolates probabilities from 3-, 2- and 1-gram frequencies by measuring the likelihood that the predicted word would appear as a distinct continuation of the input sequence

Model Implementation

Using the model as-is on the shiny platform is incredibly simple:

Input a word or phrase in the text box to the right
Indicate the number of alternatives to display (1-10)
Click the submit button!
Navigate through the plots to view the word probabilities and counts, or view your session's prediction history

Prediction Screen

Behind the Scenes: How does it Work?

The model is incredibly portable, with one primary function that outputs a data frame indicating the predicted word, associated probability and unigram count
Additional functions support pre-processing and generation of the standardized discount

prediction <- generatePKN("happy", "happy new", n=3, uniDF, biDF, triDF, numReturn = 4, knDiscApprox(uniDF,biDF,triDF))

	Predicted.Word	Word.Probability	Unigram.Count
434	year	0.8441687	12533
34	birthday	0.1461212	4376
258	mothers	0.0558350	2003
264	new	0.0338841	25870