TextSeer - Word Prediction Engine

Peng Zhao
April 3, 2015

Data Science Specialization Capstone Project
-Coursera
-John Hopkins Bloomberg School of Public Health
-SwiftKey

The source code for the project is available on GitHub
The app: TextSeer

Try TextSeer now!

Live app responsiveness to text input
Displays top 30 word possibilities with weighted scores
WordCloud display for fast visualization of suggested words
Handles and respects punctuation and special characters in user input
The size of data used by language model is 99.8Mb with an average bytes/ngram: 80.5 bytes/ngram; this guarantees fast performance and minimal loading time.

The model was trained using a random sample of 5% from the HC corpora containing English documents from twitter, blogs and news feeds (descriptive statistics).
Data for each N-gram (range of N = 4 - 1) collection is stored in hash tables (R data.table and spooky.32 for efficiency).
WordSeer employs a Markov process and is based on a modified “Stupid Backoff” algorithm (Brants 2007). Using all input N-grams (4-grams - 1-grams) it calculates the probability for all the words following that N-gram (MLE). These probabilities are summed across all N-gram collections (1 to k) to obtain a weight (W). \[ W_{word}=\sum_{n=1}^k(MLE_{word})_{n} \]

Context information is discarded for n-grams where n > 4.
The average bytes per N-gram can be further decreased by storing unique words as numbers for the higher order M-grams.
Implementation of interpolation (Key-Neisser algorithm) may improve prediction accuracy but decrease performance.
Table of know contractions can allow for their use in the model.

The app: TextSeer