Predictive Typer Project

Leif Ulstrup
December 12, 2014

Learn about natural language processing (NLP) modeling techniques
Develop functions (e.g., split files, cleanup, ngrams discovery, etc.) to build a processing pipeline for “messy” text corpus data to be used in predictive application
Develop a predictive typing model using provided text corpus (sources included twitter, news, and blogs, ~600MB of text)
Apply techniques from the JHU DS classes to assess model accuracy
Deploy fast, efficient, and accurate web application using the Shiny platform

Benefits:

Faster typing on mobile devices
Potential use to detect typos on web based search (including data entry validation for free text entry fields)

Pitch:

After studying fundamental NLP techniques for predictive typing, I chose to model the corpus using the following elements:

n-grams up to length 6 words
back-off model techniques using (input text - N words) to find a match
results are ordered by frequency of occurence in source corpus
partial word prediction using starting letters based on 1-gram frequency
test result:
- 10% or lower prediction accuracy on general text input with correct word appearing in top 10 predicted words
- highest probability predictions are far more likely and are a small subset of all predictions for each input starting word or phrase
- it could be a useful utility for text messaging applications where phrases are often simple and type-ahead of common words is valuable (i.e., high accuracy on common phrases) versus high accuracy on all possible combinations of text phrases

More work and experimentation is needed on ways to improve prediction accuracy (see last slide)

A simple demonstration application has been developed using the Shiny platform and the R programming language. The application can be found here https://lulstrup.shinyapps.io/Predictive-Typer-DSCapstone-Project/.

Application architecture choices:
- uses R library(data.table) for fast query capability on searches
- reads an n-gram data (~35MB file sample of most frequent n-grams of ~4GB from Capstone Project sources) into a data.table at start (5-10s before searching can begin) but is very fast to query once loaded
- imported n-gram file can be changed for smaller or larger corpus (the free Shiny cloud platform max is a 100MB set of files including R code)
- protoPredict2() and predictWordFromLetters() R functions perform the recursive search for matches using a backoff function to lop off starting words until a match is found

Instructions
1. launch app (wait 5-7s on initial load to read n-gram database)
2. begin to enter words, partial words, phrases
3. note both predicted next word, word from partial, and the list of potential phrases (ordered by frequency and alphabetical for matches)
4. (optional) adjust the length of the phrase list using the selector

Next steps and future upgrades to work on:

use a pre-loaded and comprehensive cloud based DB like MongoDB (e.g. MongoLabs) as the source of the n-gram datastore and query via API to reduce data load time and speed access
add an autocomplete drop down menu picker for next word/phrase (JQuery autocomplete)
enhance ability to do domain-specific prediction by limited DB to sources (& domains such as law, medicine, business, science, etc) with an expected increase in speed and accuracy
more experimentation with accuracy of predictions and application performance based on various tradeoffs
more work cleaning up the source data..still finding irregular patterns (e.g., sequential 'zzzz…') and inconsistent use of English contractions (apostrophe usage) in the prediction ngram model
make the application usable on a mobile screen
create an API to a cloud based service that provides JSON format predictions