Predictive Typer Project

Leif Ulstrup
December 12, 2014

Project Overview

This is the capstone project for the Johns Hopkins University Data Science (DS) Specialization via Coursera. The project goals:

  • Learn about natural language processing (NLP) modeling techniques
  • Develop functions (e.g., split files, cleanup, ngrams discovery, etc.) to build a processing pipeline for “messy” text corpus data to be used in predictive application
  • Develop a predictive typing model using provided text corpus (sources included twitter, news, and blogs, ~600MB of text)
  • Apply techniques from the JHU DS classes to assess model accuracy
  • Deploy fast, efficient, and accurate web application using the Shiny platform

Benefits:

  • Faster typing on mobile devices
  • Potential use to detect typos on web based search (including data entry validation for free text entry fields)

Pitch:

  • Significant progress has been made during this DS Capstone project
  • Additional $'s and Resources are needed to build a commercially viable version

Modeling Algorithm

After studying fundamental NLP techniques for predictive typing, I chose to model the corpus using the following elements:

  • n-grams up to length 6 words
  • back-off model techniques using (input text - N words) to find a match
  • results are ordered by frequency of occurence in source corpus
  • partial word prediction using starting letters based on 1-gram frequency
  • test result:
    • 10% or lower prediction accuracy on general text input with correct word appearing in top 10 predicted words
    • highest probability predictions are far more likely and are a small subset of all predictions for each input starting word or phrase
    • it could be a useful utility for text messaging applications where phrases are often simple and type-ahead of common words is valuable (i.e., high accuracy on common phrases) versus high accuracy on all possible combinations of text phrases

More work and experimentation is needed on ways to improve prediction accuracy (see last slide)

Application Overview

A simple demonstration application has been developed using the Shiny platform and the R programming language. The application can be found here https://lulstrup.shinyapps.io/Predictive-Typer-DSCapstone-Project/.

  • Application architecture choices:
    • uses R library(data.table) for fast query capability on searches
    • reads an n-gram data (~35MB file sample of most frequent n-grams of ~4GB from Capstone Project sources) into a data.table at start (5-10s before searching can begin) but is very fast to query once loaded
    • imported n-gram file can be changed for smaller or larger corpus (the free Shiny cloud platform max is a 100MB set of files including R code)
    • protoPredict2() and predictWordFromLetters() R functions perform the recursive search for matches using a backoff function to lop off starting words until a match is found
  • Instructions
    1. launch app (wait 5-7s on initial load to read n-gram database)
    2. begin to enter words, partial words, phrases
    3. note both predicted next word, word from partial, and the list of potential phrases (ordered by frequency and alphabetical for matches)
    4. (optional) adjust the length of the phrase list using the selector

Future Explorations

Next steps and future upgrades to work on:

  • use a pre-loaded and comprehensive cloud based DB like MongoDB (e.g. MongoLabs) as the source of the n-gram datastore and query via API to reduce data load time and speed access
  • add an autocomplete drop down menu picker for next word/phrase (JQuery autocomplete)
  • enhance ability to do domain-specific prediction by limited DB to sources (& domains such as law, medicine, business, science, etc) with an expected increase in speed and accuracy
  • more experimentation with accuracy of predictions and application performance based on various tradeoffs
  • more work cleaning up the source data..still finding irregular patterns (e.g., sequential 'zzzz…') and inconsistent use of English contractions (apostrophe usage) in the prediction ngram model
  • make the application usable on a mobile screen
  • create an API to a cloud based service that provides JSON format predictions