Text Prediction with Confabulator

Jason V
April 23, 2016

A capstone project for the Coursera/Johns Hopkins data science specialization

https://jvhub.shinyapps.io/Capstone/

Project Overview

Text prediction is an important subject within data science with many uses. The average person deals with these dailing in user interfaces such as phones and search engines.

The purpose of this project is to develop a text prediction algorithm and deploy it as an interactive application. Users enter a phrase and the application will predict the next word.

A key challenge was striking the right balance between prediction accuracy and application responsiveness while factoring in resource constraints.

Data Preparation

The training source used was the HC Corpora dataset which contains millions of lines of text from twitter feeds, blogs, and news. To prepare the data several steps were taken:

  • Sample data from the three sources and merged them together
  • Data-cleaning and standardization to convert text case, strip out unwanted characters and profanity
  • Tokenize the data into n-grams (from 1-gram to 5-grams)
  • Summarize to obtain frequency counts
  • Filter to leave only n-grams with more than one occurrence (to ensure adequate performance)

How the Application Works

The prediction algorithm used relies on the 'stupid backoff' algorithm chosen because it is computationally efficient while having reasonable accuracy. Steps taken:

  • Clean the input text from the user
  • Find matches of 4-grams within the 5-gram data set, 3-grams within the 4-gram set, etc.
  • For each, calculate the % of matched rows attributed to each word using the frequency counts
  • Combine the predicted word and score results from the 5, 4, 3, and 2-gram results. Reduce the weight of matches in subsequently smaller sets by 40%
  • Select the predicted word with the highest score and present to the user

Results

The results, playfully called the 'Confabulator' can be found here:

One design choice made early one was that the application should be as responsive as possible. The goal appears to have been achieved: the application is fast enough that at times it will return a prediction while the user is still in the process of typing their word or phrase.