Text Prediction

Patrick Charles
2015.07.21
Coursera/Johns Hopkins Data Science Series
Capstone Project

Approach

A body of sample texts consisting of ~4M documents including tweets, news articles and blog posts are loaded and exploratory analysis performed. Sets of n-grams are extracted from the corpus, predictive algororithms built, and various methods for improving predictive accuracy explored and refined.

Optimization

  • 4M to 1M document reduction via random sampling
  • 1M documents transformed and reduced
  • Iterative process of analysis, optimization and perf testing
  • Document-term matrices generated with {1-5}-ngrams
  • n-grams organized by frequency of occurrence in corpus
  • Least common n-grams pruned/dropped for final model..
    • 18,936 words occurring more than 10x
    • 199,966 2-grams w/ frequency > 3x
    • 150,489 3-grams w/ frequency > 3x
    • 139,984 4-grams w/ frequency > 2x
    • 43,024 5-grams w/ frequency > 2x

Prediction Algorithm

  • Capture input text, including all preceding words in the phrase
  • Iteratively traverse n-grams (longest to shortest) for matches
  • On match(es), use the longest, most common, n-gram
  • Last word in the matching n-gram is the predicted next word
  • If no match in {5, 4, 3, 2}-grams, resort to randomly selecting a most frequently occurring 1-gram (e.g. common word)

Application

text-predictor interactively performs word/phrase completion!

Performance

  • 15% Accuracy (using only first, top-ranked response)
  • 22% Accuracy (selecting from top-5 ranked responses)
  • Mean Response Time: 250ms
  • Memory: 9MB compressed, 104MB in-memory

Links