Word-a-Tron

Dennis Chandler
4/26/2015

Word-a-Tron

Predictive Text Application

JHU DDS Capstone Project; Spring 2015

  • Built off of three, unstructured text files
    • Blogs (1,010,243 lines)
    • News (899,289 lines)
    • Twitter (2,360,149 lines)
  • Must predict a word after a word/phrase typed in
  • Must have a profanity filter for predicted word

Task is modeling a language with the above corpuses

Methodology

  • Combine, clean, and tokenize files
  • Build n-gram tables for prediction
  • Prune, aggregate, and simplify tables
  • Utilized Domino Data Labs to run concurrent scripts to speed up iterations

What didn't work

  • Smoothing (no discernible improvement)
  • Interpolation (no discernible improvement)
  • Tf-idf ranking (n-grams too short)
  • Cosine similarity (n-grams too short)

Keep it Simple Stupid!!!

Algorithm Description

Katz (Stupid) Backoff

  1. Take (at most) last tri-gram, and find highest frequency occurrence
  2. If tri-gram non-existent, back-off and use bi-gram
  3. If bi-gram non-existent, back-off and use uni-gram

Predicted words are scanned for profanity, and any profane words are substituted with @#$%, but left in corpus for prediction purposes

Application

Application is located at : Word-a-Tron

  1. Enter a word or phrase in the text box on the left
  2. Predicted word will appear on the right after the word/phrase
  3. Use a check box to see the top 3 words predicted or for verbal input of word/phrase

Improvements

  • Adaptive Learning with user input over time
  • Integration of Information Retrieval techniques for larger phrases

Voice Control powered by annyang!