Word-a-Tron

Dennis Chandler
4/22/2015

Word-a-Tron

Predictive Text Application

JHU DDS Capstone Project; Spring 2015

  • Built off of three, unstructured text files
    • Blogs (1,010,243 lines)
    • News (899,289 lines)
    • Twitter (2,360,149 lines)
  • Must predict a word after a word/phrase typed in
  • Must have a profanity filter for predicted word

Task is modeling a language with the above corpuses

Methodology

  • Combine, clean, and tokenize files
  • Build n-gram tables for prediction
  • Prune, aggregate, and simplify tables

What didn't work

  • Smoothing (no discernable improvement)
  • Interpolation (no discernable improvement)
  • Tf-idf ranking (n-grams too short)
  • Cosine similarity (n-grams too short)

Keep it Stupid Simple!!!

Algorithm Description

Katz (Stupid) Backoff

  1. Take (at most) last tri-gram, and find highest frequency occurance
  2. If tri-gram non-existant, back-off and use bi-gram
  3. If bi-gram non-existant, back-off and use uni-gram

Predicted words are scanned for profanity, and any profane words are substituted with @#$%

Application

Application is located at : Word-a-Tron

  1. Simply enter a word or phrase in the text box on the left
  2. Predicted word will appear on the right after the word/phrase
  3. Use the check box to see the top 3 words predicted

Improvements

  • Adaptive Learning with user input over time
  • Larger corpus of phrases
  • Integration of Information Retreival techniques for larger phrases