Word-a-Tron

Dennis Chandler
4/22/2015

Word-a-Tron

Predictive Text Application

JHU DDS Capstone Project; Spring 2015

Built off of three, unstructured text files
- Blogs (1,010,243 lines)
- News (899,289 lines)
- Twitter (2,360,149 lines)
Must predict a word after a word/phrase typed in
Must have a profanity filter for predicted word

Task is modeling a language with the above corpuses

Methodology

Combine, clean, and tokenize files
Build n-gram tables for prediction
Prune, aggregate, and simplify tables

What didn't work

Smoothing (no discernable improvement)
Interpolation (no discernable improvement)
Tf-idf ranking (n-grams too short)
Cosine similarity (n-grams too short)

Keep it Stupid Simple!!!

Algorithm Description

Katz (Stupid) Backoff

Take (at most) last tri-gram, and find highest frequency occurance
If tri-gram non-existant, back-off and use bi-gram
If bi-gram non-existant, back-off and use uni-gram

Predicted words are scanned for profanity, and any profane words are substituted with @#$%

Application

Application is located at : Word-a-Tron

Simply enter a word or phrase in the text box on the left
Predicted word will appear on the right after the word/phrase
Use the check box to see the top 3 words predicted

Improvements

Adaptive Learning with user input over time
Larger corpus of phrases
Integration of Information Retreival techniques for larger phrases