Word-a-Tron

Dennis Chandler
4/26/2015

Word-a-Tron

Predictive Text Application

JHU DDS Capstone Project; Spring 2015

Built off of three, unstructured text files
- Blogs (1,010,243 lines)
- News (899,289 lines)
- Twitter (2,360,149 lines)
Must predict a word after a word/phrase typed in
Must have a profanity filter for predicted word

Task is modeling a language with the above corpuses

Methodology

Combine, clean, and tokenize files
Build n-gram tables for prediction
Prune, aggregate, and simplify tables
Utilized Domino Data Labs to run concurrent scripts to speed up iterations

What didn't work

Smoothing (no discernible improvement)
Interpolation (no discernible improvement)
Tf-idf ranking (n-grams too short)
Cosine similarity (n-grams too short)

Keep it Simple Stupid!!!

Algorithm Description

Katz (Stupid) Backoff

Take (at most) last tri-gram, and find highest frequency occurrence
If tri-gram non-existent, back-off and use bi-gram
If bi-gram non-existent, back-off and use uni-gram

Predicted words are scanned for profanity, and any profane words are substituted with @#$%, but left in corpus for prediction purposes

Application

Application is located at : Word-a-Tron

Enter a word or phrase in the text box on the left
Predicted word will appear on the right after the word/phrase
Use a check box to see the top 3 words predicted or for verbal input of word/phrase

Improvements

Adaptive Learning with user input over time
Integration of Information Retrieval techniques for larger phrases

Voice Control powered by annyang!