Word Prediction Application Using R


Jeff Hedberg

April 2015


Johns Hopkins University
Data Science Specialization
Capstone Presentation

Application Overview

For the Capstone project in my Data Science Specialization through Johns Hopkins University, I was required to develop an application that predicts the next word (in real-time) from a user entered text box. My application features a text box where users enter free-form text, and also includes an optional check box for “Profanity Filtering” the prediction if desired. It automagically updates the predition in real-time!

Below is a screenshot of the application that I developed in R and Shiny (Click here to try it out!!)

image

Application Development

1. I started by processing the provided US English-HC Corpora in it's entirety (No Sampling!). It consisted of Blog, News, and Twitter raw text (roughly 100M words and 4.3M records).
2. I then transformed the dataset as follows: all lowercase, removed punctuation, removed symbols, removed numbers, removed extra white spaces.
3. I then tokenized ALL of the data in order to facilitate the next step, n-gram creation.
4. Since I planned to utilize Markov Chains and a personally customized version of the Katz's back-off model I created n-grams for values of n1-6 on the entire dataset!
5. In order to reduce the amount of data in each post-processed n-gram frequency file, I only retained combinations with counts larger than 25. This enabled me to cut the overall data size on disk to 4.5 Mb Total (enabling amazing performance), while not dramatically impacting the algorithm accuracy!
6. I then foused on manually tuning the assigned n-gram probability coefficients in order to achieve an optimum aggregate effect (i.e. C1-6 below): \[ P_{WordAggregate}=\sum{[C_{1}\cdot P_{1gram}+C_{2}\cdot P_{2gram}+C_{3}\cdot P_{3gram}+C_{4}\cdot P_{4gram}+C_{5}\cdot P_{5gram}+C_{6}\cdot P_{6gram}]} \]

Application Development (Continued)

7. I then processed the finalized n-gram datasets to add an additional column for Profanity Filtering. This was basically just a boolean (T/F) indicator on every n-gram record so that I could quickly subset the data if needed (in the final step of my prediction algorithm). I used “Google's list of bad words” as my check when assigning these logical values.
8. I then built the Shiny interface and tested it to ensure it loaded and functioned in real-time with great results.
9. I also embedded Google Analytics (once a friend showed me how) so that I could gather data on all visits and users. :)

Future Enhancements

There are several enhancements that would add quite a bit of lift to this algorithm (future development opportunities):
1. Partial word completion - The data is already processed and available for this addition, but time would need to be spent on implementing an additional “character” based token search for the partially typed word instead of just the “word” based token search in the current implementation. Additional re-tuning of my customized “Katz's back-off model” would then be needed.
2. Expansion to a include a much larger Corpus and additional topics areas (i.e. Wikipedia, Google, etc…).
3. Inclusion of spelling auto-correction for mistyped words.
4. Inclusion of word auto-capitalization for proper nouns.
5. Features that can be added at a cost. This will develop a Revenue Stream ($$) as the user base grows! :)