Johannes Rebane
April 26, 2015
Overview: This application, submitted as a final capstone project for the JHU Data Science Specialization through Coursera, implements a word prediction algorithm in the spirit of SwiftKey. Using a database of 2-grams, 3-grams, and 4-grams generated from blog, Twitter, and news corpora, the application processes text and predicts next words using a customized “Stupid Backoff” model, optimized for speed and web-scale language modeling. The final application can be found on ShinyApps.io.
The application uses Python and NLTK to generate n-grams. R connects with a SQLite DB of these n-grams to implement and train a “Stupid Backoff” algorithm (Brants et al 2007). Shiny provides a front-end to interact with the algorithm.
The customized “Stupid Backoff” algorithm looks at the highest order n-grams matching the end of the inputed phrase, and, if needed, “backs off” with a discount to lower-order n-grams until a highest score match is found.
The Shiny App provides reactive input for the end user, and displays the predictions in ranked form in a ggplot2 chart to the right of the input. Documentation and sources are linked accordingly at the top.
In order to validate the algorithm, measurements were performed on different n-gram levels with regard to total file size and accuracy of prediction on a small data sample.
A 4-gram model was chosen due to size constraints (100 mb on shinyapps.io) and its relative performance to lower and higher-order models.