Swiftkey Prediction App Overview

Johannes Rebane
April 26, 2015

Overview: This application, submitted as a final capstone project for the JHU Data Science Specialization through Coursera, implements a word prediction algorithm in the spirit of SwiftKey. Using a database of 2-grams, 3-grams, and 4-grams generated from blog, Twitter, and news corpora, the application processes text and predicts next words using a customized “Stupid Backoff” model, optimized for speed and web-scale language modeling. The final application can be found on ShinyApps.io.

Application Structure and Flow

The application uses Python and NLTK to generate n-grams. R connects with a SQLite DB of these n-grams to implement and train a “Stupid Backoff” algorithm (Brants et al 2007). Shiny provides a front-end to interact with the algorithm.

App Process Diagram

Algorithm Overview

The customized “Stupid Backoff” algorithm looks at the highest order n-grams matching the end of the inputed phrase, and, if needed, “backs off” with a discount to lower-order n-grams until a highest score match is found.

Prediction Algorithm Diagram

Shiny App Overview and Functionality

The Shiny App provides reactive input for the end user, and displays the predictions in ranked form in a ggplot2 chart to the right of the input. Documentation and sources are linked accordingly at the top.

Prediction Algorithm Diagram

Peformance Assessment & Approach

In order to validate the algorithm, measurements were performed on different n-gram levels with regard to total file size and accuracy of prediction on a small data sample. plot of chunk unnamed-chunk-1

A 4-gram model was chosen due to size constraints (100 mb on shinyapps.io) and its relative performance to lower and higher-order models. plot of chunk unnamed-chunk-2