8/12/2019

Coursera Capstone Project

This presentation is the companion to my Shiny App, a text prediction application based on the SwiftKey dataset. Those data were cleaned and sets of 2-grams through 6-grams were generated. Using the Modified Kneser-Ney algorithm a probability was calculated for each terminal word, based on the “root” of the n-gram. The database was then able to be reduced to the most probable word (since only a single guess is required). Together this makes for a very small and fast application.

The Modified Kneser-Ney Algorithm

The application is a combination of the Kneser-Ney algorithm and a simple back-off model. Each n-gram (from n=2 to n=6) was split into a “root” string and a “terminal” word. For each terminal word a probability was calculated based on the frequency of the terminal word, relative to the frequency of the root, with adjustments for how variable the terminal word could be. This adjustment, also known as the continuation probability, improves the accuracy of the model by looking at how many different terminal words were seen in the training data for the given root.

In cases where the higher order n-grams aren’t found, the string is stripped of the first word and run again. If no words are found, the application returns “THE.”

Pruning and the Importance of Context

Early versions of the model suffered greatly from the trying to over-clean bad data. These early models stripped out numbers (or replaced them with space holders), and tried to account for replacing non-Latin1 characters with the appropriate equivalents. While this might be an appropriate response in some cases, it also runs the risk of breaking context. For example, removing the heart character (♡) would convert the phrase “I♡U” (I love you) to “IU” or “I U”, both of which would decrease the quality of the model. Instead, given the millions of ngrams that could be generated from the dataset, it was more practical to filter out lines that contained problematic character.

How It Performs

When compared against a set of 12656 6-grams generated from 2000 lines of held-out data the algorithm was able to correctly guess 17% of the next words.

How It Works

The application is very straightforward. In the input box you can type a phrase. Below the input box is the word that the model believes will be next.