APTM: Another Predictive Text Model

Irman Bin Zulkeflie
October 18, 2018

Motivation

APTM is the capstone project in the Coursera Johns Hopkins University Data Science Specialization.

Goal

The goal of this capstone is to build a predictive text model, similar to those included in mobile texting apps. SwiftKey has partnered with Coursera for this capstone project and provided a body of text documents from twitter, news sites, and blogs.

App

Just start typing as if you were using a messaging app. APTM will return the top prediction in context, and a table of up to the top five.

Data Ingest

The source data is 800 Mb, and consists of over four million lines of test. 98% was retained for training, and 2% for testing. These data were transformed into n-gram frequency tables. Further details can be found in the milestone report.

Model

The n-grams in the frequency tables were split into the first n-1 words (the input X), and the last word (the predicted response, y). The responses are scored using a method called “Stupid Backoff”. To save on memory, the training n-grams are pruned to those with count > 8.

One embellishment beyond the base requirements is current-word prediction: If the last character of the input is not a space, the last word (fragment) is split from the input. The preceding input is used to generate predictions, which are then filtered to just those starting with the next-word starting fragment.

Using the Application

Use of the application is straightforward and can be easily adapted to many educational and commercial uses. In the left picture, the user begins just by typing some text without punctuation in the supplied input box. As the user types, the prediction is echoed in the field below.

Visit the APTM app at https://musher1720.shinyapps.io/aptm/.

All code supporting this project is available on GitHub at https://github.com/musher1720/milestone_project.