Coursera Data Science Capstone Project

12/3/2017

The Challenge

SwiftKey, the partner for this project, builds a smart keyboard that makes it easier for people to type on their mobile devices. One cornerstone of their smart keyboard is predictive text models. When someone types, "I went to the," the keyboard presents three options for what the next word might be.

Using a large corpus of English text scraped from Twitter, blog posts, and news articles, I sought to build an algorithm and app which effectively and efficiently predicts the next word in a sentence.

The Tools

Processing power was the primary constraint. As such, I sampled 25 percent of the corpus to build the algorithm. The next challenge was cleaning extraneous information from the data. Using elements of the {tm} and {quanteda} packages, I then:

tokenized the text elements into sentences and removed sentences containing profanity;
stripped extra whitespace, symbols, numbers, and special characters; and
tokenized the sentences into two-, three-, four-, five-, and six-word n-grams.

The Method

Once I had the ngrams, I stored them in a data table (from the {data.table} package). The algorithm searches the data table for instances of the typed phrase and the frequency with which it appears, and it assigns each option a weighted score. Frequency boosts the score; but a smaller n-gram equals a lesser score. This is known as the Stupid Backoff model.

The App

Test it out yourself here. Some sample phrases to try:

"Can you follow me please? It would mean the…"
"I'll dust them off and be on my…"
"I love that film and haven't seen it in quite some…"