Coursera Data Science Capstone

Chris Castle

This presentation will outline our data product which predicts the next word in a given phrase.

The Coursera Data Science Capstone was created by professors from Johns Hopkins University with data provided by SwiftKey.

Goals

The purpose of this project is to create a data product which is able to predict the next word in a phrase.

To begin, we randomly sampled over a million entries from the SwiftKey dataset. This was roughly ¼ of our available text data.

We manually prepped the data using regular expressions. Much of this can be done with R packages but doing it manually was a great learning opportunity.

Model Method Using Ngrams

Using the quanteda package the sampled data was converted into document-frequency-matrices for a series of ngrams. A probability for each unique ngram was computed. This ngram frequency will serve as the basis for the predictive model.

Hash Tables

Searching through a long character vector takes time, so the hash package was used to decrease search time. Each token from the training set was assigned an integer in a special hash table environment. These hash table environments are stored as part of the Shiny app. The ngram frequency tables we created are also converted into integers using the hash table.

Using the App

Using the app is easy. Enter a word or a phrase into the input box and the algorithm will quickly retrieve a prediction for the next word.

Please allow a moment for the app to initialize. The application can be found here: https://chriscastle.shinyapps.io/ShinyCapstone/