JHU Data Science Specialization Capstone: Next Word Prediction

Sean Angiolillo
18 Feb 2018

Project Overview

The capstone project for the JHU Data Science Specialization on Coursera calls for a next word prediction Shiny app. As laid out in my milestone report, I broke the assignment into the following four R scripts and the Shiny app itself.

Create smaller files from large text corpus to better handle memory restrictions
Create ngrams (bigrams and trigrams) after cleaning each file of text
Create a data table with up to 3 predictions for each ngram
Create functions to process user input and query data table
Shiny App to take user input and output predictions

Sample Rows of Final Data Table

base	pred
a barrier	to
a bartab	drink
a bartender	at and
a base	of salary hit
a baseball	cap bat game
a bases	loaded
a basic	level understanding income
a basis	for in
a basket	of with and
a basketball	game team player

Starting with a 15% sample of the original data, the final data table had 369,575 unique ngrams.
Predictions saved in one character vector and then split as needed if queried in the app.

Algorithm Summary

The app incorporates a very simple backoff style algorithm. In a table of bigrams and trigrams, the algorithm first processes the user input to accept at most two words.

Once this text has been cleaned in the same manner as the data, the algorithm searches the data table for its accompanying predictions.

If it is not found, it will attempt to locate only the last word given.

If that too is not found, it simply gives the result of the most common unigrams.

Try it Out!

Try the app for yourself at the link below: https://seanangio.shinyapps.io/next_word_app/

The code can be found on Github.