Coursera Data Science Capstone Project

Jim Carleton
07/18/19

A Word Prediction Application

Objective

The objective of the Capstone project is to create a Shiny application for predicting the next word, given a user-provided sequence of words.

My Shiny app can be found at: https://g00dquestion.shinyapps.io/JCShinyApp/

The source code for my app can be found at: https://github.com/JimC99/CapstoneApp

Data Preparation

This application is based on an N-gram algorithm with “stupid backoff”.

The three text files (twitter, blog and news) were first combined into a corpus, cleaned, and tokenized into N-gram tables.

In each N-gram table, counts for last words were then tabulated and discounted using an alpha value of 0.4, as per the “stupid backoff” methodology. To limit file size, only sequences with frequencies of 4 or greater were retained.

The prefixes, next words, and associated “S” values were saved as lookup tables in rds format.

Word Prediction Approach

The model employs a 5-gram approach with “stupid backoff”.

Upon invoking the app, the five N-gram tables are read into memory. When the user enters a text string, the app searches for matches to the string and matches to subsets of it (e.g., the last 1, 2, and 3 words in a 4 word string) in the appropriate tables given the number of words in the string or substring.

Matches from each table are assembled into one data frame, which is ranked by decreasing S value. The word with the highest S value is returned as the “next word”.

The five top ranked words are also displayed in barplot format.

Using the Application

To use the app, simply enter one or more words in the box marked “Enter text here:”.

The app responds by displaying the predicted word under “Predicted next word:”. A histogram also appears that displays the top five next-word choices, scaled by their S values.

You can read about “stupid backoff” from its inventors here: https://www.aclweb.org/anthology/D07-1090