Coursera Data Science: Word Prediction (Capstone Project)

Lyle McMillin
9/29/16

About this project

Overall objective of this project: A Shiny app that takes as input a phrase (multiple words) in a text box input and outputs a prediction of the next word.

To build this project we were provided the following data sets in 4 different languages (Russian, Finnish, Dutch and English). As English is my native language, only the English datasets (twitter, blogs, news) were utilized for this project.

Statistics for the English datasets are as follows:

A twitter dataset consisting of approx. 2,360,148 data elements
A blogs dataset consisting of approx. 899,288 data elements
A news dataset 77,259 data elements

The datasets were downloaded from the following link: Capstone Dataset

My prediction algorithm

My algorithm utilizes a group of bigram, trigram and 4-gram data frames that were created using the Quanteda package against 100% of the data.

The n-gram files were then filtered taking results that had a frequency higher than 4 (i.e. 5 or greater).

The algorithm utilized in the shiny app takes the last three words entered and begins trying to successfully match a 4-gram.

If that fails a step down is implemented and the process repeats taking the highest 3-gram match against the last two words.

If that fails another step down takes place to find a bigram match agains the last word in the string.

If that fails a default word (otherwise known as the most common unigram) “the” is returned.

Results

To make the application run at its most optimal speed, I implemented two design ideas that were critical to the success of this application:

- First, the bigram, trigram and 4-gram data frames were split into lists based off the first word. This allowed me to quickly access only the portion of the data file that I need increasing speed dramatically.

- Secondly, the bigram, trigram and 4-gram lists were saved as RDS files which load each time the Shiny app is started. RDS files are compressed and much faster to load compared to other files types (CSV, Text files, etc.).

In tests, the algorithm returned, on average, a result in < 4 milliseconds. Since the algorithm is executed twice each time (once for the single word results and once for the data table) the average total time to execute should be less < 10 milliseconds for most uses.