Data Science Capstone Text Prediction Project

Greg Sutcliffe
2019-01-15

Project Overview

The Data Science Specialization Capstone tasks us with building a text prediction application, based on a corpus of text drawn from blogs, news articles, and Twitter tweets.

The premise is to build a predication model which accepts a string of text, and outputs the next word to use in the sentence. This could be something that would potentially be used in a mobile application, so memory usage and speed of execution are considerations.

In this presentation, we'll go over the dataset generation process, the model selection, the prototype application itself, and some potential next steps.

Dataset Generation

The corpus of text is large, in total comprising 4,269,678 lines (file size is 557Mb). This requires a significant amount of memory and CPU time to process. The data processing follows the following flow:

Read & clean a line of text
Use the tidytext package to extract the 4-word-grams
Calculate the 4-gram frequencies from data frame, and drop any entries where n=1

This is time-consuming - 20% of the corpus took approximately 4 hours to parse on my laptop. However, the chosen model depends strongly on the quantity of text available to it, so I choose to spend a large amount of time on this processing step, to make the execution faster instead.

Algorithm Explanation

The model selected for this application is a Stupid Backoff model which is defined as:

\[ S(w_i|w_{i-k+1}^{i-1}) = \begin{cases} \frac{count(w_{i-k+1}^{i})}{count(w_{i-k+1}^{i-1})}, & \mbox{if } count(w_{i-k+1}^{i})\mbox{ is > 0} \\ \lambda S(w_i|w_{i-k+2}^{i-1}), & \mbox{otherwise} \end{cases} \]

With \( \alpha \) set to the recommend value of 0.4. This model was chosen as it is very simple (i.e. fast), but it requires a large table of ngrams to function well, hence the focus on dataset generation.

The final frequency table weighs in at 292Mb, which should fit in the memory of most mobiles, and returns predictions in well under a second, which feels good to the user.

Prototype App

The prototype application can be found here - please be patient as the initial data load can take a few seconds if the app is sleeping. Below is a screenshot.

Capstone Text Predictor App

To use the app:

Provide some text to the input field

some starter text is provided if you prefer.

Click “Get predictions!” to get see the suggested next words.

The word suggestions themselves are action buttons - clicking them will insert the chosen word at the end of the input and generate new predictions, allowing you to quickly generate whole sentences. It's quite fun!

Next Steps

In order to complete this prototype, many areas of improvement have not yet been evaluated. Below are some areas for further work, in rough order of priority:

Load time of the initial dataset is long (~6 seconds) - can it be loaded in the background?

Regenerate the 4-gram data table with more than 20% of the corpus, to improve the Stupid Backoff accuracy

Implement probability smoothing such as Kneyser-Ney smoothing for better accuracy

Look at other models such as Katz Backoff or Naive Bayesian and compare accuracy

Consider using a word-pairs table that may occur together but further apart than 4 words, to improve prediction on longer sentences