English Text Prediction

Damian Brunold
September 28, 2016

Coursera Data Science Specialization Capstone Project

Slide 1 - Overview of the task

Developing and building a data science product that accepts english text and predicts one or more possible next words.

Provided by Coursera: Text in categories (news, blogs, twitter) and languages (english, german, finnish, russion). Limited to english because of knowledge and time constraints.

Free to use whatever resources that are available.

Slide 2 - Getting and preparing the data

Tokenization of text, Sentence boundary detection (heuristic), Removal of profanity, punctuation.

Transforming token list into n-grams for n in range 1..4.

Removing unsensical n-grams.

Calculate counts and probabilities.

Order data by descending probabilities.

Prune data (3-grams, 4-grams that are found only once, limit number of e.g. 3-grams with identical 2-gram prefix).

Slide 3 - Overview of the prediction algorithm

Tokenize input in the same way as the training data.

Start with 4-grams. If match found: return

Otherwise try 3-grams. If match found: return

Otherwise try 2-grams. If match found: return

Otherwise return top 1-grams.

This implements a simple stupid backoff strategy that returns a prediction even for unseen sentence starts.

Slide 4 - Some quantitative results

I evaluated the model performance against a separate test data set:

Tokens Accuracy Accuracy-3 Model Size
114m 11% 19% 240 MB
35m 5% 9% 31 MB
14m 3% 7% 15 MB
7m 3% 6% 10 MB
3m 3% 6% 5 MB
1m 3% 5% 3 MB

Accuracy-3 means we look at the top three predictions and count it as a match, if one of these match the test word.

We see that the volume of training data can dramatically increase accuracy with the trade off of much larger model size.

As the shiny server seems to be capable of handling the largest model size, I use the model based on 114m tokens.

Slide 5 - The shiny app

The model was integrated into a shiny app that provides a simple user interface for the prediction engine:

Type sentence start, press Enter, the app provides three possible next words.

https://damianbrunold.shinyapps.io/English-Text-Prediction/

shinyapp screenshot