Coursera Data Science Capstone Project: Word Prediction Algorithm/App Pitch

T Fanselow
26 June 2019

Word Prediction: Problem Overview

This project addresses the problem of word prediction: Guessing which word will come next in a sequence.
Word prediction systems facilitate quicker and easier typing, and hence improve user productivity and application accessibility.

For example, see SwiftKey's predictive keyboard on smart phones: https://www.microsoft.com/en-us/swiftkey
Algorithmically, the problem is addressed by analysing corpora of natural language text, and capturing statistical relationships between common words and phrases.

Overview of approach

The language model is reprensented as an ngram tree. For example, given the corpus:

A big dog
A big cat

The following tree would be created for use in prediction:

Node          Level   Frequency
-------------------------------
|-a           1       2
| |-big       2       2
|   |-dog     3       1
|   |-cat     3       1
|-big         1       2
| |-dog       2       1
| |-cat       2       1
|-dog         1       1
|-cat         1       1

Application Overview

The prototype prediction app is hosted on shinyapps.io:

https://tim-fan.shinyapps.io/word_prediction/

Try it out!

Type a few words in the text input box, and hit enter. A prediction for the next word will be displayed.

The page also shows the matched sequence (n-gram) from the user input, and a view which words follow that sequence in the prediction tree.

Implementation Features and Performance

The predictor uses a tree data structure, as this was expected to:

Allow for quick/easy lookup of sequences for prediction.
Allow for capture of n-grams of any length.

In practice the tree size and hence ngram length and predictave accuracy was limited by allowable memory usage on the shinyapps server (1GB), combined with the fairly high memory usage of the data.tree library (https://cran.r-project.org/web/packages/data.tree/vignettes/data.tree.html#memory).

The model currently hosted on shinyapps showed predictive accuracy of 8% on a held-out test set of 2,000 tweets.

Future work will be directed towards more efficient use of memory, in order to make predictions based on a much more extensive language model.

For full source code, see https://github.com/tim-fan/coursera_datascience_capstone