Word Prediction App

André Tipping

John Hopkins COURSERA - Data Science Specialization Capstone

Sponsered by Swiftkey

Introduction

The objective of this project was to build a web application that could predict the word most likely to follow a sequence of words.

That sequence can be modelled as an n-gram, consisting of n 'items'. The probability of that next word depends on the most recent n-1 tokens, thus n-grams can be used to predict the probability of that next word. wiki

Model Creation

Unstructured .txt files, provided by Swiftkey, from three sources: blogs, news, and twitter.

After cleaning the data, by removing symbols, numbers, and profanity, a 5-gram probabilistic language model was built using the quanteda package.

Algorithm

Using the following equation for Stupid Backoff to rank the 'next-word' gives scores (relative frequencies) rather than probabilities for each predicition.

Stupid Backoff equation

This was interpreted like so and applied to each row in the model:

if (inputIs5gram) {
    score = matched5gramCount / input4gramCount
} else if (inputIs4gram) {
    score = 0.4 * matched4gramCount / input3gramCount
} else if (inputIs3gram) {
    score = 0.4 * 0.4 * matched3gramCount / input2gramCount
} else if (inputIs2gram) {
    score = 0.4 * 0.4 * 0.4 * matched2gramcount / input1gramCount
}

Note alpha is a recommended value of 0.4.

Shiny App

App

After one or more words have been entered into the input box, wait a few seconds and the server will predict the most likely next-word. The results will be the top most likely posibilities and will be presented in table form from the most likely at the top to the least at the bottom. There will also be a wordcloud to visualy represent the results. You may continue the already inputted sequence by adding further words. The results will refresh automatically.

Benchmark

The following are the results of a benchmarking tool. It uses a sample of blog articles and twitter tweets as its test dataset. Sliding a fixed-size word window over a test sentence in its dataset, the script calls the prediction function to predict the word that follows that window. After the window has slid over the entire sentence, it moves on to the next sentence in its dataset. The numbers in parantheses are those of baseline predictions.

Overall top-3 score:     17.96 % (6.64 %)
Overall top-1 precision: 13.44 % (5.42 %)
Overall top-3 precision: 21.89 % (8.11 %)

Average runtime:         27.54 msec (0.09 msec)
Number of predictions:   28464
Total memory used:       307.07 MB (286.76 MB)