John Hopkins Data Science Capstone (Word Prediction)

Hafidz Zulkifli
Sat Jan 23 02:50:30 2016

Introduction

This presentation has been created as part of the Capstone Assignment for John Hopkins University Data Science Specialization.
The task given was to develop an app that could predict the next probable word that might be typed given a set of sentence.
This is similar to the word prediction features available in most keyboard apps like SwiftKey.

The app is available at: https://ikanez.shinyapps.io/word_predictor

The App

Word Predictor

The 'Word Predictor' shiny app is an app that can predict the next word that a user would type with reasonable accuracy.

It has been developed using 3 sets of text; ranging from news headlines, blogs and tweets in English.

The screenshot on the left shows how it will look like on a mobile device.

A user can just type in a sentence the input box, and the next predicted word of his/her sentence will appear in the grey box below.

The Algorithm

The app uses a simple “stupid-backoff”-like algorithm based on n-grams.

For example, in algorithm where 3-grams and 2-grams are used - what this means is the algorithm will first look at the most frequent 3-gram that matches the last 2 words that have been typed.

If this fails, then it looks under the 2-gram list based on the last 1 word.

If this also fail to return any result, then it will suggest the most frequent 1-gram word used based on the provided corpus.

Challenges

Let's take the word “it must be” from slide #3 earlier.

Since the app uses at most a 3-gram, it will take the last 2 words of the sentence and find the 3 gram that has the highest frequency

In our case, the word “must be a” is the most likely. Thus “a” was chosen as the next word predicted.

As you can see, the ability to predict relies much on whether we have previous seen the sentence before or not. With the absence of sentence to learn from, it simply can't predict accurately.

A lot of text is needed then. However, we also need to consider the how much data can be handled by our hardware.

The approach taken in this app is to use sampled data and removing infrequent words. This has significantly removed much of processing complexities.