Slide Deck for Data Science Capstone Project

Teo Tse Tsong
15th April 2016

Overview

The objective of the project is to build a model for next word prediction given a “phrase”. The model is to be build on a corpus of text collected from blogs, news and twitter posts from the following URL

https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip.

Building the Model - N-grams

The prediction model is based on the use of N-grams. An N-gram is a contiguous sequence of N tokens(words) in a sentence or phrase. In the model developed, tables of 2-gram, 3-gram, 4-gram are developed and stored. These tables are sorted in terms of frequencies to determine relative probabilities of different n-grams.

The N-grams have been extracted based on 50,000 lines of text each from the blogs and news corpi. Limitations in memory prevents more lines from being added. This constitute about 5.5% of blogs but about 65% of available news lines.

Memory limitations also limits the use of N-grams to 4. More accurate prediction should be possible with inclusion of higher-order N-grams.

The twitter data set was not used because it tended to contain more colloquial expressions than complete structured phrases and words.

Building the Model - Backoff

A very simple backoff approach is used in the following manner :

Count the number of words the user has entered.
If the user has entered more than 3 words, use the last three and use the 4-gram table.
If the number of words entered is less than 3, use the N-gram table that is 1 more than number of words entered.
Starting with the selected N-gram table, find the list of N-grams fitting the words entered by the user. If nothing is found use the (N-1)-gram table.
If nothing is found in the 2-gram table, show that nothing is found.
If matches are found, display the top five matches in order of probabilities, and stop searching lower order N-grams.

Using the App

The prediction app can be found at

https://tsetsong.shinyapps.io/CapstoneShiny/ App Layout and Instructions The steps required to use the app are labelled in the figure above. Upon entry of a short phrase, there will be a 1-2 minute wait while the code does the prediction work, and then the results will be displayed.

Conclusions

A model for predicting the next word in an N-gram phrase has been developed and functions reasonably well. Prior to using this method, both the Maximum Entropy model and the Naive Bayesian model were explored but the results were not satisfactory.

While functioning reasonably well, the current model has a number of shortcomings :

Inability to discern contextual word associations. For example, “life” and “death” or “rags” and “riches”.
Large data tables required.
Current implementation is limited in “vocabulary” with no self learning ability.