Data Science Capstone Milestone Report

Executive Summary

The goal for this capstone project is to build a predictive model for the English text with data from a corpus called HC Corpora.

A simple model of 1-gram, 2-grams and 3-grams (n-gram) was constructed with a subset of the provided corpus. An n-gram model is a type of probabilistic language model for predicting the next item in such a sequence in the form of a (n − 1)–order Markov model.

A very basic prediction routine is able to predict a sensible looking next word using the n-gram model. Further work is required to refine the model and consider an efficient way of including words that were provided in the earlier part of the a given phrase for improved accuracy.

My goal for this capstone project is to gamify the use of the next word prediction algorithm. I intend to present the app as a game to see whether the user can lead the algorithm to predict a given word (user specified or randomly generated). I am seeking feedback on the idea of gamifying the user experience when assessing the accuracy of my next word prediction algorithm.

Using the source data

The data is from a corpus called HC Corpora. The corpora are collected from publicly available sources by a web crawler. The files have been language filtered but may still contain some foreign text.

##            Filename File size (Mb) Word Count Line Count
## 1   en_US.blogs.txt       200.4242   37334690     899288
## 2    en_US.news.txt       196.2775   34372720    1010242
## 3 en_US.twitter.txt       159.3641   30374206    2360148

Data cleaning

The following data cleansing activites were performed:

remove punctuations
remove numbers (digits)
change to lower case
remove profanity (given the intention to gamify the user interaction for a younger audience)

List of profane words are drawn from Shutterstock github-hosted list of Bad Words

Summary statistics about the data set

The following summary is drawn from a sample of 30,000 lines of text from each of the data source. The sample contains a total of 2627091 words with 83427 unique words.

##   number.of.unique.words coverage
## 1                    141       50
## 2                    376       60
## 3                    960       70
## 4                   2414       80
## 5                   7343       90

The sample is tokenise into 1-gram, 2-grams and 3-grams.

Strategies for predicting the next word

Getting a simple model ready

At the point of writing this report, I have created simple 1-gram, 2-grams and 3-grams tables with a simple frequency count against each of the n-grams from the dataset. Both the 2-grams and 3-grams are split into a ‘lookup words’ column and a ‘next word’ column.

##    corpus.3grams freq lookupwords nextword lookupwords.freq
## 1:   of the year  217      of the     year             4745
## 2:   of the most  174      of the     most             4745
## 3:    of the day  151      of the      day             4745
## 4:   of the best  144      of the     best             4745
## 5:  of the world  115      of the    world             4745
## 6: of the season  101      of the   season             4745

Simple prediction routine

A simple prediction routine is used to predict the next word for a provided phrase. The algorithm currently works like this:

Turn the provided phrase into lowercase, remove punctuation and tokenise into words.
If the phrase contains 2 or more words, take the last two words in the phrase and lookup the 3-grams table using the ‘lookup words’ column.
Find all entries that contains those lookup words, sort them by the frequency count and return the top 3 highest count.
If none or less then three entries are found in the 3-grams table or if the provided phrase contains only 1 word, then same routine is applied to the 2-grams table.

Step 4 above is a step towards understanding how the ‘Stupid Backoff’ strategy can be implemented in the final algorithm.

Obvious flaws of this simple prediction routine

Trained on 90,000 lines of text, taken equally from the blogs, news and twitter text, this simple routine currently suggested words that appears to be sensible.

predictnextword("I am going to buy a")

## [1] "new"   "car"   "house"

However, there are obvious flaws in both the model and the prediction routines:

selection of possible next words do not take into account the context of entire sentence. By only focusing on the last two words, the prediction routine is ignorant of the context from other parts of the phrase. e.g.

predictnextword("I am going to the store to buy bacon, eggs and")

## [1] "i"  "my" "a"

the 2-grams and 3-grams table contains common ‘stop words’ such as the,and,is,of and so on.

predictnextword("I am of the ")

## [1] "year" "most" "day"

frequency counts alone maybe too crude a measure of the likelihood of the next word.

Next Steps

Additional exploratory work planned

The next steps of exploratory work I have planned is to assess if there is a simple way to address the flaws in my current work (taking into consideration the time (personal) and resource (machine) constraints for this project). Specifically, my intended exploratory work are:

Can the 1-gram, 2-grams and 3-grams tables be collapsed into a simpler model with precalculated probablities of a suggested word for easier inference at runtime?
Should stop words be removed in the construction of the n-grams tables? If they are, how does the prediction algorithm provide gramatically correct prediction?
Can cluster analysis of words be used to inform the probability of a suggested word? The intention of this analysis is to assess whether there is a faster way of taking into consideration words that are not used in the lookup of the 2-grams and 3-grams tables. If it can, when and how should this method be injected into the final prediction routine?
Given machine constraints, what is the optimal size of the final model?
If no suggested words are found from the n-grams lookup and the cluster analysis, can the suggested word be randomly drawn from the 1-gram lots?

Gamify the interface

It is my goal to provide a fun element to the final presentation of the next word prediction routine by gamifying the interaction between the user and next word prediction routine.

The intention is to provide a word (either random selection or user supplied) and allow the user to type in phrases. The game is won if the routine suggested the selected word. I am seeking feedback on the concept.