Synopsis

The Data Science Capstone Project asks us to create an algorithm that is able to predict the next word in a sentence from a user’s sentence input, similar to the SwiftKey keyboard available on both iOS and Andriod.

The goal of this Milestone report is to explore the HC Corpora (www.corpora.heliohost.org) data set and consider how we can use it to create our prediction algorithm engine. The HC Corpora data set consists of plain text files with blog, news and twitter data spanning multiple languages. I will be focussing on the english (en_US) data set in this report.

Setting up the R environment

We start by initializing a few R libraries, turning echo on for R code chunks, centering figures and suppressing messages. We hard code a seed value to ensure future reproducibility of this report.

require(knitr); require(ggplot2); library(R.utils); library(stringr);
library(openNLP); library(tm); library(qdap); library(RWeka);
library(wordcloud); library(RColorBrewer); library(stringi);
opts_chunk$set(echo=TRUE, fig.align='center', message=FALSE, cache=TRUE)
set.seed(98765)

Reading the raw data into memory

Prior to running this report we placed the HC Corpora data set in a sub folder named data. Reading these large files is time consuming, we check to see if we’ve already read the raw data into memory and if not we do so now.

if(!exists("blogs.raw")) {blogs.raw <- scan("data/en_US/en_US.blogs.txt", character(0), sep = "\n")}
if(!exists("news.raw")) {news.raw <- scan("data/en_US/en_US.news.txt", character(0), sep = "\n")}
if(!exists("twitter.raw")) {twitter.raw <- scan("data/en_US/en_US.twitter.txt", character(0), sep = "\n")}

Exploring the data

Next we use the stringi package’s stri_stats_latex() function to do a quick word count over our three source files. The table below lists the disk size, row count and word count of each plain text file.

blogs.raw.word.count <- stri_stats_latex(blogs.raw)['Words']
news.raw.word.count <- stri_stats_latex(news.raw)['Words']
twitter.raw.word.count <- stri_stats_latex(twitter.raw)['Words']
File Size (MB) Rows Words
en_US.blogs.txt 210.2 899,288 37,570,839
en_US.news.txt 205.8 1,010,242 34,494,539
en_US.twitter.txt 167.1 2,360,148 30,451,128
Totals 583.1 4,269,678 102,516,506

Sampling some data

Working with this large a data set is very time consuming. We start off by taking a small (1%) ramdomly distributed sample from our three data sets. We then combine our three small sample data sets into a single data set.

blogs.sample <- blogs.raw[sample(length(blogs.raw), length(blogs.raw) * 0.01)]
news.sample <- news.raw[sample(length(news.raw), length(news.raw) * 0.01)]
twitter.sample <- twitter.raw[sample(length(twitter.raw), length(twitter.raw) * 0.01)]
data.sample <- paste(blogs.sample, news.sample, twitter.sample)

Creating a Corpus and cleaning it up

Our n-Grams should not span sentence boundries. We use sent_detect() from the qdap package to split paragraphs into sentences prior to creating and cleaning our corpus (data.corpus).

# Split paragraphs into sentences
data.sample <- sent_detect(data.sample, language = "en", model = NULL)
# Create Corpus from sample data
data.corpus <- VCorpus(VectorSource(data.sample))
# Initial cleaning of corpus
data.corpus <- tm_map(data.corpus, removeNumbers) 
data.corpus <- tm_map(data.corpus, stripWhitespace) 
data.corpus <- tm_map(data.corpus, tolower) 
data.corpus <- tm_map(data.corpus, removePunctuation)
data.corpus <- tm_map(data.corpus, removeWords, stopwords("english"))

Extract n-Grams from the sample data

We use NGramTokenizer() to tokenize data.corpus in to words and bi-grams.

ngram.1 <- NGramTokenizer(data.corpus, Weka_control(min=1, max=1, delimiters=" \\r\\n\\t.,;:\"()?!"))
ngram.2 <- NGramTokenizer(data.corpus, Weka_control(min=2, max=2, delimiters=" \\r\\n\\t.,;:\"()?!"))

Planning for the Final Project

It is still only early days in terms of our prediction algorithm design. We see the following tasks taking up most of our time going forward:

Task 1: Getting to grips with the whole data set

For our initial review of the data we’ve been sampling a mere 1% of the corpus from HC Corpora (www.corpora.heliohost.org). A few google searches seem to indicate that the tm library is quite slow. Running on our small sample can take a few minutes. We plan on exploring other libraries that offer comparable feature sets at higher speeds (i.e. stylo) or alternately merely outputting processed data to disk in an effort to speed up future processing. We further more plan on splitting the data into 95% training and 5% testing.

Task 2: Cleaning the data

We’ve implemented only a very generic data cleaning and we have not addressed the issue of profanity at all. We plan on refining the clean up process. Additionally we plan on assigning each swear word a unique replacement string (i.e. #@^&*#). This should help the prediction engine predict words following a curse word, should the user opt to use our “clean” replacement word.

Prior to Quiz 2 we were convinced that stopwords should be removed from the corpus but we now think they add a lot of value. We also think the removal of punctuation and capital case characters should be carefully reconsidered.

Task 3: Creating a prediction algorithm

We currently think that 5-gram should suffice for our prediction purposes. We plan on first implementing a function which will tokenize the input string and then match the n-Gram as far as possible to our (n+1)-Gram and based on the (n+1)-Grams frequency suggest (predict) a word.

We then plan on implementing a very simple Shiny App that calls on said function and is in return rewarded with what we hope to be a sensible prediction.

Task 4: Going the extra mile

The twitter sentences seem very different from normal text and it might therefore be worthwhile to implement a dedicated twitter prediction engine. If time permists we would like to implement a simple option where you select the type of prediction you wish to have based on the source data (i.e. all, blogs & news or twitter).