The Data Science Capstone Project asks us to create an algorithm that is able to predict the next word in a sentence from a user’s sentence input, similar to the SwiftKey keyboard available on both iOS and Andriod.
The goal of this Milestone report is to explore the HC Corpora (www.corpora.heliohost.org) data set and consider how we can use it to create our prediction algorithm engine. The HC Corpora data set consists of plain text files with blog, news and twitter data spanning multiple languages. I will be focussing on the english (en_US) data set in this report.
We start by initializing a few R libraries, turning echo on for R code chunks, centering figures and suppressing messages. We hard code a seed value to ensure future reproducibility of this report.
require(knitr); require(ggplot2); library(R.utils); library(stringr);
library(openNLP); library(tm); library(qdap); library(RWeka);
library(wordcloud); library(RColorBrewer); library(stringi);
opts_chunk$set(echo=TRUE, fig.align='center', message=FALSE, cache=TRUE)
set.seed(98765)
Prior to running this report we placed the HC Corpora data set in a sub folder named data. Reading these large files is time consuming, we check to see if we’ve already read the raw data into memory and if not we do so now.
if(!exists("blogs.raw")) {blogs.raw <- scan("data/en_US/en_US.blogs.txt", character(0), sep = "\n")}
if(!exists("news.raw")) {news.raw <- scan("data/en_US/en_US.news.txt", character(0), sep = "\n")}
if(!exists("twitter.raw")) {twitter.raw <- scan("data/en_US/en_US.twitter.txt", character(0), sep = "\n")}
Next we use the stringi package’s stri_stats_latex() function to do a quick word count over our three source files. The table below lists the disk size, row count and word count of each plain text file.
blogs.raw.word.count <- stri_stats_latex(blogs.raw)['Words']
news.raw.word.count <- stri_stats_latex(news.raw)['Words']
twitter.raw.word.count <- stri_stats_latex(twitter.raw)['Words']
| File | Size (MB) | Rows | Words |
|---|---|---|---|
| en_US.blogs.txt | 210.2 | 899,288 | 37,570,839 |
| en_US.news.txt | 205.8 | 1,010,242 | 34,494,539 |
| en_US.twitter.txt | 167.1 | 2,360,148 | 30,451,128 |
| Totals | 583.1 | 4,269,678 | 102,516,506 |
Working with this large a data set is very time consuming. We start off by taking a small (1%) ramdomly distributed sample from our three data sets. We then combine our three small sample data sets into a single data set.
blogs.sample <- blogs.raw[sample(length(blogs.raw), length(blogs.raw) * 0.01)]
news.sample <- news.raw[sample(length(news.raw), length(news.raw) * 0.01)]
twitter.sample <- twitter.raw[sample(length(twitter.raw), length(twitter.raw) * 0.01)]
data.sample <- paste(blogs.sample, news.sample, twitter.sample)
Our n-Grams should not span sentence boundries. We use sent_detect() from the qdap package to split paragraphs into sentences prior to creating and cleaning our corpus (data.corpus).
# Split paragraphs into sentences
data.sample <- sent_detect(data.sample, language = "en", model = NULL)
# Create Corpus from sample data
data.corpus <- VCorpus(VectorSource(data.sample))
# Initial cleaning of corpus
data.corpus <- tm_map(data.corpus, removeNumbers)
data.corpus <- tm_map(data.corpus, stripWhitespace)
data.corpus <- tm_map(data.corpus, tolower)
data.corpus <- tm_map(data.corpus, removePunctuation)
data.corpus <- tm_map(data.corpus, removeWords, stopwords("english"))
We use NGramTokenizer() to tokenize data.corpus in to words and bi-grams.
ngram.1 <- NGramTokenizer(data.corpus, Weka_control(min=1, max=1, delimiters=" \\r\\n\\t.,;:\"()?!"))
ngram.2 <- NGramTokenizer(data.corpus, Weka_control(min=2, max=2, delimiters=" \\r\\n\\t.,;:\"()?!"))
Next we organize our n-Gram data by frequency to give us a better idea of what is going on. The following table lists the top 10 words (by frequency) found in our sample data.
## Word Frequency
## 23775 said 1362
## 30362 will 1172
## 19206 one 1140
## 14698 just 984
## 15812 like 930
## 3945 can 893
## 27868 time 831
## 11259 get 814
## 13427 im 730
## 18472 new 651
The following bar plot and and cloud lists the top 40 words from our processed corpus, as stored in the variable ngram.1.top.
## NULL
The following bar plot lists the top 30 bi-grams from our processed corpus, as stored in the variable ngram.2.top.
It is still only early days in terms of our prediction algorithm design. We see the following tasks taking up most of our time going forward:
For our initial review of the data we’ve been sampling a mere 1% of the corpus from HC Corpora (www.corpora.heliohost.org). A few google searches seem to indicate that the tm library is quite slow. Running on our small sample can take a few minutes. We plan on exploring other libraries that offer comparable feature sets at higher speeds (i.e. stylo) or alternately merely outputting processed data to disk in an effort to speed up future processing. We further more plan on splitting the data into 95% training and 5% testing.
We’ve implemented only a very generic data cleaning and we have not addressed the issue of profanity at all. We plan on refining the clean up process. Additionally we plan on assigning each swear word a unique replacement string (i.e. #@^&*#). This should help the prediction engine predict words following a curse word, should the user opt to use our “clean” replacement word.
Prior to Quiz 2 we were convinced that stopwords should be removed from the corpus but we now think they add a lot of value. We also think the removal of punctuation and capital case characters should be carefully reconsidered.
We currently think that 5-gram should suffice for our prediction purposes. We plan on first implementing a function which will tokenize the input string and then match the n-Gram as far as possible to our (n+1)-Gram and based on the (n+1)-Grams frequency suggest (predict) a word.
We then plan on implementing a very simple Shiny App that calls on said function and is in return rewarded with what we hope to be a sensible prediction.
The twitter sentences seem very different from normal text and it might therefore be worthwhile to implement a dedicated twitter prediction engine. If time permists we would like to implement a simple option where you select the type of prediction you wish to have based on the source data (i.e. all, blogs & news or twitter).