Data Science Captstone - Week 2 Milestone Report

Project Overview

This is the milestone report for week 2 of the Coursera JHU Data Science Capstone Project which is to develop a prediction algorithm for the most likely next word in a sequence of words in a sentence.

This report will load and clean the data, provide exploratory analysis to invesitgate some features of the data and use Natural Language Processing applications in R to tokenize n-grams as the first step toward building a predictive model.

Loading & Summarizing the Data

blogs <- readLines("/home/dugi/coursera/Course10/data/final/en_US/en_US.blogs.txt", warn=FALSE, encoding="UTF-8")
news <- readLines("/home/dugi/coursera/Course10/data/final/en_US/en_US.news.txt", warn=FALSE, encoding="UTF-8")
twitter <- readLines("/home/dugi/coursera/Course10/data/final/en_US/en_US.twitter.txt", warn=FALSE, encoding="UTF-8")

SummaryOfFiles <- data.frame("File Name" = c("Blogs","News","Twitter"),
                             "File Size" = sapply(list(blogs, news, twitter), function(x){format(object.size(x),"MB")}),
                             "Row Count" = sapply(list(blogs, news, twitter), function(x){length(x)}),
                             "Word Count" = sapply(list(blogs, news, twitter), function(x){wordcount(x)})
                             )
SummaryOfFiles

##   File.Name File.Size Row.Count Word.Count
## 1     Blogs  255.4 Mb    899288   37334131
## 2      News  257.3 Mb   1010242   34372530
## 3   Twitter    319 Mb   2360148   30373543

Sampling & Cleaning Data

The files are very large, so I’m going to do a sample of 5% of each file. With this subset of the total data, I’m going to clean the data by removing all non-English characters.

# set seed for reproduciblity
set.seed(1234)
blogsSamp <- sample(blogs, length(blogs)*0.05)
newsSamp <- sample(news, length(news)*0.05)
twitterSamp <- sample(twitter, length(twitter)*0.05)

sampData <- c(blogsSamp, newsSamp, twitterSamp)
sampData <- iconv(sampData, "latin1", "ASCII", sub="")

Build & Clean Corpus

Using the tm package to build and clean the corpus that will be analyzed. I’m going to convert to all lower case and then will remove all numbers, punctuation, unneccesary white space, and stopwords.

corpus <- VCorpus(VectorSource(sampData))
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, PlainTextDocument)
corpus <- tm_map(corpus, removeWords, stopwords("english"))

Tokenize & Term Document Matrices (TDM) N-grams

Using the RWeka package, I will tokenize the sample data and construct 3 TDMs: unigrams, bigrams and trigrams.

unigram <- function(x) NGramTokenizer(x, Weka_control(min=1, max=1))
bigram  <- function(x) NGramTokenizer(x, Weka_control(min=2, max=2))
trigram <- function(x) NGramTokenizer(x, Weka_control(min=3, max=3))

unigramTDM <- TermDocumentMatrix(corpus, control=list(tokenize=unigram))
bigramTDM  <- TermDocumentMatrix(corpus, control=list(tokenize=bigram))
trigramTDM <- TermDocumentMatrix(corpus, control=list(tokenize=trigram))

Investigate Sparsity

We need to look at the sparsity of each of the 3 TDMs: unigramTDM, bigramTDM and trigramTDM

unigramTDM

## <<TermDocumentMatrix (terms: 142580, documents: 213483)>>
## Non-/sparse entries: 2571033/30435835107
## Sparsity           : 100%
## Maximal term length: 114
## Weighting          : term frequency (tf)

bigramTDM

## <<TermDocumentMatrix (terms: 1814759, documents: 213483)>>
## Non-/sparse entries: 2589014/387417606583
## Sparsity           : 100%
## Maximal term length: 119
## Weighting          : term frequency (tf)

trigramTDM

## <<TermDocumentMatrix (terms: 2327168, documents: 213483)>>
## Non-/sparse entries: 2396677/496808409467
## Sparsity           : 100%
## Maximal term length: 142
## Weighting          : term frequency (tf)

Eliminate Sparse Terms and Create Frequencies

My 3 matrices are extremely sparse which is meaning they contain almost all zeros. I need to eliminate sparse terms for each of the 3 matrices and create and order frequency data frames to plot.

unigramDense <- removeSparseTerms(unigramTDM, 0.99)
bigramDense  <- removeSparseTerms(bigramTDM, 0.999)
trigramDense <- removeSparseTerms(trigramTDM, 0.9999)

freqUnigram <- rowSums(as.matrix(unigramDense))
freqBigram  <- rowSums(as.matrix(bigramDense))
freqTrigram <- rowSums(as.matrix(trigramDense))

orderUnigram <- order(freqUnigram, decreasing=TRUE)
orderBigram  <- order(freqBigram, decreasing=TRUE)
orderTrigram <- order(freqTrigram, decreasing=TRUE)

unigramDF <- data.frame("unigram"=names(freqUnigram[orderUnigram]), "freq"=freqUnigram[orderUnigram])
bigramDF <- data.frame("unigram"=names(freqBigram[orderBigram]), "freq"=freqBigram[orderBigram])
trigramDF <- data.frame("unigram"=names(freqTrigram[orderTrigram]), "freq"=freqTrigram[orderTrigram])

Unigram Histogram

ggplot(unigramDF[1:30,], aes(factor(unigram, levels=unique(unigram)), freq)) +
  geom_bar(stat="identity", fill="blue", colour="black", width=0.9) +
  theme(axis.text.x=element_text(angle=90)) +
  labs(x="Unigram", y="Frequency", title="30 Most Common Single Words")

Bigram Histogram

ggplot(bigramDF[1:30,], aes(factor(unigram, levels=unique(unigram)), freq)) +
  geom_bar(stat="identity", fill="green", colour="black", width=0.9) +
  theme(axis.text.x=element_text(angle=90)) +
  labs(x="Bigram", y="Frequency", title="30 Most Common Pair Words")

Trigram Histogram

ggplot(trigramDF[1:30,], aes(factor(unigram, levels=unique(unigram)), freq)) +
  geom_bar(stat="identity", fill="purple", colour="black", width=0.9) +
  theme(axis.text.x=element_text(angle=90)) +
  labs(x="Trigram", y="Frequency", title="30 Most Common Triple Words")

Summary of Findings

We were able to save time by evaluating 5% of the 3 data files and were able to get a good sense of the data and perform sufficient exploratory analysis. The longer the N-gram the lower the frequency. The most frequent single words was “will” occurring 15,911 while for most frequent pair of words was “right now” occuring 1,211 and for triple words “cant wait see” occurred on 173 times.

Next Steps: Prediction Alogirth & Shiny App

The next step of this project will be to build a predictive algorithm using N-grams to get the probabilites for the next occurence for the next word based on the prior word typed. Because we have the data frames of the N-grams in TDMs this format should be good for predicting the next word in a sequence. The final step will be to develop a Shiny app that will use this algorithm and suggest the next word.