Data Science Capstone - Milestone Report

WGC, March 29, 2015

0 - Introduction

In this Capstone Milestone Report I detail the curent status of my capstone project as well as some initial exploratory data analysis. I provide some basic summaries of the three files such as word counts, line counts and information on the basic data tables that I have constructed.

The data is from a corpus called HC Corpora (www.corpora.heliohost.org) but note that I am analyzing only the following files, in English, from the full set:

en_US.blogs.txt (~206 Mb)
en_US.news.txt (~200 Mb)
en_US.twitter.txt (~163 Mb)

1 - Loading the files

f <- file("data/en_US/en_US.blogs.txt", "rb")
en.us.blogs <- readLines(f)
close(f)

f <- file("data/en_US/en_US.news.txt", "rb")
en.us.news <- readLines(f)
close(f)

f <- file("data/en_US/en_US.twitter.txt", "rb")
en.us.twitter <- readLines(f)
close(f)

2 - Samples from each file and some preliminary data analysis

Sample lines from the files. Notice that even though these are English files we do have some non-English characters that we will need to remove or deal with (see the blog text below):

en.us.blogs[999]

## [1] "Spoon out about 1/3 cup of dough for each shortcake onto the baking sheet, leaving about 3 inches of space between the mounds. Pat each mound down until it is between 3/4 and 1 inch high. (The shortcakes can be made to this point and frozen on the baking sheet, then wrapped airtight and kept in the freezer for up to 2 months. Bake without defrosting â just add at least 5 more minutes to the oven time.)"

en.us.news[999]

## [1] "The next wave of valley stock launches may well be made by less-sexy enterprise software companies like Palo Alto Networks, which filed plans earlier this month for a $175 million offering. The Santa Clara-based maker of network security products reported $119 million in fiscal year 2011 revenues, which would have placed it 142nd on this year's list."

en.us.twitter[999]

## [1] "Art washes from the soul the dust of everyday life. -Pablo Picasso"

Number of lines for each of the three files:

l1 <- length(en.us.blogs)
l1

## [1] 899288

l2 <- length(en.us.news)
l2

## [1] 1010242

l3 <- length(en.us.twitter)
l3

## [1] 2360148

Word counts and white space counts and for each of the files:

require(stringi)

## Loading required package: stringi

r1 <- stri_stats_latex(en.us.blogs)
cat( "blogs: ", "word count:" , r1[[4]], ", white space count:", r1[[3]] )

## blogs:  word count: 37865888 , white space count: 43302826

r2 <- stri_stats_latex(en.us.news)
cat( "news: ", "word count:" , r2[[4]], ", white space count:", r2[[3]] )

## news:  word count: 34678691 , white space count: 40491958

r3 <- stri_stats_latex(en.us.twitter)
cat( "twitter: ", "word count:" , r3[[4]], ", white space count:", r3[[3]] )

## twitter:  word count: 30578933 , white space count: 36047952

Words per line:

r1[[4]] / l1 # blogs

## [1] 42.10652

r2[[4]] / l2 # news

## [1] 34.32711

r3[[4]] / l3 # twitter

## [1] 12.95636

3 - Combining the files

In order to do some analysis on the text from the three sources, I combined samples from each file into a single one.

set.seed(123)
nsize <- 5000
corpus.sample <- rep(NA, 3 * nsize)
s <- 1; f <- nsize
corpus.sample[s : f] <- sample(en.us.blogs, nsize)
rm(en.us.blogs) # cleanup
s <- nsize + 1; f <- nsize * 2
corpus.sample[s : f] <- sample(en.us.news, nsize)
rm(en.us.news) # cleanup
s <- nsize * 2 + 1; f <- nsize * 3
corpus.sample[s : f] <- sample(en.us.twitter, nsize)
rm(en.us.twitter) # cleanup

##           used (Mb) gc trigger  (Mb) max used  (Mb)
## Ncells  315494 16.9    4193053 224.0  5241317 280.0
## Vcells 5009649 38.3   77474007 591.1 96800876 738.6

4 - Cleaning up the corpus

For this analysis I decied to clean up the sample corpus somewhat. For the final ngram file the cleanup will need to be more extensive but for the purposes of this analysis I have removed several things from the corpora such as punctuation, numbers, etc. as follows:

require(tm)

## Loading required package: tm
## Loading required package: NLP

require(RWeka)

## Loading required package: RWeka

require(gridExtra)

## Loading required package: gridExtra
## Loading required package: grid

require(dplyr)

## Loading required package: dplyr
## 
## Attaching package: 'dplyr'
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

spacerx <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
cleanup.corpus <- function(corpus){
  cleaned.corpus <- corpus %>%
    tm_map(content_transformer(tolower)) %>%    
    tm_map(spacerx, "/|@|\\|") %>%
    tm_map(removeNumbers) %>%
    tm_map(removeWords, stopwords("english")) %>%
    tm_map(removePunctuation) %>%
    tm_map(stripWhitespace)
  return(cleaned.corpus)
}

corpus.clean <- VCorpus(VectorSource(corpus.sample)) %>% cleanup.corpus()
corpus.dtm <- DocumentTermMatrix(corpus.clean) %>% removeSparseTerms(0.99)

tokenizer.2 <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
corpus.dtm.2 <- DocumentTermMatrix(corpus.clean, control=list(tokenize = tokenizer.2)) %>% removeSparseTerms(0.9999)

tokenizer.3 <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
corpus.dtm.3 <- DocumentTermMatrix(corpus.clean, control=list(tokenize = tokenizer.3)) %>% removeSparseTerms(0.9999)

5 - Detailed visual analysis

Let’s look at some plots showing the frequency of 1-grams (i.e. words), 2-grams and 3-grams. This is to get a feel for what the content actually looks like. Note that from the 3-gram plot there is still some more cleaning that needs to be done.

require(ggplot2)

## Loading required package: ggplot2
## 
## Attaching package: 'ggplot2'
## 
## The following object is masked from 'package:NLP':
## 
##     annotate

most.freq <- function(corpus.dtm, n=10){
  freq <- colSums(as.matrix(corpus.dtm))
  result <- freq[order(freq, decreasing=TRUE)][1:n]
  return(data_frame(term=names(result), count=result))
}

ggplot(most.freq(corpus.dtm), aes(x=reorder(term, -count), y=count)) +
  geom_bar(stat="identity") +
  theme_grey() +
  theme(axis.title.x = element_blank(),
           axis.text.x  = element_text(angle=45, hjust=1)) +
  ggtitle("Most frequent words in the sample corpus")

# 2 gram
ggplot(most.freq(corpus.dtm.2), aes(x=reorder(term, -count), y=count)) +
  geom_bar(stat="identity") +
  theme_grey() +
  theme(axis.title.x = element_blank(),
           axis.text.x  = element_text(angle=45, hjust=1)) +
  ggtitle("Most frequent 2-grams in the sample corpus")

# 3 gram

ggplot(most.freq(corpus.dtm.3), aes(x=reorder(term, -count), y=count)) +
  geom_bar(stat="identity") +
  theme_grey() +
  theme(axis.title.x = element_blank(),
           axis.text.x  = element_text(angle=45, hjust=1)) +
  ggtitle("Most frequent 3-grams in the sample corpus")

6 - Next Steps

Clean the data some more as it’s become obvious that there are still issues.
For the full set of data (as opposed to the small sample used here) I think I will not use Weka’s tokenizer…it’s quite slow. I have experimented with Maciej Szymkiewicz’s (https://class.coursera.org/dsscapstone-003/forum/profile?user_id=711562) tokenizer (https://github.com/zero323/r-snippets/blob/master/R/ngram_tokenizer.R) and it seems quite fast.
Also, I will most probably have to implement some type of back-off model, but probably a “stupid” one will do (http://www.umiacs.umd.edu/~jimmylin/cloud-2010-Spring/session9-slides.pdf).
Get some version of the word predictor running, probably with a limited data set and then do several rounds of optimization, on both the R code and on the final data set. Based on my experience with web apps I think a file of > 25Mb will be too big for the final shiny so I need to aim for something smaller than that.
Did I mention optimize? Yes, optimize, optimize, optimize.