WGC, March 29, 2015

0 - Introduction

In this Capstone Milestone Report I detail the curent status of my capstone project as well as some initial exploratory data analysis. I provide some basic summaries of the three files such as word counts, line counts and information on the basic data tables that I have constructed.

The data is from a corpus called HC Corpora (www.corpora.heliohost.org) but note that I am analyzing only the following files, in English, from the full set:

1 - Loading the files

f <- file("data/en_US/en_US.blogs.txt", "rb")
en.us.blogs <- readLines(f)
close(f)

f <- file("data/en_US/en_US.news.txt", "rb")
en.us.news <- readLines(f)
close(f)

f <- file("data/en_US/en_US.twitter.txt", "rb")
en.us.twitter <- readLines(f)
close(f)

2 - Samples from each file and some preliminary data analysis

Sample lines from the files. Notice that even though these are English files we do have some non-English characters that we will need to remove or deal with (see the blog text below):

en.us.blogs[999]
## [1] "Spoon out about 1/3 cup of dough for each shortcake onto the baking sheet, leaving about 3 inches of space between the mounds. Pat each mound down until it is between 3/4 and 1 inch high. (The shortcakes can be made to this point and frozen on the baking sheet, then wrapped airtight and kept in the freezer for up to 2 months. Bake without defrosting – just add at least 5 more minutes to the oven time.)"
en.us.news[999]
## [1] "The next wave of valley stock launches may well be made by less-sexy enterprise software companies like Palo Alto Networks, which filed plans earlier this month for a $175 million offering. The Santa Clara-based maker of network security products reported $119 million in fiscal year 2011 revenues, which would have placed it 142nd on this year's list."
en.us.twitter[999]
## [1] "Art washes from the soul the dust of everyday life. -Pablo Picasso"

Number of lines for each of the three files:

l1 <- length(en.us.blogs)
l1
## [1] 899288
l2 <- length(en.us.news)
l2
## [1] 1010242
l3 <- length(en.us.twitter)
l3
## [1] 2360148

Word counts and white space counts and for each of the files:

require(stringi)
## Loading required package: stringi
r1 <- stri_stats_latex(en.us.blogs)
cat( "blogs: ", "word count:" , r1[[4]], ", white space count:", r1[[3]] )
## blogs:  word count: 37865888 , white space count: 43302826
r2 <- stri_stats_latex(en.us.news)
cat( "news: ", "word count:" , r2[[4]], ", white space count:", r2[[3]] )
## news:  word count: 34678691 , white space count: 40491958
r3 <- stri_stats_latex(en.us.twitter)
cat( "twitter: ", "word count:" , r3[[4]], ", white space count:", r3[[3]] )
## twitter:  word count: 30578933 , white space count: 36047952

Words per line:

r1[[4]] / l1 # blogs
## [1] 42.10652
r2[[4]] / l2 # news
## [1] 34.32711
r3[[4]] / l3 # twitter
## [1] 12.95636

3 - Combining the files

In order to do some analysis on the text from the three sources, I combined samples from each file into a single one.

set.seed(123)
nsize <- 5000
corpus.sample <- rep(NA, 3 * nsize)
s <- 1; f <- nsize
corpus.sample[s : f] <- sample(en.us.blogs, nsize)
rm(en.us.blogs) # cleanup
s <- nsize + 1; f <- nsize * 2
corpus.sample[s : f] <- sample(en.us.news, nsize)
rm(en.us.news) # cleanup
s <- nsize * 2 + 1; f <- nsize * 3
corpus.sample[s : f] <- sample(en.us.twitter, nsize)
rm(en.us.twitter) # cleanup
##           used (Mb) gc trigger  (Mb) max used  (Mb)
## Ncells  315494 16.9    4193053 224.0  5241317 280.0
## Vcells 5009649 38.3   77474007 591.1 96800876 738.6

4 - Cleaning up the corpus

For this analysis I decied to clean up the sample corpus somewhat. For the final ngram file the cleanup will need to be more extensive but for the purposes of this analysis I have removed several things from the corpora such as punctuation, numbers, etc. as follows:

require(tm)
## Loading required package: tm
## Loading required package: NLP
require(RWeka)
## Loading required package: RWeka
require(gridExtra)
## Loading required package: gridExtra
## Loading required package: grid
require(dplyr)
## Loading required package: dplyr
## 
## Attaching package: 'dplyr'
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
spacerx <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
cleanup.corpus <- function(corpus){
  cleaned.corpus <- corpus %>%
    tm_map(content_transformer(tolower)) %>%    
    tm_map(spacerx, "/|@|\\|") %>%
    tm_map(removeNumbers) %>%
    tm_map(removeWords, stopwords("english")) %>%
    tm_map(removePunctuation) %>%
    tm_map(stripWhitespace)
  return(cleaned.corpus)
}

corpus.clean <- VCorpus(VectorSource(corpus.sample)) %>% cleanup.corpus()
corpus.dtm <- DocumentTermMatrix(corpus.clean) %>% removeSparseTerms(0.99)

tokenizer.2 <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
corpus.dtm.2 <- DocumentTermMatrix(corpus.clean, control=list(tokenize = tokenizer.2)) %>% removeSparseTerms(0.9999)

tokenizer.3 <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
corpus.dtm.3 <- DocumentTermMatrix(corpus.clean, control=list(tokenize = tokenizer.3)) %>% removeSparseTerms(0.9999)

5 - Detailed visual analysis

Let’s look at some plots showing the frequency of 1-grams (i.e. words), 2-grams and 3-grams. This is to get a feel for what the content actually looks like. Note that from the 3-gram plot there is still some more cleaning that needs to be done.

require(ggplot2)
## Loading required package: ggplot2
## 
## Attaching package: 'ggplot2'
## 
## The following object is masked from 'package:NLP':
## 
##     annotate
most.freq <- function(corpus.dtm, n=10){
  freq <- colSums(as.matrix(corpus.dtm))
  result <- freq[order(freq, decreasing=TRUE)][1:n]
  return(data_frame(term=names(result), count=result))
}

ggplot(most.freq(corpus.dtm), aes(x=reorder(term, -count), y=count)) +
  geom_bar(stat="identity") +
  theme_grey() +
  theme(axis.title.x = element_blank(),
           axis.text.x  = element_text(angle=45, hjust=1)) +
  ggtitle("Most frequent words in the sample corpus")

# 2 gram
ggplot(most.freq(corpus.dtm.2), aes(x=reorder(term, -count), y=count)) +
  geom_bar(stat="identity") +
  theme_grey() +
  theme(axis.title.x = element_blank(),
           axis.text.x  = element_text(angle=45, hjust=1)) +
  ggtitle("Most frequent 2-grams in the sample corpus")

# 3 gram

ggplot(most.freq(corpus.dtm.3), aes(x=reorder(term, -count), y=count)) +
  geom_bar(stat="identity") +
  theme_grey() +
  theme(axis.title.x = element_blank(),
           axis.text.x  = element_text(angle=45, hjust=1)) +
  ggtitle("Most frequent 3-grams in the sample corpus")

6 - Next Steps

  1. Clean the data some more as it’s become obvious that there are still issues.
  2. For the full set of data (as opposed to the small sample used here) I think I will not use Weka’s tokenizer…it’s quite slow. I have experimented with Maciej Szymkiewicz’s (https://class.coursera.org/dsscapstone-003/forum/profile?user_id=711562) tokenizer (https://github.com/zero323/r-snippets/blob/master/R/ngram_tokenizer.R) and it seems quite fast.
  3. Also, I will most probably have to implement some type of back-off model, but probably a “stupid” one will do (http://www.umiacs.umd.edu/~jimmylin/cloud-2010-Spring/session9-slides.pdf).
  4. Get some version of the word predictor running, probably with a limited data set and then do several rounds of optimization, on both the R code and on the final data set. Based on my experience with web apps I think a file of > 25Mb will be too big for the final shiny so I need to aim for something smaller than that.
  5. Did I mention optimize? Yes, optimize, optimize, optimize.