In this Capstone Milestone Report I detail the curent status of my capstone project as well as some initial exploratory data analysis. I provide some basic summaries of the three files such as word counts, line counts and information on the basic data tables that I have constructed.
The data is from a corpus called HC Corpora (www.corpora.heliohost.org) but note that I am analyzing only the following files, in English, from the full set:
f <- file("data/en_US/en_US.blogs.txt", "rb")
en.us.blogs <- readLines(f)
close(f)
f <- file("data/en_US/en_US.news.txt", "rb")
en.us.news <- readLines(f)
close(f)
f <- file("data/en_US/en_US.twitter.txt", "rb")
en.us.twitter <- readLines(f)
close(f)
Sample lines from the files. Notice that even though these are English files we do have some non-English characters that we will need to remove or deal with (see the blog text below):
en.us.blogs[999]
## [1] "Spoon out about 1/3 cup of dough for each shortcake onto the baking sheet, leaving about 3 inches of space between the mounds. Pat each mound down until it is between 3/4 and 1 inch high. (The shortcakes can be made to this point and frozen on the baking sheet, then wrapped airtight and kept in the freezer for up to 2 months. Bake without defrosting â just add at least 5 more minutes to the oven time.)"
en.us.news[999]
## [1] "The next wave of valley stock launches may well be made by less-sexy enterprise software companies like Palo Alto Networks, which filed plans earlier this month for a $175 million offering. The Santa Clara-based maker of network security products reported $119 million in fiscal year 2011 revenues, which would have placed it 142nd on this year's list."
en.us.twitter[999]
## [1] "Art washes from the soul the dust of everyday life. -Pablo Picasso"
Number of lines for each of the three files:
l1 <- length(en.us.blogs)
l1
## [1] 899288
l2 <- length(en.us.news)
l2
## [1] 1010242
l3 <- length(en.us.twitter)
l3
## [1] 2360148
Word counts and white space counts and for each of the files:
require(stringi)
## Loading required package: stringi
r1 <- stri_stats_latex(en.us.blogs)
cat( "blogs: ", "word count:" , r1[[4]], ", white space count:", r1[[3]] )
## blogs: word count: 37865888 , white space count: 43302826
r2 <- stri_stats_latex(en.us.news)
cat( "news: ", "word count:" , r2[[4]], ", white space count:", r2[[3]] )
## news: word count: 34678691 , white space count: 40491958
r3 <- stri_stats_latex(en.us.twitter)
cat( "twitter: ", "word count:" , r3[[4]], ", white space count:", r3[[3]] )
## twitter: word count: 30578933 , white space count: 36047952
Words per line:
r1[[4]] / l1 # blogs
## [1] 42.10652
r2[[4]] / l2 # news
## [1] 34.32711
r3[[4]] / l3 # twitter
## [1] 12.95636
In order to do some analysis on the text from the three sources, I combined samples from each file into a single one.
set.seed(123)
nsize <- 5000
corpus.sample <- rep(NA, 3 * nsize)
s <- 1; f <- nsize
corpus.sample[s : f] <- sample(en.us.blogs, nsize)
rm(en.us.blogs) # cleanup
s <- nsize + 1; f <- nsize * 2
corpus.sample[s : f] <- sample(en.us.news, nsize)
rm(en.us.news) # cleanup
s <- nsize * 2 + 1; f <- nsize * 3
corpus.sample[s : f] <- sample(en.us.twitter, nsize)
rm(en.us.twitter) # cleanup
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 315494 16.9 4193053 224.0 5241317 280.0
## Vcells 5009649 38.3 77474007 591.1 96800876 738.6
For this analysis I decied to clean up the sample corpus somewhat. For the final ngram file the cleanup will need to be more extensive but for the purposes of this analysis I have removed several things from the corpora such as punctuation, numbers, etc. as follows:
require(tm)
## Loading required package: tm
## Loading required package: NLP
require(RWeka)
## Loading required package: RWeka
require(gridExtra)
## Loading required package: gridExtra
## Loading required package: grid
require(dplyr)
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
spacerx <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
cleanup.corpus <- function(corpus){
cleaned.corpus <- corpus %>%
tm_map(content_transformer(tolower)) %>%
tm_map(spacerx, "/|@|\\|") %>%
tm_map(removeNumbers) %>%
tm_map(removeWords, stopwords("english")) %>%
tm_map(removePunctuation) %>%
tm_map(stripWhitespace)
return(cleaned.corpus)
}
corpus.clean <- VCorpus(VectorSource(corpus.sample)) %>% cleanup.corpus()
corpus.dtm <- DocumentTermMatrix(corpus.clean) %>% removeSparseTerms(0.99)
tokenizer.2 <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
corpus.dtm.2 <- DocumentTermMatrix(corpus.clean, control=list(tokenize = tokenizer.2)) %>% removeSparseTerms(0.9999)
tokenizer.3 <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
corpus.dtm.3 <- DocumentTermMatrix(corpus.clean, control=list(tokenize = tokenizer.3)) %>% removeSparseTerms(0.9999)
Let’s look at some plots showing the frequency of 1-grams (i.e. words), 2-grams and 3-grams. This is to get a feel for what the content actually looks like. Note that from the 3-gram plot there is still some more cleaning that needs to be done.
require(ggplot2)
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
##
## The following object is masked from 'package:NLP':
##
## annotate
most.freq <- function(corpus.dtm, n=10){
freq <- colSums(as.matrix(corpus.dtm))
result <- freq[order(freq, decreasing=TRUE)][1:n]
return(data_frame(term=names(result), count=result))
}
ggplot(most.freq(corpus.dtm), aes(x=reorder(term, -count), y=count)) +
geom_bar(stat="identity") +
theme_grey() +
theme(axis.title.x = element_blank(),
axis.text.x = element_text(angle=45, hjust=1)) +
ggtitle("Most frequent words in the sample corpus")
# 2 gram
ggplot(most.freq(corpus.dtm.2), aes(x=reorder(term, -count), y=count)) +
geom_bar(stat="identity") +
theme_grey() +
theme(axis.title.x = element_blank(),
axis.text.x = element_text(angle=45, hjust=1)) +
ggtitle("Most frequent 2-grams in the sample corpus")
# 3 gram
ggplot(most.freq(corpus.dtm.3), aes(x=reorder(term, -count), y=count)) +
geom_bar(stat="identity") +
theme_grey() +
theme(axis.title.x = element_blank(),
axis.text.x = element_text(angle=45, hjust=1)) +
ggtitle("Most frequent 3-grams in the sample corpus")