Overview

This is an intermediate report that describes our exploratory analysis of the HC Corpora. The Corpora is a collection of texts of different lengths collected from many different web sites and split into 3 categories: Twitter feeds, news snippets and personal blog entries. It is filtered for English language, but could still contain some foreign text. For more details visit http://www.corpora.heliohost.org/aboutcorpus.html.
Our exploratory analysis comprises basic summaries of the three data collections (twitter, news and blogs) like word and line counts as well as basic plots to illustrate features of the data like word and phrase frequencies.
The next step after the exploratory analysis will be to use this data to develop an algorithm that takes a text phrase as input and predicts/suggests the next word. The algorithm will then be build into a web application free to use for everyone.

Summary Statistics

To obtain an overview of the downloaded data from the HC Corpora we will look at the data volume of each category and compare the average number of words per entry/line. We use the “wc” Unix command to calculate line and word counts of the three files and put the results together in a table for easy comparison:

lineCount wordCount avgWordsPerLine
twitter 2,360,000 30,374,000 12
blogs 899,000 37,335,000 41
news 1,010,000 34,373,000 34

The interesting column here is avgWordsPerLine. It meets our expectations that twitter feeds are short with an average of 12 words per line, while blogs are significantly longer with an average of 41 words. News snippets are only slightly shorter than blogs with an average of 34 words.

More Findings

For a deeper exploration of the data, we partition the datasets and use randomly selected samples of each of the categories and put them together as our training sample in a new dataset.
We would like to explore which words are the most frequently used ones in our sample and display them in a histogram to compare their frequencies. First we have to clean the data: we remove punctuation and numbers and also transform all words to lower case, so “The” and “the” will be recognized as the same word. The next step is tokenization: we break up our collection of lines into a collection of single words. Then we can easily count how often each word appears and build a histogram. We then do the same for pairs of words (bigrams) and trigrams.
We use R’s tm and RWeka packages to do all this. The code is shown in the Appendix.

plot of chunk unnamed-chunk-3

The histrogramm shows the distribution of the 25 most frequently used words. We can see that the most frequently used word in the sample is by far the word “the”, followed by “and”, “for” and “that”.
Let’s look at the histogram for word pairs now:

plot of chunk unnamed-chunk-5

The most frequently appearing word pairs include the word “the”, which is not surprising since this is the most frequently used word. And finally the trigram histogram:

plot of chunk unnamed-chunk-7

These are all very common phrases. Since the sample we used for our exploratory analysis was rather small, we will experiment with other samples and also bigger samples to see if the distributions change.

Appendix

R code:

library(tm)
library(RWeka)
library(slam)

## Data cleaning
sample <- readLines('en_US.Sample.txt')
sample <- iconv(sample, to="ASCII", sub = "")
corpus <- Corpus(VectorSource(sample))
corpus <- tm_map(corpus, content_transformer(removePunctuation))
corpus <- tm_map(corpus, content_transformer(removeNumbers))
corpus <- tm_map(corpus, content_transformer(stripWhitespace))
corpus <- tm_map(corpus, content_transformer(tolower))

## Tokenization and DocumentTermMatrix 
options(mc.cores=1)
Tokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
dtm <- DocumentTermMatrix(corpus, control = list(tokenize = Tokenizer))

freq <- rollup(dtm, 1, na.rm=TRUE, FUN = sum)
freq <- as.matrix(freq)
freq <- head(sort(freq[1,], decreasing = TRUE), 25)

## Histogram
par("cex.axis" = .7)
barplot(freq, main = "Histogram of Word Frequencies", ylab="Frequency",xlab="Word",las=2, space = 1, 
        ylim = c(0, 50000))


## Tokenization and DocumentTermMatrix for word pairs
options(mc.cores=1)
Tokenizer2 <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
dtm_pair <- DocumentTermMatrix(corpus, control = list(tokenize = Tokenizer2))
freq_pair <- rollup(dtm_pair, 1, na.rm=TRUE, FUN = sum)
freq_pair <- as.matrix(freq_pair)
freq_pair <- head(sort(freq_pair[1,], decreasing = TRUE), 25)

## Histogram of word pairs
par("cex.axis" = .7)
barplot(freq_pair, main = "Histogram of Word Pair Frequencies", ylab="Frequency",xlab="Word Pair",las=2, space = 1, ylim = c(0, 5000))


### Tokenization and DocumentTermMatrix for Trigrams
options(mc.cores=1)
Tokenizer3 <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
dtm_tri <- DocumentTermMatrix(corpus, control = list(tokenize = Tokenizer3))
freq_tri <- rollup(dtm_tri, 1, na.rm=TRUE, FUN = sum)
freq_tri <- as.matrix(freq_tri)
freq_tri <- head(sort(freq_tri[1,], decreasing = TRUE), 25)

## Histogram of trigrams
par("cex.axis" = .7)
barplot(freq_tri, main = "Histogram of Trigram Frequencies", ylab="Frequency",las=2, space = 1, ylim = c(0, 400))