Course Project Exploratory Analysis

Introduction

The following is an exploratory analysis of the textual dataset used in the Coursera Data Science Specialization Capstone Project. Since this is only exploratory analysis, none of the plots or code are particularly “pretty.” To keep this report as concise as possible, none of the data load/setup is shown; I hope my reader can assume this has occurred through the results I am able to show.

Summary Statistics

The following statistics are presented as an overview of the files used to build the project corpus.

# Twitter line count
length(count.fields("./data/en_US/en_US.twitter.txt"))

## [1] 2304374

# Blog line count
length(count.fields("./data/en_US/en_US.blogs.txt"))

## [1] 898436

# News line count
length(count.fields("./data/en_US/en_US.news.txt"))

## [1] 77258

A corpus from which further statistics will be extracted is constructed using the tm package. For this report, I have focused on the News package. This choice was made due to the manageable size of the data set and my expectation of more standard English usage than might exist in, e.g., the Twitter dataset. Focusing on one set will also help me keep this report fairly concise while exploring it in greater depth. I discuss later my plans to expand analysis beyond this data set.

The data is read in such that each line of the data set is considered its own “document”, analogous to an observation in statistical parlance.

# Create source
library(tm)
y2 <- readLines("./data/en_US/en_US.news.txt")
# Create corpus
CorpNe <- Corpus(VectorSource(y2), readerControl = list(reader = readPlain,
                                                       language = "en_US",
                                                       load = "FALSE"))

Document Term Matrix

A document term matrix is constructed. This sparse matrix tallies the occurrence of each term within each sample.

# Doc Term Matrix
dtm <- DocumentTermMatrix(CorpNe)
dtmhf <- findFreqTerms(dtm,5000, Inf)

The results displayed below give us a feel for the data set. In the 77,259 news entries, 178,827 distinct terms arise. Of course, no single document comes close to even a small fraction of this full set of terms, meaning that the resulting matrix is very sparse. For illustration, a small subset of this matrix is displayed (tallying appearances in the first five documents of only those terms that occur more than 5000 times across the corpus, and which also start with “th”).

dtm

## <<DocumentTermMatrix (documents: 77259, terms: 178827)>>
## Non-/sparse entries: 1940160/13814055033
## Sparsity           : 100%
## Maximal term length: 123
## Weighting          : term frequency (tf)

inspect(dtm[1:5,dtmhf[grep("^th",dtmhf)]])

## <<DocumentTermMatrix (documents: 5, terms: 5)>>
## Non-/sparse entries: 6/19
## Sparsity           : 76%
## Maximal term length: 5
## Weighting          : term frequency (tf)
## 
##     Terms
## Docs that the their they this
##    1    0   0     0    0    0
##    2    0   3     0    0    0
##    3    0   3     0    0    0
##    4    0   5     0    0    1
##    5    0   1     1    0    0

Distribution of Terms

A quick histogram of the highest-frequency terms shows a power-law-type distribution of terms.

hist(colSums(as.matrix(dtm[,dtmhf])),breaks = 60)

A ranked bar plot of these frequencies illustrates this information more clearly. It also allows us to pass a gut check: stopwords like “the”, “and”, and “for” top the list. A third benefit of seeing the actual words is that an error becomes noticeable: the term “said.” has been classified independent of “said” (with no period). This points out an area of future concern for pre-processing - handling punctuation.

par(las=2)
barplot(colSums(as.matrix(dtm[,dtmhf]))[order(colSums(as.matrix(dtm[,dtmhf])))],
        horiz = TRUE,cex.names = 0.7)

The distribution is remarkable, but is it usual or unusual? Zipf’s Law posits that term occurrence follows a distribution that decays logarithmically with the term’s rank of occurrence. tm includes a handy function that allows us to plot our data against what would be expected by Zipf’s Law, and we find basically good agreement (with some minor deviation in the top ranges.)

dtmmf <- findFreqTerms(dtm,70,Inf)
Zipf_plot(dtm[,dtmmf])

## (Intercept)           x 
##  11.6360011  -0.8912093

Interesting Findings

Unique Words Required

Following on the above, we note that there are 2179393 words and just 178827 unique terms.

dtm

## <<DocumentTermMatrix (documents: 77259, terms: 178827)>>
## Non-/sparse entries: 1940160/13814055033
## Sparsity           : 100%
## Maximal term length: 123
## Weighting          : term frequency (tf)

# Sum of dtm - note that this is also the word count
sum(dtm)

## [1] 2179393

We set up a cumulative sum vector which counts the most frequent terms, and determine the number of terms needed to cover 50% and 90% of word count.

# Sum the most frequent terms and create a cumulative sum vector
library(slam)
fr <- rollup(dtm, 1, na.rm=TRUE, FUN = sum)[[3]]
cs <- cumsum(fr[order(fr,decreasing = TRUE)])
# Find the number of terms needed to cover 50% and 90% of word count
findInterval(sum(dtm)*c(0.5,0.9), cs)+1

## [1]   692 31319

Frequencies of 2-grams and 3-grams

The RWeka package is used to create Tokenizer functions. These are used to construct Document Term Matrices for 2-grams and 3-grams, which we then summarize. Both are very sparse matrices. The number of unique 3-grams is almost twice the number of unique 2-grams (due to the higher degree of permutation allowed with 3 words compared with 2). Concordantly, by dividing the total number of instances by the number of terms, we see that the average occurrence in the corpus of a 2-gram is about 2.63, while for a 3-gram it is about 1.29.

library(RWeka)
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
ngram2 <- DocumentTermMatrix(CorpNe, control = list(tokenize = BigramTokenizer))
ngram2

## <<DocumentTermMatrix (documents: 77259, terms: 1000544)>>
## Non-/sparse entries: 2584758/77298444138
## Sparsity           : 100%
## Maximal term length: 117
## Weighting          : term frequency (tf)

sum(ngram2)

## [1] 2631217

TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
ngram3 <- DocumentTermMatrix(CorpNe, control = list(tokenize = TrigramTokenizer))
ngram3

## <<DocumentTermMatrix (documents: 77259, terms: 1983611)>>
## Non-/sparse entries: 2543955/153249258294
## Sparsity           : 100%
## Maximal term length: 121
## Weighting          : term frequency (tf)

sum(ngram3)

## [1] 2554204

I briefly explored some of the properties of 2-grams ahead of the modeling assignment. The following code organizes frequently occurring 2-grams by occurrence. For illustration, we look at 2-grams starting with the word “when.”

bg <- ngram2$dimnames[[2]][ngram2$j[ngram2$v>2]]
bg2 <- ngram2$v[which(ngram2$v>2)]
names(bg2) <- bg
bg2[grep("^when ",bg)]

##  when you  when the    when i when they when they when your  when she 
##         3         3         3         3         3         3         3 
##  when the   when he 
##         3         3

Two features are immediately noticeable. First, there are several duplicate entries, meaning that further examination of the mining setup is warranted. Second, if we do aggregate the terms, some of these stand out as more common than the rest: “when the” and “when they”. The first is to be expected; based on our results above, due to frequency alone, “the” is a safe guess as the next word given no other information. However, the second result is illuminating. Since “they” is only the 17th most common word in the corpus, its 2nd-place frequency in this analysis is remarkable, implying that when the preceding word is “when”, the following word is more likely than average to be “they”. Of course, further analysis is required, but this is a promising basis for further investigation.

Conclusion and Next Steps

Much additional analysis could be completed on this data set. To this point, my focus has been on loading and playing with the data, while getting a handle on tm and its integrations. Clearly, the n-gram analysis will be key for next steps with this project. To the end of each of these, the general goals I have over the next few weeks are:

Dig more into the Twitter data with aid of better pre-processing and clean up, e.g. punctuation, text shorthand/slang conversion, and perhaps stemming
Delve into Weka and OpenNLP to leverage tokenizers and other tools
Learn about Hidden Markov models to facilitate modeling the problem

Review and feedback at this stage will help these goals considerably. Thank you for your time and thoughtful feedback.