Introduction

This paper will perform a basic exploratory analysis of the text corpus provided for the JHU data science capstone project. We will perform an analysis on each separate document in addition to a combined analysis of all three documents.

There are 3 separate documents provided to us. The first document contains scrapped blog articles, the second contains scrapped news articles, and the last one contains scrapped Twitter posts.

Data Exploration

Let’s begin by examining the size of the files.

library(dplyr)

file.paths <- list.files('data/en_US/full', full.names = TRUE)

data.info <- file.info(file.paths)
data.info$size <- paste(data.info$size / 1024 / 1024, 'MB')

data.info %>% select(size)
##                                                  size
## data/en_US/full/en_US.blogs.txt   200.424207687378 MB
## data/en_US/full/en_US.news.txt    196.277512550354 MB
## data/en_US/full/en_US.twitter.txt 159.364068984985 MB

These files are relatively large. Let’s see how many lines are in each of the files:

sapply(file.paths, function(path) { lines <- readLines(path, skipNul = TRUE); paste(path, length(lines)) })
##             data/en_US/full/en_US.blogs.txt 
##    "data/en_US/full/en_US.blogs.txt 899288" 
##              data/en_US/full/en_US.news.txt 
##    "data/en_US/full/en_US.news.txt 1010242" 
##           data/en_US/full/en_US.twitter.txt 
## "data/en_US/full/en_US.twitter.txt 2360148"

These files have a large number of lines and may be difficult to work with directly (especially in R). What we can do is take a sample of these files and work with that instead. We will take about 6.7% of each of the documents and create a corpus approximately 20% of the original size.

library(tm)
library(RWeka)

for (filePath in file.paths) {
    corpusFile <- readLines(filePath, skipNul = TRUE)
    sample.corpus <- sample(corpusFile, length(corpusFile)*0.067)
    sample.filepath <- file.path(dirname(filePath), "..", "sample", basename(filePath))
    
    writeLines(sample.corpus, sample.filepath)
}

We will now load our sample.

corpus.path <- "data/en_US/sample"
dirSource <- DirSource(corpus.path, pattern = '*.txt')
corpus <- VCorpus(dirSource)

In order to perform a proper analysis, we need to first clean up the datasets. We will perform the following cleansing steps:

We will also attempt to filter out profanities using a list of obtained from the following source: Profanity List.

corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, function(x) {
    x <- gsub("https*://[^ ]+", " ", x)
    x <- gsub("[^A-Za-z'\\-\\s]", " ", x)
    x <- gsub("\\s+(\\-+)\\s+", " ", x)
    x <- gsub("(\\s\'|\\'(?=\\s))", " ", x, perl = TRUE)
    x <- gsub("\\s[^ai]\\s", " ", x)
    
    return(x)
})

corpus <- tm_map(corpus, stripWhitespace)

profanityList <- readLines("data/bad-words.txt")
corpus <- tm_map(corpus, removeWords, profanityList)

corpus <- tm_map(corpus, PlainTextDocument)

We now have a cleaned up dataset. Let’s now create term-document matrix. This structure will allow us to perform further analysis on the corpus.

tdm <- TermDocumentMatrix(corpus, 
                 control=list(wordLengths=c(1, Inf),
                 bounds=list(global=c(floor(length(corpus)*0.05), Inf))))

There will be some terms that will occur in one of the documents but not often (if at all) in other documents. This results in sparsity. We will need to clean this up.

library(tm)

tdm.clean <- removeSparseTerms(tdm, 0.2)
colnames(tdm.clean) <- c("blog", "news", "twitter")

inspect(tdm.clean[20:30, ])
## <<TermDocumentMatrix (terms: 11, documents: 3)>>
## Non-/sparse entries: 33/0
## Sparsity           : 0%
## Maximal term length: 11
## Weighting          : term frequency (tf)
## 
##              Docs
## Terms         blog news twitter
##   abandoning     4    8       3
##   abandonment   12    4       1
##   abba           5    3       1
##   abbey         20    6      14
##   abbie          2    2       1
##   abbot          1    2       1
##   abby          28   21      12
##   abc           29   37      44
##   abc's          6   10       3
##   abdomen       12   10       4
##   abdominal      5    7       5

Let us have a look at the most frequently occuring words in our corpus. This plot will have the top 60 terms ordered by frequency, comparing the frequency of each word in each corpus.

library(reshape2)
library(dplyr)
library(ggplot2)

term.matrix <- as.matrix(tdm.clean)
term.matrix <- melt(term.matrix, value.name = "count")

freq.matrix <- term.matrix %>% arrange(-count)
freq.matrix <- freq.matrix[1:60,]

ggplot(freq.matrix, aes(x = Docs, y = Terms, fill = log10(count))) +
    geom_tile(colour = "white") +
    scale_fill_gradient(high="#FF0000" , low="#FFFFFF")+
    ylab("") +
    theme(panel.background = element_blank()) +
    theme(axis.text.x = element_blank(), axis.ticks.x = element_blank())

Next Steps