1. Introduction

The purpose of this report is to take a brief exploratory look into the dataset utilized for the Capstone. The data for this project comes directly from HC Corpora, according to the site, the “corpora have been collected from numerous different webpages, with the aim of getting a varied and comprehensive corpus of current use of the respective language”. The sources are diverse in nature, such as newspapers, magazines, tweets (Twitter messages), etc. The website is curated by Hans Christensen.

As suggested by the Specialization Professors, the package to be used will be the tm package. The tmpackage offers a comprehensive text mining framework for R. For further information about the functioning of these package please read these 2 papers: 1) Text Mining Infrastructure in R (Feinerer et al. 2008) and 2) Introduction to the tm Package (Feinerer, 2015). The latter is less theoretical and follows a “this is what you have to do” approach.

2. Loading, sampling and preprocessing the data

The first step is to download the data and take a look at it. It has 4 folders, each for a different language: German, English, Finnish and Russian. Each folder contains 3 .txt documents, one for news, one for blogs and one for twitter. For the purposes of this project we will work with the English-language documents only.

2.1 Loading the files in R. File sizes, line counts

blogs <- readLines("data/en_US/en_US.blogs.txt", encoding = "utf-8", skipNul = TRUE)
news <- readLines("data/en_US/en_US.news.txt", encoding = "utf-8", skipNul = TRUE)
twitter <- readLines("data/en_US/en_US.twitter.txt", encoding = "utf-8", skipNul = TRUE)
#loading packages
library(ggplot2)
library(tm)
## Loading required package: NLP
## 
## Attaching package: 'NLP'
## The following object is masked from 'package:ggplot2':
## 
##     annotate
library(RWeka)
blogsSize <- file.info("./data/en_US/en_US.blogs.txt")$size/1024^2
newsSize <- file.info("./data/en_US/en_US.news.txt")$size/1024^2
twitterSize <- file.info("./data/en_US/en_US.twitter.txt")$size/1024^2

Summary table:

File Name File Size (on system) (MB) Line (Document) Count
Blogs 200.42 899288
News 196.28 77259
Twitter 159.36 2360148

2.2 Data Sampling

Because of the large size of the 3 R objects it is best to work with samples, to reduce computational complexity and execution time. It is a good time to remember from the “Statistical Inference” class that conclusions can be drawn with high degrees of certainty from a sample about a population, given a randomized sample of decent size.

set.seed(33)
#taking 1% of each type of document
#Commentary: I started taking 50%, then 30%, then 20% then 10 and 5% and my computer 
#couldn't handle it very well
sampleBlogs <- sample(blogs, length(blogs)*0.01) 
sampleNews <- sample(news, length(news)*0.01)
sampleTwitter <- sample(twitter, length(twitter)*0.01)
rm(blogs, news, twitter)
#putting all 3 docments together
finalText <- c(sampleBlogs, sampleNews, sampleTwitter)

2.3 Preprocessing

Before we can analyze our data further there are some steps we need to take to standardize and clean our datasets:

1. Converting to lower case

2. Removing Profanity: this is necessary because we are not interested in predicting nor analyzing any form of profane text. To accomplish this we will download a list of “bad words” published by carnegie Mellon University School of Computer Science. Every word on this list will be deleted.

3. Removing Punctuation: commas, periods, exclamation points, etc.

4. Removing Stopwords: words that are so common in a language that their information value is almost zero.

5. Removing Whitespaces:

All these necessary steps can be performed with the use of the tm package. The first step is to use our combined texts (finalText variable) and turn them into a corpus. Then with the proper transformations we can clean it. That is done in the following code chunk:

profanity <- readLines("bad-words.txt") #CMU's list of bad words.
profanity <- profanity[-1]
corpus <- VCorpus(VectorSource(finalText))
#Helper function that performs transformations on the corpus

transformations <- function(corpus) {
  corpus <- tm_map(corpus, content_transformer(tolower))
  corpus <- tm_map(corpus, removePunctuation)
  corpus <- tm_map(corpus, removeWords, stopwords("english"))
  corpus <- tm_map(corpus, removeWords, profanity)
  #Removing words creates white spaces, so stripWhitespace has to come later
  corpus <- tm_map(corpus, removeNumbers)
  corpus <- tm_map(corpus, stripWhitespace)
}

3. Analyzing the corpus

After running transformation steps on the document collection to clean the data the next step is to create a Document-Term Matrix (DTM) from the corpus. That can be done with the DocumentTermMatrix function of the tm package.

dtm <- DocumentTermMatrix(corpus)

Once the DTM has been created, it is possible to start working on the n-grams. N-grams are defined as: “In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sequence of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application. The n-grams typically are collected from a text or speech corpus.”

To be able to tokenize effectively the NGramTokenizer function from RWeka will be used.

Words that appear at least 1000 times

findFreqTerms(dtm, 1000)
##  [1] "about" "all"   "and"   "are"   "been"  "but"   "can"   "don't"
##  [9] "for"   "from"  "get"   "good"  "had"   "has"   "have"  "her"  
## [17] "his"   "how"   "i'm"   "it's"  "just"  "know"  "like"  "love" 
## [25] "more"  "new"   "not"   "one"   "our"   "out"   "see"   "she"  
## [33] "some"  "that"  "the"   "their" "there" "they"  "this"  "time" 
## [41] "was"   "what"  "when"  "who"   "will"  "with"  "would" "you"  
## [49] "your"

Creating 1-grams:

UnigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
uniGram <- DocumentTermMatrix(corpus, control = list(tokenize = UnigramTokenizer))
uniGram1 <- removeSparseTerms(uniGram, 0.99)
uniFreq <- colSums(as.matrix(uniGram1))
uniFreq <- uniFreq[order(uniFreq, decreasing=TRUE)][1:15]
uniFreqDf <- data.frame("Term"=names(head(uniFreq,15)), "Frequency"=head(uniFreq,15))
print(uniFreqDf)
##      Term Frequency
## the   the     29224
## and   and     15770
## you   you      9084
## for   for      7701
## that that      7585
## with with      4820
## this this      4317
## was   was      3966
## have have      3848
## are   are      3594
## but   but      3369
## not   not      2947
## your your      2731
## all   all      2667
## can   can      2499

Graphs

ggplot(uniFreqDf, aes(x=Term,y=Frequency)) + geom_bar(stat="Identity", fill="salmon") +geom_text(aes(label=Frequency), vjust=-0.20) + theme(axis.text.x = element_text(angle = 45, hjust = 1))

What’s next?

  1. There’s still more exploratory data analysis, like bigrams and trigrams to be executed.

  2. Prediction research, and what algorithm will be used to predict. The Naive Bayes classifier and also Markov Chains seem to be solid choices. So more research in this respect is expected.