Milestone Report: EDA of the Training data

This is the first assignment in the Data science Capstone course at John Hopkins University, Coursera. As explicitly mentioned in the instructions, the main goal of this project is to demonstrate that the data has been loaded into R environment and is ready to perform Exploratory data analysis which will evenutally culminnate into an N-gram Prediction model.

There are three files which give information regarding English Blogs, News and Twitter data. The following libraries where used during the production of this report and are essential for reproducible research.

Loading libraries

library(dplyr)
library(ggplot2)
library(tm)
library(wordcloud)
library(stringr)
library(DT)
library(RWeka)

Loading the dataset

us_news <- readLines("final/en_US/en_US.news.txt")
us_blogs <- readLines("final/en_US/en_US.blogs.txt")
us_twitter <- readLines("final/en_US/en_US.twitter.txt")

Summarizing the dataset

names <- c('US-News', 'US-Blogs', 'US-Twitter')
lengths <- c(length(us_news), length(us_blogs), length(us_twitter))
nwords <- c(sum(sapply(strsplit(us_news, " "), length)),
            sum(sapply(strsplit(us_blogs, " "), length)),
            sum(sapply(strsplit(us_twitter, " "), length)))
df <- data.frame(Files = names, "No. of Lines" = lengths, "No. of Words" = nwords)
datatable(df)

From the above data table we find that the files are pretty big and have a large volume of words. To make the computation process a little faster, we will take a sample of these documents and perform Exploratory data analysis. This approach will help us save time and also give us a gist of how our data sets are structured.

Sampling the data set

sample_news <- us_news[sample(1:length(us_news), 9000)]
sample_blogs <- us_blogs[sample(1:length(us_blogs), 9000)]
sample_twitter <- us_twitter[sample(1:length(us_twitter), 9000)]
sample_final <- c(sample_news, sample_blogs, sample_twitter)
cat('No. of lines in the sample are:', length(sample_final),
    '\nNo. of words in the sample are:', sum(sapply(strsplit(sample_final, " "),
                                                          length)))

## No. of lines in the sample are: 27000 
## No. of words in the sample are: 798453

Cleaning the data set

The data set is fairly unclean. Thus, text cleaning steps such as

removing the punctuations
removing the numbers
coverting the text into lower characters
removing the URL
Stemming the document
removing the stop words
stemming the document
removing the white space

are performed using the tm package in R. This gives us the cleaner version of our data set which can be used for feature extraction.

To perform the conversion we first need to convert the text data into a Corpora. This is done by creating a corpus i.e. a collection of documents in R using th Corpus function from the tm package in R.

sample_corpa <- Corpus(VectorSource(sample_final))
sample_corpa <- tm_map(sample_corpa, content_transformer(removePunctuation))
sample_corpa <- tm_map(sample_corpa, content_transformer(removeNumbers))
sample_corpa <- tm_map(sample_corpa, content_transformer(function(x){tolower(x)}))
removeURL <- function(x) gsub("http[[:alnum:]]*", "", x)
sample_corpa <- tm_map(sample_corpa, content_transformer(removeURL))
sample_corpa <- tm_map(sample_corpa, removeWords, c(stopwords('en')))
sample_corpa <- tm_map(sample_corpa, stemDocument)
sample_corpa <- tm_map(sample_corpa, stripWhitespace)

Formulating the Term Document Matrix

The Term Document Matrix which is abbreviated as TDM is a sparse matrix that is formed using the documents and the words conatined inside these documents. It helps us summarize and understand the documents in a better manner and hence we will construct a TDM as follows.

sample_tdm <- TermDocumentMatrix(sample_corpa)
sample_m <- as.matrix(sample_tdm)
Term_freq <- rowSums(sample_m)
Term_freq <- sort(Term_freq, decreasing = TRUE)
barplot(Term_freq[1:20], las = 2, col = 'brown')

rm(sample_m, sample_tdm)

From the above graph we are able to tell that words like

said
will
one
like

occur with maximum frequency. We can also visualize this by forming a wordcloud

Word clouds

wc <- data.frame(Words = names(Term_freq),
                 Freq = Term_freq, row.names = 1:length(Term_freq))
wordcloud(words = wc$Words, freq = wc$Freq, max.words = 75,colors = c('darkorange',
                                                                      'firebrick',
                                                                      'gray22'))

N-Grams

N-Grams are a contiguous list of N most frequent words from a document. So a 2-gram model lists out the two words that occured together most frequently. This gives us more details of how words interact with each other in our data set and will eventually help us extract important features from the data.

We shall construct 2-Gram and 3-Gram models, visualize them and understand how they behave.

Bigram <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
Trigram <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
bigrams <- data.frame(table(unlist(lapply(sample_corpa, Bigram)))) %>%
  arrange(desc(Freq))
trigrams <- data.frame(table(unlist(lapply(sample_corpa, Trigram)))) %>%
  arrange(desc(Freq))
datatable(head(bigrams, 10))

datatable(head(trigrams, 10))

bigrams$Var1 <- factor(bigrams$Var1, levels = bigrams$Var1[order(bigrams$Freq)])
trigrams$Var1 <- factor(trigrams$Var1, levels = trigrams$Var1[order(trigrams$Freq)])

Visualizing the N-gram tokens

ggplot(bigrams[1:10, ], aes(Var1, Freq, fill = Freq)) + geom_bar(stat = 'identity') +
  ggtitle("Most Frequent Bigrams") + theme(axis.text.x = element_text(angle = 45, hjust = 0.8))

ggplot(trigrams[1:10, ], aes(Var1, Freq, fill = Freq)) + geom_bar(stat = 'identity') +
  ggtitle("Most Frequent Trigrams") + theme(axis.text.x = element_text(angle = 45, hjust = 0.8))

From the above graphs we find the 10 most occuring Bigrams and Trigrams. This will help us model the features for our Prediction model.

Insights

Changing the order of our N-gram model yield drastic decrease in the observed counts.
The corpora is very big, sampling the corpora might help us obtain a gist of the data which can be helpful for EDA but we need to find a work around to feed this big data set into R so that we get accurate predictions.
Stemming the document has to be taken good care of as it is giving us words like citi, presid etc.

Next Steps

Next steps will include finding the direction of how to model our prediction model and how to treat the huge data set into R so that we get accurate predictions.