Exploratory Data Analysis - Milestone Report

Overview

This document is the Milestone Report for Exploratory Data Analysis (EDA) in Data Science Capstone Course. The document summarizes the results of EDA on text data provided by SwiftKey on Coursera platform for this Course. The EDA is generally the first step before beginning with any data modeling work. This includes understanding the data, identifying the features of data, creating visuals, etc. This is needed in order to get familiarize with the new data so that some possible errors in modeling can be avoided and to get the general idea about the data and the associated characteristics.

Load Packages and Import Data

For this analysis I am using functions from various R libraries. In that case, first step is to load the libraries and setup the environment.

# Load packages

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tm)

## Loading required package: NLP

library(RWeka)
library(stringi)
library(pryr)

## 
## Attaching package: 'pryr'

## The following object is masked from 'package:tm':
## 
##     inspect

library(RColorBrewer)
library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:NLP':
## 
##     annotate

library(wordcloud)

Next step is to import the data. I have manually downloaded the “Coursera-SwiftKey.zip” file from Coursera platform. As the file is in the zip format, I unzipped it using and unzip function before starting with my analysis. Once unzipped for the first time I have all the necessary file and folder structure created in my current working directory. Hence, there is no need to run the unzip command for future executions.

#unzip("Coursera-SwiftKey.zip")

blogs <- readLines("./final/en_US/en_US.blogs.txt", warn = FALSE, encoding = "UTF-8", skipNul = TRUE)
news <- readLines("./final/en_US/en_US.news.txt", warn = FALSE, encoding = "UTF-8", skipNul = TRUE)
twitter <- readLines("./final/en_US/en_US.twitter.txt", warn = FALSE, encoding = "UTF-8", skipNul = TRUE)

Data Pre-Processing

Generate Summary Statistics

Let’s examine the basic information about the 3 files given.

stats <- data.frame(
  FileName = c("blogs.txt", "news.txt", "twitter.txt"),
  FileSize = sapply(list(blogs, news, twitter),
                   function(x){format(object.size(x), "MB")}),
  t(rbind(sapply(list(blogs, news, twitter), stri_stats_general), 
    words = sapply(list(blogs, news, twitter), stri_stats_latex)[4,])
    )
)

stats

##      FileName FileSize   Lines LinesNEmpty     Chars CharsNWhite    words
## 1   blogs.txt 255.4 Mb  899288      899288 206824382   170389539 37570839
## 2    news.txt 257.3 Mb 1010242     1010242 203223154   169860866 34494539
## 3 twitter.txt   319 Mb 2360148     2360148 162096241   134082806 30451170

Data Sampling

As shown in the summary, the size of the files is large. Processing large files needs lot of computational power and sometimes additional time in processing. Hence, I am taking the 1% sample of each of the 3 input files.

I am going to set some seed value so that the sample can be reproduced.

set.seed(1001)
sampleSize <- 0.01

Let’s sample the data from 3 individual files and combine. After combining all 3 samples, let’s check their size.

blogsSample <- sample(blogs, length(blogs) * sampleSize)
newsSample <- sample(news, length(news) * sampleSize)
twitterSample <- sample(twitter, length(twitter) * sampleSize)

sampleAll <- c(blogsSample, newsSample, twitterSample)

sampleStats <- data.frame(
                fileName = c("blogsSample", "newsSample", "twitterSample", "sampleAll"),
                fileSize = sapply(list(blogsSample, newsSample, twitterSample, sampleAll),
                                  function(x){format(object.size(x), "MB")}),
                        t(rbind(sapply(list(blogsSample, newsSample, twitterSample, sampleAll), stri_stats_general),
                        words = sapply(list(blogsSample, newsSample, twitterSample, sampleAll), stri_stats_latex)[4,])
                        )
)

sampleStats

##        fileName fileSize Lines LinesNEmpty   Chars CharsNWhite   words
## 1   blogsSample   2.6 Mb  8992        8992 2083795     1717050  377945
## 2    newsSample   2.6 Mb 10102       10102 2026788     1694545  343740
## 3 twitterSample   3.2 Mb 23601       23601 1623115     1342140  305310
## 4     sampleAll   8.4 Mb 42695       42695 5733698     4753735 1026995

Building and Pre-processing of Corpus

As our sample is ready, next step is to build the corpus and check the size of the corpus.

corpus <- VCorpus(VectorSource(sampleAll))
object_size(corpus)

## 100.32 MB

As shown above the size of the VCorpus object is very large even though this corpus was generated using only sample data of size 1%.

Let’s apply common text processing steps on the data next such as convert the data to lower case, remove punctuation marks, numbers and extra whitespaces. Then finally convert it to Plain text document.

revisedCorpus <- tm_map(corpus, content_transformer(tolower))
revisedCorpus <- tm_map(revisedCorpus, removePunctuation)
revisedCorpus <- tm_map(revisedCorpus, removeNumbers)
revisedCorpus <- tm_map(revisedCorpus, stripWhitespace)
revisedCorpus <- tm_map(revisedCorpus, PlainTextDocument)

Constructing and Plotting N-Grams

Next steps are to tokenize the corpus and construct the sets of N-grams. I am planning to start with following three N-grams.

Unigram: A matrix with individual words
Bigram: A matrix with two-word patterns
Trigram: A matrix with three-word patterns

Generate Unigram (1-gram)

Let’s create unigram token and check the frequencies of unigram words.

unigram_token <- function(x) NGramTokenizer(x, Weka_control(min=1, max=1))
unigram_mat <-TermDocumentMatrix(revisedCorpus, control = list(tokenize = unigram_token))

uniCorpus <- findFreqTerms(unigram_mat, lowfreq = 20)
uniCorpusFreq <- rowSums(as.matrix(unigram_mat[uniCorpus,]))
uniCorpusFreq <- data.frame(word = names(uniCorpusFreq), frequency = uniCorpusFreq)

head(uniCorpusFreq)

##        word frequency
## “if     “if        29
## “it     “it        23
## “the   “the        95
## “this “this        25
## “we     “we        50
## “you   “you        22

Let’s plot the results of this unicorpus.

uniCorpusFreqSorted <- arrange(uniCorpusFreq, desc(frequency))

unigram_plt <- ggplot(
    data = uniCorpusFreqSorted[1:25,], aes(x = reorder(word, -frequency), y = frequency)) + 
      geom_bar(stat = "identity", fill = "#56B4E9") + 
      xlab("Words") + 
      ylab("Frequency") + 
      ggtitle(paste("Top 25 Unigrams")) +
      theme(plot.title = element_text(hjust = 0.5)) + 
      theme(axis.text.x = element_text(angle = 60, hjust = 1)
)

unigram_plt

Generate Bigram (2-gram)

Let’s create bigram token and check the frequencies of associated words.

bigram_token <- function(x) NGramTokenizer(x, Weka_control(min=2, max=2))
bigram_mat <- TermDocumentMatrix(revisedCorpus, control = list(tokenize = bigram_token))

biCorpus <- findFreqTerms(bigram_mat, lowfreq = 20)
biCorpusFreq <- rowSums(as.matrix(bigram_mat[biCorpus,]))
biCorpusFreq <- data.frame(word = names(biCorpusFreq), frequency = biCorpusFreq)

head(biCorpusFreq)

##        word frequency
## – and – and        28
## – the – the        23
## — a     — a        22
## — and — and        43
## — the — the        39
## “ i     “ i        24

Let’s plot the results of this bicorpus.

biCorpusFreqSorted <- arrange(biCorpusFreq, desc(frequency))

bigram_plt <- ggplot(
    data = biCorpusFreqSorted[1:25,], aes(x = reorder(word, -frequency), y = frequency)) + 
      geom_bar(stat = "identity", fill = "#CC79A7") + 
      xlab("Words") + 
      ylab("Frequency") + 
      ggtitle(paste("Top 25 Bigrams")) +
      theme(plot.title = element_text(hjust = 0.5)) + 
      theme(axis.text.x = element_text(angle = 60, hjust = 1)
)

bigram_plt

Generate Trigram (3-gram)

Let’s create trigram token and check the frequencies of associated words.

trigram_token <- function(x) NGramTokenizer(x, Weka_control(min=3, max=3))
trigram_mat <- TermDocumentMatrix(revisedCorpus, control = list(tokenize = trigram_token))
 
triCorpus <- findFreqTerms(trigram_mat, lowfreq = 20)
triCorpusFreq <- rowSums(as.matrix(trigram_mat[triCorpus, ])) 
triCorpusFreq <- data.frame(word = names(triCorpusFreq), frequency = triCorpusFreq) 
 
head(triCorpusFreq)

##                    word frequency
## a bit of       a bit of        58
## a bunch of   a bunch of        36
## a chance to a chance to        53
## a copy of     a copy of        20
## a couple of a couple of        92
## a few days   a few days        34

Let’s plot the results of this tricorpus.

triCorpusFreqSorted <- arrange(triCorpusFreq, desc(frequency))

trigram_plt <- ggplot(
    data = triCorpusFreqSorted[1:25,], aes(x = reorder(word, -frequency), y = frequency)) + 
      geom_bar(stat = "identity", fill = "#66CC99") + 
      xlab("Words") + 
      ylab("Frequency") + 
      ggtitle(paste("Top 25 Trigrams")) +
      theme(plot.title = element_text(hjust = 0.5)) + 
      theme(axis.text.x = element_text(angle = 60, hjust = 1)
)

trigram_plt

Word Cloud

Above section shows histograms for the unigram, bigram and trigram. Let’s try to represent the similar information using Word Cloud. It shows the popularity of words or phrases by making the most frequently used words appear larger or bold compared with other words around them.

Word Cloud for Unigram

unigram_cld <- wordcloud(
                  uniCorpusFreq$word, uniCorpusFreq$frequency, scale = c(2, 0.5), 
                  max.words = 100, random.order = FALSE,
                  rot.per = 0.35, use.r.layout = FALSE, colors = brewer.pal(8, "Dark2")
                         )

Word Cloud for Bigram

bigram_cld <- wordcloud(
                  biCorpusFreq$word, biCorpusFreq$frequency,
                  scale = c(2, 0.5), max.words = 100, random.order = FALSE,
                  rot.per = 0.35, use.r.layout = FALSE, colors = brewer.pal(8, "Dark2")
                         )

Word Cloud for Trigram

trigram_cld <- wordcloud(
                  triCorpusFreq$word, triCorpusFreq$frequency,
                  scale = c(2, 0.5), max.words = 100, random.order = FALSE,
                  rot.per = 0.35, use.r.layout = FALSE, colors = brewer.pal(8, "Dark2")
                         )

Conclusion

This Exploratory Data Analysis on samples of three given files gave general idea about the data captured in these files. The sample data and Corpus are pre-processed. These samples can be used for building and testing different prediction models for predicting a related word. Also, the similar analysis can be done using different sample sizes considering the computation complexity and run times. Another considerations are removing stopwords during pre-processing and filtering out profane words. Next, build and test different models and use the one that gives appropriate word prediction for builing a Shiny app.