This document is the Milestone Report for Exploratory Data Analysis (EDA) in Data Science Capstone Course. The document summarizes the results of EDA on text data provided by SwiftKey on Coursera platform for this Course. The EDA is generally the first step before beginning with any data modeling work. This includes understanding the data, identifying the features of data, creating visuals, etc. This is needed in order to get familiarize with the new data so that some possible errors in modeling can be avoided and to get the general idea about the data and the associated characteristics.
For this analysis I am using functions from various R libraries. In that case, first step is to load the libraries and setup the environment.
# Load packages
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tm)
## Loading required package: NLP
library(RWeka)
library(stringi)
library(pryr)
##
## Attaching package: 'pryr'
## The following object is masked from 'package:tm':
##
## inspect
library(RColorBrewer)
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
library(wordcloud)
Next step is to import the data. I have manually downloaded the “Coursera-SwiftKey.zip” file from Coursera platform. As the file is in the zip format, I unzipped it using and unzip function before starting with my analysis. Once unzipped for the first time I have all the necessary file and folder structure created in my current working directory. Hence, there is no need to run the unzip command for future executions.
#unzip("Coursera-SwiftKey.zip")
blogs <- readLines("./final/en_US/en_US.blogs.txt", warn = FALSE, encoding = "UTF-8", skipNul = TRUE)
news <- readLines("./final/en_US/en_US.news.txt", warn = FALSE, encoding = "UTF-8", skipNul = TRUE)
twitter <- readLines("./final/en_US/en_US.twitter.txt", warn = FALSE, encoding = "UTF-8", skipNul = TRUE)
Let’s examine the basic information about the 3 files given.
stats <- data.frame(
FileName = c("blogs.txt", "news.txt", "twitter.txt"),
FileSize = sapply(list(blogs, news, twitter),
function(x){format(object.size(x), "MB")}),
t(rbind(sapply(list(blogs, news, twitter), stri_stats_general),
words = sapply(list(blogs, news, twitter), stri_stats_latex)[4,])
)
)
stats
## FileName FileSize Lines LinesNEmpty Chars CharsNWhite words
## 1 blogs.txt 255.4 Mb 899288 899288 206824382 170389539 37570839
## 2 news.txt 257.3 Mb 1010242 1010242 203223154 169860866 34494539
## 3 twitter.txt 319 Mb 2360148 2360148 162096241 134082806 30451170
As shown in the summary, the size of the files is large. Processing large files needs lot of computational power and sometimes additional time in processing. Hence, I am taking the 1% sample of each of the 3 input files.
I am going to set some seed value so that the sample can be reproduced.
set.seed(1001)
sampleSize <- 0.01
Let’s sample the data from 3 individual files and combine. After combining all 3 samples, let’s check their size.
blogsSample <- sample(blogs, length(blogs) * sampleSize)
newsSample <- sample(news, length(news) * sampleSize)
twitterSample <- sample(twitter, length(twitter) * sampleSize)
sampleAll <- c(blogsSample, newsSample, twitterSample)
sampleStats <- data.frame(
fileName = c("blogsSample", "newsSample", "twitterSample", "sampleAll"),
fileSize = sapply(list(blogsSample, newsSample, twitterSample, sampleAll),
function(x){format(object.size(x), "MB")}),
t(rbind(sapply(list(blogsSample, newsSample, twitterSample, sampleAll), stri_stats_general),
words = sapply(list(blogsSample, newsSample, twitterSample, sampleAll), stri_stats_latex)[4,])
)
)
sampleStats
## fileName fileSize Lines LinesNEmpty Chars CharsNWhite words
## 1 blogsSample 2.6 Mb 8992 8992 2083795 1717050 377945
## 2 newsSample 2.6 Mb 10102 10102 2026788 1694545 343740
## 3 twitterSample 3.2 Mb 23601 23601 1623115 1342140 305310
## 4 sampleAll 8.4 Mb 42695 42695 5733698 4753735 1026995
As our sample is ready, next step is to build the corpus and check the size of the corpus.
corpus <- VCorpus(VectorSource(sampleAll))
object_size(corpus)
## 100.32 MB
As shown above the size of the VCorpus object is very large even though this corpus was generated using only sample data of size 1%.
Let’s apply common text processing steps on the data next such as convert the data to lower case, remove punctuation marks, numbers and extra whitespaces. Then finally convert it to Plain text document.
revisedCorpus <- tm_map(corpus, content_transformer(tolower))
revisedCorpus <- tm_map(revisedCorpus, removePunctuation)
revisedCorpus <- tm_map(revisedCorpus, removeNumbers)
revisedCorpus <- tm_map(revisedCorpus, stripWhitespace)
revisedCorpus <- tm_map(revisedCorpus, PlainTextDocument)
Next steps are to tokenize the corpus and construct the sets of N-grams. I am planning to start with following three N-grams.
Let’s create unigram token and check the frequencies of unigram words.
unigram_token <- function(x) NGramTokenizer(x, Weka_control(min=1, max=1))
unigram_mat <-TermDocumentMatrix(revisedCorpus, control = list(tokenize = unigram_token))
uniCorpus <- findFreqTerms(unigram_mat, lowfreq = 20)
uniCorpusFreq <- rowSums(as.matrix(unigram_mat[uniCorpus,]))
uniCorpusFreq <- data.frame(word = names(uniCorpusFreq), frequency = uniCorpusFreq)
head(uniCorpusFreq)
## word frequency
## “if “if 29
## “it “it 23
## “the “the 95
## “this “this 25
## “we “we 50
## “you “you 22
Let’s plot the results of this unicorpus.
uniCorpusFreqSorted <- arrange(uniCorpusFreq, desc(frequency))
unigram_plt <- ggplot(
data = uniCorpusFreqSorted[1:25,], aes(x = reorder(word, -frequency), y = frequency)) +
geom_bar(stat = "identity", fill = "#56B4E9") +
xlab("Words") +
ylab("Frequency") +
ggtitle(paste("Top 25 Unigrams")) +
theme(plot.title = element_text(hjust = 0.5)) +
theme(axis.text.x = element_text(angle = 60, hjust = 1)
)
unigram_plt
Let’s create bigram token and check the frequencies of associated words.
bigram_token <- function(x) NGramTokenizer(x, Weka_control(min=2, max=2))
bigram_mat <- TermDocumentMatrix(revisedCorpus, control = list(tokenize = bigram_token))
biCorpus <- findFreqTerms(bigram_mat, lowfreq = 20)
biCorpusFreq <- rowSums(as.matrix(bigram_mat[biCorpus,]))
biCorpusFreq <- data.frame(word = names(biCorpusFreq), frequency = biCorpusFreq)
head(biCorpusFreq)
## word frequency
## – and – and 28
## – the – the 23
## — a — a 22
## — and — and 43
## — the — the 39
## “ i “ i 24
Let’s plot the results of this bicorpus.
biCorpusFreqSorted <- arrange(biCorpusFreq, desc(frequency))
bigram_plt <- ggplot(
data = biCorpusFreqSorted[1:25,], aes(x = reorder(word, -frequency), y = frequency)) +
geom_bar(stat = "identity", fill = "#CC79A7") +
xlab("Words") +
ylab("Frequency") +
ggtitle(paste("Top 25 Bigrams")) +
theme(plot.title = element_text(hjust = 0.5)) +
theme(axis.text.x = element_text(angle = 60, hjust = 1)
)
bigram_plt
Let’s create trigram token and check the frequencies of associated words.
trigram_token <- function(x) NGramTokenizer(x, Weka_control(min=3, max=3))
trigram_mat <- TermDocumentMatrix(revisedCorpus, control = list(tokenize = trigram_token))
triCorpus <- findFreqTerms(trigram_mat, lowfreq = 20)
triCorpusFreq <- rowSums(as.matrix(trigram_mat[triCorpus, ]))
triCorpusFreq <- data.frame(word = names(triCorpusFreq), frequency = triCorpusFreq)
head(triCorpusFreq)
## word frequency
## a bit of a bit of 58
## a bunch of a bunch of 36
## a chance to a chance to 53
## a copy of a copy of 20
## a couple of a couple of 92
## a few days a few days 34
Let’s plot the results of this tricorpus.
triCorpusFreqSorted <- arrange(triCorpusFreq, desc(frequency))
trigram_plt <- ggplot(
data = triCorpusFreqSorted[1:25,], aes(x = reorder(word, -frequency), y = frequency)) +
geom_bar(stat = "identity", fill = "#66CC99") +
xlab("Words") +
ylab("Frequency") +
ggtitle(paste("Top 25 Trigrams")) +
theme(plot.title = element_text(hjust = 0.5)) +
theme(axis.text.x = element_text(angle = 60, hjust = 1)
)
trigram_plt
Above section shows histograms for the unigram, bigram and trigram. Let’s try to represent the similar information using Word Cloud. It shows the popularity of words or phrases by making the most frequently used words appear larger or bold compared with other words around them.
Word Cloud for Unigram
unigram_cld <- wordcloud(
uniCorpusFreq$word, uniCorpusFreq$frequency, scale = c(2, 0.5),
max.words = 100, random.order = FALSE,
rot.per = 0.35, use.r.layout = FALSE, colors = brewer.pal(8, "Dark2")
)
Word Cloud for Bigram
bigram_cld <- wordcloud(
biCorpusFreq$word, biCorpusFreq$frequency,
scale = c(2, 0.5), max.words = 100, random.order = FALSE,
rot.per = 0.35, use.r.layout = FALSE, colors = brewer.pal(8, "Dark2")
)
Word Cloud for Trigram
trigram_cld <- wordcloud(
triCorpusFreq$word, triCorpusFreq$frequency,
scale = c(2, 0.5), max.words = 100, random.order = FALSE,
rot.per = 0.35, use.r.layout = FALSE, colors = brewer.pal(8, "Dark2")
)
This Exploratory Data Analysis on samples of three given files gave general idea about the data captured in these files. The sample data and Corpus are pre-processed. These samples can be used for building and testing different prediction models for predicting a related word. Also, the similar analysis can be done using different sample sizes considering the computation complexity and run times. Another considerations are removing stopwords during pre-processing and filtering out profane words. Next, build and test different models and use the one that gives appropriate word prediction for builing a Shiny app.