This is the Week 2 - Peer Assigment from Coursera’s Data Science Specialization Capstone course. The goal for this assignment is to understand the dataset and do a exploratory data analaysis for each of the given files, en_US.blogs.tx, ex_US.news.txt and en_US.twitter.txt. Also we identify key features of the data and explain the plan for a prediction algorithm that we are going to develop later. We use plots and graphs to show our exploratory data analysis.
The R programming language and associated frameworks will be used for all stages of this project: data exploration, data cleaning, data modeling, development of the product, and presentation of findings.
Specifically, we will do the following:
Exploratory Data Analysis
Modeling
Preliminary exploration of the data identified some challenges - the text included some non-printable special characters such as nulls. Lines containing these characters needed to be removed.
The data also included a fair amount of profanity and objectionable words. Using the list of English swear words from Wictionary (see: https://en.wiktionary.org/wiki/Category:English_swear_words) as a reference, lines containing those words were filtered out of the data set.
Once these two steps were completed, the data was loaded into R.
require(devtools)
## Loading required package: devtools
## Loading required package: usethis
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(stringi)
library(tm)
## Loading required package: NLP
##
## Attaching package: 'NLP'
## The following object is masked from 'package:ggplot2':
##
## annotate
library(RWeka)
library(wordcloud)
## Loading required package: RColorBrewer
capstoneDatasetUrl<-"https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
zipFileName <- "Coursera-SwiftKey.zip"
if (!file.exists(zipFileName))
download.file(capstoneDatasetUrl, zipFileName, method = "auto")
# Define file paths and names
fileblog <- "final/en_US/en_US.blogs.txt"
filetwit <- "final/en_US/en_US.twitter.txt"
filenews <- "final/en_US/en_US.news.txt"
# Unzip the files
if (!file.exists(fileblog) || !file.exists(filetwit) || !file.exists(filenews) )
unzip(zipFileName)
# Load the data into memory
data_blogs <- readLines(fileblog, encoding="UTF-8")
data_news <- readLines(filenews, encoding="UTF-8")
data_twitter <- readLines(filetwit, encoding="UTF-8")
## Warning in readLines(filetwit, encoding = "UTF-8"): line 167155 appears to
## contain an embedded nul
## Warning in readLines(filetwit, encoding = "UTF-8"): line 268547 appears to
## contain an embedded nul
## Warning in readLines(filetwit, encoding = "UTF-8"): line 1274086 appears to
## contain an embedded nul
## Warning in readLines(filetwit, encoding = "UTF-8"): line 1759032 appears to
## contain an embedded nul
We compute the size of our files in megabytes using the stringi string processing package.
data_stats <- data.frame(File_Name=c("US_blogs", "US_news", "US_twitter"),
FileSize=c(file.info("~/en_US/en_US.blogs.txt")$size/1024*1024, file.info("~/en_US/en_US.news.txt")$size/1024*1024, file.info("~/en_US/en_US.twitter.txt")$size/1024*1024),
WordCount=sapply(list(data_blogs, data_news, data_twitter), stri_stats_latex)[4,],
t(rbind(sapply(list(data_blogs, data_news, data_twitter), stri_stats_general)[c('Lines','Chars'),]
)))
head(data_stats)
## File_Name FileSize WordCount Lines Chars
## 1 US_blogs NA 37570839 899288 206824382
## 2 US_news NA 34494539 1010242 203223154
## 3 US_twitter NA 30451128 2360148 162096031
summary <- data.frame('File Name ' = c("data_blogs","data_news","data_twitter"),
" Size " = sapply(list(data_blogs, data_news, data_twitter),function(x){format(object.size(x),"MB")}),
'No.of Rows ' = sapply(list(data_blogs, data_news, data_twitter), function(x){length(x)}),
'Total Characters ' = sapply(list(data_blogs, data_news, data_twitter), function(x){sum(nchar(x))}),
'Longest Row' = sapply(list(data_blogs, data_news, data_twitter), function(x) {max(unlist(lapply(x,function(y) nchar(y))))})
)
summary
## File.Name. X.Size. No.of.Rows. Total.Characters. Longest.Row
## 1 data_blogs 255.4 Mb 899288 206824505 40833
## 2 data_news 257.3 Mb 1010242 203223159 11384
## 3 data_twitter 319 Mb 2360148 162096031 140
Next we build and clean the corpus. As the data size is huge, we sample data to train our models on the smaller sampled dataset. we use a 0.5% sample of data. Once we have sampled the data we can clean it using the tm package. We are converting everything to lover case and removing white spaces, punctuation, non-ASCII characters, URLs, numbers etc.
set.seed(12345)
test_data <- c(sample(data_blogs, length(data_blogs) * 0.005),
sample(data_news, length(data_news) * 0.005),
sample(data_twitter, length(data_twitter) * 0.005)
)
testdata <- iconv(test_data, "UTF-8", "ASCII", sub="")
sample_corpus <- VCorpus(VectorSource(testdata))
sample_corpus <- tm_map(sample_corpus, tolower)
sample_corpus <- tm_map(sample_corpus, stripWhitespace)
sample_corpus <- tm_map(sample_corpus, removePunctuation)
sample_corpus <- tm_map(sample_corpus, removeNumbers)
sample_corpus <- tm_map(sample_corpus, PlainTextDocument)
sample_corpus <- tm_map(sample_corpus,content_transformer(function(x) gsub("http[[:alnum:]]*","",x))) # remove url
sample_corpus <- tm_map(sample_corpus,content_transformer(function(x) iconv(x, "latin1", "ASCII", sub=""))) # remove non-ASCII characters
In this section, we’ll build N-Gram models, namely uni-gram, bi-gram, and tri-gram. Word frequncies and word coverage are also plotted.
unigram <- function(x) NGramTokenizer(x, Weka_control(min=1, max=1))
bigram <- function(x) NGramTokenizer(x, Weka_control(min=2, max=2))
trigram <- function(x) NGramTokenizer(x, Weka_control(min=3, max=3))
unidtf <- TermDocumentMatrix(sample_corpus, control=list(tokenize=unigram))
bidtf <- TermDocumentMatrix(sample_corpus, control=list(tokenize=bigram))
tridtf <- TermDocumentMatrix(sample_corpus, control=list(tokenize=trigram))
uni_tf <- findFreqTerms(unidtf, lowfreq = 50 )
bi_tf <- findFreqTerms(bidtf, lowfreq = 50 )
tri_tf <- findFreqTerms(tridtf, lowfreq = 10 )
uni_freq <- rowSums(as.matrix(unidtf[uni_tf, ]))
uni_freq <- data.frame(words=names(uni_freq), frequency=uni_freq)
bi_freq <- rowSums(as.matrix(bidtf[bi_tf, ]))
bi_freq <- data.frame(words=names(bi_freq), frequency=bi_freq)
tri_freq <- rowSums(as.matrix(tridtf[tri_tf, ]))
tri_freq <- data.frame(words=names(tri_freq), frequency=tri_freq)
head(tri_freq)
## words frequency
## a bit of a bit of 18
## a bunch of a bunch of 16
## a chance to a chance to 23
## a couple of a couple of 55
## a fan of a fan of 12
## a few days a few days 14
wordcloud(words=uni_freq$words, freq=uni_freq$frequency, max.words=100, colors = brewer.pal(8, "Dark2"))
plot_freq <- ggplot(data = uni_freq[order(-uni_freq$frequency),][1:15, ], aes(x = reorder(words, -frequency), y=frequency)) +
geom_bar(stat="identity", fill="blue") +
ggtitle("Top Unigram") + xlab("words") + ylab("frequency")
plot_freq
plot_freq <- ggplot(data = bi_freq[order(-bi_freq$frequency),][1:15, ], aes(x = reorder(words, -frequency), y=frequency)) +
geom_bar(stat="identity", fill="red") + theme(axis.text.x = element_text(angle = 45)) +
ggtitle("Top Bigram") + xlab("words") + ylab("frequency")
plot_freq
plot_freq <- ggplot(data = tri_freq[order(-tri_freq$frequency),][1:15, ], aes(x = reorder(words, -frequency), y=frequency)) +
geom_bar(stat="identity", fill="red") + theme(axis.text.x = element_text(angle = 45)) +
ggtitle("Top Trigram") + xlab("words") + ylab("frequency")
plot_freq
This concludes the initial exploratory analysis. The next step will be to build a predictive algorithm that uses an n-gram model with a frequency lookup similar to the analysis above. The algorithm will then be deployed in a Shiny app and will suggest the most likely next word after a phrase is predicted