Introduction

This is the Week 2 - Peer Assigment from Coursera’s Data Science Specialization Capstone course. The goal for this assignment is to understand the dataset and do a exploratory data analaysis for each of the given files, en_US.blogs.tx, ex_US.news.txt and en_US.twitter.txt. Also we identify key features of the data and explain the plan for a prediction algorithm that we are going to develop later. We use plots and graphs to show our exploratory data analysis.

The R programming language and associated frameworks will be used for all stages of this project: data exploration, data cleaning, data modeling, development of the product, and presentation of findings.

Specifically, we will do the following:

Exploratory Data Analysis

  1. Uunderstand the distribution of words and the relationships between the words in the corpus
  2. Understand frequencies of words and word pairs

Modeling

  1. Build a basic n-gram model
  2. Build a model to handle unseen n-grams

Exploration of the Data

Preliminary exploration of the data identified some challenges - the text included some non-printable special characters such as nulls. Lines containing these characters needed to be removed.

The data also included a fair amount of profanity and objectionable words. Using the list of English swear words from Wictionary (see: https://en.wiktionary.org/wiki/Category:English_swear_words) as a reference, lines containing those words were filtered out of the data set.

Once these two steps were completed, the data was loaded into R.

require(devtools)
## Loading required package: devtools
## Loading required package: usethis
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(stringi)
library(tm)
## Loading required package: NLP
## 
## Attaching package: 'NLP'
## The following object is masked from 'package:ggplot2':
## 
##     annotate
library(RWeka)
library(wordcloud)
## Loading required package: RColorBrewer
capstoneDatasetUrl<-"https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
zipFileName <- "Coursera-SwiftKey.zip"
if (!file.exists(zipFileName))
        download.file(capstoneDatasetUrl, zipFileName, method = "auto")
# Define file paths and names
fileblog <- "final/en_US/en_US.blogs.txt"
filetwit <- "final/en_US/en_US.twitter.txt"
filenews <- "final/en_US/en_US.news.txt"

# Unzip the files
if (!file.exists(fileblog) || !file.exists(filetwit) || !file.exists(filenews) )
    unzip(zipFileName)

# Load the data into memory
data_blogs   <- readLines(fileblog, encoding="UTF-8")
data_news    <- readLines(filenews, encoding="UTF-8")
data_twitter <- readLines(filetwit, encoding="UTF-8")
## Warning in readLines(filetwit, encoding = "UTF-8"): line 167155 appears to
## contain an embedded nul
## Warning in readLines(filetwit, encoding = "UTF-8"): line 268547 appears to
## contain an embedded nul
## Warning in readLines(filetwit, encoding = "UTF-8"): line 1274086 appears to
## contain an embedded nul
## Warning in readLines(filetwit, encoding = "UTF-8"): line 1759032 appears to
## contain an embedded nul

Basic Statistics

We compute the size of our files in megabytes using the stringi string processing package.

data_stats <- data.frame(File_Name=c("US_blogs", "US_news", "US_twitter"), 
                         FileSize=c(file.info("~/en_US/en_US.blogs.txt")$size/1024*1024, file.info("~/en_US/en_US.news.txt")$size/1024*1024, file.info("~/en_US/en_US.twitter.txt")$size/1024*1024),
                         WordCount=sapply(list(data_blogs, data_news, data_twitter), stri_stats_latex)[4,], 
                         t(rbind(sapply(list(data_blogs, data_news, data_twitter), stri_stats_general)[c('Lines','Chars'),]
                         )))
head(data_stats)
##    File_Name FileSize WordCount   Lines     Chars
## 1   US_blogs       NA  37570839  899288 206824382
## 2    US_news       NA  34494539 1010242 203223154
## 3 US_twitter       NA  30451128 2360148 162096031
summary <- data.frame('File Name ' = c("data_blogs","data_news","data_twitter"),
" Size " = sapply(list(data_blogs, data_news, data_twitter),function(x){format(object.size(x),"MB")}),
'No.of Rows ' = sapply(list(data_blogs, data_news, data_twitter), function(x){length(x)}),
'Total Characters ' = sapply(list(data_blogs, data_news, data_twitter), function(x){sum(nchar(x))}),
'Longest Row' = sapply(list(data_blogs, data_news, data_twitter), function(x) {max(unlist(lapply(x,function(y) nchar(y))))})
)
summary
##     File.Name.  X.Size. No.of.Rows. Total.Characters. Longest.Row
## 1   data_blogs 255.4 Mb      899288         206824505       40833
## 2    data_news 257.3 Mb     1010242         203223159       11384
## 3 data_twitter   319 Mb     2360148         162096031         140

Sampling and Cleaning Data

Next we build and clean the corpus. As the data size is huge, we sample data to train our models on the smaller sampled dataset. we use a 0.5% sample of data. Once we have sampled the data we can clean it using the tm package. We are converting everything to lover case and removing white spaces, punctuation, non-ASCII characters, URLs, numbers etc.

set.seed(12345)
test_data <- c(sample(data_blogs, length(data_blogs) * 0.005),
              sample(data_news, length(data_news) * 0.005),
              sample(data_twitter, length(data_twitter) * 0.005)
          )
          
testdata <- iconv(test_data, "UTF-8", "ASCII", sub="")
sample_corpus <- VCorpus(VectorSource(testdata))
sample_corpus <- tm_map(sample_corpus, tolower)
sample_corpus <- tm_map(sample_corpus, stripWhitespace)
sample_corpus <- tm_map(sample_corpus, removePunctuation)
sample_corpus <- tm_map(sample_corpus, removeNumbers)
sample_corpus <- tm_map(sample_corpus, PlainTextDocument)
sample_corpus <- tm_map(sample_corpus,content_transformer(function(x) gsub("http[[:alnum:]]*","",x))) # remove url
sample_corpus <- tm_map(sample_corpus,content_transformer(function(x) iconv(x, "latin1", "ASCII", sub=""))) # remove non-ASCII characters

In this section, we’ll build N-Gram models, namely uni-gram, bi-gram, and tri-gram. Word frequncies and word coverage are also plotted.

unigram <- function(x) NGramTokenizer(x, Weka_control(min=1, max=1))
bigram <- function(x) NGramTokenizer(x, Weka_control(min=2, max=2))
trigram <- function(x) NGramTokenizer(x, Weka_control(min=3, max=3))

unidtf <- TermDocumentMatrix(sample_corpus, control=list(tokenize=unigram))
bidtf <- TermDocumentMatrix(sample_corpus, control=list(tokenize=bigram))
tridtf <- TermDocumentMatrix(sample_corpus, control=list(tokenize=trigram))
                             
uni_tf <- findFreqTerms(unidtf, lowfreq = 50 )
bi_tf <- findFreqTerms(bidtf, lowfreq = 50 )
tri_tf <- findFreqTerms(tridtf, lowfreq = 10 )

uni_freq <- rowSums(as.matrix(unidtf[uni_tf, ]))
uni_freq <- data.frame(words=names(uni_freq), frequency=uni_freq)

bi_freq <- rowSums(as.matrix(bidtf[bi_tf, ]))
bi_freq <- data.frame(words=names(bi_freq), frequency=bi_freq)

tri_freq <- rowSums(as.matrix(tridtf[tri_tf, ]))
tri_freq <- data.frame(words=names(tri_freq), frequency=tri_freq)

head(tri_freq)
##                   words frequency
## a bit of       a bit of        18
## a bunch of   a bunch of        16
## a chance to a chance to        23
## a couple of a couple of        55
## a fan of       a fan of        12
## a few days   a few days        14
wordcloud(words=uni_freq$words, freq=uni_freq$frequency, max.words=100, colors = brewer.pal(8, "Dark2"))

plot_freq <- ggplot(data = uni_freq[order(-uni_freq$frequency),][1:15, ], aes(x = reorder(words, -frequency), y=frequency)) +
              geom_bar(stat="identity", fill="blue") + 
              ggtitle("Top Unigram") + xlab("words") +  ylab("frequency")

plot_freq

plot_freq <- ggplot(data = bi_freq[order(-bi_freq$frequency),][1:15, ], aes(x = reorder(words, -frequency), y=frequency)) +
  geom_bar(stat="identity", fill="red") + theme(axis.text.x = element_text(angle = 45)) + 
  ggtitle("Top Bigram") + xlab("words") +  ylab("frequency")
  
plot_freq

plot_freq <- ggplot(data = tri_freq[order(-tri_freq$frequency),][1:15, ], aes(x = reorder(words, -frequency), y=frequency)) +
  geom_bar(stat="identity", fill="red") + theme(axis.text.x = element_text(angle = 45)) + 
  ggtitle("Top Trigram") + xlab("words") +  ylab("frequency")

plot_freq

Conclusion and Next Steps

This concludes the initial exploratory analysis. The next step will be to build a predictive algorithm that uses an n-gram model with a frequency lookup similar to the analysis above. The algorithm will then be deployed in a Shiny app and will suggest the most likely next word after a phrase is predicted