Coursera - Data Science Capstone

Part A. Overview

The objective of this report is to develop an understanding of the various statistical properties of the data set that can later be used when building the prediction model for the final data product - the Shiny application. In this report, we perform exploratory data analysis together with tables and plots to describe the major features of the training data and then summarize the plans fort creating the predictive model and Shiny app.

The data set to be used in exploratory data analysis is provided by SwifKey in the context of the Coursera Data Science Capstone. The data consist of 3 text files containing text from three different sources: blogs, news & twitter. Also, the data are provided in four different languages but we only focus on the English version.

Part B. setting environment, loading packages & data

Set working environment

rm(list = ls(all.names = TRUE))
setwd("D:/D_drive/data/OnlineLearning/DataScientistSpecialization/R_Project/DataScienceCaptone-MilestoneReport")

Load packages

library(knitr)
library(utils)
library(kableExtra)    # for data summary
library(stringi)       # for data summary
library(ggplot2)       # for histogram
library(gridExtra)     # for histogram
library(tm)            # for data cleaning
library(wordcloud)     # for data analysis (word frequencies)
library(RColorBrewer)  # for data analysis (word frequencies)
library(RWeka)         # for Tokenizing and N-Gram Generation

Load the Data

# download and unzip the data 
if(!file.exists("Coursera-SwiftKey.zip")){
  download.file("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip", "./Coursera-SwiftKey.zip")
  unzip(zipfile="Coursera-SwiftKey.zip", exdir="./projectData")
}

# blogs
blogsFileName <- "./projectData/final/en_US/en_US.blogs.txt"
con <- file(blogsFileName, open = "r")
blogs <- readLines(con, encoding = "UTF-8", skipNul = TRUE)
# head(blogs)
close(con)

# news
newsFileName <- "./projectData/final/en_US/en_US.news.txt"
con <- file(newsFileName, open = "r")
news <- readLines(con, encoding = "UTF-8", skipNul = TRUE)
# head(news)
close(con)

# twitter
twitterFileName <- "./projectData/final/en_US/en_US.twitter.txt"
con <- file(twitterFileName, open = "r")
twitter <- readLines(con, encoding = "UTF-8", skipNul = TRUE)
# head(twitter)
close(con)

rm(con)

Part C. Data Summary

Before going to clean the data, a basic summary of the three data file is provided to show file sizes, number of lines for each source. Statistical summary of word per line (including WPL.Mean, WPL.Max, and WPL.Min) is also calculated for each data file.

Part C.1 Initial Data Summary

library(kableExtra)
library(stringi)

# assign sample size
sampleSize = 0.01

# file size
fileSizeMB <- round(file.info(c(blogsFileName,
                                newsFileName,
                                twitterFileName))$size / 1024 ^ 2)

# num lines per file
numLines <- sapply(list(blogs, news, twitter), length)

# num characters per file
numChars <- sapply(list(nchar(blogs), nchar(news), nchar(twitter)), sum)

# num words per file
numWords <- sapply(list(blogs, news, twitter), stri_stats_latex)[4,]

# words per line
wpl <- lapply(list(blogs, news, twitter), function(x) stri_count_words(x))

# words per line summary
wplSummary = sapply(list(blogs, news, twitter),
             function(x) summary(stri_count_words(x))[c('Min.', 'Mean', 'Max.')])
rownames(wplSummary) = c('WPL.Min', 'WPL.Mean', 'WPL.Max')

summary <- data.frame(
    File = c("en_US.blogs.txt", "en_US.news.txt", "en_US.twitter.txt"),
    FileSize = paste(fileSizeMB, " MB"),
    Lines = numLines,
    Characters = numChars,
    Words = numWords,
    t(rbind(round(wplSummary)))
)

kable(summary,
      row.names = FALSE,
      align = c("l", rep("r", 7)),
      caption = "") %>% kable_styling(position = "left")


File	FileSize	Lines	Characters	Words	WPL.Min	WPL.Mean	WPL.Max
en_US.blogs.txt	200 MB	899288	206824505	37570839	0	42	6726
en_US.news.txt	196 MB	77259	15639408	2651432	1	35	1123
en_US.twitter.txt	159 MB	2360148	162096241	30451170	1	13	47

Findings of initial investigation are listed below:

the file size of all text data files are fairly large, ranging from 159 to 200MB.
although file size of blogs data file is larger than that of twitter data file, the number of lines in blogs data file is lower than than that in twitte data file.
the blogs data file has the largest WPL.Mean and highest WPL.Max.

Part C.2 Histogram of words per line

library(ggplot2)
library(gridExtra)

plot1 <- qplot(wpl[[1]],
               geom = "histogram",
               main = "US Blogs",
               xlab = "Words per Line",
               ylab = "Frequency",
               binwidth = 5)

plot2 <- qplot(wpl[[2]],
               geom = "histogram",
               main = "US News",
               xlab = "Words per Line",
               ylab = "Frequency",
               binwidth = 5)

plot3 <- qplot(wpl[[3]],
               geom = "histogram",
               main = "US Twitter",
               xlab = "Words per Line",
               ylab = "Frequency",
               binwidth = 1)

plotList = list(plot1, plot2, plot3)
do.call(grid.arrange, c(plotList, list(ncol = 1)))

# free up some memory
rm(plot1, plot2, plot3)

The above each histogram shows that the number of words per line for each data file is relatively low. This observation seems to reflect a general trend towards short and concise communication.

Part D. Data Preparation and Data Cleaning

Part D.1 Creating the sample data set

In light of large file size as found in the last section, samples can be obtained from three data sets to improve processing time. We can randomly choose 1% sample from blogs data set and news data set, and 0.1% sample from twitter data set to demonstrate data preprocessing, exploratory data analysis and prediction algorithm . Also, all three data sets can be further combined together to form a separate file in order to ease processing for the subsequent analysis in this report.

# subset the data
set.seed(2468)
sample_blogs <- sample(blogs, length(blogs) * 0.01)
sample_news <- sample(news, length(news) * 0.01)
sample_twitter <- sample(twitter, length(twitter) * 0.001)

# remove all non-English characters from the sampled data
sample_blogs <- iconv(sample_blogs, "latin1", "ASCII", sub = "")
sample_news <- iconv(sample_news, "latin1", "ASCII", sub = "")
sample_twitter <- iconv(sample_twitter, "latin1", "ASCII", sub = "")

# combine subsets together
sampleData <- c(sample_blogs, sample_news, sample_twitter)

# get number of lines and words from the sample data set
sampleDataLines <- length(sampleData);
sampleDataWords <- sum(stri_count_words(sampleData))

#head(sampleData, 1)

Display the number of lines of the sample data set:

sampleDataLines

## [1] 12124

Display the number of words of the sample data set:

sampleDataWords

## [1] 428214

# remove variables no longer needed to free up memory
rm(sample_blogs, sample_news, sample_twitter)

Part D.2 Cleaning the sample data set

In the process of data cleaning, we do the followings: (1) remove URL, Twitter handles and email patterns; (2) tO (4) remove the punctuation, the numbers, and the common English stop words; (5) trim white space; (6) transform all characters to lowercase; and (7) convert to plain text documents

library(tm)

corpus <- VCorpus(VectorSource(sampleData))
toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))

# (1) Remove URL, Twitter handles and email patterns
corpus <- tm_map(corpus, toSpace, "(f|ht)tp(s?)://(.*)[.][a-z]+")
corpus <- tm_map(corpus, toSpace, "@[^\\s]+")
#corpus <- tm_map(corpus, toSpace, "\\b[A-Z a-z 0-9._ - ]*[@](.*?)[.]{1,3} \\b")

# (2) Remove common English stop words 
corpus <- tm_map(corpus, removeWords, stopwords("en"))

# (3) Remove punctuation marks
corpus <- tm_map(corpus, removePunctuation)

# (4) Remove numbers
corpus <- tm_map(corpus, removeNumbers)

# (5) Trim white space
corpus <- tm_map(corpus, stripWhitespace)

# (6) Convert all words to lowercase
corpus <- tm_map(corpus, tolower)

# (7) Convert to plain text documents
corpus <- tm_map(corpus, PlainTextDocument)

#Display the 1st row of sample data set:
#corpusResult<-data.frame(text=unlist(sapply(corpus,'[',"content")), stringsAsFactors=FALSE)
#head(corpusResult,5)

Part E. Exploratory Data Analysis

Exploratory data analysis is here performed to fulfill the primary goal of this report. To facilitate the understanding of the training data, we look at the most frequently used words/phases in the sample data set in the following two dimensions:(1) tokenizing and n-gram generation in bar chart and (2) word cloud.

The predictive model to be developed for the Shiny application is going to handle uniqrams, bigrams, and trigrams. In this section, the RWeka package is used to to construct functions that tokenize the sample data and construct matrices of uniqrams, bigrams, and trigrams.

Tokenize Functions:

library(RColorBrewer)
library(RWeka) 

unigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=1, max=1))
bigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=2, max=2))
trigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=3, max=3))

Part E.1 Unigram Analysis

Bar Chart for Unigram Analysis: (Refer to Appendix G.1 for coding)

Word cloud for Unigram Analysis: (Refer to Appendix G.2 for coding)

Part E.2 Bigram Analysis

Bar Chart for Bigram Analysis: (Refer to Appendix G.3 for coding)

Word cloud for Bigram Analysis: (Refer to Appendix G.4 for coding)

Part E.2 Trigram Analysis

Bar Chart for Trigram Analysis: (Refer to Appendix G.5 for coding)

Word Cloud for Trigram Analysis: (Refer to Appendix G.6 for coding)

Part F: Next Steps

In this report, we focus on a small subset of the actual data to build the corpus because NLP is a resource intensive process. The corpus is then used to build n-grams with Tokenization. Having finished the tokenization process, we get bigrams, trigrams and quadgrams.

As a next step, we can do the followings:

use the sets of n-grams to create predictive models
deploy the algorithm in Shiny app to predict the next word based on high frequency rate
make the Shiny app as a user-interface to interact with the predictive models
prepare a slide-deck to present and publish the app to general public.

Part G: Appendix

Appendix G.1 Coding for Bar Chart of Unigram Analysis

# create term document matrix for the corpus
unigramMatrix <- TermDocumentMatrix(corpus, control=list(tokenize=unigramTokenizer))

# eliminate sparse terms for each n-gram and get frequencies of most common n-grams
unigramMatrixFreq <- sort(rowSums(as.matrix(removeSparseTerms(unigramMatrix, 0.99))), decreasing=TRUE)
unigramMatrixFreq <- data.frame(word=names(unigramMatrixFreq), freq=unigramMatrixFreq)

# generate plot
g <- ggplot(unigramMatrixFreq[1:20,], aes(x = reorder(word, -freq), y = freq))
g <- g + geom_bar(stat = "identity", fill = I("grey50"))
g <- g + geom_text(aes(label = freq ), vjust = -0.20, size = 3)
g <- g + xlab("")
g <- g + ylab("Frequency")
g <- g + theme(plot.title = element_text(size = 14, hjust = 0.5, vjust = 0.5),
               axis.text.x = element_text(hjust = 1.0, angle = 45),
               axis.text.y = element_text(hjust = 0.5, vjust = 0.5))
g <- g + ggtitle("20 Most Common Unigrams")
print(g)

Appendix G.2 Coding for Word Cloud of Unigram Analysis

library(wordcloud)

# construct word cloud
suppressWarnings(
    wordcloud(words = unigramMatrixFreq$word,
              freq = unigramMatrixFreq$freq,
              min.freq = 1,
              max.words = 100,
              random.order = FALSE,
              rot.per = 0.35, 
              colors=brewer.pal(8, "Dark2"))
)

Appendix G.3 Coding for Bar Chart of Bigram Analysis

# create term document matrix for the corpus
bigramMatrix <- TermDocumentMatrix(corpus, control = list(tokenize = bigramTokenizer))

# eliminate sparse terms for each n-gram and get frequencies of most common n-grams
bigramMatrixFreq <- sort(rowSums(as.matrix(removeSparseTerms(bigramMatrix, 0.999))), decreasing = TRUE)
bigramMatrixFreq <- data.frame(word = names(bigramMatrixFreq), freq = bigramMatrixFreq)

# generate plot
g <- ggplot(bigramMatrixFreq[1:20,], aes(x = reorder(word, -freq), y = freq))
g <- g + geom_bar(stat = "identity", fill = I("grey50"))
g <- g + geom_text(aes(label = freq ), vjust = -0.20, size = 3)
g <- g + xlab("")
g <- g + ylab("Frequency")
g <- g + theme(plot.title = element_text(size = 14, hjust = 0.5, vjust = 0.5),
               axis.text.x = element_text(hjust = 1.0, angle = 45),
               axis.text.y = element_text(hjust = 0.5, vjust = 0.5))
g <- g + ggtitle("20 Most Common Bigrams")
print(g)

Appendix G.4 Coding for Word Cloud of Bigram Analysis

# construct word cloud
suppressWarnings (
    wordcloud(words = bigramMatrixFreq$word,
              freq = bigramMatrixFreq$freq,
              min.freq = 1,
              max.words = 100,
              random.order = FALSE,
              rot.per = 0.35, 
              colors=brewer.pal(8, "Dark2"))
)

Appendix G.5 Coding for Bar Chart of Trigram Analysis

# create term document matrix for the corpus
trigramMatrix <- TermDocumentMatrix(corpus, control = list(tokenize = trigramTokenizer))

# eliminate sparse terms for each n-gram and get frequencies of most common n-grams
trigramMatrixFreq <- sort(rowSums(as.matrix(removeSparseTerms(trigramMatrix, 0.9999))), decreasing = TRUE)
trigramMatrixFreq <- data.frame(word = names(trigramMatrixFreq), freq = trigramMatrixFreq)

# generate plot
g <- ggplot(trigramMatrixFreq[1:20,], aes(x = reorder(word, -freq), y = freq))
g <- g + geom_bar(stat = "identity", fill = I("grey50"))
g <- g + geom_text(aes(label = freq ), vjust = -0.20, size = 3)
g <- g + xlab("")
g <- g + ylab("Frequency")
g <- g + theme(plot.title = element_text(size = 14, hjust = 0.5, vjust = 0.5),
               axis.text.x = element_text(hjust = 1.0, angle = 45),
               axis.text.y = element_text(hjust = 0.5, vjust = 0.5))
g <- g + ggtitle("20 Most Common Trigrams")
print(g)

Appendix G.6 Coding for Word Cloud of Trigram Analysis

# construct word cloud
suppressWarnings (
    wordcloud(words = trigramMatrixFreq$word,
              freq = trigramMatrixFreq$freq,
              min.freq = 1,
              max.words = 100,
              random.order = FALSE,
              rot.per = 0.35, 
              colors=brewer.pal(8, "Dark2"))
)

Coursera - Data Science Capstone - Milestone Report

Ken Yang

15 Jun 2021