Coursera Data Science Capstone Project - Milestone Report

Exploratory Data Analysis

Author: S. Wu

This is in an exploratory data analysis report focusing on understanding the training set corpus before the implementation of a n-gram prediction model for the final product.


# Check and load required R packages
pkg<-c("knitr", "ggplot2", "grid", "gridExtra", "RColorBrewer", "tidyr", "dplyr", "tm", "wordcloud")
pkgCheck<-pkg %in% rownames(installed.packages())
for(i in 1:length(pkg)) {
    if(pkgCheck[i]==FALSE) {
        install.packages(pkg[i])
    } 
    library(pkg[i],character.only = TRUE)
}

Data Processing

Entire Training Dataset Overview

The dataset was obtained from the course web site. The three English language corpus documents were used: blogs, news, and tweets. No other files were included.

blog<- readLines('en_US.blogs.txt')
news<-  readLines('en_US.news.txt')
twitter<- readLines('en_US.twitter.txt')

#count lines
lines<- c(length(blog), length(news), length(twitter))
names(lines)<- c('blogs', 'news', 'twitter')

#count vocabulary(number of words separated by whitespace, without any data cleaning)
vCount<- function (x) {
  sum(sapply(x, function(y) length(strsplit(y," ")[[1]])))
}
vocabulary0<- c(vCount(blog), vCount(news), vCount(twitter))
names(vocabulary0)<- c('blogs', 'news', 'twitter')

#count word type(number of unique words without any data cleaning)
wCount<- function (x) {
  c<- unlist(lapply(x, function(y) unique(strsplit(y, " ")[[1]])))
  length(unique(c))
}
wordType0<- c(wCount(blog), wCount(news), wCount(twitter))
names(wordType0)<- c('blogs', 'news', 'twitter')

ttr0<- wordType0/vocabulary0
diversity0<- wordType0/sqrt(vocabulary0*2)
Table 1. Line Count, Word Count, and Word Types by Document Source: Entire Data Set
Documents Line Count Word Count Word Types Word Type/Count Ratio
Blogs 899,288 37,334,131 1,103,548 0.030
News 77,259 2,643,969 197,858 0.075
Twitter 2,360,148 30,373,543 1,290,170 0.042

Table 1 shows the total line count, word count, and word type count from each document source before any data transformation processing. Word count is the number of words separated by whitespace. Word type count is the number of unique words which occure in the data set. Word type/count ratio is the total word types divided by world count. It indicates complexity, with higher numbers indicating a more complex data set.

Sampling and Data Transformation

20,000 lines from each of the three document source were randomly selected for data exploration.

The data set was preprocessed with the following transformations:

  1. symbols and foreign characters(such as â) were removed

  2. all numbers were removed

  3. alphabets were converted to lower case

  4. stopwords were removed, using the built-in list of English stop words from the r tm package

  5. profanity words were removed, according to George Carlin’s definition, including variations(ed, ing etc.)

  6. punctuations except intra-word dashes were removed

  7. unnecessary whitespace was removed

set.seed(1234)
len<- 20000
blogSample20k<- sample(blog, len, replace=FALSE)
newsSample20k<- sample(news, len, replace=FALSE)
twitterSample20k<- sample(twitter, len, replace=FALSE)
removeSymbol<- function(x) {
  gsub("[^a-zA-Z0-9 '-]", "", x)
}
blogSample20k<- removeSymbol(blogSample20k)
newsSample20k<- removeSymbol(newsSample20k)
twitterSample20k<- removeSymbol(twitterSample20k)
cname<- file.path(getwd(), 'sample20k')
docSample <- Corpus(DirSource(cname))
docSample <- tm_map(docSample, tolower) # lower case
docSample <- tm_map(docSample, PlainTextDocument)
docSample <- tm_map(docSample, removeNumbers) #remove numbers
docSample <- tm_map(docSample, removeWords, stopwords("english")) #remove stop words
docSample <- tm_map(docSample, removePunctuation, preserve_intra_word_dashes=TRUE) #remove punctuation
docSample <- tm_map(docSample, removeWords, profanity) #remove profanity words
docSample <- tm_map(docSample, stripWhitespace) #remove unnecessary whitespace
names(docSample)<- c('blogs', 'news', 'twitter')
Table 2 (A). Line Count, Word Count, and Word Types by Document Source: Sample Data Set Before Data Transformation
Documents Line Count Word Count Word Types Word Type/Count Ratio
Blogs 20,000 834,670 94,319 0.113
News 20,000 687,092 86,595 0.126
Twitter 20,000 258,267 47,470 0.184
Table 2 (B). Line Count, Word Count, and Word Types by Document Source: Sample Data Set After Data Transformation
Documents Line Count Word Count Word Types Word Type/Count Ratio
Blogs 20,000 419,960 45,589 0.109
News 20,000 381,597 43,677 0.114
Twitter 20,000 135,307 23,117 0.171

Table 2 shows the total line count, word count, and word type count from each document source of the sample corpus, (A) before and (B) after data transformation. Word count is the number of words separated by whitespace. Word type count is the number of unique words which occure in the data set. Word type/count ratio is the total word types divided by world count. It indicates complexity, with higher numbers indicating a more complex data set.

N-Gram Frequency Analysis

Uni-Gram

tdmSample<- TermDocumentMatrix(docSample)
tdmSample
## <<TermDocumentMatrix (terms: 75772, documents: 3)>>
## Non-/sparse entries: 112383/114933
## Sparsity           : 51%
## Maximal term length: 86
## Weighting          : term frequency (tf)
mSample<- as.matrix(tdmSample)
freqSample<- rowSums(mSample) ; mSample<- cbind(mSample, freqSample)
mSample<- mSample[order(freqSample, decreasing=TRUE),]
dSample<- data.frame(word=row.names(mSample), as.data.frame(mSample, row.names=NULL), stringsAsFactors = FALSE)
freqSample<- sort(freqSample, decreasing=TRUE)

Word Frequency

Figure 1 Word Cloud - Top 100 Most Frequently Occurring Words


Figure 2 Top 30 Most Frequently Occurring Words by Document Source

Bi-Gram

BigramTokenizer <- function(x) {
  unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE)
}
tdmSample_Bi<- TermDocumentMatrix(docSample, control = list(tokenize = BigramTokenizer))
tdmSample_Bi
## <<TermDocumentMatrix (terms: 775103, documents: 3)>>
## Non-/sparse entries: 822204/1503105
## Sparsity           : 65%
## Maximal term length: 95
## Weighting          : term frequency (tf)
mSample_Bi<- as.matrix(tdmSample_Bi)
freqSample_Bi<- rowSums(mSample_Bi) ; mSample_Bi<- cbind(mSample_Bi, freqSample_Bi)
mSample_Bi<- mSample_Bi[order(freqSample_Bi, decreasing=TRUE),]
dSample_Bi<- data.frame(word=row.names(mSample_Bi), as.data.frame(mSample_Bi, row.names=NULL), stringsAsFactors = FALSE)
freqSample_Bi<- sort(freqSample_Bi, decreasing=TRUE)

Word Frequency

Figure 3 Top 30 Most Frequently Occurring Bi-gram Words by Document Source


Tri-Gram

TrigramTokenizer <- function(x) {
  unlist(lapply(ngrams(words(x), 3), paste, collapse = " "), use.names = FALSE)
} 
tdmSample_Tri<- TermDocumentMatrix(docSample, control = list(tokenize = TrigramTokenizer))
tdmSample_Tri
## <<TermDocumentMatrix (terms: 946211, documents: 3)>>
## Non-/sparse entries: 949026/1889607
## Sparsity           : 67%
## Maximal term length: 142
## Weighting          : term frequency (tf)
mSample_Tri<- as.matrix(tdmSample_Tri)
freqSample_Tri<- rowSums(mSample_Tri) ; mSample_Tri<- cbind(mSample_Tri, freqSample_Tri)
mSample_Tri<- mSample_Tri[order(freqSample_Tri, decreasing=TRUE),]
dSample_Tri<- data.frame(word=row.names(mSample_Tri), as.data.frame(mSample_Tri, row.names=NULL), stringsAsFactors = FALSE)
freqSample_Tri<- sort(freqSample_Tri, decreasing=TRUE)

Word Frequency

Figure 4 Top 30 Most Frequently Occurring Tri-gram Words by Document Source


Distributions of Frequencies

Distributions of frenquencies were further examined.

Single Words

Among 75,772 unique words(word types) in this sample corpus, it was observed that 40,433 of them (about 53%) appeared only once. There are also many others that appear very infrequently. 1,095 (top 1%) unique words cover 50% of the sample corpus, whereas 16,288 (top 21%) unique words are needed to account for 90%.

N-Gram

Figure 5 Distributions of Frequencies by N-Gram Models

The x axis represents sequences of unique words sorted by number of occurence. The y axis represents frequency. The x, y axes remain in each model’s original scales. The blue line shows 50% coverage of the total words in the dataset, whereas the orange line shows the 90% coverage. The red line indicates the begining of single occurence words, ie, frequencies to the right side of the red line are all “one” - 53% for uni-gram, 90% for bi-gram, and 99% for tri-gram.


Table 3 Number of Word Types(Unique Words) and Highest Frequency by N-Gram Models

N-Gram Word Types Highest Frequency
1-Gram 75,772 5,887
2-Gram 775,103 361
3-Gram 946,211 43