This report details the advance of coursera’s data science specialization capstone project during week two of the course. The requirement is to perform an exploratory data analysis of the data, after having being exposed to natural language processing tools and libraries.
I will load the text data from all three files and clean them in the following ways: * Remove numbers * Remove punctuation while preserving the mid-word hyphens * Remove extra whitespaces * Convert all the text to lower case * Remove non-english sentences from the original data.
For the first four items we use the functions of the tm library, to detect the language we use the textcat library. The output will be a dataframe named indf that contains one column with the cleaned-up text.
# a. Environment set-up
basedir <- '~/Coursera_R_programming/Capstone/final'
setwd(basedir)
# b. Libraries load
library(tm); library(RColorBrewer); library(quanteda); library(cluster)
## Loading required package: NLP
## quanteda version 0.9.8
##
## Attaching package: 'quanteda'
## The following objects are masked from 'package:tm':
##
## as.DocumentTermMatrix, stopwords
## The following object is masked from 'package:NLP':
##
## ngrams
## The following object is masked from 'package:base':
##
## sample
library(textcat); library(ggplot2); library(stringi); library(xlsx); library(dendextend)
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
## Loading required package: rJava
## Loading required package: xlsxjars
##
## ---------------------
## Welcome to dendextend version 1.2.0
## Type browseVignettes(package = 'dendextend') for the package vignette.
## The github page is: https://github.com/talgalili/dendextend/
##
## Suggestions and bug-reports can be submitted at: https://github.com/talgalili/dendextend/issues
## Or contact: <tal.galili@gmail.com>
##
## To suppress this message use: suppressPackageStartupMessages(library(dendextend))
## ---------------------
##
## Attaching package: 'dendextend'
## The following object is masked from 'package:stats':
##
## cutree
# c. Data load
news <- readLines("~/Coursera_R_programming/Capstone/final/en_US/en_US.news.txt")
suppressWarnings(twitter <- readLines("~/Coursera_R_programming/Capstone/final/en_US/en_US.twitter.txt"))
blogs <- readLines("~/Coursera_R_programming/Capstone/final/en_US/en_US.blogs.txt")
The datasets are very big, so I am going to sample 10% of the total. The sample is random.
seed <- as.numeric(as.Date("2016-8-18"))
set.seed(seed)
news.sample <- sample(news, length(news)*0.1, replace = FALSE)
twitter.sample <- sample(twitter, length(twitter)*0.1, replace = FALSE)
blogs.sample <- sample(blogs, length(blogs)*0.1, replace = FALSE)
I am pooling the three datasets together for cleaning and machine learning purposes.
alldata <- c(news.sample, twitter.sample, blogs.sample)
remove(news.sample); remove(twitter.sample); remove(blogs.sample)
remove(news); remove(twitter); remove(blogs)
I am creating the corpus that will be the basis for cleaning and text mining.
corpus2 <- corpus(alldata)
remove(alldata)
I am cleaning up the data by using the quanteda package.
# I use quanteda's "tokenize()" because tm's "tm_map" does not recognise the "corpus" class that quanteda uses.
# The following must all be passed in one function because quanteda cannot tokenize an already tokenized object.
corpus2 <- tokenize(corpus2,
removeNumbers = TRUE,
removePunct = TRUE,
removeSymbols = TRUE,
removeTwitter = TRUE,
removeHyphens = TRUE,
removeURL = TRUE)
corpus2 <- removeFeatures(corpus2, stopwords("english"))
corpus2 <- toLower(corpus2)
To proceed, I create a document features matrix. This is the quanteda equivalent of the tm’s package “document terms matrix”. This is what you will be using from this point on.
# a. Create the "document-features matrix"
suppressMessages(dfm <- dfm(corpus2, ignoredFeatures = stopwords("english"), stem = TRUE))
# b. Access a list of the most frequently occurring features
topfeatures(dfm, 100) # 100 most frequent words
## will one just like said get time go can year
## 32501 31408 30502 30369 30347 30040 27110 26865 25072 24359
## day make love new know good now work peopl say
## 22971 20939 20235 19792 18713 18513 18206 18124 16493 16209
## want see think look thank back come need first use
## 16208 16041 15654 15386 15313 15033 14678 14231 14150 13433
## also thing last take way great two today us well
## 13098 13091 13045 12789 12775 12587 12231 12146 12123 12066
## even much right realli follow got start week still game
## 12058 12029 11720 11506 11394 10857 10846 10799 10465 10435
## play call state show tri feel school life rt home
## 10240 9987 9772 9642 9541 9446 9368 9338 9019 8992
## mani night live help made hope never littl best may
## 8947 8727 8675 8567 8536 8380 8223 8060 8040 7989
## u friend give next lol someth run book world long
## 7977 7769 7625 7615 7445 7314 7279 7191 7186 7185
## happi end citi lot place find around alway watch team
## 7163 7109 7039 6911 6910 6850 6802 6773 6719 6715
## man everi keep old anoth better talk put big three
## 6703 6687 6685 6647 6647 6623 6486 6445 6418 6384
top100words <- as.data.frame(topfeatures(dfm, 100)) # convert to dataframe for manipulating
write.xlsx(top100words, file = "topwords.xlsx", sheetName = "top100words")
top100words <- read.xlsx("topwords.xlsx", sheetName = "top100words")
names(top100words) <- c("word", "frequency")
I am creating the following two graphs to offer a preliminary/exploratory visualization of the data analysis I have completed up to this point.
# 1. Create a wordcloud plot
suppressWarnings(wordcloud <- plot(dfm, max.words = 100, colors = brewer.pal(6, "Dark2"), scale = c(8, .5)))
# 2. Create a graph that plots a series of most common words
## First, I sort by frequency in decreasing order, just to make the end plot prettier
top100words$word <- factor(top100words$word, levels = top100words$word[order(top100words$frequency)])
## Then, I create the plot in the gglpot environment
frequencyplot <- ggplot(head(top100words, n = 30), aes(word, frequency)) +
geom_bar(stat = "identity") +
coord_flip() +
theme_minimal() + ggtitle("Histogram of top 25 words") +
xlab("Word") + ylab("Frequency")
frequencyplot
I will now use the quanteda library (RWeka did not work for me, and it’s been a struggle) to create 1, 2 and 3 n-grams over the curated set. Then, I will order the ngram data frames in descending order, so we have the most frequently used terms on top.
# 1. Create ngrams
ngram1 <- ngrams(corpus2, n = 1)
ngram2 <- ngrams(corpus2, n = 2)
ngram3 <- ngrams(corpus2, n = 3)
# 2. Tranform the ngrams into "document-features matrix"
suppressMessages(dfm.ngram1 <- dfm(ngram1, ignoredFeatures = stopwords("english"), stem = TRUE))
suppressMessages(dfm.ngram2 <- dfm(ngram2, ignoredFeatures = stopwords("english"), stem = TRUE))
suppressMessages(dfm.ngram3 <- dfm(ngram3, ignoredFeatures = stopwords("english"), stem = TRUE))
# 3. Access a list of the most frequently occurring features
## a. for unigram
top100ngram1 <- as.data.frame(topfeatures(dfm.ngram1, 100)) # creating a dataframe of 100 most common unigrams
write.xlsx(top100ngram1, file = "topwords.xlsx", sheetName = "top100ngram1", append = TRUE)
top100ngram1 <- read.xlsx("topwords.xlsx", sheetName = "top100ngram1")
names(top100ngram1) <- c("feature", "frequency")
## b. for bigram
top100ngram2 <- as.data.frame(topfeatures(dfm.ngram2, 100)) # creating a dataframe of 100 most common bigrams
write.xlsx(top100ngram2, file = "topwords.xlsx", sheetName = "top100ngram2", append = TRUE)
top100ngram2 <- read.xlsx("topwords.xlsx", sheetName = "top100ngram2")
names(top100ngram2) <- c("feature", "frequency")
## c. for trigram
top100ngram3 <- as.data.frame(topfeatures(dfm.ngram3, 100)) # creating a dataframe of 100 most common trigrams
write.xlsx(top100ngram3, file = "topwords.xlsx", sheetName = "top100ngram3", append = TRUE)
top100ngram3 <- read.xlsx("topwords.xlsx", sheetName = "top100ngram3")
names(top100ngram3) <- c("feature", "frequency")
# 4. Create graphs for unigram, bigram and trigram
# First, I sort by frequency in decreasing order, just to make the end plot prettier
## a. for unigram
top100ngram1$feature <- factor(top100ngram1$feature, levels = top100ngram1$feature[order(top100ngram1$frequency)])
## b. for bigram
top100ngram2$feature <- factor(top100ngram2$feature, levels = top100ngram2$feature[order(top100ngram2$frequency)])
## c. for trigram
top100ngram3$feature <- factor(top100ngram3$feature, levels = top100ngram3$feature[order(top100ngram3$frequency)])
# Then, I create the plot in the gglpot environment
## a. for unigram
frequencyplot.ngram1 <- ggplot(head(top100ngram1, n = 30), aes(feature, frequency)) +
geom_bar(stat = "identity") +
coord_flip() +
theme_minimal() + ggtitle("Unigram") +
xlab("Feature") + ylab("Frequency")
## b. for bigram
frequencyplot.ngram2 <- ggplot(head(top100ngram2, n = 30), aes(feature, frequency)) +
geom_bar(stat = "identity") +
coord_flip() +
theme_minimal() + ggtitle("Bigram") +
xlab("Feature") + ylab("Frequency")
## c. for trigram
frequencyplot.ngram3 <- ggplot(head(top100ngram3, n = 30), aes(feature, frequency)) +
geom_bar(stat = "identity") +
coord_flip() +
theme_minimal() + ggtitle("Trigram") +
xlab("Feature") + ylab("Frequency")
frequencyplot.ngram1
frequencyplot.ngram2
frequencyplot.ngram3
I can use distances to draw comparisons between words and cluster them. Here, I do it for the 30 most common words. Note: Initially, I was trying to plot the dendrogram but couldn’t figure out a way to show the labels of the leaves, i.e. the words. I found the answer here: https://github.com/kbenoit/quanteda/issues/112; or here http://www.kenbenoit.net/courses/nyu2014qta/day7examples.R. God bless the guy.
# hierarchical clustering on words
# weight by relative term frequency
word.dfm <- sort(tf(dfm, "prop")) # sort in decreasing order of total word frequency
word.dfm <- t(word.dfm)[1:30, ] # because it is transposed
word.dist.mat <- dist(word.dfm)
wordcluster <- hclust(word.dist.mat)
# This is a different presentation which I find more appealing
wordcluster <- as.dendrogram(wordcluster)
# Color the branches using color_branches() from 'dendextend' package
colorbranch <- color_branches(wordcluster, k=20)
# To color also labels, plug the result of coloring branches to color_labels()
allcolor <- color_labels(colorbranch, k=20)
# Finally, plot the dendrogram
plot(allcolor, xlab = "", ylab = "Height", main = "Euclidean Distance on Normalized Token Frequency")
# k-means clustering
k_means_dfm <- trim(head(dfm, n = 200), minCount = 5, minDoc = 3)
## Document-feature matrix of: 426,966 documents, 131,641 features.
## (showing first 200 documents and first 6 features)
## features
## docs magic fail allow roster deterior team
## text1 1 1 1 1 1 1
## text2 0 0 0 0 0 0
## text3 0 0 0 0 0 0
## text4 0 0 0 0 0 0
## text5 0 0 0 0 0 1
## text6 0 0 1 0 0 0
## text7 0 0 1 0 0 0
## text8 0 0 0 0 0 0
## text9 0 0 0 0 0 0
## text10 0 0 0 0 0 0
## text11 0 0 0 0 0 0
## text12 0 0 0 0 0 0
## text13 0 0 0 0 0 0
## text14 0 0 0 0 0 0
## text15 0 0 0 0 0 0
## text16 0 0 0 0 0 0
## text17 0 0 0 0 0 0
## text18 0 0 0 0 0 0
## text19 0 0 0 0 0 0
## text20 0 0 0 0 0 0
## text21 0 0 0 0 0 0
## text22 0 0 0 0 0 0
## text23 0 0 0 0 0 0
## text24 0 0 0 0 0 0
## text25 0 0 0 0 0 0
## text26 0 0 0 0 0 0
## text27 0 0 0 0 0 0
## text28 0 0 0 0 0 0
## text29 0 0 0 0 0 0
## text30 0 0 0 0 0 0
## text31 0 0 0 0 0 0
## text32 0 0 0 0 0 0
## text33 0 0 0 0 0 0
## text34 0 0 0 0 0 0
## text35 0 0 0 0 0 0
## text36 0 0 0 0 0 0
## text37 0 0 0 0 0 0
## text38 0 0 0 0 0 0
## text39 0 0 0 0 0 0
## text40 0 0 0 0 0 0
## text41 0 0 0 0 0 0
## text42 0 0 0 0 0 0
## text43 0 0 0 0 0 0
## text44 0 0 0 0 0 0
## text45 0 0 0 0 0 0
## text46 1 0 0 0 0 0
## text47 0 0 0 0 0 0
## text48 0 0 0 0 0 0
## text49 0 0 0 0 0 0
## text50 0 0 0 0 0 0
## text51 0 0 0 0 0 0
## text52 0 0 0 0 0 0
## text53 0 0 0 0 0 0
## text54 0 0 0 0 0 0
## text55 0 0 0 0 0 0
## text56 0 0 0 0 0 0
## text57 0 0 0 0 0 0
## text58 0 0 0 0 0 0
## text59 0 0 0 0 0 0
## text60 0 0 0 0 0 0
## text61 0 0 0 0 0 0
## text62 0 0 0 0 0 0
## text63 0 0 0 0 0 0
## text64 0 0 0 0 0 0
## text65 0 0 0 0 0 0
## text66 0 0 0 0 0 0
## text67 0 0 0 0 0 0
## text68 0 0 0 0 0 0
## text69 0 0 0 0 0 0
## text70 0 0 0 0 0 0
## text71 0 0 0 0 0 0
## text72 0 0 0 0 0 0
## text73 0 0 0 0 0 0
## text74 0 0 0 0 0 0
## text75 0 0 0 0 0 0
## text76 0 0 0 0 0 0
## text77 0 0 0 0 0 0
## text78 0 0 0 0 0 0
## text79 0 0 0 0 0 0
## text80 0 0 0 0 0 0
## text81 0 0 0 0 0 0
## text82 0 0 0 0 0 0
## text83 0 0 0 0 0 0
## text84 0 0 0 0 0 0
## text85 0 0 0 0 0 0
## text86 0 0 0 0 0 0
## text87 0 0 0 0 0 0
## text88 0 0 0 0 0 0
## text89 0 0 0 0 0 0
## text90 0 0 0 0 0 0
## text91 0 0 0 0 0 0
## text92 0 0 0 0 0 0
## text93 0 0 0 0 0 0
## text94 0 0 0 0 0 0
## text95 0 0 0 0 0 0
## text96 0 0 0 0 0 0
## text97 0 0 0 0 0 0
## text98 0 0 0 0 0 0
## text99 0 0 0 0 0 0
## text100 0 0 0 0 0 0
## text101 0 0 0 0 0 0
## text102 0 0 0 0 0 0
## text103 0 0 0 0 0 2
## text104 0 1 0 0 0 0
## text105 0 0 0 0 0 0
## text106 0 0 0 0 0 0
## text107 0 0 0 0 0 1
## text108 0 0 0 0 0 0
## text109 0 0 0 0 0 0
## text110 0 0 0 0 0 0
## text111 0 0 0 0 0 0
## text112 0 0 0 0 0 0
## text113 0 0 0 0 0 0
## text114 0 0 0 0 0 1
## text115 0 0 0 0 0 0
## text116 0 0 0 0 0 0
## text117 0 0 0 0 0 0
## text118 0 0 0 0 0 0
## text119 0 0 0 0 0 0
## text120 0 0 0 0 0 0
## text121 0 0 0 0 0 0
## text122 0 0 0 0 0 0
## text123 0 0 0 0 0 0
## text124 0 0 1 0 0 0
## text125 0 0 0 0 0 0
## text126 0 0 0 0 0 0
## text127 0 0 0 0 0 0
## text128 0 0 0 0 0 0
## text129 0 0 0 0 0 0
## text130 0 0 0 0 0 0
## text131 0 0 0 0 0 1
## text132 0 0 0 0 0 0
## text133 0 0 0 0 0 0
## text134 0 0 0 0 0 0
## text135 0 0 0 0 0 0
## text136 0 0 0 0 0 0
## text137 0 0 0 0 0 0
## text138 0 0 0 0 0 0
## text139 0 0 0 0 0 0
## text140 0 0 0 0 0 0
## text141 0 0 0 0 0 0
## text142 0 0 0 0 0 0
## text143 0 0 0 0 0 0
## text144 0 0 0 0 0 0
## text145 0 0 0 0 0 0
## text146 0 0 0 0 0 0
## text147 0 0 0 0 0 0
## text148 0 0 0 0 0 0
## text149 0 0 0 0 0 0
## text150 0 0 0 0 0 0
## text151 0 0 0 0 0 0
## text152 0 0 0 0 0 0
## text153 0 0 0 0 0 0
## text154 0 0 0 0 0 0
## text155 0 0 0 0 0 0
## text156 0 0 0 0 0 0
## text157 0 0 0 0 0 0
## text158 0 0 0 0 0 0
## text159 0 0 1 0 0 0
## text160 0 0 0 0 0 0
## text161 0 0 0 0 0 0
## text162 0 0 1 0 0 0
## text163 0 0 0 0 0 0
## text164 0 0 0 0 0 0
## text165 0 0 0 0 0 0
## text166 0 0 0 0 0 0
## text167 0 0 0 0 0 0
## text168 0 0 0 0 0 0
## text169 0 0 0 0 0 0
## text170 0 0 0 0 0 0
## text171 0 0 0 0 0 0
## text172 0 0 0 0 0 0
## text173 0 0 0 0 0 0
## text174 0 0 0 0 0 0
## text175 0 0 0 0 0 0
## text176 0 0 0 0 0 0
## text177 0 0 0 0 0 0
## text178 0 0 0 0 0 0
## text179 0 0 0 0 0 0
## text180 0 0 0 0 0 0
## text181 0 0 0 0 0 0
## text182 0 0 0 0 0 0
## text183 0 0 0 0 0 0
## text184 0 0 0 0 0 0
## text185 0 0 0 0 0 0
## text186 0 0 0 0 0 0
## text187 0 0 0 0 0 0
## text188 0 0 0 0 0 0
## text189 0 0 0 0 0 0
## text190 0 0 0 0 0 0
## text191 0 0 0 0 0 0
## text192 0 0 0 0 0 0
## text193 0 0 0 0 0 0
## text194 0 0 0 0 0 0
## text195 0 0 0 0 0 0
## text196 0 0 0 0 0 0
## text197 0 0 0 0 0 0
## text198 0 0 0 0 0 0
## text199 0 0 0 0 0 0
## text200 0 0 0 0 0 0
## Removing features occurring fewer than 5 times: 4
## Removing features occurring in fewer than 3 documents: 4
# try default guidelines
k <- round(sqrt(ndoc(k_means_dfm)/2))
# clusterk5 <- kmeans(tf(k_means_dfm, "prop"), k)
# split(docnames(k_means_dfm), clusterk5$cluster)
clusterk3 <- kmeans(tf(k_means_dfm, "prop"), 3)
split(docnames(k_means_dfm), clusterk3$cluster)
## $`1`
## [1] "text1" "text5" "text103" "text107" "text114" "text131"
##
## $`2`
## [1] "text6" "text7" "text124" "text159" "text162"
##
## $`3`
## [1] "text2" "text3" "text4" "text8" "text9" "text10" "text11"
## [8] "text12" "text13" "text14" "text15" "text16" "text17" "text18"
## [15] "text19" "text20" "text21" "text22" "text23" "text24" "text25"
## [22] "text26" "text27" "text28" "text29" "text30" "text31" "text32"
## [29] "text33" "text34" "text35" "text36" "text37" "text38" "text39"
## [36] "text40" "text41" "text42" "text43" "text44" "text45" "text46"
## [43] "text47" "text48" "text49" "text50" "text51" "text52" "text53"
## [50] "text54" "text55" "text56" "text57" "text58" "text59" "text60"
## [57] "text61" "text62" "text63" "text64" "text65" "text66" "text67"
## [64] "text68" "text69" "text70" "text71" "text72" "text73" "text74"
## [71] "text75" "text76" "text77" "text78" "text79" "text80" "text81"
## [78] "text82" "text83" "text84" "text85" "text86" "text87" "text88"
## [85] "text89" "text90" "text91" "text92" "text93" "text94" "text95"
## [92] "text96" "text97" "text98" "text99" "text100" "text101" "text102"
## [99] "text104" "text105" "text106" "text108" "text109" "text110" "text111"
## [106] "text112" "text113" "text115" "text116" "text117" "text118" "text119"
## [113] "text120" "text121" "text122" "text123" "text125" "text126" "text127"
## [120] "text128" "text129" "text130" "text132" "text133" "text134" "text135"
## [127] "text136" "text137" "text138" "text139" "text140" "text141" "text142"
## [134] "text143" "text144" "text145" "text146" "text147" "text148" "text149"
## [141] "text150" "text151" "text152" "text153" "text154" "text155" "text156"
## [148] "text157" "text158" "text160" "text161" "text163" "text164" "text165"
## [155] "text166" "text167" "text168" "text169" "text170" "text171" "text172"
## [162] "text173" "text174" "text175" "text176" "text177" "text178" "text179"
## [169] "text180" "text181" "text182" "text183" "text184" "text185" "text186"
## [176] "text187" "text188" "text189" "text190" "text191" "text192" "text193"
## [183] "text194" "text195" "text196" "text197" "text198" "text199" "text200"
clusterk2 <- kmeans(tf(k_means_dfm, "prop"), 2)
split(docnames(k_means_dfm), clusterk2$cluster)
## $`1`
## [1] "text1" "text5" "text103" "text107" "text114" "text131"
##
## $`2`
## [1] "text2" "text3" "text4" "text6" "text7" "text8" "text9"
## [8] "text10" "text11" "text12" "text13" "text14" "text15" "text16"
## [15] "text17" "text18" "text19" "text20" "text21" "text22" "text23"
## [22] "text24" "text25" "text26" "text27" "text28" "text29" "text30"
## [29] "text31" "text32" "text33" "text34" "text35" "text36" "text37"
## [36] "text38" "text39" "text40" "text41" "text42" "text43" "text44"
## [43] "text45" "text46" "text47" "text48" "text49" "text50" "text51"
## [50] "text52" "text53" "text54" "text55" "text56" "text57" "text58"
## [57] "text59" "text60" "text61" "text62" "text63" "text64" "text65"
## [64] "text66" "text67" "text68" "text69" "text70" "text71" "text72"
## [71] "text73" "text74" "text75" "text76" "text77" "text78" "text79"
## [78] "text80" "text81" "text82" "text83" "text84" "text85" "text86"
## [85] "text87" "text88" "text89" "text90" "text91" "text92" "text93"
## [92] "text94" "text95" "text96" "text97" "text98" "text99" "text100"
## [99] "text101" "text102" "text104" "text105" "text106" "text108" "text109"
## [106] "text110" "text111" "text112" "text113" "text115" "text116" "text117"
## [113] "text118" "text119" "text120" "text121" "text122" "text123" "text124"
## [120] "text125" "text126" "text127" "text128" "text129" "text130" "text132"
## [127] "text133" "text134" "text135" "text136" "text137" "text138" "text139"
## [134] "text140" "text141" "text142" "text143" "text144" "text145" "text146"
## [141] "text147" "text148" "text149" "text150" "text151" "text152" "text153"
## [148] "text154" "text155" "text156" "text157" "text158" "text159" "text160"
## [155] "text161" "text162" "text163" "text164" "text165" "text166" "text167"
## [162] "text168" "text169" "text170" "text171" "text172" "text173" "text174"
## [169] "text175" "text176" "text177" "text178" "text179" "text180" "text181"
## [176] "text182" "text183" "text184" "text185" "text186" "text187" "text188"
## [183] "text189" "text190" "text191" "text192" "text193" "text194" "text195"
## [190] "text196" "text197" "text198" "text199" "text200"
# Render the cluster plot
clusplot(as.matrix(k_means_dfm), clusterk2$cluster, color=TRUE, shade=TRUE, labels=2, lines=0, main = "K-means clustering")