This document describes the process designed to answer the following challenges:
Perform Exploratory analysis thorough exploratory analysis of the data, understanding the distribution of words and relationship between the words in the corpora.
Understand frequencies of words and word pairs - build figures and tables to understand variation in the frequencies of words and word pairs in the data.
The data comes from a corpus called HC Corpora (http://www.corpora.heliohost.org). See the readme file at http://www.corpora.heliohost.org/aboutcorpus.html for details on the corpora available.
The data set used in this analysis is available at : https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
The downloaded file is stored locally in the following location: ./Capstone Project/en_US/
list.of.packages <- c("tm", "RColorBrewer", "ggplot2", "wordcloud", "biclust", "cluster", "igraph", "fpc","doParallel", "RWeka","R.utils")
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages)
library("tm")
library("SnowballC")
library("cldr")
library("reshape2")
library("ggplot2")
library("R.utils")
library("RWeka")
The below code checks the size of the individual files in the English corpus.
blogs.txt.size <- file.info("./Capstone Project/en_US/en_US.blogs.txt")$size
news.txt.size <- file.info("./Capstone Project/en_US/en_US.news.txt")$size
twitter.txt.size <- file.info("./Capstone Project/en_US/en_US.twitter.txt")$size
blogs.txt.lines <- countLines("./Capstone Project/en_US/en_US.blogs.txt")
news.txt.lines <- countLines("./Capstone Project/en_US/en_US.news.txt")
twitter.txt.lines <- countLines("./Capstone Project/en_US/en_US.twitter.txt")
Files =c("blogs.txt", "news.txt", "twitter.txt")
Size=c(blogs.txt.size,news.txt.size,twitter.txt.size)
Lines=c(blogs.txt.lines,news.txt.lines,twitter.txt.lines)
size_table <- data.frame(Files, Size, Lines)
The results are printed using knitr kable function.
knitr::kable(size_table)
| Files | Size | Lines |
|---|---|---|
| blogs.txt | 210160014 | 899288 |
| news.txt | 205811889 | 1010242 |
| twitter.txt | 167105338 | 2360148 |
It’s clear from the table above that the size of the corpus exceeds the computational power of a single machine. In order to reduce the runtime the decision has been made to sample the original files as per the code below:
connection <- file("./Capstone Project/en_US/en_US.news.txt")
news <- readLines(connection, encoding="UTF-8", skipNul=T)
close(connection)
news_sample <- sample(news, length(news)/50)
connection <- file("./Capstone Project/en_US/en_US.blogs.txt")
blogs <- readLines(connection, encoding="UTF-8", skipNul=T)
close(connection)
blogs_sample <- sample(blogs, length(blogs)/50)
connection <- file("./Capstone Project/en_US/en_US.twitter.txt")
twitter <- readLines(connection, encoding="UTF-8", skipNul=T)
close(connection)
twitter_sample <- sample(twitter, length(twitter)/50)
connection <- file("./Capstone Project/en_US/sample/news_sample.txt")
writeLines(news_sample,connection)
close(connection)
connection <- file("./Capstone Project/en_US/sample/blogs_sample.txt")
writeLines(blogs_sample,connection)
close(connection)
connection <- file("./Capstone Project/en_US/sample/twitter_sample.txt")
writeLines(twitter_sample,connection)
close(connection)
One of the quickest ways to detect words from foreign languages is to use “cdlr” package (http://cran.r-project.org/web/packages/cldr/) which brings Google’s Chrome language detection into R. In order to install the package make sure you have got devtools installed and execue the following command:
devtools::install_version("cldr",version="1.1.0")
library(cldr)
non_english_blogs<-which(detectLanguage(blogs_sample)$detectedLanguage != "ENGLISH")
length(non_english_blogs)
non_english_news<-which(detectLanguage(news_sample)$detectedLanguage != "ENGLISH")
length(non_english_news)
non_english_twitter<-which(detectLanguage(twitter_sample)$detectedLanguage != "ENGLISH")
length(non_english_twitter)
Non_English <- c(length(non_english_blogs), length(non_english_news),length(non_english_twitter ))
Non_English_table <- data.frame(Files, Non_English)
knitr::kable(Non_English_table)
| Files | Non_English |
|---|---|
| blogs.txt | 608 |
| news.txt | 444 |
| twitter.txt | 1101 |
cat_name <- file.path("./Capstone Project/en_US/sample")
texts <- tm::Corpus(DirSource(cat_name))
This step allows us to remove the characters like numbers, capitalization, common words, punctuation to better prepare our text for analysis. It might be time and processing power consuming but will greatly improve the overall quality of analysis.
texts <- tm_map(texts, removePunctuation)
texts <- tm_map(texts, removeNumbers)
texts <- tm_map(texts, removeWords, stopwords("english"))
texts <- tm_map(texts, content_transformer(tolower))
We stem the documents so that a word will be recognizable to the computer, despite whether or not it may have a variety of possible endings in the original text. That means removing common word endings (e.g., “ing”, “es”, “s”)
texts <- tm_map(texts, stemDocument)
All the above text cleaning activities will leave your corpus with many unnecessary white spaces, which are simply leftovers after we have removed specified words.
texts <- tm_map(texts, stripWhitespace)
It is a mathematical matrix that describes the frequency of terms that occur in a collection of documents.
texts_tdm <- TermDocumentMatrix(texts, control=list(wordLengths=c(3,Inf)))
In this step we’ll generate the 3 TermDocumentMatrix objects which will represent our original corpus in form n-sequenced words.
require(RWeka)
options(mc.cores = 1)
UnigramTokenizer <- function(x) {RWeka::NGramTokenizer(x, RWeka::Weka_control(min = 1, max = 1))}
tdm_unigram <- TermDocumentMatrix(texts, control = list(tokenize = UnigramTokenizer)) # create tdm from 1-grams
#note that in theory texts_tdm = tdm_unigram
BigramTokenizer <- function(x) {RWeka::NGramTokenizer(x, RWeka::Weka_control(min = 2, max = 2))}
tdm_bigram <- TermDocumentMatrix(texts, control = list(tokenize = BigramTokenizer)) # create tdm from 2-grams
ThreegramTokenizer <- function(x) {RWeka::NGramTokenizer(x, RWeka::Weka_control(min = 3, max = 3))}
tdm_threegram <- TermDocumentMatrix(texts, control = list(tokenize = ThreegramTokenizer)) # create tdm from 3-grams
tdm_unigram.matrix <- as.matrix(tdm_unigram)
topwords_uni_gram<- rowSums(tdm_unigram.matrix)
head(sort(topwords_uni_gram,decreasing = TRUE), 25)
## the will one get said just like time can day year make
## 10688 6366 6213 6033 6028 5945 5945 5111 4779 4426 4357 4175
## love new good know now dont work think say see want peopl
## 3911 3857 3674 3597 3537 3424 3387 3291 3276 3236 3224 3213
## thank
## 3048
tdm_bigram.matrix <- as.matrix(tdm_bigram)
topwords_bigram<- rowSums(tdm_bigram.matrix)
head(sort(topwords_bigram,decreasing = TRUE), 25)
## i think i dont i love i can i know i just
## 1211 1107 999 752 743 735
## i want i will i cant i need last year right now
## 688 630 509 483 475 467
## i like i didnt i feel look like i got i get
## 453 412 405 404 401 399
## new york cant wait i thought i hope dont know year ago
## 383 364 364 358 346 344
## last night
## 341
tdm_threegram.matrix <- as.matrix(tdm_threegram)
topwords_threegram<- rowSums(tdm_threegram.matrix)
head(sort(topwords_threegram,decreasing = TRUE), 25)
## i dont know i dont think i think i
## 187 162 147
## i feel like i know i i dont want
## 128 121 94
## i cant wait i wish i cant wait see
## 90 87 73
## happi mother day i thought i feel like i
## 62 61 60
## i dont like i just want let us know
## 56 52 50
## i think im happi new year new york citi
## 45 44 44
## i didnt know presid barack obama i realli want
## 43 42 41
## i can get i cant believ i dont even
## 38 38 38
## dont know i
## 36
unigram_freq <- rowSums(tdm_unigram.matrix)
unigram_freq_ord <- order(unigram_freq, decreasing = TRUE)
unigram_freq_top25 <- unigram_freq[head(unigram_freq_ord, 25)]
unigram_freq_top25 <- melt(unigram_freq_top25)
unigram_freq_top25$words <- rownames(unigram_freq_top25)
#Generating the word cloud for unigram
wordcloud::wordcloud(names(unigram_freq), unigram_freq,max.words=200, scale = c(5, .1))
#ploting the top words
p <- ggplot(unigram_freq_top25, aes(x=words, y=value))
p <- p + geom_bar(stat = "identity", colour="red", fill="navy", width = 0.5)
p <- p + geom_text( aes (label = value ) , vjust = - 0.20, size = 3 )
p <- p + theme(axis.text.x=element_text(angle=45, hjust=1))
p
bigram_freq <- rowSums(tdm_bigram.matrix)
bigram_freq_ord <- order(bigram_freq, decreasing = TRUE)
bigram_freq_top25 <- bigram_freq[head(bigram_freq_ord, 25)]
bigram_freq_top25 <- melt(bigram_freq_top25)
bigram_freq_top25$words <- rownames(bigram_freq_top25)
#Generating the word cloud for bigram
wordcloud::wordcloud(names(bigram_freq), bigram_freq,max.words=100, scale = c(5, .1))
#ploting the top words
b <- ggplot(bigram_freq_top25, aes(x=words, y=value))
b <- b + geom_bar(stat = "identity", colour="yellow", fill="red", width = 0.5)
b <- b + geom_text( aes (label = value ) , vjust = - 0.20, size = 3 )
b <- b + theme(axis.text.x=element_text(angle=45, hjust=1))
b
threegram_freq <- rowSums(tdm_threegram.matrix)
threegram_freq_ord <- order(threegram_freq, decreasing = TRUE)
threegram_freq_top25 <- threegram_freq[head(threegram_freq_ord, 25)]
threegram_freq_top25 <- melt(threegram_freq_top25)
threegram_freq_top25$words <- rownames(threegram_freq_top25)
#Generating the word cloud for threegram
wordcloud::wordcloud(names(threegram_freq), threegram_freq,max.words=25)
#ploting the top words
t <- ggplot(threegram_freq_top25, aes(x=words, y=value))
t <- t + geom_bar(stat = "identity", colour="yellow", fill="green", width = 0.5)
t <- t + geom_text( aes (label = value ) , vjust = - 0.20, size = 3 )
t <- t + theme(axis.text.x=element_text(angle=45, hjust=1))
t
all_uniwords <- sort(topwords_uni_gram,decreasing = TRUE)
all_uniwords <- melt(all_uniwords)
#Generating a function which calculates a number of frequency sorted words which cover given percent of entire set of 1-gram words
words_percentage_sum <- function(x,y){
for(i in 1:length(x)) {
a <- sum(x[1:i])
if (a > length(x)*y) return(i)}
print(i)
}
To cover 50 percent of all word instances we need the following number of frequency sorted top words:
## [1] 5
To cover 90 percent of all word instances we need the following number of frequency sorted top words:
## [1] 9
The above exploratory analysis has delivered the following insights:
The input data sets represents quite differential level of English grammar. The way people express themselves on twitter differs from the language used in news and blogs.
The input data set is large and given the limitations of R sampling if the source data was necessary. Further analysis may require applying packages which will enable multi-thread, parallel processing (doParallel).
Using sample of the input data may impact final algorithm prediction accuracy. Modelling phase of the project will have to deliver a proper analysis of the trade-off between the size of the sample used for building the model and its accuracy.