This is a Week 2 milestone report for Data Science Specialization Capstone project. The aim of the report is to provide an initial exploratory data anylsis on the dataset. The main aim of the project is to build a predictive model that predicts the next phrase based on the phrase/word typed before it. This type of predictive study is categorized as Natural Language Processing (NLP) work.
## Preparing package
knitr::opts_chunk$set(echo = TRUE)
set.seed(100)
library(knitr)
library(stringi)
library(tm)
## Loading required package: NLP
library(RWeka)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
library(SnowballC)
Only English language datasetes were used for this report.
blogs<-readLines(con = "../en_US.blogs.txt",encoding = "UTF-8")
twitter<-readLines(con = "../en_US.twitter.txt",encoding = "UTF-8")
## Warning in readLines(con = "../en_US.twitter.txt", encoding = "UTF-8"):
## line 167155 appears to contain an embedded nul
## Warning in readLines(con = "../en_US.twitter.txt", encoding = "UTF-8"):
## line 268547 appears to contain an embedded nul
## Warning in readLines(con = "../en_US.twitter.txt", encoding = "UTF-8"):
## line 1274086 appears to contain an embedded nul
## Warning in readLines(con = "../en_US.twitter.txt", encoding = "UTF-8"):
## line 1759032 appears to contain an embedded nul
news<-readLines(con = "../en_US.news.txt",encoding = "UTF-8")
The summary statistics for the data were calculated by determining the number of lines, number of characters, and number of words for each of the 3 datasets (twitter, blogs, and news). The number of words per line (min, mean, and max) were also calculated.
WPL=sapply(list(blogs,news,twitter),function(x)summary(stri_count_words(x))[c('Min.','Mean','Max.')])
rownames(WPL)=c('wplMIN','wplMEAN','wplMAX')
stats=data.frame(
Dataset=c("blogs","news","twitter"),
t(rbind(
sapply(list(blogs,news,twitter),stri_stats_general)[c('Lines','Chars'),],
Words=sapply(list(blogs,news,twitter),stri_stats_latex)['Words',],
WPL)
))
head(stats)
## Dataset Lines Chars Words wplMIN wplMEAN wplMAX
## 1 blogs 899288 206824382 37570839 0 41.75 6726
## 2 news 1010242 203223154 34494539 1 34.41 1796
## 3 twitter 2360148 162096031 30451128 1 12.75 47
Since I have seen the summary stats of the data, it is crucial to do sampling because the data is too big for processing. I sampled the dataset for 1% each file. The sample dataset was cleaned by removing uncommon characters from the news, blogs and twitter sample dataset. Then, the cleaned samples were combined to become one new dataset called sampleTotal.
## Clean the data
blogs <- iconv(blogs, "latin1", "ASCII", sub="")
news <- iconv(news, "latin1", "ASCII", sub="")
twitter <- iconv(twitter, "latin1", "ASCII", sub="")
## Sampling the data
sample_blogs <- sample(blogs,size =1/100*length(blogs))
sample_news<- sample(news,size =1/100*length(news))
sample_twitter <- sample(twitter,size =1/100*length(twitter))
## Combine all the subsample into one sample
sample_total <- c(sample_blogs,sample_news,sample_twitter)
summary(sample_total)
## Length Class Mode
## 42695 character character
Next, a corpus was build using tm package. Tm package is used to clean my corpus before it is analyzed. Some pre-processing of the corpus done were:
corpus <- VCorpus(VectorSource(sample_total))
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, stemDocument)
corpus <- tm_map(corpus, PlainTextDocument)
The corpus created is converted into a readable NLP format which is based on the Term Document Matrices (TDM) of n-Grams. N-Grams is the representaion of text in n-tuple of words. Examples of n-grams are:
The TDMs store the frequencies of the N-grams. To create the TDM, we use a special package called RWeka. RWeka is a package that links Weka(data mining program) with R. With this package loaded into the library, we used some available functions inside it.
unigram_tokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
bigram_tokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
trigram_tokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
unigrams <- TermDocumentMatrix(corpus, control = list(tokenize = unigram_tokenizer))
bigrams <- TermDocumentMatrix(corpus, control = list(tokenize = bigram_tokenizer))
trigrams <- TermDocumentMatrix(corpus, control = list(tokenize = trigram_tokenizer))
unigrams
## <<TermDocumentMatrix (terms: 41125, documents: 42695)>>
## Non-/sparse entries: 503672/1755328203
## Sparsity : 100%
## Maximal term length: 85
## Weighting : term frequency (tf)
bigrams
## <<TermDocumentMatrix (terms: 405384, documents: 42695)>>
## Non-/sparse entries: 515358/17307354522
## Sparsity : 100%
## Maximal term length: 93
## Weighting : term frequency (tf)
trigrams
## <<TermDocumentMatrix (terms: 471235, documents: 42695)>>
## Non-/sparse entries: 477373/20118900952
## Sparsity : 100%
## Maximal term length: 101
## Weighting : term frequency (tf)
The frequency for each n-grams were calculated using tm package. With all the n-grams are tokkenized, the frequency for each n-grams were calculted using ‘findFreqTerms’ function from tm package. Each frequecny terms of each n-grams was stored in different dataframe for plotting purpose later.
unigrams_freqTerm <- findFreqTerms(unigrams,lowfreq = 50)
bigrams_freqTerm <- findFreqTerms(bigrams,lowfreq=50)
trigrams_freqTerm <- findFreqTerms(trigrams,lowfreq=8)
## Unigram freqeuncy dataframe
unigrams_freq <- rowSums(as.matrix(unigrams[unigrams_freqTerm,]))
unigrams_freq <- data.frame(word=names(unigrams_freq), frequency=unigrams_freq)
head(unigrams_freq)
## word frequency
## abil abil 100
## abl abl 308
## absolut absolut 114
## abus abus 67
## accept accept 150
## access access 106
## Bigram freqeuncy dataframe
bigrams_freq <- rowSums(as.matrix(bigrams[bigrams_freqTerm,]))
bigrams_freq <- data.frame(word=names(bigrams_freq), frequency=bigrams_freq)
head(bigrams_freq)
## word frequency
## can get can get 107
## can help can help 52
## can make can make 68
## can see can see 74
## cant wait cant wait 188
## come back come back 84
## Trigram freqeuncy dataframe
trigrams_freq <- rowSums(as.matrix(trigrams[trigrams_freqTerm,]))
trigrams_freq <- data.frame(word=names(trigrams_freq), frequency=trigrams_freq)
head(trigrams_freq)
## word frequency
## blah blah blah blah blah blah 8
## cant wait get cant wait get 13
## cant wait see cant wait see 40
## cinco de mayo cinco de mayo 20
## coupl week ago coupl week ago 8
## didnt even know didnt even know 9
The n-grams word frequency dataframe has been created. Next is to create a bar graph visualization for these n-grams. A visuzalization function is created below to ease the creation of plots.
## Function for visualization of n-grams
plot_n_grams <- function(df_gram, title, num, barC) {
df_sort <- df_gram[order(-df_gram$frequency),][1:num,]
ggplot(data = df_sort[1:num,], aes(x = reorder(word, -frequency), y = frequency)) +
geom_bar(stat = "identity", fill = barC, colour = "black") +
coord_cartesian(xlim = c(0, num+1)) +
labs(title = title) +
xlab("Words") +
ylab("Count") +
theme(axis.text.x=element_text(angle=90))
}
plot_n_grams(unigrams_freq,"Top 10 Unigrams",10,"red")
plot_n_grams(bigrams_freq,"Top 10 Bigrams",10,"green")
plot_n_grams(trigrams_freq,"Top 10 Trigrams",10,"blue")
This report showed the initial exploratory data analysis performed on the English news, blogs and twitter dataset. Unigrams, bigrams and trigrams of words frequecny were extracted from 1% of each dataset. The anaylsis showed that frequency decrease when the number of n-grams increase. Some notable future plans for this project: