This is a milestone report to demonstrate preparation for developing a prediction algorithm for the word prediction problem. The report summarizes basic steps to preprocess and conduct exploratory data analysis (EDA) of U.S English language corpus comprising three documents: blogs, news, and twitter text. The preprocessing steps comprise the following:
As the TermDocumentMatrix appears to get very large for the three documents in our Corpus, let us apply the method removeSparseTerms(). This has a tendency to reduce the matrix substantially without losing significant relations in the matrix. Based on computation of TermDocumentMatrix and lowering of the sparseness, the report attempts to present preliminary statistics on the following:
Word clouds for singletons, bigrams, and trigrams for visual demonstration of coverage by a small proportion of the n-grams
Exploring MarkovChains package for understanding prediction model
Explore correlations between tokens from the term document matrix using Rgraphviz package. These provide good visualization of token correlations for EDA purposes.
The twitter documents have relative shorter sentences (number of characters) with a tight spread between the median and 95 percentile sentence lengths (median of 64 characters, and a 95 percentile length of 132 characters). The blogs and news documents have much higher median sentence lengths and a wide spread between the median and the 95 percentile lengths.
The wordcloud for single word tokens visually demonstrates the predominance by a small percentage of tokens.
As part of data exploration, one can use the package Rgraphviz to visualize token correlations inherent in term document matrix. The package facilitates the edge weights be controlled by correlation between the nodes in the graph. This report compares original single word term document matrix with the one after removing sparseness using random sample size of 10 from the entire population of tokens enumerated in the term document matrix. Correlations for 0.2, 0.4, and 0.6 are examined and plotted.
The overall frequency distribution of bigrams (n=2) across the three documents in the Corpus is as follows. This is also a reflection of the tightness of the vocabulary coverage. It is clear that a small percentage of bigrams dominate the overall language use across blogs, news, and twitter feeds.
The wordcloud for bigrams visually demonstrates the predominance by a relatively small percentage of bigrams.
The overall frequency distribution of trigrams (n=3) across the three documents in the Corpus is as follows. This is also a reflection of the tightness of the vocabulary coverage. It is clear that a small percentage of trigrams dominate the overall language use across blogs, news, and twitter feeds.
The wordcloud for trigrams visually demonstrates the predominance by a relatively small percentage of trigrams.
The TermDocumentMatrix of single word tokens across all the documents in the corpus shows 53% sparseness. This means that the single word tokens are on the average present only in 1 of 2 documents. When the sparsity is reduced to 0.33 i.e., increase the incidence of terms to 2 out of 3 documents using tm::removeSparseTerms() method, the resulting sparseness is computed as 0. This means the single word tokens after removing sparseness appear in all the three documents.
Using Rgraphviz package, one can visualize correlations between tokens from the term document matrix. Some examples of these visualizations are included in the report for EDA purposes. The randome sample size chosen for exploration is 10 from the whole population of tdm. The plots are captured for correlations: 0.2, 0.6, and 0.9.
The following analysis presents the knitr weaving of the imported datasets, the R code, and notes of analysis.
opts_chunk$set(echo=TRUE,cache.lazy=FALSE)
To mitigate long computation time for training for a large number of predictors, let us use the doParallel package and register multiple cores for parallelism registerDoParallel(cores=6)
registerDoParallel(cores=3)
Sampling of the documents to reduce dimensionality and computation time
setwd("C:\\Users\\pr9115.OPS\\Documents\\R_Practice\\Capstone")
dataFeed <- getwd()
## Delete sample files for blogs, news, and twitter feeds
##sapply(list.files(path="./SampleFeeds",pattern="*.sample.txt",full.names=TRUE),
## function(x) unlink(x, force=TRUE))
#################### Sample the twitter feed text feed ########################
con <- file("en_US.twitter.txt", "r") ### read binary mode
content <- readLines(con, encoding = "UTF-8") ## Read the first line of text
close(con) ## It's important to close the connection when you are done
## Distribution of line lengths in the twitter feed
numLines <- length(content)
## Percent of the lines chosen randomly from the text file with no replacement
samplePercent <- 20.0
sampleSize <- ceiling(numLines*samplePercent/100)
set.seed(1234)
randLineNum <- sample(seq(1,numLines), sampleSize, replace=FALSE)
sample.twitter.txt <- content[randLineNum]
#sapply(sample.twitter.txt,function(x))
con <- file("./SampleFeeds/en_US.twitter.sample.txt", "w")
writeLines(sample.twitter.txt, con)
close(con)
twitterdf <- as.data.frame(content, stringsAsFactors=FALSE)
twitterdf$Length <- nchar(content)
medianLength <- median(twitterdf$Length)
perc90Length <- quantile(twitterdf$Length,probs=0.90)[[1]]
perc95Length <- quantile(twitterdf$Length,probs=0.95)[[1]]
maxLength <- max(twitterdf$Length)
tp1 <- sprintf("Median = %f", medianLength)
tp2 <- sprintf("95 Percentile = %f", perc95Length)
ggplot(twitterdf) +
geom_freqpoly(aes(Length),color="darkblue",size=2.0,binwidth=5.0) +
geom_vline(aes(xintercept=medianLength),color="orange",size=3.0) +
geom_vline(aes(xintercept=perc95Length),color="red",size=3.0) +
ggtitle("Distribution of line lengths of Twitter Feed") +
theme(axis.title.y=element_text(size=20),axis.title.x=element_text(size=20),
plot.title=element_text(size=25),
axis.text.x=element_text(angle=90,size=20),
axis.text.y=element_text(size=20),
legend.position="none") +
xlab("Number of characters in line") +
annotate(geom="text", x=30, y=60000, label=tp1, color="orange", size=7) +
annotate(geom="text", x=30, y=50000, label=tp2, color="red", size=7)
con <- file("en_US.blogs.txt", "r")
content <- readLines(con,encoding = "UTF-8") ## Read the first line of text
close(con) ## It's important to close the connection when you are done
numLines <- length(content)
sampleSize <- ceiling(numLines*samplePercent/100)
set.seed(1234)
randLineNum <- sample(seq(1,numLines), sampleSize, replace=FALSE)
sample.blogs.txt <- content[randLineNum]
con <- file("./SampleFeeds/en_US.blogs.sample.txt", "w")
writeLines(sample.blogs.txt, con)
close(con)
blogsdf <- as.data.frame(content, stringsAsFactors=FALSE)
blogsdf$Length <- nchar(content)
medianLength <- median(blogsdf$Length)
perc90Length <- quantile(blogsdf$Length,probs=0.90)[[1]]
perc95Length <- quantile(blogsdf$Length,probs=0.95)[[1]]
maxLength <- max(blogsdf$Length)
tp1 <- sprintf("Median = %f", medianLength)
tp2 <- sprintf("95 Percentile = %f", perc95Length)
ggplot(blogsdf) +
geom_freqpoly(aes(Length),color="darkblue",size=2.0,binwidth=5.0) +
geom_vline(aes(xintercept=medianLength),color="orange",size=3.0) +
geom_vline(aes(xintercept=perc95Length),color="red",size=3.0) +
ggtitle("Distribution of line lengths of Blogs Feed") +
theme(axis.title.y=element_text(size=20),axis.title.x=element_text(size=20),
plot.title=element_text(size=25),
axis.text.x=element_text(angle=90,size=20),
axis.text.y=element_text(size=20),
legend.position="none") +
xlab("Number of characters in line") +
coord_cartesian(xlim =c(0,1600)) +
scale_x_continuous(breaks=seq(0,1600,100)) +
annotate(geom="text", x=800, y=20000, label=tp1, color="orange", size=7) +
annotate(geom="text", x=800, y=17000, label=tp2, color="red", size=7)
con <- file("en_US.news.txt", "r")
content <- readLines(con,encoding = "UTF-8") ## Read the first line of text
close(con) ## It's important to close the connection when you are done
numLines <- length(content)
sampleSize <- ceiling(numLines*samplePercent/100)
set.seed(1234)
randLineNum <- sample(seq(1,numLines), sampleSize, replace=FALSE)
sample.news.txt <- content[randLineNum]
con <- file("./SampleFeeds/en_US.news.sample.txt", "w")
writeLines(sample.news.txt, con)
close(con)
newsdf <- as.data.frame(content, stringsAsFactors=FALSE)
newsdf$Length <- nchar(content)
medianLength <- median(newsdf$Length)
perc90Length <- quantile(newsdf$Length,probs=0.90)[[1]]
perc95Length <- quantile(newsdf$Length,probs=0.95)[[1]]
maxLength <- max(newsdf$Length)
tp1 <- sprintf("Median = %f", medianLength)
tp2 <- sprintf("95 Percentile = %f", perc95Length)
ggplot(newsdf) +
geom_freqpoly(aes(Length),color="darkblue",size=2.0,binwidth=5.0) +
geom_vline(aes(xintercept=medianLength),color="orange",size=3.0) +
geom_vline(aes(xintercept=perc95Length),color="red",size=3.0) +
ggtitle("Distribution of line lengths of News Feed") +
theme(axis.title.y=element_text(size=20),axis.title.x=element_text(size=20),
plot.title=element_text(size=25),
axis.text.x=element_text(angle=90,size=20),
axis.text.y=element_text(size=20),
legend.position="none") +
xlab("Number of characters in line") +
coord_cartesian(xlim =c(0,1000)) +
scale_x_continuous(breaks=seq(0,1000,100)) +
annotate(geom="text", x=600, y=10000, label=tp1, color="orange", size=7) +
annotate(geom="text", x=600, y=8000, label=tp2, color="red", size=7)
sampleFeed <- paste(getwd(),"SampleFeeds",sep="/")
usEnCorpus <- VCorpus(DirSource(sampleFeed),
readerControl = list(reader=readPlain))
inspect(usEnCorpus[1:3])
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 3
##
## [[1]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 20647966
##
## [[2]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 20370163
##
## [[3]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 16271658
## Inspection of non-UTF characters in documents
byteCorpus <- tm_map(usEnCorpus, content_transformer(function(x) iconv(enc2utf8(x), sub = "byte")))
# Replace certain distracting patterns and to reduce the computing load
f <- content_transformer(function(x, pattern, replacement) gsub(pattern, " ", x))
profanities <- c("fuck", "fucking", "fucker", "fuckin", "cocksucker")
# usEnCorpus <- tm_map(usEnCorpus, f, "<.+>")
usEnCorpus <- tm_map(usEnCorpus, removeWords, profanities)
usEnCorpus <- tm_map(usEnCorpus, stripWhitespace)
usEnCorpus <- tm_map(usEnCorpus, content_transformer(tolower))
usEnCorpus <- tm_map(usEnCorpus, removeWords, stopwords("english"))
usEnCorpus <- tm_map(usEnCorpus, removePunctuation)
usEnCorpus <- tm_map(usEnCorpus, removeNumbers)
# Stemming is not working out very well. Commenting out.
#usEnCorpus <- tm_map(usEnCorpus, stemDocument, language="english")
#corpus.final <- usEnCorpus
inspect(usEnCorpus[1:3])
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 3
##
## [[1]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 14946453
##
## [[2]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 15411191
##
## [[3]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 11830202
#Compute and save as rda for subsequent loading to reduce the markdown generation time
# tdm <- TermDocumentMatrix(usEnCorpus)
# save(tdm,file="./tdm.rda")
load(file="./tdm.rda")
## Reduce the sparsity i.e., increase the incidence of terms in 2 out of 3 documents
## This will improve coverage with fewer words by removing words occurring mostly in one document and not others
tdmDense <- removeSparseTerms(tdm, 0.66)
# Sparseness before
tdm
## <<TermDocumentMatrix (terms: 342982, documents: 3)>>
## Non-/sparse entries: 474805/554141
## Sparsity : 54%
## Maximal term length: 454
## Weighting : term frequency (tf)
#Sparseness after
tdmDense
## <<TermDocumentMatrix (terms: 86825, documents: 3)>>
## Non-/sparse entries: 218648/41827
## Sparsity : 16%
## Maximal term length: 34
## Weighting : term frequency (tf)
# Extract the matrix of word sparsity from term docuent matrix
mat <- as.matrix(tdmDense)
# Aggregate the word frequency from each document and arrange the word frequency in descenting order
# (highest frequency to lowest frequency)
freqSingle <- sort(rowSums(mat),decreasing=TRUE)
# Formulate a data frame of the word frequency and word name for further processing
topWords <- data.frame(word=names(freqSingle),freq=freqSingle,stringsAsFactors = FALSE)
# List out the top and bottom 100 words in the denser tdm
head(topWords, 30)
## word freq
## will will 63005
## said said 61152
## just just 60465
## one one 58153
## like like 53661
## can can 49480
## get get 45212
## time time 43299
## new new 38703
## good good 35586
## now now 35521
## day day 33871
## know know 32262
## love love 31889
## people people 31332
## back back 28210
## see see 27536
## first first 26911
## also also 26182
## make make 26146
## going going 25338
## think think 25217
## last last 24755
## great great 24618
## much much 23756
## year year 22948
## two two 22838
## really really 22645
## way way 22263
## today today 21921
tail(topWords, 25)
## word freq
## ziff ziff 2
## zimmers zimmers 2
## zings zings 2
## zither zither 2
## zitos zitos 2
## ziyi ziyi 2
## zloty zloty 2
## zodin zodin 2
## zofia zofia 2
## zoie zoie 2
## zoneread zoneread 2
## zoological zoological 2
## zoomzoom zoomzoom 2
## zork zork 2
## zte zte 2
## zucchet zucchet 2
## zud zud 2
## zumbad zumbad 2
## zumbathon zumbathon 2
## zupan zupan 2
## zur zur 2
## zygmunt zygmunt 2
## zygotes zygotes 2
## zylstra zylstra 2
## zyrtec zyrtec 2
require(wordcloud)
set.seed(1234)
wordcloud(words=topWords$word, freq=topWords$freq, min.freq=1,
max.words=500, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8,"Dark2"))
Token Correlations in TermDocumentMatrix (tdm)
Using Rgraphviz package, one can visualize correlations between tokens from the term document matrix. Here are some examples of these visualizations for EDA purposes. The randome sample size chosen for exploration is 10 from the whole population of tdm. The plots are captured for correlations: 0.2, 0.4, and 0.6.
set.seed(12347)
plot(tdm, terms=as.vector(topWords$word[200:210]),corThreshold = 0.2, weighting = TRUE)
set.seed(123478)
plot(tdm, terms=as.vector(topWords$word[200:210]),corThreshold = 0.6, weighting = TRUE)
set.seed(12364)
plot(tdm, terms=as.vector(topWords$word[200:210]),corThreshold = 0.9, weighting = TRUE)
Token Correlations in Reduced TermDocumentMatrix (tdmDense)
The randome sample size chosen for exploration is 10 from the whole population of tdmDense. The plots are captured for correlations: 0.2, 0.4, and 0.6.
set.seed(12345)
plot(tdmDense, terms=as.vector(topWords$word[300:310]),corThreshold = 0.6, weighting = TRUE)
set.seed(12345)
plot(tdmDense, terms=as.vector(topWords$word[300:310]),corThreshold = 0.6, weighting = TRUE)
set.seed(12345)
plot(tdmDense, terms=as.vector(topWords$word[300:310]),corThreshold = 0.9, weighting = TRUE)
Cumulative Distribution of Word frequencies from tdm
topWords$Percent <- topWords$freq / sum(topWords$freq)
topWords$CumPercent <- cumsum(topWords$Percent)
topWords$RowNum=seq(1,nrow(topWords),1)
# Check the frequency of the most frequent words from the word cloud
inspect(tdm[as.vector(topWords[topWords$RowNum <= 30,]$word),])
## <<TermDocumentMatrix (terms: 30, documents: 3)>>
## Non-/sparse entries: 90/0
## Sparsity : 0%
## Maximal term length: 6
## Weighting : term frequency (tf)
##
## Docs
## Terms en_US.blogs.sample.txt en_US.news.sample.txt
## will 11263 10926
## said 3672 25417
## just 10000 5285
## one 12321 8501
## like 9723 4853
## can 9940 5857
## get 7058 4393
## time 8771 5213
## new 5222 6939
## now 5908 3626
## good 4752 2876
## day 4901 2877
## know 5802 2422
## love 4435 876
## people 5760 4812
## back 5101 3317
## see 4961 2110
## first 5029 5223
## also 5675 5816
## make 5036 3247
## going 4006 3177
## think 4819 2504
## last 3446 5178
## great 2963 1578
## much 4860 2599
## two 3890 5808
## year 3056 5656
## really 4845 1866
## today 2253 1520
## way 4641 2743
## Docs
## Terms en_US.twitter.sample.txt
## will 9507
## said 1677
## just 15087
## one 8261
## like 12191
## can 9019
## get 11196
## time 7633
## new 7090
## now 8332
## good 10147
## day 8978
## know 8010
## love 10540
## people 5096
## back 5714
## see 6668
## first 3141
## also 1558
## make 4675
## going 5553
## think 5341
## last 3711
## great 7627
## much 4386
## two 1732
## year 2683
## really 4517
## today 7272
## way 3644
# Compare with the frequency of the most frequent words from the word cloud for comparison with the sparser tdm
inspect(tdmDense[as.vector(topWords[topWords$RowNum <= 30,]$word),])
## <<TermDocumentMatrix (terms: 30, documents: 3)>>
## Non-/sparse entries: 90/0
## Sparsity : 0%
## Maximal term length: 6
## Weighting : term frequency (tf)
##
## Docs
## Terms en_US.blogs.sample.txt en_US.news.sample.txt
## will 11263 10926
## said 3672 25417
## just 10000 5285
## one 12321 8501
## like 9723 4853
## can 9940 5857
## get 7058 4393
## time 8771 5213
## new 5222 6939
## now 5908 3626
## good 4752 2876
## day 4901 2877
## know 5802 2422
## love 4435 876
## people 5760 4812
## back 5101 3317
## see 4961 2110
## first 5029 5223
## also 5675 5816
## make 5036 3247
## going 4006 3177
## think 4819 2504
## last 3446 5178
## great 2963 1578
## much 4860 2599
## two 3890 5808
## year 3056 5656
## really 4845 1866
## today 2253 1520
## way 4641 2743
## Docs
## Terms en_US.twitter.sample.txt
## will 9507
## said 1677
## just 15087
## one 8261
## like 12191
## can 9019
## get 11196
## time 7633
## new 7090
## now 8332
## good 10147
## day 8978
## know 8010
## love 10540
## people 5096
## back 5714
## see 6668
## first 3141
## also 1558
## make 4675
## going 5553
## think 5341
## last 3711
## great 7627
## much 4386
## two 1732
## year 2683
## really 4517
## today 7272
## way 3644
Summary stats of Word frequencies from tdm
picsDir <- "./Charts"
xx <- topWords[topWords$CumPercent < 0.990001,]
wcRef <- data.frame(medianWC = max(xx[xx$CumPercent < 0.5001,]$RowNum),
perc90WC = max(xx[xx$CumPercent < 0.9001,]$RowNum),
perc95WC = max(xx[xx$CumPercent < 0.9501,]$RowNum),
stringsAsFactors = FALSE)
kable(wcRef, format="markdown")
| medianWC | perc90WC | perc95WC |
|---|---|---|
| 810 | 8767 | 14084 |
Plot of Cumulative Distribution of Word frequencies from tdm
picsDir <- "./Charts"
# Plot single word frequency distribution
jpeg(filename = paste(picsDir,"TopNWords.jpeg",sep="/"),pointsize=6,
width = 1700, height = 1080,quality=100)
ggplot(xx,aes(y=CumPercent,x=RowNum)) +
geom_bar(stat="identity") +
geom_vline(data=wcRef, aes(xintercept=medianWC),color="darkgreen",size=3.0) +
geom_vline(data=wcRef, aes(xintercept=perc90WC),color="orange",size=3.0) +
geom_vline(data=wcRef, aes(xintercept=perc95WC),color="red",size=3.0) +
ggtitle("Cumulative Distribution of Top N words from all feeds") +
theme(axis.title.y=element_text(size=25),axis.title.x=element_text(size=25),
plot.title=element_text(size=35),
axis.text.x=element_text(angle=90,size=20),
axis.text.y=element_text(size=20),
legend.position="none") +
ylab("Cumulative Probability")+xlab("Top N word Count") +
#coord_cartesian(ylim =c(0,1.0),xlim=c(1,max(xx$RowNum))) +
coord_cartesian(ylim =c(0,1.0),xlim=c(0,25000)) +
scale_y_continuous(breaks=seq(0,1.0,0.05)) +
#scale_x_continuous(breaks=seq(1,max(xx$RowNum),100))
scale_x_continuous(breaks=seq(0,25000,1000)) +
geom_text(data=wcRef, aes(x=medianWC+500, y=0.975, label="Median"),size=10) +
geom_text(data=wcRef, aes(x=perc90WC+500, y=0.975, label="90 Percentile"),size=10) +
geom_text(data=wcRef, aes(x=perc95WC+500, y=0.975, label="95 Percentile"),size=10)
dev.off()
## png
## 2
####################################################################################
## Bigram tokenization and bigram distribution
####################################################################################
BigramTokenizer <-
function(x)
unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE)
#Compute and save as rda for subsequent loading to reduce the markdown generation time
# tdmBigram <- TermDocumentMatrix(usEnCorpus, control = list(tokenize = BigramTokenizer))
# save(tdmBigram,file="./tdmBigram.rda")
load(file="./tdmBigram.rda")
dim(tdmBigram)
## [1] 6385410 3
## Reduce the sparsity i.e., increase the incidence of terms in 2 out of 3 documents
## This will improve coverage with fewer words by removing words occurring mostly in one document
tdmBigramDense <- removeSparseTerms(tdmBigram, 0.66)
dim(tdmBigramDense)
## [1] 679926 3
# Bigram tdm sparseness before removeSparseTerms()
tdmBigram
## <<TermDocumentMatrix (terms: 6385410, documents: 3)>>
## Non-/sparse entries: 7225725/11930505
## Sparsity : 62%
## Maximal term length: 465
## Weighting : term frequency (tf)
# Bigram sparseness after removeSparseTerms()
tdmBigramDense
## <<TermDocumentMatrix (terms: 679926, documents: 3)>>
## Non-/sparse entries: 1520241/519537
## Sparsity : 25%
## Maximal term length: 33
## Weighting : term frequency (tf)
matBigram <- as.matrix(tdmBigramDense)
# Aggregate the word frequency from each document and arrange the word frequency in descenting order
# (highest frequency to lowest frequency)
freqBigram <- sort(rowSums(matBigram),decreasing=TRUE)
# Formulate a data frame of the word frequency and word name for further processing
topBigrams <- data.frame(word=names(freqBigram),freq=freqBigram,stringsAsFactors = FALSE)
# List out the top and bottom 100 words in the denser tdm
head(topBigrams, 30)
## word freq
## right now right now 4902
## new york new york 3825
## last year last year 3680
## last night last night 3067
## high school high school 2829
## years ago years ago 2697
## last week last week 2516
## feel like feel like 2507
## first time first time 2378
## looking forward looking forward 2265
## can get can get 2221
## make sure make sure 2095
## looks like looks like 1970
## st louis st louis 1943
## even though even though 1833
## happy birthday happy birthday 1758
## just got just got 1727
## good morning good morning 1673
## new jersey new jersey 1664
## let know let know 1650
## every day every day 1583
## one day one day 1566
## next week next week 1536
## united states united states 1535
## can see can see 1490
## good luck good luck 1478
## look like look like 1423
## said “ said “ 1417
## los angeles los angeles 1384
## two years two years 1364
tail(topBigrams, 25)
## word freq
## zoo safari zoo safari 2
## zoo said zoo said 2
## zoo sent zoo sent 2
## zoo st zoo st 2
## zoo u zoo u 2
## zoo want zoo want 2
## zoo work zoo work 2
## zooey deschanels zooey deschanels 2
## zoology degree zoology degree 2
## zoom across zoom across 2
## zoom feature zoom feature 2
## zoom lense zoom lense 2
## zoomed away zoomed away 2
## zoomed past zoomed past 2
## zoos animals zoos animals 2
## zou bisou zou bisou 2
## zubrus devils zubrus devils 2
## zucchini also zucchini also 2
## zucchini onion zucchini onion 2
## zucchini year zucchini year 2
## zuckerberg company zuckerberg company 2
## zuckerberg will zuckerberg will 2
## zuckerberg worth zuckerberg worth 2
## zumba feel zumba feel 2
## zumba get zumba get 2
require(wordcloud)
set.seed(1234)
wordcloud(words = topBigrams$word, freq = topBigrams$freq, min.freq=1,
max.words=240, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8,"Dark2"))
topBigrams$Percent <- topBigrams$freq / sum(topBigrams$freq)
topBigrams$CumPercent <- cumsum(topBigrams$Percent)
topBigrams$RowNum=seq(1,nrow(topBigrams),1)
# Check the frequency of the most frequent words from the word cloud for comparison with the orignal tdm
inspect(tdmBigramDense[as.vector(topBigrams[topBigrams$RowNum <= 30,]$word),])
## <<TermDocumentMatrix (terms: 30, documents: 3)>>
## Non-/sparse entries: 89/1
## Sparsity : 1%
## Maximal term length: 15
## Weighting : term frequency (tf)
##
## Docs
## Terms en_US.blogs.sample.txt en_US.news.sample.txt
## right now 974 621
## new york 961 2285
## last year 752 2535
## last night 601 189
## high school 557 1759
## years ago 1027 1300
## last week 705 1343
## feel like 852 326
## first time 805 820
## looking forward 346 160
## can get 616 430
## make sure 771 552
## looks like 502 247
## st louis 57 1794
## even though 952 559
## happy birthday 84 8
## just got 159 92
## good morning 64 20
## new jersey 95 1497
## let know 389 64
## every day 669 472
## one day 634 282
## next week 335 274
## united states 512 942
## can see 886 184
## good luck 122 44
## look like 507 236
## said “ 443 974
## los angeles 162 1034
## two years 337 946
## Docs
## Terms en_US.twitter.sample.txt
## right now 3307
## new york 579
## last year 393
## last night 2277
## high school 513
## years ago 370
## last week 468
## feel like 1329
## first time 753
## looking forward 1759
## can get 1175
## make sure 772
## looks like 1221
## st louis 92
## even though 322
## happy birthday 1666
## just got 1476
## good morning 1589
## new jersey 72
## let know 1197
## every day 442
## one day 650
## next week 927
## united states 81
## can see 420
## good luck 1312
## look like 680
## said “ 0
## los angeles 188
## two years 81
xx <- topBigrams[topBigrams$CumPercent < 0.990001,]
wcRef <- data.frame(medianWC = max(xx[xx$CumPercent < 0.5001,]$RowNum),
perc90WC = max(xx[xx$CumPercent < 0.9001,]$RowNum),
perc95WC = max(xx[xx$CumPercent < 0.9501,]$RowNum),
stringsAsFactors = FALSE)
kable(wcRef, format="markdown")
| medianWC | perc90WC | perc95WC |
|---|---|---|
| 64755 | 452316 | 566235 |
picsDir <- "./Charts"
jpeg(filename = paste(picsDir,"TopNBigrams.jpeg",sep="/"),pointsize=6,
width = 1700, height = 1080,quality=100)
ggplot(xx,aes(y=CumPercent,x=RowNum)) +
geom_bar(stat="identity") +
geom_vline(data=wcRef, aes(xintercept=medianWC),color="darkgreen",size=3.0) +
geom_vline(data=wcRef, aes(xintercept=perc90WC),color="orange",size=3.0) +
geom_vline(data=wcRef, aes(xintercept=perc95WC),color="red",size=3.0) +
ggtitle("Cumulative Distribution of Top N Bigrams from all feeds") +
theme(axis.title.y=element_text(size=25),axis.title.x=element_text(size=25),
plot.title=element_text(size=35),
axis.text.x=element_text(angle=90,size=20),
axis.text.y=element_text(size=20),
legend.position="none") +
ylab("Cumulative Probability")+xlab("Top N Bigram Count") +
#coord_cartesian(ylim =c(0,1.0),xlim=c(1,max(xx$RowNum))) +
coord_cartesian(ylim =c(0,1.0),xlim=c(0,240000)) +
scale_y_continuous(breaks=seq(0,1.0,0.1)) +
#scale_x_continuous(breaks=seq(1,max(xx$RowNum),100))
scale_x_continuous(breaks=seq(0,240000,24000)) +
geom_text(data=wcRef, aes(x=medianWC+500, y=0.975, label="Median"),size=10) +
geom_text(data=wcRef, aes(x=perc90WC+500, y=0.975, label="90 Percentile"),size=10)
dev.off()
## png
## 2
###################################################################################
## Trigram tokenization and trigram distribution
###################################################################################
TrigramTokenizer <-
function(x)
unlist(lapply(ngrams(words(x), 3), paste, collapse = " "), use.names = FALSE)
#Compute and save as rda for subsequent loading to reduce the markdown generation time
#tdmTrigram <- TermDocumentMatrix(usEnCorpus, control = list(tokenize = TrigramTokenizer))
#save(tdmTrigram,file="./tdmTrigram.rda")
load(file="./tdmTrigram.rda")
dim(tdmTrigram)
## [1] 10555777 3
## Reduce the sparsity i.e., increase the incidence of terms in 2 out of 3 documents
## This will improve coverage with fewer words by removing words occurring mostly in one document
tdmTrigramDense <- removeSparseTerms(tdmTrigram, 0.66)
dim(tdmTrigramDense)
## [1] 107713 3
# Trigram tdm sparseness before removeSparseTerms()
tdmTrigram
## <<TermDocumentMatrix (terms: 10555777, documents: 3)>>
## Non-/sparse entries: 10674917/20992414
## Sparsity : 66%
## Maximal term length: 471
## Weighting : term frequency (tf)
# Trigram tdm sparseness after removeSparseTerms()
tdmTrigramDense
## <<TermDocumentMatrix (terms: 107713, documents: 3)>>
## Non-/sparse entries: 226853/96286
## Sparsity : 30%
## Maximal term length: 39
## Weighting : term frequency (tf)
matTrigram <- as.matrix(tdmTrigramDense)
# Aggregate the word frequency from each document and arrange the word frequency in descenting order
# (highest frequency to lowest frequency)
freqTrigram <- sort(rowSums(matTrigram),decreasing=TRUE)
# Formulate a data frame of the word frequency and word name for further processing
topTrigrams <- data.frame(word=names(freqTrigram),freq=freqTrigram,stringsAsFactors = FALSE)
# List out the top and bottom 100 words in the denser tdm
head(topTrigrams, 30)
## word freq
## happy mothers day happy mothers day 680
## new york city new york city 541
## let us know let us know 491
## happy new year happy new year 389
## president barack obama president barack obama 301
## two years ago two years ago 291
## new york times new york times 265
## cinco de mayo cinco de mayo 255
## looking forward seeing looking forward seeing 216
## world war ii world war ii 209
## gov chris christie gov chris christie 184
## new years eve new years eve 181
## first time since first time since 164
## will take place will take place 162
## three years ago three years ago 159
## st patricks day st patricks day 150
## just got back just got back 147
## love love love love love love 141
## please please please please please please 139
## five years ago five years ago 133
## two weeks ago two weeks ago 133
## happy valentines day happy valentines day 132
## martin luther king martin luther king 130
## osama bin laden osama bin laden 126
## four years ago four years ago 125
## long time ago long time ago 121
## really looking forward really looking forward 117
## past two years past two years 114
## look forward seeing look forward seeing 113
## couple weeks ago couple weeks ago 112
tail(topTrigrams, 30)
## word freq
## younger brother th younger brother th 2
## younger get free younger get free 2
## younger kids may younger kids may 2
## younger years still younger years still 2
## youngest daughter currently youngest daughter currently 2
## youth commission can youth commission can 2
## youth family center youth family center 2
## youth human services youth human services 2
## youth soccer game youth soccer game 2
## youth sports league youth sports league 2
## youth system contemporaries youth system contemporaries 2
## youth wrong message youth wrong message 2
## youtube music videos youtube music videos 2
## youtube soon get youtube soon get 2
## yr old lady yr old lady 2
## yr old think yr old think 2
## yr old thought yr old thought 2
## yrs yrs old yrs yrs old 2
## yummy yummy yummy yummy yummy yummy 2
## yums asian bistro yums asian bistro 2
## yup just got yup just got 2
## yves st laurent yves st laurent 2
## zest lemon juice zest lemon juice 2
## zimmerman stalked son zimmerman stalked son 2
## zip lock bags zip lock bags 2
## zone across street zone across street 2
## zone sports bar zone sports bar 2
## zoo safari park zoo safari park 2
## zou bisou bisou zou bisou bisou 2
## zuckerberg worth billion zuckerberg worth billion 2
require(wordcloud)
set.seed(1234)
wordcloud(words = topTrigrams$word, freq = topTrigrams$freq, min.freq=1,
max.words=180, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8,"Dark2"))
topTrigrams$Percent <- topTrigrams$freq / sum(topTrigrams$freq)
topTrigrams$CumPercent <- cumsum(topTrigrams$Percent)
topTrigrams$RowNum=seq(1,nrow(topTrigrams),1)
# Check the frequency of the most frequent words from the word cloud for comparison with the orignal tdm
inspect(tdmTrigramDense[as.vector(topTrigrams[topTrigrams$RowNum <= 30,]$word),])
## <<TermDocumentMatrix (terms: 30, documents: 3)>>
## Non-/sparse entries: 87/3
## Sparsity : 3%
## Maximal term length: 22
## Weighting : term frequency (tf)
##
## Docs
## Terms en_US.blogs.sample.txt en_US.news.sample.txt
## happy mothers day 17 0
## new york city 180 288
## let us know 45 16
## happy new year 35 2
## president barack obama 18 278
## two years ago 68 209
## new york times 109 122
## cinco de mayo 27 37
## looking forward seeing 40 10
## world war ii 57 147
## gov chris christie 0 183
## new years eve 44 48
## first time since 20 120
## will take place 28 117
## three years ago 30 123
## st patricks day 28 22
## just got back 19 1
## love love love 38 1
## please please please 10 4
## five years ago 32 93
## two weeks ago 33 94
## happy valentines day 6 0
## martin luther king 17 49
## osama bin laden 39 65
## four years ago 23 99
## long time ago 64 23
## really looking forward 50 7
## past two years 18 93
## look forward seeing 19 1
## couple weeks ago 89 11
## Docs
## Terms en_US.twitter.sample.txt
## happy mothers day 663
## new york city 73
## let us know 430
## happy new year 352
## president barack obama 5
## two years ago 14
## new york times 34
## cinco de mayo 191
## looking forward seeing 166
## world war ii 5
## gov chris christie 1
## new years eve 89
## first time since 24
## will take place 17
## three years ago 6
## st patricks day 100
## just got back 127
## love love love 102
## please please please 125
## five years ago 8
## two weeks ago 6
## happy valentines day 126
## martin luther king 64
## osama bin laden 22
## four years ago 3
## long time ago 34
## really looking forward 60
## past two years 3
## look forward seeing 93
## couple weeks ago 12
xx <- topTrigrams[topTrigrams$CumPercent < 0.990001,]
wcRef <- data.frame(medianWC = max(xx[xx$CumPercent < 0.5001,]$RowNum),
perc90WC = max(xx[xx$CumPercent < 0.9001,]$RowNum),
perc95WC = max(xx[xx$CumPercent < 0.9501,]$RowNum),
stringsAsFactors = FALSE)
kable(wcRef, format="markdown")
| medianWC | perc90WC | perc95WC |
|---|---|---|
| 22971 | 88222 | 97977 |
picsDir <- "./Charts"
jpeg(filename = paste(picsDir,"TopNTrigrams.jpeg",sep="/"),pointsize=6,
width = 1700, height = 1080,quality=100)
ggplot(xx,aes(y=CumPercent,x=RowNum)) +
geom_bar(stat="identity") +
geom_vline(data=wcRef, aes(xintercept=medianWC),color="darkgreen",size=3.0) +
geom_vline(data=wcRef, aes(xintercept=perc90WC),color="orange",size=3.0) +
geom_vline(data=wcRef, aes(xintercept=perc95WC),color="red",size=3.0) +
ggtitle("Cumulative Distribution of Top N Trigrams from all feeds") +
theme(axis.title.y=element_text(size=25),axis.title.x=element_text(size=25),
plot.title=element_text(size=35),
axis.text.x=element_text(angle=90,size=20),
axis.text.y=element_text(size=20),
legend.position="none") +
ylab("Cumulative Probability")+xlab("Top N Trigram Count") +
#coord_cartesian(ylim =c(0,1.0),xlim=c(1,max(xx$RowNum))) +
coord_cartesian(ylim =c(0,1.0),xlim=c(0,40000)) +
scale_y_continuous(breaks=seq(0,1.0,0.05)) +
#scale_x_continuous(breaks=seq(1,max(xx$RowNum),100))
scale_x_continuous(breaks=seq(0,40000,4000)) +
geom_text(data=wcRef, aes(x=medianWC+500, y=0.975, label="Median"),size=10) +
geom_text(data=wcRef, aes(x=perc90WC+500, y=0.975, label="90 Percentile"),size=10) +
geom_text(data=wcRef, aes(x=perc95WC+500, y=0.975, label="95 Percentile"),size=10)
dev.off()
## png
## 2