Capstone NLP Milestone Project: Prediction of next word in a sentence

Overview:

This is a milestone report to demonstrate preparation for developing a prediction algorithm for the word prediction problem. The report summarizes basic steps to preprocess and conduct exploratory data analysis (EDA) of U.S English language corpus comprising three documents: blogs, news, and twitter text. The preprocessing steps comprise the following:

Random sampling of each document in the corpus (20%) for reducing the computation time. Full document processing is taking unmanageable time.
Removing whitespaces, punctuation, english stopwords, and numbers
Apply profanity filters to remove potential profanities from the document
Converting text to lowercase
Tokenizing the documents

As the TermDocumentMatrix appears to get very large for the three documents in our Corpus, let us apply the method removeSparseTerms(). This has a tendency to reduce the matrix substantially without losing significant relations in the matrix. Based on computation of TermDocumentMatrix and lowering of the sparseness, the report attempts to present preliminary statistics on the following:

Distribution of line lengths of three documents in the corpus as well as other basic stats
Document coverage versus number of key words
n-gram frequency distributions of the following across all three documents in the corpus:

Single word tokens (n=1)
Bigrams (n=2)
Trigrams (n=3)

Word clouds for singletons, bigrams, and trigrams for visual demonstration of coverage by a small proportion of the n-grams
Exploring MarkovChains package for understanding prediction model
Explore correlations between tokens from the term document matrix using Rgraphviz package. These provide good visualization of token correlations for EDA purposes.

Preliminary Observations

The twitter documents have relative shorter sentences (number of characters) with a tight spread between the median and 95 percentile sentence lengths (median of 64 characters, and a 95 percentile length of 132 characters). The blogs and news documents have much higher median sentence lengths and a wide spread between the median and the 95 percentile lengths.
The overall length of the documents in the corpus in number of UTF-8 characters is as follows:

Blogs: 206800758
News: 203222798
Twitter: 161606037

The overall length of the documents in the corpus in number of words is as follows:

Blogs: 37332998
News: 34372523
Twitter: 30194277

The overall length of the documents in the corpus in number of lines is as follows:

Blogs: 899287
News: 1010242
Twitter: 2360148

The overall frequency distribution of single word tokens across the three documents in the Corpus is as follows. This is a reflection of the tightness of the vocabulary coverage. It is clear that a relatively small percentage of words dominate the overall language use across blogs, news, and twitter feeds.

Median: 928
90 Percentile: 12435
95 Percentile: 23032

The wordcloud for single word tokens visually demonstrates the predominance by a small percentage of tokens.
As part of data exploration, one can use the package Rgraphviz to visualize token correlations inherent in term document matrix. The package facilitates the edge weights be controlled by correlation between the nodes in the graph. This report compares original single word term document matrix with the one after removing sparseness using random sample size of 10 from the entire population of tokens enumerated in the term document matrix. Correlations for 0.2, 0.4, and 0.6 are examined and plotted.
The overall frequency distribution of bigrams (n=2) across the three documents in the Corpus is as follows. This is also a reflection of the tightness of the vocabulary coverage. It is clear that a small percentage of bigrams dominate the overall language use across blogs, news, and twitter feeds.

Median: 64755
90 Percentile: 452316
95 Percentile: 566235

The wordcloud for bigrams visually demonstrates the predominance by a relatively small percentage of bigrams.
The overall frequency distribution of trigrams (n=3) across the three documents in the Corpus is as follows. This is also a reflection of the tightness of the vocabulary coverage. It is clear that a small percentage of trigrams dominate the overall language use across blogs, news, and twitter feeds.

Median: 22971
90 Percentile: 88222
95 Percentile: 97977

The wordcloud for trigrams visually demonstrates the predominance by a relatively small percentage of trigrams.
The TermDocumentMatrix of single word tokens across all the documents in the corpus shows 53% sparseness. This means that the single word tokens are on the average present only in 1 of 2 documents. When the sparsity is reduced to 0.33 i.e., increase the incidence of terms to 2 out of 3 documents using tm::removeSparseTerms() method, the resulting sparseness is computed as 0. This means the single word tokens after removing sparseness appear in all the three documents.
Using Rgraphviz package, one can visualize correlations between tokens from the term document matrix. Some examples of these visualizations are included in the report for EDA purposes. The randome sample size chosen for exploration is 10 from the whole population of tdm. The plots are captured for correlations: 0.2, 0.6, and 0.9.

References:

Analysis:

The following analysis presents the knitr weaving of the imported datasets, the R code, and notes of analysis.

opts_chunk$set(echo=TRUE,cache.lazy=FALSE)

To mitigate long computation time for training for a large number of predictors, let us use the doParallel package and register multiple cores for parallelism registerDoParallel(cores=6)

registerDoParallel(cores=3)

Data Acquisition, Dimensionality Reduction and Cleaning

Sampling of the documents to reduce dimensionality and computation time

setwd("C:\\Users\\pr9115.OPS\\Documents\\R_Practice\\Capstone")

dataFeed <- getwd()

## Delete sample files for blogs, news, and twitter feeds
##sapply(list.files(path="./SampleFeeds",pattern="*.sample.txt",full.names=TRUE),
##       function(x) unlink(x, force=TRUE))

#################### Sample the twitter feed text feed ########################
con <- file("en_US.twitter.txt", "r") ### read binary mode
content <- readLines(con, encoding = "UTF-8") ## Read the first line of text 
close(con) ## It's important to close the connection when you are done

## Distribution of line lengths in the twitter feed
numLines <- length(content)

## Percent of the lines chosen randomly from the text file with no replacement
samplePercent <- 20.0

sampleSize <- ceiling(numLines*samplePercent/100)

set.seed(1234)
randLineNum <- sample(seq(1,numLines), sampleSize, replace=FALSE)
sample.twitter.txt <- content[randLineNum]

#sapply(sample.twitter.txt,function(x))
con <- file("./SampleFeeds/en_US.twitter.sample.txt", "w") 
writeLines(sample.twitter.txt, con)
close(con)

twitterdf <- as.data.frame(content, stringsAsFactors=FALSE)
twitterdf$Length <- nchar(content)

medianLength <- median(twitterdf$Length)
perc90Length <- quantile(twitterdf$Length,probs=0.90)[[1]]
perc95Length <- quantile(twitterdf$Length,probs=0.95)[[1]]
maxLength <- max(twitterdf$Length)

tp1 <- sprintf("Median = %f", medianLength)
tp2 <- sprintf("95 Percentile = %f", perc95Length)

ggplot(twitterdf) + 
  geom_freqpoly(aes(Length),color="darkblue",size=2.0,binwidth=5.0) +
  geom_vline(aes(xintercept=medianLength),color="orange",size=3.0) +
  geom_vline(aes(xintercept=perc95Length),color="red",size=3.0) +
  ggtitle("Distribution of line lengths of Twitter Feed") +
  theme(axis.title.y=element_text(size=20),axis.title.x=element_text(size=20),
        plot.title=element_text(size=25),
        axis.text.x=element_text(angle=90,size=20),
        axis.text.y=element_text(size=20),
        legend.position="none") +
    xlab("Number of characters in line") +
    annotate(geom="text", x=30, y=60000, label=tp1, color="orange", size=7) +
    annotate(geom="text", x=30, y=50000, label=tp2, color="red", size=7)

con <- file("en_US.blogs.txt", "r") 
content <- readLines(con,encoding = "UTF-8") ## Read the first line of text 
close(con) ## It's important to close the connection when you are done

numLines <- length(content)

sampleSize <- ceiling(numLines*samplePercent/100)

set.seed(1234)
randLineNum <- sample(seq(1,numLines), sampleSize, replace=FALSE)
sample.blogs.txt <- content[randLineNum]

con <- file("./SampleFeeds/en_US.blogs.sample.txt", "w") 
writeLines(sample.blogs.txt, con)
close(con)

blogsdf <- as.data.frame(content, stringsAsFactors=FALSE)
blogsdf$Length <- nchar(content)

medianLength <- median(blogsdf$Length)
perc90Length <- quantile(blogsdf$Length,probs=0.90)[[1]]
perc95Length <- quantile(blogsdf$Length,probs=0.95)[[1]]
maxLength <- max(blogsdf$Length)

tp1 <- sprintf("Median = %f", medianLength)
tp2 <- sprintf("95 Percentile = %f", perc95Length)

ggplot(blogsdf) + 
  geom_freqpoly(aes(Length),color="darkblue",size=2.0,binwidth=5.0) +
  geom_vline(aes(xintercept=medianLength),color="orange",size=3.0) +
  geom_vline(aes(xintercept=perc95Length),color="red",size=3.0) +
  ggtitle("Distribution of line lengths of Blogs Feed") +
  theme(axis.title.y=element_text(size=20),axis.title.x=element_text(size=20),
        plot.title=element_text(size=25),
        axis.text.x=element_text(angle=90,size=20),
        axis.text.y=element_text(size=20),
        legend.position="none") +
  xlab("Number of characters in line") +
  coord_cartesian(xlim =c(0,1600)) +
  scale_x_continuous(breaks=seq(0,1600,100)) +
  annotate(geom="text", x=800, y=20000, label=tp1, color="orange", size=7) +
  annotate(geom="text", x=800, y=17000, label=tp2, color="red", size=7)

con <- file("en_US.news.txt", "r") 
content <- readLines(con,encoding = "UTF-8") ## Read the first line of text 
close(con) ## It's important to close the connection when you are done

numLines <- length(content)

sampleSize <- ceiling(numLines*samplePercent/100)

set.seed(1234)

randLineNum <- sample(seq(1,numLines), sampleSize, replace=FALSE)
sample.news.txt <- content[randLineNum]

con <- file("./SampleFeeds/en_US.news.sample.txt", "w") 
writeLines(sample.news.txt, con)
close(con)

newsdf <- as.data.frame(content, stringsAsFactors=FALSE)
newsdf$Length <- nchar(content)

medianLength <- median(newsdf$Length)
perc90Length <- quantile(newsdf$Length,probs=0.90)[[1]]
perc95Length <- quantile(newsdf$Length,probs=0.95)[[1]]
maxLength <- max(newsdf$Length)

tp1 <- sprintf("Median = %f", medianLength)
tp2 <- sprintf("95 Percentile = %f", perc95Length)

ggplot(newsdf) + 
  geom_freqpoly(aes(Length),color="darkblue",size=2.0,binwidth=5.0) +
  geom_vline(aes(xintercept=medianLength),color="orange",size=3.0) +
  geom_vline(aes(xintercept=perc95Length),color="red",size=3.0) +
  ggtitle("Distribution of line lengths of News Feed") +
  theme(axis.title.y=element_text(size=20),axis.title.x=element_text(size=20),
        plot.title=element_text(size=25),
        axis.text.x=element_text(angle=90,size=20),
        axis.text.y=element_text(size=20),
        legend.position="none") +
  xlab("Number of characters in line") +
  coord_cartesian(xlim =c(0,1000)) +
  scale_x_continuous(breaks=seq(0,1000,100)) +
  annotate(geom="text", x=600, y=10000, label=tp1, color="orange", size=7) +
  annotate(geom="text", x=600, y=8000, label=tp2, color="red", size=7)

sampleFeed <- paste(getwd(),"SampleFeeds",sep="/")

usEnCorpus <- VCorpus(DirSource(sampleFeed),
                      readerControl = list(reader=readPlain))

inspect(usEnCorpus[1:3])

## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 3
## 
## [[1]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 20647966
## 
## [[2]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 20370163
## 
## [[3]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 16271658

## Inspection of non-UTF characters in documents
byteCorpus <- tm_map(usEnCorpus, content_transformer(function(x) iconv(enc2utf8(x), sub = "byte")))

# Replace certain distracting patterns and to reduce the computing load
f <- content_transformer(function(x, pattern, replacement) gsub(pattern, " ", x))

profanities <- c("fuck", "fucking", "fucker", "fuckin", "cocksucker")

# usEnCorpus <- tm_map(usEnCorpus, f, "<.+>")
usEnCorpus <- tm_map(usEnCorpus, removeWords, profanities)
usEnCorpus <- tm_map(usEnCorpus, stripWhitespace)
usEnCorpus <- tm_map(usEnCorpus, content_transformer(tolower))
usEnCorpus <- tm_map(usEnCorpus, removeWords, stopwords("english"))
usEnCorpus <- tm_map(usEnCorpus, removePunctuation)
usEnCorpus <- tm_map(usEnCorpus, removeNumbers)

# Stemming is not working out very well. Commenting out.
#usEnCorpus <- tm_map(usEnCorpus, stemDocument, language="english")

#corpus.final <- usEnCorpus
inspect(usEnCorpus[1:3])

## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 3
## 
## [[1]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 14946453
## 
## [[2]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 15411191
## 
## [[3]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 11830202

#Compute and save as rda for subsequent loading to reduce the markdown generation time
# tdm <- TermDocumentMatrix(usEnCorpus)
# save(tdm,file="./tdm.rda")

load(file="./tdm.rda")

## Reduce the sparsity i.e., increase the incidence of terms in 2 out of 3 documents
## This will improve coverage with fewer words by removing words occurring mostly in one document and not others
tdmDense <- removeSparseTerms(tdm, 0.66)

# Sparseness before
tdm

## <<TermDocumentMatrix (terms: 342982, documents: 3)>>
## Non-/sparse entries: 474805/554141
## Sparsity           : 54%
## Maximal term length: 454
## Weighting          : term frequency (tf)

#Sparseness after
tdmDense

## <<TermDocumentMatrix (terms: 86825, documents: 3)>>
## Non-/sparse entries: 218648/41827
## Sparsity           : 16%
## Maximal term length: 34
## Weighting          : term frequency (tf)

# Extract the matrix of word sparsity from term docuent matrix
mat <- as.matrix(tdmDense)

# Aggregate the word frequency from each document and arrange the word frequency in descenting order
# (highest frequency to lowest frequency)
freqSingle <- sort(rowSums(mat),decreasing=TRUE)

# Formulate a data frame of the word frequency and word name for further processing
topWords <- data.frame(word=names(freqSingle),freq=freqSingle,stringsAsFactors = FALSE)

# List out the top and bottom 100 words in the denser tdm
head(topWords, 30)

##          word  freq
## will     will 63005
## said     said 61152
## just     just 60465
## one       one 58153
## like     like 53661
## can       can 49480
## get       get 45212
## time     time 43299
## new       new 38703
## good     good 35586
## now       now 35521
## day       day 33871
## know     know 32262
## love     love 31889
## people people 31332
## back     back 28210
## see       see 27536
## first   first 26911
## also     also 26182
## make     make 26146
## going   going 25338
## think   think 25217
## last     last 24755
## great   great 24618
## much     much 23756
## year     year 22948
## two       two 22838
## really really 22645
## way       way 22263
## today   today 21921

tail(topWords, 25)

##                  word freq
## ziff             ziff    2
## zimmers       zimmers    2
## zings           zings    2
## zither         zither    2
## zitos           zitos    2
## ziyi             ziyi    2
## zloty           zloty    2
## zodin           zodin    2
## zofia           zofia    2
## zoie             zoie    2
## zoneread     zoneread    2
## zoological zoological    2
## zoomzoom     zoomzoom    2
## zork             zork    2
## zte               zte    2
## zucchet       zucchet    2
## zud               zud    2
## zumbad         zumbad    2
## zumbathon   zumbathon    2
## zupan           zupan    2
## zur               zur    2
## zygmunt       zygmunt    2
## zygotes       zygotes    2
## zylstra       zylstra    2
## zyrtec         zyrtec    2

require(wordcloud)

set.seed(1234)
wordcloud(words=topWords$word, freq=topWords$freq, min.freq=1,
          max.words=500, random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(8,"Dark2"))

Token Correlations in TermDocumentMatrix (tdm)

Using Rgraphviz package, one can visualize correlations between tokens from the term document matrix. Here are some examples of these visualizations for EDA purposes. The randome sample size chosen for exploration is 10 from the whole population of tdm. The plots are captured for correlations: 0.2, 0.4, and 0.6.

set.seed(12347)
plot(tdm, terms=as.vector(topWords$word[200:210]),corThreshold = 0.2, weighting = TRUE)

set.seed(123478)
plot(tdm, terms=as.vector(topWords$word[200:210]),corThreshold = 0.6, weighting = TRUE)

set.seed(12364)
plot(tdm, terms=as.vector(topWords$word[200:210]),corThreshold = 0.9, weighting = TRUE)

Token Correlations in Reduced TermDocumentMatrix (tdmDense)

The randome sample size chosen for exploration is 10 from the whole population of tdmDense. The plots are captured for correlations: 0.2, 0.4, and 0.6.

set.seed(12345)
plot(tdmDense, terms=as.vector(topWords$word[300:310]),corThreshold = 0.6, weighting = TRUE)

set.seed(12345)
plot(tdmDense, terms=as.vector(topWords$word[300:310]),corThreshold = 0.6, weighting = TRUE)

set.seed(12345)
plot(tdmDense, terms=as.vector(topWords$word[300:310]),corThreshold = 0.9, weighting = TRUE)

Cumulative Distribution of Word frequencies from tdm

topWords$Percent <- topWords$freq / sum(topWords$freq)
topWords$CumPercent <- cumsum(topWords$Percent)
topWords$RowNum=seq(1,nrow(topWords),1)

# Check the frequency of the most frequent words from the word cloud
inspect(tdm[as.vector(topWords[topWords$RowNum <= 30,]$word),])

## <<TermDocumentMatrix (terms: 30, documents: 3)>>
## Non-/sparse entries: 90/0
## Sparsity           : 0%
## Maximal term length: 6
## Weighting          : term frequency (tf)
## 
##         Docs
## Terms    en_US.blogs.sample.txt en_US.news.sample.txt
##   will                    11263                 10926
##   said                     3672                 25417
##   just                    10000                  5285
##   one                     12321                  8501
##   like                     9723                  4853
##   can                      9940                  5857
##   get                      7058                  4393
##   time                     8771                  5213
##   new                      5222                  6939
##   now                      5908                  3626
##   good                     4752                  2876
##   day                      4901                  2877
##   know                     5802                  2422
##   love                     4435                   876
##   people                   5760                  4812
##   back                     5101                  3317
##   see                      4961                  2110
##   first                    5029                  5223
##   also                     5675                  5816
##   make                     5036                  3247
##   going                    4006                  3177
##   think                    4819                  2504
##   last                     3446                  5178
##   great                    2963                  1578
##   much                     4860                  2599
##   two                      3890                  5808
##   year                     3056                  5656
##   really                   4845                  1866
##   today                    2253                  1520
##   way                      4641                  2743
##         Docs
## Terms    en_US.twitter.sample.txt
##   will                       9507
##   said                       1677
##   just                      15087
##   one                        8261
##   like                      12191
##   can                        9019
##   get                       11196
##   time                       7633
##   new                        7090
##   now                        8332
##   good                      10147
##   day                        8978
##   know                       8010
##   love                      10540
##   people                     5096
##   back                       5714
##   see                        6668
##   first                      3141
##   also                       1558
##   make                       4675
##   going                      5553
##   think                      5341
##   last                       3711
##   great                      7627
##   much                       4386
##   two                        1732
##   year                       2683
##   really                     4517
##   today                      7272
##   way                        3644

# Compare with the frequency of the most frequent words from the word cloud for comparison with the sparser tdm
inspect(tdmDense[as.vector(topWords[topWords$RowNum <= 30,]$word),])

## <<TermDocumentMatrix (terms: 30, documents: 3)>>
## Non-/sparse entries: 90/0
## Sparsity           : 0%
## Maximal term length: 6
## Weighting          : term frequency (tf)
## 
##         Docs
## Terms    en_US.blogs.sample.txt en_US.news.sample.txt
##   will                    11263                 10926
##   said                     3672                 25417
##   just                    10000                  5285
##   one                     12321                  8501
##   like                     9723                  4853
##   can                      9940                  5857
##   get                      7058                  4393
##   time                     8771                  5213
##   new                      5222                  6939
##   now                      5908                  3626
##   good                     4752                  2876
##   day                      4901                  2877
##   know                     5802                  2422
##   love                     4435                   876
##   people                   5760                  4812
##   back                     5101                  3317
##   see                      4961                  2110
##   first                    5029                  5223
##   also                     5675                  5816
##   make                     5036                  3247
##   going                    4006                  3177
##   think                    4819                  2504
##   last                     3446                  5178
##   great                    2963                  1578
##   much                     4860                  2599
##   two                      3890                  5808
##   year                     3056                  5656
##   really                   4845                  1866
##   today                    2253                  1520
##   way                      4641                  2743
##         Docs
## Terms    en_US.twitter.sample.txt
##   will                       9507
##   said                       1677
##   just                      15087
##   one                        8261
##   like                      12191
##   can                        9019
##   get                       11196
##   time                       7633
##   new                        7090
##   now                        8332
##   good                      10147
##   day                        8978
##   know                       8010
##   love                      10540
##   people                     5096
##   back                       5714
##   see                        6668
##   first                      3141
##   also                       1558
##   make                       4675
##   going                      5553
##   think                      5341
##   last                       3711
##   great                      7627
##   much                       4386
##   two                        1732
##   year                       2683
##   really                     4517
##   today                      7272
##   way                        3644

Summary stats of Word frequencies from tdm

picsDir <- "./Charts"

xx <- topWords[topWords$CumPercent < 0.990001,]

wcRef <- data.frame(medianWC = max(xx[xx$CumPercent < 0.5001,]$RowNum),
                    perc90WC = max(xx[xx$CumPercent < 0.9001,]$RowNum),
                    perc95WC = max(xx[xx$CumPercent < 0.9501,]$RowNum),
                    stringsAsFactors = FALSE)

kable(wcRef, format="markdown")

medianWC	perc90WC	perc95WC
810	8767	14084

Plot of Cumulative Distribution of Word frequencies from tdm

picsDir <- "./Charts"
# Plot single word frequency distribution
jpeg(filename = paste(picsDir,"TopNWords.jpeg",sep="/"),pointsize=6,
     width = 1700, height = 1080,quality=100)

ggplot(xx,aes(y=CumPercent,x=RowNum)) + 
  geom_bar(stat="identity") +
  geom_vline(data=wcRef, aes(xintercept=medianWC),color="darkgreen",size=3.0) +
  geom_vline(data=wcRef, aes(xintercept=perc90WC),color="orange",size=3.0) +
  geom_vline(data=wcRef, aes(xintercept=perc95WC),color="red",size=3.0) +
  ggtitle("Cumulative Distribution of Top N words from all feeds") +
  theme(axis.title.y=element_text(size=25),axis.title.x=element_text(size=25),
        plot.title=element_text(size=35),
        axis.text.x=element_text(angle=90,size=20),
        axis.text.y=element_text(size=20),
        legend.position="none") +
  ylab("Cumulative Probability")+xlab("Top N word Count") +
  #coord_cartesian(ylim =c(0,1.0),xlim=c(1,max(xx$RowNum))) +
  coord_cartesian(ylim =c(0,1.0),xlim=c(0,25000)) +
  scale_y_continuous(breaks=seq(0,1.0,0.05)) + 
  #scale_x_continuous(breaks=seq(1,max(xx$RowNum),100))
  scale_x_continuous(breaks=seq(0,25000,1000)) +
  geom_text(data=wcRef, aes(x=medianWC+500, y=0.975, label="Median"),size=10) +
  geom_text(data=wcRef, aes(x=perc90WC+500, y=0.975, label="90 Percentile"),size=10) +
  geom_text(data=wcRef, aes(x=perc95WC+500, y=0.975, label="95 Percentile"),size=10)

dev.off()

## png 
##   2

####################################################################################
## Bigram tokenization and bigram distribution
####################################################################################

BigramTokenizer <-
  function(x)
    unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE)

#Compute and save as rda for subsequent loading to reduce the markdown generation time
# tdmBigram <- TermDocumentMatrix(usEnCorpus, control = list(tokenize = BigramTokenizer))
# save(tdmBigram,file="./tdmBigram.rda")

load(file="./tdmBigram.rda")

dim(tdmBigram)

## [1] 6385410       3

## Reduce the sparsity i.e., increase the incidence of terms in 2 out of 3 documents
## This will improve coverage with fewer words by removing words occurring mostly in one document
tdmBigramDense <- removeSparseTerms(tdmBigram, 0.66)

dim(tdmBigramDense)

## [1] 679926      3

# Bigram tdm sparseness before removeSparseTerms()
tdmBigram

## <<TermDocumentMatrix (terms: 6385410, documents: 3)>>
## Non-/sparse entries: 7225725/11930505
## Sparsity           : 62%
## Maximal term length: 465
## Weighting          : term frequency (tf)

# Bigram sparseness after removeSparseTerms()
tdmBigramDense

## <<TermDocumentMatrix (terms: 679926, documents: 3)>>
## Non-/sparse entries: 1520241/519537
## Sparsity           : 25%
## Maximal term length: 33
## Weighting          : term frequency (tf)

matBigram <- as.matrix(tdmBigramDense)

# Aggregate the word frequency from each document and arrange the word frequency in descenting order
# (highest frequency to lowest frequency)
freqBigram <- sort(rowSums(matBigram),decreasing=TRUE)

# Formulate a data frame of the word frequency and word name for further processing
topBigrams <- data.frame(word=names(freqBigram),freq=freqBigram,stringsAsFactors = FALSE)

# List out the top and bottom 100 words in the denser tdm
head(topBigrams, 30)

##                            word freq
## right now             right now 4902
## new york               new york 3825
## last year             last year 3680
## last night           last night 3067
## high school         high school 2829
## years ago             years ago 2697
## last week             last week 2516
## feel like             feel like 2507
## first time           first time 2378
## looking forward looking forward 2265
## can get                 can get 2221
## make sure             make sure 2095
## looks like           looks like 1970
## st louis               st louis 1943
## even though         even though 1833
## happy birthday   happy birthday 1758
## just got               just got 1727
## good morning       good morning 1673
## new jersey           new jersey 1664
## let know               let know 1650
## every day             every day 1583
## one day                 one day 1566
## next week             next week 1536
## united states     united states 1535
## can see                 can see 1490
## good luck             good luck 1478
## look like             look like 1423
## said                    said  1417
## los angeles         los angeles 1384
## two years             two years 1364

tail(topBigrams, 25)

##                                  word freq
## zoo safari                 zoo safari    2
## zoo said                     zoo said    2
## zoo sent                     zoo sent    2
## zoo st                         zoo st    2
## zoo u                           zoo u    2
## zoo want                     zoo want    2
## zoo work                     zoo work    2
## zooey deschanels     zooey deschanels    2
## zoology degree         zoology degree    2
## zoom across               zoom across    2
## zoom feature             zoom feature    2
## zoom lense                 zoom lense    2
## zoomed away               zoomed away    2
## zoomed past               zoomed past    2
## zoos animals             zoos animals    2
## zou bisou                   zou bisou    2
## zubrus devils           zubrus devils    2
## zucchini also           zucchini also    2
## zucchini onion         zucchini onion    2
## zucchini year           zucchini year    2
## zuckerberg company zuckerberg company    2
## zuckerberg will       zuckerberg will    2
## zuckerberg worth     zuckerberg worth    2
## zumba feel                 zumba feel    2
## zumba get                   zumba get    2

require(wordcloud)
set.seed(1234)
wordcloud(words = topBigrams$word, freq = topBigrams$freq, min.freq=1,
          max.words=240, random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(8,"Dark2"))

topBigrams$Percent <- topBigrams$freq / sum(topBigrams$freq)
topBigrams$CumPercent <- cumsum(topBigrams$Percent)
topBigrams$RowNum=seq(1,nrow(topBigrams),1)

# Check the frequency of the most frequent words from the word cloud for comparison with the orignal tdm
inspect(tdmBigramDense[as.vector(topBigrams[topBigrams$RowNum <= 30,]$word),])

## <<TermDocumentMatrix (terms: 30, documents: 3)>>
## Non-/sparse entries: 89/1
## Sparsity           : 1%
## Maximal term length: 15
## Weighting          : term frequency (tf)
## 
##                  Docs
## Terms             en_US.blogs.sample.txt en_US.news.sample.txt
##   right now                          974                   621
##   new york                           961                  2285
##   last year                          752                  2535
##   last night                         601                   189
##   high school                        557                  1759
##   years ago                         1027                  1300
##   last week                          705                  1343
##   feel like                          852                   326
##   first time                         805                   820
##   looking forward                    346                   160
##   can get                            616                   430
##   make sure                          771                   552
##   looks like                         502                   247
##   st louis                            57                  1794
##   even though                        952                   559
##   happy birthday                      84                     8
##   just got                           159                    92
##   good morning                        64                    20
##   new jersey                          95                  1497
##   let know                           389                    64
##   every day                          669                   472
##   one day                            634                   282
##   next week                          335                   274
##   united states                      512                   942
##   can see                            886                   184
##   good luck                          122                    44
##   look like                          507                   236
##   said                              443                   974
##   los angeles                        162                  1034
##   two years                          337                   946
##                  Docs
## Terms             en_US.twitter.sample.txt
##   right now                           3307
##   new york                             579
##   last year                            393
##   last night                          2277
##   high school                          513
##   years ago                            370
##   last week                            468
##   feel like                           1329
##   first time                           753
##   looking forward                     1759
##   can get                             1175
##   make sure                            772
##   looks like                          1221
##   st louis                              92
##   even though                          322
##   happy birthday                      1666
##   just got                            1476
##   good morning                        1589
##   new jersey                            72
##   let know                            1197
##   every day                            442
##   one day                              650
##   next week                            927
##   united states                         81
##   can see                              420
##   good luck                           1312
##   look like                            680
##   said                                  0
##   los angeles                          188
##   two years                             81

xx <- topBigrams[topBigrams$CumPercent < 0.990001,]

wcRef <- data.frame(medianWC = max(xx[xx$CumPercent < 0.5001,]$RowNum),
                    perc90WC = max(xx[xx$CumPercent < 0.9001,]$RowNum),
                    perc95WC = max(xx[xx$CumPercent < 0.9501,]$RowNum),
                    stringsAsFactors = FALSE)

kable(wcRef, format="markdown")

medianWC	perc90WC	perc95WC
64755	452316	566235

picsDir <- "./Charts"
jpeg(filename = paste(picsDir,"TopNBigrams.jpeg",sep="/"),pointsize=6,
     width = 1700, height = 1080,quality=100)

ggplot(xx,aes(y=CumPercent,x=RowNum)) + 
  geom_bar(stat="identity") +
  geom_vline(data=wcRef, aes(xintercept=medianWC),color="darkgreen",size=3.0) +
  geom_vline(data=wcRef, aes(xintercept=perc90WC),color="orange",size=3.0) +
  geom_vline(data=wcRef, aes(xintercept=perc95WC),color="red",size=3.0) +
  ggtitle("Cumulative Distribution of Top N Bigrams from all feeds") +
  theme(axis.title.y=element_text(size=25),axis.title.x=element_text(size=25),
        plot.title=element_text(size=35),
        axis.text.x=element_text(angle=90,size=20),
        axis.text.y=element_text(size=20),
        legend.position="none") +
  ylab("Cumulative Probability")+xlab("Top N Bigram Count") +
  #coord_cartesian(ylim =c(0,1.0),xlim=c(1,max(xx$RowNum))) +
  coord_cartesian(ylim =c(0,1.0),xlim=c(0,240000)) +
  scale_y_continuous(breaks=seq(0,1.0,0.1)) + 
  #scale_x_continuous(breaks=seq(1,max(xx$RowNum),100))
  scale_x_continuous(breaks=seq(0,240000,24000)) +
  geom_text(data=wcRef, aes(x=medianWC+500, y=0.975, label="Median"),size=10) +
  geom_text(data=wcRef, aes(x=perc90WC+500, y=0.975, label="90 Percentile"),size=10)

dev.off()

## png 
##   2

###################################################################################
## Trigram tokenization and trigram distribution
###################################################################################

TrigramTokenizer <-
  function(x)
    unlist(lapply(ngrams(words(x), 3), paste, collapse = " "), use.names = FALSE)

#Compute and save as rda for subsequent loading to reduce the markdown generation time
#tdmTrigram <- TermDocumentMatrix(usEnCorpus, control = list(tokenize = TrigramTokenizer))
#save(tdmTrigram,file="./tdmTrigram.rda")

load(file="./tdmTrigram.rda")

dim(tdmTrigram)

## [1] 10555777        3

## Reduce the sparsity i.e., increase the incidence of terms in 2 out of 3 documents
## This will improve coverage with fewer words by removing words occurring mostly in one document
tdmTrigramDense <- removeSparseTerms(tdmTrigram, 0.66)

dim(tdmTrigramDense)

## [1] 107713      3

# Trigram tdm sparseness before removeSparseTerms()
tdmTrigram

## <<TermDocumentMatrix (terms: 10555777, documents: 3)>>
## Non-/sparse entries: 10674917/20992414
## Sparsity           : 66%
## Maximal term length: 471
## Weighting          : term frequency (tf)

# Trigram tdm sparseness after removeSparseTerms()
tdmTrigramDense

## <<TermDocumentMatrix (terms: 107713, documents: 3)>>
## Non-/sparse entries: 226853/96286
## Sparsity           : 30%
## Maximal term length: 39
## Weighting          : term frequency (tf)

matTrigram <- as.matrix(tdmTrigramDense)

# Aggregate the word frequency from each document and arrange the word frequency in descenting order
# (highest frequency to lowest frequency)
freqTrigram <- sort(rowSums(matTrigram),decreasing=TRUE)

# Formulate a data frame of the word frequency and word name for further processing
topTrigrams <- data.frame(word=names(freqTrigram),freq=freqTrigram,stringsAsFactors = FALSE)

# List out the top and bottom 100 words in the denser tdm
head(topTrigrams, 30)

##                                          word freq
## happy mothers day           happy mothers day  680
## new york city                   new york city  541
## let us know                       let us know  491
## happy new year                 happy new year  389
## president barack obama president barack obama  301
## two years ago                   two years ago  291
## new york times                 new york times  265
## cinco de mayo                   cinco de mayo  255
## looking forward seeing looking forward seeing  216
## world war ii                     world war ii  209
## gov chris christie         gov chris christie  184
## new years eve                   new years eve  181
## first time since             first time since  164
## will take place               will take place  162
## three years ago               three years ago  159
## st patricks day               st patricks day  150
## just got back                   just got back  147
## love love love                 love love love  141
## please please please     please please please  139
## five years ago                 five years ago  133
## two weeks ago                   two weeks ago  133
## happy valentines day     happy valentines day  132
## martin luther king         martin luther king  130
## osama bin laden               osama bin laden  126
## four years ago                 four years ago  125
## long time ago                   long time ago  121
## really looking forward really looking forward  117
## past two years                 past two years  114
## look forward seeing       look forward seeing  113
## couple weeks ago             couple weeks ago  112

tail(topTrigrams, 30)

##                                                    word freq
## younger brother th                   younger brother th    2
## younger get free                       younger get free    2
## younger kids may                       younger kids may    2
## younger years still                 younger years still    2
## youngest daughter currently youngest daughter currently    2
## youth commission can               youth commission can    2
## youth family center                 youth family center    2
## youth human services               youth human services    2
## youth soccer game                     youth soccer game    2
## youth sports league                 youth sports league    2
## youth system contemporaries youth system contemporaries    2
## youth wrong message                 youth wrong message    2
## youtube music videos               youtube music videos    2
## youtube soon get                       youtube soon get    2
## yr old lady                                 yr old lady    2
## yr old think                               yr old think    2
## yr old thought                           yr old thought    2
## yrs yrs old                                 yrs yrs old    2
## yummy yummy yummy                     yummy yummy yummy    2
## yums asian bistro                     yums asian bistro    2
## yup just got                               yup just got    2
## yves st laurent                         yves st laurent    2
## zest lemon juice                       zest lemon juice    2
## zimmerman stalked son             zimmerman stalked son    2
## zip lock bags                             zip lock bags    2
## zone across street                   zone across street    2
## zone sports bar                         zone sports bar    2
## zoo safari park                         zoo safari park    2
## zou bisou bisou                         zou bisou bisou    2
## zuckerberg worth billion       zuckerberg worth billion    2

require(wordcloud)
set.seed(1234)
wordcloud(words = topTrigrams$word, freq = topTrigrams$freq, min.freq=1,
          max.words=180, random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(8,"Dark2"))

topTrigrams$Percent <- topTrigrams$freq / sum(topTrigrams$freq)
topTrigrams$CumPercent <- cumsum(topTrigrams$Percent)
topTrigrams$RowNum=seq(1,nrow(topTrigrams),1)

# Check the frequency of the most frequent words from the word cloud for comparison with the orignal tdm
inspect(tdmTrigramDense[as.vector(topTrigrams[topTrigrams$RowNum <= 30,]$word),])

## <<TermDocumentMatrix (terms: 30, documents: 3)>>
## Non-/sparse entries: 87/3
## Sparsity           : 3%
## Maximal term length: 22
## Weighting          : term frequency (tf)
## 
##                         Docs
## Terms                    en_US.blogs.sample.txt en_US.news.sample.txt
##   happy mothers day                          17                     0
##   new york city                             180                   288
##   let us know                                45                    16
##   happy new year                             35                     2
##   president barack obama                     18                   278
##   two years ago                              68                   209
##   new york times                            109                   122
##   cinco de mayo                              27                    37
##   looking forward seeing                     40                    10
##   world war ii                               57                   147
##   gov chris christie                          0                   183
##   new years eve                              44                    48
##   first time since                           20                   120
##   will take place                            28                   117
##   three years ago                            30                   123
##   st patricks day                            28                    22
##   just got back                              19                     1
##   love love love                             38                     1
##   please please please                       10                     4
##   five years ago                             32                    93
##   two weeks ago                              33                    94
##   happy valentines day                        6                     0
##   martin luther king                         17                    49
##   osama bin laden                            39                    65
##   four years ago                             23                    99
##   long time ago                              64                    23
##   really looking forward                     50                     7
##   past two years                             18                    93
##   look forward seeing                        19                     1
##   couple weeks ago                           89                    11
##                         Docs
## Terms                    en_US.twitter.sample.txt
##   happy mothers day                           663
##   new york city                                73
##   let us know                                 430
##   happy new year                              352
##   president barack obama                        5
##   two years ago                                14
##   new york times                               34
##   cinco de mayo                               191
##   looking forward seeing                      166
##   world war ii                                  5
##   gov chris christie                            1
##   new years eve                                89
##   first time since                             24
##   will take place                              17
##   three years ago                               6
##   st patricks day                             100
##   just got back                               127
##   love love love                              102
##   please please please                        125
##   five years ago                                8
##   two weeks ago                                 6
##   happy valentines day                        126
##   martin luther king                           64
##   osama bin laden                              22
##   four years ago                                3
##   long time ago                                34
##   really looking forward                       60
##   past two years                                3
##   look forward seeing                          93
##   couple weeks ago                             12

xx <- topTrigrams[topTrigrams$CumPercent < 0.990001,]

wcRef <- data.frame(medianWC = max(xx[xx$CumPercent < 0.5001,]$RowNum),
                    perc90WC = max(xx[xx$CumPercent < 0.9001,]$RowNum),
                    perc95WC = max(xx[xx$CumPercent < 0.9501,]$RowNum),
                    stringsAsFactors = FALSE)

kable(wcRef, format="markdown")

medianWC	perc90WC	perc95WC
22971	88222	97977

picsDir <- "./Charts"
jpeg(filename = paste(picsDir,"TopNTrigrams.jpeg",sep="/"),pointsize=6,
     width = 1700, height = 1080,quality=100)

ggplot(xx,aes(y=CumPercent,x=RowNum)) + 
  geom_bar(stat="identity") +
  geom_vline(data=wcRef, aes(xintercept=medianWC),color="darkgreen",size=3.0) +
  geom_vline(data=wcRef, aes(xintercept=perc90WC),color="orange",size=3.0) +
  geom_vline(data=wcRef, aes(xintercept=perc95WC),color="red",size=3.0) +
  ggtitle("Cumulative Distribution of Top N Trigrams from all feeds") +
  theme(axis.title.y=element_text(size=25),axis.title.x=element_text(size=25),
        plot.title=element_text(size=35),
        axis.text.x=element_text(angle=90,size=20),
        axis.text.y=element_text(size=20),
        legend.position="none") +
  ylab("Cumulative Probability")+xlab("Top N Trigram Count") +
  #coord_cartesian(ylim =c(0,1.0),xlim=c(1,max(xx$RowNum))) +
  coord_cartesian(ylim =c(0,1.0),xlim=c(0,40000)) +
  scale_y_continuous(breaks=seq(0,1.0,0.05)) + 
  #scale_x_continuous(breaks=seq(1,max(xx$RowNum),100))
  scale_x_continuous(breaks=seq(0,40000,4000)) +
  geom_text(data=wcRef, aes(x=medianWC+500, y=0.975, label="Median"),size=10) +
  geom_text(data=wcRef, aes(x=perc90WC+500, y=0.975, label="90 Percentile"),size=10) +
  geom_text(data=wcRef, aes(x=perc95WC+500, y=0.975, label="95 Percentile"),size=10)

dev.off()

## png 
##   2