The 3 data sets together have 104936031 words, and the dataset sizes are huge to acommodate in memory. So we think this word analysis has to be made in detail to help as search for a good solution (prediction algorithm that performs well and fits in the memory available to the Shiny server). First we are going to do some transformation and cleansing in the data.
Data Transformations and Cleansing
We adopted an approach to mark some selected punctuation (dot,comma,exclamation, smileys, etc.), because we think they denote an end of thinking and usally the next word doesn’t has a relationship with the one before the punctuation. Also, they mark the beggining of a new phrase, then they are important to the flow of words. So, we are considering some punctuation as a special kind of word. This is reflected in the cleanse and transformation function below.
cleanData <- function(data) {
library(tm)
data <- tolower(data) # convert to lowercase
data <- removeNumbers(data) # remove numbers
pontuacao <- '[.,!:;?]|:-\\)|:-\\(|:\\)|:\\(|:D|=D|8\\)|:\\*|=\\*|:x|:X|:o|:O|:~\\(|T\\.T|Y\\.Y|S2|<3|:B|=B|=3|:3'
data <- gsub(pontuacao," END ",data) # substitute selected ponctuation (including smileys) with the word END
data <- gsub("$"," END",data) # make sure every line ends with an END
data <- gsub("\\b(\\w+)\\s+\\1\\b","\\1",data) # remove duplicate words in sequence (eg. that that)
data <- gsub("\\b(\\w+)\\s+\\1\\b","\\1",data) # remove duplicate words in sequence (eg. that that)
data <- gsub("\\b(\\w+)\\s+\\1\\b","\\1",data) # remove duplicate words in sequence (eg. that that)
data <- removePunctuation(data) # remove all other punctuation
data <- stripWhitespace(data) # remove excess white space
data <- gsub("^[[:space:]]","",data) # make sure lines doesn't begin with space
data <- gsub("[[:space:]]$","",data) # make sure lines doesn't end with space
}
blogsUS <- cleanData(blogsUS)
## Loading required package: NLP
##
## Attaching package: 'NLP'
##
## The following object is masked from 'package:ggplot2':
##
## annotate
save(file="blogsUS-clean.rdata",blogsUS)
newsUS <- cleanData(newsUS)
save(file="newsUS-clean.rdata",newsUS)
twitterUS <- cleanData(twitterUS)
save(file="twitterUS-clean.rdata",twitterUS)
Still in the transformation we adopted to transform all the data sets in a vector of words.
blogsUS <- unlist(str_split(blogsUS,"\\W+"))
newsUS <- unlist(str_split(newsUS,"\\W+"))
twitterUS <- unlist(str_split(twitterUS,"\\W+"))
save(file="blogsUS-words.rdata",blogsUS)
save(file="newsUS-words.rdata",newsUS)
save(file="twitterUS-words.rdata",twitterUS)
Statistics on words in each file and also concatenating all of then
words.blogsUS <- sort(table(blogsUS),decreasing=TRUE) # table with blogsUS word freq
words.newsUS <- sort(table(newsUS),decreasing=TRUE) # table with newsUS word freq
words.twitterUS <- sort(table(twitterUS),decreasing=TRUE) # table with twitterUS word freq
words.all <- sort(table(c(blogsUS,newsUS,twitterUS)),decreasing=TRUE) # all word freq
q.blogsUS <- quantile(words.blogsUS,probs=c(0,25,50,75,80,95,99,100)/100,type=3)
q.newsUS <- quantile(words.newsUS,probs=c(0,25,50,75,80,95,99,100)/100,type=3)
q.twitterUS <- quantile(words.twitterUS,probs=c(0,25,50,75,80,95,99,100)/100,type=3)
q.all <- quantile(words.all,probs=c(0,25,50,75,80,95,99,100)/100,type=3)
print(q.blogsUS)
## 0% 25% 50% 75% 80% 95% 99% 100%
## 1 1 1 4 6 78 824 4451634
print(q.newsUS)
## 0% 25% 50% 75% 80% 95% 99% 100%
## 1 1 2 6 9 114 1179 4499678
print(q.twitterUS)
## 0% 25% 50% 75% 80% 95% 99% 100%
## 1 1 1 3 4 46 572 5186943
print(q.all)
## 0% 25% 50% 75% 80% 95% 99% 100%
## 1 1 1 3 4 59 869 14138255
qqnorm(words.blogsUS,main="Normal Q-Q plot of words in blogsUS")
qqline(words.blogsUS)

qqnorm(words.newsUS,main="Normal Q-Q plot of words in newsUS")
qqline(words.newsUS)

qqnorm(words.twitterUS,main="Normal Q-Q plot of words in twitterUS")
qqline(words.twitterUS)

qqnorm(words.all,main="Normal Q-Q plot of words in all files")
qqline(words.all)

hist(words.all)

head(words.all)
##
## END the to and a of
## 14138255 4763777 2753690 2409871 2402467 2005693
tail(words.all)
##
## энергетику юге южного южной я як
## 1 1 1 1 1 1
words99 <- words.all[words.all>=q.all['99%']] # all word freq above 99% quartile
hist(words99)

head(words99)
##
## END the to and a of
## 14138255 4763777 2753690 2409871 2402467 2005693
tail(words99)
##
## disc executed implications precise satisfy
## 869 869 869 869 869
## sweeney
## 869
sum(words99)/sum(words.all)
## [1] 0.914251
total.words99 <- length(words99) # total of unique words in all 3 data sets above 99% quantile
total.words <- length(words.all) # total of unique words in all 3 data sets
stotal.words99 <- format(total.words99,big.mark=",",small.mark=",",small.interval=3)
stotal.words <- format(total.words,big.mark=",",small.mark=",",small.interval=3)
p99 <- total.words99/total.words
stotal.per <- format(p99,digits=3,big.mark=",",small.mark=",",small.interval=3)
As we can see the words frequencies are totally skewed to the right, and with a vocabulary of words above the 99% quantile (7,889) we can achieve 91.4251022% of words that appears in the text. In other words :-) , using only 1% of the words we can solve 91% of the text. This reduced number of words cold be the diference of success or failure.