NEWS dataset Warning message: In readLines(news) : incomplete final line found on 'en_US.news.txt'
TWITTER dataset Warning messages: 1: In readLines(twitter) : line 167155 appears to contain an embedded nul and various other lines also
Let us start with loading the data, and selecting a part of it for processing, as it is quite large to be processed at once !!
# BLOGS dataset
blogs <- readLines(paste(filepath, "en_US.blogs.txt", sep = ""))
blogs <- blogs[1:9000]
# hard-coded to select the first 9000 lines in blogs
blogs_corpus <- VCorpus (VectorSource (blogs))
rm(blogs) # remove variable no longer needed
This is the information regarding the files in the ZIP folder
##
## FILE blogs news twitter
## FILE_SIZE 200 196 159
## LENGHT 899288 77259 2360148
## LONGEST_LINE 483415 14556 1484357
## TOTAL_WORDS 37334441 2643971 30373792
This code creates the Corpus from the Sample Data, and also removes the unnecesaary variables from the environment, so as to free-up the memory for later operations(later operations are seriously memory-hard)
The various Plots, and their wordclouds.
Function used -
corpusToDF <- function(theCorpus) {
m <- as.matrix(theCorpus)
v <- sort(rowSums(m), decreasing = TRUE)
return (data.frame(word = names(v), freq = v))
}
d1 <- corpusToDF(blogs_1)
barplot(d1[1:10, ]$freq, las = 2, names.arg = d1[1:10, ]$word, col ="lightblue", main = "Most frequent words", ylab = "Word frequencies")
wordcloud(words = d1$word, freq = d1$freq, min.freq = 40, max.words = 200, random.order = TRUE, rot.per = 0.35, colors = brewer.pal(8, "Dark2"))
Error: cannot allocate vector of size 11.1 Gb Execution halted
d2 <- corpusToDF(blogs_2)
barplot(d1[1:10, ]$freq, las = 2, names.arg = d1[1:10, ]$word, col ="lightblue", main = "Most frequent words", ylab = "Word frequencies")
wordcloud(words = d1$word, freq = d1$freq, min.freq = 40, max.words = 200, random.order = TRUE, rot.per = 0.35, colors = brewer.pal(8, "Dark2"))
Error: cannot allocate vector of size 11.8 Gb Execution halted
d3 <- corpusToDF(blogs_3)
barplot(d1[1:10, ]$freq, las = 2, names.arg = d1[1:10, ]$word, col ="lightblue", main = "Most frequent words", ylab = "Word frequencies")
wordcloud(words = d1$word, freq = d1$freq, min.freq = 40, max.words = 200, random.order = TRUE, rot.per = 0.35, colors = brewer.pal(8, "Dark2"))
I have posted the equivalent code on my Github Repo.
Regarding any suggestions (or comments) please comment there.