Peer Graded Assignment 1

Rubric for the Assignment

The goal of this project is just to display that you’ve gotten used to working with the data and that you are on track to create your prediction algorithm.
Please submit a report on R Pubs (http://rpubs.com/) that explains your exploratory analysis and your goals for the eventual app and algorithm. This document should be concise and explain only the major features of the data you have identified and briefly summarize your plans for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager. You should make use of tables and plots to illustrate important summaries of the data set.
The motivation for this project is to:
  1. Demonstrate that you’ve downloaded the data and have successfully loaded it in.
  2. Create a basic report of summary statistics about the data sets.
  3. Report any interesting findings that you amassed so far.
  4. Get feedback on your plans for creating a prediction algorithm and Shiny app.

Few Notes on the Data

I have made use of the data given in en_US folder. Moreover, I have only used the blogs data for the project.
The files :- news and twitter are given errors –
NEWS dataset
Warning message:
In readLines(news) : incomplete final line found on 'en_US.news.txt'
TWITTER dataset
Warning messages:
1: In readLines(twitter) : line 167155 appears to contain an embedded nul
and various other lines also

Load the Data

Let us start with loading the data, and selecting a part of it for processing, as it is quite large to be processed at once !!

# BLOGS dataset
blogs <- readLines(paste(filepath, "en_US.blogs.txt", sep = ""))
blogs <- blogs[1:9000]
# hard-coded to select the first 9000 lines in blogs
blogs_corpus <- VCorpus (VectorSource (blogs))
rm(blogs) # remove variable no longer needed

First Part of the Assignment

This is the information regarding the files in the ZIP folder

##                                       
## FILE            blogs    news  twitter
## FILE_SIZE         200     196      159
## LENGHT         899288   77259  2360148
## LONGEST_LINE   483415   14556  1484357
## TOTAL_WORDS  37334441 2643971 30373792

Sample Corpus

This code creates the Corpus from the Sample Data, and also removes the unnecesaary variables from the environment, so as to free-up the memory for later operations(later operations are seriously memory-hard)

Finally

The various Plots, and their wordclouds.
Function used -

corpusToDF <- function(theCorpus) {
    m <- as.matrix(theCorpus)
    v <- sort(rowSums(m), decreasing = TRUE)
    return (data.frame(word = names(v), freq = v))
}

Unigram Tokenization

d1 <- corpusToDF(blogs_1)
barplot(d1[1:10, ]$freq, las = 2, names.arg = d1[1:10, ]$word, col ="lightblue", main = "Most frequent words", ylab = "Word frequencies")

wordcloud(words = d1$word, freq = d1$freq, min.freq = 40, max.words = 200, random.order = TRUE, rot.per = 0.35,  colors = brewer.pal(8, "Dark2"))

Bigram Tokenization

Gives error, so I did not execute this part of code
Error: cannot allocate vector of size 11.1 Gb
Execution halted
d2 <- corpusToDF(blogs_2)
barplot(d1[1:10, ]$freq, las = 2, names.arg = d1[1:10, ]$word, col ="lightblue", main = "Most frequent words", ylab = "Word frequencies")
wordcloud(words = d1$word, freq = d1$freq, min.freq = 40, max.words = 200, random.order = TRUE, rot.per = 0.35,  colors = brewer.pal(8, "Dark2"))

Trigram Tokenization

Gives error, so I did not execute this snippet,
Error: cannot allocate vector of size 11.8 Gb
Execution halted
d3 <- corpusToDF(blogs_3)
barplot(d1[1:10, ]$freq, las = 2, names.arg = d1[1:10, ]$word, col ="lightblue", main = "Most frequent words", ylab = "Word frequencies")
wordcloud(words = d1$word, freq = d1$freq, min.freq = 40, max.words = 200, random.order = TRUE, rot.per = 0.35,  colors = brewer.pal(8, "Dark2"))

I have posted the equivalent code on my Github Repo.
Regarding any suggestions (or comments) please comment there.