This is a milestone report, part of the Data-Science specialization by JHU which is available on coursera. Here, we performed Tokenization, Profanity filtering, and EDA.

Motivation for this Project:

  1. Demonstrate that you’ve downloaded the data and have successfully loaded it in.
  2. Create a basic report of summary statistics about the data sets.
  3. Report any interesting findings that you amassed so far.
  4. Get feedback on your plans for creating a prediction algorithm and Shiny app.

Criteria:

  1. Does the link lead to an HTML page describing the exploratory analysis of the training data set?
  2. Has the data scientist done basic summaries of the three files? Word counts, line counts and basic data tables?
  3. Has the data scientist made basic plots, such as histograms to illustrate features of the data?
  4. Was the report written in a brief, concise style, in a way that a non-data scientist manager could appreciate?

Setting up working directory, unzipping the downloaded files, reading the file list and calling required libraries:

setwd("D:/R/Class/10Capstone")
##unzip("./dataset.zip")
list.files("./final/en_US/")
## [1] "en_US.blogs.txt"   "en_US.news.txt"    "en_US.twitter.txt"
library(stringi)
library(tm)
## Loading required package: NLP

Reading each files

conn <- file("./final/en_US/en_US.blogs.txt","r")
blogs <- readLines(conn,skipNul = TRUE)
close(conn)

conn <- file("./final/en_US/en_US.news.txt","rb")
news <- readLines(conn,skipNul = TRUE)
close(conn)

conn <- file("./final/en_US/en_US.twitter.txt","r")
twitter <- readLines(conn,skipNul = TRUE)
close(conn)

Calculating the file size:

blogsize <- file.info("./final/en_US/en_US.blogs.txt")$size / 1024 ^ 2
blogsize
## [1] 200.4242
newssize <- file.info("./final/en_US/en_US.news.txt")$size / 1024 ^ 2
newssize
## [1] 196.2775
tweetsize <- file.info("./final/en_US/en_US.twitter.txt")$size/1024 ^ 2
tweetsize
## [1] 159.3641

Calculating the lengths and the wordcounts:

bloglen <- length(blogs)
bloglen
## [1] 899288
newslen <- length(news)
newslen
## [1] 1010242
tweetlen <- length(twitter)
tweetlen
## [1] 2360148
wordsblog <- stri_count_words(blogs)
wordsnews <- stri_count_words(news)
wordstweets <- stri_count_words(twitter)

Summary of the Data:

Creating a summary of the data.

sumryofdata <- data.frame(source = c("Blogs","NEWS","Tweets"),
                          size = c(blogsize,newssize,tweetsize),
                          NumberOfLines = c(bloglen,newslen,tweetlen),
                          NumberOfWords = c(sum(wordsblog),sum(wordsnews),sum(wordstweets)),
                          MeanOfNumberOfWordsPerLine = c(mean(wordsblog),mean(wordsnews),mean(wordstweets)))
sumryofdata
##   source     size NumberOfLines NumberOfWords MeanOfNumberOfWordsPerLine
## 1  Blogs 200.4242        899288      38154238                   42.42716
## 2   NEWS 196.2775       1010242      35010782                   34.65584
## 3 Tweets 159.3641       2360148      30218166                   12.80350

Sampling Data:

We sampled some part of the data provided in the file downloaded.

sample.data <- c(sample(blogs,1000),
                 sample(news,1000),
                 sample(twitter,1000))

Cleaning the Data:

Before performing EDA or doing any process on the data we should always perform the cleaning process for the data so that we can remove the profanity. Reading the badwords.txt for reading and removing the badwords.

conn <- file("./badwords.txt","r")
badwords <- readLines(conn,skipNul = TRUE)
close(conn)

Performing the cleaning process

corpus <- VCorpus(VectorSource(sample.data))
toSpace<-content_transformer(function(x,pattern)gsub(pattern,"",x))
corpus<-tm_map(corpus,toSpace,"/")
corpus<-tm_map(corpus,toSpace,"@")
corpus<-tm_map(corpus,toSpace,"\\|")
corpus <- tm_map(corpus,tolower)
corpus <- tm_map(corpus,removeNumbers)
corpus <- tm_map(corpus,removeWords,badwords)
##corpus <- tm_map(corpus,removeWords,stopwords("en"))
corpus <- tm_map(corpus,removePunctuation)
corpus <- tm_map(corpus,stripWhitespace)

Libraries:

library(RWeka)
library(rJava)
library(wordcloud)
## Loading required package: RColorBrewer

Tokenization and Exploratory Data Analysis:

One-Gram:

onegram <- NGramTokenizer(corpus,Weka_control(min = 1,max = 1))
oneGram <- data.frame(table(onegram))
head(oneGram)
##                      onegram Freq
## 1                          a 2038
## 2                         â–    9
## 3                         ã—    3
## 4   â\\200\\230â\\200\\230he    1
## 5 â\\200\\230â\\200\\230that    1
## 6 â\\200\\230â\\200\\230went    1
oneGram <- oneGram[order(oneGram$Freq,decreasing = T),]
orderedonegram <- oneGram[1:60,]
barplot(orderedonegram$Freq,names.arg = orderedonegram$onegram,cex.names=1,col = terrain.colors(60),las=2,main="One Gram")

wordcloud(oneGram$onegram,freq = oneGram$Freq,max.words = 50,random.order = F,colors=brewer.pal(4, "Set1"),scale = c(2,1))

Bi-Gram:

bigram <- NGramTokenizer(corpus,Weka_control(min=2,max=2))
biGram <- data.frame(table(bigram))
head(biGram)
##        bigram Freq
## 1  â– chicken    1
## 2     â– even    1
## 3      â– ive    1
## 4 â– monterey    1
## 5       â– or    1
## 6 â– probably    1
biGram <- biGram[order(biGram$Freq,decreasing = T),]
orderedbigram <- biGram[1:60,]
barplot(orderedbigram$Freq,names.arg = orderedbigram$bigram,col=terrain.colors(60),las=2,main="Bi-Gram")

wordcloud(biGram$bigram,freq = biGram$Freq,max.words = 50,random.order = F,colors = brewer.pal(8,"Set1"),scale = c(2,1))

Tri-Gram:

trigram <- NGramTokenizer(corpus,Weka_control(min=3,max=3))
triGram <- data.frame(table(trigram))
triGram <- triGram[order(triGram$Freq,decreasing = T),]
orderedtrigram <- triGram[1:60,]
barplot(orderedtrigram$Freq,names.arg = orderedtrigram$trigram,col=terrain.colors(60),las=2,main="Tri-Gram",cex.names = 0.7)

wordcloud(triGram$trigram,freq = triGram$Freq,max.words = 60,random.order = F,colors = brewer.pal(8,"Dark2"),scale = c(2,1))