Introduction

The goal of this project is just to display familiarity with the data for further creation of prediction algorithm. This report explains exploratory analysis and goals for the eventual app and algorithm in a way that would be understandable to a non-data scientist manager.

Required environment

library(tm)
library(tidyr)
library(dplyr)
library(RWeka)
library(NLP)
library(Matrix)
library(knitr)
library(lattice)

Data download

The data was downloaded from the Capsone Dataset as below:

#Blogs
b1<-file("en_US.blogs.txt")
blogs <- readLines(b1,encoding = 'UTF-8',skipNul = T)
close(b1)
#news
n1<-file("en_US.news.txt")
news1 <- readLines(n1,encoding = 'UTF-8',skipNul = T)
close(n1)
#twitter
t1<-file("en_US.twitter.txt")
twitter<- readLines(t1,encoding = 'UTF-8',skipNul = T)
close(t1)

About the dataset

The dataset is composed by files in 4 different languages (directories): German, Russian, English, and Finnish. Each directory has three files: blogs, twitter, and news. For the purpose of this project, we will explore the following:

  • en_US.blogs.txt
  • en_US.news.txt
  • en_US.twitter.txt

Basic statistics analysis

We analyzed the English dataset regarding to size and length of them.

col1<-c(length(blogs),length(news1),length(twitter))
col2<-sapply(list(blogs,news1,twitter),function(x){format(object.size(x),'MB')})
col3<-sapply(list(blogs,news1,twitter),function(x){sum(nchar(x))})
dat_overview<-cbind(col1,col2,col3)
colnames(dat_overview)<-c('number of rows','size in MB','number of characters')
row.names(dat_overview)<-c('blogs','news','twitter')
kable(dat_overview)
number of rows size in MB number of characters
blogs 899288 255.4 Mb 206824505
news 77259 19.8 Mb 15639408
twitter 2360148 319 Mb 162096241

As seen in the table above, it is a large-sized dataset.

Tidying/ Preprocessing Dataset

To be more accurate along with more compact, we cleaned up the dataset. For this, there are few steps:

Sampling

Given the large size of the dataset, it is possible to gather samples for analysis. We studied 5% of the total dataset.

# Sampling
set.seed(987)
blogs_s<-sample(blogs,size = length(blogs)*0.05,replace = F)
news1_s<-sample(news1,size = length(news1)*0.05,replace = F)
twitter_s<-sample(twitter,size = length(twitter)*0.05,replace = F)

total_sample<-iconv(c(blogs_s,news1_s,twitter_s),"UTF-8","ASCII",sub = "")
total_sample<-VCorpus(VectorSource(total_sample))

Cleaning up

We converted all documents to lower case. In addition we removed empty spaces, numbers, and punctuation, as this is a Natural Language Processing (NLP).

total_sample<-tm_map(total_sample,content_transformer(tolower))
total_sample<-tm_map(total_sample,removePunctuation)
total_sample<-tm_map(total_sample,stripWhitespace)
total_sample<-tm_map(total_sample,removeNumbers)
print(total_sample)
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 166833

N-grams for dataset

For this, we built uni-grams,bi-grams, and tri-grams using the package RWeka.

#Tokenization
unigram<-function(x) NGramTokenizer(x,Weka_control(min=1,max=1))
bigram<-function(x)NGramTokenizer(x,Weka_control(min=2,max=2))
trigram<-function(x)NGramTokenizer(x,Weka_control(min=3,max=3))

unigram1<-TermDocumentMatrix(total_sample,control=list(tokenize=unigram))
bigram1<-TermDocumentMatrix(total_sample,control=list(tokenize=bigram))
trigram1<-TermDocumentMatrix(total_sample,control=list(tokenize=trigram))

print(unigram1)
## <<TermDocumentMatrix (terms: 114098, documents: 166833)>>
## Non-/sparse entries: 2348564/19032963070
## Sparsity           : 100%
## Maximal term length: 110
## Weighting          : term frequency (tf)
print(bigram1)
## <<TermDocumentMatrix (terms: 1116140, documents: 166833)>>
## Non-/sparse entries: 3204243/186205780377
## Sparsity           : 100%
## Maximal term length: 117
## Weighting          : term frequency (tf)
print(trigram1)
## <<TermDocumentMatrix (terms: 2299875, documents: 166833)>>
## Non-/sparse entries: 3088454/383691957421
## Sparsity           : 100%
## Maximal term length: 123
## Weighting          : term frequency (tf)
uni<-findFreqTerms(unigram1,lowfreq = 100)
bi<-findFreqTerms(bigram1,lowfreq = 100)
tri<-findFreqTerms(trigram1,lowfreq = 100)

#Sorted by decreasing order of frequency
unif<-rowSums(as.matrix(unigram1[uni,]))
sorted_uni<-sort(unif,decreasing = T)
bif<-rowSums(as.matrix(bigram1[bi,]))
sorted_bi<-sort(bif,decreasing = T)
trif<-rowSums(as.matrix(trigram1[tri,]))
sorted_tri<-sort(trif,decreasing = T)

Exploratory Data

With the n-grams information, we were able to plot the frequency of the top 5 more frequent words.

sorted_uni<-data.frame(word=names(sorted_uni),frequency=sorted_uni)
sorted_bi<-data.frame(word=names(sorted_bi),frequency=sorted_bi)
sorted_tri<-data.frame(word=names(sorted_tri),frequency=sorted_tri)
                       
sorted_uni<-sorted_uni[1:5,]
sorted_bi<-sorted_bi[1:5,]
sorted_tri<-sorted_tri[1:5,]

barchart(frequency~word,xlab= "word",data = sorted_uni,main="Top 5 more frequent word - Unigram")

barchart(frequency~word,xlab = "word",data = sorted_bi,main="Top 5 more frequent word - Bigram")

barchart(frequency~word,xlab = "word",data = sorted_tri,main="Top 5 more frequent word - Trigram")

Conclusion

For this preliminary analysis, we found that the is the most frequent word for uni-gram, of the for bi-gram, and thanks for the for tri-gram. Now we are able to move forward on creating the predictive model.