The goal of this project is just to display familiarity with the data for further creation of prediction algorithm. This report explains exploratory analysis and goals for the eventual app and algorithm in a way that would be understandable to a non-data scientist manager.
library(tm)
library(tidyr)
library(dplyr)
library(RWeka)
library(NLP)
library(Matrix)
library(knitr)
library(lattice)
The data was downloaded from the Capsone Dataset as below:
#Blogs
b1<-file("en_US.blogs.txt")
blogs <- readLines(b1,encoding = 'UTF-8',skipNul = T)
close(b1)
#news
n1<-file("en_US.news.txt")
news1 <- readLines(n1,encoding = 'UTF-8',skipNul = T)
close(n1)
#twitter
t1<-file("en_US.twitter.txt")
twitter<- readLines(t1,encoding = 'UTF-8',skipNul = T)
close(t1)
The dataset is composed by files in 4 different languages (directories): German, Russian, English, and Finnish. Each directory has three files: blogs, twitter, and news. For the purpose of this project, we will explore the following:
We analyzed the English dataset regarding to size and length of them.
col1<-c(length(blogs),length(news1),length(twitter))
col2<-sapply(list(blogs,news1,twitter),function(x){format(object.size(x),'MB')})
col3<-sapply(list(blogs,news1,twitter),function(x){sum(nchar(x))})
dat_overview<-cbind(col1,col2,col3)
colnames(dat_overview)<-c('number of rows','size in MB','number of characters')
row.names(dat_overview)<-c('blogs','news','twitter')
kable(dat_overview)
| number of rows | size in MB | number of characters | |
|---|---|---|---|
| blogs | 899288 | 255.4 Mb | 206824505 |
| news | 77259 | 19.8 Mb | 15639408 |
| 2360148 | 319 Mb | 162096241 |
As seen in the table above, it is a large-sized dataset.
To be more accurate along with more compact, we cleaned up the dataset. For this, there are few steps:
Given the large size of the dataset, it is possible to gather samples for analysis. We studied 5% of the total dataset.
# Sampling
set.seed(987)
blogs_s<-sample(blogs,size = length(blogs)*0.05,replace = F)
news1_s<-sample(news1,size = length(news1)*0.05,replace = F)
twitter_s<-sample(twitter,size = length(twitter)*0.05,replace = F)
total_sample<-iconv(c(blogs_s,news1_s,twitter_s),"UTF-8","ASCII",sub = "")
total_sample<-VCorpus(VectorSource(total_sample))
We converted all documents to lower case. In addition we removed empty spaces, numbers, and punctuation, as this is a Natural Language Processing (NLP).
total_sample<-tm_map(total_sample,content_transformer(tolower))
total_sample<-tm_map(total_sample,removePunctuation)
total_sample<-tm_map(total_sample,stripWhitespace)
total_sample<-tm_map(total_sample,removeNumbers)
print(total_sample)
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 166833
For this, we built uni-grams,bi-grams, and tri-grams using the package RWeka.
#Tokenization
unigram<-function(x) NGramTokenizer(x,Weka_control(min=1,max=1))
bigram<-function(x)NGramTokenizer(x,Weka_control(min=2,max=2))
trigram<-function(x)NGramTokenizer(x,Weka_control(min=3,max=3))
unigram1<-TermDocumentMatrix(total_sample,control=list(tokenize=unigram))
bigram1<-TermDocumentMatrix(total_sample,control=list(tokenize=bigram))
trigram1<-TermDocumentMatrix(total_sample,control=list(tokenize=trigram))
print(unigram1)
## <<TermDocumentMatrix (terms: 114098, documents: 166833)>>
## Non-/sparse entries: 2348564/19032963070
## Sparsity : 100%
## Maximal term length: 110
## Weighting : term frequency (tf)
print(bigram1)
## <<TermDocumentMatrix (terms: 1116140, documents: 166833)>>
## Non-/sparse entries: 3204243/186205780377
## Sparsity : 100%
## Maximal term length: 117
## Weighting : term frequency (tf)
print(trigram1)
## <<TermDocumentMatrix (terms: 2299875, documents: 166833)>>
## Non-/sparse entries: 3088454/383691957421
## Sparsity : 100%
## Maximal term length: 123
## Weighting : term frequency (tf)
uni<-findFreqTerms(unigram1,lowfreq = 100)
bi<-findFreqTerms(bigram1,lowfreq = 100)
tri<-findFreqTerms(trigram1,lowfreq = 100)
#Sorted by decreasing order of frequency
unif<-rowSums(as.matrix(unigram1[uni,]))
sorted_uni<-sort(unif,decreasing = T)
bif<-rowSums(as.matrix(bigram1[bi,]))
sorted_bi<-sort(bif,decreasing = T)
trif<-rowSums(as.matrix(trigram1[tri,]))
sorted_tri<-sort(trif,decreasing = T)
With the n-grams information, we were able to plot the frequency of the top 5 more frequent words.
sorted_uni<-data.frame(word=names(sorted_uni),frequency=sorted_uni)
sorted_bi<-data.frame(word=names(sorted_bi),frequency=sorted_bi)
sorted_tri<-data.frame(word=names(sorted_tri),frequency=sorted_tri)
sorted_uni<-sorted_uni[1:5,]
sorted_bi<-sorted_bi[1:5,]
sorted_tri<-sorted_tri[1:5,]
barchart(frequency~word,xlab= "word",data = sorted_uni,main="Top 5 more frequent word - Unigram")
barchart(frequency~word,xlab = "word",data = sorted_bi,main="Top 5 more frequent word - Bigram")
barchart(frequency~word,xlab = "word",data = sorted_tri,main="Top 5 more frequent word - Trigram")
For this preliminary analysis, we found that the is the most frequent word for uni-gram, of the for bi-gram, and thanks for the for tri-gram. Now we are able to move forward on creating the predictive model.