Data Science Captone Project: Milestone Report

Introduction

This work corresponds to the Peer-graded Assignment: Milestone Report. The aim is to show the exploration data and text mining done so far to prepare the final project of the data science captone project. This work have three parts. First, the text data comming from blogs, news and twitter in english is read and 5% is selected. Secondly, we create the corpus, clean the data and analyze the most frequence words and the combination of two and three words (n-grams). Finally we make some plots to show the result of the exploratory analysis. Moreover, in the end, the approach for the next step in the project is stated.

Reading and sampling input data

In this section, the file are readed and a sample is selected and save to avoid reading large files all the executions.

The original files are data comming from blogs, news and twitter in english language. The files are quite large, 8.9, 10.1 and 23.6 millons of records for blogs, news and twitter respectively. All of them have more than 30 millons of words. In this part we read the files and select randomly 5% of the lines.

## File: en_US.blogs.txt Lines: 0.90M Selected: 0.03M Chars: 38.37M  Selected: 1.06M MaxLine 40833
##  File: en_US.news.txt Lines: 1.01M Selected: 0.07M Chars: 35.78M  Selected: 2.48M MaxLine 11384
##  File: en_US.twitter.txt Lines: 0.61M Selected: 0.03M Chars: 8.10M  Selected: 0.31M MaxLine 140
##

## [1] "General stats for sample final data"

##       Lines LinesNEmpty       Chars CharsNWhite 
##      118359      118359    21443890    17839865

Exploring and cleaning final input data

We explore and clean the sample data by

converting to lower case
remove puntuation
remove numbers
remove extra whitespaces
remove common english words
remove profanity words

We use the tm package, a useful tool for text mining. We selected 5000 lines randomly in order to reduce the use of memory.

set.seed(1234)
final_data_sample = sample(final_data,5000, replace = FALSE)
#creating corpus object to use tn functions
final_data_cleaned <- VCorpus(VectorSource(final_data_sample))
#cleaning operations on sample text: lower, remove punctuation, remove common words, etc
final_data_cleaned <- tm_map(final_data_cleaned, content_transformer(tolower))
final_data_cleaned <- tm_map(final_data_cleaned, content_transformer(removePunctuation))
final_data_cleaned <- tm_map(final_data_cleaned, stripWhitespace)
final_data_cleaned <- tm_map(final_data_cleaned, removeWords, stopwords("english"))
#final_data_cleaned <- tm_map(final_data_cleaned, removeWords, unlist(profanityWords))
final_data_cleaned <- tm_map(final_data_cleaned, removeNumbers)
removeURL <- function(x) gsub("http[[:alnum:]]*", "", x)
final_data_cleaned <- tm_map(final_data_cleaned,  content_transformer(removeURL))
#remove profanity words
profanityWords <- read.table("./full-list-of-bad-words-text-file_2018_03_26.txt", header = FALSE)
final_data_cleaned <- tm_map(final_data_cleaned, removeWords, unlist(profanityWords))

We create a document term matrix to calculate the frequency for one, two and three words in the selected sample.

#functions to create tokens of two and three words using Weka package
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
#creating matrix of terms for one,two and three words and frequency vectors
#one word (1-grams)
dtm<-DocumentTermMatrix(final_data_cleaned)
dtm_bigram <- DocumentTermMatrix(final_data_cleaned, control = list(tokenize = BigramTokenizer, stemming = TRUE))
dtm_trigram <- DocumentTermMatrix(final_data_cleaned, control = list(tokenize = TrigramTokenizer, stemming = TRUE))
dtmmat_unigram<-as.matrix(dtm)
freq<-colSums(dtmmat_unigram)
freq<-sort(freq,decreasing = TRUE)
#two words (2-grams)
dtmmat_bigram<-as.matrix(dtm_bigram)
freq_bigram<-colSums(dtmmat_bigram)
freq_bigram<-sort(freq_bigram,decreasing = TRUE)
#three words (3-grams)
dtmmat_trigram<-as.matrix(dtm_trigram)
freq_trigram<-colSums(dtmmat_trigram)
freq_trigram<-sort(freq_trigram,decreasing = TRUE)

A wordcloud and histogram are plotted for every one, two and three words most repeated in the selected sample.

words<-names(freq)
wordcloud(words[1:100],freq[1:100],random.order = F,random.color = F,          colors = brewer.pal(9,"Blues"))

words_bigram<-names(freq_bigram)
wordcloud(words_bigram[1:100],freq_bigram[1:100],random.order = F,random.color = F,          colors = brewer.pal(9,"Blues"))

words_trigram<-names(freq_trigram)
wordcloud(words_trigram[1:100],freq_trigram[1:100],random.order = F,random.color = F,colors = brewer.pal(9,"Blues"))

unigram_freq <- rowapply_simple_triplet_matrix(dtm,sum)
bigram_freq <- rowapply_simple_triplet_matrix(dtm_bigram,sum)
trigram_freq <- rowapply_simple_triplet_matrix(dtm_trigram,sum)
par(mfrow = c(1,3), oma=c(0,0,3,0))
hist(unigram_freq, breaks = 50, main = 'unigram freq', xlab='frequency')
hist(bigram_freq, breaks = 50, main = 'bigram freq', xlab='frequency')
hist(trigram_freq, breaks = 50, main = 'trigram freq', xlab='frequency')
title("NGram Histograms",outer=T)

The histograms shows that the distribution is skewed. We analize how many most frequency words we need to cover the 50% of total words.

#Words in freq
length(freq)

## [1] 19774

#sum of frequencies
freqtot<-sum(freq)
#sum freq of first 200 words
freq150<-sum(freq[1:150])
calcfreq <- function(freq, i) {
    freqtot<-sum(freq)
    freqtot
    freqi<-sum(freq[1:i])
    ratio<-i/length(freq)
    coverage<-freqi/freqtot
    cat(sprintf("Tot words: %d Analized (top frequency) %d words Ratio=%.2f Coverage %.2f\n",length(freq),i,ratio,coverage))
}    
for (i in seq(100,1000,100)) {
  calcfreq(freq,i) 
}

## Tot words: 19774 Analized (top frequency) 100 words Ratio=0.01 Coverage 0.17
## Tot words: 19774 Analized (top frequency) 200 words Ratio=0.01 Coverage 0.23
## Tot words: 19774 Analized (top frequency) 300 words Ratio=0.02 Coverage 0.28
## Tot words: 19774 Analized (top frequency) 400 words Ratio=0.02 Coverage 0.32
## Tot words: 19774 Analized (top frequency) 500 words Ratio=0.03 Coverage 0.36
## Tot words: 19774 Analized (top frequency) 600 words Ratio=0.03 Coverage 0.39
## Tot words: 19774 Analized (top frequency) 700 words Ratio=0.04 Coverage 0.41
## Tot words: 19774 Analized (top frequency) 800 words Ratio=0.04 Coverage 0.43
## Tot words: 19774 Analized (top frequency) 900 words Ratio=0.05 Coverage 0.45
## Tot words: 19774 Analized (top frequency) 1000 words Ratio=0.05 Coverage 0.47

Ploting the results

Finally, we plot the 30 most frequent one, two and three words in the sample.

num <- 30
unigram_df <- head(data.frame(terms=names(freq), freq=freq),n=num) 
bigram_df <- head(data.frame(terms=names(freq_bigram), freq=freq_bigram),n=num)
trigram_df <- head(data.frame(terms=names(freq_trigram), freq=freq_trigram),n=num)

#Plot 1 - Unigram
plot_unigram  <-ggplot(unigram_df,aes(terms, freq))   
plot_unigram  <- plot_unigram  + geom_bar(fill="white",colour=unigram_df$freq,stat="identity") + scale_x_discrete(limits=unigram_df$terms)
plot_unigram  <- plot_unigram  + theme(axis.text.x=element_text(angle=45, hjust=1))  
plot_unigram  <- plot_unigram  + labs(x = "Words", y="Frequency", title="30 most frequent words in Sample")
plot_unigram

plot_bigram <-ggplot(bigram_df,aes(terms, freq))   
plot_bigram <-plot_bigram + geom_bar(fill="white", colour=bigram_df$freq,stat="identity") + scale_x_discrete(limits=bigram_df$terms)
plot_bigram <- plot_bigram + theme(axis.text.x=element_text(angle=45, hjust=1))  
plot_bigram<- plot_bigram+ labs(x = "Words", y="Frequency", title="30 most frequent two-words in Sample")
plot_bigram

plot_trigram <-ggplot(trigram_df,aes(terms, freq))   
plot_trigram <- plot_trigram + geom_bar(fill="white",colour=trigram_df$freq,stat="identity") + scale_x_discrete(limits=trigram_df$terms)
plot_trigram <- plot_trigram + theme(axis.text.x=element_text(angle=45, hjust=1))  
plot_trigram <- plot_trigram + labs(x = "Words", y="Frequency", title="Top 30 Trigrams in Sample")
plot_trigram

Conclusions and next steps

The next step is to build the predictive algorithm to know the next word which are going to be write in a sentence. The data to be used for training and testing the algorithm is the n-grams we have been building in this projects. The algorithm have to be optimized in order to be executed with low memory and cpu, so the resources have to be measured.