Milestone Report

Introduction

This work is a report - Milestone Report. The aim is to show the exploration data and text mining done so far to prepare the final project of the data science captone project. The work contains three parts.

In the first part, the text data is received from blogs, news and twitter, in English. From this data is selected 10%.
In the second part, a corpus is created in which the data is cleared and an analysis of the most used words and combinations of two and three words is performed (n-grams).
Thirdly, plots are built that show the results of a search analysis.

The conclusion sets out the approach to the next stage of the project.

Reading and sampling input data

In this section, the file are readed and a sample is selected and save to avoid reading large files all the executions.

The original files are data comming from blogs, news and twitter in english language. The files are quite large, 8.9, 10.1 and 23.6 millons of records for blogs, news and twitter respectively. All of them have more than 30 millons of words. In this part we read the files and select randomly 10% of the lines.

## [1] "General stats for sample final data"

##       Lines LinesNEmpty       Chars CharsNWhite 
##      195000      195000    27694328    22852820

Exploring and cleaning final input data

Using tm package, we create the Corpus object and clean the sample data by

converting to lower case
remove puntuation
remove numbers
remove extra whitespaces
remove common english words
remove profanity words

A sample of 5K lines are choosed in order to be able to generate the matrix.

set.seed(1)
final_data_sample = sample(final_data,5000, replace = FALSE)
#creating corpus object to use tn functions
final_data_cleaned <- VCorpus(VectorSource(final_data_sample))
#cleaning operations on sample text: lower, remove punctuation, remove common words, etc
final_data_cleaned <- tm_map(final_data_cleaned, content_transformer(tolower))
final_data_cleaned <- tm_map(final_data_cleaned, content_transformer(removePunctuation))
final_data_cleaned <- tm_map(final_data_cleaned, stripWhitespace)
final_data_cleaned <- tm_map(final_data_cleaned, removeWords, stopwords("english"))
#final_data_cleaned <- tm_map(final_data_cleaned, removeWords, unlist(profanityWords))
final_data_cleaned <- tm_map(final_data_cleaned, removeNumbers)
removeURL <- function(x) gsub("http[[:alnum:]]*", "", x)
final_data_cleaned <- tm_map(final_data_cleaned,  content_transformer(removeURL))
#remove profanity words
profanityWords <- read.table("./full-list-of-bad-words-text-file_2019_02_20.txt", header = FALSE)
final_data_cleaned <- tm_map(final_data_cleaned, removeWords, unlist(profanityWords))

We create a document term matrix to calculate the frequency for one, two and three words.

#functions to create tokens of two and three words using Weka package
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
#creating matrix of terms for one,two and three words and frequency vectors
#one word (1-grams)
dtm<-DocumentTermMatrix(final_data_cleaned)

dtm_bigram <- DocumentTermMatrix(final_data_cleaned, control = list(tokenize = BigramTokenizer, stemming = TRUE))
dtm_trigram <- DocumentTermMatrix(final_data_cleaned, control = list(tokenize = TrigramTokenizer, stemming = TRUE))
dtmmat_unigram<-as.matrix(dtm)
freq<-colSums(dtmmat_unigram)
freq<-sort(freq,decreasing = TRUE)
#two words (2-grams)
dtmmat_bigram<-as.matrix(dtm_bigram)
freq_bigram<-colSums(dtmmat_bigram)
freq_bigram<-sort(freq_bigram,decreasing = TRUE)
#three words (3-grams)
dtmmat_trigram<-as.matrix(dtm_trigram)
freq_trigram<-colSums(dtmmat_trigram)
freq_trigram<-sort(freq_trigram,decreasing = TRUE)

A wordcloud and histogram are plotted for every n-gram.

Creating wordcloud of n-grams

words<-names(freq)
par(bg="grey30")
wordcloud(words[1:100],freq[1:100],random.order = F,random.color = F, colors = brewer.pal(11,"RdBu"))

Figure 1. Unigram wordcloud

words_bigram<-names(freq_bigram)
par(bg="grey30")
wordcloud(words_bigram[1:1000],freq_bigram[1:1000],random.order = F,random.color = F,colors = brewer.pal(9, "Greens"), rot.per = 0.3)

Figure 2. Bigram wordcloud

words_trigram<-names(freq_trigram)
par(bg="grey30")
pal <- brewer.pal(9, "Set1")
wordcloud(words_trigram[1:100],freq_trigram[1:100], scale = c(5,0,3),1,2400,  FALSE, TRUE, 0, pal)

Figure 3. Trigram wordcloud

Creating histograms of frequency

unigram_freq <- rowapply_simple_triplet_matrix(dtm,sum)
bigram_freq <- rowapply_simple_triplet_matrix(dtm_bigram,sum)
trigram_freq <- rowapply_simple_triplet_matrix(dtm_trigram,sum)

require(data.table)
uf <- as.data.frame(unigram_freq)
g4 <- ggplot(data = uf, aes(x = unigram_freq))
g4 + geom_histogram(fill = "green", color = "darkgreen")

Figure 4. NGram Histogtam Frequency of Unigram

bf <- as.data.frame(bigram_freq)
g5 <- ggplot(data = bf, aes(x = bigram_freq))
g5 + geom_histogram(fill = "red", color = "darkred")

Figure 5. NGram Histogram Frequency of Bigram

tf <- as.data.frame(trigram_freq)
g6 <- ggplot(data = tf, aes(x = trigram_freq))
g6 + geom_histogram(fill = "lightblue", color = "darkblue")

Figure 6. NGram Histogram Frequency of Trigram

The histograms shows that the distribution is skewed. We analize how many most frequency words we need to cover the 50% of total words.

#words in frequency
length(freq)

## [1] 16084

#sum of frequencies
freqtot<-sum(freq)
#sum freq of first 200 words
freq150<-sum(freq[1:150])
calcfreq <- function(freq, i) {
    freqtot<-sum(freq)
    freqtot
    freqi<-sum(freq[1:i])
    ratio<-i/length(freq)
    coverage<-freqi/freqtot
    cat(sprintf("Tot words: %d Analized (top frequency) %d words Ratio=%.2f Coverage %.2f\n",length(freq),i,ratio,coverage))
}    
for (i in seq(100,1000,100)) {
  calcfreq(freq,i) 
}

## Tot words: 16084 Analized (top frequency) 100 words Ratio=0.01 Coverage 0.21
## Tot words: 16084 Analized (top frequency) 200 words Ratio=0.01 Coverage 0.29
## Tot words: 16084 Analized (top frequency) 300 words Ratio=0.02 Coverage 0.34
## Tot words: 16084 Analized (top frequency) 400 words Ratio=0.02 Coverage 0.38
## Tot words: 16084 Analized (top frequency) 500 words Ratio=0.03 Coverage 0.42
## Tot words: 16084 Analized (top frequency) 600 words Ratio=0.04 Coverage 0.45
## Tot words: 16084 Analized (top frequency) 700 words Ratio=0.04 Coverage 0.47
## Tot words: 16084 Analized (top frequency) 800 words Ratio=0.05 Coverage 0.49
## Tot words: 16084 Analized (top frequency) 900 words Ratio=0.06 Coverage 0.51
## Tot words: 16084 Analized (top frequency) 1000 words Ratio=0.06 Coverage 0.53

Ploting the results

Finally, we plot the 50 most frequent one, two and three words in the sample.

num <- 50
unigram_df <- head(data.frame(terms=names(freq), freq=freq),n=num) 
bigram_df <- head(data.frame(terms=names(freq_bigram), freq=freq_bigram),n=num)
trigram_df <- head(data.frame(terms=names(freq_trigram), freq=freq_trigram),n=num)

#plot unigram
plot_unigram  <-ggplot(unigram_df,aes(terms, freq))   
plot_unigram  <- plot_unigram  + geom_bar(fill="yellow",colour="salmon",stat="identity") + scale_x_discrete(limits=unigram_df$terms) + coord_polar()
plot_unigram  <- plot_unigram  + theme(axis.text.x=element_text(angle=0, hjust=1))  
plot_unigram  <- plot_unigram  + labs(x = "words", y="frequency")
plot_unigram

Figure 7. Thirty most frequent words in sample

#plot bigram
plot_bigram <-ggplot(bigram_df,aes(terms, freq))   
plot_bigram <-plot_bigram + geom_bar(fill="lightblue", colour="blue",stat="identity") + scale_x_discrete(limits=bigram_df$terms) + coord_polar()
plot_bigram <- plot_bigram + theme(axis.text.x=element_text(angle=0, hjust=1))  
plot_bigram<- plot_bigram+ labs(x = "words", y="frequency")
plot_bigram

Figure 8. Thirty most frequent two-words in sample

#plot trigram
plot_trigram <-ggplot(trigram_df,aes(terms, freq))   
plot_trigram <- plot_trigram + geom_bar(fill="red",colour="darkred",stat="identity") + scale_x_discrete(limits=trigram_df$terms) + coord_polar()
plot_trigram <- plot_trigram + theme(axis.text.x=element_text(angle=0, hjust=1))  
plot_trigram <- plot_trigram + labs(x = "words", y="frequency")
plot_trigram

Figure 9. Thirty most frequent three-words in sample

Conclusions and next steps

In the future, it is supposed to construct an algorithm that predicts the next possible word, which will be written in the sentence. As initial data for forecasting, n-grams constructed in this project will be used. The algorithm have to be optimized in order to be executed with low memory and cpu, so the resources have to be measured.