Introduction

The goal of this project is just to display that you’ve gotten used to working with the data and that you are on track to create your prediction algorithm. This document should be concise and explain only the major features of the data you have identified and briefly summarize your plans for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager. You should make use of tables and plots to illustrate important summaries of the data set. The motivation for this project is to: 1. Demonstrate that you’ve downloaded the data and have successfully loaded it in. 2. Create a basic report of summary statistics about the data sets. 3. Report any interesting findings that you amassed so far. 4. Get feedback on your plans for creating a prediction algorithm and Shiny app.

Loading and reading data

Here I read the file and a training sample is selected in order to avoid working with large files.

The original files are data comming from blogs, news and twitter in english language. The files are quite large, 8.9, 10.1 and 23.6 millons of records for blogs, news and twitter respectively. All of them have more than 30 millons of words. In this part we read the files and select randomly 10% of the lines.

## [1] "General stats for sample final data"
##       Lines LinesNEmpty       Chars CharsNWhite 
##      180000      180000    18729383    15475253

Data exploration and data cleanup

Using the tm package, we will create the Corpus object and clean the sample data by performing the following steps: 1. converting to lower case 2. remove puntuation
3. remove numbers 4. remove extra whitespaces 5. remove common english words 6. remove profanity words

Samples of 5K lines are choosen in order to be able to generate the matrix.

set.seed(1234)
final_data_sample = sample(final_data,5000, replace = FALSE)

#creating corpus object to use tn functions
final_data_cleaned <- VCorpus(VectorSource(final_data_sample))

#cleaning operations on sample text: lower, remove punctuation, remove common words, etc
final_data_cleaned <- tm_map(final_data_cleaned, content_transformer(tolower))
final_data_cleaned <- tm_map(final_data_cleaned, content_transformer(removePunctuation))
final_data_cleaned <- tm_map(final_data_cleaned, stripWhitespace)
final_data_cleaned <- tm_map(final_data_cleaned, removeWords, stopwords("english"))
#final_data_cleaned <- tm_map(final_data_cleaned, removeWords, unlist(profanityWords))
final_data_cleaned <- tm_map(final_data_cleaned, removeNumbers)
removeURL <- function(x) gsub("http[[:alnum:]]*", "", x)
final_data_cleaned <- tm_map(final_data_cleaned,  content_transformer(removeURL))
#remove profanity words
profanityWords <- read.table("./profanity_words.txt", header = FALSE)
final_data_cleaned <- tm_map(final_data_cleaned, removeWords, unlist(profanityWords))

Now I will create a document term matrix to calculate the frequency for one, two and three words.

#First we need functions to create tokens of two and three words using Weka package
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))

#Then a creation of the matrix of terms for one, two and three words and frequency vectors

#one word (1-grams)
dtm<-DocumentTermMatrix(final_data_cleaned)
dtm_bigram <- DocumentTermMatrix(final_data_cleaned, control = list(tokenize = BigramTokenizer, stemming = TRUE))
dtm_trigram <- DocumentTermMatrix(final_data_cleaned, control = list(tokenize = TrigramTokenizer, stemming = TRUE))
dtmmat_unigram <- as.matrix(dtm)
freq <- colSums(dtmmat_unigram)
freq <- sort(freq,decreasing = TRUE)

#two words (2-grams)
dtmmat_bigram <- as.matrix(dtm_bigram)
freq_bigram <- colSums(dtmmat_bigram)
freq_bigram <- sort(freq_bigram,decreasing = TRUE)

#three words (3-grams)
dtmmat_trigram <- as.matrix(dtm_trigram)
freq_trigram <- colSums(dtmmat_trigram)
freq_trigram <- sort(freq_trigram,decreasing = TRUE)

Lets plot some graphs for every n-gram.

words <- names(freq)
wordcloud(words[1:100], freq[1:100], random.order = F, random.color = F, colors = brewer.pal(9,"Blues"))

words_bigram <- names(freq_bigram)
wordcloud(words_bigram[1:100], freq_bigram[1:100], random.order = F, random.color = F, colors = brewer.pal(9,"Blues"))

words_trigram <- names(freq_trigram)
wordcloud(words_trigram[1:100], freq_trigram[1:100], random.order = F, random.color = F, colors = brewer.pal(9,"Blues"))

unigram_freq <- rowapply_simple_triplet_matrix(dtm,sum)
bigram_freq <- rowapply_simple_triplet_matrix(dtm_bigram,sum)
trigram_freq <- rowapply_simple_triplet_matrix(dtm_trigram,sum)
par(mfrow = c(1,3), oma=c(0,0,3,0))
hist(unigram_freq, breaks = 50, main = 'Unigram Frequency', xlab='frequency')
hist(bigram_freq, breaks = 50, main = 'Bigram Frequency', xlab='frequency')
hist(trigram_freq, breaks = 50, main = 'Trigram Frequency', xlab='frequency')
title("NGram Histograms",outer=T)

The histograms shows that the distribution is skewed to the left. We analyze how many most frequency words we need to cover the 50% of total words.

#Words in frequency
length(freq)
## [1] 13073
#sum of frequencies
freqtot <- sum(freq)
#sum freq of first 200 words
freq150 <- sum(freq[1:150])
calcfreq <- function(freq, i) {
    freqtot <- sum(freq)
    freqtot
    freqi <- sum(freq[1:i])
    ratio <- i / length(freq)
    coverage <- freqi / freqtot
    cat(sprintf("Tot words: %d Analized (top frequency) %d words Ratio=%.2f Coverage %.2f\n", length(freq), i, ratio, coverage))
}    
for (i in seq(100,1000,100)) {
  calcfreq(freq,i) 
}
## Tot words: 13073 Analized (top frequency) 100 words Ratio=0.01 Coverage 0.22
## Tot words: 13073 Analized (top frequency) 200 words Ratio=0.02 Coverage 0.30
## Tot words: 13073 Analized (top frequency) 300 words Ratio=0.02 Coverage 0.36
## Tot words: 13073 Analized (top frequency) 400 words Ratio=0.03 Coverage 0.40
## Tot words: 13073 Analized (top frequency) 500 words Ratio=0.04 Coverage 0.44
## Tot words: 13073 Analized (top frequency) 600 words Ratio=0.05 Coverage 0.46
## Tot words: 13073 Analized (top frequency) 700 words Ratio=0.05 Coverage 0.49
## Tot words: 13073 Analized (top frequency) 800 words Ratio=0.06 Coverage 0.51
## Tot words: 13073 Analized (top frequency) 900 words Ratio=0.07 Coverage 0.53
## Tot words: 13073 Analized (top frequency) 1000 words Ratio=0.08 Coverage 0.55

Ploting the results

Here I plot the top 30 most frequent one, two and three words in the sample.

num <- 30
unigram_df <- head(data.frame(terms=names(freq), freq=freq), n=num) 
bigram_df <- head(data.frame(terms=names(freq_bigram), freq=freq_bigram), n=num)
trigram_df <- head(data.frame(terms=names(freq_trigram), freq=freq_trigram), n=num)

#Plot 1 - Unigram
plot_unigram  <- ggplot(unigram_df,aes(terms, freq))   
plot_unigram  <- plot_unigram  + geom_bar(fill="white", colour=unigram_df$freq, stat="identity") + scale_x_discrete(limits=unigram_df$terms)

plot_unigram  <- plot_unigram  + theme(axis.text.x=element_text(angle=45, hjust=1))  
plot_unigram  <- plot_unigram  + labs(x = "Words", y="Frequency", title="30 most frequent words in Sample")
plot_unigram  

plot_bigram <- ggplot(bigram_df, aes(terms, freq))   
plot_bigram <- plot_bigram + geom_bar(fill="white", colour=bigram_df$freq, stat="identity") + scale_x_discrete(limits=bigram_df$terms)
plot_bigram <- plot_bigram + theme(axis.text.x=element_text(angle=45, hjust=1))  
plot_bigram <- plot_bigram + labs(x = "Words", y="Frequency", title="30 most frequent two-words in Sample")
plot_bigram

plot_trigram <- ggplot(trigram_df, aes(terms, freq))   
plot_trigram <- plot_trigram + geom_bar(fill="white", colour=trigram_df$freq, stat="identity") + scale_x_discrete(limits=trigram_df$terms)
plot_trigram <- plot_trigram + theme(axis.text.x=element_text(angle=45, hjust=1))  
plot_trigram <- plot_trigram + labs(x = "Words", y="Frequency", title="Top 30 Trigrams in Sample")
plot_trigram

Conclusion on the findings and the following steps

The next step is to build the predictive algorithm to know the next word which are going to be write in a sentence. The data to be used for training and testing the algorithm is the n-grams we have been building in this project. The algorithm have to be optimized in order to be executed with low memory and cpu, so the resources have to be measured.