The goal of this project is just to display that you’ve gotten used to working with the data and that you are on track to create your prediction algorithm. This document should be concise and explain only the major features of the data you have identified and briefly summarize your plans for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager. You should make use of tables and plots to illustrate important summaries of the data set. The motivation for this project is to: 1. Demonstrate that you’ve downloaded the data and have successfully loaded it in. 2. Create a basic report of summary statistics about the data sets. 3. Report any interesting findings that you amassed so far. 4. Get feedback on your plans for creating a prediction algorithm and Shiny app.
Here I read the file and a training sample is selected in order to avoid working with large files.
The original files are data comming from blogs, news and twitter in english language. The files are quite large, 8.9, 10.1 and 23.6 millons of records for blogs, news and twitter respectively. All of them have more than 30 millons of words. In this part we read the files and select randomly 10% of the lines.
## [1] "General stats for sample final data"
## Lines LinesNEmpty Chars CharsNWhite
## 180000 180000 18729383 15475253
Using the tm package, we will create the Corpus object and clean the sample data by performing the following steps: 1. converting to lower case 2. remove puntuation
3. remove numbers 4. remove extra whitespaces 5. remove common english words 6. remove profanity words
Samples of 5K lines are choosen in order to be able to generate the matrix.
set.seed(1234)
final_data_sample = sample(final_data,5000, replace = FALSE)
#creating corpus object to use tn functions
final_data_cleaned <- VCorpus(VectorSource(final_data_sample))
#cleaning operations on sample text: lower, remove punctuation, remove common words, etc
final_data_cleaned <- tm_map(final_data_cleaned, content_transformer(tolower))
final_data_cleaned <- tm_map(final_data_cleaned, content_transformer(removePunctuation))
final_data_cleaned <- tm_map(final_data_cleaned, stripWhitespace)
final_data_cleaned <- tm_map(final_data_cleaned, removeWords, stopwords("english"))
#final_data_cleaned <- tm_map(final_data_cleaned, removeWords, unlist(profanityWords))
final_data_cleaned <- tm_map(final_data_cleaned, removeNumbers)
removeURL <- function(x) gsub("http[[:alnum:]]*", "", x)
final_data_cleaned <- tm_map(final_data_cleaned, content_transformer(removeURL))
#remove profanity words
profanityWords <- read.table("./profanity_words.txt", header = FALSE)
final_data_cleaned <- tm_map(final_data_cleaned, removeWords, unlist(profanityWords))
Now I will create a document term matrix to calculate the frequency for one, two and three words.
#First we need functions to create tokens of two and three words using Weka package
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
#Then a creation of the matrix of terms for one, two and three words and frequency vectors
#one word (1-grams)
dtm<-DocumentTermMatrix(final_data_cleaned)
dtm_bigram <- DocumentTermMatrix(final_data_cleaned, control = list(tokenize = BigramTokenizer, stemming = TRUE))
dtm_trigram <- DocumentTermMatrix(final_data_cleaned, control = list(tokenize = TrigramTokenizer, stemming = TRUE))
dtmmat_unigram <- as.matrix(dtm)
freq <- colSums(dtmmat_unigram)
freq <- sort(freq,decreasing = TRUE)
#two words (2-grams)
dtmmat_bigram <- as.matrix(dtm_bigram)
freq_bigram <- colSums(dtmmat_bigram)
freq_bigram <- sort(freq_bigram,decreasing = TRUE)
#three words (3-grams)
dtmmat_trigram <- as.matrix(dtm_trigram)
freq_trigram <- colSums(dtmmat_trigram)
freq_trigram <- sort(freq_trigram,decreasing = TRUE)
Lets plot some graphs for every n-gram.
words <- names(freq)
wordcloud(words[1:100], freq[1:100], random.order = F, random.color = F, colors = brewer.pal(9,"Blues"))
words_bigram <- names(freq_bigram)
wordcloud(words_bigram[1:100], freq_bigram[1:100], random.order = F, random.color = F, colors = brewer.pal(9,"Blues"))
words_trigram <- names(freq_trigram)
wordcloud(words_trigram[1:100], freq_trigram[1:100], random.order = F, random.color = F, colors = brewer.pal(9,"Blues"))
unigram_freq <- rowapply_simple_triplet_matrix(dtm,sum)
bigram_freq <- rowapply_simple_triplet_matrix(dtm_bigram,sum)
trigram_freq <- rowapply_simple_triplet_matrix(dtm_trigram,sum)
par(mfrow = c(1,3), oma=c(0,0,3,0))
hist(unigram_freq, breaks = 50, main = 'Unigram Frequency', xlab='frequency')
hist(bigram_freq, breaks = 50, main = 'Bigram Frequency', xlab='frequency')
hist(trigram_freq, breaks = 50, main = 'Trigram Frequency', xlab='frequency')
title("NGram Histograms",outer=T)
The histograms shows that the distribution is skewed to the left. We analyze how many most frequency words we need to cover the 50% of total words.
#Words in frequency
length(freq)
## [1] 13073
#sum of frequencies
freqtot <- sum(freq)
#sum freq of first 200 words
freq150 <- sum(freq[1:150])
calcfreq <- function(freq, i) {
freqtot <- sum(freq)
freqtot
freqi <- sum(freq[1:i])
ratio <- i / length(freq)
coverage <- freqi / freqtot
cat(sprintf("Tot words: %d Analized (top frequency) %d words Ratio=%.2f Coverage %.2f\n", length(freq), i, ratio, coverage))
}
for (i in seq(100,1000,100)) {
calcfreq(freq,i)
}
## Tot words: 13073 Analized (top frequency) 100 words Ratio=0.01 Coverage 0.22
## Tot words: 13073 Analized (top frequency) 200 words Ratio=0.02 Coverage 0.30
## Tot words: 13073 Analized (top frequency) 300 words Ratio=0.02 Coverage 0.36
## Tot words: 13073 Analized (top frequency) 400 words Ratio=0.03 Coverage 0.40
## Tot words: 13073 Analized (top frequency) 500 words Ratio=0.04 Coverage 0.44
## Tot words: 13073 Analized (top frequency) 600 words Ratio=0.05 Coverage 0.46
## Tot words: 13073 Analized (top frequency) 700 words Ratio=0.05 Coverage 0.49
## Tot words: 13073 Analized (top frequency) 800 words Ratio=0.06 Coverage 0.51
## Tot words: 13073 Analized (top frequency) 900 words Ratio=0.07 Coverage 0.53
## Tot words: 13073 Analized (top frequency) 1000 words Ratio=0.08 Coverage 0.55
Here I plot the top 30 most frequent one, two and three words in the sample.
num <- 30
unigram_df <- head(data.frame(terms=names(freq), freq=freq), n=num)
bigram_df <- head(data.frame(terms=names(freq_bigram), freq=freq_bigram), n=num)
trigram_df <- head(data.frame(terms=names(freq_trigram), freq=freq_trigram), n=num)
#Plot 1 - Unigram
plot_unigram <- ggplot(unigram_df,aes(terms, freq))
plot_unigram <- plot_unigram + geom_bar(fill="white", colour=unigram_df$freq, stat="identity") + scale_x_discrete(limits=unigram_df$terms)
plot_unigram <- plot_unigram + theme(axis.text.x=element_text(angle=45, hjust=1))
plot_unigram <- plot_unigram + labs(x = "Words", y="Frequency", title="30 most frequent words in Sample")
plot_unigram
plot_bigram <- ggplot(bigram_df, aes(terms, freq))
plot_bigram <- plot_bigram + geom_bar(fill="white", colour=bigram_df$freq, stat="identity") + scale_x_discrete(limits=bigram_df$terms)
plot_bigram <- plot_bigram + theme(axis.text.x=element_text(angle=45, hjust=1))
plot_bigram <- plot_bigram + labs(x = "Words", y="Frequency", title="30 most frequent two-words in Sample")
plot_bigram
plot_trigram <- ggplot(trigram_df, aes(terms, freq))
plot_trigram <- plot_trigram + geom_bar(fill="white", colour=trigram_df$freq, stat="identity") + scale_x_discrete(limits=trigram_df$terms)
plot_trigram <- plot_trigram + theme(axis.text.x=element_text(angle=45, hjust=1))
plot_trigram <- plot_trigram + labs(x = "Words", y="Frequency", title="Top 30 Trigrams in Sample")
plot_trigram
The next step is to build the predictive algorithm to know the next word which are going to be write in a sentence. The data to be used for training and testing the algorithm is the n-grams we have been building in this project. The algorithm have to be optimized in order to be executed with low memory and cpu, so the resources have to be measured.