Capstone - Week2: Milestone Report

Executive Summary

The goal of this project is to present an explanatory analysis on the data that will be used to create the prediction algorithm.

More in details, with this project I perform the following steps of analysis: 1. Load the original dataset 2. Provide a summary of the original dataset with basic statistics 3. Take a random sample of the original dataset to allow a faster computation. 4. Cleaning of the data 5. Create plots and word clouds for word frequencies 6. Summarise findings

1. Data Loading

Data are initially downloaded from internet and then stored into a local folder. From the downloaded data, I chose to import in R three files (blogs, news and twitter for English US):

en_US.blogs.txt
en_US.news.txt
en_US.twitter.txt

that constituted my original dataset.

enUS_blogs_con <- file("~/en_US/en_US.blogs.txt")
enUS_news_con <- file("~/en_US/en_US.news.txt")
enUS_twit_con <- file("~/en_US/en_US.twitter.txt")
enUS_blogs_db <- readLines(enUS_blogs_con)
enUS_news_db <- readLines(enUS_news_con)
enUS_twit_db <- readLines(enUS_twit_con)
close(enUS_blogs_con)
close(enUS_news_con)
close(enUS_twit_con)

2. Data Summary

To understand my original dataset, I run some basic statistics on the text contained in the three files and summarized results. I computed the total number of lines in each file, the minimum and the maximum number of characters, the count of words for each data files.

db_names <- c("Blogs", "News", "Twitter")
db_len <- c(length(enUS_blogs_db), length(enUS_news_db), length(enUS_twit_db))
db_nchar_max <- c(max(nchar(enUS_blogs_db)), max(nchar(enUS_news_db)), max(nchar(enUS_twit_db)))
db_nchar_min <- c(min(nchar(enUS_blogs_db)), min(nchar(enUS_news_db)), min(nchar(enUS_twit_db)))
db_nword <- c(sum(wc(enUS_blogs_db), na.rm = TRUE), sum(wc(enUS_news_db), na.rm = TRUE), sum(wc(enUS_twit_db), na.rm = TRUE))
db_summary <- data.frame("Name" = db_names, "N Lines" = db_len, "Max N Char" = db_nchar_max, 
                         "Min N Char" = db_nchar_min, "N Words" = db_nword)

The result is shown below:

db_summary

##      Name N.Lines Max.N.Char Min.N.Char  N.Words
## 1   Blogs  899288      40833          1 36825518
## 2    News 1010242      11384          1 33482314
## 3 Twitter 2360148        140          2 29379638

3. Data Sampling

I took a random sample equal to the 2% each file. This to allowed me to avoid computational problems due to a huge initial dataset and to run a faster analysis on data.

set.seed(3)
blogs_sample <- sample(enUS_blogs_db, length(enUS_blogs_db)*0.02)
news_sample <- sample(enUS_blogs_db, length(enUS_news_db)*0.02)
twit_sample <- sample(enUS_blogs_db, length(enUS_twit_db)*0.02)

data <- c(twit_sample, blogs_sample, news_sample)

4. Data Cleaning

I performed a cleaning of data by removing punctuation and with spaces, by transforming the all text as lower case, by removing numbers and connection words (like “and”, “or”, “to”, …).

data <- VCorpus(VectorSource(data))

removePunct <- function(x) gsub("[[:punct:]]", "", x) 
data <- tm_map(data,content_transformer(removePunct)) 
data <- tm_map(data,stripWhitespace) 
data <- tm_map(data, content_transformer(tolower)) 
data <- tm_map(data, removeNumbers) 
data <- tm_map(data,removeWords, stopwords("en"))

5. Word Frequencies

Some words are more frequent than others in the combined text. I performed an analysis to understand which words are more frequent and which is the distributions of word frequencies. In addition, I did the analysis also to understand what are the frequencies of 2-grams and 3-grams in my sampled dataset.

unigram <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
unigram_tdm <- TermDocumentMatrix(data, control = list(tokenize = unigram))
unigram_freqTerm <- findFreqTerms(unigram_tdm, lowfreq = 50)

bigram <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
bigram_tdm <- TermDocumentMatrix(data, control = list(tokenize = bigram))
bigram_freqTerm <- findFreqTerms(bigram_tdm, lowfreq = 30)

trigram <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
trigram_tdm <- TermDocumentMatrix(data, control = list(tokenize = trigram))
trigram_freqTerm <- findFreqTerms(trigram_tdm, lowfreq = 10)

unigram_freq <- rowSums(as.matrix(unigram_tdm[unigram_freqTerm,]))
unigram_ord <- order(unigram_freq, decreasing = TRUE)
unigram_freq <- data.frame(word=names(unigram_freq[unigram_ord]), frequency=unigram_freq[unigram_ord])

bigram_freq <- rowSums(as.matrix(bigram_tdm[bigram_freqTerm,]))
bigram_ord <- order(bigram_freq, decreasing = TRUE)
bigram_freq <- data.frame(word=names(bigram_freq[bigram_ord]), frequency=bigram_freq[bigram_ord])

trigram_freq <- rowSums(as.matrix(trigram_tdm[trigram_freqTerm,]))
trigram_ord <- order(trigram_freq, decreasing = TRUE)
trigram_freq <- data.frame(word=names(trigram_freq[trigram_ord]), frequency=trigram_freq[trigram_ord])

To examine results, I produced histograms and word clouds, showing frequently used terms.

Here below the histogram for word frequencies is shown:

unigram_plot <- ggplot(unigram_freq[1:25,], aes(factor(word, levels = unique(word)), frequency)) 
unigram_plot+
        geom_bar(stat = 'identity', fill = "steelblue4")+
        theme(axis.text.x=element_text(angle=90))+
        xlab('Unigram')+
        ylab('Frequency')

Here below the word cloud is shown:

wordcloud(unigram_freq$word, unigram_freq$frequency, max.words=40, colors=colorRampPalette(brewer.pal(9,"Blues"))(32), scale=c(3, .3))

Note that the word cloud generally shows the top words with size varying by frequency.

Here below the histogram for 2-grams is presented:

bigram_plot <- ggplot(bigram_freq[1:20,], aes(factor(word, levels = unique(word)), frequency)) 
bigram_plot +
        geom_bar(stat = 'identity', fill = "steelblue4")+
        theme(axis.text.x=element_text(angle=90))+
        xlab('Bigram')+
        ylab('Frequency')

Here below the word cloud for 2-grams is shown:

wordcloud(bigram_freq$word, bigram_freq$frequency, max.words=30, colors=colorRampPalette(brewer.pal(9,"Blues"))(32), scale=c(3, .3))

Here below the histogram for 3-grams is presented:

trigram_plot <- ggplot(trigram_freq[1:15,], aes(factor(word, levels = unique(word)), frequency)) 
trigram_plot +
        geom_bar(stat = 'identity', fill = "steelblue4")+
        theme(axis.text.x=element_text(angle=90))+
        xlab('Trigram')+
        ylab('Frequency')

Here below the word cloud for 2-grams is shown:

wordcloud(trigram_freq$word, trigram_freq$frequency, max.words=20, colors=colorRampPalette(brewer.pal(9,"Blues"))(32), scale=c(3, .3))

6. Findings

As expected, the longer is the N-gram, the less frequent they become.

Conclusions and further considerations

My results provided me with a good starting point for the rest of the course. I probably will perform further analysis on data before the new class and before bulding the predictive model.