The goal of this project is to present an explanatory analysis on the data that will be used to create the prediction algorithm.
More in details, with this project I perform the following steps of analysis: 1. Load the original dataset 2. Provide a summary of the original dataset with basic statistics 3. Take a random sample of the original dataset to allow a faster computation. 4. Cleaning of the data 5. Create plots and word clouds for word frequencies 6. Summarise findings
Data are initially downloaded from internet and then stored into a local folder. From the downloaded data, I chose to import in R three files (blogs, news and twitter for English US):
that constituted my original dataset.
enUS_blogs_con <- file("~/en_US/en_US.blogs.txt")
enUS_news_con <- file("~/en_US/en_US.news.txt")
enUS_twit_con <- file("~/en_US/en_US.twitter.txt")
enUS_blogs_db <- readLines(enUS_blogs_con)
enUS_news_db <- readLines(enUS_news_con)
enUS_twit_db <- readLines(enUS_twit_con)
close(enUS_blogs_con)
close(enUS_news_con)
close(enUS_twit_con)
To understand my original dataset, I run some basic statistics on the text contained in the three files and summarized results. I computed the total number of lines in each file, the minimum and the maximum number of characters, the count of words for each data files.
db_names <- c("Blogs", "News", "Twitter")
db_len <- c(length(enUS_blogs_db), length(enUS_news_db), length(enUS_twit_db))
db_nchar_max <- c(max(nchar(enUS_blogs_db)), max(nchar(enUS_news_db)), max(nchar(enUS_twit_db)))
db_nchar_min <- c(min(nchar(enUS_blogs_db)), min(nchar(enUS_news_db)), min(nchar(enUS_twit_db)))
db_nword <- c(sum(wc(enUS_blogs_db), na.rm = TRUE), sum(wc(enUS_news_db), na.rm = TRUE), sum(wc(enUS_twit_db), na.rm = TRUE))
db_summary <- data.frame("Name" = db_names, "N Lines" = db_len, "Max N Char" = db_nchar_max,
"Min N Char" = db_nchar_min, "N Words" = db_nword)
The result is shown below:
db_summary
## Name N.Lines Max.N.Char Min.N.Char N.Words
## 1 Blogs 899288 40833 1 36825518
## 2 News 1010242 11384 1 33482314
## 3 Twitter 2360148 140 2 29379638
I took a random sample equal to the 2% each file. This to allowed me to avoid computational problems due to a huge initial dataset and to run a faster analysis on data.
set.seed(3)
blogs_sample <- sample(enUS_blogs_db, length(enUS_blogs_db)*0.02)
news_sample <- sample(enUS_blogs_db, length(enUS_news_db)*0.02)
twit_sample <- sample(enUS_blogs_db, length(enUS_twit_db)*0.02)
data <- c(twit_sample, blogs_sample, news_sample)
I performed a cleaning of data by removing punctuation and with spaces, by transforming the all text as lower case, by removing numbers and connection words (like “and”, “or”, “to”, …).
data <- VCorpus(VectorSource(data))
removePunct <- function(x) gsub("[[:punct:]]", "", x)
data <- tm_map(data,content_transformer(removePunct))
data <- tm_map(data,stripWhitespace)
data <- tm_map(data, content_transformer(tolower))
data <- tm_map(data, removeNumbers)
data <- tm_map(data,removeWords, stopwords("en"))
Some words are more frequent than others in the combined text. I performed an analysis to understand which words are more frequent and which is the distributions of word frequencies. In addition, I did the analysis also to understand what are the frequencies of 2-grams and 3-grams in my sampled dataset.
unigram <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
unigram_tdm <- TermDocumentMatrix(data, control = list(tokenize = unigram))
unigram_freqTerm <- findFreqTerms(unigram_tdm, lowfreq = 50)
bigram <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
bigram_tdm <- TermDocumentMatrix(data, control = list(tokenize = bigram))
bigram_freqTerm <- findFreqTerms(bigram_tdm, lowfreq = 30)
trigram <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
trigram_tdm <- TermDocumentMatrix(data, control = list(tokenize = trigram))
trigram_freqTerm <- findFreqTerms(trigram_tdm, lowfreq = 10)
unigram_freq <- rowSums(as.matrix(unigram_tdm[unigram_freqTerm,]))
unigram_ord <- order(unigram_freq, decreasing = TRUE)
unigram_freq <- data.frame(word=names(unigram_freq[unigram_ord]), frequency=unigram_freq[unigram_ord])
bigram_freq <- rowSums(as.matrix(bigram_tdm[bigram_freqTerm,]))
bigram_ord <- order(bigram_freq, decreasing = TRUE)
bigram_freq <- data.frame(word=names(bigram_freq[bigram_ord]), frequency=bigram_freq[bigram_ord])
trigram_freq <- rowSums(as.matrix(trigram_tdm[trigram_freqTerm,]))
trigram_ord <- order(trigram_freq, decreasing = TRUE)
trigram_freq <- data.frame(word=names(trigram_freq[trigram_ord]), frequency=trigram_freq[trigram_ord])
To examine results, I produced histograms and word clouds, showing frequently used terms.
Here below the histogram for word frequencies is shown:
unigram_plot <- ggplot(unigram_freq[1:25,], aes(factor(word, levels = unique(word)), frequency))
unigram_plot+
geom_bar(stat = 'identity', fill = "steelblue4")+
theme(axis.text.x=element_text(angle=90))+
xlab('Unigram')+
ylab('Frequency')
Here below the word cloud is shown:
wordcloud(unigram_freq$word, unigram_freq$frequency, max.words=40, colors=colorRampPalette(brewer.pal(9,"Blues"))(32), scale=c(3, .3))
Note that the word cloud generally shows the top words with size varying by frequency.
Here below the histogram for 2-grams is presented:
bigram_plot <- ggplot(bigram_freq[1:20,], aes(factor(word, levels = unique(word)), frequency))
bigram_plot +
geom_bar(stat = 'identity', fill = "steelblue4")+
theme(axis.text.x=element_text(angle=90))+
xlab('Bigram')+
ylab('Frequency')
Here below the word cloud for 2-grams is shown:
wordcloud(bigram_freq$word, bigram_freq$frequency, max.words=30, colors=colorRampPalette(brewer.pal(9,"Blues"))(32), scale=c(3, .3))
Here below the histogram for 3-grams is presented:
trigram_plot <- ggplot(trigram_freq[1:15,], aes(factor(word, levels = unique(word)), frequency))
trigram_plot +
geom_bar(stat = 'identity', fill = "steelblue4")+
theme(axis.text.x=element_text(angle=90))+
xlab('Trigram')+
ylab('Frequency')
Here below the word cloud for 2-grams is shown:
wordcloud(trigram_freq$word, trigram_freq$frequency, max.words=20, colors=colorRampPalette(brewer.pal(9,"Blues"))(32), scale=c(3, .3))
As expected, the longer is the N-gram, the less frequent they become.
My results provided me with a good starting point for the rest of the course. I probably will perform further analysis on data before the new class and before bulding the predictive model.