Capstone - Milestone report

Summary

The goal of this project is just to display that you’ve gotten used to working with the data and that you are on track to create your prediction algorithm.

This is a milestone report for peer graded assigment as part of Data Science Captone Course from Coursera in Week 2. The objective of this document is as follows,

Demonstrate that you’ve downloaded the data and have successfully loaded it in.
Create a basic report of summary statistics about the data sets.
Report any interesting findings that you amassed so far.
Get feedback on your plans for creating a prediction algorithm and Shiny app.

This report also will be served as a base for creating the next assignment report, hence it should be as clear and concise as possible. The content of this report will be structured to 5 sections as per the objective mentioned above.

Loading the libraries

library(ggplot2)
library(tm)
library(stringi)
library(textmineR)
library(NLP)

Downloading the dataset

if (!file.exists("Coursera-SwiftKey.zip")){
  download.file("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip","Coursera-SwiftKey.zip",
                method = "auto")
}

Unzipping the zip file

blog_file<- "final/en_US/en_US.blogs.txt"
twit_file <- "final/en_US/en_US.twitter.txt"
news_file <- "final/en_US/en_US.news.txt"

if (!file.exists(blog_file) || !file.exists(twit_file) || !file.exists(news_file) ){
  unzip("Coursera-SwiftKey.zip")}

Loading the text files

We gonna work with english dataset. There are three files for blogs twitter and news

blogs   <- readLines(blog_file, encoding="UTF-8")
twitter <- readLines(twit_file, encoding="UTF-8")
news    <- readLines(news_file, encoding="UTF-8")

Creating the summary

The basic details for the three datasets. The number of words, the number of lines, the number of characters and the size of files.

summary <- data.frame('File Name ' = c("Blogs","News","Twitter"),
                      'words_counts'= sapply(list(blogs, news, twitter), function(x) {sum(stri_count_words(x))}),
                      'line_counts' = sapply(list(blogs, news, twitter), stri_stats_general)[1,],
                      'Chars'= sapply(list(blogs, news, twitter), stri_stats_general)[3,],
                      'Size'= sapply(list(blogs, news, twitter),function(x){format(object.size(x),"MB")}))
summary

##   File.Name. words_counts line_counts     Chars     Size
## 1      Blogs     37546239      899288 206824382 255.4 Mb
## 2       News      2674536       77259  15639408  19.8 Mb
## 3    Twitter     30093372     2360148 162096031   319 Mb

Data sample

Because the dataset is large we gonna create a subset with 1% of the data per file

set.seed(1234)
sample_size <- 0.01

data_sample <- c(sample(blogs, length(blogs) * sample_size),
                 sample(news, length(news) * sample_size),
                 sample(twitter, length(twitter) * sample_size))

Building the Corpus

corpus <- VCorpus(VectorSource(data_sample))

Data cleaning and preprocessing

We gonna:

Turn every letter to lowercase
Removing double spaces
Removing punctuations
Removing every number
Removing standar words we dont need

corpus<-tm_map(corpus, content_transformer(tolower))
corpus<-tm_map(corpus, stripWhitespace)
corpus<-tm_map(corpus, removePunctuation)
corpus<-tm_map(corpus, removeNumbers)
corpus<-tm_map(corpus, removeWords, stopwords('english'))

Creating N-grams for our data

We have cleaned and sampled our data. we have done some preprocessing for our data. Now we can build our basic unigram, bigram and trigram. We choose 99.99% sparsity in the range from bigger zero to smaller one, so we gonna keep 99.99% of the most frequent words.

BigramTokenizer <-function(x) unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE)
TrigramTokenizer <- function(x) unlist(lapply(ngrams(words(x), 3), paste, collapse = " "), use.names = FALSE)

freq_df <- function(tdm){
  freq <- sort(rowSums(as.matrix(tdm)), decreasing=TRUE)
  freq_df <- data.frame(word=names(freq), freq=freq)
  return(freq_df)
}

unigram <- removeSparseTerms(TermDocumentMatrix(corpus), 0.9999)
unigram_freq <- freq_df(unigram)

bigram <- removeSparseTerms(TermDocumentMatrix(corpus, control = list(tokenize = BigramTokenizer)), 0.9999)
bigram_freq <- freq_df(bigram)

trigram <- removeSparseTerms(TermDocumentMatrix(corpus, control = list(tokenize = TrigramTokenizer)), 0.9999)
trigram_freq <- freq_df(trigram)

Plotting the most common unigrams,bigrams and trigrams

ggplot(unigram_freq[1:10,], aes(reorder(word,-freq),freq, fill=word)) + geom_bar(stat="identity") +
  labs(x="words", y="Frequency") + ggtitle("10 most common Unigrams") + 
  theme(legend.position = "none", axis.text.x = element_text(angle = 45, size = 12, hjust = 1))

ggplot(bigram_freq[1:10,], aes(reorder(word,-freq),freq, fill=word)) + geom_bar(stat="identity") +
  labs(x="words", y="Frequency") + ggtitle("10 most common Bigrams") + 
  theme(legend.position = "none", axis.text.x = element_text(angle = 45, size = 12, hjust = 1))

ggplot(trigram_freq[1:10,], aes(reorder(word,-freq),freq, fill=word)) + geom_bar(stat="identity") +
  labs(x="words", y="Frequency") + ggtitle("10 most common Trigrams") + 
  theme(legend.position = "none", axis.text.x = element_text(angle = 45, size = 12, hjust = 1))