Data Science Project Assignment:Milestone Report

Project Introduciton

The goal of this project is just to display that I’ve gotten used to working with the data and that I am on track to create my prediction algorithm. This report explains my exploratory analysis and my goals for the eventual app and algorithm.

This document explains only the major features of the data you have identified and briefly summarize youmy plans for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager. I use of tables and plots to illustrate important summaries of the data set.

The motivation for this project is to: 1. Demonstrate that I’ve downloaded the data and have successfully loaded it in. 2. Create a basic report of summary statistics about the data sets. 3. Report any interesting findings that I amassed so far. 4. Get feedback on my plans for creating a prediction algorithm and Shiny app

Downlaod the data

The downlaod URL of training dataset is here This is a corpus called HC Corpora. It consists of three files containing text scrapped by a web crawler from blogs, news articles and social media, specifically, tweets from Twitter, in the en_US locale (US English).

#To display R code of download the data without evaluating it
data_url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download.file(data_url,"../")
unzip("Coursera-SwiftKey.zip")

Loading the data

We loaded the three files with the goal of reporting the number of lines and words with the three datasets. Subsequent to loading the files, we build a dataframe to show the total lines and words.

blogs <- readLines("final/en_US/en_US.blogs.txt",skipNul = TRUE)
news <- readLines("final/en_US/en_US.news.txt",skipNul = TRUE)
twitters <- readLines("final/en_US/en_US.twitter.txt",skipNul = TRUE)

Summary of the datasets

This is the basic summary of the data features which include size of each file, total lines and total words of each file

# Size of each file

size_of_blogs <- file.info("final/en_US/en_US.blogs.txt")$size/1024^2
size_of_news <- file.info("final/en_US/en_US.news.txt")$size/1024^2
size_of_twitters <- file.info("final/en_US/en_US.twitter.txt")$size/1024^2

summary_size <- c(size_of_blogs,size_of_news,size_of_twitters)

# Total lines of of each file
total_lines_blogs <- length(blogs)
total_lines_news <- length(news)
total_lines_twitters <- length(twitters)

summary_lines <- c(total_lines_blogs,total_lines_news,total_lines_twitters)

# Total words of of each file
total_words_blogs <- sum(nchar(blogs))
total_words_news <- sum(nchar(news))
total_words_twitters<- sum(nchar(twitters))

summary_words <- c(total_words_blogs,total_words_news,total_words_twitters)

#Summary of the data
summary_table <- data.frame(Size_in_MB = summary_size,Total_lines = summary_lines,Total_words = summary_words)

row.names(summary_table) <- c("Blogs","News","Twitters")

Summary Statistics of Training Dataset

The following table provides a basic summary of the three files in terms of File Size, Line Count and Word Count

knitr::kable(summary_table)

	Size_in_MB	Total_lines	Total_words
Blogs	200.4242	899288	206824505
News	196.2775	1010242	203223159
Twitters	159.3641	2360148	162096241

Exploratory analysis

This section will report any interesting findings that I amassed so far. The analysis include find the frequency of the words in the training corpus dataset

library(tm)
library(NLP)
library(wordcloud)
library(ggplot2)
library(fpc)

# Since the training dataset is too big, we decided to select 10000 samples for each corpus and combine to a new corpus
corpus_blogs <- sample(blogs,10000)
corpus_news <- sample(news,10000)
corpus_twitters <- sample(twitters,10000)

sample_coupus <-c(corpus_blogs,corpus_news,corpus_twitters)

training_corpus <- Corpus(VectorSource(sample_coupus))

# clear up the training corpus
corpus_cleaner <- function (corpus) {
    corpus <- tm_map(corpus, tolower)  
    corpus <- tm_map(corpus, removePunctuation)
    corpus <- tm_map(corpus, stripWhitespace)
    corpus <- tm_map(corpus, removeNumbers)
    corpus <- tm_map(corpus, removeWords,stopwords("english"))
    corpus <- tm_map(corpus, PlainTextDocument)
    corpus <- tm_map(corpus, stemDocument)

    return (corpus)
}

cleaned_training_corpus <- corpus_cleaner(training_corpus)

# Save the cleaned corpus
saveRDS(cleaned_training_corpus, file = "cleaned_training_corpus.RDS")

Creating document-term matrix

# The document-term matrix from tm package
dtm <- DocumentTermMatrix(cleaned_training_corpus)

Check out the terms freqency

 word_freq <-sort(colSums(as.matrix(dtm)), decreasing=TRUE)

# List top 20 of the terms freqency
head(word_freq,20)

##  said   one  will  like   get  just  time   can  year  make   day   new 
##  2843  2780  2772  2421  2303  2284  2164  2088  1991  1744  1653  1557 
##   say peopl  work  love  know   now  want  good 
##  1484  1464  1455  1438  1421  1362  1358  1342

Plot top 20 words which appeared most frequently

word_count <- data.frame(word=names(word_freq), freq=word_freq)
top_20_terms <- head(word_count,20)


qplot(word, freq, data = top_20_terms,
      colour = word,
      xlab = "Words",
      ylab = "Freqency",
      main = "Top 20 words which appeared most frequently")

Plot the Word Cloud

The cleaned corpus can now be used to construct the word cloud which plot the words appeared at least 500 times

set.seed(1000)   
wordcloud(names(word_freq), word_freq, min.freq=500, scale=c(5, .2), colors=brewer.pal(6, "Dark2"))

Conclusion

This exploratory analysis can be seen that all three data sets are very large with blogs, news and twitter containing 37.3 million, 2.6 million and 30 million words, respectively, with total combined of 70 million words. But to build the prediction model, only 30,000samples of the data sets will be taken to be used in the prediction algorithm since the original data sets are too large to be processed by PCs with limited computing resources. The samples will be treated as the train data set. The proposed algorithm will create unique words from the train data set and keep track of their frequencies.