Coursera Capstone Milestone Project

Alejandro Fraga November, 2016

Objective

The goal of this report is to explain the exploratory analysis and goals for the eventual app and algorithm. This document will explain only the major features of the data identified and briefly summarize the plans for creating the prediction algorithm and Shiny app.

Goals: 1. Demonstrate that you’ve downloaded the data and have successfully loaded it in. 2. Create a basic report of summary statistics about the data sets. 3. Report any interesting findings that you amassed so far. 4. Get feedback on your plans for creating a prediction algorithm and Shiny app.

Preparation

Basic Statistics

The following table will describe a set of basic statistics of the “en_US” data

# Calculate the file size 
file_blogs <- file.info("dataset/final/en_US/en_US.blogs.txt")$size / 1024.0 / 1024.0
file_news <- file.info("dataset/final/en_US/en_US.news.txt")$size / 1024.0 / 1024.0
file_twitter <- file.info("dataset/final/en_US/en_US.twitter.txt")$size / 1024.0 / 1024.0


# Calculate the lenght of the files
length_blogs <- length(blogs)
length_news <- length(news)
length_twitter <- length(twitter)

# Calculating word count
words_blogs <- sum(sapply(gregexpr("\\S+", blogs), length))
words_news <- sum(sapply(gregexpr("\\S+", news), length))
words_twitter <- sum(sapply(gregexpr("\\S+", twitter), length))

dataset_overview <- data.frame(
  file_name = c("Blogs","News","Twitter"),
  file_size = c(round(file_blogs, digits=2), 
               round(file_news,digits=2), 
               round(file_twitter, digits=2)),
  line_count = c(length_blogs, length_news, length_twitter),
  word_count = c(words_blogs, words_news, words_twitter)                  
)

colnames(dataset_overview) <- c("File Name","File Size in Megabyte","Line Count","Word Count")
# Building the table
knitr::kable(head(dataset_overview, 10))
File Name File Size in Megabyte Line Count Word Count
Blogs 200.42 899288 37334147
News 196.28 1010242 34372530
Twitter 159.36 2360148 30373603

Defining a Sample

Next I create a sample for each of the dataset files for the initial training

# Creating a sample from data sources
sample_twitter <- twitter[sample(1:length(twitter),10000)]
sample_news <- news[sample(1:length(news),10000)]
sample_blogs <- blogs[sample(1:length(blogs),10000)]
sample_dataset <- c(sample_twitter,sample_news,sample_blogs)
# Save sample
writeLines(sample_dataset, "output/sample_dataset.txt")

# Getting sample metrics 
sample_file <- file.info("output/sample_dataset.txt")$size/1024.0/1024.0
sample_length <- length(sample_dataset)
sample_words <- sum(sapply(gregexpr("\\S+", sample_dataset), length))

sample_summary <- data.frame(
  file_name  = c("Generated Sample"),
  file_size  = c(round(sample_file, digits = 2)),
  lines_count = c(sample_length),
  word_count = c(sample_words)                  
)

colnames(sample_summary) <- c("File Name", "File Size [Mb]", "Line Count", "Word Count")

# Display the table
knitr::kable(head(sample_summary, 10))
File Name File Size [Mb] Line Count Word Count
Generated Sample 4.83 30000 886798

n-Grams top 10

After analyzing the sample extracted and generated a Corpus using “TM” package, I was able to identify the top 10 n-Grams (unigram, bigram and trigram). Following are the resulting plots from the tokenization analysis of the sample

## unigram plot
unigram <- readRDS("output/unigram.RDS")
unigramPlot <- gvisColumnChart(unigram, "String", "Count", options=list(seriesType='bars', title='Top 10 unigrams',legend="none"))
plot(unigramPlot)
## bigram plot
bigram <- readRDS("output/bigram.RDS")
bigramPlot <- gvisColumnChart(bigram, "String", "Count", options=list(seriesType='bars', title='Top 10 biigrams',legend="none"))
plot(bigramPlot)
## trigram plot
trigram <- readRDS("output/trigram.RDS")
trigramPlot <- gvisColumnChart(trigram, "String", "Count", options=list(seriesType='bars', title='Top 10 trigrams',legend="none"))
plot(trigramPlot)
# Setting the options back to the original settings
options(op)

Remarks

  • Loading the data for analysis requires significant number of resources and time. I found that saving RDS saves time and will help to generate new samples.
  • While performing tokenization, got some issues removing white spaces. Need to investigate further my regular expression
  • The text mining algorithm will need to be tunned further to enhance the prediction

Next Steps

The goal for the Capstone project is to create a reliable prediction algorithm that work not only with accuracy but also using efficiently the resources available. The Shiny application requested will take advantage of enhancements made on the prediction algorithm.