Coursera Data Science Capstone Mile Stone Report

Eric Dolan- July 25,2015

INTRODUCTION

This is the milestone report for the Coursera data science specialization. The project motivation is:

  1. Demonstrate that you’ve downloaded the data and have successfully loaded it in.
  2. Create a basic report of summary statistics about the data sets.
  3. Report any interesting findings that you amassed so far.
  4. Get feedback on your plans for creating a prediction algorithm and Shiny app.

This data was downloaded, cleaned, and collated into a corpus as detailed below. Exploratory analyses revealed a confirmation of Zipf’s Law, i.e., the frequency of a term is inversely related to it rank. A small number of words account for most of our corpus. Frequency of bigrams (i.e., pairs of consecutive words) and trigram (i.e., trios of consecutive words) were also analyzed and found to reflect common phrases in the English language.

us.news <- readLines("./final/en_US/en_US.news.txt", encoding = "UTF-8")
us.blog <- readLines("./final/en_US/en_US.blogs.txt", encoding = "UTF-8")
us.twitter <- readLines("./final/en_US/en_US.twitter.txt", encoding = "UTF-8")
## Warning: line 167155 appears to contain an embedded nul
## Warning: line 268547 appears to contain an embedded nul
## Warning: line 1274086 appears to contain an embedded nul
## Warning: line 1759032 appears to contain an embedded nul
file.info("/final/en_US/en_US.news.txt")$size / (1024^2)
## [1] NA
file.info("./final/en_US/en_US.blogs.txt")$size / (1024^2)
## [1] 200.4
file.info("./final/en_US/en_US.twitter.txt")$size / (1024^2)
## [1] 159.4
library(stringi) #load stringi for string summaries
## Warning: package 'stringi' was built under R version 3.1.2
stri_stats_general(us.news)
##       Lines LinesNEmpty       Chars CharsNWhite 
##     1010242     1010242   203223154   169860866
stri_stats_general(us.blog)
##       Lines LinesNEmpty       Chars CharsNWhite 
##      899288      899288   206824382   170389539
stri_stats_general(us.twitter)
##       Lines LinesNEmpty       Chars CharsNWhite 
##     2360148     2360148   162096031   134082634
library(ggplot2) #load ggplot2 for graphing

summary(stri_count_words(us.news))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0    19.0    32.0    34.4    46.0  1800.0
qplot(stri_count_words(us.news))
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

plot of chunk unnamed-chunk-7

summary(stri_count_words(us.blog))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0       9      28      42      60    6730
qplot(stri_count_words(us.blog))
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

plot of chunk unnamed-chunk-9

SHINY APP AND ALGORITHM

Having performed the necessary exploratory analysis, we can now use these observations to build our predictive model, based on n-grams, test it and build a Shiny app to demo the model.