Eric Dolan- July 25,2015
This is the milestone report for the Coursera data science specialization. The project motivation is:
This data was downloaded, cleaned, and collated into a corpus as detailed below. Exploratory analyses revealed a confirmation of Zipf’s Law, i.e., the frequency of a term is inversely related to it rank. A small number of words account for most of our corpus. Frequency of bigrams (i.e., pairs of consecutive words) and trigram (i.e., trios of consecutive words) were also analyzed and found to reflect common phrases in the English language.
us.news <- readLines("./final/en_US/en_US.news.txt", encoding = "UTF-8")
us.blog <- readLines("./final/en_US/en_US.blogs.txt", encoding = "UTF-8")
us.twitter <- readLines("./final/en_US/en_US.twitter.txt", encoding = "UTF-8")
## Warning: line 167155 appears to contain an embedded nul
## Warning: line 268547 appears to contain an embedded nul
## Warning: line 1274086 appears to contain an embedded nul
## Warning: line 1759032 appears to contain an embedded nul
file.info("/final/en_US/en_US.news.txt")$size / (1024^2)
## [1] NA
file.info("./final/en_US/en_US.blogs.txt")$size / (1024^2)
## [1] 200.4
file.info("./final/en_US/en_US.twitter.txt")$size / (1024^2)
## [1] 159.4
library(stringi) #load stringi for string summaries
## Warning: package 'stringi' was built under R version 3.1.2
stri_stats_general(us.news)
## Lines LinesNEmpty Chars CharsNWhite
## 1010242 1010242 203223154 169860866
stri_stats_general(us.blog)
## Lines LinesNEmpty Chars CharsNWhite
## 899288 899288 206824382 170389539
stri_stats_general(us.twitter)
## Lines LinesNEmpty Chars CharsNWhite
## 2360148 2360148 162096031 134082634
library(ggplot2) #load ggplot2 for graphing
summary(stri_count_words(us.news))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 19.0 32.0 34.4 46.0 1800.0
qplot(stri_count_words(us.news))
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
summary(stri_count_words(us.blog))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 9 28 42 60 6730
qplot(stri_count_words(us.blog))
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
Having performed the necessary exploratory analysis, we can now use these observations to build our predictive model, based on n-grams, test it and build a Shiny app to demo the model.