This Milestone Report is for the Coursera/John’s Hopkins University Data Science Capstone Project http://www.coursera.org/course/dsscapstone. This report provides basic exploratory analysis mainly related to Tasks 0-2 of the project.
The data come from HC Corpora. The zip file is available here: http://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip.
us.news <- readLines("./Coursera-SwiftKey/final/en_US/en_US.news.txt", encoding = "UTF-8")
us.blog <- readLines("./Coursera-SwiftKey/final/en_US/en_US.blogs.txt", encoding = "UTF-8")
us.twitter <- readLines("./Coursera-SwiftKey/final/en_US/en_US.twitter.txt", encoding = "UTF-8")
file.info("./Coursera-SwiftKey/final/en_US/en_US.news.txt")$size / (1024^2)
file.info("./Coursera-SwiftKey/final/en_US/en_US.blogs.txt")$size / (1024^2)
file.info("./Coursera-SwiftKey/final/en_US/en_US.twitter.txt")$size / (1024^2)
The file size for news is 196.3MB, the file size for blog is 200.4MB, and whereas the file size for twitter is 159.4MB.
library(stringi) #load stringi for string summaries
stri_stats_general(us.news)
## Lines LinesNEmpty Chars CharsNWhite
## 77259 77259 15639408 13072698
stri_stats_general(us.blog)
## Lines LinesNEmpty Chars CharsNWhite
## 899288 899288 206824382 170389539
stri_stats_general(us.twitter)
## Lines LinesNEmpty Chars CharsNWhite
## 2360148 2360148 162096031 134082634
The above shows the word counts and also line counts for the three files respectively.
library(ggplot2) #load ggplot2 for graphing
summary(stri_count_words(us.news))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 19.00 32.00 34.62 46.00 1123.00
qplot(stri_count_words(us.news))
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
summary(stri_count_words(us.blog))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 9.00 28.00 41.75 60.00 6726.00
qplot(stri_count_words(us.blog))
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
summary(stri_count_words(us.twitter))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 12.00 12.75 18.00 47.00
qplot(stri_count_words(us.twitter))
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
The histogram of twitter looks rather different from the other two files; probably due to the word limits in twitter.