Introduction

This Milestone Report is for the Coursera/John’s Hopkins University Data Science Capstone Project http://www.coursera.org/course/dsscapstone. This report provides basic exploratory analysis mainly related to Tasks 0-2 of the project.

Data Acquisition and Basic Summaries

The data come from HC Corpora. The zip file is available here: http://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip.

us.news <- readLines("./Coursera-SwiftKey/final/en_US/en_US.news.txt", encoding = "UTF-8")
us.blog <- readLines("./Coursera-SwiftKey/final/en_US/en_US.blogs.txt", encoding = "UTF-8")
us.twitter <- readLines("./Coursera-SwiftKey/final/en_US/en_US.twitter.txt", encoding = "UTF-8")
file.info("./Coursera-SwiftKey/final/en_US/en_US.news.txt")$size / (1024^2)
file.info("./Coursera-SwiftKey/final/en_US/en_US.blogs.txt")$size / (1024^2)
file.info("./Coursera-SwiftKey/final/en_US/en_US.twitter.txt")$size / (1024^2)

The file size for news is 196.3MB, the file size for blog is 200.4MB, and whereas the file size for twitter is 159.4MB.

library(stringi) #load stringi for string summaries

stri_stats_general(us.news)
##       Lines LinesNEmpty       Chars CharsNWhite 
##       77259       77259    15639408    13072698
stri_stats_general(us.blog)
##       Lines LinesNEmpty       Chars CharsNWhite 
##      899288      899288   206824382   170389539
stri_stats_general(us.twitter)
##       Lines LinesNEmpty       Chars CharsNWhite 
##     2360148     2360148   162096031   134082634

The above shows the word counts and also line counts for the three files respectively.

Basic Tables and Plots

library(ggplot2) #load ggplot2 for graphing

summary(stri_count_words(us.news))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   19.00   32.00   34.62   46.00 1123.00
qplot(stri_count_words(us.news))
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

summary(stri_count_words(us.blog))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    9.00   28.00   41.75   60.00 6726.00
qplot(stri_count_words(us.blog))
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

summary(stri_count_words(us.twitter))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   12.00   12.75   18.00   47.00
qplot(stri_count_words(us.twitter))
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

The histogram of twitter looks rather different from the other two files; probably due to the word limits in twitter.