In this document we do some basic explorative analysis as we would like to understand more about the data before we move on to interacting with the data.
download.file(url = 'https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip', method='curl', destfile = 'data/raw_data/Coursera-SwiftKey.zip')
unzip('data/raw_data/Coursera-SwiftKey.zip', exdir = 'data/raw_data/')
First we examine some basic stats on the raw data
Blog - 200.42MB
News - 196.28MB
Twitter - 159.36MB
Blog - 899288 lines
News - 1010242 lines
Twitter - 2360148 lines
As the data we are dealing with is fairly large, we use random sampling to do analysis and will then infer our findings upon the general dataset. We consider <= 2% of each of the data sets based on the confidence intervals calculating what should be considered a representative sample.
setwd('data/raw_data/final/en_US/')
#Blog
blog_sample_index <- createDataPartition(y = 1:blog_line_count, times = 1, p = 0.02, list = F)
blog_sample_index_bool <- (1:blog_line_count) %in% blog_sample_index
blog_con <- file('en_US.blogs.txt', open = 'rt')
blog_lines <- readLines(blog_con)
close(blog_con)
blog_sample <- blog_lines[blog_sample_index]
remove(blog_lines)
writeLines(blog_sample, 'en_US.blogs.sample.txt')
#Twitter
twitter_sample_index <- createDataPartition(y = 1:twitter_line_count, times = 1, p = 0.007, list = F)
twitter_sample_index_bool <- 1:twitter_line_count %in% twitter_sample_index
twitter_con <- file('en_US.twitter.txt', open = 'rt')
twitter_lines <- readLines(twitter_con, skipNul = T)
close(twitter_con)
twitter_sample <- twitter_lines[twitter_sample_index]
remove(twitter_lines)
writeLines(twitter_sample, 'en_US.twitter.sample.txt')
#News
news_sample_index <- createDataPartition(y = 1:news_line_count, times = 1, p = 0.0162, list = F)
news_sample_index_bool <- 1:news_line_count %in% news_sample_index
news_con <- file('en_US.news.txt', open = 'rt')
news_lines <- readLines(news_con)
close(news_con)
news_sample <- news_lines[news_sample_index]
remove(news_lines)
writeLines(news_sample, 'en_US.news.sample.txt')
representative_sample <-c(blog_sample, news_sample, twitter_sample)
writeLines(representative_sample, "en_US.all.sample.txt")
Now that we have a representation with which we are confident to analyse and infer our findings in the analysis upon the general population. We take a look at each sample’s word counts to see how the different sources differ.
blog_sample_wc <- sapply(blog_sample, wc)
news_sample_wc <- sapply(news_sample, wc)
twitter_sample_wc <- sapply(twitter_sample, wc)
hist(blog_sample_wc, xlab = "Word counts", main = "Blog sample wc")
hist(news_sample_wc, xlab = "Word counts", main = "News sample wc")
hist(twitter_sample_wc, xlab = "Word counts", main = "Twitter sample wc")
We can see in the histograms above that there is heavy skewing of the data to having less words in all 3 data sets. We notice that blogs has the longest word count of 600, with the news data set coming in second having a word count of just under 200 and finally the Twitter data set with having just have 30 words.
Finally we look at a summary representation of the word counts to get more accurate information about the different samples and their word counts.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.00 8.00 28.00 40.98 59.00 569.00 11
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.00 18.00 30.00 32.81 44.00 215.00 13
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 12.00 12.44 18.00 32.00
After exploring our data at the basic level, we begin to get a feel for what our sources are like. We notice that although blogs and news would generally be thought of as having longer pieces, and whilst this remains true, they however still seem to have skewing of the data lengths that makes the data seem similar to the Twitter datasource which is restricted to having less characters.
Our next steps are to further clean the data of stop words and create NGrams which will ultimately help us to create a predictive model.