Summary

In this document we do some basic explorative analysis as we would like to understand more about the data before we move on to interacting with the data.

Get raw data

  download.file(url = 'https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip', method='curl',  destfile = 'data/raw_data/Coursera-SwiftKey.zip')
  unzip('data/raw_data/Coursera-SwiftKey.zip', exdir = 'data/raw_data/')

Explore raw data

First we examine some basic stats on the raw data

Data file sizes

Blog - 200.42MB

News - 196.28MB

Twitter - 159.36MB

Data file line count

Blog - 899288 lines

News - 1010242 lines

Twitter - 2360148 lines

Create data sample

As the data we are dealing with is fairly large, we use random sampling to do analysis and will then infer our findings upon the general dataset. We consider <= 2% of each of the data sets based on the confidence intervals calculating what should be considered a representative sample.

  setwd('data/raw_data/final/en_US/')
  

  #Blog
  blog_sample_index <- createDataPartition(y = 1:blog_line_count, times = 1, p = 0.02, list = F)
  blog_sample_index_bool <- (1:blog_line_count) %in% blog_sample_index
  
  blog_con <- file('en_US.blogs.txt', open = 'rt')
  blog_lines <- readLines(blog_con)
  close(blog_con)
  
  blog_sample <- blog_lines[blog_sample_index]
  remove(blog_lines)
  
  writeLines(blog_sample, 'en_US.blogs.sample.txt')
  
  #Twitter
  twitter_sample_index <- createDataPartition(y = 1:twitter_line_count, times = 1, p = 0.007, list = F)
  twitter_sample_index_bool <- 1:twitter_line_count %in% twitter_sample_index
  
  twitter_con <- file('en_US.twitter.txt', open = 'rt')
  twitter_lines <- readLines(twitter_con, skipNul = T)
  close(twitter_con)
  
  
  twitter_sample <- twitter_lines[twitter_sample_index]
  remove(twitter_lines)
  
  writeLines(twitter_sample, 'en_US.twitter.sample.txt')
  
  #News
  news_sample_index <- createDataPartition(y = 1:news_line_count, times = 1, p = 0.0162, list = F)
  news_sample_index_bool <- 1:news_line_count %in% news_sample_index
  
  news_con <- file('en_US.news.txt', open = 'rt')
  news_lines <- readLines(news_con)
  close(news_con)
  
  news_sample <- news_lines[news_sample_index]
  remove(news_lines)
  
  writeLines(news_sample, 'en_US.news.sample.txt')
  
  representative_sample <-c(blog_sample, news_sample, twitter_sample)
  
  writeLines(representative_sample, "en_US.all.sample.txt")  

Explore raw representative sample data

Now that we have a representation with which we are confident to analyse and infer our findings in the analysis upon the general population. We take a look at each sample’s word counts to see how the different sources differ.

  blog_sample_wc <- sapply(blog_sample, wc)
  news_sample_wc <- sapply(news_sample, wc)
  twitter_sample_wc <- sapply(twitter_sample, wc)
  
  
  hist(blog_sample_wc, xlab = "Word counts", main = "Blog sample wc")
  hist(news_sample_wc, xlab = "Word counts", main = "News sample wc")
  hist(twitter_sample_wc, xlab = "Word counts", main = "Twitter sample wc")

We can see in the histograms above that there is heavy skewing of the data to having less words in all 3 data sets. We notice that blogs has the longest word count of 600, with the news data set coming in second having a word count of just under 200 and finally the Twitter data set with having just have 30 words.

Finally we look at a summary representation of the word counts to get more accurate information about the different samples and their word counts.

Blog word count summary

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1.00    8.00   28.00   40.98   59.00  569.00      11

News word count summary

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1.00   18.00   30.00   32.81   44.00  215.00      13

Twitter word count summary

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   12.00   12.44   18.00   32.00

Conclusion

After exploring our data at the basic level, we begin to get a feel for what our sources are like. We notice that although blogs and news would generally be thought of as having longer pieces, and whilst this remains true, they however still seem to have skewing of the data lengths that makes the data seem similar to the Twitter datasource which is restricted to having less characters.

Our next steps are to further clean the data of stop words and create NGrams which will ultimately help us to create a predictive model.