Introduction

This milestone report is based on exploratory data analysis of the SwifKey data provided in the Coursera Data Science Capstone. The data consist of 3 text files containing text from three different sources (Twitter, Blog and News). The data can be download from the link below:

Link: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

Load Data & Sample

Next, we need to load the data into R so we can start the analysis. We use readLines to read blogs, news and twitter.

# Set the correct working directory
source("~/Desktop/Data Scientist/Capstone/data/en_US")
# Read the file using readLines function for the 3 files

twitter<- readLines(con <- file("en_US.twitter.txt", encoding = "UTF-8"), skipNul=TRUE)
blog<- readLines(con <- file("en_US.blogs.txt", encoding = "UTF-8"), skipNul=TRUE)
news<- readLines(con <- file("en_US.news.txt", encoding = "UTF-8"), skipNul=TRUE)

close(con)

Before we start the analysis, cleanup any illegal character

cleanedTwitter<-iconv(twitter, 'UTF-8', 'ASCII', "byte") 
cleanedBlog<-iconv(blog, 'UTF-8', 'ASCII', "byte") 
cleanedNews<-iconv(news, 'UTF-8', 'ASCII', "byte") 

Since the file is too large and running the calculation will be really slow, we just take sample for the analysis. We decided to take 10000 of each file.

sampleTwitter<-sample(cleanedTwitter, 10000)
sampleBlog<-sample(cleanedBlog, 10000)
sampleNews<-sample(cleanedNews, 10000)
## Create folder for the sample
dir.create("sample", showWarnings = FALSE)
## Save file into the folder
write(sampleBlog, "sample/sample.blogs.txt")
write(sampleNews, "sample/sample.news.txt")
write(sampleTwitter, "sample/sample.twitter.txt")

Basic Statistics

Before we analyze the files, we look at their size (presented in MegaBytes / MBs).

# file size (in MegaBytes/MB)
file.info("en_US.twitter.txt")$size / 1024^2
## [1] 159.3641
file.info("en_US.blogs.txt")$size   / 1024^2
## [1] 200.4242
file.info("en_US.news.txt")$size    / 1024^2
## [1] 196.2775

For our analysis we need two libraries.

# library for character string analysis
library(stringi)
# library for plotting
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.2.3

We analyse the lines and characters.

stri_stats_general(twitter)
##       Lines LinesNEmpty       Chars CharsNWhite 
##     2360148     2360148   162096241   134082806
stri_stats_general(blog)
##       Lines LinesNEmpty       Chars CharsNWhite 
##      899288      899288   206824382   170389539
stri_stats_general(news)
##       Lines LinesNEmpty       Chars CharsNWhite 
##     1010242     1010242   203223154   169860866

Next we count the words per item (line). We summarise the distibution of these counts per corpus, using summary statistics and a distibution plot. First for Twitter corpus.

words_twitter <- stri_count_words(twitter)
summary(words_twitter)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   12.00   12.75   18.00   47.00
qplot(words_twitter)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Next we analyze for the blogs corpus.

words_blog <- stri_count_words(blog)
summary(words_blog)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    9.00   28.00   41.75   60.00 6726.00
qplot(words_blog)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Lastly we analyze the news corpus.

words_news <- stri_count_words(news)
summary(words_news)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   19.00   32.00   34.41   46.00 1796.00
qplot(words_news)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Conclusions

We have done analyze three sample datab (Twitter, Blog & News) which total file size around 1 Gigabytes (GBs). Each file sizes are around 300 MegaBytes (MBs) per file.

From the analysis, we found that:

  1. The twitter has the highest corpus which over 2 million items.
  2. The blog and news corpus consist about 1 million items each
  3. Twitter messages have a character limit of 140,
  4. Blog and News has the same frequency distribution, whereas twitter is different due to charcater limit