Coursera Data Science Capstone

The goal of this capstone is to mimic the experience of being a data scientist by using data science techniques learned from all 9 specialization courses to create a data product and presentation to Swiftkey.

For Week 1, the main objective is to understand the problem, acquire the data, and understand the type of data we dealing with. The data is available to be downloaded from

https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

The files are extracted from the zip file with three working files:

  1. “en_US.blogs.txt”
  2. “en_US.news.txt”
  3. “en_US.twitter.txt”

Data preparation

Several library chosen to begin with are as below:

library(magrittr)
library(stringi)
library(ggplot2)

Data is being read and stored:

blogfile<- "en_US.blogs.txt"
newsfile<- "en_US.news.txt"
twitterfile<- "en_US.twitter.txt"

blog.line<-readLines(blogfile,encoding="UTF-8", skipNul = TRUE)
news.line<-readLines(newsfile,encoding="UTF-8", skipNul = TRUE)
twitter.line<-readLines(twitterfile,encoding="UTF-8", skipNul = TRUE)

Understanding the data (preliminary)

Count the words on each lines in the data

blog.word.count<-stri_count_words(blog.line)
news.word.count<-stri_count_words(news.line)
twitter.word.count<-stri_count_words(twitter.line)

Blog

Produce a summary of the preliminary understanding of the data for blog

Number of lines:

## [1] 899288

Number of words:

## [1] 37546246

Summary of words count:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    9.00   28.00   41.75   60.00 6726.00

News

Produce a summary of the preliminary understanding of the data for news

Number of lines:

## [1] 77259

Number of words:

## [1] 2674536

Summary of words count:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   19.00   32.00   34.62   46.00 1123.00

Twitter

Produce a summary of the preliminary understanding of the data for twitter

Number of lines:

## [1] 2360148

Number of words:

## [1] 30093369

Summary of words count:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   12.00   12.75   18.00   47.00

Data Processing

Split lines to words

blog.word<-unlist(strsplit(blog.line," "))
news.word<-unlist(strsplit(news.line," "))
twitter.word<-unlist(strsplit(twitter.line," "))

Finding the punctuations, spaces, non-ASCII and numbers

blog.blankspace<-sum(stri_count(blog.line,regex="\\p{Space}"))
news.blankspace<-sum(stri_count(news.line,regex="\\p{Space}"))
twitter.blankspace<-sum(stri_count(twitter.line,regex="\\p{Space}"))
blog.punc<-sum(stri_count(blog.line,regex="\\p{Punct}"))
news.punc<-sum(stri_count(news.line,regex="\\p{Punct}"))
twitter.punc<-sum(stri_count(twitter.line,regex="\\p{Punct}"))
blog.nonEnglish <- length(blog.word[stri_enc_isascii(unlist(blog.word))==FALSE])
news.nonEnglish <- length(news.word[stri_enc_isascii(unlist(news.word))==FALSE])
twitter.nonEnglish <- length(twitter.word[stri_enc_isascii(unlist(twitter.word))==FALSE])
blog.number<-length(blog.word[stri_detect_regex(blog.word,"[:digit:]")==TRUE])
news.number<-length(news.word[stri_detect_regex(news.word,"[:digit:]")==TRUE])
twitter.number<-length(twitter.word[stri_detect_regex(twitter.word,"[:digit:]")==TRUE])

Understanding of the Data

Blog

Analysis of information for blog

From the plot above, it can be seen that the 25%, 50% and 95% percentiles are as below:

## 25% 50% 95% 
##   9  28 126

Number of lines:

## [1] 899288

Number of words:

## [1] 37334131

Top 10 words:

## 
##     the      to     and      of       a       I      in    that      is 
## 1659151 1043878 1015714  862906  857102  738534  540436  421628  412438 
##     for 
##  337156

Note that the occurence of the words observed to be common english stop words

Number of blankspaces:

## [1] 36434843

Number of punctuations:

## [1] 6536746

Number of non-ASCII words:

## [1] 716174

Number of digits:

## [1] 411373

News

Analysis of information for news

From the plot above, it can be seen that the 25%, 50% and 95% percentiles are as below:

## 25% 50% 95% 
##  19  32  74

Number of lines:

## [1] 77259

Number of words:

## [1] 2643969

Top 10 words:

## 
##    the     to    and      a     of     in    for   that     is     on 
## 131810  68417  65167  63401  58675  47526  25498  23916  21232  19198

Note that the occurence of the words observed to be common english stop words

Number of blankspaces:

## [1] 2566710

Number of punctuations:

## [1] 533196

Number of non-ASCII words:

## [1] 22587

Number of digits:

## [1] 64181

Twitter

Analysis of information for Twitter

From the plot above, it can be seen that the 25%, 50% and 95% percentiles are as below:

## 25% 50% 95% 
##   7  12  25

Number of lines:

## [1] 2360148

Number of words:

## [1] 30373583

Top 10 words:

## 
##    the     to      I      a    you    and    for     of     in     is 
## 837023 761902 604531 572691 416377 397642 368422 349367 348815 329396

Note that the occurence of the words observed to be common english stop words

Number of blankspaces:

## [1] 28013435

Number of punctuations:

## [1] 7877048

Number of non-ASCII words:

## [1] 114774

Number of digits:

## [1] 505709

The above ends the Task 0 and 1 to understand the data of the blogs, news and twitter text files. The next step is to build the corpus based on the understanding acquired from this part.