The goal of this capstone is to mimic the experience of being a data scientist by using data science techniques learned from all 9 specialization courses to create a data product and presentation to Swiftkey.
For Week 1, the main objective is to understand the problem, acquire the data, and understand the type of data we dealing with. The data is available to be downloaded from
https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
The files are extracted from the zip file with three working files:
Several library chosen to begin with are as below:
library(magrittr)
library(stringi)
library(ggplot2)
Data is being read and stored:
blogfile<- "en_US.blogs.txt"
newsfile<- "en_US.news.txt"
twitterfile<- "en_US.twitter.txt"
blog.line<-readLines(blogfile,encoding="UTF-8", skipNul = TRUE)
news.line<-readLines(newsfile,encoding="UTF-8", skipNul = TRUE)
twitter.line<-readLines(twitterfile,encoding="UTF-8", skipNul = TRUE)
Count the words on each lines in the data
blog.word.count<-stri_count_words(blog.line)
news.word.count<-stri_count_words(news.line)
twitter.word.count<-stri_count_words(twitter.line)
Produce a summary of the preliminary understanding of the data for blog
Number of lines:
## [1] 899288
Number of words:
## [1] 37546246
Summary of words count:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 9.00 28.00 41.75 60.00 6726.00
Produce a summary of the preliminary understanding of the data for news
Number of lines:
## [1] 77259
Number of words:
## [1] 2674536
Summary of words count:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 19.00 32.00 34.62 46.00 1123.00
Produce a summary of the preliminary understanding of the data for twitter
Number of lines:
## [1] 2360148
Number of words:
## [1] 30093369
Summary of words count:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 12.00 12.75 18.00 47.00
Split lines to words
blog.word<-unlist(strsplit(blog.line," "))
news.word<-unlist(strsplit(news.line," "))
twitter.word<-unlist(strsplit(twitter.line," "))
Finding the punctuations, spaces, non-ASCII and numbers
blog.blankspace<-sum(stri_count(blog.line,regex="\\p{Space}"))
news.blankspace<-sum(stri_count(news.line,regex="\\p{Space}"))
twitter.blankspace<-sum(stri_count(twitter.line,regex="\\p{Space}"))
blog.punc<-sum(stri_count(blog.line,regex="\\p{Punct}"))
news.punc<-sum(stri_count(news.line,regex="\\p{Punct}"))
twitter.punc<-sum(stri_count(twitter.line,regex="\\p{Punct}"))
blog.nonEnglish <- length(blog.word[stri_enc_isascii(unlist(blog.word))==FALSE])
news.nonEnglish <- length(news.word[stri_enc_isascii(unlist(news.word))==FALSE])
twitter.nonEnglish <- length(twitter.word[stri_enc_isascii(unlist(twitter.word))==FALSE])
blog.number<-length(blog.word[stri_detect_regex(blog.word,"[:digit:]")==TRUE])
news.number<-length(news.word[stri_detect_regex(news.word,"[:digit:]")==TRUE])
twitter.number<-length(twitter.word[stri_detect_regex(twitter.word,"[:digit:]")==TRUE])
Analysis of information for blog
From the plot above, it can be seen that the 25%, 50% and 95% percentiles are as below:
## 25% 50% 95%
## 9 28 126
Number of lines:
## [1] 899288
Number of words:
## [1] 37334131
Top 10 words:
##
## the to and of a I in that is
## 1659151 1043878 1015714 862906 857102 738534 540436 421628 412438
## for
## 337156
Note that the occurence of the words observed to be common english stop words
Number of blankspaces:
## [1] 36434843
Number of punctuations:
## [1] 6536746
Number of non-ASCII words:
## [1] 716174
Number of digits:
## [1] 411373
Analysis of information for news
From the plot above, it can be seen that the 25%, 50% and 95% percentiles are as below:
## 25% 50% 95%
## 19 32 74
Number of lines:
## [1] 77259
Number of words:
## [1] 2643969
Top 10 words:
##
## the to and a of in for that is on
## 131810 68417 65167 63401 58675 47526 25498 23916 21232 19198
Note that the occurence of the words observed to be common english stop words
Number of blankspaces:
## [1] 2566710
Number of punctuations:
## [1] 533196
Number of non-ASCII words:
## [1] 22587
Number of digits:
## [1] 64181
Analysis of information for Twitter
From the plot above, it can be seen that the 25%, 50% and 95% percentiles are as below:
## 25% 50% 95%
## 7 12 25
Number of lines:
## [1] 2360148
Number of words:
## [1] 30373583
Top 10 words:
##
## the to I a you and for of in is
## 837023 761902 604531 572691 416377 397642 368422 349367 348815 329396
Note that the occurence of the words observed to be common english stop words
Number of blankspaces:
## [1] 28013435
Number of punctuations:
## [1] 7877048
Number of non-ASCII words:
## [1] 114774
Number of digits:
## [1] 505709
The above ends the Task 0 and 1 to understand the data of the blogs, news and twitter text files. The next step is to build the corpus based on the understanding acquired from this part.