This milestone report is based on exploratory data analysis of the SwifKey data provided in the Coursera Data Science Capstone. The data consist of 3 text files containing text from three different sources (Twitter, Blog and News). The data can be download from the link below:
Link: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
Next, we need to load the data into R so we can start the analysis. We use readLines to read blogs, news and twitter.
# Set the correct working directory
source("~/Desktop/Data Scientist/Capstone/data/en_US")
# Read the file using readLines function for the 3 files
twitter<- readLines(con <- file("en_US.twitter.txt", encoding = "UTF-8"), skipNul=TRUE)
blog<- readLines(con <- file("en_US.blogs.txt", encoding = "UTF-8"), skipNul=TRUE)
news<- readLines(con <- file("en_US.news.txt", encoding = "UTF-8"), skipNul=TRUE)
close(con)
cleanedTwitter<-iconv(twitter, 'UTF-8', 'ASCII', "byte")
cleanedBlog<-iconv(blog, 'UTF-8', 'ASCII', "byte")
cleanedNews<-iconv(news, 'UTF-8', 'ASCII', "byte")
sampleTwitter<-sample(cleanedTwitter, 10000)
sampleBlog<-sample(cleanedBlog, 10000)
sampleNews<-sample(cleanedNews, 10000)
## Create folder for the sample
dir.create("sample", showWarnings = FALSE)
## Save file into the folder
write(sampleBlog, "sample/sample.blogs.txt")
write(sampleNews, "sample/sample.news.txt")
write(sampleTwitter, "sample/sample.twitter.txt")
Before we analyze the files, we look at their size (presented in MegaBytes / MBs).
# file size (in MegaBytes/MB)
file.info("en_US.twitter.txt")$size / 1024^2
## [1] 159.3641
file.info("en_US.blogs.txt")$size / 1024^2
## [1] 200.4242
file.info("en_US.news.txt")$size / 1024^2
## [1] 196.2775
For our analysis we need two libraries.
# library for character string analysis
library(stringi)
# library for plotting
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.2.3
We analyse the lines and characters.
stri_stats_general(twitter)
## Lines LinesNEmpty Chars CharsNWhite
## 2360148 2360148 162096241 134082806
stri_stats_general(blog)
## Lines LinesNEmpty Chars CharsNWhite
## 899288 899288 206824382 170389539
stri_stats_general(news)
## Lines LinesNEmpty Chars CharsNWhite
## 1010242 1010242 203223154 169860866
Next we count the words per item (line). We summarise the distibution of these counts per corpus, using summary statistics and a distibution plot. First for Twitter corpus.
words_twitter <- stri_count_words(twitter)
summary(words_twitter)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 12.00 12.75 18.00 47.00
qplot(words_twitter)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Next we analyze for the blogs corpus.
words_blog <- stri_count_words(blog)
summary(words_blog)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 9.00 28.00 41.75 60.00 6726.00
qplot(words_blog)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Lastly we analyze the news corpus.
words_news <- stri_count_words(news)
summary(words_news)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 19.00 32.00 34.41 46.00 1796.00
qplot(words_news)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We have done analyze three sample datab (Twitter, Blog & News) which total file size around 1 Gigabytes (GBs). Each file sizes are around 300 MegaBytes (MBs) per file.
From the analysis, we found that: