This is the first project of the Capstone course in Data Science. As such, this assignment is intended to get a feel for the data by doing some Exploratory Data Analysis and Summarization. The remainder of this document will present some statistics about the data set and produce some plots that will provide a sense of what this data is about.
The data itself is a set of recorded communications from three sources: - Blogs - Twitter - News
The idea of the capstone project is when given a word in such a data source, to then be able to predict the next word. The applications of such a predictive algorithm are seen every day in mobile devices such as android phones and iPhones where the user is provided the next word in a tool such a text messaging, making their typing task greatly reduced.
# download data
destination_file <- "Coursera-SwiftKey.zip"
source_file <- "http://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download.file(source_file, destination_file)
unzip(destination_file)
unzip(destination_file, list = TRUE )
## Name Length Date
## 1 final/ 0 2014-07-22 10:10:00
## 2 final/de_DE/ 0 2014-07-22 10:10:00
## 3 final/de_DE/de_DE.twitter.txt 75578341 2014-07-22 10:11:00
## 4 final/de_DE/de_DE.blogs.txt 85459666 2014-07-22 10:11:00
## 5 final/de_DE/de_DE.news.txt 95591959 2014-07-22 10:11:00
## 6 final/ru_RU/ 0 2014-07-22 10:10:00
## 7 final/ru_RU/ru_RU.blogs.txt 116855835 2014-07-22 10:12:00
## 8 final/ru_RU/ru_RU.news.txt 118996424 2014-07-22 10:12:00
## 9 final/ru_RU/ru_RU.twitter.txt 105182346 2014-07-22 10:12:00
## 10 final/en_US/ 0 2014-07-22 10:10:00
## 11 final/en_US/en_US.twitter.txt 167105338 2014-07-22 10:12:00
## 12 final/en_US/en_US.news.txt 205811889 2014-07-22 10:13:00
## 13 final/en_US/en_US.blogs.txt 210160014 2014-07-22 10:13:00
## 14 final/fi_FI/ 0 2014-07-22 10:10:00
## 15 final/fi_FI/fi_FI.news.txt 94234350 2014-07-22 10:11:00
## 16 final/fi_FI/fi_FI.blogs.txt 108503595 2014-07-22 10:12:00
## 17 final/fi_FI/fi_FI.twitter.txt 25331142 2014-07-22 10:10:00
list.files("final")
## [1] "de_DE" "en_US" "fi_FI" "ru_RU"
list.files("final/en_US")
## [1] "en_US.blogs.txt" "en_US.news.txt" "en_US.twitter.txt"
The corpora (input data) is in the following files: - blogs - twitter - news
# get the blogs data
blogs <- readLines("final/en_US/en_US.blogs.txt", encoding="UTF-8")
# get the twitter data
twitter <- readLines("final/en_US/en_US.twitter.txt", encoding="UTF-8")
## Warning in readLines("final/en_US/en_US.twitter.txt", encoding = "UTF-8"):
## line 167155 appears to contain an embedded nul
## Warning in readLines("final/en_US/en_US.twitter.txt", encoding = "UTF-8"):
## line 268547 appears to contain an embedded nul
## Warning in readLines("final/en_US/en_US.twitter.txt", encoding = "UTF-8"):
## line 1274086 appears to contain an embedded nul
## Warning in readLines("final/en_US/en_US.twitter.txt", encoding = "UTF-8"):
## line 1759032 appears to contain an embedded nul
# get the news data
con <- file("final/en_US/en_US.news.txt", open="rb")
news <- readLines(con, encoding="UTF-8")
close(con)
rm(con)
File sizes.
# file size (in MegaBytes/MB)
file.info("final/en_US/en_US.blogs.txt")$size / 1024^2
## [1] 200.4242
file.info("final/en_US/en_US.news.txt")$size / 1024^2
## [1] 196.2775
file.info("final/en_US/en_US.twitter.txt")$size / 1024^2
## [1] 159.3641
# library for character string analysis
library(stringi)
# library for plotting
library(ggplot2)
Determine number of lines and characters for each data source.
stri_stats_general( blogs )
## Lines LinesNEmpty Chars CharsNWhite
## 899288 899288 206824382 170389539
stri_stats_general( news )
## Lines LinesNEmpty Chars CharsNWhite
## 1010242 1010242 203223154 169860866
stri_stats_general( twitter )
## Lines LinesNEmpty Chars CharsNWhite
## 2360148 2360148 162096031 134082634
Blogs
words_blogs <- stri_count_words(blogs)
summary( words_blogs )
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 9.00 28.00 41.75 60.00 6726.00
qplot( words_blogs )
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
words_twitter <- stri_count_words(twitter)
summary( words_twitter )
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 12.00 12.75 18.00 47.00
qplot( words_twitter )
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
News
words_news <- stri_count_words(news)
summary( words_news )
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 19.00 32.00 34.41 46.00 1796.00
qplot( words_news )
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
File sizes are around 200 MB each.
Blogs - approximately 1 million items. Twitter - approximately 2 million items. News - approximately 1 million items.