Executive Summary

This is the first project of the Capstone course in Data Science. As such, this assignment is intended to get a feel for the data by doing some Exploratory Data Analysis and Summarization. The remainder of this document will present some statistics about the data set and produce some plots that will provide a sense of what this data is about.

The data itself is a set of recorded communications from three sources: - Blogs - Twitter - News

The idea of the capstone project is when given a word in such a data source, to then be able to predict the next word. The applications of such a predictive algorithm are seen every day in mobile devices such as android phones and iPhones where the user is provided the next word in a tool such a text messaging, making their typing task greatly reduced.

Summarization of Data

# download data
destination_file <- "Coursera-SwiftKey.zip"
source_file <- "http://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download.file(source_file, destination_file)

unzip(destination_file)
unzip(destination_file, list = TRUE )
##                             Name    Length                Date
## 1                         final/         0 2014-07-22 10:10:00
## 2                   final/de_DE/         0 2014-07-22 10:10:00
## 3  final/de_DE/de_DE.twitter.txt  75578341 2014-07-22 10:11:00
## 4    final/de_DE/de_DE.blogs.txt  85459666 2014-07-22 10:11:00
## 5     final/de_DE/de_DE.news.txt  95591959 2014-07-22 10:11:00
## 6                   final/ru_RU/         0 2014-07-22 10:10:00
## 7    final/ru_RU/ru_RU.blogs.txt 116855835 2014-07-22 10:12:00
## 8     final/ru_RU/ru_RU.news.txt 118996424 2014-07-22 10:12:00
## 9  final/ru_RU/ru_RU.twitter.txt 105182346 2014-07-22 10:12:00
## 10                  final/en_US/         0 2014-07-22 10:10:00
## 11 final/en_US/en_US.twitter.txt 167105338 2014-07-22 10:12:00
## 12    final/en_US/en_US.news.txt 205811889 2014-07-22 10:13:00
## 13   final/en_US/en_US.blogs.txt 210160014 2014-07-22 10:13:00
## 14                  final/fi_FI/         0 2014-07-22 10:10:00
## 15    final/fi_FI/fi_FI.news.txt  94234350 2014-07-22 10:11:00
## 16   final/fi_FI/fi_FI.blogs.txt 108503595 2014-07-22 10:12:00
## 17 final/fi_FI/fi_FI.twitter.txt  25331142 2014-07-22 10:10:00
list.files("final")
## [1] "de_DE" "en_US" "fi_FI" "ru_RU"
list.files("final/en_US")
## [1] "en_US.blogs.txt"   "en_US.news.txt"    "en_US.twitter.txt"

The corpora (input data) is in the following files: - blogs - twitter - news

# get the blogs data
blogs <- readLines("final/en_US/en_US.blogs.txt", encoding="UTF-8")
# get the twitter data
twitter <- readLines("final/en_US/en_US.twitter.txt", encoding="UTF-8")
## Warning in readLines("final/en_US/en_US.twitter.txt", encoding = "UTF-8"):
## line 167155 appears to contain an embedded nul
## Warning in readLines("final/en_US/en_US.twitter.txt", encoding = "UTF-8"):
## line 268547 appears to contain an embedded nul
## Warning in readLines("final/en_US/en_US.twitter.txt", encoding = "UTF-8"):
## line 1274086 appears to contain an embedded nul
## Warning in readLines("final/en_US/en_US.twitter.txt", encoding = "UTF-8"):
## line 1759032 appears to contain an embedded nul
# get the news data
con <- file("final/en_US/en_US.news.txt", open="rb")
news <- readLines(con, encoding="UTF-8")
close(con)
rm(con)

Exploratory Data Analysis

File sizes.

# file size (in MegaBytes/MB)
file.info("final/en_US/en_US.blogs.txt")$size   / 1024^2
## [1] 200.4242
file.info("final/en_US/en_US.news.txt")$size    / 1024^2
## [1] 196.2775
file.info("final/en_US/en_US.twitter.txt")$size / 1024^2
## [1] 159.3641
# library for character string analysis
library(stringi)

# library for plotting
library(ggplot2)

Determine number of lines and characters for each data source.

stri_stats_general( blogs )
##       Lines LinesNEmpty       Chars CharsNWhite 
##      899288      899288   206824382   170389539
stri_stats_general( news )
##       Lines LinesNEmpty       Chars CharsNWhite 
##     1010242     1010242   203223154   169860866
stri_stats_general( twitter )
##       Lines LinesNEmpty       Chars CharsNWhite 
##     2360148     2360148   162096031   134082634

Blogs

words_blogs   <- stri_count_words(blogs)
summary( words_blogs )
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    9.00   28.00   41.75   60.00 6726.00
qplot(   words_blogs )
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Twitter

words_twitter <- stri_count_words(twitter)
summary( words_twitter )
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   12.00   12.75   18.00   47.00
qplot(   words_twitter )
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

News

words_news    <- stri_count_words(news)
summary( words_news )
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   19.00   32.00   34.41   46.00 1796.00
qplot(   words_news )
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Summarized Observations

File sizes are around 200 MB each.

Blogs - approximately 1 million items. Twitter - approximately 2 million items. News - approximately 1 million items.