Summary

The purpose of this report is to explore the data to be used as training data towards the development of an application for the a prediction algorithm. The data analyzed in this report originates from the corpus HC Corpora (www.corpora.heliohost.org). This data include three corpora of US English text: a set of internet blog pots, set of internet news articles, and set of messages from twitter. The following parameters were explored: file sizes, line numbers, number non-empty lines, word and character counts, and number of non-white characters. From this exploratory analysis, the twitter corpus seems to different in the parameters mentioned above when compared to the blogs and news corpora. A possible explanation for this difference could be the character limit (i.e 140 characters) set for Twitter messages. These findings must be kept in mind through the workflow towards developing the application and text prediction algorithm.

Loading the Data

# the download and distination sources are specified
file_destination <- "Coursera-SwiftKey.zip"
file_source <- "http://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"

# donwload is executed
download.file(file_source, file_destination)

# files are extracted from the downloaded zip file 
unzip(file_destination)
# unzipped files are confirmed
unzip(file_destination, list = TRUE )
##                             Name    Length                Date
## 1                         final/         0 2014-07-22 10:10:00
## 2                   final/de_DE/         0 2014-07-22 10:10:00
## 3  final/de_DE/de_DE.twitter.txt  75578341 2014-07-22 10:11:00
## 4    final/de_DE/de_DE.blogs.txt  85459666 2014-07-22 10:11:00
## 5     final/de_DE/de_DE.news.txt  95591959 2014-07-22 10:11:00
## 6                   final/ru_RU/         0 2014-07-22 10:10:00
## 7    final/ru_RU/ru_RU.blogs.txt 116855835 2014-07-22 10:12:00
## 8     final/ru_RU/ru_RU.news.txt 118996424 2014-07-22 10:12:00
## 9  final/ru_RU/ru_RU.twitter.txt 105182346 2014-07-22 10:12:00
## 10                  final/en_US/         0 2014-07-22 10:10:00
## 11 final/en_US/en_US.twitter.txt 167105338 2014-07-22 10:12:00
## 12    final/en_US/en_US.news.txt 205811889 2014-07-22 10:13:00
## 13   final/en_US/en_US.blogs.txt 210160014 2014-07-22 10:13:00
## 14                  final/fi_FI/         0 2014-07-22 10:10:00
## 15    final/fi_FI/fi_FI.news.txt  94234350 2014-07-22 10:11:00
## 16   final/fi_FI/fi_FI.blogs.txt 108503595 2014-07-22 10:12:00
## 17 final/fi_FI/fi_FI.twitter.txt  25331142 2014-07-22 10:10:00
# data files are inspected 
list.files("final")
## [1] "de_DE" "en_US" "fi_FI" "ru_RU"
list.files("final/en_US")
## [1] "en_US.blogs.txt"   "en_US.news.txt"    "en_US.twitter.txt"

From the above inspection, the three unzipped data are three separate plain-text files. Important to note that one of the files is in binary mode.

Based on the above inspections, the blogs and twitter data are imported as text, while the news data is imported in binary mode.

# In text mode, blogs and twitter datasets are imported 
blogstext <- readLines("final/en_US/en_US.blogs.txt", encoding="UTF-8")
twittertext <- readLines("final/en_US/en_US.twitter.txt", encoding="UTF-8")
## Warning in readLines("final/en_US/en_US.twitter.txt", encoding = "UTF-8"):
## line 167155 appears to contain an embedded nul
## Warning in readLines("final/en_US/en_US.twitter.txt", encoding = "UTF-8"):
## line 268547 appears to contain an embedded nul
## Warning in readLines("final/en_US/en_US.twitter.txt", encoding = "UTF-8"):
## line 1274086 appears to contain an embedded nul
## Warning in readLines("final/en_US/en_US.twitter.txt", encoding = "UTF-8"):
## line 1759032 appears to contain an embedded nul
# In binary mode, the news dataset is imported
con <- file("final/en_US/en_US.news.txt", open="rb")
newstext <- readLines(con, encoding="UTF-8")
close(con)
rm(con)

Basic Statistics of the Importaed Files

# In MegaBytes(MB), the files sizes are calculated
file.info("final/en_US/en_US.blogs.txt")$size   / 1024^2
## [1] 200.4242
file.info("final/en_US/en_US.twitter.txt")$size / 1024^2
## [1] 159.3641
file.info("final/en_US/en_US.news.txt")$size    / 1024^2
## [1] 196.2775

The libraries to be used in further basic statistics analyses are loaded.

# library for character string analysis
library(stringi)

# library for plotting
library(ggplot2)

The lines and character counts are evaluated.

stri_stats_general(blogstext)
##       Lines LinesNEmpty       Chars CharsNWhite 
##      899288      899288   206824382   170389539
stri_stats_general(twittertext)
##       Lines LinesNEmpty       Chars CharsNWhite 
##     2360148     2360148   162096031   134082634
stri_stats_general(newstext )
##       Lines LinesNEmpty       Chars CharsNWhite 
##     1010242     1010242   203223154   169860866

In the remaining code chunks, the summary statistics of each of the three files are evaluated as well as histogram of the counts. The files are analyzed in the following order: 1) blogs, 2) twitter, 3) news

blogstext_words   <- stri_count_words(blogstext)
summary(blogstext_words)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    9.00   28.00   41.75   60.00 6726.00
qplot(blogstext_words, main = "Blogs File Word Count")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

twittertext_words <- stri_count_words(twittertext)
summary(twittertext_words)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   12.00   12.75   18.00   47.00
qplot(twittertext_words, main = "Twitter File Word Count")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

newstext_words    <- stri_count_words(newstext)
summary(newstext_words)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   19.00   32.00   34.41   46.00 1796.00
qplot(newstext_words, main = "News File Word Count")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Conclusion

This exploratory data report examines three copora of US English text (blogs, twitter, news). All three files are approximately 200 MBs in size. Nevertheless, the blogs and the news files seems to contain similar items count (~ million), while the twitter count is larger. This larger item counts may be due to the 140 character limit of the twitter items. This difference is not observed with word counts as all three files have about 200 million words each. Finally, the distributions of the frequencies of the twitter differ from those of blogs and news, with the latter two appearing to be log-normal.