The purpose of this report is to explore the data to be used as training data towards the development of an application for the a prediction algorithm. The data analyzed in this report originates from the corpus HC Corpora (www.corpora.heliohost.org). This data include three corpora of US English text: a set of internet blog pots, set of internet news articles, and set of messages from twitter. The following parameters were explored: file sizes, line numbers, number non-empty lines, word and character counts, and number of non-white characters. From this exploratory analysis, the twitter corpus seems to different in the parameters mentioned above when compared to the blogs and news corpora. A possible explanation for this difference could be the character limit (i.e 140 characters) set for Twitter messages. These findings must be kept in mind through the workflow towards developing the application and text prediction algorithm.
# the download and distination sources are specified
file_destination <- "Coursera-SwiftKey.zip"
file_source <- "http://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
# donwload is executed
download.file(file_source, file_destination)
# files are extracted from the downloaded zip file
unzip(file_destination)
# unzipped files are confirmed
unzip(file_destination, list = TRUE )
## Name Length Date
## 1 final/ 0 2014-07-22 10:10:00
## 2 final/de_DE/ 0 2014-07-22 10:10:00
## 3 final/de_DE/de_DE.twitter.txt 75578341 2014-07-22 10:11:00
## 4 final/de_DE/de_DE.blogs.txt 85459666 2014-07-22 10:11:00
## 5 final/de_DE/de_DE.news.txt 95591959 2014-07-22 10:11:00
## 6 final/ru_RU/ 0 2014-07-22 10:10:00
## 7 final/ru_RU/ru_RU.blogs.txt 116855835 2014-07-22 10:12:00
## 8 final/ru_RU/ru_RU.news.txt 118996424 2014-07-22 10:12:00
## 9 final/ru_RU/ru_RU.twitter.txt 105182346 2014-07-22 10:12:00
## 10 final/en_US/ 0 2014-07-22 10:10:00
## 11 final/en_US/en_US.twitter.txt 167105338 2014-07-22 10:12:00
## 12 final/en_US/en_US.news.txt 205811889 2014-07-22 10:13:00
## 13 final/en_US/en_US.blogs.txt 210160014 2014-07-22 10:13:00
## 14 final/fi_FI/ 0 2014-07-22 10:10:00
## 15 final/fi_FI/fi_FI.news.txt 94234350 2014-07-22 10:11:00
## 16 final/fi_FI/fi_FI.blogs.txt 108503595 2014-07-22 10:12:00
## 17 final/fi_FI/fi_FI.twitter.txt 25331142 2014-07-22 10:10:00
# data files are inspected
list.files("final")
## [1] "de_DE" "en_US" "fi_FI" "ru_RU"
list.files("final/en_US")
## [1] "en_US.blogs.txt" "en_US.news.txt" "en_US.twitter.txt"
From the above inspection, the three unzipped data are three separate plain-text files. Important to note that one of the files is in binary mode.
Based on the above inspections, the blogs and twitter data are imported as text, while the news data is imported in binary mode.
# In text mode, blogs and twitter datasets are imported
blogstext <- readLines("final/en_US/en_US.blogs.txt", encoding="UTF-8")
twittertext <- readLines("final/en_US/en_US.twitter.txt", encoding="UTF-8")
## Warning in readLines("final/en_US/en_US.twitter.txt", encoding = "UTF-8"):
## line 167155 appears to contain an embedded nul
## Warning in readLines("final/en_US/en_US.twitter.txt", encoding = "UTF-8"):
## line 268547 appears to contain an embedded nul
## Warning in readLines("final/en_US/en_US.twitter.txt", encoding = "UTF-8"):
## line 1274086 appears to contain an embedded nul
## Warning in readLines("final/en_US/en_US.twitter.txt", encoding = "UTF-8"):
## line 1759032 appears to contain an embedded nul
# In binary mode, the news dataset is imported
con <- file("final/en_US/en_US.news.txt", open="rb")
newstext <- readLines(con, encoding="UTF-8")
close(con)
rm(con)
# In MegaBytes(MB), the files sizes are calculated
file.info("final/en_US/en_US.blogs.txt")$size / 1024^2
## [1] 200.4242
file.info("final/en_US/en_US.twitter.txt")$size / 1024^2
## [1] 159.3641
file.info("final/en_US/en_US.news.txt")$size / 1024^2
## [1] 196.2775
The libraries to be used in further basic statistics analyses are loaded.
# library for character string analysis
library(stringi)
# library for plotting
library(ggplot2)
The lines and character counts are evaluated.
stri_stats_general(blogstext)
## Lines LinesNEmpty Chars CharsNWhite
## 899288 899288 206824382 170389539
stri_stats_general(twittertext)
## Lines LinesNEmpty Chars CharsNWhite
## 2360148 2360148 162096031 134082634
stri_stats_general(newstext )
## Lines LinesNEmpty Chars CharsNWhite
## 1010242 1010242 203223154 169860866
In the remaining code chunks, the summary statistics of each of the three files are evaluated as well as histogram of the counts. The files are analyzed in the following order: 1) blogs, 2) twitter, 3) news
blogstext_words <- stri_count_words(blogstext)
summary(blogstext_words)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 9.00 28.00 41.75 60.00 6726.00
qplot(blogstext_words, main = "Blogs File Word Count")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
twittertext_words <- stri_count_words(twittertext)
summary(twittertext_words)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 12.00 12.75 18.00 47.00
qplot(twittertext_words, main = "Twitter File Word Count")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
newstext_words <- stri_count_words(newstext)
summary(newstext_words)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 19.00 32.00 34.41 46.00 1796.00
qplot(newstext_words, main = "News File Word Count")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
This exploratory data report examines three copora of US English text (blogs, twitter, news). All three files are approximately 200 MBs in size. Nevertheless, the blogs and the news files seems to contain similar items count (~ million), while the twitter count is larger. This larger item counts may be due to the 140 character limit of the twitter items. This difference is not observed with word counts as all three files have about 200 million words each. Finally, the distributions of the frequencies of the twitter differ from those of blogs and news, with the latter two appearing to be log-normal.