We look at the three data sets of US English text from a corpus called HC Corpora. The three data sets are : a set of internet blogs posts, a set of internet news articles, and a set of twitter messages
Following information are collected from the data sets
In the following section we will describe the data collection process, the section after that gives the results of the data exploration, we finally present conclusions and give references.
For our analysis we use the R computing environment, as well as the libraries stringi and ggplot2 . In order to make the code more readable we use the pipe operator from the magrittr library . This report is compiled using the rmarkdown library . Finally during writing we used the RStudio IDE .
Data The data is presented as a ZIP compressed archive, which is freely downloadable from www.corpora.heliohost.org. I have already downloaded the file given the large size of the file.
# specify the source and destination of the download
destination_file <- "Coursera-SwiftKey.zip"
# extract the files from the zip file
unzip(destination_file)
Inspect the unzipped files
# find out which files where unzipped
unzip(destination_file, list = TRUE )
## Name Length Date
## 1 final/ 0 2014-07-22 10:10:00
## 2 final/de_DE/ 0 2014-07-22 10:10:00
## 3 final/de_DE/de_DE.twitter.txt 75578341 2014-07-22 10:11:00
## 4 final/de_DE/de_DE.blogs.txt 85459666 2014-07-22 10:11:00
## 5 final/de_DE/de_DE.news.txt 95591959 2014-07-22 10:11:00
## 6 final/ru_RU/ 0 2014-07-22 10:10:00
## 7 final/ru_RU/ru_RU.blogs.txt 116855835 2014-07-22 10:12:00
## 8 final/ru_RU/ru_RU.news.txt 118996424 2014-07-22 10:12:00
## 9 final/ru_RU/ru_RU.twitter.txt 105182346 2014-07-22 10:12:00
## 10 final/en_US/ 0 2014-07-22 10:10:00
## 11 final/en_US/en_US.twitter.txt 167105338 2014-07-22 10:12:00
## 12 final/en_US/en_US.news.txt 205811889 2014-07-22 10:13:00
## 13 final/en_US/en_US.blogs.txt 210160014 2014-07-22 10:13:00
## 14 final/fi_FI/ 0 2014-07-22 10:10:00
## 15 final/fi_FI/fi_FI.news.txt 94234350 2014-07-22 10:11:00
## 16 final/fi_FI/fi_FI.blogs.txt 108503595 2014-07-22 10:12:00
## 17 final/fi_FI/fi_FI.twitter.txt 25331142 2014-07-22 10:10:00
# inspect the data
list.files("final")
## [1] "de_DE" "en_US" "fi_FI" "ru_RU"
list.files("final/en_US")
## [1] "en_US.blogs.txt" "en_US.news.txt" "en_US.twitter.txt"
The corpora are contained in three separate plain-text files, out of which one is binary, for more information on this see . We import these files as follows.
# import the blogs and twitter datasets in text mode
blogs <- readLines("final/en_US/en_US.blogs.txt", encoding="UTF-8")
twitter <- readLines("final/en_US/en_US.twitter.txt", encoding="UTF-8")
con <- file("final/en_US/en_US.news.txt", open="rb")
news <- readLines(con, encoding="UTF-8")
close(con)
rm(con)
The before we analyse the files we look at their size (presented in MegaBytes / MBs).
# file size (in MegaBytes/MB)
file.info("final/en_US/en_US.blogs.txt")$size / 1024^2
## [1] 200.4242
file.info("final/en_US/en_US.news.txt")$size / 1024^2
## [1] 196.2775
file.info("final/en_US/en_US.twitter.txt")$size / 1024^2
## [1] 159.3641
For our analysis we need two libraries.
# library for character string analysis
library(stringi)
# library for plotting
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.2.4
We analyse lines and characters.
stri_stats_general(blogs)
## Lines LinesNEmpty Chars CharsNWhite
## 899288 899288 206824382 170389539
stri_stats_general(news)
## Lines LinesNEmpty Chars CharsNWhite
## 1010242 1010242 203223154 169860866
stri_stats_general(twitter)
## Lines LinesNEmpty Chars CharsNWhite
## 2360148 2360148 162096031 134082634
Next we count the words per item (line). We summarise the distribution of these counts per corpus, using summary statistics and a distribution plot. we start with the blogs’ corpus.
words_blogs <- stri_count_words(blogs)
summary( words_blogs )
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 9.00 28.00 41.75 60.00 6726.00
qplot(words_blogs)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Next we analyse the news corpus.
words_news <- stri_count_words(news)
summary(words_news)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 19.00 32.00 34.41 46.00 1796.00
qplot(words_news)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Finally we analyse the twitter corpus.
words_twitter <- stri_count_words(twitter)
summary(words_twitter)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 12.00 12.75 18.00 47.00
qplot(words_twitter)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.