This documents provides a brief exploratory analysis on the US English dataset provided within the Coursera Data Science Capstone Project.
The dataset contains 3 txt files: “blogs” (posts from a blog), “news” (articles from a news website), and “Twitter” (tweets messages).
By giving a first look at the 3 files, it turns out that, probably due to the maximum-characters restriction applied to tweets messages, the “Twitter” file is quite different from the two others.
library(knitr)
library(ggplot2)
library(stringi)
download.file("http://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip",
"capdata")
unzip("capdata")
list.files("final/en_US")
## [1] "en_US.blogs.txt" "en_US.news.txt" "en_US.twitter.txt"
blogs <- readLines("final/en_US/en_US.blogs.txt", encoding="UTF-8")
tmp <- file("final/en_US/en_US.news.txt", open="rb")
news <- readLines(tmp, encoding="UTF-8")
tmp <- file("final/en_US/en_US.twitter.txt", open="rb")
twitter <- readLines(tmp, encoding="UTF-8", skipNul = TRUE)
close(tmp)
rm(tmp)
The last two lines of above R code close the temporary file used in reading the binary file, and clean up the memory.
file.info("final/en_US/en_US.blogs.txt")$size/1024^2
## [1] 200.4242
file.info("final/en_US/en_US.news.txt")$size/1024^2
## [1] 196.2775
file.info("final/en_US/en_US.twitter.txt")$size/1024^2
## [1] 159.3641
stri_stats_general(blogs)
## Lines LinesNEmpty Chars CharsNWhite
## 899288 899288 206824382 170389539
stri_stats_general(news)
## Lines LinesNEmpty Chars CharsNWhite
## 1010242 1010242 203223154 169860866
stri_stats_general(twitter)
## Lines LinesNEmpty Chars CharsNWhite
## 2360148 2360148 162096241 134082806
Please note: LinesNEmpty column displays the number of lines containing at least one character that is not a white space, while CharsNWhite displays how many characters other that white space are contained into the file.
blogs_words_per_line <- stri_count_words(blogs)
news_words_per_line<- stri_count_words(news)
twitter_words_per_line <- stri_count_words(twitter)
summary(blogs_words_per_line)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 9.00 28.00 41.75 60.00 6726.00
summary(news_words_per_line)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 19.00 32.00 34.41 46.00 1796.00
summary(twitter_words_per_line)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 12.00 12.75 18.00 47.00
qplot(blogs_words_per_line, binwidth=5, fill=I("red"))
qplot(news_words_per_line, binwidth=5, fill=I("dark green"))
qplot(twitter_words_per_line, binwidth=1, colour=I("black"), fill=I("light blue"))
blogs_chars_per_line <- stri_count_boundaries(blogs, type="character")
news_chars_per_line <- stri_count_boundaries(news, type="character")
twitter_chars_per_line <- stri_count_boundaries(twitter, type="character")
qplot(blogs_chars_per_line, binwidth=5, fill=I("red"))
qplot(news_chars_per_line, binwidth=5, fill=I("dark green"))
qplot(twitter_chars_per_line, binwidth=1, colour=I("black"), fill=I("light blue"))
From this first exploratory analysis, and particularly from the plots above, it’s interesting to realize that into the “blogs” file the lines length (measured both by words and characters) is more concentrated into the low values when compared to the “news” file.
So we could deduce that who works into the news world tends to use longer sentences with respect to who posts into blogs.
Another interesting finding is about the “twitter” file. When we examine the histogram of line lengths measured in character numbers (the last plot) we see that there is quite a concentration of messages both into the shortest lengths and into the maximum allowed length (the last bar to the right of the plot).
We could conclude that people tend to tweet short messages, or to use the maximum allowed length, while are not much prone to the “half-measures” lengths.
While I’m writing this report (end of the first Capstone week), I haven’t yet defined a detailed plan about my prediction algorithm.
I’m probably going for applying the “stringi” package through the text manipulation and cleaning tasks.
Then I will use the “tm” package to create the corpora and build the Term-Documents Matrices.
I will then probably apply the “RWeka” package in order to define the n-gram tokenizers, and with the help of “dplyr” package (and maybe other packages) I will build the n-gram frequency tables.
Finally, I will decide when to apply the profanity filter; I’ll probably go for applying a “censoring” filter ex-post upon the outcome of the prediction function, in order to keep the most of data through the training of the prediction algorithm, but I haven’t taken yet a definitive decision about that.