This documents provides a brief exploratory analysis on the US English dataset provided within the Coursera Data Science Capstone Project.

The dataset contains 3 txt files: “blogs” (posts from a blog), “news” (articles from a news website), and “Twitter” (tweets messages).

By giving a first look at the 3 files, it turns out that, probably due to the maximum-characters restriction applied to tweets messages, the “Twitter” file is quite different from the two others.

Basic summaries for the three files

Calling required libraries, downloading files (packed into a zip file), and checking the US english dataset:

library(knitr)
library(ggplot2)
library(stringi)
download.file("http://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip",
        "capdata")
unzip("capdata")
list.files("final/en_US")
## [1] "en_US.blogs.txt"   "en_US.news.txt"    "en_US.twitter.txt"

Reading txt files into R (please note: binary mode for “news” and “twitter” files):

blogs <- readLines("final/en_US/en_US.blogs.txt", encoding="UTF-8")
tmp <- file("final/en_US/en_US.news.txt", open="rb")
news <- readLines(tmp, encoding="UTF-8")
tmp <- file("final/en_US/en_US.twitter.txt", open="rb")
twitter <- readLines(tmp, encoding="UTF-8", skipNul = TRUE)
close(tmp)
rm(tmp)

The last two lines of above R code close the temporary file used in reading the binary file, and clean up the memory.

Displaying file dimensions in MegaBytes:

file.info("final/en_US/en_US.blogs.txt")$size/1024^2
## [1] 200.4242
file.info("final/en_US/en_US.news.txt")$size/1024^2
## [1] 196.2775
file.info("final/en_US/en_US.twitter.txt")$size/1024^2
## [1] 159.3641

Displaying the number of lines and characters contained into each file.

stri_stats_general(blogs)
##       Lines LinesNEmpty       Chars CharsNWhite 
##      899288      899288   206824382   170389539
stri_stats_general(news)
##       Lines LinesNEmpty       Chars CharsNWhite 
##     1010242     1010242   203223154   169860866
stri_stats_general(twitter)
##       Lines LinesNEmpty       Chars CharsNWhite 
##     2360148     2360148   162096241   134082806

Please note: LinesNEmpty column displays the number of lines containing at least one character that is not a white space, while CharsNWhite displays how many characters other that white space are contained into the file.

Displaying a summary of the word counts for each line in each file:

blogs_words_per_line <- stri_count_words(blogs)
news_words_per_line<- stri_count_words(news)
twitter_words_per_line <- stri_count_words(twitter)
summary(blogs_words_per_line)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    9.00   28.00   41.75   60.00 6726.00
summary(news_words_per_line)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   19.00   32.00   34.41   46.00 1796.00
summary(twitter_words_per_line)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   12.00   12.75   18.00   47.00

Basic Plots

Histograms of the distribution of words per line into each file:

qplot(blogs_words_per_line, binwidth=5, fill=I("red"))

qplot(news_words_per_line, binwidth=5, fill=I("dark green"))

qplot(twitter_words_per_line, binwidth=1, colour=I("black"), fill=I("light blue"))

Histograms of the distribution of characters per line into each file:

blogs_chars_per_line <- stri_count_boundaries(blogs, type="character")
news_chars_per_line <- stri_count_boundaries(news, type="character")
twitter_chars_per_line <- stri_count_boundaries(twitter, type="character")
qplot(blogs_chars_per_line, binwidth=5, fill=I("red"))

qplot(news_chars_per_line, binwidth=5, fill=I("dark green"))

qplot(twitter_chars_per_line, binwidth=1, colour=I("black"), fill=I("light blue"))

Findings

From this first exploratory analysis, and particularly from the plots above, it’s interesting to realize that into the “blogs” file the lines length (measured both by words and characters) is more concentrated into the low values when compared to the “news” file.

So we could deduce that who works into the news world tends to use longer sentences with respect to who posts into blogs.

Another interesting finding is about the “twitter” file. When we examine the histogram of line lengths measured in character numbers (the last plot) we see that there is quite a concentration of messages both into the shortest lengths and into the maximum allowed length (the last bar to the right of the plot).

We could conclude that people tend to tweet short messages, or to use the maximum allowed length, while are not much prone to the “half-measures” lengths.

Plans for the prediction algorithm

While I’m writing this report (end of the first Capstone week), I haven’t yet defined a detailed plan about my prediction algorithm.

I’m probably going for applying the “stringi” package through the text manipulation and cleaning tasks.

Then I will use the “tm” package to create the corpora and build the Term-Documents Matrices.

I will then probably apply the “RWeka” package in order to define the n-gram tokenizers, and with the help of “dplyr” package (and maybe other packages) I will build the n-gram frequency tables.

Finally, I will decide when to apply the profanity filter; I’ll probably go for applying a “censoring” filter ex-post upon the outcome of the prediction function, in order to keep the most of data through the training of the prediction algorithm, but I haven’t taken yet a definitive decision about that.