The goal of this project is just to do an exploratory analysis on the data to be used as training data towards the development of an application for the a prediction algorithm. This document identifies and the plans for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager. This data include three corpora of US English text: a set of internet blog posts, set of internet news articles, and set of messages from twitter. The following parameters were explored: line numbers,file sizes, number non-empty lines, word and character counts, and number of non-white characters.The twitter corpus seems to different in the parameters mentioned above when compared to the blogs and news corpora. A possible explanation for this difference could be the character limit (i.e 140 characters) set for Twitter messages. These findings must be kept in mind through the workflow towards developing the application and text prediction algorithm. The source of data is: “http://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-Swift#Key.zip”
Blogs Data:Text Twitter data: News data: binary mode.
# In text mode, blogs and twitter datasets are imported
blogsData <- readLines("final/en_US/en_US.blogs.txt", encoding="UTF-8",skipNul =T)
twitterData <- readLines("final/en_US/en_US.twitter.txt", encoding="UTF-8",skipNul =T)
# In binary mode, the news dataset is imported
connection <- file("final/en_US/en_US.news.txt", open="rb")
newsData <- readLines(connection, encoding="UTF-8",skipNul =T)
close(connection)
rm(connection)
# In MegaBytes(MB), the files sizes are calculated
file.info("final/en_US/en_US.blogs.txt")$size / 1024^2
## [1] 200.4242
file.info("final/en_US/en_US.twitter.txt")$size / 1024^2
## [1] 159.3641
file.info("final/en_US/en_US.news.txt")$size / 1024^2
## [1] 196.2775
The libraries to be used in further basic statistics analyses are loaded.
# library for character string analysis
library(stringi)
# library for plotting
library(ggplot2,warn.conflicts = FALSE)
The lines and character counts are evaluated.
stri_stats_general(blogsData)
## Lines LinesNEmpty Chars CharsNWhite
## 899288 899288 206824382 170389539
stri_stats_general(twitterData)
## Lines LinesNEmpty Chars CharsNWhite
## 2360148 2360148 162096241 134082806
stri_stats_general(newsData )
## Lines LinesNEmpty Chars CharsNWhite
## 1010242 1010242 203223154 169860866
In the remaining code chunks, the summary statistics of each of the three files are evaluated as well as histogram of the counts. The files are analyzed in the following order: 1) blogs, 2) twitter, 3) news
blogsDataWords <- stri_count_words(blogsData)
summary(blogsDataWords)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 9.00 28.00 41.75 60.00 6726.00
qplot(blogsDataWords, main = "Blogs File Word Count")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
twitterDataWords <- stri_count_words(twitterData)
summary(twitterDataWords)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 12.00 12.75 18.00 47.00
qplot(twitterDataWords, main = "Twitter File Word Count")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
newsDataWords <- stri_count_words(newsData)
summary(newsDataWords)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 19.00 32.00 34.41 46.00 1796.00
qplot(newsDataWords, main = "News File Word Count")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Three copora of US English text (blogs, twitter, news) was analysed here. Files are 200 MBs in size. While the twitter count is larger, the blogs and the news files seems to contain similar items count (~ million),. This difference is not observed with word counts as all three files have about 200 million words each. Finally, the distributions of the frequencies of the twitter differ from those of blogs and news, with the latter two appearing to be log-normal.