Project Milestone Report

Summary

The goal of this project is just to do an exploratory analysis on the data to be used as training data towards the development of an application for the a prediction algorithm. This document identifies and the plans for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager. This data include three corpora of US English text: a set of internet blog posts, set of internet news articles, and set of messages from twitter. The following parameters were explored: line numbers,file sizes, number non-empty lines, word and character counts, and number of non-white characters.The twitter corpus seems to different in the parameters mentioned above when compared to the blogs and news corpora. A possible explanation for this difference could be the character limit (i.e 140 characters) set for Twitter messages. These findings must be kept in mind through the workflow towards developing the application and text prediction algorithm. The source of data is: “http://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-Swift#Key.zip”

Blogs Data:Text Twitter data: News data: binary mode.

# In text mode, blogs and twitter datasets are imported 
blogsData <- readLines("final/en_US/en_US.blogs.txt", encoding="UTF-8",skipNul =T)
twitterData <- readLines("final/en_US/en_US.twitter.txt", encoding="UTF-8",skipNul =T)

# In binary mode, the news dataset is imported
connection <- file("final/en_US/en_US.news.txt", open="rb")
newsData <- readLines(connection, encoding="UTF-8",skipNul =T)
close(connection)
rm(connection)

Basic Statistics of the Importaed Files

# In MegaBytes(MB), the files sizes are calculated
file.info("final/en_US/en_US.blogs.txt")$size   / 1024^2

## [1] 200.4242

file.info("final/en_US/en_US.twitter.txt")$size / 1024^2

## [1] 159.3641

file.info("final/en_US/en_US.news.txt")$size    / 1024^2

## [1] 196.2775

The libraries to be used in further basic statistics analyses are loaded.

# library for character string analysis
library(stringi)

# library for plotting
library(ggplot2,warn.conflicts = FALSE)

The lines and character counts are evaluated.

stri_stats_general(blogsData)

##       Lines LinesNEmpty       Chars CharsNWhite 
##      899288      899288   206824382   170389539

stri_stats_general(twitterData)

##       Lines LinesNEmpty       Chars CharsNWhite 
##     2360148     2360148   162096241   134082806

stri_stats_general(newsData )

##       Lines LinesNEmpty       Chars CharsNWhite 
##     1010242     1010242   203223154   169860866

In the remaining code chunks, the summary statistics of each of the three files are evaluated as well as histogram of the counts. The files are analyzed in the following order: 1) blogs, 2) twitter, 3) news

blogsDataWords   <- stri_count_words(blogsData)
summary(blogsDataWords)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    9.00   28.00   41.75   60.00 6726.00

qplot(blogsDataWords, main = "Blogs File Word Count")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

twitterDataWords <- stri_count_words(twitterData)
summary(twitterDataWords)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   12.00   12.75   18.00   47.00

qplot(twitterDataWords, main = "Twitter File Word Count")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

newsDataWords    <- stri_count_words(newsData)
summary(newsDataWords)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   19.00   32.00   34.41   46.00 1796.00

qplot(newsDataWords, main = "News File Word Count")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Conclusion

Three copora of US English text (blogs, twitter, news) was analysed here. Files are 200 MBs in size. While the twitter count is larger, the blogs and the news files seems to contain similar items count (~ million),. This difference is not observed with word counts as all three files have about 200 million words each. Finally, the distributions of the frequencies of the twitter differ from those of blogs and news, with the latter two appearing to be log-normal.

Project Milestone Report

Akshat Jain

9/11/2020

Summary

Basic Statistics of the Importaed Files

Conclusion