Milestone Report

Md.Rajib Hossain

March 15, 2016

Abstract

We analyze three corpora of US English text Which is provided by Coursera. We find that the blogs and news corpora are similar and the twitter corpus is different with respect to statistical analysis. We propose that this is the result of the 140 character limit of Twitter messages.

Introduction

In this report we look at three corpora of US English text, a set of internet blogs posts, a set of internet news articles, and a set of twitter messages.

We collect the following forms of information:

Total file size.
Total number of lines.
Total number of non-empty lines.
Total number of words.
Total distribution of words.
Total number of characters.
Total number of non-white characters

In the following section we will describe the data collection process. After this section we provide the results of the data exploration. we finally present conclusions.

Data file path insertion.

# inspect the data
list.files("H:/Coursera-SwiftKey/final")

## [1] "CaptionsProject.Rmd" "de_DE"               "en_US"              
## [4] "fi_FI"               "Quiz1.R"             "rcode"              
## [7] "rcode.zip"           "ru_RU"

list.files("H:/Coursera-SwiftKey/final/en_US")

## [1] "en_US.blogs.txt"   "en_US.news.txt"    "en_US.twitter.txt"

The corpora are contained in three separate plain-text files, out of which one is binary, for more information on this see [@newtest]. We import these files as follows.

# import the blogs and twitter datasets in text mode
blogs <- readLines("H:/Coursera-SwiftKey/final/en_US/en_US.blogs.txt", encoding="UTF-8")
twitter <- readLines("H:/Coursera-SwiftKey/final/en_US/en_US.twitter.txt", encoding="UTF-8")

## Warning in readLines("H:/Coursera-SwiftKey/final/en_US/
## en_US.twitter.txt", : line 167155 appears to contain an embedded nul

## Warning in readLines("H:/Coursera-SwiftKey/final/en_US/
## en_US.twitter.txt", : line 268547 appears to contain an embedded nul

## Warning in readLines("H:/Coursera-SwiftKey/final/en_US/
## en_US.twitter.txt", : line 1274086 appears to contain an embedded nul

## Warning in readLines("H:/Coursera-SwiftKey/final/en_US/
## en_US.twitter.txt", : line 1759032 appears to contain an embedded nul

# import the news dataset in binary mode
con <- file("H:/Coursera-SwiftKey/final/en_US/en_US.news.txt", open="rb")
news <- readLines(con, encoding="UTF-8")
close(con)
rm(con)

Basic Statistics

The before we analyse the files we look at their size (presented in MegaBytes / MBs).

# file size (in MegaBytes/MB)
file.info("H:/Coursera-SwiftKey/final/en_US/en_US.blogs.txt")$size   / 1024^2

## [1] 200.4242

file.info("H:/Coursera-SwiftKey/final/en_US/en_US.news.txt")$size    / 1024^2

## [1] 196.2775

file.info("H:/Coursera-SwiftKey/final/en_US/en_US.twitter.txt")$size / 1024^2

## [1] 159.3641

For our analysis we need two libraries.

library('stringi')

## Warning: package 'stringi' was built under R version 3.1.3

library('ggplot2')

## Warning: package 'ggplot2' was built under R version 3.1.3

We analyse the lines and characters.

stri_stats_general( blogs )

##       Lines LinesNEmpty       Chars CharsNWhite 
##      899288      899288   206824382   170389539

stri_stats_general( news )

##       Lines LinesNEmpty       Chars CharsNWhite 
##     1010242     1010242   203223154   169860866

stri_stats_general( twitter )

##       Lines LinesNEmpty       Chars CharsNWhite 
##     2360148     2360148   162096031   134082634

Next we count the words perline. We summarise the distibution of these counts per corpus, using summary statistics and a distibution plot. we start with the blogs corpus.

words_blogs   <- stri_count_words(blogs)
summary( words_blogs )

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    9.00   28.00   41.75   60.00 6726.00

qplot(   words_blogs )

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Next we analys the news corpus.

words_news    <- stri_count_words(news)
summary( words_news )

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   19.00   32.00   34.41   46.00 1796.00

qplot(   words_news )

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Finally we analyse the twitter corpus.

words_twitter <- stri_count_words(twitter)
summary( words_twitter )

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   12.00   12.75   18.00   47.00

qplot(   words_twitter )

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Conclusions

We analyse three corpora of US english text. The file sizes are around 200 MegaBytes (MBs) per file.

We find that the blogs and news corpora consist of about 1 million items each, and the *twitter** corpus consist of over 2 million items. Twitter messages have a character limit of 140 (with exceptions for links), this explains why there are some many more items for a corpus of about the same size.

This result is further supported by the fact that the number of characters is similar for all three corpora (around 200 million each).

Finally we find that the frequency distributions of the blogs and news corpora are similar (appearing to be log-normal). The frequency distribution of the twitter corpus is again different, as a result of the character limit.