Introduction

The task is to do some exploratory data analysis. The basic goal is to develop an understanding of the various statistical properties of the data set so that you can build a good prediction model down the road.

Text files: reading and sampling

Three text files are provided:

## [1] "../Data/en_US/en_US.blogs.txt"   "../Data/en_US/en_US.news.txt"   
## [3] "../Data/en_US/en_US.twitter.txt"

Character vectors obtained as result of reading are considered too large for analysis.

blogs news twitter
Object size (MB) 255.35 257.34 318.99
Number of lines 899288.00 1010242.00 2360148.00
Number of characters 206824505.00 203223159.00 162096241.00
Number of words 37334131.00 34372530.00 30373583.00

Sampling is applied to keep just 5% of lines.

blogs news twitter
Object size (MB) 12.75 12.83 16.12
Number of lines 44964.00 50512.00 118007.00
Number of characters 10319220.00 10125870.00 8104746.00
Number of words 1863466.00 1712581.00 1518963.00

Finally, non ASCII characters are removed.

blogs news twitter
Object size (MB) 12.64 12.79 16.1
Number of lines 44964.00 50512.00 118007.0
Number of characters 10280826.00 10110703.00 8098000.0
Number of words 1860294.00 1709555.00 1517300.0

Corpus object

The character vectors are loaded into a Corpus object. A Corpus is a collection of documents, in this case the three documents obtained after sampling.

## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 3
## 
## $blogs
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 10280826
## 
## $news
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 10110703
## 
## $twitter
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 8098000

Pre-processing

Some transformations are provided by the tm package to mainly clean the data.

## [1] "removeNumbers"     "removePunctuation" "removeWords"      
## [4] "stemDocument"      "stripWhitespace"

All of them but stemming (reducing every word to its root) are applied. Also, translation to lower case.

## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 3
## 
## $blogs
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 6576348
## 
## $news
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 6944478
## 
## $twitter
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 5307692

Document term matrix

A document term matrix (DTM) is created from the transformed corpus. The DTM is a matrix that lists all occurrences of words in the corpus. Summary information on the matrix follows.

## <<DocumentTermMatrix (documents: 3, terms: 135224)>>
## Non-/sparse entries: 202557/203115
## Sparsity           : 50%
## Maximal term length: 95
## Weighting          : term frequency (tf)

Summary of cumulative frequencies of words across documents.

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     1.00     1.00     1.00    19.84     4.00 16032.00

Twenty most frequent occurring terms.

##   will   said   just    one   like    can    get   time    new    now   good 
##  16032  15261  14910  14354  13418  12229  11383  10817   9618   9044   8842 
##    day   know   love people   back    see  first   also   make 
##   8484   8237   8109   7869   7101   6907   6773   6504   6475

Word cloud with 100 most frequent terms.

N-Grams

An n-gram is an ordered sequence of n “words” taken from a body of text. They are the base for a prediction model of next word. The ngram package allows for fast n-gram tokenization among other very useful utilities.

Bigrams

Sequences of two words.

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    1.000    1.000    1.000    1.414    1.000 1295.000

##          ngrams freq         prop
## 1    right now  1295 0.0004661402
## 2     new york   922 0.0003318774
## 3    last year   892 0.0003210788
## 4   last night   809 0.0002912027
## 5    years ago   645 0.0002321702
## 6  high school   641 0.0002307304
## 7    feel like   611 0.0002199318
## 8    last week   605 0.0002177721
## 9      can get   598 0.0002152524
## 10  first time   594 0.0002138126

Trigrams

Sequencies of three words.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   1.000   1.026   1.000 152.000

##                     ngrams freq         prop
## 1       happy mothers day   152 5.471300e-05
## 2             let us know   144 5.183337e-05
## 3           new york city   124 4.463429e-05
## 4          happy new year    96 3.455558e-05
## 5           two years ago    78 2.807641e-05
## 6           cinco de mayo    69 2.483682e-05
## 7          new york times    69 2.483682e-05
## 8  looking forward seeing    63 2.267710e-05
## 9  president barack obama    60 2.159724e-05
## 10           world war ii    56 2.015742e-05

4-grams

Sequencies of four words.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   1.000   1.004   1.000  25.000

##                         ngrams freq         prop
## 1           g fat g saturated    25 8.998852e-06
## 2  amazon services llc amazon    24 8.638898e-06
## 3      services llc amazon eu    24 8.638898e-06
## 4    g protein g carbohydrate    23 8.278944e-06
## 5             let us know can    21 7.559036e-06
## 6    protein g carbohydrate g    20 7.199082e-06
## 7      incorporated item c pp    18 6.479174e-06
## 8        just finished mi run    17 6.119220e-06
## 9       martin luther king jr    17 6.119220e-06
## 10       g carbohydrate g fat    16 5.759266e-06

Conclusions