Capstone Project: Milestone Report

Introduction
Text files: reading and sampling
Corpus object
Pre-processing
Document term matrix
N-Grams
Conclusions

Introduction

The task is to do some exploratory data analysis. The basic goal is to develop an understanding of the various statistical properties of the data set so that you can build a good prediction model down the road.

Text files: reading and sampling

Three text files are provided:

## [1] "../Data/en_US/en_US.blogs.txt"   "../Data/en_US/en_US.news.txt"   
## [3] "../Data/en_US/en_US.twitter.txt"

Character vectors obtained as result of reading are considered too large for analysis.

	blogs	news	twitter
Object size (MB)	255.35	257.34	318.99
Number of lines	899288.00	1010242.00	2360148.00
Number of characters	206824505.00	203223159.00	162096241.00
Number of words	37334131.00	34372530.00	30373583.00

Sampling is applied to keep just 5% of lines.

	blogs	news	twitter
Object size (MB)	12.75	12.83	16.12
Number of lines	44964.00	50512.00	118007.00
Number of characters	10319220.00	10125870.00	8104746.00
Number of words	1863466.00	1712581.00	1518963.00

Finally, non ASCII characters are removed.

	blogs	news	twitter
Object size (MB)	12.64	12.79	16.1
Number of lines	44964.00	50512.00	118007.0
Number of characters	10280826.00	10110703.00	8098000.0
Number of words	1860294.00	1709555.00	1517300.0

Corpus object

The character vectors are loaded into a Corpus object. A Corpus is a collection of documents, in this case the three documents obtained after sampling.

## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 3
## 
## $blogs
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 10280826
## 
## $news
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 10110703
## 
## $twitter
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 8098000

Pre-processing

Some transformations are provided by the tm package to mainly clean the data.

## [1] "removeNumbers"     "removePunctuation" "removeWords"      
## [4] "stemDocument"      "stripWhitespace"

All of them but stemming (reducing every word to its root) are applied. Also, translation to lower case.

## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 3
## 
## $blogs
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 6576348
## 
## $news
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 6944478
## 
## $twitter
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 5307692

Document term matrix

A document term matrix (DTM) is created from the transformed corpus. The DTM is a matrix that lists all occurrences of words in the corpus. Summary information on the matrix follows.

## <<DocumentTermMatrix (documents: 3, terms: 135224)>>
## Non-/sparse entries: 202557/203115
## Sparsity           : 50%
## Maximal term length: 95
## Weighting          : term frequency (tf)

Summary of cumulative frequencies of words across documents.

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     1.00     1.00     1.00    19.84     4.00 16032.00

Twenty most frequent occurring terms.

##   will   said   just    one   like    can    get   time    new    now   good 
##  16032  15261  14910  14354  13418  12229  11383  10817   9618   9044   8842 
##    day   know   love people   back    see  first   also   make 
##   8484   8237   8109   7869   7101   6907   6773   6504   6475

Word cloud with 100 most frequent terms.

N-Grams

An n-gram is an ordered sequence of n “words” taken from a body of text. They are the base for a prediction model of next word. The ngram package allows for fast n-gram tokenization among other very useful utilities.

Bigrams

Sequences of two words.

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    1.000    1.000    1.000    1.414    1.000 1295.000

##          ngrams freq         prop
## 1    right now  1295 0.0004661402
## 2     new york   922 0.0003318774
## 3    last year   892 0.0003210788
## 4   last night   809 0.0002912027
## 5    years ago   645 0.0002321702
## 6  high school   641 0.0002307304
## 7    feel like   611 0.0002199318
## 8    last week   605 0.0002177721
## 9      can get   598 0.0002152524
## 10  first time   594 0.0002138126

Trigrams

Sequencies of three words.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   1.000   1.026   1.000 152.000

##                     ngrams freq         prop
## 1       happy mothers day   152 5.471300e-05
## 2             let us know   144 5.183337e-05
## 3           new york city   124 4.463429e-05
## 4          happy new year    96 3.455558e-05
## 5           two years ago    78 2.807641e-05
## 6           cinco de mayo    69 2.483682e-05
## 7          new york times    69 2.483682e-05
## 8  looking forward seeing    63 2.267710e-05
## 9  president barack obama    60 2.159724e-05
## 10           world war ii    56 2.015742e-05

4-grams

Sequencies of four words.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   1.000   1.004   1.000  25.000

##                         ngrams freq         prop
## 1           g fat g saturated    25 8.998852e-06
## 2  amazon services llc amazon    24 8.638898e-06
## 3      services llc amazon eu    24 8.638898e-06
## 4    g protein g carbohydrate    23 8.278944e-06
## 5             let us know can    21 7.559036e-06
## 6    protein g carbohydrate g    20 7.199082e-06
## 7      incorporated item c pp    18 6.479174e-06
## 8        just finished mi run    17 6.119220e-06
## 9       martin luther king jr    17 6.119220e-06
## 10       g carbohydrate g fat    16 5.759266e-06

Conclusions

The documents provided for the project are too large. Some kind of sampling will be applied in order to reduce the size of the dataset used to build the model.
Transformations provided by the tm package are not enough to clean the text. Some manual cleaning through base functions will be needed. Removal of stopwords will have to be evaluated as probably are not suitable for the prediction model.
The ngram package provides some utilities and a tokenizer which are very well suited to build the prediction model.