The task is to do some exploratory data analysis. The basic goal is to develop an understanding of the various statistical properties of the data set so that you can build a good prediction model down the road.
Three text files are provided:
## [1] "../Data/en_US/en_US.blogs.txt" "../Data/en_US/en_US.news.txt"
## [3] "../Data/en_US/en_US.twitter.txt"
Character vectors obtained as result of reading are considered too large for analysis.
| blogs | news | ||
|---|---|---|---|
| Object size (MB) | 255.35 | 257.34 | 318.99 |
| Number of lines | 899288.00 | 1010242.00 | 2360148.00 |
| Number of characters | 206824505.00 | 203223159.00 | 162096241.00 |
| Number of words | 37334131.00 | 34372530.00 | 30373583.00 |
Sampling is applied to keep just 5% of lines.
| blogs | news | ||
|---|---|---|---|
| Object size (MB) | 12.75 | 12.83 | 16.12 |
| Number of lines | 44964.00 | 50512.00 | 118007.00 |
| Number of characters | 10319220.00 | 10125870.00 | 8104746.00 |
| Number of words | 1863466.00 | 1712581.00 | 1518963.00 |
Finally, non ASCII characters are removed.
| blogs | news | ||
|---|---|---|---|
| Object size (MB) | 12.64 | 12.79 | 16.1 |
| Number of lines | 44964.00 | 50512.00 | 118007.0 |
| Number of characters | 10280826.00 | 10110703.00 | 8098000.0 |
| Number of words | 1860294.00 | 1709555.00 | 1517300.0 |
The character vectors are loaded into a Corpus object. A Corpus is a collection of documents, in this case the three documents obtained after sampling.
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 3
##
## $blogs
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 10280826
##
## $news
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 10110703
##
## $twitter
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 8098000
Some transformations are provided by the tm package to mainly clean the data.
## [1] "removeNumbers" "removePunctuation" "removeWords"
## [4] "stemDocument" "stripWhitespace"
All of them but stemming (reducing every word to its root) are applied. Also, translation to lower case.
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 3
##
## $blogs
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 6576348
##
## $news
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 6944478
##
## $twitter
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 5307692
A document term matrix (DTM) is created from the transformed corpus. The DTM is a matrix that lists all occurrences of words in the corpus. Summary information on the matrix follows.
## <<DocumentTermMatrix (documents: 3, terms: 135224)>>
## Non-/sparse entries: 202557/203115
## Sparsity : 50%
## Maximal term length: 95
## Weighting : term frequency (tf)
Summary of cumulative frequencies of words across documents.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 1.00 1.00 19.84 4.00 16032.00
Twenty most frequent occurring terms.
## will said just one like can get time new now good
## 16032 15261 14910 14354 13418 12229 11383 10817 9618 9044 8842
## day know love people back see first also make
## 8484 8237 8109 7869 7101 6907 6773 6504 6475
Word cloud with 100 most frequent terms.
An n-gram is an ordered sequence of n “words” taken from a body of text. They are the base for a prediction model of next word. The ngram package allows for fast n-gram tokenization among other very useful utilities.
Sequences of two words.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 1.000 1.414 1.000 1295.000
## ngrams freq prop
## 1 right now 1295 0.0004661402
## 2 new york 922 0.0003318774
## 3 last year 892 0.0003210788
## 4 last night 809 0.0002912027
## 5 years ago 645 0.0002321702
## 6 high school 641 0.0002307304
## 7 feel like 611 0.0002199318
## 8 last week 605 0.0002177721
## 9 can get 598 0.0002152524
## 10 first time 594 0.0002138126
Sequencies of three words.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 1.000 1.026 1.000 152.000
## ngrams freq prop
## 1 happy mothers day 152 5.471300e-05
## 2 let us know 144 5.183337e-05
## 3 new york city 124 4.463429e-05
## 4 happy new year 96 3.455558e-05
## 5 two years ago 78 2.807641e-05
## 6 cinco de mayo 69 2.483682e-05
## 7 new york times 69 2.483682e-05
## 8 looking forward seeing 63 2.267710e-05
## 9 president barack obama 60 2.159724e-05
## 10 world war ii 56 2.015742e-05
Sequencies of four words.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 1.000 1.004 1.000 25.000
## ngrams freq prop
## 1 g fat g saturated 25 8.998852e-06
## 2 amazon services llc amazon 24 8.638898e-06
## 3 services llc amazon eu 24 8.638898e-06
## 4 g protein g carbohydrate 23 8.278944e-06
## 5 let us know can 21 7.559036e-06
## 6 protein g carbohydrate g 20 7.199082e-06
## 7 incorporated item c pp 18 6.479174e-06
## 8 just finished mi run 17 6.119220e-06
## 9 martin luther king jr 17 6.119220e-06
## 10 g carbohydrate g fat 16 5.759266e-06