This is a report covering text and corpus properties for NLP.
Data is imported from three files:
The text analysis will:
Initial data manipulation entails:
Basic source file information is as follows:
| FileName | SizeinMB | NumberLines | Characters | Words |
|---|---|---|---|---|
| Blogs | 205.23 | 899288 | 206824505 | 37334131 |
| News | 200.99 | 77259 | 15639408 | 2643969 |
| 163.19 | 2305923 | 160656274 | 30094580 |
A corpus is built with a 10% sample.
The corpus information is as follows:
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 3
##
## [[1]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 20790578
##
## [[2]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 1585061
##
## [[3]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 16138513
The following was applied to the corpus:
When predictive model will be built, stopwords will be left in
## <<DocumentTermMatrix (documents: 3, terms: 44431)>>
## Non-/sparse entries: 106257/27036
## Sparsity : 20%
## Maximal term length: 23
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs can get good just like love now one time will
## blo.txt 9635 7052 4792 9988 9800 4445 5932 12391 8896 11331
## nws.txt 453 346 223 405 417 102 269 635 406 837
## twi.txt 8749 10966 9707 14856 12184 10293 8092 8126 7389 9450
| Term | Frequency |
|---|---|
| just | 25249 |
| like | 22401 |
| will | 21618 |
| one | 21152 |
| can | 18837 |
| get | 18364 |
| time | 16691 |
| love | 14840 |
| good | 14722 |
| now | 14293 |
| Term | Frequency |
|---|---|
| right now | 2216 |
| last night | 1510 |
| feel like | 1165 |
| looking forward | 1071 |
| new york | 925 |
| looks like | 866 |
| can get | 863 |
| just got | 796 |
| let know | 790 |
| first time | 773 |
| Term | Frequency |
|---|---|
| happy mothers day | 301 |
| let us know | 231 |
| happy new year | 169 |
| new york city | 136 |
| cinco de mayo | 92 |
| looking forward seeing | 87 |
| new york times | 78 |
| just got back | 71 |
| st patricks day | 71 |
| happy valentines day | 66 |
Graphs showing frequently used terms in the N-grams generated.