Introduction

This is a report covering text and corpus properties for NLP.

Data is imported from three files:

The text analysis will:

  1. Show basic file information on the source files
  2. Show frequently used terms for 1-gram, 2-gram, and 3-gram N-grams
  3. Graphically detail the frequency of top terms in the N-grams

Data Manipulation

Initial data manipulation entails:

Text File Summary

Basic source file information is as follows:

Text File Summary
FileName SizeinMB NumberLines Characters Words
Blogs 205.23 899288 206824505 37334131
News 200.99 77259 15639408 2643969
Twitter 163.19 2305923 160656274 30094580

Corpus

A corpus is built with a 10% sample.
The corpus information is as follows:

## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 3
## 
## [[1]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 20790578
## 
## [[2]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 1585061
## 
## [[3]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 16138513

Corpus Transformation

The following was applied to the corpus:

  • Strip whitespace
  • Remove numbers
  • All text to lower case
  • Remove English stopwords
  • Remove punctuation
  • Remove profanity

When predictive model will be built, stopwords will be left in

## <<DocumentTermMatrix (documents: 3, terms: 44431)>>
## Non-/sparse entries: 106257/27036
## Sparsity           : 20%
## Maximal term length: 23
## Weighting          : term frequency (tf)
## Sample             :
##          Terms
## Docs       can   get good  just  like  love  now   one time  will
##   blo.txt 9635  7052 4792  9988  9800  4445 5932 12391 8896 11331
##   nws.txt  453   346  223   405   417   102  269   635  406   837
##   twi.txt 8749 10966 9707 14856 12184 10293 8092  8126 7389  9450
1-gram Frequent Terms
Term Frequency
just 25249
like 22401
will 21618
one 21152
can 18837
get 18364
time 16691
love 14840
good 14722
now 14293
2-gram Frequent Terms
Term Frequency
right now 2216
last night 1510
feel like 1165
looking forward 1071
new york 925
looks like 866
can get 863
just got 796
let know 790
first time 773
3-gram Frequent Terms
Term Frequency
happy mothers day 301
let us know 231
happy new year 169
new york city 136
cinco de mayo 92
looking forward seeing 87
new york times 78
just got back 71
st patricks day 71
happy valentines day 66

Graphing Text Analysis

Graphs showing frequently used terms in the N-grams generated.