Introduction

This is a report covering text and corpus properties for NLP.

Data is imported from three files:

en_US.blogs.txt
en_US.newss.txt
en_US.twitter.txt

The text analysis will:

Show basic file information on the source files
Show frequently used terms for 1-gram, 2-gram, and 3-gram N-grams
Graphically detail the frequency of top terms in the N-grams

Data Manipulation

Initial data manipulation entails:

Force encoding to UTF-8
Remove duplicate rows/lines.

Text File Summary

Basic source file information is as follows:

Text File Summary
FileName	SizeinMB	NumberLines	Characters	Words
Blogs	205.23	899288	206824505	37334131
News	200.99	77259	15639408	2643969
Twitter	163.19	2305923	160656274	30094580

Corpus

A corpus is built with a 10% sample.
The corpus information is as follows:

## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 3
## 
## [[1]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 20790578
## 
## [[2]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 1585061
## 
## [[3]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 16138513

Corpus Transformation

The following was applied to the corpus:

Strip whitespace
Remove numbers
All text to lower case
Remove English stopwords
Remove punctuation
Remove profanity

When predictive model will be built, stopwords will be left in

## <<DocumentTermMatrix (documents: 3, terms: 44431)>>
## Non-/sparse entries: 106257/27036
## Sparsity           : 20%
## Maximal term length: 23
## Weighting          : term frequency (tf)
## Sample             :
##          Terms
## Docs       can   get good  just  like  love  now   one time  will
##   blo.txt 9635  7052 4792  9988  9800  4445 5932 12391 8896 11331
##   nws.txt  453   346  223   405   417   102  269   635  406   837
##   twi.txt 8749 10966 9707 14856 12184 10293 8092  8126 7389  9450

1-gram Frequent Terms
Term	Frequency
just	25249
like	22401
will	21618
one	21152
can	18837
get	18364
time	16691
love	14840
good	14722
now	14293

2-gram Frequent Terms
Term	Frequency
right now	2216
last night	1510
feel like	1165
looking forward	1071
new york	925
looks like	866
can get	863
just got	796
let know	790
first time	773

3-gram Frequent Terms
Term	Frequency
happy mothers day	301
let us know	231
happy new year	169
new york city	136
cinco de mayo	92
looking forward seeing	87
new york times	78
just got back	71
st patricks day	71
happy valentines day	66

Graphing Text Analysis

Graphs showing frequently used terms in the N-grams generated.

Text Analysis

Johann Raath

28 October 2018