How does the text looks like? Blogs

blogs[1]
## [1] "In the years thereafter, most of the Oil fields and platforms were named after pagan â\200œgodsâ\200\235."

News

news[1]
## [1] "He wasn't home alone, apparently."

Twitter

twitter[1]
## [1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."

General Statistics

Files’Lenght and Memory usage

This is a Exploratory Data Analysis (EDA) of the files contained textual English data extracted from (1)Blogs, (2) News, and (3) Twitter. Each “.txt” file number of textual entry/lines (i.e., length) and memory occupied by reading them in R are:

Summary of files’ length and memory
File Name Number of Lines Memory [MB]
en_US.blogs.txt 899,288 267.759
en_US.news.txt 77,259 20.729
en_US.twitter.txt 2,360,148 334.485

It is clear to see that the file containing the “Twitter” texts had more lines of textual data, which results in a greater memory usage

Lenght of text

## Warning: package 'ngram' was built under R version 3.5.2

Number of words per line (i.e., words defined by a space)

Number of Words statistics
File Name Min. Median Mean Max
en_US.blogs.txt 1 28 41 6,630
en_US.news.txt 1 31 34 1,031
en_US.twitter.txt 1 12 12 47

Number of characters per line (inlcuding spaces and puntuations)

Number of Characters statistics
File Name Min. Median Mean Max
en_US.blogs.txt 1 157 231 40,835
en_US.news.txt 2 186 203 5,760
en_US.twitter.txt 2 64 68 213

Based on these statistics is clear that the file containing textual data from blogs has lines of text with variable length, showing positive skewness (i.e., the mean is significantly greater than the median) compare to the file containing textual data from twits.

Word Frequency Analysis

Blogs

News

Twitter

Number of total unique words and workds needed to cover 50% & 90% of copora
File Name No. Unique Words Words to Cover 50% Words to Cover 90%
en_US.blogs.txt 358,953 113 6,879
en_US.news.txt 80,529 195 7,699
en_US.twitter.txt 346,498 115 5,142

As expected, the most frequent words are the most common words in the English language. In NLP these are know as “Stop Words” (e.g., the, and, a, an, in, at). Because of this, with only a few words we can cover 50% of the corpora of the textual data files.

N-grams

For the 2 & 3-grams assessment, a sample set for each of the different corpora was used. Specifically, a random sample accounting for 10% of the original corpora was used.

Top 10 2-grams

Blog

##      ngrams  freq        prop
## 1   of the  18869 0.005048991
## 2   in the  15330 0.004102021
## 3   to the   8504 0.002275511
## 4   on the   7560 0.002022915
## 5    to be   6723 0.001798949
## 6  for the   5859 0.001567759
## 7  and the   5834 0.001561069
## 8    and i   5433 0.001453769
## 9    i was   4959 0.001326936
## 10  it was   4842 0.001295629

News

##      ngrams freq        prop
## 1   of the  1433 0.005352247
## 2   in the  1342 0.005012363
## 3   to the   645 0.002409072
## 4  for the   544 0.002031837
## 5   on the   523 0.001953402
## 6   at the   461 0.001721833
## 7  and the   424 0.001583638
## 8     in a   410 0.001531348
## 9     it s   377 0.001408093
## 10   to be   374 0.001396888

Twitter

##      ngrams  freq        prop
## 1      i m  13222 0.004585928
## 2     it s   8428 0.002923173
## 3   in the   7935 0.002752181
## 4    don t   7594 0.002633908
## 5  for the   7436 0.002579107
## 6   of the   5686 0.001972136
## 7   on the   4833 0.001676281
## 8    can t   4602 0.001596161
## 9    to be   4581 0.001588877
## 10  to the   4370 0.001515694

Top 10 3-grams

Blogs

##          ngrams freq         prop
## 1   one of the  1476 0.0004068174
## 2      i don t  1246 0.0003434245
## 3     a lot of  1131 0.0003117280
## 4      to be a   755 0.0002080943
## 5   as well as   740 0.0002039600
## 6     it was a   714 0.0001967938
## 7   out of the   672 0.0001852177
## 8   the end of   662 0.0001824615
## 9  a couple of   646 0.0001780515
## 10 some of the   635 0.0001750197

News

##               ngrams freq         prop
## 1        one of the   111 0.0004256902
## 2           the u s    93 0.0003566594
## 3          a lot of    82 0.0003144739
## 4            it s a    63 0.0002416080
## 5           i don t    51 0.0001955874
## 6       part of the    50 0.0001917523
## 7        as well as    50 0.0001917523
## 8  according to the    49 0.0001879173
## 9              it s    44 0.0001687421
## 10     in the first    43 0.0001649070

Twitter

##                 ngrams freq         prop
## 1             i don t  2428 0.0009165443
## 2      thanks for the  2365 0.0008927624
## 3             i can t  1408 0.0005315051
## 4          can t wait  1378 0.0005201804
## 5             i m not   977 0.0003688071
## 6       thank you for   847 0.0003197335
## 7  looking forward to   823 0.0003106738
## 8              it s a   799 0.0003016140
## 9          i love you   794 0.0002997266
## 10     for the follow   784 0.0002959517

As expected the most common 2-grams and 3-grams containing also the most frequent words (i.e., the, a, an,…)

Total Number of 2-grams and 3-grams

Number of n-grams for n=2 and n=3
File Name 2-grams 3-grams
en_US.blogs.txt 1,199,545 2,623,067
en_US.news.txt 158,383 236,802
en_US.twitter.txt 939,412 1,867,452