EDA_Text

How does the text looks like? Blogs

blogs[1]

## [1] "In the years thereafter, most of the Oil fields and platforms were named after pagan â\200godsâ\200\235."

News

news[1]

## [1] "He wasn't home alone, apparently."

Twitter

twitter[1]

## [1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."

General Statistics

Files’Lenght and Memory usage

This is a Exploratory Data Analysis (EDA) of the files contained textual English data extracted from (1)Blogs, (2) News, and (3) Twitter. Each “.txt” file number of textual entry/lines (i.e., length) and memory occupied by reading them in R are:

Summary of files’ length and memory
File Name	Number of Lines	Memory [MB]
en_US.blogs.txt	899,288	267.759
en_US.news.txt	77,259	20.729
en_US.twitter.txt	2,360,148	334.485

It is clear to see that the file containing the “Twitter” texts had more lines of textual data, which results in a greater memory usage

Lenght of text

## Warning: package 'ngram' was built under R version 3.5.2

Number of words per line (i.e., words defined by a space)

Number of Words statistics
File Name	Min.	Median	Mean	Max
en_US.blogs.txt	1	28	41	6,630
en_US.news.txt	1	31	34	1,031
en_US.twitter.txt	1	12	12	47

Number of characters per line (inlcuding spaces and puntuations)

Number of Characters statistics
File Name	Min.	Median	Mean	Max
en_US.blogs.txt	1	157	231	40,835
en_US.news.txt	2	186	203	5,760
en_US.twitter.txt	2	64	68	213

Based on these statistics is clear that the file containing textual data from blogs has lines of text with variable length, showing positive skewness (i.e., the mean is significantly greater than the median) compare to the file containing textual data from twits.

Word Frequency Analysis

Blogs

News

Twitter

Number of total unique words and workds needed to cover 50% & 90% of copora
File Name	No. Unique Words	Words to Cover 50%	Words to Cover 90%
en_US.blogs.txt	358,953	113	6,879
en_US.news.txt	80,529	195	7,699
en_US.twitter.txt	346,498	115	5,142

As expected, the most frequent words are the most common words in the English language. In NLP these are know as “Stop Words” (e.g., the, and, a, an, in, at). Because of this, with only a few words we can cover 50% of the corpora of the textual data files.

N-grams

For the 2 & 3-grams assessment, a sample set for each of the different corpora was used. Specifically, a random sample accounting for 10% of the original corpora was used.

Top 10 2-grams

Blog

##      ngrams  freq        prop
## 1   of the  18869 0.005048991
## 2   in the  15330 0.004102021
## 3   to the   8504 0.002275511
## 4   on the   7560 0.002022915
## 5    to be   6723 0.001798949
## 6  for the   5859 0.001567759
## 7  and the   5834 0.001561069
## 8    and i   5433 0.001453769
## 9    i was   4959 0.001326936
## 10  it was   4842 0.001295629

News

##      ngrams freq        prop
## 1   of the  1433 0.005352247
## 2   in the  1342 0.005012363
## 3   to the   645 0.002409072
## 4  for the   544 0.002031837
## 5   on the   523 0.001953402
## 6   at the   461 0.001721833
## 7  and the   424 0.001583638
## 8     in a   410 0.001531348
## 9     it s   377 0.001408093
## 10   to be   374 0.001396888

Twitter

##      ngrams  freq        prop
## 1      i m  13222 0.004585928
## 2     it s   8428 0.002923173
## 3   in the   7935 0.002752181
## 4    don t   7594 0.002633908
## 5  for the   7436 0.002579107
## 6   of the   5686 0.001972136
## 7   on the   4833 0.001676281
## 8    can t   4602 0.001596161
## 9    to be   4581 0.001588877
## 10  to the   4370 0.001515694

Top 10 3-grams

Blogs

##          ngrams freq         prop
## 1   one of the  1476 0.0004068174
## 2      i don t  1246 0.0003434245
## 3     a lot of  1131 0.0003117280
## 4      to be a   755 0.0002080943
## 5   as well as   740 0.0002039600
## 6     it was a   714 0.0001967938
## 7   out of the   672 0.0001852177
## 8   the end of   662 0.0001824615
## 9  a couple of   646 0.0001780515
## 10 some of the   635 0.0001750197

News

##               ngrams freq         prop
## 1        one of the   111 0.0004256902
## 2           the u s    93 0.0003566594
## 3          a lot of    82 0.0003144739
## 4            it s a    63 0.0002416080
## 5           i don t    51 0.0001955874
## 6       part of the    50 0.0001917523
## 7        as well as    50 0.0001917523
## 8  according to the    49 0.0001879173
## 9              it s    44 0.0001687421
## 10     in the first    43 0.0001649070

Twitter

##                 ngrams freq         prop
## 1             i don t  2428 0.0009165443
## 2      thanks for the  2365 0.0008927624
## 3             i can t  1408 0.0005315051
## 4          can t wait  1378 0.0005201804
## 5             i m not   977 0.0003688071
## 6       thank you for   847 0.0003197335
## 7  looking forward to   823 0.0003106738
## 8              it s a   799 0.0003016140
## 9          i love you   794 0.0002997266
## 10     for the follow   784 0.0002959517

As expected the most common 2-grams and 3-grams containing also the most frequent words (i.e., the, a, an,…)

Total Number of 2-grams and 3-grams

Number of n-grams for n=2 and n=3
File Name	2-grams	3-grams
en_US.blogs.txt	1,199,545	2,623,067
en_US.news.txt	158,383	236,802
en_US.twitter.txt	939,412	1,867,452

EDA_Text_data

Christian

April 2, 2019

General Statistics

Files’Lenght and Memory usage

Lenght of text

Number of words per line (i.e., words defined by a space)

Number of characters per line (inlcuding spaces and puntuations)

Word Frequency Analysis

Blogs

News

Twitter

N-grams

Top 10 2-grams

Top 10 3-grams

Total Number of 2-grams and 3-grams