How does the text looks like? Blogs
blogs[1]
## [1] "In the years thereafter, most of the Oil fields and platforms were named after pagan â\200godsâ\200\235."
News
news[1]
## [1] "He wasn't home alone, apparently."
twitter[1]
## [1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."
This is a Exploratory Data Analysis (EDA) of the files contained textual English data extracted from (1)Blogs, (2) News, and (3) Twitter. Each “.txt” file number of textual entry/lines (i.e., length) and memory occupied by reading them in R are:
| File Name | Number of Lines | Memory [MB] |
|---|---|---|
| en_US.blogs.txt | 899,288 | 267.759 |
| en_US.news.txt | 77,259 | 20.729 |
| en_US.twitter.txt | 2,360,148 | 334.485 |
It is clear to see that the file containing the “Twitter” texts had more lines of textual data, which results in a greater memory usage
## Warning: package 'ngram' was built under R version 3.5.2
| File Name | Min. | Median | Mean | Max |
|---|---|---|---|---|
| en_US.blogs.txt | 1 | 28 | 41 | 6,630 |
| en_US.news.txt | 1 | 31 | 34 | 1,031 |
| en_US.twitter.txt | 1 | 12 | 12 | 47 |
| File Name | Min. | Median | Mean | Max |
|---|---|---|---|---|
| en_US.blogs.txt | 1 | 157 | 231 | 40,835 |
| en_US.news.txt | 2 | 186 | 203 | 5,760 |
| en_US.twitter.txt | 2 | 64 | 68 | 213 |
Based on these statistics is clear that the file containing textual data from blogs has lines of text with variable length, showing positive skewness (i.e., the mean is significantly greater than the median) compare to the file containing textual data from twits.
| File Name | No. Unique Words | Words to Cover 50% | Words to Cover 90% |
|---|---|---|---|
| en_US.blogs.txt | 358,953 | 113 | 6,879 |
| en_US.news.txt | 80,529 | 195 | 7,699 |
| en_US.twitter.txt | 346,498 | 115 | 5,142 |
As expected, the most frequent words are the most common words in the English language. In NLP these are know as “Stop Words” (e.g., the, and, a, an, in, at). Because of this, with only a few words we can cover 50% of the corpora of the textual data files.
For the 2 & 3-grams assessment, a sample set for each of the different corpora was used. Specifically, a random sample accounting for 10% of the original corpora was used.
Blog
## ngrams freq prop
## 1 of the 18869 0.005048991
## 2 in the 15330 0.004102021
## 3 to the 8504 0.002275511
## 4 on the 7560 0.002022915
## 5 to be 6723 0.001798949
## 6 for the 5859 0.001567759
## 7 and the 5834 0.001561069
## 8 and i 5433 0.001453769
## 9 i was 4959 0.001326936
## 10 it was 4842 0.001295629
News
## ngrams freq prop
## 1 of the 1433 0.005352247
## 2 in the 1342 0.005012363
## 3 to the 645 0.002409072
## 4 for the 544 0.002031837
## 5 on the 523 0.001953402
## 6 at the 461 0.001721833
## 7 and the 424 0.001583638
## 8 in a 410 0.001531348
## 9 it s 377 0.001408093
## 10 to be 374 0.001396888
## ngrams freq prop
## 1 i m 13222 0.004585928
## 2 it s 8428 0.002923173
## 3 in the 7935 0.002752181
## 4 don t 7594 0.002633908
## 5 for the 7436 0.002579107
## 6 of the 5686 0.001972136
## 7 on the 4833 0.001676281
## 8 can t 4602 0.001596161
## 9 to be 4581 0.001588877
## 10 to the 4370 0.001515694
Blogs
## ngrams freq prop
## 1 one of the 1476 0.0004068174
## 2 i don t 1246 0.0003434245
## 3 a lot of 1131 0.0003117280
## 4 to be a 755 0.0002080943
## 5 as well as 740 0.0002039600
## 6 it was a 714 0.0001967938
## 7 out of the 672 0.0001852177
## 8 the end of 662 0.0001824615
## 9 a couple of 646 0.0001780515
## 10 some of the 635 0.0001750197
News
## ngrams freq prop
## 1 one of the 111 0.0004256902
## 2 the u s 93 0.0003566594
## 3 a lot of 82 0.0003144739
## 4 it s a 63 0.0002416080
## 5 i don t 51 0.0001955874
## 6 part of the 50 0.0001917523
## 7 as well as 50 0.0001917523
## 8 according to the 49 0.0001879173
## 9 it s 44 0.0001687421
## 10 in the first 43 0.0001649070
## ngrams freq prop
## 1 i don t 2428 0.0009165443
## 2 thanks for the 2365 0.0008927624
## 3 i can t 1408 0.0005315051
## 4 can t wait 1378 0.0005201804
## 5 i m not 977 0.0003688071
## 6 thank you for 847 0.0003197335
## 7 looking forward to 823 0.0003106738
## 8 it s a 799 0.0003016140
## 9 i love you 794 0.0002997266
## 10 for the follow 784 0.0002959517
As expected the most common 2-grams and 3-grams containing also the most frequent words (i.e., the, a, an,…)
| File Name | 2-grams | 3-grams |
|---|---|---|
| en_US.blogs.txt | 1,199,545 | 2,623,067 |
| en_US.news.txt | 158,383 | 236,802 |
| en_US.twitter.txt | 939,412 | 1,867,452 |