Natural Language Processing

Farid Tayari

4/22/2020

Exploratory analysis

File size in MB

[1] "en_US.blogs.txt size: 210.16 MB"
[1] "en_US.blogs.txt size: 205.81 MB"
[1] "en_US.blogs.txt size: 167.11 MB"

Number of lines in each file

[1] "en_US.blogs.txt has  444439  lines"
[1] "en_US.news.txt has  39643  lines"
[1] "en_US.twitter.txt has  1195015  lines"

Line length in each file

[1] "en_US.blogs.txt max length is  295404 characters"
[1] "en_US.news.txt max length is  99035 characters"
[1] "en_US.twitter.txt max length is  208452 characters"

Word Frequency

number of words in each file

[1] "en_US.blogs.txt has  38593875 words"
[1] "en_US.news.txt has  2733220 words"
[1] "en_US.twitter.txt has  30603178 words"
most frequently used words in each file
word bolg news twitter
it 328749 13065 233732
for 341991 25809 369089
is 424751 21728 331483
that 441420 24966 201672
in 550382 47966 353476
a 861735 64632 574165
of 865829 58813 348985
and 1027761 65670 398397
to 1050560 68709 764099
the 1664779 132184 834237

Word Frequency

most frequently used words in each file

Removing the stopwords

most frequently used words in each file after removing the stopwords
word_bolg freq_bolg word_news freq_news word_twitter freq_twitter
great 27791 team 1946 tonight 32316
year 29989 season 2026 follow 32319
years 33348 work 2026 work 35905
work 35772 game 2060 people 43662
life 36002 good 2157 great 52753
love 39210 percent 2614 today 53072
day 43335 people 3392 time 60714
good 44508 time 3860 day 64056
people 55514 years 3908 good 68100
time 84481 year 4332 love 77051

Removing the stopwords

most frequently used words in each file