Overview

The aim of this mile stone report is to load the data, get some general summaries about the data sets. Furthermore, the goal is to do some expolratory data anaysis.

Loading the Data and the Needed Libraries

options(warn=-1)
##libraries
twitter_en <- readLines("en_US/en_US.twitter.txt")
news_en <- readLines("en_US/en_US.news.txt")
blogs_en <- readLines("en_US/en_US.blogs.txt")

Summaries of the Loaded Files

length(twitter_en)
## [1] 2360148
length(blogs_en)
## [1] 899288
length(news_en)
## [1] 1010242
print(object.size(twitter_en), units = "Mb")
## 319 Mb
print(object.size(blogs_en), units = "Mb")
## 255.4 Mb
print(object.size(news_en), units = "Mb")
## 257.3 Mb

The length of the blogs and news files appear to be similar in the number of elemnts, while twitter appears to have more than double the number of elements. This my differ when comparing the number of words, concederint that news and blogs my have more words than tweats. The blogs and news seem to be talking a qurater of a gigabyte each from memory. While the twitter data set takes up a thirds

Lines into words

twitter_en <- unlist(strsplit(twitter_en, " "))
blogs_en <- unlist(strsplit(blogs_en, " "))
news_en <- unlist(strsplit(news_en, " "))
length(twitter_en)
## [1] 30373543
length(blogs_en)
## [1] 37334131
length(news_en)
## [1] 34372530

all three files have around 30-35 million words.

What words are common?

twitter_en <- as.data.frame(table(twitter_en))
names(twitter_en) <- c("word", "freq")
twitter_en <- twitter_en[order(-twitter_en$freq),]
twitter_en[1:50,]
##          word   freq
## 1142715   the 837023
## 1159650    to 761901
## 688085      I 604530
## 268062      a 572690
## 1280026   you 416376
## 297922    and 397641
## 589894    for 368422
## 887935     of 349367
## 696940     in 348814
## 712090     is 329396
## 894380     on 253558
## 853808     my 248739
## 714136     it 192437
## 1141501  that 190844
## 341923     be 172886
## 316666     at 171759
## 1250849  with 163808
## 1282874  your 157112
## 654585   have 149376
## 811896     me 143522
## 308599    are 142169
## 1148418  this 125655
## 1077770    so 117820
## 688794    I'm 117050
## 733207   just 111291
## 1228332   was 110425
## 771125   like 109644
## 392850    but 101832
## 615323    get  99198
## 877256    not  97799
## 688084      i  97485
## 289669    all  95943
## 903949    out  90814
## 90415       &  90595
## 1201157    up  86707
## 271275  about  84091
## 1231967    we  83883
## 1246825  will  82624
## 1017934    RT  80092
## 8256       :)  80053
## 508968     do  79989
## 599858   from  78736
## 1142720   The  78178
## 400802    can  77050
## 785815   love  74248
## 476         -  71285
## 1240496  what  71247
## 7575        :  67799
## 692039     if  66372
## 313018     as  64404
par(mfrow=c(2,2))
plot(twitter_en$freq[1:100])
plot(twitter_en$freq[1000:10000])
plot(twitter_en$freq[10000:100000])
plot(twitter_en$freq[100000:200000])

As it appears, the most common used 50 words in twitter are printed in the data frame. also it is clear that there is an eponential decay in the use of words as the list goes down the ranking.

blogs_en <- as.data.frame(table(blogs_en))
names(blogs_en) <- c("word", "freq")
blogs_en <- blogs_en[order(-blogs_en$freq),]
blogs_en[1:50,]
##          word    freq
## 992657    the 1659151
## 1006149    to 1043878
## 214941    and 1015714
## 749533     of  862906
## 187174      a  857102
## 571868      I  738534
## 579690     in  540436
## 991933   that  421628
## 595351     is  412438
## 485392    for  337156
## 1061783   was  271439
## 1079723  with  271302
## 597263     it  270280
## 754441     on  252275
## 718882     my  239952
## 1096302   you  238652
## 541781   have  210982
## 254300     be  198728
## 230804     as  196879
## 997858   this  188536
## 226537    are  184566
## 992664    The  177241
## 234305     at  158291
## 298678    but  152386
## 740384    not  151561
## 494362   from  140397
## 1064675    we  135411
## 759142     or  129477
## 207443    all  118284
## 932582     so  116031
## 300192     by  114214
## 995847   they  110146
## 189974  about  108487
## 1076345  will  107587
## 214021     an  105224
## 532507    had  104410
## 543079     he  100711
## 755272    one   98284
## 682165     me   98168
## 555133    his   98130
## 1097427  your   93723
## 305727    can   92664
## 763319    out   92620
## 540654    has   91621
## 548903    her   90973
## 611267   just   90637
## 1039765    up   89791
## 646873   like   88438
## 993237  their   86728
## 1071061  what   83762
par(mfrow=c(2,2))
plot(blogs_en$freq[1:100])
plot(blogs_en$freq[1000:10000])
plot(blogs_en$freq[10000:100000])
plot(blogs_en$freq[100000:200000])

The same pattern appears with the blogs dataset.

news_en <- as.data.frame(table(news_en))
names(news_en) <- c("word", "freq")
news_en <- news_en[order(-news_en$freq),]
news_en[1:50,]
##         word    freq
## 796750   the 1712435
## 805615    to  889242
## 202694   and  845234
## 182678     a  826741
## 610581    of  766165
## 477309    in  621931
## 405789   for  332321
## 796522  that  314029
## 488134    is  275121
## 614686    on  249819
## 860125  with  242665
## 796754   The  225944
## 846867   was  225558
## 215586    at  198246
## 213214    as  170988
## 451130    he  166961
## 231140    be  148139
## 460177   his  147552
## 412981  from  147350
## 489086    it  143300
## 450071  have  140626
## 713267  said  140304
## 210186   are  134629
## 266642    by  126618
## 449169   has  119336
## 472364     I  118289
## 202139    an  117290
## 855427   who  105939
## 857276  will  105114
## 603417   not  103229
## 618212    or   94178
## 799739  this   91950
## 265757   but   91725
## 713297 said.   91113
## 798700  they   86182
## 184499 about   84622
## 797170 their   83660
## 442698   had   81862
## 579932  more   79504
## 872036   you   73746
## 852518  were   72673
## 864048 would   70453
## 615324   one   67248
## 233304  been   65421
## 830089    up   62948
## 622018   out   61738
## 455433   her   60933
## 854100  when   60715
## 848889    we   60656
## 854275 which   58361
par(mfrow=c(2,2))
plot(news_en$freq[1:100])
plot(news_en$freq[1000:10000])
plot(news_en$freq[10000:100000])
plot(news_en$freq[100000:200000])

The same pattern appears with the news dataset.