The aim of this mile stone report is to load the data, get some general summaries about the data sets. Furthermore, the goal is to do some expolratory data anaysis.
options(warn=-1)
##libraries
twitter_en <- readLines("en_US/en_US.twitter.txt")
news_en <- readLines("en_US/en_US.news.txt")
blogs_en <- readLines("en_US/en_US.blogs.txt")
length(twitter_en)
## [1] 2360148
length(blogs_en)
## [1] 899288
length(news_en)
## [1] 1010242
print(object.size(twitter_en), units = "Mb")
## 319 Mb
print(object.size(blogs_en), units = "Mb")
## 255.4 Mb
print(object.size(news_en), units = "Mb")
## 257.3 Mb
The length of the blogs and news files appear to be similar in the number of elemnts, while twitter appears to have more than double the number of elements. This my differ when comparing the number of words, concederint that news and blogs my have more words than tweats. The blogs and news seem to be talking a qurater of a gigabyte each from memory. While the twitter data set takes up a thirds
twitter_en <- unlist(strsplit(twitter_en, " "))
blogs_en <- unlist(strsplit(blogs_en, " "))
news_en <- unlist(strsplit(news_en, " "))
length(twitter_en)
## [1] 30373543
length(blogs_en)
## [1] 37334131
length(news_en)
## [1] 34372530
all three files have around 30-35 million words.
twitter_en <- as.data.frame(table(twitter_en))
names(twitter_en) <- c("word", "freq")
twitter_en <- twitter_en[order(-twitter_en$freq),]
twitter_en[1:50,]
## word freq
## 1142715 the 837023
## 1159650 to 761901
## 688085 I 604530
## 268062 a 572690
## 1280026 you 416376
## 297922 and 397641
## 589894 for 368422
## 887935 of 349367
## 696940 in 348814
## 712090 is 329396
## 894380 on 253558
## 853808 my 248739
## 714136 it 192437
## 1141501 that 190844
## 341923 be 172886
## 316666 at 171759
## 1250849 with 163808
## 1282874 your 157112
## 654585 have 149376
## 811896 me 143522
## 308599 are 142169
## 1148418 this 125655
## 1077770 so 117820
## 688794 I'm 117050
## 733207 just 111291
## 1228332 was 110425
## 771125 like 109644
## 392850 but 101832
## 615323 get 99198
## 877256 not 97799
## 688084 i 97485
## 289669 all 95943
## 903949 out 90814
## 90415 & 90595
## 1201157 up 86707
## 271275 about 84091
## 1231967 we 83883
## 1246825 will 82624
## 1017934 RT 80092
## 8256 :) 80053
## 508968 do 79989
## 599858 from 78736
## 1142720 The 78178
## 400802 can 77050
## 785815 love 74248
## 476 - 71285
## 1240496 what 71247
## 7575 : 67799
## 692039 if 66372
## 313018 as 64404
par(mfrow=c(2,2))
plot(twitter_en$freq[1:100])
plot(twitter_en$freq[1000:10000])
plot(twitter_en$freq[10000:100000])
plot(twitter_en$freq[100000:200000])
As it appears, the most common used 50 words in twitter are printed in the data frame. also it is clear that there is an eponential decay in the use of words as the list goes down the ranking.
blogs_en <- as.data.frame(table(blogs_en))
names(blogs_en) <- c("word", "freq")
blogs_en <- blogs_en[order(-blogs_en$freq),]
blogs_en[1:50,]
## word freq
## 992657 the 1659151
## 1006149 to 1043878
## 214941 and 1015714
## 749533 of 862906
## 187174 a 857102
## 571868 I 738534
## 579690 in 540436
## 991933 that 421628
## 595351 is 412438
## 485392 for 337156
## 1061783 was 271439
## 1079723 with 271302
## 597263 it 270280
## 754441 on 252275
## 718882 my 239952
## 1096302 you 238652
## 541781 have 210982
## 254300 be 198728
## 230804 as 196879
## 997858 this 188536
## 226537 are 184566
## 992664 The 177241
## 234305 at 158291
## 298678 but 152386
## 740384 not 151561
## 494362 from 140397
## 1064675 we 135411
## 759142 or 129477
## 207443 all 118284
## 932582 so 116031
## 300192 by 114214
## 995847 they 110146
## 189974 about 108487
## 1076345 will 107587
## 214021 an 105224
## 532507 had 104410
## 543079 he 100711
## 755272 one 98284
## 682165 me 98168
## 555133 his 98130
## 1097427 your 93723
## 305727 can 92664
## 763319 out 92620
## 540654 has 91621
## 548903 her 90973
## 611267 just 90637
## 1039765 up 89791
## 646873 like 88438
## 993237 their 86728
## 1071061 what 83762
par(mfrow=c(2,2))
plot(blogs_en$freq[1:100])
plot(blogs_en$freq[1000:10000])
plot(blogs_en$freq[10000:100000])
plot(blogs_en$freq[100000:200000])
The same pattern appears with the blogs dataset.
news_en <- as.data.frame(table(news_en))
names(news_en) <- c("word", "freq")
news_en <- news_en[order(-news_en$freq),]
news_en[1:50,]
## word freq
## 796750 the 1712435
## 805615 to 889242
## 202694 and 845234
## 182678 a 826741
## 610581 of 766165
## 477309 in 621931
## 405789 for 332321
## 796522 that 314029
## 488134 is 275121
## 614686 on 249819
## 860125 with 242665
## 796754 The 225944
## 846867 was 225558
## 215586 at 198246
## 213214 as 170988
## 451130 he 166961
## 231140 be 148139
## 460177 his 147552
## 412981 from 147350
## 489086 it 143300
## 450071 have 140626
## 713267 said 140304
## 210186 are 134629
## 266642 by 126618
## 449169 has 119336
## 472364 I 118289
## 202139 an 117290
## 855427 who 105939
## 857276 will 105114
## 603417 not 103229
## 618212 or 94178
## 799739 this 91950
## 265757 but 91725
## 713297 said. 91113
## 798700 they 86182
## 184499 about 84622
## 797170 their 83660
## 442698 had 81862
## 579932 more 79504
## 872036 you 73746
## 852518 were 72673
## 864048 would 70453
## 615324 one 67248
## 233304 been 65421
## 830089 up 62948
## 622018 out 61738
## 455433 her 60933
## 854100 when 60715
## 848889 we 60656
## 854275 which 58361
par(mfrow=c(2,2))
plot(news_en$freq[1:100])
plot(news_en$freq[1000:10000])
plot(news_en$freq[10000:100000])
plot(news_en$freq[100000:200000])
The same pattern appears with the news dataset.