This document contains the Milestone Report for the Capstone Project of the Coursera Data Science Specialization.
The project goal is to build an n-gram based predictive language model using SwiftKey datasets. This report contains the summary statistics for these datasets and describes the proposed steps for the implementation of the predictive model.
The SwiftKey datasets can be loaded from https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip This archive contains 3 datasets (news articles, blogs and twits) per language for English, German, Finnish and Russian languages. In this project we use only English datasets.
The obscene words are censored out of the text corpus. The list of obscene words and expressions is loaded from https://raw.githubusercontent.com/shutterstock/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/master/en
Document and word counts for full English datasets:
## file_size document_count word_count
## news 196.3 MB 77259 2753559
## blogs 200.4 MB 899288 38371704
## twitter 159.4 MB 2360148 31149354
Document and word counts for the first 10000 documents from English datasets:
## file_size document_count word_count
## news 196.3 MB 77259 2753559
## blogs 200.4 MB 100000 4248895
## twitter 159.4 MB 100000 1317343
Before building n-gram model, the text data are cleaned. The following transformations are applied:
We won’t try to carry prediction across phrase borders, so
Word counts for cleaned English datasets (first 100000 phrases):
## phrase_count word_count
## news 171602 2680231
## blogs 277921 4167722
## twitter 174065 1307120
Let’s build corpus from first 200000 blogs.
c <- read_corpus(n=200000)
c
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 556547
Corresponding full Term-document matrix.
#td <- TermDocumentMatrix(c, control=list(removePunctuation=FALSE))
#td
It’s really too big, so we keep only terms happening at least twice and making together at least 99% of all words
#df1 <- freq_table(c, 1, 2, 0.99)
#saveRDS(df1,"df1.rds")
df1 <- readRDS("df1.rds")
str(df1)
## 'data.frame': 73424 obs. of 5 variables:
## $ term : Factor w/ 157054 levels "@@@","@_@","@adambonin",..: 138686 5002 138613 240 50482 155960 153568 150665 95710 139322 ...
## $ freq : num 411281 245411 104050 102975 80415 ...
## $ rel_freq : num 0.0632 0.0377 0.016 0.0158 0.0124 ...
## $ log_freq : num 3.98 4.73 5.97 5.98 6.34 ...
## $ total_freq: num 0.0632 0.101 0.1169 0.1328 0.1451 ...
head(df1, 20)
## term freq rel_freq log_freq total_freq
## the the 411281 0.063225530 3.983349 0.06322553
## and and 245411 0.037726617 4.728273 0.10095215
## that that 104050 0.015995430 5.966196 0.11694758
## _number_ _number_ 102975 0.015830172 5.981179 0.13277775
## for for 80415 0.012362062 6.337937 0.14513981
## you you 68502 0.010530696 6.569255 0.15567051
## with with 63284 0.009728542 6.683561 0.16539905
## was was 63256 0.009724238 6.684199 0.17512329
## not not 57998 0.008915934 6.809398 0.18403922
## this this 57474 0.008835381 6.822492 0.19287460
## have have 54039 0.008307324 6.911400 0.20118192
## are are 46419 0.007135914 7.130686 0.20831784
## but but 45424 0.006982954 7.161947 0.21530079
## from from 32510 0.004997707 7.644518 0.22029850
## they they 31991 0.004917922 7.667735 0.22521642
## all all 31932 0.004908852 7.670399 0.23012527
## will will 29332 0.004509159 7.792926 0.23463443
## one one 27614 0.004245053 7.880002 0.23887949
## about about 25692 0.003949588 7.984082 0.24282907
## what what 24997 0.003842746 8.023647 0.24667182
tail(df1, 20)
## term freq rel_freq log_freq total_freq
## zouave zouave 2 3.074566e-07 21.63311 0.9871379
## zray zray 2 3.074566e-07 21.63311 0.9871382
## zub zub 2 3.074566e-07 21.63311 0.9871385
## zubir zubir 2 3.074566e-07 21.63311 0.9871388
## zucker zucker 2 3.074566e-07 21.63311 0.9871391
## zuckerman zuckerman 2 3.074566e-07 21.63311 0.9871394
## zukofsky zukofsky 2 3.074566e-07 21.63311 0.9871397
## zululand zululand 2 3.074566e-07 21.63311 0.9871400
## zuni zuni 2 3.074566e-07 21.63311 0.9871403
## zustandes zustandes 2 3.074566e-07 21.63311 0.9871406
## zut zut 2 3.074566e-07 21.63311 0.9871409
## zuzana zuzana 2 3.074566e-07 21.63311 0.9871412
## zuzus zuzus 2 3.074566e-07 21.63311 0.9871415
## zvominir zvominir 2 3.074566e-07 21.63311 0.9871419
## zwei zwei 2 3.074566e-07 21.63311 0.9871422
## zwelinzima zwelinzima 2 3.074566e-07 21.63311 0.9871425
## zwickl zwickl 2 3.074566e-07 21.63311 0.9871428
## zyrtec zyrtec 2 3.074566e-07 21.63311 0.9871431
## zzubs zzubs 2 3.074566e-07 21.63311 0.9871434
## zzzzzzz zzzzzzz 2 3.074566e-07 21.63311 0.9871437
Now we can plot a gistogram for 20 most frequent words in the corpus.
And visualize 150 most frequent words as a word cloud.
As expected, the so-called stop words like “the”, “for” or “that” are among most common words in the corpus.
Now collect and show digrams that happens at least twice. Note that the distribution is flatter.
## 'data.frame': 508511 obs. of 5 variables:
## $ term : Factor w/ 1927256 levels "@ @hazeleyedkell",..: 1160616 826406 872722 1725833 799285 1178344 1624 1714457 800776 121472 ...
## $ freq : num 41443 33820 19855 19096 17756 ...
## $ rel_freq : num 0.00532 0.00434 0.00255 0.00245 0.00228 ...
## $ log_freq : num 7.55 7.85 8.62 8.67 8.78 ...
## $ total_freq: num 0.00532 0.00967 0.01222 0.01467 0.01695 ...
## term freq rel_freq log_freq total_freq
## of the of the 41443 0.005322141 7.553778 0.005322141
## in the in the 33820 0.004343190 7.847029 0.009665330
## it is it is 19855 0.002549794 8.615404 0.012215124
## to the to the 19096 0.002452323 8.671636 0.014667447
## i am i am 17756 0.002280239 8.776599 0.016947685
## on the on the 16734 0.002148993 8.862124 0.019096678
## _number_ _number_ _number_ _number_ 15131 0.001943134 9.007399 0.021039812
## to be to be 14984 0.001924256 9.021483 0.022964069
## i have i have 14836 0.001905250 9.035804 0.024869319
## and the and the 12983 0.001667287 9.228282 0.026536606
## for the for the 12664 0.001626320 9.264173 0.028162926
## and i and i 12004 0.001541563 9.341391 0.029704488
## is a is a 11775 0.001512154 9.369179 0.031216643
## i was i was 11436 0.001468620 9.411324 0.032685262
## it was it was 10821 0.001389641 9.491072 0.034074903
## at the at the 10543 0.001353940 9.528621 0.035428843
## in a in a 9912 0.001272906 9.617658 0.036701750
## with the with the 9657 0.001240159 9.655259 0.037941909
## that i that i 8931 0.001146926 9.768012 0.039088834
## from the from the 8324 0.001068974 9.869557 0.040157809
## term freq rel_freq log_freq total_freq
## zoomed out zoomed out 2 2.568415e-07 21.89262 0.8177988
## zooming out zooming out 2 2.568415e-07 21.89262 0.8177991
## zora neale zora neale 2 2.568415e-07 21.89262 0.8177993
## zp yeah zp yeah 2 2.568415e-07 21.89262 0.8177996
## zub and zub and 2 2.568415e-07 21.89262 0.8177999
## zucchini cut zucchini cut 2 2.568415e-07 21.89262 0.8178001
## zucchini for zucchini for 2 2.568415e-07 21.89262 0.8178004
## zucchini i zucchini i 2 2.568415e-07 21.89262 0.8178006
## zucchini pancakes zucchini pancakes 2 2.568415e-07 21.89262 0.8178009
## zucchini sweet zucchini sweet 2 2.568415e-07 21.89262 0.8178011
## zuckerberg was zuckerberg was 2 2.568415e-07 21.89262 0.8178014
## zulu inkatha zulu inkatha 2 2.568415e-07 21.89262 0.8178017
## zuma was zuma was 2 2.568415e-07 21.89262 0.8178019
## zuma would zuma would 2 2.568415e-07 21.89262 0.8178022
## zumba is zumba is 2 2.568415e-07 21.89262 0.8178024
## zut alors zut alors 2 2.568415e-07 21.89262 0.8178027
## zva bling zva bling 2 2.568415e-07 21.89262 0.8178029
## zva creative zva creative 2 2.568415e-07 21.89262 0.8178032
## zwelinzima vavi zwelinzima vavi 2 2.568415e-07 21.89262 0.8178035
## zyl a zyl a 2 2.568415e-07 21.89262 0.8178037
And visualize them as a word cloud.
Let’s do the same for trigrams.
## 'data.frame': 602484 obs. of 5 variables:
## $ term : Factor w/ 4579333 levels "@ @hazeleyedkell _number_",..: 2761411 2974 1819224 91766 2045007 1825290 1811813 1818480 2056741 3397038 ...
## $ freq : num 3255 3121 3019 2757 2179 ...
## $ rel_freq : num 0.000449 0.00043 0.000416 0.00038 0.0003 ...
## $ log_freq : num 11.1 11.2 11.2 11.4 11.7 ...
## $ total_freq: num 0.000449 0.000879 0.001295 0.001675 0.001976 ...
## term freq rel_freq
## one of the one of the 3255 0.0004487407
## _number_ _number_ _number_ _number_ _number_ _number_ 3121 0.0004302672
## i do not i do not 3019 0.0004162053
## a lot of a lot of 2757 0.0003800855
## it is a it is a 2179 0.0003004012
## i have been i have been 2032 0.0002801355
## i am not i am not 1948 0.0002685551
## i did not i did not 1563 0.0002154783
## it was a it was a 1536 0.0002117560
## some of the some of the 1475 0.0002033464
## the end of the end of 1472 0.0002029328
## to be a to be a 1466 0.0002021056
## out of the out of the 1436 0.0001979698
## it is not it is not 1429 0.0001970048
## as well as as well as 1423 0.0001961776
## there is a there is a 1405 0.0001936961
## be able to be able to 1389 0.0001914903
## a couple of a couple of 1347 0.0001857001
## if you are if you are 1335 0.0001840457
## i want to i want to 1321 0.0001821157
## log_freq total_freq
## one of the 11.12183 0.0004487407
## _number_ _number_ _number_ 11.18248 0.0008790079
## i do not 11.23042 0.0012952132
## a lot of 11.36139 0.0016752987
## it is a 11.70082 0.0019756999
## i have been 11.80159 0.0022558354
## i am not 11.86249 0.0025243905
## i did not 12.18017 0.0027398688
## it was a 12.20531 0.0029516248
## some of the 12.26377 0.0031549712
## the end of 12.26671 0.0033579040
## to be a 12.27260 0.0035600097
## out of the 12.30243 0.0037579795
## it is not 12.30948 0.0039549842
## as well as 12.31555 0.0041511618
## there is a 12.33392 0.0043448579
## be able to 12.35044 0.0045363481
## a couple of 12.39474 0.0047220482
## if you are 12.40765 0.0049060939
## i want to 12.42286 0.0050882096
## term freq rel_freq log_freq
## zitouna university must zitouna university must 2 2.757239e-07 21.79027
## zlatan ibrahimovic who zlatan ibrahimovic who 2 2.757239e-07 21.79027
## zombie and i zombie and i 2 2.757239e-07 21.79027
## zombie movie with zombie movie with 2 2.757239e-07 21.79027
## zombie tag is zombie tag is 2 2.757239e-07 21.79027
## zombies and vampires zombies and vampires 2 2.757239e-07 21.79027
## zombies in black zombies in black 2 2.757239e-07 21.79027
## zombies on the zombies on the 2 2.757239e-07 21.79027
## zone and embrace zone and embrace 2 2.757239e-07 21.79027
## zone and then zone and then 2 2.757239e-07 21.79027
## zone and travel zone and travel 2 2.757239e-07 21.79027
## zone and trying zone and trying 2 2.757239e-07 21.79027
## zone and you zone and you 2 2.757239e-07 21.79027
## zoo farm said zoo farm said 2 2.757239e-07 21.79027
## zoom _number_ and zoom _number_ and 2 2.757239e-07 21.79027
## zoomed through the zoomed through the 2 2.757239e-07 21.79027
## zub and i zub and i 2 2.757239e-07 21.79027
## zucchini and yellow zucchini and yellow 2 2.757239e-07 21.79027
## zucchini bread a zucchini bread a 2 2.757239e-07 21.79027
## zulu inkatha supporters zulu inkatha supporters 2 2.757239e-07 21.79027
## total_freq
## zitouna university must 0.4517385
## zlatan ibrahimovic who 0.4517388
## zombie and i 0.4517391
## zombie movie with 0.4517393
## zombie tag is 0.4517396
## zombies and vampires 0.4517399
## zombies in black 0.4517402
## zombies on the 0.4517405
## zone and embrace 0.4517407
## zone and then 0.4517410
## zone and travel 0.4517413
## zone and trying 0.4517416
## zone and you 0.4517418
## zoo farm said 0.4517421
## zoom _number_ and 0.4517424
## zoomed through the 0.4517427
## zub and i 0.4517429
## zucchini and yellow 0.4517432
## zucchini bread a 0.4517435
## zulu inkatha supporters 0.4517438
And visualize them as a word cloud.
## 'data.frame': 313655 obs. of 5 variables:
## $ term : Factor w/ 5925240 levels "@ @hazeleyedkell _number_ the",..: 3206 2249943 4670282 4809019 723379 2270467 726027 5622269 2583839 5104494 ...
## $ freq : num 786 725 718 681 666 610 464 442 432 428 ...
## $ rel_freq : num 1.17e-04 1.08e-04 1.07e-04 1.01e-04 9.88e-05 ...
## $ log_freq : num 13.1 13.2 13.2 13.3 13.3 ...
## $ total_freq: num 0.000117 0.000224 0.000331 0.000432 0.000531 ...
## term
## _number_ _number_ _number_ _number_ _number_ _number_ _number_ _number_
## i am going to i am going to
## the end of the the end of the
## the rest of the the rest of the
## at the end of at the end of
## i do not know i do not know
## at the same time at the same time
## when it comes to when it comes to
## is one of the is one of the
## to be able to to be able to
## i would like to i would like to
## one of the most one of the most
## for the first time for the first time
## in the middle of in the middle of
## in the _number_ s in the _number_ s
## i do not have i do not have
## i am not sure i am not sure
## is going to be is going to be
## if you want to if you want to
## do not want to do not want to
## freq rel_freq log_freq
## _number_ _number_ _number_ _number_ 786 1.166367e-04 13.06569
## i am going to 725 1.075847e-04 13.18224
## the end of the 718 1.065460e-04 13.19624
## the rest of the 681 1.010554e-04 13.27257
## at the end of 666 9.882954e-05 13.30470
## i do not know 610 9.051955e-05 13.43141
## at the same time 464 6.885421e-05 13.82610
## when it comes to 442 6.558957e-05 13.89617
## is one of the 432 6.410565e-05 13.92919
## to be able to 428 6.351208e-05 13.94261
## i would like to 425 6.306690e-05 13.95276
## one of the most 409 6.069261e-05 14.00812
## for the first time 408 6.054422e-05 14.01165
## in the middle of 397 5.891190e-05 14.05108
## in the _number_ s 379 5.624083e-05 14.11802
## i do not have 350 5.193744e-05 14.23287
## i am not sure 348 5.164066e-05 14.24113
## is going to be 335 4.971155e-05 14.29606
## if you want to 331 4.911798e-05 14.31339
## do not want to 325 4.822763e-05 14.33978
## total_freq
## _number_ _number_ _number_ _number_ 0.0001166367
## i am going to 0.0002242214
## the end of the 0.0003307673
## the rest of the 0.0004318228
## at the end of 0.0005306523
## i do not know 0.0006211718
## at the same time 0.0006900261
## when it comes to 0.0007556156
## is one of the 0.0008197213
## to be able to 0.0008832333
## i would like to 0.0009463002
## one of the most 0.0010069929
## for the first time 0.0010675371
## in the middle of 0.0011264490
## in the _number_ s 0.0011826898
## i do not have 0.0012346273
## i am not sure 0.0012862679
## is going to be 0.0013359795
## if you want to 0.0013850975
## do not want to 0.0014333251
## term freq
## youve been reading my youve been reading my 2
## youve by no means youve by no means 2
## youve come to the youve come to the 2
## youve got to get youve got to get 2
## youve never seen it youve never seen it 2
## youve probably been to youve probably been to 2
## youve probably heard that youve probably heard that 2
## youve recently discovered or youve recently discovered or 2
## youve walked a mile youve walked a mile 2
## yoyogi olympic pool tokyo yoyogi olympic pool tokyo 2
## yusufali o ye who yusufali o ye who 2
## zain said najib should zain said najib should 2
## zerotolerance at _number_ _number_ zerotolerance at _number_ _number_ 2
## zest of _number_ _number_ zest of _number_ _number_ 2
## zionist dream of a zionist dream of a 2
## zitouna university must be zitouna university must be 2
## zombies in black pajamas zombies in black pajamas 2
## zoom _number_ and iacthc zoom _number_ and iacthc 2
## zucchini and yellow squash zucchini and yellow squash 2
## zulu inkatha supporters were zulu inkatha supporters were 2
## rel_freq log_freq total_freq
## youve been reading my 2.967854e-07 21.68408 0.1672761
## youve by no means 2.967854e-07 21.68408 0.1672764
## youve come to the 2.967854e-07 21.68408 0.1672767
## youve got to get 2.967854e-07 21.68408 0.1672770
## youve never seen it 2.967854e-07 21.68408 0.1672773
## youve probably been to 2.967854e-07 21.68408 0.1672776
## youve probably heard that 2.967854e-07 21.68408 0.1672779
## youve recently discovered or 2.967854e-07 21.68408 0.1672782
## youve walked a mile 2.967854e-07 21.68408 0.1672785
## yoyogi olympic pool tokyo 2.967854e-07 21.68408 0.1672788
## yusufali o ye who 2.967854e-07 21.68408 0.1672791
## zain said najib should 2.967854e-07 21.68408 0.1672794
## zerotolerance at _number_ _number_ 2.967854e-07 21.68408 0.1672797
## zest of _number_ _number_ 2.967854e-07 21.68408 0.1672800
## zionist dream of a 2.967854e-07 21.68408 0.1672803
## zitouna university must be 2.967854e-07 21.68408 0.1672806
## zombies in black pajamas 2.967854e-07 21.68408 0.1672809
## zoom _number_ and iacthc 2.967854e-07 21.68408 0.1672812
## zucchini and yellow squash 2.967854e-07 21.68408 0.1672815
## zulu inkatha supporters were 2.967854e-07 21.68408 0.1672818
And visualize them as a word cloud.
### Interesting findings