Overview

This document contains the Milestone Report for the Capstone Project of the Coursera Data Science Specialization.

The project goal is to build an n-gram based predictive language model using SwiftKey datasets. This report contains the summary statistics for these datasets and describes the proposed steps for the implementation of the predictive model.

Data sources

The SwiftKey datasets can be loaded from https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip This archive contains 3 datasets (news articles, blogs and twits) per language for English, German, Finnish and Russian languages. In this project we use only English datasets.

The obscene words are censored out of the text corpus. The list of obscene words and expressions is loaded from https://raw.githubusercontent.com/shutterstock/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/master/en

Summary statistics

Document and word counts for full English datasets:

##         file_size document_count word_count
## news     196.3 MB          77259    2753559
## blogs    200.4 MB         899288   38371704
## twitter  159.4 MB        2360148   31149354

Document and word counts for the first 10000 documents from English datasets:

##         file_size document_count word_count
## news     196.3 MB          77259    2753559
## blogs    200.4 MB         100000    4248895
## twitter  159.4 MB         100000    1317343

Data cleaning

Before building n-gram model, the text data are cleaned. The following transformations are applied:

We won’t try to carry prediction across phrase borders, so

Word counts for cleaned English datasets (first 100000 phrases):

##         phrase_count word_count
## news          171602    2680231
## blogs         277921    4167722
## twitter       174065    1307120

Corpus

Let’s build corpus from first 200000 blogs.

c <- read_corpus(n=200000)
c
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 556547

Corresponding full Term-document matrix.

#td <- TermDocumentMatrix(c, control=list(removePunctuation=FALSE))
#td

It’s really too big, so we keep only terms happening at least twice and making together at least 99% of all words

#df1 <- freq_table(c, 1, 2, 0.99)
#saveRDS(df1,"df1.rds")
df1 <- readRDS("df1.rds")
str(df1)
## 'data.frame':    73424 obs. of  5 variables:
##  $ term      : Factor w/ 157054 levels "@@@","@_@","@adambonin",..: 138686 5002 138613 240 50482 155960 153568 150665 95710 139322 ...
##  $ freq      : num  411281 245411 104050 102975 80415 ...
##  $ rel_freq  : num  0.0632 0.0377 0.016 0.0158 0.0124 ...
##  $ log_freq  : num  3.98 4.73 5.97 5.98 6.34 ...
##  $ total_freq: num  0.0632 0.101 0.1169 0.1328 0.1451 ...
head(df1, 20)
##              term   freq    rel_freq log_freq total_freq
## the           the 411281 0.063225530 3.983349 0.06322553
## and           and 245411 0.037726617 4.728273 0.10095215
## that         that 104050 0.015995430 5.966196 0.11694758
## _number_ _number_ 102975 0.015830172 5.981179 0.13277775
## for           for  80415 0.012362062 6.337937 0.14513981
## you           you  68502 0.010530696 6.569255 0.15567051
## with         with  63284 0.009728542 6.683561 0.16539905
## was           was  63256 0.009724238 6.684199 0.17512329
## not           not  57998 0.008915934 6.809398 0.18403922
## this         this  57474 0.008835381 6.822492 0.19287460
## have         have  54039 0.008307324 6.911400 0.20118192
## are           are  46419 0.007135914 7.130686 0.20831784
## but           but  45424 0.006982954 7.161947 0.21530079
## from         from  32510 0.004997707 7.644518 0.22029850
## they         they  31991 0.004917922 7.667735 0.22521642
## all           all  31932 0.004908852 7.670399 0.23012527
## will         will  29332 0.004509159 7.792926 0.23463443
## one           one  27614 0.004245053 7.880002 0.23887949
## about       about  25692 0.003949588 7.984082 0.24282907
## what         what  24997 0.003842746 8.023647 0.24667182
tail(df1, 20)
##                  term freq     rel_freq log_freq total_freq
## zouave         zouave    2 3.074566e-07 21.63311  0.9871379
## zray             zray    2 3.074566e-07 21.63311  0.9871382
## zub               zub    2 3.074566e-07 21.63311  0.9871385
## zubir           zubir    2 3.074566e-07 21.63311  0.9871388
## zucker         zucker    2 3.074566e-07 21.63311  0.9871391
## zuckerman   zuckerman    2 3.074566e-07 21.63311  0.9871394
## zukofsky     zukofsky    2 3.074566e-07 21.63311  0.9871397
## zululand     zululand    2 3.074566e-07 21.63311  0.9871400
## zuni             zuni    2 3.074566e-07 21.63311  0.9871403
## zustandes   zustandes    2 3.074566e-07 21.63311  0.9871406
## zut               zut    2 3.074566e-07 21.63311  0.9871409
## zuzana         zuzana    2 3.074566e-07 21.63311  0.9871412
## zuzus           zuzus    2 3.074566e-07 21.63311  0.9871415
## zvominir     zvominir    2 3.074566e-07 21.63311  0.9871419
## zwei             zwei    2 3.074566e-07 21.63311  0.9871422
## zwelinzima zwelinzima    2 3.074566e-07 21.63311  0.9871425
## zwickl         zwickl    2 3.074566e-07 21.63311  0.9871428
## zyrtec         zyrtec    2 3.074566e-07 21.63311  0.9871431
## zzubs           zzubs    2 3.074566e-07 21.63311  0.9871434
## zzzzzzz       zzzzzzz    2 3.074566e-07 21.63311  0.9871437

Term frequency plots

Now we can plot a gistogram for 20 most frequent words in the corpus.

And visualize 150 most frequent words as a word cloud.

As expected, the so-called stop words like “the”, “for” or “that” are among most common words in the corpus.

Now collect and show digrams that happens at least twice. Note that the distribution is flatter.

## 'data.frame':    508511 obs. of  5 variables:
##  $ term      : Factor w/ 1927256 levels "@ @hazeleyedkell",..: 1160616 826406 872722 1725833 799285 1178344 1624 1714457 800776 121472 ...
##  $ freq      : num  41443 33820 19855 19096 17756 ...
##  $ rel_freq  : num  0.00532 0.00434 0.00255 0.00245 0.00228 ...
##  $ log_freq  : num  7.55 7.85 8.62 8.67 8.78 ...
##  $ total_freq: num  0.00532 0.00967 0.01222 0.01467 0.01695 ...
##                                term  freq    rel_freq log_freq  total_freq
## of the                       of the 41443 0.005322141 7.553778 0.005322141
## in the                       in the 33820 0.004343190 7.847029 0.009665330
## it is                         it is 19855 0.002549794 8.615404 0.012215124
## to the                       to the 19096 0.002452323 8.671636 0.014667447
## i am                           i am 17756 0.002280239 8.776599 0.016947685
## on the                       on the 16734 0.002148993 8.862124 0.019096678
## _number_ _number_ _number_ _number_ 15131 0.001943134 9.007399 0.021039812
## to be                         to be 14984 0.001924256 9.021483 0.022964069
## i have                       i have 14836 0.001905250 9.035804 0.024869319
## and the                     and the 12983 0.001667287 9.228282 0.026536606
## for the                     for the 12664 0.001626320 9.264173 0.028162926
## and i                         and i 12004 0.001541563 9.341391 0.029704488
## is a                           is a 11775 0.001512154 9.369179 0.031216643
## i was                         i was 11436 0.001468620 9.411324 0.032685262
## it was                       it was 10821 0.001389641 9.491072 0.034074903
## at the                       at the 10543 0.001353940 9.528621 0.035428843
## in a                           in a  9912 0.001272906 9.617658 0.036701750
## with the                   with the  9657 0.001240159 9.655259 0.037941909
## that i                       that i  8931 0.001146926 9.768012 0.039088834
## from the                   from the  8324 0.001068974 9.869557 0.040157809
##                                term freq     rel_freq log_freq total_freq
## zoomed out               zoomed out    2 2.568415e-07 21.89262  0.8177988
## zooming out             zooming out    2 2.568415e-07 21.89262  0.8177991
## zora neale               zora neale    2 2.568415e-07 21.89262  0.8177993
## zp yeah                     zp yeah    2 2.568415e-07 21.89262  0.8177996
## zub and                     zub and    2 2.568415e-07 21.89262  0.8177999
## zucchini cut           zucchini cut    2 2.568415e-07 21.89262  0.8178001
## zucchini for           zucchini for    2 2.568415e-07 21.89262  0.8178004
## zucchini i               zucchini i    2 2.568415e-07 21.89262  0.8178006
## zucchini pancakes zucchini pancakes    2 2.568415e-07 21.89262  0.8178009
## zucchini sweet       zucchini sweet    2 2.568415e-07 21.89262  0.8178011
## zuckerberg was       zuckerberg was    2 2.568415e-07 21.89262  0.8178014
## zulu inkatha           zulu inkatha    2 2.568415e-07 21.89262  0.8178017
## zuma was                   zuma was    2 2.568415e-07 21.89262  0.8178019
## zuma would               zuma would    2 2.568415e-07 21.89262  0.8178022
## zumba is                   zumba is    2 2.568415e-07 21.89262  0.8178024
## zut alors                 zut alors    2 2.568415e-07 21.89262  0.8178027
## zva bling                 zva bling    2 2.568415e-07 21.89262  0.8178029
## zva creative           zva creative    2 2.568415e-07 21.89262  0.8178032
## zwelinzima vavi     zwelinzima vavi    2 2.568415e-07 21.89262  0.8178035
## zyl a                         zyl a    2 2.568415e-07 21.89262  0.8178037

And visualize them as a word cloud.

Let’s do the same for trigrams.

## 'data.frame':    602484 obs. of  5 variables:
##  $ term      : Factor w/ 4579333 levels "@ @hazeleyedkell _number_",..: 2761411 2974 1819224 91766 2045007 1825290 1811813 1818480 2056741 3397038 ...
##  $ freq      : num  3255 3121 3019 2757 2179 ...
##  $ rel_freq  : num  0.000449 0.00043 0.000416 0.00038 0.0003 ...
##  $ log_freq  : num  11.1 11.2 11.2 11.4 11.7 ...
##  $ total_freq: num  0.000449 0.000879 0.001295 0.001675 0.001976 ...
##                                                  term freq     rel_freq
## one of the                                 one of the 3255 0.0004487407
## _number_ _number_ _number_ _number_ _number_ _number_ 3121 0.0004302672
## i do not                                     i do not 3019 0.0004162053
## a lot of                                     a lot of 2757 0.0003800855
## it is a                                       it is a 2179 0.0003004012
## i have been                               i have been 2032 0.0002801355
## i am not                                     i am not 1948 0.0002685551
## i did not                                   i did not 1563 0.0002154783
## it was a                                     it was a 1536 0.0002117560
## some of the                               some of the 1475 0.0002033464
## the end of                                 the end of 1472 0.0002029328
## to be a                                       to be a 1466 0.0002021056
## out of the                                 out of the 1436 0.0001979698
## it is not                                   it is not 1429 0.0001970048
## as well as                                 as well as 1423 0.0001961776
## there is a                                 there is a 1405 0.0001936961
## be able to                                 be able to 1389 0.0001914903
## a couple of                               a couple of 1347 0.0001857001
## if you are                                 if you are 1335 0.0001840457
## i want to                                   i want to 1321 0.0001821157
##                            log_freq   total_freq
## one of the                 11.12183 0.0004487407
## _number_ _number_ _number_ 11.18248 0.0008790079
## i do not                   11.23042 0.0012952132
## a lot of                   11.36139 0.0016752987
## it is a                    11.70082 0.0019756999
## i have been                11.80159 0.0022558354
## i am not                   11.86249 0.0025243905
## i did not                  12.18017 0.0027398688
## it was a                   12.20531 0.0029516248
## some of the                12.26377 0.0031549712
## the end of                 12.26671 0.0033579040
## to be a                    12.27260 0.0035600097
## out of the                 12.30243 0.0037579795
## it is not                  12.30948 0.0039549842
## as well as                 12.31555 0.0041511618
## there is a                 12.33392 0.0043448579
## be able to                 12.35044 0.0045363481
## a couple of                12.39474 0.0047220482
## if you are                 12.40765 0.0049060939
## i want to                  12.42286 0.0050882096
##                                            term freq     rel_freq log_freq
## zitouna university must zitouna university must    2 2.757239e-07 21.79027
## zlatan ibrahimovic who   zlatan ibrahimovic who    2 2.757239e-07 21.79027
## zombie and i                       zombie and i    2 2.757239e-07 21.79027
## zombie movie with             zombie movie with    2 2.757239e-07 21.79027
## zombie tag is                     zombie tag is    2 2.757239e-07 21.79027
## zombies and vampires       zombies and vampires    2 2.757239e-07 21.79027
## zombies in black               zombies in black    2 2.757239e-07 21.79027
## zombies on the                   zombies on the    2 2.757239e-07 21.79027
## zone and embrace               zone and embrace    2 2.757239e-07 21.79027
## zone and then                     zone and then    2 2.757239e-07 21.79027
## zone and travel                 zone and travel    2 2.757239e-07 21.79027
## zone and trying                 zone and trying    2 2.757239e-07 21.79027
## zone and you                       zone and you    2 2.757239e-07 21.79027
## zoo farm said                     zoo farm said    2 2.757239e-07 21.79027
## zoom _number_ and             zoom _number_ and    2 2.757239e-07 21.79027
## zoomed through the           zoomed through the    2 2.757239e-07 21.79027
## zub and i                             zub and i    2 2.757239e-07 21.79027
## zucchini and yellow         zucchini and yellow    2 2.757239e-07 21.79027
## zucchini bread a               zucchini bread a    2 2.757239e-07 21.79027
## zulu inkatha supporters zulu inkatha supporters    2 2.757239e-07 21.79027
##                         total_freq
## zitouna university must  0.4517385
## zlatan ibrahimovic who   0.4517388
## zombie and i             0.4517391
## zombie movie with        0.4517393
## zombie tag is            0.4517396
## zombies and vampires     0.4517399
## zombies in black         0.4517402
## zombies on the           0.4517405
## zone and embrace         0.4517407
## zone and then            0.4517410
## zone and travel          0.4517413
## zone and trying          0.4517416
## zone and you             0.4517418
## zoo farm said            0.4517421
## zoom _number_ and        0.4517424
## zoomed through the       0.4517427
## zub and i                0.4517429
## zucchini and yellow      0.4517432
## zucchini bread a         0.4517435
## zulu inkatha supporters  0.4517438

And visualize them as a word cloud.

## 'data.frame':    313655 obs. of  5 variables:
##  $ term      : Factor w/ 5925240 levels "@ @hazeleyedkell _number_ the",..: 3206 2249943 4670282 4809019 723379 2270467 726027 5622269 2583839 5104494 ...
##  $ freq      : num  786 725 718 681 666 610 464 442 432 428 ...
##  $ rel_freq  : num  1.17e-04 1.08e-04 1.07e-04 1.01e-04 9.88e-05 ...
##  $ log_freq  : num  13.1 13.2 13.2 13.3 13.3 ...
##  $ total_freq: num  0.000117 0.000224 0.000331 0.000432 0.000531 ...
##                                                                    term
## _number_ _number_ _number_ _number_ _number_ _number_ _number_ _number_
## i am going to                                             i am going to
## the end of the                                           the end of the
## the rest of the                                         the rest of the
## at the end of                                             at the end of
## i do not know                                             i do not know
## at the same time                                       at the same time
## when it comes to                                       when it comes to
## is one of the                                             is one of the
## to be able to                                             to be able to
## i would like to                                         i would like to
## one of the most                                         one of the most
## for the first time                                   for the first time
## in the middle of                                       in the middle of
## in the _number_ s                                     in the _number_ s
## i do not have                                             i do not have
## i am not sure                                             i am not sure
## is going to be                                           is going to be
## if you want to                                           if you want to
## do not want to                                           do not want to
##                                     freq     rel_freq log_freq
## _number_ _number_ _number_ _number_  786 1.166367e-04 13.06569
## i am going to                        725 1.075847e-04 13.18224
## the end of the                       718 1.065460e-04 13.19624
## the rest of the                      681 1.010554e-04 13.27257
## at the end of                        666 9.882954e-05 13.30470
## i do not know                        610 9.051955e-05 13.43141
## at the same time                     464 6.885421e-05 13.82610
## when it comes to                     442 6.558957e-05 13.89617
## is one of the                        432 6.410565e-05 13.92919
## to be able to                        428 6.351208e-05 13.94261
## i would like to                      425 6.306690e-05 13.95276
## one of the most                      409 6.069261e-05 14.00812
## for the first time                   408 6.054422e-05 14.01165
## in the middle of                     397 5.891190e-05 14.05108
## in the _number_ s                    379 5.624083e-05 14.11802
## i do not have                        350 5.193744e-05 14.23287
## i am not sure                        348 5.164066e-05 14.24113
## is going to be                       335 4.971155e-05 14.29606
## if you want to                       331 4.911798e-05 14.31339
## do not want to                       325 4.822763e-05 14.33978
##                                       total_freq
## _number_ _number_ _number_ _number_ 0.0001166367
## i am going to                       0.0002242214
## the end of the                      0.0003307673
## the rest of the                     0.0004318228
## at the end of                       0.0005306523
## i do not know                       0.0006211718
## at the same time                    0.0006900261
## when it comes to                    0.0007556156
## is one of the                       0.0008197213
## to be able to                       0.0008832333
## i would like to                     0.0009463002
## one of the most                     0.0010069929
## for the first time                  0.0010675371
## in the middle of                    0.0011264490
## in the _number_ s                   0.0011826898
## i do not have                       0.0012346273
## i am not sure                       0.0012862679
## is going to be                      0.0013359795
## if you want to                      0.0013850975
## do not want to                      0.0014333251
##                                                                  term freq
## youve been reading my                           youve been reading my    2
## youve by no means                                   youve by no means    2
## youve come to the                                   youve come to the    2
## youve got to get                                     youve got to get    2
## youve never seen it                               youve never seen it    2
## youve probably been to                         youve probably been to    2
## youve probably heard that                   youve probably heard that    2
## youve recently discovered or             youve recently discovered or    2
## youve walked a mile                               youve walked a mile    2
## yoyogi olympic pool tokyo                   yoyogi olympic pool tokyo    2
## yusufali o ye who                                   yusufali o ye who    2
## zain said najib should                         zain said najib should    2
## zerotolerance at _number_ _number_ zerotolerance at _number_ _number_    2
## zest of _number_ _number_                   zest of _number_ _number_    2
## zionist dream of a                                 zionist dream of a    2
## zitouna university must be                 zitouna university must be    2
## zombies in black pajamas                     zombies in black pajamas    2
## zoom _number_ and iacthc                     zoom _number_ and iacthc    2
## zucchini and yellow squash                 zucchini and yellow squash    2
## zulu inkatha supporters were             zulu inkatha supporters were    2
##                                        rel_freq log_freq total_freq
## youve been reading my              2.967854e-07 21.68408  0.1672761
## youve by no means                  2.967854e-07 21.68408  0.1672764
## youve come to the                  2.967854e-07 21.68408  0.1672767
## youve got to get                   2.967854e-07 21.68408  0.1672770
## youve never seen it                2.967854e-07 21.68408  0.1672773
## youve probably been to             2.967854e-07 21.68408  0.1672776
## youve probably heard that          2.967854e-07 21.68408  0.1672779
## youve recently discovered or       2.967854e-07 21.68408  0.1672782
## youve walked a mile                2.967854e-07 21.68408  0.1672785
## yoyogi olympic pool tokyo          2.967854e-07 21.68408  0.1672788
## yusufali o ye who                  2.967854e-07 21.68408  0.1672791
## zain said najib should             2.967854e-07 21.68408  0.1672794
## zerotolerance at _number_ _number_ 2.967854e-07 21.68408  0.1672797
## zest of _number_ _number_          2.967854e-07 21.68408  0.1672800
## zionist dream of a                 2.967854e-07 21.68408  0.1672803
## zitouna university must be         2.967854e-07 21.68408  0.1672806
## zombies in black pajamas           2.967854e-07 21.68408  0.1672809
## zoom _number_ and iacthc           2.967854e-07 21.68408  0.1672812
## zucchini and yellow squash         2.967854e-07 21.68408  0.1672815
## zulu inkatha supporters were       2.967854e-07 21.68408  0.1672818

And visualize them as a word cloud.

### Interesting findings

To do next