Introduction

Presentation

Following the discussion on the availability of the code in the final report that should be made available to non-data scientists. The code is by default set to echo=FALSE.

Data files

The following code assumes that the required files are acailable in the data folder. If needed, the files can be easily downloaded with use of the download.file and unzip commands in R. The following document provides the exploratory analysis for the Data Science Capstone. The report summarises descriptive analysis delivered on a set of files with the characteristics detailed in the table below:

Provided files, summary characteristics
  MBs Lines Characters_Millions LongestLine ShotestLine Average
data/final/de_DE/de_DE.blogs.txt 81.5 371440 83.2 10599 1 224
data/final/de_DE/de_DE.news.txt 91.16 244743 93.4 3949 1 382
data/final/de_DE/de_DE.twitter.txt 72.08 947774 72.8 140 2 77
data/final/en_US/en_US.blogs.txt 200.4 899288 206.8 40833 1 230
data/final/en_US/en_US.news.txt 196.3 1010242 203.2 11384 1 201
data/final/en_US/en_US.twitter.txt 159.4 2360148 162.1 140 2 69
data/final/fi_FI/fi_FI.blogs.txt 103.5 439785 102.9 18299 1 234
data/final/fi_FI/fi_FI.news.txt 89.87 485758 89.6 3820 1 184
data/final/fi_FI/fi_FI.twitter.txt 24.16 285214 23.7 140 1 83
data/final/ru_RU/ru_RU.blogs.txt 111.4 337100 64.1 7806 1 190
data/final/ru_RU/ru_RU.news.txt 113.5 196360 65 9540 1 331
data/final/ru_RU/ru_RU.twitter.txt 100.3 881414 58 140 3 66

Exploratory analysis

The following section provides exploratory analysis on the provided data.

Data subsets

Following the discussion on Coursera the analysis below is undertaken on the data subsets with each subset corresponding to the characteristics detailed in the table below. The sample sizes correspond to approximately 10% of the lines from the original files.

Characteristics - Utilised subsets
Subset_For Lines Characters_Millions LongestLine ShotestLine Average
data/final/de_DE/de_DE.blogs.txt 18572 4.2 10599 2 225
data/final/de_DE/de_DE.news.txt 12237 4.7 2518 3 382
data/final/de_DE/de_DE.twitter.txt 47388 3.6 140 3 77
data/final/en_US/en_US.blogs.txt 44964 10.3 4714 2 229
data/final/en_US/en_US.news.txt 50512 10.2 3885 1 201
data/final/en_US/en_US.twitter.txt 118007 8.1 140 3 69
data/final/fi_FI/fi_FI.blogs.txt 21989 5.2 3207 2 235
data/final/fi_FI/fi_FI.news.txt 24287 4.5 1847 3 183
data/final/fi_FI/fi_FI.twitter.txt 14260 1.2 140 5 83
data/final/ru_RU/ru_RU.blogs.txt 16855 3.2 4777 1 189
data/final/ru_RU/ru_RU.news.txt 9818 3.3 2600 2 331
data/final/ru_RU/ru_RU.twitter.txt 44070 2.9 140 3 66

Word Frequencies Analysis

As a first step the data is tokenized into separate words and word frequency analysis is being conducted. The tables below provide the word frequency analysis for all of the files introduced in the data set. The tables below provide initial word frequencies analysis. It should be noted that the tables correspond to randomly selected samples of

Most frequent words for: data/final/de_DE/de_DE.blogs.txt
Word Occurences
und 19120
die 18222
der 13904
ich 11507
in 8824
das 8322
zu 7090
ist 6871
es 6628
nicht 6448
den 6108
mit 5497
ein 5287
von 4855
auch 4834
sie 4748
auf 4692
sich 4300
eine 3910
für 3908
so 3625
im 3517
dem 3336
aber 3259
dass 3193
Most frequent words for: data/final/de_DE/de_DE.news.txt
Word Occurences
die 24326
der 22992
und 14360
in 12345
den 7986
das 7938
von 6369
zu 6208
mit 6130
für 5322
auf 5228
im 5208
ist 5070
nicht 5036
sich 4883
ein 4799
es 4695
auch 4256
eine 4239
dem 4237
des 3998
sie 3591
als 3531
an 3194
er 3150
Most frequent words for: data/final/de_DE/de_DE.twitter.txt
Word Occurences
ich 13352
die 11650
und 10499
der 9408
das 8403
ist 8153
in 6955
nicht 6767
es 4749
mit 4720
auf 4716
auch 4529
ein 4467
zu 4338
für 4262
den 4243
so 4059
noch 3809
mal 3580
im 3422
aber 3420
ja 3234
von 3140
was 3118
du 3023
Most frequent words for: data/final/en_US/en_US.blogs.txt
Word Occurences
the 93155
and 54783
to 53497
i 45354
a 45187
of 43739
in 29823
that 24266
it 24202
is 21284
for 17834
you 16351
s 15973
with 14388
was 14057
on 13920
my 13415
this 12976
as 11201
have 11071
be 10479
but 10179
we 10060
t 9565
are 9558
Most frequent words for: data/final/en_US/en_US.news.txt
Word Occurences
the 98469
to 45438
a 44658
and 44453
of 38810
in 33903
s 22384
that 18455
for 17948
it 14342
is 14053
on 13261
he 12869
with 12788
said 12657
was 11588
at 10571
i 9738
as 9257
his 7843
from 7726
be 7710
but 7564
have 7138
are 7013
Most frequent words for: data/final/en_US/en_US.twitter.txt
Word Occurences
the 46636
i 45609
to 39898
a 30689
you 29965
and 21947
it 19219
for 19035
in 18936
of 18191
is 18034
s 15812
my 14591
on 13876
that 13550
t 10912
me 10287
at 9305
be 9243
your 8630
with 8616
have 8427
this 8359
we 8175
so 8061
Most frequent words for: data/final/fi_FI/fi_FI.blogs.txt
Word Occurences
ja 21613
on 15739
että 7667
ei 7432
se 4692
mutta 4358
oli 3769
kun 3490
niin 3251
ole 3130
kuin 2742
tai 2569
sen 2428
myös 2410
jos 2374
ovat 2339
en 2228
joka 1975
nyt 1960
voi 1873
sitten 1844
sitä 1824
olen 1757
vain 1714
jo 1603
Most frequent words for: data/final/fi_FI/fi_FI.news.txt
Word Occurences
ja 15890
on 13814
ei 4819
että 4701
oli 3172
myös 2737
ovat 2507
mutta 2466
se 2190
kun 2018
ole 2000
mukaan 1855
hän 1790
kuin 1785
tai 1403
jo 1390
sen 1363
joka 1287
niin 1276
voi 1245
nyt 1230
jos 1207
n 1120
vain 987
ollut 983
Most frequent words for: data/final/fi_FI/fi_FI.twitter.txt
Word Occurences
ja 4111
on 3698
ei 2123
se 1148
nyt 1071
d 998
että 980
mutta 814
niin 771
kun 749
oli 745
en 664
tänään 594
jos 591
jo 583
ole 510
ihan 497
olla 461
vielä 456
hyvä 452
kyllä 449
voi 439
sitten 426
n 407
sitä 404
Most frequent words for: data/final/ru_RU/ru_RU.blogs.txt
Word Occurences
и 16850
в 14750
на 8013
не 7082
с 5349
что 5264
я 4108
по 4019
а 3636
как 2984
для 2668
то 2629
это 2597
но 2479
из 2269
к 2134
все 1965
у 1944
за 1859
так 1812
от 1758
о 1753
он 1354
его 1292
или 1235
Most frequent words for: data/final/ru_RU/ru_RU.news.txt
Word Occurences
в 17672
и 12639
на 8307
не 6137
что 4999
с 4874
по 4341
а 2722
как 2296
из 2182
к 2121
за 2079
это 2070
но 1968
для 1821
о 1693
то 1676
у 1532
от 1401
он 1392
его 1380
все 1301
я 1212
года 1190
мы 1138
Most frequent words for: data/final/ru_RU/ru_RU.twitter.txt
Word Occurences
в 13598
и 11529
не 10920
на 8014
я 6806
что 6048
а 5976
с 5340
это 3873
как 3683
то 3526
у 3357
по 2663
за 2593
все 2573
так 2415
меня 2193
но 2152
ты 2031
да 1810
мне 1786
уже 1597
ну 1573
из 1562
только 1562

Relative frequencies

When analysed in the context of the relative frequencies of the text, the same words appear in the following order. Due to small relative sizes the figures are provided in ** parts per thousand - per mille ‰ **, which should not be confused with the traditional percentage signs.

Most frequent words as % for: data/final/de_DE/de_DE.blogs.txt
Word Occurences as ‰ (per mille)
und 29.91
die 28.51
der 21.75
ich 18
in 13.81
das 13.02
zu 11.09
ist 10.75
es 10.37
nicht 10.09
den 9.56
mit 8.6
ein 8.27
von 7.6
auch 7.56
sie 7.43
auf 7.34
sich 6.73
eine 6.12
für 6.11
so 5.67
im 5.5
dem 5.22
aber 5.1
dass 5
Most frequent words as % for: data/final/de_DE/de_DE.news.txt
Word Occurences as ‰ (per mille)
die 36.23
der 34.24
und 21.39
in 18.39
den 11.89
das 11.82
von 9.49
zu 9.25
mit 9.13
für 7.93
auf 7.79
im 7.76
ist 7.55
nicht 7.5
sich 7.27
ein 7.15
es 6.99
auch 6.34
dem 6.31
eine 6.31
des 5.95
sie 5.35
als 5.26
an 4.76
er 4.69
Most frequent words as % for: data/final/de_DE/de_DE.twitter.txt
Word Occurences as ‰ (per mille)
ich 22.72
die 19.82
und 17.86
der 16.01
das 14.3
ist 13.87
in 11.83
nicht 11.51
es 8.08
mit 8.03
auf 8.02
auch 7.71
ein 7.6
zu 7.38
für 7.25
den 7.22
so 6.91
noch 6.48
mal 6.09
aber 5.82
im 5.82
ja 5.5
von 5.34
was 5.31
du 5.14
Most frequent words as % for: data/final/en_US/en_US.blogs.txt
Word Occurences as ‰ (per mille)
the 48.77
and 28.68
to 28.01
i 23.74
a 23.66
of 22.9
in 15.61
that 12.7
it 12.67
is 11.14
for 9.34
you 8.56
s 8.36
with 7.53
was 7.36
on 7.29
my 7.02
this 6.79
as 5.86
have 5.8
be 5.49
but 5.33
we 5.27
t 5.01
are 5
Most frequent words as % for: data/final/en_US/en_US.news.txt
Word Occurences as ‰ (per mille)
the 55.25
to 25.5
a 25.06
and 24.94
of 21.78
in 19.02
s 12.56
that 10.36
for 10.07
it 8.05
is 7.89
on 7.44
he 7.22
with 7.18
said 7.1
was 6.5
at 5.93
i 5.46
as 5.19
his 4.4
from 4.34
be 4.33
but 4.24
have 4.01
are 3.94
Most frequent words as % for: data/final/en_US/en_US.twitter.txt
Word Occurences as ‰ (per mille)
the 30.09
i 29.42
to 25.74
a 19.8
you 19.33
and 14.16
it 12.4
for 12.28
in 12.22
of 11.74
is 11.63
s 10.2
my 9.41
on 8.95
that 8.74
t 7.04
me 6.64
at 6
be 5.96
your 5.57
with 5.56
have 5.44
this 5.39
we 5.27
so 5.2
Most frequent words as % for: data/final/fi_FI/fi_FI.blogs.txt
Word Occurences as ‰ (per mille)
ja 33.48
on 24.38
että 11.88
ei 11.51
se 7.27
mutta 6.75
oli 5.84
kun 5.41
niin 5.04
ole 4.85
kuin 4.25
tai 3.98
sen 3.76
myös 3.73
jos 3.68
ovat 3.62
en 3.45
joka 3.06
nyt 3.04
voi 2.9
sitten 2.86
sitä 2.83
olen 2.72
vain 2.66
jo 2.48
Most frequent words as % for: data/final/fi_FI/fi_FI.news.txt
Word Occurences as ‰ (per mille)
ja 30.17
on 26.22
ei 9.15
että 8.92
oli 6.02
myös 5.2
ovat 4.76
mutta 4.68
se 4.16
kun 3.83
ole 3.8
mukaan 3.52
hän 3.4
kuin 3.39
tai 2.66
jo 2.64
sen 2.59
joka 2.44
niin 2.42
voi 2.36
nyt 2.34
jos 2.29
n 2.13
ollut 1.87
vain 1.87
Most frequent words as % for: data/final/fi_FI/fi_FI.twitter.txt
Word Occurences as ‰ (per mille)
ja 25.81
on 23.22
ei 13.33
se 7.21
nyt 6.72
d 6.27
että 6.15
mutta 5.11
niin 4.84
kun 4.7
oli 4.68
en 4.17
tänään 3.73
jos 3.71
jo 3.66
ole 3.2
ihan 3.12
olla 2.89
vielä 2.86
hyvä 2.84
kyllä 2.82
voi 2.76
sitten 2.67
n 2.56
sitä 2.54
Most frequent words as % for: data/final/ru_RU/ru_RU.blogs.txt
Word Occurences as ‰ (per mille)
и 35.85
в 31.38
на 17.05
не 15.07
с 11.38
что 11.2
я 8.74
по 8.55
а 7.74
как 6.35
для 5.68
то 5.59
это 5.53
но 5.27
из 4.83
к 4.54
все 4.18
у 4.14
за 3.96
так 3.86
от 3.74
о 3.73
он 2.88
его 2.75
или 2.63
Most frequent words as % for: data/final/ru_RU/ru_RU.news.txt
Word Occurences as ‰ (per mille)
в 38.84
и 27.78
на 18.26
не 13.49
что 10.99
с 10.71
по 9.54
а 5.98
как 5.05
из 4.8
к 4.66
за 4.57
это 4.55
но 4.32
для 4
о 3.72
то 3.68
у 3.37
от 3.08
он 3.06
его 3.03
все 2.86
я 2.66
года 2.62
мы 2.5
Most frequent words as % for: data/final/ru_RU/ru_RU.twitter.txt ### Histograms The word subsets are additionally analysed with us of the histograms below.
Word Occurences as ‰ (per mille)
в 29.2
и 24.76
не 23.45
на 17.21
я 14.61
что 12.99
а 12.83
с 11.47
это 8.32
как 7.91
то 7.57
у 7.21
по 5.72
за 5.57
все 5.52
так 5.19
меня 4.71
но 4.62
ты 4.36
да 3.89
мне 3.84
уже 3.43
ну 3.38
из 3.35
только 3.35