Introduction
Presentation
Following the discussion on the availability of the code in the final report that should be made available to non-data scientists. The code is by default set to echo=FALSE.
Data files
The following code assumes that the required files are acailable in the data folder. If needed, the files can be easily downloaded with use of the download.file and unzip commands in R. The following document provides the exploratory analysis for the Data Science Capstone. The report summarises descriptive analysis delivered on a set of files with the characteristics detailed in the table below:
Provided files, summary characteristics
| data/final/de_DE/de_DE.blogs.txt |
81.5 |
371440 |
83.2 |
10599 |
1 |
224 |
| data/final/de_DE/de_DE.news.txt |
91.16 |
244743 |
93.4 |
3949 |
1 |
382 |
| data/final/de_DE/de_DE.twitter.txt |
72.08 |
947774 |
72.8 |
140 |
2 |
77 |
| data/final/en_US/en_US.blogs.txt |
200.4 |
899288 |
206.8 |
40833 |
1 |
230 |
| data/final/en_US/en_US.news.txt |
196.3 |
1010242 |
203.2 |
11384 |
1 |
201 |
| data/final/en_US/en_US.twitter.txt |
159.4 |
2360148 |
162.1 |
140 |
2 |
69 |
| data/final/fi_FI/fi_FI.blogs.txt |
103.5 |
439785 |
102.9 |
18299 |
1 |
234 |
| data/final/fi_FI/fi_FI.news.txt |
89.87 |
485758 |
89.6 |
3820 |
1 |
184 |
| data/final/fi_FI/fi_FI.twitter.txt |
24.16 |
285214 |
23.7 |
140 |
1 |
83 |
| data/final/ru_RU/ru_RU.blogs.txt |
111.4 |
337100 |
64.1 |
7806 |
1 |
190 |
| data/final/ru_RU/ru_RU.news.txt |
113.5 |
196360 |
65 |
9540 |
1 |
331 |
| data/final/ru_RU/ru_RU.twitter.txt |
100.3 |
881414 |
58 |
140 |
3 |
66 |
Exploratory analysis
The following section provides exploratory analysis on the provided data.
Data subsets
Following the discussion on Coursera the analysis below is undertaken on the data subsets with each subset corresponding to the characteristics detailed in the table below. The sample sizes correspond to approximately 10% of the lines from the original files.
Characteristics - Utilised subsets
| data/final/de_DE/de_DE.blogs.txt |
18572 |
4.2 |
10599 |
2 |
225 |
| data/final/de_DE/de_DE.news.txt |
12237 |
4.7 |
2518 |
3 |
382 |
| data/final/de_DE/de_DE.twitter.txt |
47388 |
3.6 |
140 |
3 |
77 |
| data/final/en_US/en_US.blogs.txt |
44964 |
10.3 |
4714 |
2 |
229 |
| data/final/en_US/en_US.news.txt |
50512 |
10.2 |
3885 |
1 |
201 |
| data/final/en_US/en_US.twitter.txt |
118007 |
8.1 |
140 |
3 |
69 |
| data/final/fi_FI/fi_FI.blogs.txt |
21989 |
5.2 |
3207 |
2 |
235 |
| data/final/fi_FI/fi_FI.news.txt |
24287 |
4.5 |
1847 |
3 |
183 |
| data/final/fi_FI/fi_FI.twitter.txt |
14260 |
1.2 |
140 |
5 |
83 |
| data/final/ru_RU/ru_RU.blogs.txt |
16855 |
3.2 |
4777 |
1 |
189 |
| data/final/ru_RU/ru_RU.news.txt |
9818 |
3.3 |
2600 |
2 |
331 |
| data/final/ru_RU/ru_RU.twitter.txt |
44070 |
2.9 |
140 |
3 |
66 |
Word Frequencies Analysis
As a first step the data is tokenized into separate words and word frequency analysis is being conducted. The tables below provide the word frequency analysis for all of the files introduced in the data set. The tables below provide initial word frequencies analysis. It should be noted that the tables correspond to randomly selected samples of
Most frequent words for: data/final/de_DE/de_DE.blogs.txt
| und |
19120 |
| die |
18222 |
| der |
13904 |
| ich |
11507 |
| in |
8824 |
| das |
8322 |
| zu |
7090 |
| ist |
6871 |
| es |
6628 |
| nicht |
6448 |
| den |
6108 |
| mit |
5497 |
| ein |
5287 |
| von |
4855 |
| auch |
4834 |
| sie |
4748 |
| auf |
4692 |
| sich |
4300 |
| eine |
3910 |
| für |
3908 |
| so |
3625 |
| im |
3517 |
| dem |
3336 |
| aber |
3259 |
| dass |
3193 |
Most frequent words for: data/final/de_DE/de_DE.news.txt
| die |
24326 |
| der |
22992 |
| und |
14360 |
| in |
12345 |
| den |
7986 |
| das |
7938 |
| von |
6369 |
| zu |
6208 |
| mit |
6130 |
| für |
5322 |
| auf |
5228 |
| im |
5208 |
| ist |
5070 |
| nicht |
5036 |
| sich |
4883 |
| ein |
4799 |
| es |
4695 |
| auch |
4256 |
| eine |
4239 |
| dem |
4237 |
| des |
3998 |
| sie |
3591 |
| als |
3531 |
| an |
3194 |
| er |
3150 |
Most frequent words for: data/final/de_DE/de_DE.twitter.txt
| ich |
13352 |
| die |
11650 |
| und |
10499 |
| der |
9408 |
| das |
8403 |
| ist |
8153 |
| in |
6955 |
| nicht |
6767 |
| es |
4749 |
| mit |
4720 |
| auf |
4716 |
| auch |
4529 |
| ein |
4467 |
| zu |
4338 |
| für |
4262 |
| den |
4243 |
| so |
4059 |
| noch |
3809 |
| mal |
3580 |
| im |
3422 |
| aber |
3420 |
| ja |
3234 |
| von |
3140 |
| was |
3118 |
| du |
3023 |
Most frequent words for: data/final/en_US/en_US.blogs.txt
| the |
93155 |
| and |
54783 |
| to |
53497 |
| i |
45354 |
| a |
45187 |
| of |
43739 |
| in |
29823 |
| that |
24266 |
| it |
24202 |
| is |
21284 |
| for |
17834 |
| you |
16351 |
| s |
15973 |
| with |
14388 |
| was |
14057 |
| on |
13920 |
| my |
13415 |
| this |
12976 |
| as |
11201 |
| have |
11071 |
| be |
10479 |
| but |
10179 |
| we |
10060 |
| t |
9565 |
| are |
9558 |
Most frequent words for: data/final/en_US/en_US.news.txt
| the |
98469 |
| to |
45438 |
| a |
44658 |
| and |
44453 |
| of |
38810 |
| in |
33903 |
| s |
22384 |
| that |
18455 |
| for |
17948 |
| it |
14342 |
| is |
14053 |
| on |
13261 |
| he |
12869 |
| with |
12788 |
| said |
12657 |
| was |
11588 |
| at |
10571 |
| i |
9738 |
| as |
9257 |
| his |
7843 |
| from |
7726 |
| be |
7710 |
| but |
7564 |
| have |
7138 |
| are |
7013 |
Most frequent words for: data/final/en_US/en_US.twitter.txt
| the |
46636 |
| i |
45609 |
| to |
39898 |
| a |
30689 |
| you |
29965 |
| and |
21947 |
| it |
19219 |
| for |
19035 |
| in |
18936 |
| of |
18191 |
| is |
18034 |
| s |
15812 |
| my |
14591 |
| on |
13876 |
| that |
13550 |
| t |
10912 |
| me |
10287 |
| at |
9305 |
| be |
9243 |
| your |
8630 |
| with |
8616 |
| have |
8427 |
| this |
8359 |
| we |
8175 |
| so |
8061 |
Most frequent words for: data/final/fi_FI/fi_FI.blogs.txt
| ja |
21613 |
| on |
15739 |
| että |
7667 |
| ei |
7432 |
| se |
4692 |
| mutta |
4358 |
| oli |
3769 |
| kun |
3490 |
| niin |
3251 |
| ole |
3130 |
| kuin |
2742 |
| tai |
2569 |
| sen |
2428 |
| myös |
2410 |
| jos |
2374 |
| ovat |
2339 |
| en |
2228 |
| joka |
1975 |
| nyt |
1960 |
| voi |
1873 |
| sitten |
1844 |
| sitä |
1824 |
| olen |
1757 |
| vain |
1714 |
| jo |
1603 |
Most frequent words for: data/final/fi_FI/fi_FI.news.txt
| ja |
15890 |
| on |
13814 |
| ei |
4819 |
| että |
4701 |
| oli |
3172 |
| myös |
2737 |
| ovat |
2507 |
| mutta |
2466 |
| se |
2190 |
| kun |
2018 |
| ole |
2000 |
| mukaan |
1855 |
| hän |
1790 |
| kuin |
1785 |
| tai |
1403 |
| jo |
1390 |
| sen |
1363 |
| joka |
1287 |
| niin |
1276 |
| voi |
1245 |
| nyt |
1230 |
| jos |
1207 |
| n |
1120 |
| vain |
987 |
| ollut |
983 |
Most frequent words for: data/final/fi_FI/fi_FI.twitter.txt
| ja |
4111 |
| on |
3698 |
| ei |
2123 |
| se |
1148 |
| nyt |
1071 |
| d |
998 |
| että |
980 |
| mutta |
814 |
| niin |
771 |
| kun |
749 |
| oli |
745 |
| en |
664 |
| tänään |
594 |
| jos |
591 |
| jo |
583 |
| ole |
510 |
| ihan |
497 |
| olla |
461 |
| vielä |
456 |
| hyvä |
452 |
| kyllä |
449 |
| voi |
439 |
| sitten |
426 |
| n |
407 |
| sitä |
404 |
Most frequent words for: data/final/ru_RU/ru_RU.blogs.txt
| и |
16850 |
| в |
14750 |
| на |
8013 |
| не |
7082 |
| с |
5349 |
| что |
5264 |
| я |
4108 |
| по |
4019 |
| а |
3636 |
| как |
2984 |
| для |
2668 |
| то |
2629 |
| это |
2597 |
| но |
2479 |
| из |
2269 |
| к |
2134 |
| все |
1965 |
| у |
1944 |
| за |
1859 |
| так |
1812 |
| от |
1758 |
| о |
1753 |
| он |
1354 |
| его |
1292 |
| или |
1235 |
Most frequent words for: data/final/ru_RU/ru_RU.news.txt
| в |
17672 |
| и |
12639 |
| на |
8307 |
| не |
6137 |
| что |
4999 |
| с |
4874 |
| по |
4341 |
| а |
2722 |
| как |
2296 |
| из |
2182 |
| к |
2121 |
| за |
2079 |
| это |
2070 |
| но |
1968 |
| для |
1821 |
| о |
1693 |
| то |
1676 |
| у |
1532 |
| от |
1401 |
| он |
1392 |
| его |
1380 |
| все |
1301 |
| я |
1212 |
| года |
1190 |
| мы |
1138 |
Most frequent words for: data/final/ru_RU/ru_RU.twitter.txt
| в |
13598 |
| и |
11529 |
| не |
10920 |
| на |
8014 |
| я |
6806 |
| что |
6048 |
| а |
5976 |
| с |
5340 |
| это |
3873 |
| как |
3683 |
| то |
3526 |
| у |
3357 |
| по |
2663 |
| за |
2593 |
| все |
2573 |
| так |
2415 |
| меня |
2193 |
| но |
2152 |
| ты |
2031 |
| да |
1810 |
| мне |
1786 |
| уже |
1597 |
| ну |
1573 |
| из |
1562 |
| только |
1562 |