The aim of this report is to produce an exploratory data analysis of the Coursera Switfkey corpus (composed of three files: news, blogs and twitter sample, available here) and to state the goals for the following steps regarding the prediction algorithm(s) to use and the resulting shiny application.
Here, I present a basic summary of the three source files (downloaded from coursera website, capstone project) with the following rubrics:
| filename | file_size | num_lines | num_words | num_chars | max_len_chars | time_elapsed |
|---|---|---|---|---|---|---|
| en_US.blogs.txt | 210,160,014 | 899,288 | 37,334,131 | 206,824,505 | 40,833 | 44.753 |
| en_US.news.txt | 205,811,889 | 1,010,242 | 34,372,530 | 203,223,159 | 11,384 | 41.189 |
| en_US.twitter.txt | 167,105,338 | 2,360,148 | 30,373,583 | 162,096,241 | 140 | 77.038 |
For this step I used the following ordered succession of steps (written with core R functions):
Remarks:
1265 entries (note: it contains multi-words per line, some entries are not ‘bad’ words per se but depends on context. Not being a native English speaker nor an expert on bad words, I kept almost all the entries).| filename | file_size | num_lines | num_words | num_chars | max_len_chars | time_elapsed |
|---|---|---|---|---|---|---|
| en_US.blogs.clean_100p.txt | 183,208,188 | 2,341,953 | 33,709,011 | 180,866,235 | 2,941 | 80.285 |
| en_US.news.clean_100p.txt | 180,285,309 | 2,056,668 | 30,976,197 | 178,228,641 | 3,213 | 71.603 |
| en_US.twitter.clean_100p.txt | 141,623,805 | 3,794,753 | 27,022,842 | 137,829,052 | 140 | 126.605 |
Comparison with original data, expressed as a gain in percentage, in other words the percentage of removal from original file, reported only for the file size, the number of words and number of characters. So for example, the first line tells us that for file en_US.blogs.txt:
12.82%,9.71%, and12.55%.| filename | diff_size | diff_words | diff_chars |
|---|---|---|---|
| en_US.blogs.clean_100p.txt | 12.82% | 9.71% | 12.55% |
| en_US.news.clean_100p.txt | 12.4% | 9.88% | 12.3% |
| en_US.twitter.clean_100p.txt | 15.25% | 11.03% | 14.97% |
I created a corpus based on the three cleaned up files obtained in the previous stage, using the R quanteda package (quanteda: Quantitative Analysis of Textual Data). Then I defined a document feature matrix from which I built our 5-gram model (wikipedia n-gram) all the way down to uni-gram.
I then computed the frequencies for my 5-gram model (down to uni-gram), using textstat_frequency() function (from R quanteda package).
In the following sections, I present results from a 20% sample of the three cleaned up files (corpus), which is actually split into three subsets: training (80%), development (10%) and testing (10%). These results are from the training set.
| n-grams | number of instances |
|---|---|
| 1-grams | 249,801 |
| 2-grams | 3,188,407 |
| 3-grams | 7,619,595 |
| 4-grams | 9,666,425 |
| 5-grams | 9,608,035 |
Top 50 5-grams for our model.
feature frequency rank
1 at the end of the 590 1
2 for the first time in 256 2
3 in the middle of the 255 3
4 the end of the day 194 4
5 by the end of the 185 5
6 thank you so much for 175 6
7 for the rest of the 171 7
8 is going to be a 154 8
9 there are a lot of 144 9
10 it's going to be a 142 10
11 to be a part of 133 11
12 the other side of the 129 12
13 thanks for the shout out 126 13
14 i can't wait to see 121 14
15 is one of the most 120 15
16 the end of the year 118 16
17 at the top of the 116 17
18 can't wait to see you 116 18
19 this is going to be 113 19
20 on the other side of 110 20
21 let me know if you 107 21
22 for the first time since 103 22
23 i love you so much 99 23
24 and the rest of the 94 24
25 for those of you who 93 25
feature frequency rank
26 the end of the month 93 26
27 in the middle of a 92 27
28 keep up the good work 92 28
29 thanks so much for the 90 29
30 at the bottom of the 89 30
31 thank you for the follow 88 31
32 hope you have a great 83 32
33 but at the same time 82 33
34 i thought it would be 82 34
35 to figure out how to 80 35
36 the rest of the day 80 36
37 let me know what you 79 37
38 to be one of the 78 38
39 if you would like to 78 39
40 in the bottom of the 77 40
41 happy mother's day to all 75 41
42 this is the first time 73 42
43 let us know if you 73 43
44 for a chance to win 72 44
45 at the beginning of the 70 45
46 to find a way to 69 46
47 there is a lot of 68 47
48 going to be a great 68 48
49 going to be a good 68 49
50 it was going to be 67 50
Most frequent 5grams
Word cloud of most frequent 5grams
Top 50 4-grams for our model.
feature frequency rank
1 the end of the 1215 1
2 at the end of 1042 2
3 the rest of the 987 3
4 for the first time 948 4
5 thanks for the follow 940 5
6 at the same time 746 6
7 is going to be 654 7
8 is one of the 610 8
9 one of the most 582 9
10 when it comes to 572 10
11 going to be a 560 11
12 in the middle of 557 12
13 to be able to 538 13
14 thanks for the rt 534 14
15 if you want to 509 15
16 thank you for the 483 16
17 can't wait to see 478 17
18 one of the best 461 18
19 thank you so much 445 19
20 i don't want to 441 20
21 i am going to 388 21
22 in the united states 377 22
23 i would like to 376 23
24 i can't wait to 356 24
25 the top of the 354 25
feature frequency rank
26 it's going to be 351 26
27 by the end of 346 27
28 i wish i could 336 28
29 one of my favorite 332 29
30 the middle of the 328 30
31 i was going to 328 31
32 for the rest of 328 32
33 a lot of people 320 33
34 in front of the 304 34
35 on the other hand 299 35
36 a bit of a 298 36
37 what do you think 297 37
38 was one of the 291 38
39 the first time in 289 39
40 the bottom of the 286 40
41 as well as the 280 41
42 i just want to 273 42
43 i was able to 271 43
44 said in a statement 269 44
45 you don't have to 267 45
46 have a lot of 267 46
47 i don't know what 267 47
48 to go to the 264 48
49 have a great day 261 49
50 hope to see you 261 50
Most frequent 4grams
Word cloud of most frequent 4grams
Top 50 3-grams for our model.
feature frequency rank
1 one of the 5052 1
2 a lot of 4620 2
3 thanks for the 3788 3
4 going to be 2633 4
5 to be a 2615 5
6 the end of 2307 6
7 i want to 2243 7
8 it was a 2223 8
9 out of the 2213 9
10 some of the 2037 10
11 as well as 2029 11
12 be able to 1956 12
13 part of the 1883 13
14 i have a 1801 14
15 i have to 1700 15
16 looking forward to 1678 16
17 the rest of 1667 17
18 i don't know 1657 18
19 thank you for 1600 19
20 the first time 1554 20
21 is going to 1514 21
22 a couple of 1499 22
23 i love you 1480 23
24 this is a 1474 24
25 end of the 1441 25
feature frequency rank
26 i need to 1437 26
27 you have to 1415 27
28 you want to 1407 28
29 i'm going to 1361 29
30 there is a 1351 30
31 can't wait to 1315 31
32 in the world 1310 32
33 the fact that 1305 33
34 this is the 1291 34
35 at the end 1290 35
36 it would be 1279 36
37 to go to 1279 37
38 one of my 1277 38
39 there is no 1224 39
40 for the first 1222 40
41 it is a 1216 41
42 i don't think 1195 42
43 is one of 1191 43
44 for the follow 1177 44
45 in the first 1156 45
46 to have a 1151 46
47 most of the 1123 47
48 all of the 1109 48
49 in front of 1093 49
50 of the year 1079 50
Most frequent 3grams
Word cloud of most frequent 3grams
Top 50 bi-grams for our model.
feature frequency rank
1 of the 63376 1
2 in the 59531 2
3 to the 31422 3
4 for the 30487 4
5 on the 28911 5
6 to be 23992 6
7 at the 21017 7
8 and the 18175 8
9 in a 17112 9
10 with the 15784 10
11 is a 14763 11
12 it was 14557 12
13 for a 14070 13
14 i have 13221 14
15 i was 12793 15
16 from the 12774 16
17 will be 12398 17
18 and i 12129 18
19 it is 12053 19
20 with a 11854 20
21 going to 11830 21
22 i am 11732 22
23 of a 11365 23
24 have a 11150 24
25 if you 10941 25
feature frequency rank
26 is the 10935 26
27 one of 10643 27
28 to get 10398 28
29 as a 9862 29
30 want to 9624 30
31 have to 9311 31
32 this is 8994 32
33 by the 8862 33
34 i think 8815 34
35 to do 8714 35
36 that the 8655 36
37 the first 8473 37
38 and a 8435 38
39 i don't 8408 39
40 to see 8205 40
41 to a 8156 41
42 out of 8145 42
43 was a 8144 43
44 on a 7917 44
45 that i 7802 45
46 but i 7800 46
47 i love 7754 47
48 all the 7484 48
49 you can 7376 49
50 to make 7371 50
Most frequent 2grams
Word cloud of most frequent 2grams
Top 50 uni-grams for our model.
feature frequency rank
1 the 701491 1
2 to 409185 2
3 and 353071 3
4 a 350075 4
5 of 293013 5
6 i 248945 6
7 in 241229 7
8 for 165302 8
9 is 159340 9
10 that 152626 10
11 you 141593 11
12 it 136952 12
13 on 121136 13
14 with 105561 14
15 was 91207 15
16 my 89682 16
17 at 84793 17
18 be 81832 18
19 this 81189 19
20 have 80262 20
21 are 73441 21
22 but 72151 22
23 as 70751 23
24 we 64082 24
25 he 63103 25
feature frequency rank
26 not 60751 26
27 so 57590 27
28 from 56703 28
29 me 54238 29
30 all 49032 30
31 will 47779 31
32 they 47378 32
33 by 45835 33
34 just 45508 34
35 or 45262 35
36 your 44972 36
37 said 44948 37
38 an 43901 38
39 out 43876 39
40 about 43796 40
41 his 42975 41
42 up 42880 42
43 one 42763 43
44 what 41936 44
45 if 41199 45
46 like 39378 46
47 when 38221 47
48 has 38216 48
49 can 37149 49
50 more 36578 50
Most frequent unigrams
Word cloud of most frequent unigrams
At this stage, I still need to spend some time on NLP resources to gain a better understanding of this topic.
I am planning on using the following algorithms (depending on the time I have) on the 5-grams model I built, for the word prediction application:
There is also some work on making the application as resource efficient as possible without compromising too much the accuracy (trade-off). I am not yet ready to tackle this, but I am considering options.
These are the resources, I am using for NLP and this project: