library(tidyverse)
library(readtext)
library(quanteda)
library(quanteda.textstats)
x <- list.files(pattern = c("docx|pdf"), recursive=T) %>%
readtext(ignore_missing_files = T) %>% texts %>% tokens
ndoc(x)
[1] 1257
ntoken(x) %>% sum
[1] 461180
x %>% textstat_collocations(size=5) %>% nrow
[1] 9590
fivegrams <- x %>% textstat_collocations(size=5) %>%
filter(count<23) %>%
arrange(-count, collocation) %>%
summarise(collocation, count)
nrow(fivegrams)
[1] 8893
tengrams <- x %>% textstat_collocations(size=10) %>%
filter(count<23) %>%
arrange(-count, collocation) %>%
summarise(collocation, count)
nrow(tengrams)
[1] 1392
fifteengrams <- x %>% textstat_collocations(size=15) %>%
filter(count<23) %>%
arrange(-count, collocation) %>%
summarise(collocation, count)
nrow(fifteengrams)
[1] 350
twentygrams <- x %>% textstat_collocations(size=20) %>%
filter(count<23) %>%
arrange(-count, collocation) %>%
summarise(collocation, count)
nrow(twentygrams)
[1] 90
twentyfivegrams <- x %>% textstat_collocations(size=25) %>%
filter(count<23) %>%
arrange(-count, collocation) %>%
summarise(collocation, count)
nrow(twentyfivegrams)
[1] 23
thirtygrams <- x %>% textstat_collocations(size=30) %>%
filter(count<23) %>%
arrange(-count, collocation) %>%
summarise(collocation, count)
nrow(thirtygrams)
[1] 0
Comment: No collocation in the corpus was as long as thirty words.
kwic_25gram_df <- kwic(x, phrase("he had no idea what he wanted to do with his life and no idea how college was going to help him figure it out"))
kwic_20gram_df <- kwic(x, phrase("beauty of linguistic diversity is that it reveals to us just how ingenious and how flexible the human mind is"))
Comment: Two students had this phrase in their week seven class assignment. It’s a bit long. Perhaps it was memorized, but it may also have been copied from the transcript.
kwic_15gram_df <- kwic(x, phrase("makeup to celebrate who you are instead of trying to hide what you look like"))
Comment: Four students had the above collocation in their week four class assignment. If they had memorized it, good for them.
kwic_11gram_df <- kwic(x, phrase("to have a second language is to have a second soul"))
Comment: Seventeen students had this quotation by Charlemagne, Holy Roman Emperor, as an answer to a question about language. I think that it is easy to memorize and wise not to paraphrase.
kwic_5gram_df <- kwic(x, phrase("i think that it is"))
Comment: This is a generic collocation. It was used by seventeen students in ten different assignments.
kwic_5gram_df_specific <- kwic(x, phrase("she was diagnosed with asperger"))
opinion <- kwic(x, phrase("am surprised that barack obama also said that reading novels is good to understand citizens"))
Comment: A student had submitted her assignment twice by mistake. Perhaps this is the source of some of the longer collocations within the corpus.
x %>% dfm(remove_punct=T,
remove_numbers=T,
remove_symbols=T,
remove=stopwords("en")) %>%
textstat_frequency() %>%
filter(frequency>=100) %>%
arrange(-frequency, feature) %>%
summarise(word = feature, frequency)
word frequency
1 people 3295
2 can 2126
3 think 1941
4 write 1637
5 opinion 1351
6 summary 1317
7 many 1290
8 make 1260
9 five 1166
10 words 1139
11 world 1026
12 questions 952
13 story 944
14 talk 894
15 one 843
16 also 841
17 time 818
18 use 810
19 speaker 746
20 however 736
21 like 729
22 us 727
23 life 726
24 important 717
25 way 712
26 change 669
27 answer 661
28 women 659
29 want 647
30 thought 644
31 language 634
32 thinking 621
33 need 594
34 work 564
35 things 547
36 first 545
37 good 538
38 children 533
39 single 533
40 get 526
41 using 521
42 live 506
43 malaria 500
44 take 491
45 virus 489
46 listen 484
47 now 483
48 something 481
49 know 477
50 poor 467
51 notes 460
52 speech 457
53 even 455
54 countries 438
55 different 430
56 years 429
57 english 427
58 feel 402
59 lot 393
60 new 391
61 future 390
62 makeup 389
63 give 383
64 without 380
65 much 378
66 said 374
67 society 374
68 languages 373
69 japan 364
70 every 363
71 made 363
72 problem 360
73 color 359
74 example 350
75 read 350
76 humans 348
77 sentence 338
78 say 337
79 idea 335
80 japanese 335
81 used 334
82 find 333
83 may 331
84 thing 329
85 three 328
86 paragraph 325
87 difficult 316
88 just 315
89 lecture 314
90 school 312
91 human 311
92 see 311
93 response 305
94 person 303
95 second 301
96 system 301
97 kind 299
98 robots 291
99 living 290
100 show 287
101 go 286
102 become 285
103 found 284
104 skin 284
105 country 282
106 coronavirus 274
107 learned 273
108 next 273
109 day 271
110 understand 267
111 meaning 266
112 pandemic 266
113 social 266
114 whether 265
115 health 261
116 lives 260
117 problems 260
118 long 259
119 situation 256
120 least 255
121 bad 253
122 end 253
123 poverty 253
124 water 251
125 short 249
126 books 247
127 around 244
128 men 244
129 easily 238
130 never 237
131 university 237
132 help 235
133 ted 235
134 two 234
135 experience 232
136 necessary 232
137 issue 231
138 learn 231
139 money 231
140 ways 228
141 try 227
142 great 226
143 others 221
144 family 220
145 look 220
146 education 219
147 jobs 218
148 always 215
149 stories 215
150 parents 214
151 similar 213
152 better 211
153 black 211
154 makes 209
155 everyone 208
156 correctly 206
157 according 205
158 year 205
159 book 204
160 mother 203
161 well 203
162 create 201
163 power 200
164 stop 200
165 talks 200
166 project 199
167 right 198
168 start 198
169 deeply 197
170 still 196
171 best 195
172 says 194
173 spread 194
174 singular 193
175 christmas 192
176 child 191
177 coherent 191
178 love 189
179 media 189
180 since 188
181 able 187
182 energy 186
183 college 185
184 various 184
185 order 183
186 little 181
187 movie 181
188 reason 181
189 solve 181
190 climate 179
191 realized 179
192 big 178
193 mental 178
194 points 178
195 come 177
196 confidence 177
197 gender 175
198 happy 175
199 often 175
200 started 175
201 felt 174
202 body 172
203 million 172
204 agree 171
205 changes 171
206 high 170
207 number 170
208 environment 169
209 wearing 169
210 means 167
211 action 166
212 old 166
213 surprised 166
214 third 166
215 guerrillas 165
216 possible 165
217 therefore 165
218 video 165
219 reasons 164
220 making 163
221 must 163
222 cause 162
223 white 162
224 friends 161
225 called 159
226 covid-19 159
227 speak 159
228 students 159
229 vaccine 159
230 information 158
231 instead 158
232 entrepreneur 157
233 study 156
234 discrimination 155
235 someone 155
236 believe 154
237 name 154
238 seven 154
239 development 152
240 less 152
241 mind 152
242 place 152
243 american 151
244 online 151
245 really 150
246 death 149
247 fact 149
248 away 148
249 disease 148
250 interesting 148
251 research 147
252 method 146
253 technology 145
254 class 144
255 going 144
256 man 144
257 opportunity 144
258 together 144
259 true 144
260 understood 144
261 build 143
262 care 143
263 home 143
264 viruses 142
265 got 141
266 everything 140
267 days 139
268 difference 139
269 epidemic 139
270 actually 138
271 clean 138
272 rna 138
273 government 137
274 creative 136
275 economic 136
276 might 136
277 sometimes 136
278 wanted 136
279 last 135
280 tell 135
281 fiction 134
282 bill 133
283 eat 133
284 learning 133
285 part 133
286 reading 133
287 describe 132
288 keep 132
289 air 130
290 although 130
291 hard 130
292 job 130
293 put 130
294 working 130
295 young 130
296 native 128
297 america 127
298 enough 127
299 beautiful 126
300 developed 126
301 people's 126
302 depression 125
303 especially 125
304 food 125
305 north 125
306 sleep 125
307 trying 125
308 face 124
309 gratefulness 124
310 speaker's 124
311 became 123
312 guerrilla 123
313 student 123
314 due 122
315 act 121
316 grateful 121
317 told 121
318 decided 120
319 medical 120
320 question 120
321 watch 118
322 easy 117
323 heard 117
324 support 117
325 ability 116
326 global 116
327 real 116
328 times 116
329 africa 115
330 gates 115
331 hope 114
332 lack 114
333 purpose 114
334 small 114
335 thinks 114
336 another 113
337 brain 113
338 causes 113
339 back 112
340 ebola 112
341 rules 112
342 today 112
343 vaccines 112
344 colombia 111
345 community 110
346 die 110
347 strong 110
348 effects 109
349 adoptive 108
350 affect 108
351 empowerment 108
352 realize 108
353 african 107
354 colors 107
355 hand 107
356 importance 107
357 point 107
358 straightforward 107
359 addition 106
360 amount 106
361 company 106
362 result 106
363 examples 105
364 famous 105
365 father 105
366 tend 105
367 given 104
368 prosthetic 104
369 developing 103
370 lost 103
371 test 103
372 half 102
373 stress 102
374 basic 101
375 lead 101
376 nature 101
377 rather 101
378 illness 100
379 looking 100
380 woman 100
url <- "https://www.gutenberg.org/files/1400/1400-0.txt"
y <- readtext(url) %>% texts
start <- str_locate(y, "Chapter I.\\n\\n")[1]
end <- str_locate(y, "another parting
from her.")[2]
ge_novel <- str_sub(y, start, end)
ge_fivegrams <- ge_novel %>% textstat_collocations(size=5) %>%
arrange(-count) %>%
summarise(collocation, count)
head(ge_fivegrams, 5)
collocation count
1 it appeared to me that 7
2 as if he had been 7
3 as if it were a 6
4 his hands in his pockets 6
5 bad side of human nature 5
ntoken(ge_novel)
text1
224434
sum(ge_fivegrams$count)/224434
[1] 0.01070693
sum(fivegrams$count)/461180
[1] 0.05003469
Comment: My corpus contains five times as many collocations as Great Expectations at five percent. However, the corpus is 39 assignments of which 24 were classwork about the same topic and including questions expecting the same answer. The longer collocations tended to be answers, which unlike summaries aren’t expected to be paraphrased by students in Japan. For long collocations which were not answers, these phrases were often memorable. According to Pennycook (1996), good learners mimic native speakers in order to master the language and this is what my students may have been doing.
Reference
Pennycook, A. (1996). Borrowing others’ words: Text, ownership, memory and plagiarism. TESOL Quarterly 30(2), 201-230. https://doi.org/10.2307/3588141
Comment: The corpus consists of weekly classwork and homework assignments set in the academic year 2020 for twenty-three sophomore students of a Japanese university.