Comment: The corpus consists of weekly classwork and homework assignments set in the academic year 2020 for twenty-three sophomore students of a Japanese university.

Set the location of the working directory.

Load libraries into R.

library(tidyverse)
library(readtext)
library(quanteda)
library(quanteda.textstats)

Read the homework assignments into R and convert them to a tokens object.

x <- list.files(pattern = c("docx|pdf"), recursive=T) %>% 
  readtext(ignore_missing_files = T) %>% texts %>% tokens

Students had one class work and one homework per week. How many documents are there in the corpus?

ndoc(x)
[1] 1257

How many words does the corpus contain?

ntoken(x) %>% sum
[1] 461180

How many 5-grams are in the corups including teacher’s questions?

x %>% textstat_collocations(size=5) %>% nrow
[1] 9590

Filter out collocations of the teacher’s questions by limiting the n-gram count to be less than 23, the number of students in the class.

Collect and store the 5-grams and count the number of them.

fivegrams <- x %>% textstat_collocations(size=5) %>%
  filter(count<23) %>%
  arrange(-count, collocation) %>%
  summarise(collocation, count)
nrow(fivegrams)
[1] 8893

10-grams

tengrams <- x %>% textstat_collocations(size=10) %>%
  filter(count<23) %>%
  arrange(-count, collocation) %>%
  summarise(collocation, count)
nrow(tengrams)
[1] 1392

15-grams

fifteengrams <- x %>% textstat_collocations(size=15) %>%
  filter(count<23) %>%
  arrange(-count, collocation) %>%
  summarise(collocation, count)
nrow(fifteengrams)
[1] 350

20-grams

twentygrams <- x %>% textstat_collocations(size=20) %>%
  filter(count<23) %>%
  arrange(-count, collocation) %>%
  summarise(collocation, count)
nrow(twentygrams)
[1] 90

25-grams

twentyfivegrams <- x %>% textstat_collocations(size=25) %>%
  filter(count<23) %>%
  arrange(-count, collocation) %>%
  summarise(collocation, count)
nrow(twentyfivegrams)
[1] 23

30-grams

thirtygrams <- x %>% textstat_collocations(size=30) %>%
  filter(count<23) %>%
  arrange(-count, collocation) %>%
  summarise(collocation, count)
nrow(thirtygrams)
[1] 0

Comment: No collocation in the corpus was as long as thirty words.

Inspect the most common 25-gram.

kwic_25gram_df <- kwic(x, phrase("he had no idea what he wanted to do with his life and no idea how college was going to help him figure it out"))

Comment: Three students shared this 25-gram. I think it’s acceptable for students to use this long phrase because it’s an answer to a question and it’s memorable. It may also be possible that a few had already memorized Steve Job’s 2005 Stanford commencement address during high school.

Inspect the most common 20-gram.

kwic_20gram_df <- kwic(x, phrase("beauty of linguistic diversity is that it reveals to us just how ingenious and how flexible the human mind is"))

Comment: Two students had this phrase in their week seven class assignment. It’s a bit long. Perhaps it was memorized, but it may also have been copied from the transcript.

Inspect the most common 15-gram.

kwic_15gram_df <- kwic(x, phrase("makeup to celebrate who you are instead of trying to hide what you look like"))

Comment: Four students had the above collocation in their week four class assignment. If they had memorized it, good for them.

Inspect the most common 10-gram (actually, it’s an 11-gram).

kwic_11gram_df <- kwic(x, phrase("to have a second language is to have a second soul"))

Comment: Seventeen students had this quotation by Charlemagne, Holy Roman Emperor, as an answer to a question about language. I think that it is easy to memorize and wise not to paraphrase.

Inspect the most common 5-gram.

kwic_5gram_df <- kwic(x, phrase("i think that it is"))

Comment: This is a generic collocation. It was used by seventeen students in ten different assignments.

Find a 5-gram which is specific to a text.

kwic_5gram_df_specific <- kwic(x, phrase("she was diagnosed with asperger"))

Comment: Fifteen students shared the same 5-gram about climate activist Greta Thunberg.

Investigate an opinion. How could a non-generic opinion be in the corpus?

opinion <- kwic(x, phrase("am surprised that barack obama also said that reading novels is good to understand citizens"))

Comment: A student had submitted her assignment twice by mistake. Perhaps this is the source of some of the longer collocations within the corpus.

Which words appeared in the corpus at least 100 times?

x %>% dfm(remove_punct=T,
          remove_numbers=T,
          remove_symbols=T,
          remove=stopwords("en")) %>%
  textstat_frequency() %>% 
  filter(frequency>=100) %>%
  arrange(-frequency, feature) %>%
  summarise(word = feature, frequency)
               word frequency
1            people      3295
2               can      2126
3             think      1941
4             write      1637
5           opinion      1351
6           summary      1317
7              many      1290
8              make      1260
9              five      1166
10            words      1139
11            world      1026
12        questions       952
13            story       944
14             talk       894
15              one       843
16             also       841
17             time       818
18              use       810
19          speaker       746
20          however       736
21             like       729
22               us       727
23             life       726
24        important       717
25              way       712
26           change       669
27           answer       661
28            women       659
29             want       647
30          thought       644
31         language       634
32         thinking       621
33             need       594
34             work       564
35           things       547
36            first       545
37             good       538
38         children       533
39           single       533
40              get       526
41            using       521
42             live       506
43          malaria       500
44             take       491
45            virus       489
46           listen       484
47              now       483
48        something       481
49             know       477
50             poor       467
51            notes       460
52           speech       457
53             even       455
54        countries       438
55        different       430
56            years       429
57          english       427
58             feel       402
59              lot       393
60              new       391
61           future       390
62           makeup       389
63             give       383
64          without       380
65             much       378
66             said       374
67          society       374
68        languages       373
69            japan       364
70            every       363
71             made       363
72          problem       360
73            color       359
74          example       350
75             read       350
76           humans       348
77         sentence       338
78              say       337
79             idea       335
80         japanese       335
81             used       334
82             find       333
83              may       331
84            thing       329
85            three       328
86        paragraph       325
87        difficult       316
88             just       315
89          lecture       314
90           school       312
91            human       311
92              see       311
93         response       305
94           person       303
95           second       301
96           system       301
97             kind       299
98           robots       291
99           living       290
100            show       287
101              go       286
102          become       285
103           found       284
104            skin       284
105         country       282
106     coronavirus       274
107         learned       273
108            next       273
109             day       271
110      understand       267
111         meaning       266
112        pandemic       266
113          social       266
114         whether       265
115          health       261
116           lives       260
117        problems       260
118            long       259
119       situation       256
120           least       255
121             bad       253
122             end       253
123         poverty       253
124           water       251
125           short       249
126           books       247
127          around       244
128             men       244
129          easily       238
130           never       237
131      university       237
132            help       235
133             ted       235
134             two       234
135      experience       232
136       necessary       232
137           issue       231
138           learn       231
139           money       231
140            ways       228
141             try       227
142           great       226
143          others       221
144          family       220
145            look       220
146       education       219
147            jobs       218
148          always       215
149         stories       215
150         parents       214
151         similar       213
152          better       211
153           black       211
154           makes       209
155        everyone       208
156       correctly       206
157       according       205
158            year       205
159            book       204
160          mother       203
161            well       203
162          create       201
163           power       200
164            stop       200
165           talks       200
166         project       199
167           right       198
168           start       198
169          deeply       197
170           still       196
171            best       195
172            says       194
173          spread       194
174        singular       193
175       christmas       192
176           child       191
177        coherent       191
178            love       189
179           media       189
180           since       188
181            able       187
182          energy       186
183         college       185
184         various       184
185           order       183
186          little       181
187           movie       181
188          reason       181
189           solve       181
190         climate       179
191        realized       179
192             big       178
193          mental       178
194          points       178
195            come       177
196      confidence       177
197          gender       175
198           happy       175
199           often       175
200         started       175
201            felt       174
202            body       172
203         million       172
204           agree       171
205         changes       171
206            high       170
207          number       170
208     environment       169
209         wearing       169
210           means       167
211          action       166
212             old       166
213       surprised       166
214           third       166
215      guerrillas       165
216        possible       165
217       therefore       165
218           video       165
219         reasons       164
220          making       163
221            must       163
222           cause       162
223           white       162
224         friends       161
225          called       159
226        covid-19       159
227           speak       159
228        students       159
229         vaccine       159
230     information       158
231         instead       158
232    entrepreneur       157
233           study       156
234  discrimination       155
235         someone       155
236         believe       154
237            name       154
238           seven       154
239     development       152
240            less       152
241            mind       152
242           place       152
243        american       151
244          online       151
245          really       150
246           death       149
247            fact       149
248            away       148
249         disease       148
250     interesting       148
251        research       147
252          method       146
253      technology       145
254           class       144
255           going       144
256             man       144
257     opportunity       144
258        together       144
259            true       144
260      understood       144
261           build       143
262            care       143
263            home       143
264         viruses       142
265             got       141
266      everything       140
267            days       139
268      difference       139
269        epidemic       139
270        actually       138
271           clean       138
272             rna       138
273      government       137
274        creative       136
275        economic       136
276           might       136
277       sometimes       136
278          wanted       136
279            last       135
280            tell       135
281         fiction       134
282            bill       133
283             eat       133
284        learning       133
285            part       133
286         reading       133
287        describe       132
288            keep       132
289             air       130
290        although       130
291            hard       130
292             job       130
293             put       130
294         working       130
295           young       130
296          native       128
297         america       127
298          enough       127
299       beautiful       126
300       developed       126
301        people's       126
302      depression       125
303      especially       125
304            food       125
305           north       125
306           sleep       125
307          trying       125
308            face       124
309    gratefulness       124
310       speaker's       124
311          became       123
312       guerrilla       123
313         student       123
314             due       122
315             act       121
316        grateful       121
317            told       121
318         decided       120
319         medical       120
320        question       120
321           watch       118
322            easy       117
323           heard       117
324         support       117
325         ability       116
326          global       116
327            real       116
328           times       116
329          africa       115
330           gates       115
331            hope       114
332            lack       114
333         purpose       114
334           small       114
335          thinks       114
336         another       113
337           brain       113
338          causes       113
339            back       112
340           ebola       112
341           rules       112
342           today       112
343        vaccines       112
344        colombia       111
345       community       110
346             die       110
347          strong       110
348         effects       109
349        adoptive       108
350          affect       108
351     empowerment       108
352         realize       108
353         african       107
354          colors       107
355            hand       107
356      importance       107
357           point       107
358 straightforward       107
359        addition       106
360          amount       106
361         company       106
362          result       106
363        examples       105
364          famous       105
365          father       105
366            tend       105
367           given       104
368      prosthetic       104
369      developing       103
370            lost       103
371            test       103
372            half       102
373          stress       102
374           basic       101
375            lead       101
376          nature       101
377          rather       101
378         illness       100
379         looking       100
380           woman       100

Build a corpus of Great Expectations.

url <- "https://www.gutenberg.org/files/1400/1400-0.txt"
y <- readtext(url) %>% texts
start <- str_locate(y, "Chapter I.\\n\\n")[1]
end <- str_locate(y, "another parting
from her.")[2]
ge_novel <- str_sub(y, start, end)

Extract 5-grams from Great Expectations.

ge_fivegrams <- ge_novel %>% textstat_collocations(size=5) %>%
  arrange(-count) %>% 
  summarise(collocation, count)

What are the most common fivegram collocates?

head(ge_fivegrams, 5)
               collocation count
1   it appeared to me that     7
2        as if he had been     7
3          as if it were a     6
4 his hands in his pockets     6
5 bad side of human nature     5

How many words are there in Great Expectations?

ntoken(ge_novel)
 text1 
224434 

What proportion of Great Expectations is made of 5-gram collocations?

sum(ge_fivegrams$count)/224434
[1] 0.01070693

What proportion of my corpus consists of 5-gram collocations?

sum(fivegrams$count)/461180
[1] 0.05003469

Comment: My corpus contains five times as many collocations as Great Expectations at five percent. However, the corpus is 39 assignments of which 24 were classwork about the same topic and including questions expecting the same answer. The longer collocations tended to be answers, which unlike summaries aren’t expected to be paraphrased by students in Japan. For long collocations which were not answers, these phrases were often memorable. According to Pennycook (1996), good learners mimic native speakers in order to master the language and this is what my students may have been doing.

Reference

Pennycook, A. (1996). Borrowing others’ words: Text, ownership, memory and plagiarism. TESOL Quarterly 30(2), 201-230. https://doi.org/10.2307/3588141