Load libraries.

library(tidyverse)
library(readtext)
library(quanteda)
library(quanteda.textstats)
library(caTools)

Build a corpus of the texts.

x <- list.files(pattern = "html", recursive = T) %>% 
  readtext(ignore_missing_files = T) %>% texts %>% corpus

Extract family and given names from doc names.

family_names <- docnames(x) %>% word(2)
family_names <- gsub(",","", family_names)
given_names <- docnames(x) %>% word(3)
given_names <- gsub(",","", given_names)

Make a data frame of the texts.

df <- data.frame(x)

Add real names to dataframe.

df$given_name <- given_names
df$family_name <- family_names

Anonymize the students.

anon <- read.csv("../anon.csv")
set.seed(1066)
row.names(df) <- sample(anon$x, 101, replace=F)

Create a variable for class group.

df$class <- c(rep("class_3", 29), rep("class_4", 36), rep("class_5", 36)) %>% as.factor

Type in the similarity scores and create a variable for it in the data frame.

p3 <- c(32,0,39,39,31,0,64,67,99,19,59,24,0,66,0,66,NA,54,0,12,0,88,0,93,81,17,0,0,65)
p4 <- c(0,89,0,85,10,70,41,52,44,66,30,19,34,67,0,28,52,51,0,37, 74,45,0,14,36,58,14,49,0,0,29,82,65,0,75,42)
p5 <- c(0,0,0,39,75,0,0,15,81,0,30,18,0,0,0,0,0,46,0,0,39,0,80, 0,0,0,0,0,24,0,32,0,0,40,18,41)
df$similarity <- c(p3, p4, p5)

Inspect the corpus.

ntoken(x) %>% min
[1] 91
ntoken(x) %>% max
[1] 182
ntoken(x) %>% mean
[1] 117.4158
nsentence(x) %>% min
[1] 3
nsentence(x) %>% max
[1] 13
nsentence(x) %>% mean
[1] 7.623762

Inspect the similarity scores.

fivenum(df$similarity)
[1]  0  0 24 52 99
mean(df$similarity, na.rm=T)
[1] 29.51
median(df$similarity, na.rm=T)
[1] 24

Draw boxplots of the similarity scores for each class.

df %>% ggplot(aes(x=class, y=similarity)) +
  geom_boxplot()

Comment: Class_5 has a median of zero. This is where I want the other classes to be. There is no clear reason for this at the moment.

Class_3: TOEIC avg=492 max=509 min=475

Class_4: TOEIC avg=581 max=608 min=552

Class_5: TOEIC avg=488 max=516 min=458

Run a one-way ANOVA to see if the classes are different.

mod <- aov(similarity ~ class, data = df)
summary(mod)
            Df Sum Sq Mean Sq F value  Pr(>F)   
class        2  10217    5108   6.234 0.00284 **
Residuals   97  79488     819                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
1 observation deleted due to missingness

Comment: The difference is statistically significant.

Run a Tukey HSD to see where the difference is.

TukeyHSD(mod)
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = similarity ~ class, data = df)

$class
                      diff       lwr       upr     p adj
class_4-class_3   1.472222 -15.69670 18.641141 0.9773020
class_5-class_3 -20.194444 -37.36336 -3.025526 0.0168428
class_5-class_4 -21.666667 -37.72672 -5.606614 0.0050580

Comment: There is a statistically significant difference between Class_5 and Class_3 (p = 0.017) and Class_5 and Class_4 (p = 0.005).

Identify and count the number of n-grams in the corpus.

x %>% textstat_collocations(size= 2) %>% nrow
[1] 924
x %>% textstat_collocations(size= 4) %>% nrow
[1] 627
x %>% textstat_collocations(size= 6) %>% nrow
[1] 336
x %>% textstat_collocations(size= 8) %>% nrow
[1] 208
x %>% textstat_collocations(size= 10) %>% nrow
[1] 127
x %>% textstat_collocations(size= 12) %>% nrow
[1] 82
x %>% textstat_collocations(size= 14) %>% nrow
[1] 51
x %>% textstat_collocations(size= 16) %>% nrow
[1] 27
x %>% textstat_collocations(size= 18) %>% nrow
[1] 11
x %>% textstat_collocations(size= 20) %>% nrow
[1] 4
x %>% textstat_collocations(size= 22) %>% nrow
[1] 2
x %>% textstat_collocations(size= 24) %>% nrow
[1] 0

Comment: I’m going to take a look at the six-grams, which may be the smallest contiguous string that could be construed as plagiarism.

Six-grams

sixgrams <- x %>% textstat_collocations(size = 6) %>% arrange(-count) %>%
  summarise(collocation, count)
sixgrams
                                           collocation count
1                  measurement is to the correct value    26
2                 the degree of accuracy and precision    24
3                        how close a measurement is to    23
4                        close a measurement is to the    22
5                      a measurement is to the correct    22
6                   how close the agreement is between    21
7                        is how close a measurement is    21
8                          is to the correct value for    20
9                  accuracy is how close a measurement    20
10             close the agreement is between repeated    19
11                    is a quantitative measure of how    18
12                   from a standard or expected value    18
13            uncertainty is a quantitative measure of    17
14                       to the correct value for that    16
15              the correct value for that measurement    16
16      the agreement is between repeated measurements    15
17                 degree of accuracy and precision of    14
18                  a quantitative measure of how much    14
19                 science is based on observation and    13
20                   refers to how close the agreement    12
21                      of accuracy and precision of a    12
22                       to how close the agreement is    12
23                   related to the uncertainty in the    11
24                   are related to the uncertainty in    11
25                      the skill of the person making    11
26                      skill of the person making the    11
27                of the person making the measurement    11
28             measured values deviate from a standard    10
29                 and precision of a measuring system    10
30               accuracy and precision of a measuring    10
31               accuracy and precision are related to    10
32                      system refers to how close the    10
33                 deviate from a standard or expected    10
34              is based on observation and experiment    10
35               the precision of a measurement system    10
36                      close the measured value is to     9
37                   related to the uncertainty of the     9
38                          must be based on a careful     9
39              measurement system refers to how close     9
40                   of a measurement system refers to     9
41              much your measured values deviate from     9
42                  a measurement system refers to how     9
43                 your measured values deviate from a     9
44               system are related to the uncertainty     9
45              to the uncertainty in the measurements     9
46               how much your measured values deviate     9
47                        the measured value is to the     9
48                 precision of a measuring system are     9
49                     how close the measured value is     9
50                   values deviate from a standard or     9
51            precision of a measurement system refers     8
52                 measuring system are related to the     8
53                 based on a careful consideration of     8
54                   of a measuring system are related     8
55                 be based on a careful consideration     8
56                   measure of how much your measured     8
57                   a measuring system are related to     8
58            careful consideration of all the factors     8
59                   on a careful consideration of all     8
60               quantitative measure of how much your     8
61           measurements are accurate but not precise     8
62             factors that might contribute and their     7
63                  a careful consideration of all the     7
64            that might contribute and their possible     7
65               all the factors that might contribute     7
66                    of how much your measured values     7
67            factors contributing to uncertainty in a     7
68               the factors that might contribute and     7
69                       of all the factors that might     7
70               of accuracy and precision are related     6
71                       value is to the correct value     6
72         irregularities in the object being measured     6
73        contributing to uncertainty in a measurement     6
74                       is how close the agreement is     6
75         might contribute and their possible effects     6
76               consideration of all the factors that     6
77  factors that contribute to measurement uncertainty     6
78              are repeated under the same conditions     5
79                    is related to the uncertainty of     5
80                system is related to the uncertainty     5
81                      way to analyze the accuracy of     5
82                measurement system is related to the     5
83            and precision are related to uncertainty     5
84                     one way to analyze the accuracy     5
85                   how close the measured values are     5
86                        to analyze the accuracy of a     5
87                    is how close the measured values     5
88                degree of accuracy and precision are     5
89               and precision of a measurement system     5
90                    the accuracy of a measurement is     5
91             accuracy and precision of a measurement     5
92            precision are related to the uncertainty     4
93                    close the measured values are to     4
94                      a measurement must be based on     4
95                  to the uncertainty of the measured     4
96                     accuracy of a measurement is to     4
97                  a measurement system is related to     4
98              deviates from the standard or expected     4
99                uncertainty in a measurement must be     4
100                       to the correct value for the     4
101                   measured value is to the correct     4
102                     uncertainty must be based on a     4
103                 of a measurement system is related     4
104         the factors contributing to uncertainty in     4
105              analyze the accuracy of a measurement     4
106                from the standard or expected value     4
107                     in a measurement must be based     4
108       uncertainty is a quantitative measurement of     4
109                  which are repeated under the same     4
110                  they are precise but not accurate     4
111                  any other factors that affect the     4
112              the correct value for the measurement     4
113               is a quantitative measurement of how     4
114      determined after careful consideration of all     4
115                   and precision are related to the     4
116            measured value deviates from a standard     4
117        after careful consideration of all possible     4
118                  precision refers to how close the     4
119               deviates from a standard or expected     4
120              other factors that affect the outcome     4
121                  value deviates from a standard or     4
122                value deviates from the standard or     4
123              the uncertainty in a measurement must     4
124       be determined after careful consideration of     4
125          measured value deviates from the standard     4
126            all measurements contain some amount of     4
127                     measurement must be based on a     4
128                        or they are precise but not     4
129      the agreement is between repeated measurement     4
130              to the uncertainty of the measurement     4
131                   a measured value deviates from a     4
132               the measured value deviates from the     3
133                  a measurement is to determine the     3
134                 accuracy is how close the measured     3
135                     how close to the correct value     3
136               the irregularity of the object being     3
137   should be determined after careful consideration     3
138               the degree of precision and accuracy     3
139            to uncertainty in a measurement include     3
140          irregularity of the object being measured     3
141                   of a measurement is to determine     3
142              accuracy is how close the measurement     3
143    measurements contain some degree of uncertainty     3
144             to the uncertainty of the measurements     3
145              the uncertainty of the measured value     3
146                degree of precision and accuracy of     3
147                         value is to the true value     3
148 careful consideration of all possible contributing     3
149                       of how much a measured value     3
150 consideration of all possible contributing factors     3
151               precision of a measurement system is     3
152              measurement is to determine the range     3
153                 how much a measured value deviates     3
154                     measure of how much a measured     3
155          accuracy and precision are very important     3
156         accuracy and precision of measuring system     3
157                  are related to the uncertainty of     3
158                         is to the correct value of     3
159            all measurements contain some degree of     3
160           of all possible contributing factors and     3
161                degree of accuracy and precision is     3
162        all possible contributing factors and their     3
163             all possible factors and their effects     3
164               precision is how close the agreement     3
165                      measure of how far a measured     3
166                    the accuracy and precision of a     3
167                        of how far a measured value     3
168                   means how close the agreement is     3
169             of accuracy and precision of measuring     3
170    measurements contain some amount of uncertainty     3
171                  how close measurements are to the     3
172                   accuracy refers to how close the     3
173                much a measured value deviates from     3
174                    how close the measurement is to     3
175                    close the measurement is to the     3
176               the accuracy of a measurement system     3
177                  a quantitative measure of how far     3
178                  quantitative measure of how far a     3
179                     of precision and accuracy of a     3
180                 quantitative measure of how much a     3
181                      measured value is to the true     3
182                         can be caused by the skill     2
183                   the uncertainty must be based on     2
184               a measurement system means how close     2
185                     of how much the measured value     2
186              between the lowest and highest values     2
187                related to the uncertainty which is     2
188            explains the difference of accuracy and     2
189          accuracy and precision of the measurement     2
190                         second is the skill of the     2
191               range between the lowest and highest     2
192                    spread of the measured value is     2
193                      a given measurement is to the     2
194                  the accuracy and precision of the     2
195                    of the measuring device and the     2
196                determine the range of the measured     2
197                         caused by the skill of the     2
198             indicates how close the measured value     2
199                  the precision refers to how close     2
200                        refers to the spread of the     2
201                to the uncertainty of a measurement     2
202            limitations of the measuring device and     2
203               quantitative measure of how much the     2
204               is related to accuracy and precision     2
205               such as limitations of the measuring     2
206                  measured values are to each other     2
207                   uncertainty can be caused by the     2
208                  and any other factors that affect     2
209         the difference between the measured values     2
210                    can be accurate but not precise     2
211        the limitations of the measuring instrument     2
212                          be caused by the skill of     2
213               the person making the measurement is     2
214                      both of them are important to     2
215                how far the measured value deviates     2
216     uncertainty should be determined after careful     2
217                 to determine the range between the     2
218                        device and the skill of the     2
219                  measured value is from a standard     2
220                    can say that the measurement is     2
221               far the measured value deviates from     2
222                    close a given measurement is to     2
223                shows how much your measured values     2
224                how close the measurement result is     2
225                    is how much the measured values     2
226                      to the uncertainty which is a     2
227             as limitations of the measuring device     2
228             based on observation and experiment on     2
229        is based on observation and experimentation     2
230              that contribute to the uncertainty of     2
231             a quantitative measurement of how much     2
232                  measure of how close the measured     2
233                 means how close the obtained value     2
234          consideration of all possible factors and     2
235            measurement is accurate but not precise     2
236           and the causes of uncertain measurements     2
237             between the minimum and maximum values     2
238                  minimum and maximum values of the     2
239      the difference between accuracy and precision     2
240                measure of how much measured values     2
241                         close it is to the correct     2
242                             how close it is to the     2
243                    how close the obtained value is     2
244              and maximum values of the measurement     2
245                       value is to the actual value     2
246             are various factors that contribute to     2
247                 shows how close the measured value     2
248                   so it doesn't consider how close     2
249                      refers to how close the match     2
250            the limitations of the measuring device     2
251             and precision are important in science     2
252               how much the measured values deviate     2
253                   how close a given measurement is     2
254                    the measured values are to each     2
255 measurement uncertainty should be determined after     2
256         measurements are both accurate and precise     2
257                    measured value is to the actual     2
258               both accuracy and precision are high     2
259                  measuring device and the skill of     2
260                   refers to how close the measured     2
261            quantitative measurement of how far the     2
262               of the measurement system is related     2
263                       to the correct value of that     2
264            precision of a measurement system means     2
265                  the minimum and maximum values of     2
266    measurements contain some amount of uncertainly     2
267                  the spread of the measured values     2
268        uncertainty which is a quantitative measure     2
269                    it is necessary to consider all     2
270         precision are related to uncertainty which     2
271                  the object being measured and any     2
272            precision are related to uncertainty in     2
273                     the measured values are to the     2
274               being measured and any other factors     2
275           the difference of accuracy and precision     2
276                        and the skill of the person     2
277              to the uncertainty in the measurement     2
278              a quantitative measurement of how far     2
279                      to determine the range of the     2
280            repeated measurements are to each other     2
281     repeated measurements which are repeated under     2
282                  is to determine the range between     2
283                 the measuring device and the skill     2
284                measured and any other factors that     2
285                    and precision is related to the     2
286                 far a measured value deviates from     2
287            there are various causes of uncertainty     2
288              the correct value of that measurement     2
289                  how far a measured value deviates     2
290           cases that measurements are accurate but     2
291               the measurement system is related to     2
292                   is how close measurements are to     2
293           of accuracy and precision when measuring     2
294          there are various factors that contribute     2
295         factors that contribute to the uncertainty     2
296                   uncertainty can be thought of as     2
297    contributing factors and their possible effects     2
298              close measurements are to the correct     2
299      on observation and experiment on measurements     2
300                   in the object being measured and     2
301               accuracy and precision is related to     2
302              accuracy shows how close the measured     2
303                    the range of the measured value     2
304           there are various factors of uncertainty     2
305                   the spread of the measured value     2
306                accuracy of a measurement system is     2
307          accuracy indicates how close the measured     2
308                say that the measurement is precise     2
309                             can be thought of as a     2
310                 and the irregularity of the object     2
311                  of a measurement system means how     2
312                     is from a standard or expected     2
313                       is to determine the range of     2
314                    is how close the measurement is     2
315                         a measure of how close the     2
316              much the measured values deviate from     2
317         is between repeated measurements which are     2
318               and accuracy of a measurement system     2
319                object being measured and any other     2
320   agreement is between repeated measurements which     2
321             that measurements are accurate but not     2
322                given measurement is to the correct     2
323               of accuracy and precision is related     2
324                      be thought of as a disclaimer     2
325          how close between measurements and target     2
326            precision and accuracy of a measurement     2
327          measurements which are repeated under the     2
328                         a measured value is from a     2
329   between repeated measurements which are repeated     2
330            accuracy and precision are important in     2
331                  of all possible factors and their     2
332   possible contributing factors and their possible     2
333                precision is how close the measured     2
334                  the measurement is to the correct     2
335                  the precision of a measurement is     2
336      careful consideration of all possible factors     2

Ten-grams

tengrams <- x %>% textstat_collocations(size = 10) %>% arrange(-count) %>%
  summarise(collocation, count)
tengrams
                                                                        collocation
1                                is how close a measurement is to the correct value
2                             accuracy is how close a measurement is to the correct
3                               how close a measurement is to the correct value for
4                              close a measurement is to the correct value for that
5                        a measurement is to the correct value for that measurement
6                        the degree of accuracy and precision of a measuring system
7                        degree of accuracy and precision of a measuring system are
8                          how much your measured values deviate from a standard or
9                refers to how close the agreement is between repeated measurements
10                    much your measured values deviate from a standard or expected
11                   your measured values deviate from a standard or expected value
12                     system refers to how close the agreement is between repeated
13                        precision of a measurement system refers to how close the
14                        the precision of a measurement system refers to how close
15                      accuracy and precision of a measuring system are related to
16                  measurement system refers to how close the agreement is between
17                        a measurement system refers to how close the agreement is
18                              must be based on a careful consideration of all the
19                       is a quantitative measure of how much your measured values
20                           be based on a careful consideration of all the factors
21                        of a measurement system refers to how close the agreement
22                      of accuracy and precision of a measuring system are related
23                           and precision of a measuring system are related to the
24                   precision of a measuring system are related to the uncertainty
25                      of all the factors that might contribute and their possible
26                 all the factors that might contribute and their possible effects
27                         based on a careful consideration of all the factors that
28                  uncertainty is a quantitative measure of how much your measured
29                  a quantitative measure of how much your measured values deviate
30                          measure of how much your measured values deviate from a
31                         of how much your measured values deviate from a standard
32               careful consideration of all the factors that might contribute and
33                 a careful consideration of all the factors that might contribute
34                         on a careful consideration of all the factors that might
35               quantitative measure of how much your measured values deviate from
36                 consideration of all the factors that might contribute and their
37                          of a measuring system are related to the uncertainty in
38                         a measuring system are related to the uncertainty in the
39                               way to analyze the accuracy of a measurement is to
40                        a measurement system is related to the uncertainty of the
41                          uncertainty in a measurement must be based on a careful
42                             how close the measured value is to the correct value
43                        a measurement must be based on a careful consideration of
44                         of a measurement system is related to the uncertainty of
45                        in a measurement must be based on a careful consideration
46              measuring system are related to the uncertainty in the measurements
47                              one way to analyze the accuracy of a measurement is
48                      measurement must be based on a careful consideration of all
49                              the uncertainty in a measurement must be based on a
50                      a measured value deviates from a standard or expected value
51                  the measured value deviates from the standard or expected value
52                                how close the measured value is to the true value
53                              close a measurement is to the correct value for the
54                  the degree of accuracy and precision are related to uncertainty
55                refers to how close the agreement is between repeated measurement
56                        a measurement is to the correct value for the measurement
57                      uncertainty is a quantitative measure of how far a measured
58                            is a quantitative measure of how far a measured value
59                  precision refers to how close the agreement is between repeated
60                         to analyze the accuracy of a measurement is to determine
61                 measurement system is related to the uncertainty of the measured
62                       system is related to the uncertainty of the measured value
63                        analyze the accuracy of a measurement is to determine the
64            be determined after careful consideration of all possible factors and
65  determined after careful consideration of all possible contributing factors and
66                      shows how much your measured values deviate from a standard
67                     the degree of accuracy and precision of a measurement system
68                       the precision refers to how close the agreement is between
69                            of how much a measured value deviates from a standard
70                        far a measured value deviates from a standard or expected
71                              how close the measured value is to the actual value
72                is how close the agreement is between repeated measurements which
73       after careful consideration of all possible contributing factors and their
74                            how close a given measurement is to the correct value
75   be determined after careful consideration of all possible contributing factors
76         should be determined after careful consideration of all possible factors
77                             how far a measured value deviates from a standard or
78                     being measured and any other factors that affect the outcome
79                         how far the measured value deviates from the standard or
80                     uncertainty is a quantitative measure of how much a measured
81                           and accuracy of a measurement system is related to the
82                      object being measured and any other factors that affect the
83    careful consideration of all possible contributing factors and their possible
84            precision is how close the agreement is between repeated measurements
85    consideration of all possible contributing factors and their possible effects
86                    far the measured value deviates from the standard or expected
87                          in the object being measured and any other factors that
88                      uncertainty must be based on a careful consideration of all
89                            how much a measured value deviates from a standard or
90  measurement uncertainty should be determined after careful consideration of all
91                irregularities in the object being measured and any other factors
92          the agreement is between repeated measurements which are repeated under
93                      a quantitative measure of how far a measured value deviates
94     uncertainty should be determined after careful consideration of all possible
95                              measure of how far a measured value deviates from a
96          close the agreement is between repeated measurements which are repeated
97                degree of accuracy and precision are related to uncertainty which
98                              of the measuring device and the skill of the person
99               how close the agreement is between repeated measurements which are
100                    precision and accuracy of a measurement system is related to
101                 quantitative measure of how much a measured value deviates from
102                            accuracy is how close the measured values are to the
103                         and precision of a measurement system is related to the
104                    the degree of precision and accuracy of a measurement system
105                           of a measurement system refers to how close the match
106                     the object being measured and any other factors that affect
107        determined after careful consideration of all possible factors and their
108                            of how far a measured value deviates from a standard
109                    a quantitative measure of how much a measured value deviates
110                            measure of how much a measured value deviates from a
111                          is a quantitative measure of how much a measured value
112                          the degree of accuracy and precision is related to the
113                    accuracy and precision of a measurement system is related to
114                         the accuracy of a measurement is to determine the range
115           after careful consideration of all possible factors and their effects
116                  quantitative measure of how far a measured value deviates from
117                      much a measured value deviates from a standard or expected
118      between repeated measurements which are repeated under the same conditions
119             measurement system is related to the uncertainty of the measurement
120                     degree of precision and accuracy of a measurement system is
121                  accuracy of a measurement system is related to the uncertainty
122                 precision of a measurement system is related to the uncertainty
123              is between repeated measurements which are repeated under the same
124         agreement is between repeated measurements which are repeated under the
125                       accuracy indicates how close the measured value is to the
126                           accuracy shows how close the measured value is to the
127                    of precision and accuracy of a measurement system is related
    count
1      19
2      18
3      16
4      13
5      13
6       9
7       9
8       8
9       8
10      8
11      8
12      8
13      8
14      7
15      7
16      7
17      7
18      7
19      7
20      7
21      7
22      7
23      7
24      7
25      7
26      6
27      6
28      6
29      6
30      6
31      6
32      6
33      6
34      6
35      6
36      6
37      5
38      5
39      4
40      4
41      4
42      4
43      4
44      4
45      4
46      4
47      4
48      4
49      4
50      4
51      3
52      3
53      3
54      3
55      3
56      3
57      3
58      3
59      3
60      3
61      3
62      3
63      3
64      2
65      2
66      2
67      2
68      2
69      2
70      2
71      2
72      2
73      2
74      2
75      2
76      2
77      2
78      2
79      2
80      2
81      2
82      2
83      2
84      2
85      2
86      2
87      2
88      2
89      2
90      2
91      2
92      2
93      2
94      2
95      2
96      2
97      2
98      2
99      2
100     2
101     2
102     2
103     2
104     2
105     2
106     2
107     2
108     2
109     2
110     2
111     2
112     2
113     2
114     2
115     2
116     2
117     2
118     2
119     2
120     2
121     2
122     2
123     2
124     2
125     2
126     2
127     2

Fourteen-grams

fourteengrams <- x %>% textstat_collocations(size = 14) %>% arrange(-count) %>%
  summarise(collocation, count)
fourteengrams
                                                                                                  collocation
1                            accuracy is how close a measurement is to the correct value for that measurement
2                               the degree of accuracy and precision of a measuring system are related to the
3                     precision of a measurement system refers to how close the agreement is between repeated
4                       degree of accuracy and precision of a measuring system are related to the uncertainty
5                      on a careful consideration of all the factors that might contribute and their possible
6                          the precision of a measurement system refers to how close the agreement is between
7                          is a quantitative measure of how much your measured values deviate from a standard
8                           must be based on a careful consideration of all the factors that might contribute
9                            be based on a careful consideration of all the factors that might contribute and
10                a careful consideration of all the factors that might contribute and their possible effects
11                        based on a careful consideration of all the factors that might contribute and their
12                         accuracy and precision of a measuring system are related to the uncertainty in the
13                         measure of how much your measured values deviate from a standard or expected value
14                         a quantitative measure of how much your measured values deviate from a standard or
15                  quantitative measure of how much your measured values deviate from a standard or expected
16                          of accuracy and precision of a measuring system are related to the uncertainty in
17                           uncertainty in a measurement must be based on a careful consideration of all the
18                     and precision of a measuring system are related to the uncertainty in the measurements
19                 of a measurement system refers to how close the agreement is between repeated measurements
20                             a measurement must be based on a careful consideration of all the factors that
21                      uncertainty is a quantitative measure of how much your measured values deviate from a
22                               in a measurement must be based on a careful consideration of all the factors
23                         measurement must be based on a careful consideration of all the factors that might
24                           the uncertainty in a measurement must be based on a careful consideration of all
25                  of a measurement system refers to how close the agreement is between repeated measurement
26                                 one way to analyze the accuracy of a measurement is to determine the range
27 measurement uncertainty should be determined after careful consideration of all possible factors and their
28                        and precision of a measurement system is related to the uncertainty of the measured
29                             a quantitative measure of how far a measured value deviates from a standard or
30                 how close the agreement is between repeated measurements which are repeated under the same
31     uncertainty should be determined after careful consideration of all possible factors and their effects
32                      quantitative measure of how far a measured value deviates from a standard or expected
33                        precision and accuracy of a measurement system is related to the uncertainty of the
34                   is how close the agreement is between repeated measurements which are repeated under the
35                         uncertainty is a quantitative measure of how much a measured value deviates from a
36                            accuracy is how close a measurement is to the correct value for the measurement
37                             is a quantitative measure of how far a measured value deviates from a standard
38                     quantitative measure of how much a measured value deviates from a standard or expected
39                             measure of how far a measured value deviates from a standard or expected value
40                            is a quantitative measure of how much a measured value deviates from a standard
41                         uncertainty must be based on a careful consideration of all the factors that might
42                        accuracy and precision of a measurement system is related to the uncertainty of the
43                            measure of how much a measured value deviates from a standard or expected value
44                  irregularities in the object being measured and any other factors that affect the outcome
45          close the agreement is between repeated measurements which are repeated under the same conditions
46                          uncertainty is a quantitative measure of how far a measured value deviates from a
47                             the degree of precision and accuracy of a measurement system is related to the
48                      precision of a measurement system is related to the uncertainty of the measured value
49                            a quantitative measure of how much a measured value deviates from a standard or
50                     degree of precision and accuracy of a measurement system is related to the uncertainty
51                         of precision and accuracy of a measurement system is related to the uncertainty of
   count
1     12
2      7
3      7
4      7
5      6
6      6
7      6
8      6
9      6
10     6
11     6
12     5
13     5
14     5
15     5
16     5
17     4
18     4
19     4
20     4
21     4
22     4
23     4
24     4
25     3
26     2
27     2
28     2
29     2
30     2
31     2
32     2
33     2
34     2
35     2
36     2
37     2
38     2
39     2
40     2
41     2
42     2
43     2
44     2
45     2
46     2
47     2
48     2
49     2
50     2
51     2

Eighteen-grams

eighteengrams <- x %>% textstat_collocations(size = 18) %>% arrange(-count) %>%
  summarise(collocation, count)
eighteengrams
                                                                                                        collocation
1      must be based on a careful consideration of all the factors that might contribute and their possible effects
2    uncertainty in a measurement must be based on a careful consideration of all the factors that might contribute
3     the degree of accuracy and precision of a measuring system are related to the uncertainty in the measurements
4         a measurement must be based on a careful consideration of all the factors that might contribute and their
5            in a measurement must be based on a careful consideration of all the factors that might contribute and
6  measurement must be based on a careful consideration of all the factors that might contribute and their possible
7  uncertainty is a quantitative measure of how much your measured values deviate from a standard or expected value
8           the uncertainty in a measurement must be based on a careful consideration of all the factors that might
9     uncertainty is a quantitative measure of how much a measured value deviates from a standard or expected value
10 uncertainty must be based on a careful consideration of all the factors that might contribute and their possible
11     uncertainty is a quantitative measure of how far a measured value deviates from a standard or expected value
   count
1      6
2      4
3      4
4      4
5      4
6      4
7      4
8      4
9      2
10     2
11     2

Twenty-two-grams

twentytwograms <- x %>% textstat_collocations(size = 22) %>% arrange(-count) %>%
  summarise(collocation, count)
twentytwograms
                                                                                                                                collocation
1 uncertainty in a measurement must be based on a careful consideration of all the factors that might contribute and their possible effects
2     the uncertainty in a measurement must be based on a careful consideration of all the factors that might contribute and their possible
  count
1     4
2     4

Loop the sixgrams through the students’ essays and count each sixgram found.

l <- list()
for(i in 1:nrow(sixgrams)){
  l[[i]] <- ifelse(grepl(sixgrams$collocation[i], df$x),1,0)
}
df1 <- data.frame(matrix(unlist(l), nrow=length(l), byrow=TRUE))
colSums(df1)
  X1   X2   X3   X4   X5   X6   X7   X8   X9  X10  X11  X12  X13  X14  X15  X16 
   5    0    9   23    6    0   18   17   45    8   10    2    2   25    2   16 
 X17  X18  X19  X20  X21  X22  X23  X24  X25  X26  X27  X28  X29  X30  X31  X32 
   2   21    1    0    2   35    0   31   38    3    2    7   21    5   39    0 
 X33  X34  X35  X36  X37  X38  X39  X40  X41  X42  X43  X44  X45  X46  X47  X48 
  11    1   41   25   19   16    9    9    7    9   17    6    9   14   13    1 
 X49  X50  X51  X52  X53  X54  X55  X56  X57  X58  X59  X60  X61  X62  X63  X64 
   4   18   15    0    3    3   32    6   14    7    1    5   66   22    1   29 
 X65  X66  X67  X68  X69  X70  X71  X72  X73  X74  X75  X76  X77  X78  X79  X80 
   0    0    0    1   16   38    0    0    6   17    2    7   10    3    0    7 
 X81  X82  X83  X84  X85  X86  X87  X88  X89  X90  X91  X92  X93  X94  X95  X96 
   0    2   21   20    2    9    2   30    0    0    1    4    9    3    7   12 
 X97  X98  X99 X100 X101 
   5    0   17   10   15 

Add the sixgram counts to the students’ data frame.

df$sixgrams <- colSums(df1)

Comment: Likewise, add the ten-gram, fourteen-gram, eighteen-gram and 22-gram counts to the data frame.

Ten-grams

tengrams <- x %>% textstat_collocations(size=10) %>% arrange(-count)
l <- list()
for(i in 1:nrow(tengrams)){
  l[[i]] <- ifelse(grepl(tengrams$collocation[i], df$x),1,0)
}
df1 <- data.frame(matrix(unlist(l), nrow=length(l), byrow=TRUE))
df$tengrams <- colSums(df1)

Fourteen-grams

tengrams <- x %>% textstat_collocations(size=14) %>% arrange(-count)
l <- list()
for(i in 1:nrow(fourteengrams)){
  l[[i]] <- ifelse(grepl(fourteengrams$collocation[i], df$x),1,0)
}
df1 <- data.frame(matrix(unlist(l), nrow=length(l), byrow=TRUE))
df$fourteengrams <- colSums(df1)

Eighteen-grams

eighteengrams <- x %>% textstat_collocations(size=18) %>% arrange(-count)
l <- list()
for(i in 1:nrow(eighteengrams)){
  l[[i]] <- ifelse(grepl(eighteengrams$collocation[i], df$x),1,0)
}
df1 <- data.frame(matrix(unlist(l), nrow=length(l), byrow=TRUE))
df$eighteengrams <- colSums(df1)

Twenty-two grams

twentytwograms <- x %>% textstat_collocations(size=22) %>% arrange(-count)
l <- list()
for(i in 1:nrow(twentytwograms)){
  l[[i]] <- ifelse(grepl(twentytwograms$collocation[i], df$x),1,0)
}
df1 <- data.frame(matrix(unlist(l), nrow=length(l), byrow=TRUE))
df$twentytwograms <- colSums(df1)

Take a glimpse at the data frame.

glimpse(df)
Rows: 101
Columns: 10
$ x              <corpus> "There are two types of indices to evaluate an exper…
$ given_name     <chr> "Shou", "Ryo", "Masatsugu", "Haruya", "Shunta", "Hayate…
$ family_name    <chr> "ANDO", "INOUE", "UMETA", "OMORI", "OKA", "ONO", "KIMUR…
$ class          <fct> class_3, class_3, class_3, class_3, class_3, class_3, c…
$ similarity     <dbl> 32, 0, 39, 39, 31, 0, 64, 67, 99, 19, 59, 24, 0, 66, 0,…
$ sixgrams       <dbl> 5, 0, 9, 23, 6, 0, 18, 17, 45, 8, 10, 2, 2, 25, 2, 16, …
$ tengrams       <dbl> 1, 0, 0, 10, 0, 0, 2, 0, 15, 1, 1, 0, 0, 9, 0, 6, 0, 2,…
$ fourteengrams  <dbl> 0, 0, 0, 2, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 2, 0, 0, 0…
$ eighteengrams  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ twentytwograms <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…

Plot the similarity scores against six-grams.

df %>% ggplot(aes(x=sixgrams, y=similarity)) +
  geom_point() +
  stat_smooth(method = "lm") +
  labs(title = "Similarity versus Learners' 6-gram Duplicate Counts in Corpus")

Comment: There are several students who populate the x-axis having scored zero for similarity. Interestingly, one student even had 20 six-grams in her text.

Plot the similarity scores against ten-grams.

df %>% ggplot(aes(x=tengrams, y=similarity)) +
  geom_point() +
  stat_smooth(method = "lm") +
  labs(title = "Similarity versus Learners' 10-gram Duplicate Counts in Corpus")

Comment: Now the y-axis is starting to populate with students who had no shared ten-gram collocations within their texts yet still scored higher than zero for similarity. These students must be borrowing from alternate sources.

Plot the similarity scores against fourteen-grams.

df %>% ggplot(aes(x=fourteengrams, y=similarity)) +
  geom_point() +
  stat_smooth(method = "lm") +
  labs(title = "Similarity versus Learners' 14-gram Duplicate Counts in Corpus")

Comment: Most students with a fourteen-gram collocation in their text scored near to fifty for similarity or higher.

Plot the similarity scores against eighteen-grams.

df %>% ggplot(aes(x=eighteengrams, y=similarity)) +
  geom_point() +
  stat_smooth(method = "lm")

  labs(title = "Similarity versus Learners' 18-gram Duplicate Counts in Corpus")
$title
[1] "Similarity versus Learners' 18-gram Duplicate Counts in Corpus"

attr(,"class")
[1] "labels"

Comment: Few students have an eighteen-gram collocation in their texts.

Plot the similarity scores against twentytwo-grams.

df %>% ggplot(aes(x=twentytwograms, y=similarity)) +
  geom_point() +
  stat_smooth(method = "lm") +
    labs(title = "Similarity versus Learners' 22-gram Duplicate Counts in Corpus")

Comment: Only three students had a 22-gram in their text.

The average similarity score for these students was about 50.

What is the correlation between six-grams and similarity?

cor.test(df$similarity, df$sixgrams, na.rm=T)

    Pearson's product-moment correlation

data:  df$similarity and df$sixgrams
t = 14.077, df = 98, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.7405369 0.8739767
sample estimates:
      cor 
0.8179764 

Comment: The correlation between six-grams and similarity is very high.

What is the correlation between ten-grams and similarity?

cor.test(df$similarity, df$tengrams, na.rm=T)

    Pearson's product-moment correlation

data:  df$similarity and df$tengrams
t = 7.3455, df = 98, p-value = 6.141e-11
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.4524209 0.7092875
sample estimates:
      cor 
0.5958864 

Comment: The correlation between ten-grams and similarity is quite high.

What is the correlation between fourteen-grams and similarity?

cor.test(df$similarity, df$fourteengrams, na.rm=T)

    Pearson's product-moment correlation

data:  df$similarity and df$fourteengrams
t = 4.5181, df = 98, p-value = 1.744e-05
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.2382085 0.5655001
sample estimates:
      cor 
0.4152001 

Comment: The correlation is low.

One student didn’t have a similarity score. Give her the average score before running linear regression.
df[is.na(df$similarity),]$similarity <- mean(df$similarity, na.rm = TRUE)

Split the data into train and test sets.

set.seed(1066)
spl <- sample.split(df$similarity, SplitRatio = 0.75)
train <- df[spl==T,]
test <-df[spl==F,]
Run a linear regression model for similarity against n-gram counts.
mod <- glm(similarity ~ sixgrams + tengrams + fourteengrams + eighteengrams + twentytwograms, data = train)
summary(mod)

Call:
glm(formula = similarity ~ sixgrams + tengrams + fourteengrams + 
    eighteengrams + twentytwograms, data = train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-51.890   -9.557   -3.374    9.279   46.054  

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)      3.3743     2.6451   1.276   0.2056    
sixgrams         3.2338     0.3414   9.473 7.38e-15 ***
tengrams        -3.2319     1.3323  -2.426   0.0174 *  
fourteengrams   -0.1715     2.7885  -0.061   0.9511    
eighteengrams    7.8773     7.4765   1.054   0.2951    
twentytwograms -26.6014    21.5944  -1.232   0.2215    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 249.972)

    Null deviance: 78878  on 88  degrees of freedom
Residual deviance: 20748  on 83  degrees of freedom
AIC: 751.76

Number of Fisher Scoring iterations: 2

Comment: The six-gram and ten-gram counts are significant.

Simplify the model for just these two variables.

mod2 <- glm(similarity ~ sixgrams + tengrams, data = train)
summary(mod2)

Call:
glm(formula = similarity ~ sixgrams + tengrams, data = train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-52.083  -10.007   -3.860    9.421   46.568  

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   3.8598     2.5651   1.505    0.136    
sixgrams      3.1429     0.2818  11.154  < 2e-16 ***
tengrams     -2.9271     0.5901  -4.960 3.51e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 246.5069)

    Null deviance: 78878  on 88  degrees of freedom
Residual deviance: 21200  on 86  degrees of freedom
AIC: 747.68

Number of Fisher Scoring iterations: 2

Comment: The bivariate model has a lower AIC so may be the better one to use.

Run the model against the test set and compute its accuracy.

preds <- predict(mod2, newdata = test)
SSE <- sum((test$similarity - preds)^2)
SST <- sum((test$similarity - mean(train$similarity))^2)
Rsquared1 <- 1 - SSE/SST
Rsquared1
[1] 0.789131

Comment: 79 percent of the similarity scores can be accounted for by the counts of six-grams and ten-grams.

Run the model on the entire dataset and compute predcitions for all students.

mod3 <- glm(similarity ~ sixgrams + tengrams, data = df)
summary(mod3)

Call:
glm(formula = similarity ~ sixgrams + tengrams, data = df)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-52.047   -9.036   -2.607    9.471   47.033  

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   2.6068     2.2998   1.133     0.26    
sixgrams      3.2146     0.2689  11.952  < 2e-16 ***
tengrams     -2.9702     0.5742  -5.173 1.22e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 240.5035)

    Null deviance: 89705  on 100  degrees of freedom
Residual deviance: 23569  on  98  degrees of freedom
AIC: 845.34

Number of Fisher Scoring iterations: 2

Predict similarity scores for the students based on their counts for six-grams and ten-grams.

df$predictions <- predict(mod3)

Run a scatterplot of the students’ similarity scores against their predictions.

df %>% ggplot(aes(x=predictions, y=similarity)) +
  geom_point() +
  stat_smooth(method="lm") +
  labs(title = "Similarity Versus Linear Regression Predictions Based on Corpus 6-gram and 10-gram Counts")

Comment: The model is quite a good fit but it predicts a high similarity for one student who had a score of zero for it. Perhaps that was the student who had twenty six-grams in her summary.

Investigate the student who had a high prediction for similarity but zero score for it.

df[df$predictions>50 & df$similarity==0,]$x %>%
  str_squish
[1] "When we measure some of objects using measurements, accuracy and precision is very important. First, accuracy is how close a measurement is to the correct value. If we use high accuracy , measurements are very close to the collect value. In contrast, low accuracy is very far. Next is precision. The precision refers to how close the agreement is between repeated measurements. If the spread of the measured value is small, the precision is high. On the other hand, when the spread of the measured value is large, it is low. In conclusion, The degree of accuracy and precision are related to uncertainty in the measurements."

Comment: His answer is clear and concise. The commercial algorithm did not highlight any of his text, even the bigrams for high and low accuracy which were highlighted for others. It would be unfair to penalize this student based on the model’s wild prediction, even if it does work well generally and has a fairly high accuracy of 0.79.

Take another glimpse at the data frame.

glimpse(df)
Rows: 101
Columns: 11
$ x              <corpus> "There are two types of indices to evaluate an exper…
$ given_name     <chr> "Shou", "Ryo", "Masatsugu", "Haruya", "Shunta", "Hayate…
$ family_name    <chr> "ANDO", "INOUE", "UMETA", "OMORI", "OKA", "ONO", "KIMUR…
$ class          <fct> class_3, class_3, class_3, class_3, class_3, class_3, c…
$ similarity     <dbl> 32.00, 0.00, 39.00, 39.00, 31.00, 0.00, 64.00, 67.00, 9…
$ sixgrams       <dbl> 5, 0, 9, 23, 6, 0, 18, 17, 45, 8, 10, 2, 2, 25, 2, 16, …
$ tengrams       <dbl> 1, 0, 0, 10, 0, 0, 2, 0, 15, 1, 1, 0, 0, 9, 0, 6, 0, 2,…
$ fourteengrams  <dbl> 0, 0, 0, 2, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 2, 0, 0, 0…
$ eighteengrams  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ twentytwograms <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ predictions    <dbl> 15.709452, 2.606848, 31.537922, 46.839669, 21.894231, 2…

Subset and inspect a data frame for students who scored 30 or more for similarity.

df %>% filter(similarity > 30) %>% select(4:10) %>% arrange(similarity)
                       class similarity sixgrams tengrams fourteengrams
Miyuu Kuramoto       class_3         31        6        0             0
Kiara Sato           class_3         32        5        1             0
Ami Yumoto           class_5         32       12        1             0
Manami Ichikawa      class_4         34        9        0             0
Chiyori Nakanishi    class_4         36        3        0             0
Yukishige Chikamatsu class_4         37        4        0             0
Takeyuki Kurahashi   class_3         39        9        0             0
Narumi Kuranoo       class_3         39       23       10             2
Saki Michieda        class_5         39       16        1             0
Sayaka Takahashi     class_5         39        9        1             0
Ayami Nagatomo       class_5         40       17       13             9
Miu Shitao           class_4         41       25        7             3
Hitomi Honda         class_5         41       15        9             5
Sadayuki Kimura      class_4         42        0        0             0
Takanori Ushioda     class_4         44       16        2             0
Haruna Saito         class_4         45       15        5             0
Kanehide Okano       class_5         46       21       10             2
Satone Kubo          class_4         49       14        0             0
Rei Nishikawa        class_4         51       13        1             0
Tokunaga Remi        class_4         52       19        4             0
Tadao Ōtaka          class_4         52       14        0             0
Mototoki Hara        class_3         54       21        2             0
Taketaka Katsuta     class_4         58       32       12             3
Shigekata Akabane    class_3         59       10        1             0
Miho Miyazaki        class_3         64       18        2             0
Kamachi Yukina       class_3         65       21        2             0
Manaka Taguchi       class_4         65       22        4             0
Mai Homma            class_3         66       25        9             1
Nanase Yoshikawa     class_3         66       16        6             2
Maho Omori           class_4         66        9        0             0
Nobukiyo Ōishi       class_3         67       17        0             0
Momoka Hasegawa      class_4         67       17        1             0
Mitsutada Senba      class_4         70       41       21            10
Masaaki Mase         class_4         74       18        2             0
Kaoru Takaoka        class_4         75       29        7             0
Ayu Yamabe           class_5         75       38       17             7
Nobuyuki Terasaka    class_5         80       30       15             7
Erii Chiba           class_3         81       38       20             7
Kyoka Tada           class_5         81       17        1             0
Serika Nagano        class_4         82       66       35            17
Momoka Onishi        class_4         85       11        0             0
Ayaka Hidaritomo     class_3         88       35        5             0
Kanamaru Horibe      class_4         89       39       17             5
Kana Yasuda          class_3         93       31        7             0
Saki Kitazawa        class_3         99       45       15             1
                     eighteengrams twentytwograms
Miyuu Kuramoto                   0              0
Kiara Sato                       0              0
Ami Yumoto                       0              0
Manami Ichikawa                  0              0
Chiyori Nakanishi                0              0
Yukishige Chikamatsu             0              0
Takeyuki Kurahashi               0              0
Narumi Kuranoo                   0              0
Saki Michieda                    0              0
Sayaka Takahashi                 0              0
Ayami Nagatomo                   5              1
Miu Shitao                       0              0
Hitomi Honda                     1              0
Sadayuki Kimura                  0              0
Takanori Ushioda                 0              0
Haruna Saito                     0              0
Kanehide Okano                   0              0
Satone Kubo                      0              0
Rei Nishikawa                    0              0
Tokunaga Remi                    0              0
Tadao Ōtaka                      0              0
Mototoki Hara                    0              0
Taketaka Katsuta                 0              0
Shigekata Akabane                0              0
Miho Miyazaki                    0              0
Kamachi Yukina                   0              0
Manaka Taguchi                   0              0
Mai Homma                        0              0
Nanase Yoshikawa                 0              0
Maho Omori                       0              0
Nobukiyo Ōishi                   0              0
Momoka Hasegawa                  0              0
Mitsutada Senba                  6              2
Masaaki Mase                     0              0
Kaoru Takaoka                    0              0
Ayu Yamabe                       0              0
Nobuyuki Terasaka                2              0
Erii Chiba                       0              0
Kyoka Tada                       0              0
Serika Nagano                    5              1
Momoka Onishi                    0              0
Ayaka Hidaritomo                 0              0
Kanamaru Horibe                  0              0
Kana Yasuda                      0              0
Saki Kitazawa                    0              0

Inspect Miyuu’s essay.

df["Miyuu Kuramoto",]$x %>% str_squish
[1] "Measurement is important in the observations and experiments that make up science. Accuracy refers to how close the measured result is to the true value. Precision is the spread of measured values. The difference between these two causes situations such as low precision and high accuracy and high precision and low accuracy. Accuracy and precision are largely related to the uncertainty in measurement. Uncertainty is a quantitative measurement of how far the measured value deviates from the standard or expected value. There are various factors of uncertainty, and experiments and measurements must be carried out carefully considering the causes."

Comment: Miyuu had four essential bigrams highlighted in her text (high and low accuracy and precision), three natural quadrigrams, and a 10-gram collocation spanning two sentences according to the commercial algorithm. In my corpus, the student shared just six 6-gram collocations with her peers.

Check the next student.

df["Kiara Sato",]$x %>% str_squish
[1] "There are two types of indices to evaluate an experiment, accuracy, and precision. Accuracy means how close the measured value is to the true value. Precision means how close each of the measured values is. Not only accurate and precise measurements but there are high accuracy and low precise measurements. Moreover, there is another index, uncertainly. Uncertainty means how far the measured value is from the true value. So, high accuracy and high precision measurement of uncertainly is low, and low accuracy and low precision measurement of uncertainly would be high. In conclusion, careful consideration should be given to how much uncertainty the measurement has because any measurement has uncertainly."

Comment: Kiara had one long 20-gram string highlighted by the commerial algorithm among other shorter strings. Within my corpus, she had one ten-gram duplicated with others (which is also her five 6-grams).

Subset and inspect a data frame for students who scored more than 50 for similarity.

df %>% filter(similarity > 50) %>% select(4:10)
                    class similarity sixgrams tengrams fourteengrams
Miho Miyazaki     class_3         64       18        2             0
Nobukiyo Ōishi    class_3         67       17        0             0
Saki Kitazawa     class_3         99       45       15             1
Shigekata Akabane class_3         59       10        1             0
Mai Homma         class_3         66       25        9             1
Nanase Yoshikawa  class_3         66       16        6             2
Mototoki Hara     class_3         54       21        2             0
Ayaka Hidaritomo  class_3         88       35        5             0
Kana Yasuda       class_3         93       31        7             0
Erii Chiba        class_3         81       38       20             7
Kamachi Yukina    class_3         65       21        2             0
Kanamaru Horibe   class_4         89       39       17             5
Momoka Onishi     class_4         85       11        0             0
Mitsutada Senba   class_4         70       41       21            10
Tokunaga Remi     class_4         52       19        4             0
Maho Omori        class_4         66        9        0             0
Momoka Hasegawa   class_4         67       17        1             0
Tadao Ōtaka       class_4         52       14        0             0
Rei Nishikawa     class_4         51       13        1             0
Masaaki Mase      class_4         74       18        2             0
Taketaka Katsuta  class_4         58       32       12             3
Serika Nagano     class_4         82       66       35            17
Manaka Taguchi    class_4         65       22        4             0
Kaoru Takaoka     class_4         75       29        7             0
Ayu Yamabe        class_5         75       38       17             7
Kyoka Tada        class_5         81       17        1             0
Nobuyuki Terasaka class_5         80       30       15             7
                  eighteengrams twentytwograms
Miho Miyazaki                 0              0
Nobukiyo Ōishi                0              0
Saki Kitazawa                 0              0
Shigekata Akabane             0              0
Mai Homma                     0              0
Nanase Yoshikawa              0              0
Mototoki Hara                 0              0
Ayaka Hidaritomo              0              0
Kana Yasuda                   0              0
Erii Chiba                    0              0
Kamachi Yukina                0              0
Kanamaru Horibe               0              0
Momoka Onishi                 0              0
Mitsutada Senba               6              2
Tokunaga Remi                 0              0
Maho Omori                    0              0
Momoka Hasegawa               0              0
Tadao Ōtaka                   0              0
Rei Nishikawa                 0              0
Masaaki Mase                  0              0
Taketaka Katsuta              0              0
Serika Nagano                 5              1
Manaka Taguchi                0              0
Kaoru Takaoka                 0              0
Ayu Yamabe                    0              0
Kyoka Tada                    0              0
Nobuyuki Terasaka             2              0

Comment: I think I must make a decision about the copying and to be proactive is better than to do nothing. A threshold of over 50 is a wise decision. I will run this script for successive assignments to check that it is still working.

What is the average number of n-grams that a student with a similarity score of over 50 would have?

df %>% filter(similarity > 50) %>%  
  summarise(avg6 = mean(sixgrams),
            avg10 = mean(tengrams),
            avg14 = mean(fourteengrams),
            avg18 = mean(eighteengrams),
            avg22 = mean(twentytwograms))
      avg6   avg10    avg14     avg18     avg22
1 25.62963 7.62963 2.222222 0.4814815 0.1111111

Sir, can you explain to me why I lost points for this assignment?

df["Rei Nishikawa",]$x %>% str_squish
[1] "Accuracy and precision are important in science. The difference between accuracy and precision is also important. Accuracy is how close a measurement is to the correct value for that measurement. On the other hand, precision is how close how close the agreement is between repeated measurements. The degree of accuracy and precision affects the uncertainty in the measurements. The skill of the person making the measurement is one of the examples of the factor of the uncertainty in the measurement. We have to take it into consideration when measuring something."

Comment: Certainly Rei. You had a similarity score of 51 which means that over half of your essay was unoriginal and has been highlighted here by a commercial similarity website. Your summary was similar to other submissions for this assignment among my students this semester. In particular, you had one 10-word string which was the same as someone else’s. You also had thirteen 6-word strings. Try not to copy directly from the text or other sources. Please use your own words.

***********************************************************************************

Read in the similarity scores for the second intensive reading assignment (3.3, June 7th, 2021).

ir <- read.csv("../ir.csv")
df$similarity2 <- ir$similarity2

Plot boxplots of the students’ similarity scores for assignment II.

df %>% ggplot(aes(x=class, y=similarity2)) +
  geom_boxplot()

Comment: Even thought he median score for class 5 has increased, the trend is downward.

What is the difference between the two similarity scores?

mean(df$similarity) - mean(df$similarity2, na.rm=T)
[1] 4.97875

Comment: The similarity scores have come down five points per person on average.

What proportion of similarity scores for assignment I were no greater than 50?

mean(df$similarity<=50)
[1] 0.7326733

What proportion of similarity scores for assignment II were no greater than 50?

mean(df$similarity2<=50, na.rm = T)
[1] 0.8541667

Comment: The ‘no greater than 50’ similarity scores had increased from 73 to 85 percent.

Draw boxplots for the third intensive reading assignment (4.3, June 21st, 2021).

df$similarity3 <- ir$similarity3
df %>% ggplot(aes(x=class, y=similarity3)) +
  geom_boxplot()

What proportion of similarity scores for assignment III were 50 points or less?

mean(df$similarity3<=50, na.rm = T)
[1] 0.9157895

Comment: The ‘over 50’ similarity score have fallen to 8 percent.

What were the median scores for each task?

median(df$similarity)
[1] 24
median(df$similarity2, na.rm = T)
[1] 20
median(df$similarity3, na.rm = T)
[1] 13

Comment: Over the three assignments, the median for similarity has fallen by 11 points.

Compare the three assignments using boxplots

df %>% select(4:5, 12:13) %>%
  gather(task, "scores", 2:4) %>%
  ggplot(aes(x=task, y=scores)) +
  geom_boxplot()

Comment: The trend over time is for lower similarity scores.

Run a t-test of assignment I and assignment III.

t.test(df$similarity, df$similarity3, paired = T)

    Paired t-test

data:  df$similarity and df$similarity3
t = 4.4163, df = 94, p-value = 2.683e-05
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
  7.193066 18.943987
sample estimates:
mean of the differences 
               13.06853 

Comment: The difference between similarity scores of assignment I and III was statistically significant.

Find the size of the difference.

library(effsize)
cohen.d(df$similarity, df$similarity3, conf.level=0.95, na.rm=T)

Cohen's d

d estimate: 0.5320765 (medium)
95 percent confidence interval:
    lower     upper 
0.2452521 0.8189008 

Comment: Cohen’s d was medium.

Final comment: Calibration didn’t quite work but the students were asked to borrow less and they complied. To be continued…