library(tidyverse)
library(readtext)
library(quanteda)
library(quanteda.textstats)
library(caTools)
x <- list.files(pattern = "html", recursive = T) %>%
readtext(ignore_missing_files = T) %>% texts %>% corpus
family_names <- docnames(x) %>% word(2)
family_names <- gsub(",","", family_names)
given_names <- docnames(x) %>% word(3)
given_names <- gsub(",","", given_names)
df <- data.frame(x)
df$given_name <- given_names
df$family_name <- family_names
anon <- read.csv("../anon.csv")
set.seed(1066)
row.names(df) <- sample(anon$x, 101, replace=F)
df$class <- c(rep("class_3", 29), rep("class_4", 36), rep("class_5", 36)) %>% as.factor
p3 <- c(32,0,39,39,31,0,64,67,99,19,59,24,0,66,0,66,NA,54,0,12,0,88,0,93,81,17,0,0,65)
p4 <- c(0,89,0,85,10,70,41,52,44,66,30,19,34,67,0,28,52,51,0,37, 74,45,0,14,36,58,14,49,0,0,29,82,65,0,75,42)
p5 <- c(0,0,0,39,75,0,0,15,81,0,30,18,0,0,0,0,0,46,0,0,39,0,80, 0,0,0,0,0,24,0,32,0,0,40,18,41)
df$similarity <- c(p3, p4, p5)
ntoken(x) %>% min
[1] 91
ntoken(x) %>% max
[1] 182
ntoken(x) %>% mean
[1] 117.4158
nsentence(x) %>% min
[1] 3
nsentence(x) %>% max
[1] 13
nsentence(x) %>% mean
[1] 7.623762
fivenum(df$similarity)
[1] 0 0 24 52 99
mean(df$similarity, na.rm=T)
[1] 29.51
median(df$similarity, na.rm=T)
[1] 24
df %>% ggplot(aes(x=class, y=similarity)) +
geom_boxplot()
Class_3: TOEIC avg=492 max=509 min=475
Class_4: TOEIC avg=581 max=608 min=552
Class_5: TOEIC avg=488 max=516 min=458
mod <- aov(similarity ~ class, data = df)
summary(mod)
Df Sum Sq Mean Sq F value Pr(>F)
class 2 10217 5108 6.234 0.00284 **
Residuals 97 79488 819
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
1 observation deleted due to missingness
Comment: The difference is statistically significant.
TukeyHSD(mod)
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = similarity ~ class, data = df)
$class
diff lwr upr p adj
class_4-class_3 1.472222 -15.69670 18.641141 0.9773020
class_5-class_3 -20.194444 -37.36336 -3.025526 0.0168428
class_5-class_4 -21.666667 -37.72672 -5.606614 0.0050580
Comment: There is a statistically significant difference between Class_5 and Class_3 (p = 0.017) and Class_5 and Class_4 (p = 0.005).
x %>% textstat_collocations(size= 2) %>% nrow
[1] 924
x %>% textstat_collocations(size= 4) %>% nrow
[1] 627
x %>% textstat_collocations(size= 6) %>% nrow
[1] 336
x %>% textstat_collocations(size= 8) %>% nrow
[1] 208
x %>% textstat_collocations(size= 10) %>% nrow
[1] 127
x %>% textstat_collocations(size= 12) %>% nrow
[1] 82
x %>% textstat_collocations(size= 14) %>% nrow
[1] 51
x %>% textstat_collocations(size= 16) %>% nrow
[1] 27
x %>% textstat_collocations(size= 18) %>% nrow
[1] 11
x %>% textstat_collocations(size= 20) %>% nrow
[1] 4
x %>% textstat_collocations(size= 22) %>% nrow
[1] 2
x %>% textstat_collocations(size= 24) %>% nrow
[1] 0
Comment: I’m going to take a look at the six-grams, which may be the smallest contiguous string that could be construed as plagiarism.
sixgrams <- x %>% textstat_collocations(size = 6) %>% arrange(-count) %>%
summarise(collocation, count)
sixgrams
collocation count
1 measurement is to the correct value 26
2 the degree of accuracy and precision 24
3 how close a measurement is to 23
4 close a measurement is to the 22
5 a measurement is to the correct 22
6 how close the agreement is between 21
7 is how close a measurement is 21
8 is to the correct value for 20
9 accuracy is how close a measurement 20
10 close the agreement is between repeated 19
11 is a quantitative measure of how 18
12 from a standard or expected value 18
13 uncertainty is a quantitative measure of 17
14 to the correct value for that 16
15 the correct value for that measurement 16
16 the agreement is between repeated measurements 15
17 degree of accuracy and precision of 14
18 a quantitative measure of how much 14
19 science is based on observation and 13
20 refers to how close the agreement 12
21 of accuracy and precision of a 12
22 to how close the agreement is 12
23 related to the uncertainty in the 11
24 are related to the uncertainty in 11
25 the skill of the person making 11
26 skill of the person making the 11
27 of the person making the measurement 11
28 measured values deviate from a standard 10
29 and precision of a measuring system 10
30 accuracy and precision of a measuring 10
31 accuracy and precision are related to 10
32 system refers to how close the 10
33 deviate from a standard or expected 10
34 is based on observation and experiment 10
35 the precision of a measurement system 10
36 close the measured value is to 9
37 related to the uncertainty of the 9
38 must be based on a careful 9
39 measurement system refers to how close 9
40 of a measurement system refers to 9
41 much your measured values deviate from 9
42 a measurement system refers to how 9
43 your measured values deviate from a 9
44 system are related to the uncertainty 9
45 to the uncertainty in the measurements 9
46 how much your measured values deviate 9
47 the measured value is to the 9
48 precision of a measuring system are 9
49 how close the measured value is 9
50 values deviate from a standard or 9
51 precision of a measurement system refers 8
52 measuring system are related to the 8
53 based on a careful consideration of 8
54 of a measuring system are related 8
55 be based on a careful consideration 8
56 measure of how much your measured 8
57 a measuring system are related to 8
58 careful consideration of all the factors 8
59 on a careful consideration of all 8
60 quantitative measure of how much your 8
61 measurements are accurate but not precise 8
62 factors that might contribute and their 7
63 a careful consideration of all the 7
64 that might contribute and their possible 7
65 all the factors that might contribute 7
66 of how much your measured values 7
67 factors contributing to uncertainty in a 7
68 the factors that might contribute and 7
69 of all the factors that might 7
70 of accuracy and precision are related 6
71 value is to the correct value 6
72 irregularities in the object being measured 6
73 contributing to uncertainty in a measurement 6
74 is how close the agreement is 6
75 might contribute and their possible effects 6
76 consideration of all the factors that 6
77 factors that contribute to measurement uncertainty 6
78 are repeated under the same conditions 5
79 is related to the uncertainty of 5
80 system is related to the uncertainty 5
81 way to analyze the accuracy of 5
82 measurement system is related to the 5
83 and precision are related to uncertainty 5
84 one way to analyze the accuracy 5
85 how close the measured values are 5
86 to analyze the accuracy of a 5
87 is how close the measured values 5
88 degree of accuracy and precision are 5
89 and precision of a measurement system 5
90 the accuracy of a measurement is 5
91 accuracy and precision of a measurement 5
92 precision are related to the uncertainty 4
93 close the measured values are to 4
94 a measurement must be based on 4
95 to the uncertainty of the measured 4
96 accuracy of a measurement is to 4
97 a measurement system is related to 4
98 deviates from the standard or expected 4
99 uncertainty in a measurement must be 4
100 to the correct value for the 4
101 measured value is to the correct 4
102 uncertainty must be based on a 4
103 of a measurement system is related 4
104 the factors contributing to uncertainty in 4
105 analyze the accuracy of a measurement 4
106 from the standard or expected value 4
107 in a measurement must be based 4
108 uncertainty is a quantitative measurement of 4
109 which are repeated under the same 4
110 they are precise but not accurate 4
111 any other factors that affect the 4
112 the correct value for the measurement 4
113 is a quantitative measurement of how 4
114 determined after careful consideration of all 4
115 and precision are related to the 4
116 measured value deviates from a standard 4
117 after careful consideration of all possible 4
118 precision refers to how close the 4
119 deviates from a standard or expected 4
120 other factors that affect the outcome 4
121 value deviates from a standard or 4
122 value deviates from the standard or 4
123 the uncertainty in a measurement must 4
124 be determined after careful consideration of 4
125 measured value deviates from the standard 4
126 all measurements contain some amount of 4
127 measurement must be based on a 4
128 or they are precise but not 4
129 the agreement is between repeated measurement 4
130 to the uncertainty of the measurement 4
131 a measured value deviates from a 4
132 the measured value deviates from the 3
133 a measurement is to determine the 3
134 accuracy is how close the measured 3
135 how close to the correct value 3
136 the irregularity of the object being 3
137 should be determined after careful consideration 3
138 the degree of precision and accuracy 3
139 to uncertainty in a measurement include 3
140 irregularity of the object being measured 3
141 of a measurement is to determine 3
142 accuracy is how close the measurement 3
143 measurements contain some degree of uncertainty 3
144 to the uncertainty of the measurements 3
145 the uncertainty of the measured value 3
146 degree of precision and accuracy of 3
147 value is to the true value 3
148 careful consideration of all possible contributing 3
149 of how much a measured value 3
150 consideration of all possible contributing factors 3
151 precision of a measurement system is 3
152 measurement is to determine the range 3
153 how much a measured value deviates 3
154 measure of how much a measured 3
155 accuracy and precision are very important 3
156 accuracy and precision of measuring system 3
157 are related to the uncertainty of 3
158 is to the correct value of 3
159 all measurements contain some degree of 3
160 of all possible contributing factors and 3
161 degree of accuracy and precision is 3
162 all possible contributing factors and their 3
163 all possible factors and their effects 3
164 precision is how close the agreement 3
165 measure of how far a measured 3
166 the accuracy and precision of a 3
167 of how far a measured value 3
168 means how close the agreement is 3
169 of accuracy and precision of measuring 3
170 measurements contain some amount of uncertainty 3
171 how close measurements are to the 3
172 accuracy refers to how close the 3
173 much a measured value deviates from 3
174 how close the measurement is to 3
175 close the measurement is to the 3
176 the accuracy of a measurement system 3
177 a quantitative measure of how far 3
178 quantitative measure of how far a 3
179 of precision and accuracy of a 3
180 quantitative measure of how much a 3
181 measured value is to the true 3
182 can be caused by the skill 2
183 the uncertainty must be based on 2
184 a measurement system means how close 2
185 of how much the measured value 2
186 between the lowest and highest values 2
187 related to the uncertainty which is 2
188 explains the difference of accuracy and 2
189 accuracy and precision of the measurement 2
190 second is the skill of the 2
191 range between the lowest and highest 2
192 spread of the measured value is 2
193 a given measurement is to the 2
194 the accuracy and precision of the 2
195 of the measuring device and the 2
196 determine the range of the measured 2
197 caused by the skill of the 2
198 indicates how close the measured value 2
199 the precision refers to how close 2
200 refers to the spread of the 2
201 to the uncertainty of a measurement 2
202 limitations of the measuring device and 2
203 quantitative measure of how much the 2
204 is related to accuracy and precision 2
205 such as limitations of the measuring 2
206 measured values are to each other 2
207 uncertainty can be caused by the 2
208 and any other factors that affect 2
209 the difference between the measured values 2
210 can be accurate but not precise 2
211 the limitations of the measuring instrument 2
212 be caused by the skill of 2
213 the person making the measurement is 2
214 both of them are important to 2
215 how far the measured value deviates 2
216 uncertainty should be determined after careful 2
217 to determine the range between the 2
218 device and the skill of the 2
219 measured value is from a standard 2
220 can say that the measurement is 2
221 far the measured value deviates from 2
222 close a given measurement is to 2
223 shows how much your measured values 2
224 how close the measurement result is 2
225 is how much the measured values 2
226 to the uncertainty which is a 2
227 as limitations of the measuring device 2
228 based on observation and experiment on 2
229 is based on observation and experimentation 2
230 that contribute to the uncertainty of 2
231 a quantitative measurement of how much 2
232 measure of how close the measured 2
233 means how close the obtained value 2
234 consideration of all possible factors and 2
235 measurement is accurate but not precise 2
236 and the causes of uncertain measurements 2
237 between the minimum and maximum values 2
238 minimum and maximum values of the 2
239 the difference between accuracy and precision 2
240 measure of how much measured values 2
241 close it is to the correct 2
242 how close it is to the 2
243 how close the obtained value is 2
244 and maximum values of the measurement 2
245 value is to the actual value 2
246 are various factors that contribute to 2
247 shows how close the measured value 2
248 so it doesn't consider how close 2
249 refers to how close the match 2
250 the limitations of the measuring device 2
251 and precision are important in science 2
252 how much the measured values deviate 2
253 how close a given measurement is 2
254 the measured values are to each 2
255 measurement uncertainty should be determined after 2
256 measurements are both accurate and precise 2
257 measured value is to the actual 2
258 both accuracy and precision are high 2
259 measuring device and the skill of 2
260 refers to how close the measured 2
261 quantitative measurement of how far the 2
262 of the measurement system is related 2
263 to the correct value of that 2
264 precision of a measurement system means 2
265 the minimum and maximum values of 2
266 measurements contain some amount of uncertainly 2
267 the spread of the measured values 2
268 uncertainty which is a quantitative measure 2
269 it is necessary to consider all 2
270 precision are related to uncertainty which 2
271 the object being measured and any 2
272 precision are related to uncertainty in 2
273 the measured values are to the 2
274 being measured and any other factors 2
275 the difference of accuracy and precision 2
276 and the skill of the person 2
277 to the uncertainty in the measurement 2
278 a quantitative measurement of how far 2
279 to determine the range of the 2
280 repeated measurements are to each other 2
281 repeated measurements which are repeated under 2
282 is to determine the range between 2
283 the measuring device and the skill 2
284 measured and any other factors that 2
285 and precision is related to the 2
286 far a measured value deviates from 2
287 there are various causes of uncertainty 2
288 the correct value of that measurement 2
289 how far a measured value deviates 2
290 cases that measurements are accurate but 2
291 the measurement system is related to 2
292 is how close measurements are to 2
293 of accuracy and precision when measuring 2
294 there are various factors that contribute 2
295 factors that contribute to the uncertainty 2
296 uncertainty can be thought of as 2
297 contributing factors and their possible effects 2
298 close measurements are to the correct 2
299 on observation and experiment on measurements 2
300 in the object being measured and 2
301 accuracy and precision is related to 2
302 accuracy shows how close the measured 2
303 the range of the measured value 2
304 there are various factors of uncertainty 2
305 the spread of the measured value 2
306 accuracy of a measurement system is 2
307 accuracy indicates how close the measured 2
308 say that the measurement is precise 2
309 can be thought of as a 2
310 and the irregularity of the object 2
311 of a measurement system means how 2
312 is from a standard or expected 2
313 is to determine the range of 2
314 is how close the measurement is 2
315 a measure of how close the 2
316 much the measured values deviate from 2
317 is between repeated measurements which are 2
318 and accuracy of a measurement system 2
319 object being measured and any other 2
320 agreement is between repeated measurements which 2
321 that measurements are accurate but not 2
322 given measurement is to the correct 2
323 of accuracy and precision is related 2
324 be thought of as a disclaimer 2
325 how close between measurements and target 2
326 precision and accuracy of a measurement 2
327 measurements which are repeated under the 2
328 a measured value is from a 2
329 between repeated measurements which are repeated 2
330 accuracy and precision are important in 2
331 of all possible factors and their 2
332 possible contributing factors and their possible 2
333 precision is how close the measured 2
334 the measurement is to the correct 2
335 the precision of a measurement is 2
336 careful consideration of all possible factors 2
tengrams <- x %>% textstat_collocations(size = 10) %>% arrange(-count) %>%
summarise(collocation, count)
tengrams
collocation
1 is how close a measurement is to the correct value
2 accuracy is how close a measurement is to the correct
3 how close a measurement is to the correct value for
4 close a measurement is to the correct value for that
5 a measurement is to the correct value for that measurement
6 the degree of accuracy and precision of a measuring system
7 degree of accuracy and precision of a measuring system are
8 how much your measured values deviate from a standard or
9 refers to how close the agreement is between repeated measurements
10 much your measured values deviate from a standard or expected
11 your measured values deviate from a standard or expected value
12 system refers to how close the agreement is between repeated
13 precision of a measurement system refers to how close the
14 the precision of a measurement system refers to how close
15 accuracy and precision of a measuring system are related to
16 measurement system refers to how close the agreement is between
17 a measurement system refers to how close the agreement is
18 must be based on a careful consideration of all the
19 is a quantitative measure of how much your measured values
20 be based on a careful consideration of all the factors
21 of a measurement system refers to how close the agreement
22 of accuracy and precision of a measuring system are related
23 and precision of a measuring system are related to the
24 precision of a measuring system are related to the uncertainty
25 of all the factors that might contribute and their possible
26 all the factors that might contribute and their possible effects
27 based on a careful consideration of all the factors that
28 uncertainty is a quantitative measure of how much your measured
29 a quantitative measure of how much your measured values deviate
30 measure of how much your measured values deviate from a
31 of how much your measured values deviate from a standard
32 careful consideration of all the factors that might contribute and
33 a careful consideration of all the factors that might contribute
34 on a careful consideration of all the factors that might
35 quantitative measure of how much your measured values deviate from
36 consideration of all the factors that might contribute and their
37 of a measuring system are related to the uncertainty in
38 a measuring system are related to the uncertainty in the
39 way to analyze the accuracy of a measurement is to
40 a measurement system is related to the uncertainty of the
41 uncertainty in a measurement must be based on a careful
42 how close the measured value is to the correct value
43 a measurement must be based on a careful consideration of
44 of a measurement system is related to the uncertainty of
45 in a measurement must be based on a careful consideration
46 measuring system are related to the uncertainty in the measurements
47 one way to analyze the accuracy of a measurement is
48 measurement must be based on a careful consideration of all
49 the uncertainty in a measurement must be based on a
50 a measured value deviates from a standard or expected value
51 the measured value deviates from the standard or expected value
52 how close the measured value is to the true value
53 close a measurement is to the correct value for the
54 the degree of accuracy and precision are related to uncertainty
55 refers to how close the agreement is between repeated measurement
56 a measurement is to the correct value for the measurement
57 uncertainty is a quantitative measure of how far a measured
58 is a quantitative measure of how far a measured value
59 precision refers to how close the agreement is between repeated
60 to analyze the accuracy of a measurement is to determine
61 measurement system is related to the uncertainty of the measured
62 system is related to the uncertainty of the measured value
63 analyze the accuracy of a measurement is to determine the
64 be determined after careful consideration of all possible factors and
65 determined after careful consideration of all possible contributing factors and
66 shows how much your measured values deviate from a standard
67 the degree of accuracy and precision of a measurement system
68 the precision refers to how close the agreement is between
69 of how much a measured value deviates from a standard
70 far a measured value deviates from a standard or expected
71 how close the measured value is to the actual value
72 is how close the agreement is between repeated measurements which
73 after careful consideration of all possible contributing factors and their
74 how close a given measurement is to the correct value
75 be determined after careful consideration of all possible contributing factors
76 should be determined after careful consideration of all possible factors
77 how far a measured value deviates from a standard or
78 being measured and any other factors that affect the outcome
79 how far the measured value deviates from the standard or
80 uncertainty is a quantitative measure of how much a measured
81 and accuracy of a measurement system is related to the
82 object being measured and any other factors that affect the
83 careful consideration of all possible contributing factors and their possible
84 precision is how close the agreement is between repeated measurements
85 consideration of all possible contributing factors and their possible effects
86 far the measured value deviates from the standard or expected
87 in the object being measured and any other factors that
88 uncertainty must be based on a careful consideration of all
89 how much a measured value deviates from a standard or
90 measurement uncertainty should be determined after careful consideration of all
91 irregularities in the object being measured and any other factors
92 the agreement is between repeated measurements which are repeated under
93 a quantitative measure of how far a measured value deviates
94 uncertainty should be determined after careful consideration of all possible
95 measure of how far a measured value deviates from a
96 close the agreement is between repeated measurements which are repeated
97 degree of accuracy and precision are related to uncertainty which
98 of the measuring device and the skill of the person
99 how close the agreement is between repeated measurements which are
100 precision and accuracy of a measurement system is related to
101 quantitative measure of how much a measured value deviates from
102 accuracy is how close the measured values are to the
103 and precision of a measurement system is related to the
104 the degree of precision and accuracy of a measurement system
105 of a measurement system refers to how close the match
106 the object being measured and any other factors that affect
107 determined after careful consideration of all possible factors and their
108 of how far a measured value deviates from a standard
109 a quantitative measure of how much a measured value deviates
110 measure of how much a measured value deviates from a
111 is a quantitative measure of how much a measured value
112 the degree of accuracy and precision is related to the
113 accuracy and precision of a measurement system is related to
114 the accuracy of a measurement is to determine the range
115 after careful consideration of all possible factors and their effects
116 quantitative measure of how far a measured value deviates from
117 much a measured value deviates from a standard or expected
118 between repeated measurements which are repeated under the same conditions
119 measurement system is related to the uncertainty of the measurement
120 degree of precision and accuracy of a measurement system is
121 accuracy of a measurement system is related to the uncertainty
122 precision of a measurement system is related to the uncertainty
123 is between repeated measurements which are repeated under the same
124 agreement is between repeated measurements which are repeated under the
125 accuracy indicates how close the measured value is to the
126 accuracy shows how close the measured value is to the
127 of precision and accuracy of a measurement system is related
count
1 19
2 18
3 16
4 13
5 13
6 9
7 9
8 8
9 8
10 8
11 8
12 8
13 8
14 7
15 7
16 7
17 7
18 7
19 7
20 7
21 7
22 7
23 7
24 7
25 7
26 6
27 6
28 6
29 6
30 6
31 6
32 6
33 6
34 6
35 6
36 6
37 5
38 5
39 4
40 4
41 4
42 4
43 4
44 4
45 4
46 4
47 4
48 4
49 4
50 4
51 3
52 3
53 3
54 3
55 3
56 3
57 3
58 3
59 3
60 3
61 3
62 3
63 3
64 2
65 2
66 2
67 2
68 2
69 2
70 2
71 2
72 2
73 2
74 2
75 2
76 2
77 2
78 2
79 2
80 2
81 2
82 2
83 2
84 2
85 2
86 2
87 2
88 2
89 2
90 2
91 2
92 2
93 2
94 2
95 2
96 2
97 2
98 2
99 2
100 2
101 2
102 2
103 2
104 2
105 2
106 2
107 2
108 2
109 2
110 2
111 2
112 2
113 2
114 2
115 2
116 2
117 2
118 2
119 2
120 2
121 2
122 2
123 2
124 2
125 2
126 2
127 2
fourteengrams <- x %>% textstat_collocations(size = 14) %>% arrange(-count) %>%
summarise(collocation, count)
fourteengrams
collocation
1 accuracy is how close a measurement is to the correct value for that measurement
2 the degree of accuracy and precision of a measuring system are related to the
3 precision of a measurement system refers to how close the agreement is between repeated
4 degree of accuracy and precision of a measuring system are related to the uncertainty
5 on a careful consideration of all the factors that might contribute and their possible
6 the precision of a measurement system refers to how close the agreement is between
7 is a quantitative measure of how much your measured values deviate from a standard
8 must be based on a careful consideration of all the factors that might contribute
9 be based on a careful consideration of all the factors that might contribute and
10 a careful consideration of all the factors that might contribute and their possible effects
11 based on a careful consideration of all the factors that might contribute and their
12 accuracy and precision of a measuring system are related to the uncertainty in the
13 measure of how much your measured values deviate from a standard or expected value
14 a quantitative measure of how much your measured values deviate from a standard or
15 quantitative measure of how much your measured values deviate from a standard or expected
16 of accuracy and precision of a measuring system are related to the uncertainty in
17 uncertainty in a measurement must be based on a careful consideration of all the
18 and precision of a measuring system are related to the uncertainty in the measurements
19 of a measurement system refers to how close the agreement is between repeated measurements
20 a measurement must be based on a careful consideration of all the factors that
21 uncertainty is a quantitative measure of how much your measured values deviate from a
22 in a measurement must be based on a careful consideration of all the factors
23 measurement must be based on a careful consideration of all the factors that might
24 the uncertainty in a measurement must be based on a careful consideration of all
25 of a measurement system refers to how close the agreement is between repeated measurement
26 one way to analyze the accuracy of a measurement is to determine the range
27 measurement uncertainty should be determined after careful consideration of all possible factors and their
28 and precision of a measurement system is related to the uncertainty of the measured
29 a quantitative measure of how far a measured value deviates from a standard or
30 how close the agreement is between repeated measurements which are repeated under the same
31 uncertainty should be determined after careful consideration of all possible factors and their effects
32 quantitative measure of how far a measured value deviates from a standard or expected
33 precision and accuracy of a measurement system is related to the uncertainty of the
34 is how close the agreement is between repeated measurements which are repeated under the
35 uncertainty is a quantitative measure of how much a measured value deviates from a
36 accuracy is how close a measurement is to the correct value for the measurement
37 is a quantitative measure of how far a measured value deviates from a standard
38 quantitative measure of how much a measured value deviates from a standard or expected
39 measure of how far a measured value deviates from a standard or expected value
40 is a quantitative measure of how much a measured value deviates from a standard
41 uncertainty must be based on a careful consideration of all the factors that might
42 accuracy and precision of a measurement system is related to the uncertainty of the
43 measure of how much a measured value deviates from a standard or expected value
44 irregularities in the object being measured and any other factors that affect the outcome
45 close the agreement is between repeated measurements which are repeated under the same conditions
46 uncertainty is a quantitative measure of how far a measured value deviates from a
47 the degree of precision and accuracy of a measurement system is related to the
48 precision of a measurement system is related to the uncertainty of the measured value
49 a quantitative measure of how much a measured value deviates from a standard or
50 degree of precision and accuracy of a measurement system is related to the uncertainty
51 of precision and accuracy of a measurement system is related to the uncertainty of
count
1 12
2 7
3 7
4 7
5 6
6 6
7 6
8 6
9 6
10 6
11 6
12 5
13 5
14 5
15 5
16 5
17 4
18 4
19 4
20 4
21 4
22 4
23 4
24 4
25 3
26 2
27 2
28 2
29 2
30 2
31 2
32 2
33 2
34 2
35 2
36 2
37 2
38 2
39 2
40 2
41 2
42 2
43 2
44 2
45 2
46 2
47 2
48 2
49 2
50 2
51 2
eighteengrams <- x %>% textstat_collocations(size = 18) %>% arrange(-count) %>%
summarise(collocation, count)
eighteengrams
collocation
1 must be based on a careful consideration of all the factors that might contribute and their possible effects
2 uncertainty in a measurement must be based on a careful consideration of all the factors that might contribute
3 the degree of accuracy and precision of a measuring system are related to the uncertainty in the measurements
4 a measurement must be based on a careful consideration of all the factors that might contribute and their
5 in a measurement must be based on a careful consideration of all the factors that might contribute and
6 measurement must be based on a careful consideration of all the factors that might contribute and their possible
7 uncertainty is a quantitative measure of how much your measured values deviate from a standard or expected value
8 the uncertainty in a measurement must be based on a careful consideration of all the factors that might
9 uncertainty is a quantitative measure of how much a measured value deviates from a standard or expected value
10 uncertainty must be based on a careful consideration of all the factors that might contribute and their possible
11 uncertainty is a quantitative measure of how far a measured value deviates from a standard or expected value
count
1 6
2 4
3 4
4 4
5 4
6 4
7 4
8 4
9 2
10 2
11 2
twentytwograms <- x %>% textstat_collocations(size = 22) %>% arrange(-count) %>%
summarise(collocation, count)
twentytwograms
collocation
1 uncertainty in a measurement must be based on a careful consideration of all the factors that might contribute and their possible effects
2 the uncertainty in a measurement must be based on a careful consideration of all the factors that might contribute and their possible
count
1 4
2 4
l <- list()
for(i in 1:nrow(sixgrams)){
l[[i]] <- ifelse(grepl(sixgrams$collocation[i], df$x),1,0)
}
df1 <- data.frame(matrix(unlist(l), nrow=length(l), byrow=TRUE))
colSums(df1)
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16
5 0 9 23 6 0 18 17 45 8 10 2 2 25 2 16
X17 X18 X19 X20 X21 X22 X23 X24 X25 X26 X27 X28 X29 X30 X31 X32
2 21 1 0 2 35 0 31 38 3 2 7 21 5 39 0
X33 X34 X35 X36 X37 X38 X39 X40 X41 X42 X43 X44 X45 X46 X47 X48
11 1 41 25 19 16 9 9 7 9 17 6 9 14 13 1
X49 X50 X51 X52 X53 X54 X55 X56 X57 X58 X59 X60 X61 X62 X63 X64
4 18 15 0 3 3 32 6 14 7 1 5 66 22 1 29
X65 X66 X67 X68 X69 X70 X71 X72 X73 X74 X75 X76 X77 X78 X79 X80
0 0 0 1 16 38 0 0 6 17 2 7 10 3 0 7
X81 X82 X83 X84 X85 X86 X87 X88 X89 X90 X91 X92 X93 X94 X95 X96
0 2 21 20 2 9 2 30 0 0 1 4 9 3 7 12
X97 X98 X99 X100 X101
5 0 17 10 15
df$sixgrams <- colSums(df1)
Comment: Likewise, add the ten-gram, fourteen-gram, eighteen-gram and 22-gram counts to the data frame.
tengrams <- x %>% textstat_collocations(size=10) %>% arrange(-count)
l <- list()
for(i in 1:nrow(tengrams)){
l[[i]] <- ifelse(grepl(tengrams$collocation[i], df$x),1,0)
}
df1 <- data.frame(matrix(unlist(l), nrow=length(l), byrow=TRUE))
df$tengrams <- colSums(df1)
tengrams <- x %>% textstat_collocations(size=14) %>% arrange(-count)
l <- list()
for(i in 1:nrow(fourteengrams)){
l[[i]] <- ifelse(grepl(fourteengrams$collocation[i], df$x),1,0)
}
df1 <- data.frame(matrix(unlist(l), nrow=length(l), byrow=TRUE))
df$fourteengrams <- colSums(df1)
eighteengrams <- x %>% textstat_collocations(size=18) %>% arrange(-count)
l <- list()
for(i in 1:nrow(eighteengrams)){
l[[i]] <- ifelse(grepl(eighteengrams$collocation[i], df$x),1,0)
}
df1 <- data.frame(matrix(unlist(l), nrow=length(l), byrow=TRUE))
df$eighteengrams <- colSums(df1)
twentytwograms <- x %>% textstat_collocations(size=22) %>% arrange(-count)
l <- list()
for(i in 1:nrow(twentytwograms)){
l[[i]] <- ifelse(grepl(twentytwograms$collocation[i], df$x),1,0)
}
df1 <- data.frame(matrix(unlist(l), nrow=length(l), byrow=TRUE))
df$twentytwograms <- colSums(df1)
glimpse(df)
Rows: 101
Columns: 10
$ x <corpus> "There are two types of indices to evaluate an exper…
$ given_name <chr> "Shou", "Ryo", "Masatsugu", "Haruya", "Shunta", "Hayate…
$ family_name <chr> "ANDO", "INOUE", "UMETA", "OMORI", "OKA", "ONO", "KIMUR…
$ class <fct> class_3, class_3, class_3, class_3, class_3, class_3, c…
$ similarity <dbl> 32, 0, 39, 39, 31, 0, 64, 67, 99, 19, 59, 24, 0, 66, 0,…
$ sixgrams <dbl> 5, 0, 9, 23, 6, 0, 18, 17, 45, 8, 10, 2, 2, 25, 2, 16, …
$ tengrams <dbl> 1, 0, 0, 10, 0, 0, 2, 0, 15, 1, 1, 0, 0, 9, 0, 6, 0, 2,…
$ fourteengrams <dbl> 0, 0, 0, 2, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 2, 0, 0, 0…
$ eighteengrams <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ twentytwograms <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
df %>% ggplot(aes(x=sixgrams, y=similarity)) +
geom_point() +
stat_smooth(method = "lm") +
labs(title = "Similarity versus Learners' 6-gram Duplicate Counts in Corpus")
Comment: There are several students who populate the x-axis having scored zero for similarity. Interestingly, one student even had 20 six-grams in her text.
df %>% ggplot(aes(x=tengrams, y=similarity)) +
geom_point() +
stat_smooth(method = "lm") +
labs(title = "Similarity versus Learners' 10-gram Duplicate Counts in Corpus")
df %>% ggplot(aes(x=fourteengrams, y=similarity)) +
geom_point() +
stat_smooth(method = "lm") +
labs(title = "Similarity versus Learners' 14-gram Duplicate Counts in Corpus")
Comment: Most students with a fourteen-gram collocation in their text scored near to fifty for similarity or higher.
df %>% ggplot(aes(x=eighteengrams, y=similarity)) +
geom_point() +
stat_smooth(method = "lm")
labs(title = "Similarity versus Learners' 18-gram Duplicate Counts in Corpus")
$title
[1] "Similarity versus Learners' 18-gram Duplicate Counts in Corpus"
attr(,"class")
[1] "labels"
Comment: Few students have an eighteen-gram collocation in their texts.
df %>% ggplot(aes(x=twentytwograms, y=similarity)) +
geom_point() +
stat_smooth(method = "lm") +
labs(title = "Similarity versus Learners' 22-gram Duplicate Counts in Corpus")
Comment: Only three students had a 22-gram in their text.
The average similarity score for these students was about 50.
cor.test(df$similarity, df$sixgrams, na.rm=T)
Pearson's product-moment correlation
data: df$similarity and df$sixgrams
t = 14.077, df = 98, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.7405369 0.8739767
sample estimates:
cor
0.8179764
Comment: The correlation between six-grams and similarity is very high.
cor.test(df$similarity, df$tengrams, na.rm=T)
Pearson's product-moment correlation
data: df$similarity and df$tengrams
t = 7.3455, df = 98, p-value = 6.141e-11
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.4524209 0.7092875
sample estimates:
cor
0.5958864
Comment: The correlation between ten-grams and similarity is quite high.
cor.test(df$similarity, df$fourteengrams, na.rm=T)
Pearson's product-moment correlation
data: df$similarity and df$fourteengrams
t = 4.5181, df = 98, p-value = 1.744e-05
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.2382085 0.5655001
sample estimates:
cor
0.4152001
Comment: The correlation is low.
df[is.na(df$similarity),]$similarity <- mean(df$similarity, na.rm = TRUE)
set.seed(1066)
spl <- sample.split(df$similarity, SplitRatio = 0.75)
train <- df[spl==T,]
test <-df[spl==F,]
mod <- glm(similarity ~ sixgrams + tengrams + fourteengrams + eighteengrams + twentytwograms, data = train)
summary(mod)
Call:
glm(formula = similarity ~ sixgrams + tengrams + fourteengrams +
eighteengrams + twentytwograms, data = train)
Deviance Residuals:
Min 1Q Median 3Q Max
-51.890 -9.557 -3.374 9.279 46.054
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.3743 2.6451 1.276 0.2056
sixgrams 3.2338 0.3414 9.473 7.38e-15 ***
tengrams -3.2319 1.3323 -2.426 0.0174 *
fourteengrams -0.1715 2.7885 -0.061 0.9511
eighteengrams 7.8773 7.4765 1.054 0.2951
twentytwograms -26.6014 21.5944 -1.232 0.2215
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 249.972)
Null deviance: 78878 on 88 degrees of freedom
Residual deviance: 20748 on 83 degrees of freedom
AIC: 751.76
Number of Fisher Scoring iterations: 2
Comment: The six-gram and ten-gram counts are significant.
mod2 <- glm(similarity ~ sixgrams + tengrams, data = train)
summary(mod2)
Call:
glm(formula = similarity ~ sixgrams + tengrams, data = train)
Deviance Residuals:
Min 1Q Median 3Q Max
-52.083 -10.007 -3.860 9.421 46.568
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.8598 2.5651 1.505 0.136
sixgrams 3.1429 0.2818 11.154 < 2e-16 ***
tengrams -2.9271 0.5901 -4.960 3.51e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 246.5069)
Null deviance: 78878 on 88 degrees of freedom
Residual deviance: 21200 on 86 degrees of freedom
AIC: 747.68
Number of Fisher Scoring iterations: 2
Comment: The bivariate model has a lower AIC so may be the better one to use.
preds <- predict(mod2, newdata = test)
SSE <- sum((test$similarity - preds)^2)
SST <- sum((test$similarity - mean(train$similarity))^2)
Rsquared1 <- 1 - SSE/SST
Rsquared1
[1] 0.789131
Comment: 79 percent of the similarity scores can be accounted for by the counts of six-grams and ten-grams.
mod3 <- glm(similarity ~ sixgrams + tengrams, data = df)
summary(mod3)
Call:
glm(formula = similarity ~ sixgrams + tengrams, data = df)
Deviance Residuals:
Min 1Q Median 3Q Max
-52.047 -9.036 -2.607 9.471 47.033
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.6068 2.2998 1.133 0.26
sixgrams 3.2146 0.2689 11.952 < 2e-16 ***
tengrams -2.9702 0.5742 -5.173 1.22e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 240.5035)
Null deviance: 89705 on 100 degrees of freedom
Residual deviance: 23569 on 98 degrees of freedom
AIC: 845.34
Number of Fisher Scoring iterations: 2
df$predictions <- predict(mod3)
df %>% ggplot(aes(x=predictions, y=similarity)) +
geom_point() +
stat_smooth(method="lm") +
labs(title = "Similarity Versus Linear Regression Predictions Based on Corpus 6-gram and 10-gram Counts")
Comment: The model is quite a good fit but it predicts a high similarity for one student who had a score of zero for it. Perhaps that was the student who had twenty six-grams in her summary.
df[df$predictions>50 & df$similarity==0,]$x %>%
str_squish
[1] "When we measure some of objects using measurements, accuracy and precision is very important. First, accuracy is how close a measurement is to the correct value. If we use high accuracy , measurements are very close to the collect value. In contrast, low accuracy is very far. Next is precision. The precision refers to how close the agreement is between repeated measurements. If the spread of the measured value is small, the precision is high. On the other hand, when the spread of the measured value is large, it is low. In conclusion, The degree of accuracy and precision are related to uncertainty in the measurements."
Comment: His answer is clear and concise. The commercial algorithm did not highlight any of his text, even the bigrams for high and low accuracy which were highlighted for others. It would be unfair to penalize this student based on the model’s wild prediction, even if it does work well generally and has a fairly high accuracy of 0.79.
glimpse(df)
Rows: 101
Columns: 11
$ x <corpus> "There are two types of indices to evaluate an exper…
$ given_name <chr> "Shou", "Ryo", "Masatsugu", "Haruya", "Shunta", "Hayate…
$ family_name <chr> "ANDO", "INOUE", "UMETA", "OMORI", "OKA", "ONO", "KIMUR…
$ class <fct> class_3, class_3, class_3, class_3, class_3, class_3, c…
$ similarity <dbl> 32.00, 0.00, 39.00, 39.00, 31.00, 0.00, 64.00, 67.00, 9…
$ sixgrams <dbl> 5, 0, 9, 23, 6, 0, 18, 17, 45, 8, 10, 2, 2, 25, 2, 16, …
$ tengrams <dbl> 1, 0, 0, 10, 0, 0, 2, 0, 15, 1, 1, 0, 0, 9, 0, 6, 0, 2,…
$ fourteengrams <dbl> 0, 0, 0, 2, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 2, 0, 0, 0…
$ eighteengrams <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ twentytwograms <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ predictions <dbl> 15.709452, 2.606848, 31.537922, 46.839669, 21.894231, 2…
df %>% filter(similarity > 30) %>% select(4:10) %>% arrange(similarity)
class similarity sixgrams tengrams fourteengrams
Miyuu Kuramoto class_3 31 6 0 0
Kiara Sato class_3 32 5 1 0
Ami Yumoto class_5 32 12 1 0
Manami Ichikawa class_4 34 9 0 0
Chiyori Nakanishi class_4 36 3 0 0
Yukishige Chikamatsu class_4 37 4 0 0
Takeyuki Kurahashi class_3 39 9 0 0
Narumi Kuranoo class_3 39 23 10 2
Saki Michieda class_5 39 16 1 0
Sayaka Takahashi class_5 39 9 1 0
Ayami Nagatomo class_5 40 17 13 9
Miu Shitao class_4 41 25 7 3
Hitomi Honda class_5 41 15 9 5
Sadayuki Kimura class_4 42 0 0 0
Takanori Ushioda class_4 44 16 2 0
Haruna Saito class_4 45 15 5 0
Kanehide Okano class_5 46 21 10 2
Satone Kubo class_4 49 14 0 0
Rei Nishikawa class_4 51 13 1 0
Tokunaga Remi class_4 52 19 4 0
Tadao Ōtaka class_4 52 14 0 0
Mototoki Hara class_3 54 21 2 0
Taketaka Katsuta class_4 58 32 12 3
Shigekata Akabane class_3 59 10 1 0
Miho Miyazaki class_3 64 18 2 0
Kamachi Yukina class_3 65 21 2 0
Manaka Taguchi class_4 65 22 4 0
Mai Homma class_3 66 25 9 1
Nanase Yoshikawa class_3 66 16 6 2
Maho Omori class_4 66 9 0 0
Nobukiyo Ōishi class_3 67 17 0 0
Momoka Hasegawa class_4 67 17 1 0
Mitsutada Senba class_4 70 41 21 10
Masaaki Mase class_4 74 18 2 0
Kaoru Takaoka class_4 75 29 7 0
Ayu Yamabe class_5 75 38 17 7
Nobuyuki Terasaka class_5 80 30 15 7
Erii Chiba class_3 81 38 20 7
Kyoka Tada class_5 81 17 1 0
Serika Nagano class_4 82 66 35 17
Momoka Onishi class_4 85 11 0 0
Ayaka Hidaritomo class_3 88 35 5 0
Kanamaru Horibe class_4 89 39 17 5
Kana Yasuda class_3 93 31 7 0
Saki Kitazawa class_3 99 45 15 1
eighteengrams twentytwograms
Miyuu Kuramoto 0 0
Kiara Sato 0 0
Ami Yumoto 0 0
Manami Ichikawa 0 0
Chiyori Nakanishi 0 0
Yukishige Chikamatsu 0 0
Takeyuki Kurahashi 0 0
Narumi Kuranoo 0 0
Saki Michieda 0 0
Sayaka Takahashi 0 0
Ayami Nagatomo 5 1
Miu Shitao 0 0
Hitomi Honda 1 0
Sadayuki Kimura 0 0
Takanori Ushioda 0 0
Haruna Saito 0 0
Kanehide Okano 0 0
Satone Kubo 0 0
Rei Nishikawa 0 0
Tokunaga Remi 0 0
Tadao Ōtaka 0 0
Mototoki Hara 0 0
Taketaka Katsuta 0 0
Shigekata Akabane 0 0
Miho Miyazaki 0 0
Kamachi Yukina 0 0
Manaka Taguchi 0 0
Mai Homma 0 0
Nanase Yoshikawa 0 0
Maho Omori 0 0
Nobukiyo Ōishi 0 0
Momoka Hasegawa 0 0
Mitsutada Senba 6 2
Masaaki Mase 0 0
Kaoru Takaoka 0 0
Ayu Yamabe 0 0
Nobuyuki Terasaka 2 0
Erii Chiba 0 0
Kyoka Tada 0 0
Serika Nagano 5 1
Momoka Onishi 0 0
Ayaka Hidaritomo 0 0
Kanamaru Horibe 0 0
Kana Yasuda 0 0
Saki Kitazawa 0 0
df["Miyuu Kuramoto",]$x %>% str_squish
[1] "Measurement is important in the observations and experiments that make up science. Accuracy refers to how close the measured result is to the true value. Precision is the spread of measured values. The difference between these two causes situations such as low precision and high accuracy and high precision and low accuracy. Accuracy and precision are largely related to the uncertainty in measurement. Uncertainty is a quantitative measurement of how far the measured value deviates from the standard or expected value. There are various factors of uncertainty, and experiments and measurements must be carried out carefully considering the causes."
df["Kiara Sato",]$x %>% str_squish
[1] "There are two types of indices to evaluate an experiment, accuracy, and precision. Accuracy means how close the measured value is to the true value. Precision means how close each of the measured values is. Not only accurate and precise measurements but there are high accuracy and low precise measurements. Moreover, there is another index, uncertainly. Uncertainty means how far the measured value is from the true value. So, high accuracy and high precision measurement of uncertainly is low, and low accuracy and low precision measurement of uncertainly would be high. In conclusion, careful consideration should be given to how much uncertainty the measurement has because any measurement has uncertainly."
Comment: Kiara had one long 20-gram string highlighted by the commerial algorithm among other shorter strings. Within my corpus, she had one ten-gram duplicated with others (which is also her five 6-grams).
df %>% filter(similarity > 50) %>% select(4:10)
class similarity sixgrams tengrams fourteengrams
Miho Miyazaki class_3 64 18 2 0
Nobukiyo Ōishi class_3 67 17 0 0
Saki Kitazawa class_3 99 45 15 1
Shigekata Akabane class_3 59 10 1 0
Mai Homma class_3 66 25 9 1
Nanase Yoshikawa class_3 66 16 6 2
Mototoki Hara class_3 54 21 2 0
Ayaka Hidaritomo class_3 88 35 5 0
Kana Yasuda class_3 93 31 7 0
Erii Chiba class_3 81 38 20 7
Kamachi Yukina class_3 65 21 2 0
Kanamaru Horibe class_4 89 39 17 5
Momoka Onishi class_4 85 11 0 0
Mitsutada Senba class_4 70 41 21 10
Tokunaga Remi class_4 52 19 4 0
Maho Omori class_4 66 9 0 0
Momoka Hasegawa class_4 67 17 1 0
Tadao Ōtaka class_4 52 14 0 0
Rei Nishikawa class_4 51 13 1 0
Masaaki Mase class_4 74 18 2 0
Taketaka Katsuta class_4 58 32 12 3
Serika Nagano class_4 82 66 35 17
Manaka Taguchi class_4 65 22 4 0
Kaoru Takaoka class_4 75 29 7 0
Ayu Yamabe class_5 75 38 17 7
Kyoka Tada class_5 81 17 1 0
Nobuyuki Terasaka class_5 80 30 15 7
eighteengrams twentytwograms
Miho Miyazaki 0 0
Nobukiyo Ōishi 0 0
Saki Kitazawa 0 0
Shigekata Akabane 0 0
Mai Homma 0 0
Nanase Yoshikawa 0 0
Mototoki Hara 0 0
Ayaka Hidaritomo 0 0
Kana Yasuda 0 0
Erii Chiba 0 0
Kamachi Yukina 0 0
Kanamaru Horibe 0 0
Momoka Onishi 0 0
Mitsutada Senba 6 2
Tokunaga Remi 0 0
Maho Omori 0 0
Momoka Hasegawa 0 0
Tadao Ōtaka 0 0
Rei Nishikawa 0 0
Masaaki Mase 0 0
Taketaka Katsuta 0 0
Serika Nagano 5 1
Manaka Taguchi 0 0
Kaoru Takaoka 0 0
Ayu Yamabe 0 0
Kyoka Tada 0 0
Nobuyuki Terasaka 2 0
Comment: I think I must make a decision about the copying and to be proactive is better than to do nothing. A threshold of over 50 is a wise decision. I will run this script for successive assignments to check that it is still working.
df %>% filter(similarity > 50) %>%
summarise(avg6 = mean(sixgrams),
avg10 = mean(tengrams),
avg14 = mean(fourteengrams),
avg18 = mean(eighteengrams),
avg22 = mean(twentytwograms))
avg6 avg10 avg14 avg18 avg22
1 25.62963 7.62963 2.222222 0.4814815 0.1111111
df["Rei Nishikawa",]$x %>% str_squish
[1] "Accuracy and precision are important in science. The difference between accuracy and precision is also important. Accuracy is how close a measurement is to the correct value for that measurement. On the other hand, precision is how close how close the agreement is between repeated measurements. The degree of accuracy and precision affects the uncertainty in the measurements. The skill of the person making the measurement is one of the examples of the factor of the uncertainty in the measurement. We have to take it into consideration when measuring something."
Comment: Certainly Rei. You had a similarity score of 51 which means that over half of your essay was unoriginal and has been highlighted here by a commercial similarity website. Your summary was similar to other submissions for this assignment among my students this semester. In particular, you had one 10-word string which was the same as someone else’s. You also had thirteen 6-word strings. Try not to copy directly from the text or other sources. Please use your own words.
ir <- read.csv("../ir.csv")
df$similarity2 <- ir$similarity2
df %>% ggplot(aes(x=class, y=similarity2)) +
geom_boxplot()
Comment: Even thought he median score for class 5 has increased, the trend is downward.
mean(df$similarity) - mean(df$similarity2, na.rm=T)
[1] 4.97875
Comment: The similarity scores have come down five points per person on average.
mean(df$similarity<=50)
[1] 0.7326733
mean(df$similarity2<=50, na.rm = T)
[1] 0.8541667
Comment: The ‘no greater than 50’ similarity scores had increased from 73 to 85 percent.
df$similarity3 <- ir$similarity3
df %>% ggplot(aes(x=class, y=similarity3)) +
geom_boxplot()
mean(df$similarity3<=50, na.rm = T)
[1] 0.9157895
Comment: The ‘over 50’ similarity score have fallen to 8 percent.
median(df$similarity)
[1] 24
median(df$similarity2, na.rm = T)
[1] 20
median(df$similarity3, na.rm = T)
[1] 13
Comment: Over the three assignments, the median for similarity has fallen by 11 points.
df %>% select(4:5, 12:13) %>%
gather(task, "scores", 2:4) %>%
ggplot(aes(x=task, y=scores)) +
geom_boxplot()
Comment: The trend over time is for lower similarity scores.
t.test(df$similarity, df$similarity3, paired = T)
Paired t-test
data: df$similarity and df$similarity3
t = 4.4163, df = 94, p-value = 2.683e-05
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
7.193066 18.943987
sample estimates:
mean of the differences
13.06853
Comment: The difference between similarity scores of assignment I and III was statistically significant.
library(effsize)
cohen.d(df$similarity, df$similarity3, conf.level=0.95, na.rm=T)
Cohen's d
d estimate: 0.5320765 (medium)
95 percent confidence interval:
lower upper
0.2452521 0.8189008
Comment: Cohen’s d was medium.
Final comment: Calibration didn’t quite work but the students were asked to borrow less and they complied. To be continued…
Comment: Class_5 has a median of zero. This is where I want the other classes to be. There is no clear reason for this at the moment.