Who are we?

We are a small research group of assessment experts and computer scientists at Ege University, İzmir, but our network extends beyond our immediate team. We are in touch with a colleague from the University of Alberta, and together, we study automated scoring for inputs in Turkish, a language often considered low-resource. The project started only eight months ago, and we aim to compare several different scoring procedures.

How did we hear about CoGrader?

Automated scoring is a highly active research topic. We try to follow developments by setting publication alerts in several databases, but we also enjoy chatting with Microsoft Edge’s Copilot. That is how we found out about CoGrader. We tried it and wanted to share our experience with the company. Aleph was helpful, and Gil was excited to collaborate.

How did we use CoGrader?

This short experience summary is not a validation attempt, but we are conducting research that compares CoGrader’s scores with multiple raters and other AS procedures.

To experience and meet with CoGrader, we scored 590 essays. Each essay was written by an individual interested in learning Turkish as a second language and who visited a well-known website to test their writing skills. The invitation was free of charge, and participants voluntarily provided their answers to a prompt; they were asked to write an e-mail complaining about a tour in 20 minutes with at least 125 words. Essays were then scored by a single rater using a scoring rubric based on Common European Framework of Reference for Languages. There were a total of 20 raters, all trained professionals, and the rubric (in Turkish as well) had substantial validity evidence. The maximum score is 10, and there are six criteria: task completion, coherence/cohesion, grammatical accuracy, lexis, spelling/punctuation, and format.

We put our essays in a CSV file and uploaded them into CoGrader interface. It took around five minutes to upload all 590 essays. It was easy to transfer our rubric into CoGrader.

Preliminary findings and graphs

Here is the score distributions; rater scores in red and CoGrader scores in blue. The plot is a clear invitation for improvement, but without training, it still looks promising for a generic competition.

Lets check correlations using the R package corrplot:

The correlation between total scores by raters and by CoGrade is .69.
The last two criteria seems to cause disagreement and attempts to revise the rubric might start with these two criteria

Quadratic Weighted Kappa (QWK) is a common comparison measure to compare rater scores and automated scores and it is common practice to use a threshold value of .65, in other words QWK>.65 can be interpreted as promising results. Here we can use the Metrics package.

library(Metrics)
ScoreQuadraticWeightedKappa(covdat$Hmn_Essay1sum, 
                            covdat$AES1_Essay1sum,
                            min.rating = 0, max.rating = 10)
## [1] 0.6752977

The QWK is .68, arguably meeting the minimum standards.

When two raters score an essay, a third rater might be invited if the disagreement is large. Using 2.5 points as the maximum tolerance for disagreement, we can compute the proportion of cases where a third rater is needed:

covdat$disag1=abs(covdat$Hmn_Essay1sum-covdat$AES1_Essay1sum)
sum(covdat$disag1>2.5)
## [1] 127
127/590
## [1] 0.2152542

Using a somewhat arbitrary but helpful cut score of 2.5 out of 10, 22% of the cases would need a third score. The readers can comment on whether this is acceptable or not.

Our final attempt is to understand if there are subgroups of raters for which CoGrader assigns closer scores. The dependent variable is the absolute difference between the rater and CoGrader, where lower scores indicate better agreement. The independent variables are the experience in grading and whether the raters - when they have to pick one- consider themselves as a dove or a hawk. We can utilize the glmertree package:

We have preliminary findings that the agreement between raters and CoGrader is better except when the raters consider themselves doves and have less than six years of experience, or they have more than ten years of experience and consider themselves hawks.

The average disagreement is the lowest for those with at least eight years of experience who consider themselves doves if they have to choose between a dove or a hawk. Overall, the amount of disagreement varies among raters, as expected and depicted below:

Conclusion

We are optimistic about CoGrader’s future and await their version upgrade.