Comparing Automated Essay Scoring Approaches and Human Raters: An Empirical Study on Turkish Foreign Language Learners

EPOD 2024

Authors

Outline

  • Introduction (3 mins.)
  • Method (3 mins.)
  • Preliminary findings (4 mins.)
  • Q&A

Introduction - I

  • Despite ongoing controversy (see Shermis & Wilson, 2024), the use of AES in operational testing is inevitable.

  • Wrestling with AI draws in more measurement professionals to examine the utility of AES (see Dorsey, Michaels, & Ferrara, 2024).

  • This presentation provides preliminary findings for an empirical study where we utilized 3 approaches to score the same essays: (a) zero-shot prompting with CoGrader-beta version, (b) zero-shot prompt automation via ChatGPT, and (c) supervised learning via BERT multilingual model with bi-directional LSTM.

Method - I

  • We scored 590 Turkish L2 essays. The participants were invited to voluntarily provide their answers to a prompt that required writing an e-mail complaining about a tour. The task had to be completed in 20 minutes with at least 125 words.
  • Essays were then scored by a single rater using a scoring rubric based on the Common European Framework of Reference for Languages.
  • There was a total of 20 raters, all trained professionals, and the rubric had substantial validity evidence.
  • The maximum score was 10, and there were six criteria

Method - II

  • We compared score distributions from AES and human raters, quadratic weighted kappa (QWK), and proportion of tertium quid using a criterion of 2.5 out of 10.
  • We also examined if there were subgroups of raters for which AES assigned closer scores; the dependent variable was the absolute difference between the rater and AES, where lower scores indicate better agreement, the independent variables were the experience in grading and whether the raters - when they had to pick one- considered themselves as a dove or a hawk.
  • We crafted a computationally reproducible R code to utilize Metrics, and glmertree packages.

Results I

  • The lowest QWK was .60, reached with the BERT+LSTM approach. It was .68 with CoGrader-beta and .79 with the prompt automation for ChatGPT.

  • CoGrader-beta and ChatGPT 3.5 meet the arbitrarily minimum threshold of .65.

  • Tertium quid proportions were 22% for the former and 19% for the latter, which are promising.

Results II

Results III

  • Overlap was 32.2% vs 55.9%
  • Allowing scores nested in human raters, creating a disagreement score (AES - Human), ICC for former was .20, for latter .07.
  • 20% and 7% of the variance were due to rater differences

Results IV

  • Mlmtree (Fokkema et al., 2018) results for disagreement scores for the former:

Next?

  • Closer look at disagreement
  • Questions?
  • RStudio - revealjs: Hattab ve Allaire (2017)
  • Thanks