Despite ongoing controversy (see Shermis & Wilson, 2024), the use of AES in operational testing is inevitable.
Wrestling with AI draws in more measurement professionals to examine the utility of AES (see Dorsey, Michaels, & Ferrara, 2024).
This presentation provides preliminary findings for an empirical study where we utilized 3 approaches to score the same essays: (a) zero-shot prompting with CoGrader-beta version, (b) zero-shot prompt automation via ChatGPT, and (c) supervised learning via BERT multilingual model with bi-directional LSTM.
The lowest QWK was .60, reached with the BERT+LSTM approach. It was .68 with CoGrader-beta and .79 with the prompt automation for ChatGPT.
CoGrader-beta and ChatGPT 3.5 meet the arbitrarily minimum threshold of .65.
Tertium quid proportions were 22% for the former and 19% for the latter, which are promising.