Intro

The Algorithm Grading Your Words

As universities turn to AI detectors to police academic integrity, the tools are not failing evenly. Evidence from human-written essays shows non-native English writers are far more likely to be wrongly flagged as using AI. The result is not just a technical error. It is an academic integrity system that can mistake careful second-language writing for machine-generated text.

RowNarrativeOne

Chart 1

The detector is not neutral

The first warning sign is the size of the gap: human-written TOEFL essays are flagged as AI at more than ten times the rate of native student essays. Both groups wrote their work themselves.

RowChartOne

AfterChartOne

The gap is not marginal. Non-native writers are flagged at more than ten times the rate of native writers, despite both groups producing entirely human-written work. For a student facing an academic misconduct accusation, that asymmetry is not just a statistic. It can affect trust, stress and academic standing.

What makes this finding striking is not just the size of the gap but what it implies for the students on the wrong side of it. TOEFL writers are not careless students submitting rushed work. They are international students who have passed a rigorous English proficiency test just to enrol. Being flagged by a detector does not just risk a grade. It can trigger stress, academic integrity investigations and consequences that may affect a student’s academic record.

RowNarrativeTwo

Chart 2

The bias appears across every tool tested

This is not one unreliable detector. Across all seven tested detectors, the same pattern appears. Every detector tested flagged non-native essays at a higher false-positive rate than native essays.

RowChartTwo

AfterChartTwo

No single tool is an outlier. OriginalityAI flags 75.8% of non-native essays. GPTZero flags 51.6%. ZeroGPT flags 48.4%. The pattern holds across every detector tested, suggesting the problem is structural rather than the fault of any one product.

RowNarrativeThree

Chart 3

The problem is fixable, but the tools remain deployed

When the same TOEFL essays are rewritten into more native-like English, false positives collapse by 81%. The problem is not inevitable. That makes its persistence harder to justify.

RowChartThree

AfterChartThree

A GPT-4 paraphrase, essentially making the writing sound more native, cuts false positives by 81%. The bias is not a technical inevitability. It is a correctable flaw that has not been corrected. The same essays, written by the same students, are treated very differently depending on how they are phrased.

This is the detail that should give universities pause. If a GPT-4 rewrite can reduce false positives by 81%, then the detector is not catching AI, it is catching writing style. And writing style is not misconduct. Penalising students for how they write, rather than whether they used AI, is not academic integrity. It is something closer to the opposite.

RowNarrativeFour

Chart 4

The model mistakes careful second-language writing for machine text

AI detectors often treat predictable text as suspicious, but non-native academic writing can also be more predictable, especially when writers use careful, structured sentences in a second language. This gap persists even after controlling for paper quality, ruling out the idea that non-native authors simply write worse papers.

RowChartFour

AfterChartFour

The gap persists whether the paper is highly rated or not. Among higher-rated papers, non-native authors still score lower on text unpredictability than native authors. This rules out the possibility that non-native writers simply produce lower quality work, the difference is in writing style, not ability.

Perplexity is not a measure of quality or intelligence. It is a measure of how predictable text is to a language model trained predominantly on native English writing. Non-native writers are not writing worse, they are writing differently. The problem is that these tools can treat that difference as suspicious and institutions may not fully see how unevenly that suspicion is distributed.

RowNarrativeFive

Chart 5

Regulators and universities are catching up

Institutions and regulators have slowly begun responding, but only after years of students being exposed to unreliable detection systems. Australia’s higher education regulator, TEQSA, has warned against over-reliance on these tools. The long gap after the 2023 bias study is part of the story: evidence emerged early but major regulatory warnings arrived much later.

RowChartFive

AfterChartFive

The institutional response has been slow, but it is arriving. The question is how many students faced misconduct proceedings before it did and whether the universities that still rely on these tools have considered the asymmetry in who gets accused.

The timeline reveals something uncomfortable. The research warning universities about this bias was published in April 2023. TEQSA’s first major warning came in November 2024, eighteen months later. That gap is not just slow institutional response. It is eighteen months in which some students may have faced academic integrity scrutiny while major regulatory guidance was still catching up. The question of accountability has not yet been answered.

Closing

AI detection is often presented as an academic integrity safeguard, but if these systems wrongly accuse some students more than others, they do not just detect misconduct, they redistribute suspicion. For universities, the lesson is clear: unreliable AI detection should not be used as decisive evidence against students.

References

References

Liang, W., Yuksekgonul, M., Mao, Y., Wu, E., & Zou, J. (2023). GPT detectors are biased against non-native English writers. Patterns, 4(7). https://doi.org/10.1016/j.patter.2023.100779

Liang, W., Yuksekgonul, M., Mao, Y., Wu, E., & Zou, J. (2023). ChatGPT-Detector-Bias (v1.0.0) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.7893958

TEQSA. (2024). Gen AI strategies for Australian higher education: Emerging practice. Australian Government. https://www.teqsa.gov.au/sites/default/files/2024-11/Gen-AI-strategies-emerging-practice-toolkit.pdf

TEQSA. (2025). Enacting assessment reform in a time of artificial intelligence. Australian Government. https://www.teqsa.gov.au/guides-resources/resources/corporate-publications/enacting-assessment-reform-time-artificial-intelligence

Yi, J. S., Kang, Y. A., & Stasko, J. (2007). Toward a deeper understanding of the role of interaction in information visualization. IEEE Transactions on Visualization and Computer Graphics, 13(6), 1224–1231. https://doi.org/10.1109/TVCG.2007.70515

GenAI Acknowledgement: Generative AI tools like ChatGPT (OpenAI), was used to support research planning, code troubleshooting, wording refinement and review. Topic selection, data validation, chart testing in RStudio, visual design decisions and final submission preparation were completed by the student.