Explainable and Multimodal AI for High-Stakes Educational Assessment

Rabin Thapa

Explainable and Multimodal AI for High-Stakes Educational Assessment

Research Proposal

A PhD proposal submitted to Kings College London (Computer Science)

Author

Affiliation

Rabin Thapa

Data Science | AI | Physics | Education

Published

05/23/2025 02:32:36 AM +0545

1 ABSTRACT

Recent advances in artificial intelligence (AI), particularly large language models (LLMs) and multimodal systems, present transformative opportunities for educational assessment. This PhD research proposes the development of an explainable, multimodal AI framework for automatically evaluating high-stakes student examinations, such as UK A-levels, which often contain complex textual responses, diagrams, and structured problem-solving steps. The project addresses key limitations of current automated assessment systems—including lack of transparency, inability to process visual content, and misalignment with pedagogical goals—by integrating state-of-the-art natural language processing, computer vision, and interpretability techniques. Developed in collaboration with AQA, the UK’s leading examination board, this research will produce an AI system capable of accurate scoring, personalized feedback generation, and human-understandable explanations of its decision-making process.

The study employs a mixed-methods approach, combining technical development (e.g., fine-tuned LLMs, multimodal fusion architectures) with rigorous educational validation (e.g., expert reviews, fairness audits). Key innovations include novel methods for aligning AI assessments with curriculum standards, quantifying model explainability in educational contexts, and generating actionable feedback to support student learning. By bridging AI and education research, this work aims to establish best practices for trustworthy, pedagogically sound AI assessment systems while contributing open-source tools and datasets to the research community. The outcomes will provide critical insights for policymakers and educators seeking to leverage AI’s potential in high-stakes assessment environments—ensuring both technological advancement and educational equity.

2 INTRODUCTION

Advances in Artificial Intelligence (AI), particularly large language models (LLMs), have revolutionized how we interact with information. From personalized learning environments to intelligent tutoring systems, AI has found numerous applications within the education sector. One particularly promising application is the use of AI for assessing student work, especially in high-stakes examinations such as A-levels in the United Kingdom(Leaton Gray and Kucirkova 2021). These assessments often require students to write detailed responses that may include diagrams, charts, or structured problem-solving steps. The aim of this research is to build an explainable and multimodal AI system that can assess such responses reliably, fairly, and in line with pedagogical goals(Joshi, Walambe, and Kotecha 2021).

This PhD project will form part of a wider EPSRC funded initiative led by Professor Yulan He, in collaboration with AQA, the UK’s largest examination board. The project seeks to design AI systems that are not only accurate but also transparent and pedagogically sound. This proposal outlines how the student will contribute to this broader goal by developing AI models capable of processing both text and visual information, explaining their reasoning, and generating feedback tailored to students’ learning needs(BAİDOO-ANU and OWUSU ANSAH 2023).

3 Background and Motivation

Traditional automated assessment systems, such as multiple-choice graders, are limited in scope. As illustrated by Bernard, these methods fail to address the complexity of open-ended responses, where students must demonstrate deep understanding, apply knowledge to new situations, or provide structured reasoning(Bernard et al. 2025). Human marking, while richer, suffers from scalability, subjectivity, and time constraints. The challenge lies in building an AI system that combines the breadth and depth of human judgment with the efficiency of automation.

Recent developments in large language models (LLMs), such as GPT-4, have shown promise in understanding and generating human like text. Some models are now also capable of processing images, making it feasible to analyse diagrams and other visual elements in student responses. However, these models are often criticized for being opaque providing answers without clear justification and are not yet tailored for educational contexts(Kasneci et al. 2023). There is a clear need for AI systems that are explainable, aligned with curricular goals, and capable of handling multimodal input.

Hassan suggest that the importance of explainability cannot be overstated, particularly in education. Teachers, students, and examination boards must trust AI decisions, understand the reasoning behind them, and be able to contest them if necessary(Khosravi et al. 2022). The system must not only perform well but also be seen to perform fairly and transparently. In the context of high-stakes exams, the consequences of error are significant, making the need for reliability and interpretability even more pressing, such as required by AQA frame work in the United Kingdom(Smith and Fey 2000).

4 Aims and Objectives

The main aim of this PhD project is to develop an explainable, multimodal AI system for automated assessment of open-ended student responses within AQA frame works in UK. The system built through this project is expected to:

Accurately interpret both written and visual elements of student answers.
Provide interpretable explanations for its assessments.
Align with AQA’s marking criteria and broader pedagogical goals.
Offer constructive, personalised feedback to support student learning.
Be robust, fair, and able to operate at scale in real-world settings.

To achieve these aims, the research will be organised into the following objectives:

Investigate the boundaries of LLM knowledge and their application to educational assessment.
Develop techniques to improve model interpretability using methods such as counterfactuals and causal interventions.
Enable multimodal reasoning in LLMs by combining image and text processing capabilities.
Align AI-generated assessments with subject-specific rubrics and cognitive models of learning.
Design and evaluate a framework for personalised feedback generation.
Establish robust metrics and evaluation methods to assess reliability, fairness, and pedagogical value.

5 Research Questions

This research will explore the following key questions:

How can we define and operationalise explainability in the context of AI-based educational assessment?
What are the limits of current LLMs when applied to structured student responses in science subjects?
How can visual content such as diagrams be integrated effectively into AI assessment frameworks?
What methods can ensure alignment between AI-generated feedback and curricular expectations?
How can we evaluate the effectiveness of personalised feedback in improving student outcomes?

6 Methodology

This PhD will adopt a mixed-methods approach combining computational, experimental, and user-centred design techniques. The methodology will include the following phases:

6.0.1 Data Collection and Annotation

The project will begin with the collection of anonymised student responses provided by AQA(Rybinski 2021). These will include scanned handwritten answers, typed submissions, and associated mark schemes. As suggested by Ling, a subset of this data will be annotated with expert feedback and scoring to serve as training and evaluation data for AI models(Liang et al. 2022).

6.0.2 Model Development

The core AI model will be based on a fine-tuned large language model with visual understanding capabilities. This model will be trained to process both text and diagrams. Key tasks include:

Preprocessing scanned student responses to extract both text (via OCR) and diagrams (via image segmentation)
Encoding multimodal inputs into a unified format
Fine-tuning the model to perform both classification (e.g., scoring) and generation (e.g., feedback)

We initiate our project by determining all necessary tools required to model development.

Code

import pytesseract
import cv2
from PIL import Image
import numpy as np
import matplotlib.pyplot as plt
import torch
from transformers import CLIPProcessor, CLIPModel, AutoTokenizer, AutoModelForSequenceClassification
import shap
import lime
from lime.lime_text import LimeTextExplainer
from sklearn.metrics import cohen_kappa_score
import openai
import ipywidgets as widgets
from IPython.display import display
import os

6.0.3 Explainability and Interpretability

To ensure transparency, the system will incorporate explainability techniques such as:

Feature attribution (SHAP or LIME) to highlight key parts of the answer influencing scores(Linardatos, Papastefanopoulos, and Kotsiantis 2020).
Counterfactual reasoning to explore how different answers might receive different marks.
Visual overlays showing which words or image regions contributed to model decisions.

This would be followed by importing the image scan or image of the student’s answers based on AQA framework. For this prototype, we have used image which is an answer of a student for a question, “How can we design Smart bamboo dustbins using Arduino and sensor.”

Code

pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'
def preprocess_image(image_path):
    # loading image with the response or answer of the student 
    img = cv2.imread(image_path)
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    # setting a threshold for better OCR to to extract data inside the answer 
    _, thresh = cv2.threshold(gray, 150, 255, cv2.THRESH_BINARY_INV)
    return thresh, img

def extract_text_from_image(image_path):
    # pytesseract is applied for OCR on preprocessed image
    thresh, orig_img = preprocess_image(image_path)
    # inverting the image for pytesseract if needed
    inverted = cv2.bitwise_not(thresh)
    text = pytesseract.image_to_string(inverted, lang='eng')
    return text, orig_img, thresh

# sampel answer of the studetnt: example usage
sample_img_path = r'C:\Users\Dell\OneDrive\Pictures\Screenshots\Screenshot 2023-10-05 152941.png'
text, orig_img, thresh_img = extract_text_from_image(sample_img_path)

print("Extracted Text:\n", text)

# visualizing the original and processed images
fig, axs = plt.subplots(1, 2, figsize=(12,6))
axs[0].imshow(cv2.cvtColor(orig_img, cv2.COLOR_BGR2RGB))
axs[0].set_title("Original Image")
axs[1].imshow(thresh_img, cmap='gray')
axs[1].set_title("Preprocessed for OCR")
plt.show()

Extracted Text:
 Servo Motor

Jumpers

Ultrasonic sensor

Arduino UNO

Power Source(50000 MaH)

Fig.1.1. Smart bamboo Dustbin

Multimodal machine learning pipeline is applied, it uses CLIP (Contrastive Language-Image Pretraining) from OpenAI to extract text and image embeddings that exist in the same vector space because we require the understanding or comparing text and images together and CLIP provides a powerful, pretrained way to do that with minimal setup(Liu et al. 2025).

Code

device = "cuda" if torch.cuda.is_available() else "cpu"

clip_model_name = "openai/clip-vit-base-patch32"
processor = CLIPProcessor.from_pretrained(clip_model_name)
model = CLIPModel.from_pretrained(clip_model_name).to(device)

def encode_multimodal(text, image_path):
    image = Image.open(image_path)
    inputs = processor(text=[text], images=image, return_tensors="pt", padding=True).to(device)
    outputs = model(**inputs)
    # Get pooled features as embeddings
    text_emb = outputs.text_embeds.detach().cpu().numpy()
    image_emb = outputs.image_embeds.detach().cpu().numpy()
    return text_emb, image_emb

# Example
text_emb, image_emb = encode_multimodal(text, sample_img_path)
print("Text embedding shape:", text_emb.shape)
print("Image embedding shape:", image_emb.shape)

Text embedding shape: (1, 512)
Image embedding shape: (1, 512)

6.0.4 Personalized Feedback

The final system will generate tailored feedback for students, including suggestions for improvement and recognition of strengths. The feedback will be grounded in student profiles, previous performance, and learning trajectories.

Now in order to classify, compare the answer to the AQA framework for the generation of feedback;

Code

import cv2
import pytesseract
import matplotlib.pyplot as plt
from openai import OpenAI
import os

# just for a sample application;setting personal API key securely, this shall be modified 
client = OpenAI(api_key="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")

# Tesseract OCR path setup
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'

def preprocess_image(image_path):
    img = cv2.imread(image_path)
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    _, thresh = cv2.threshold(gray, 150, 255, cv2.THRESH_BINARY_INV)
    return thresh, img

def extract_text_from_image(image_path):
    thresh, orig_img = preprocess_image(image_path)
    inverted = cv2.bitwise_not(thresh)
    text = pytesseract.image_to_string(inverted, lang='eng')
    return text.strip(), orig_img, thresh

def generate_feedback(text):
    response = client.chat.completions.create(
        model="gpt-4",  # or use "gpt-3.5-turbo"
        messages=[
            {"role": "system", "content": "You are an educational expert. Provide constructive feedback on student answers."},
            {"role": "user", "content": f"Here is a student answer: \n\n{text}\n\nPlease give helpful feedback."}
        ]
    )
    return response.choices[0].message.content

# now, improting the image form the path 
image_path = r'C:\Users\Dell\OneDrive\Pictures\Screenshots\Screenshot 2023-10-05 152941.png'

# extraction of the text
text, original_img, thresh_img = extract_text_from_image(image_path)
print("Extracted Student Answer:\n", text)

# generating the feedback based on AQA framework(hypothetical)
feedback = generate_feedback(text)
print("\nAI Feedback:\n", feedback)

# visualization of the images
fig, axs = plt.subplots(1, 2, figsize=(12, 6))
axs[0].imshow(cv2.cvtColor(original_img, cv2.COLOR_BGR2RGB))
axs[0].set_title("Original Image")
axs[0].axis("off")

axs[1].imshow(thresh_img, cmap='gray')
axs[1].set_title("Preprocessed Image for OCR")
axs[1].axis("off")

plt.tight_layout()
plt.show()

Extracted Student Answer:
 Servo Motor

Jumpers

Ultrasonic sensor

Arduino UNO

Power Source(50000 MaH)

Fig.1.1. Smart bamboo Dustbin

AI Feedback:
 While your list of components for the "Smart Bamboo Dustbin" gives a good understanding of the mechanical side, your answer could benefit from additional explanation and detail. For example, how do each of these components function in relation to the overall operation of the dustbin? It would also be beneficial if you could provide context on the usage of Fig.1.1. Smart bamboo Dustbin. Is it a diagram, a photograph, or an illustration? What does it depict? Always remember, clarity and detail are key when answering such technical questions.

Additionally, evaluating the answers if it contain key words and ideas related to the topic or AQA framework.

Code

def visualize_explanation_on_text(text, important_words):
    words = text.split()
    highlighted_text = ""
    for w in words:
        if w.lower().strip('.,') in important_words:
            highlighted_text += f"\033[93m{w}\033[0m "  # Yellow highlight in console
        else:
            highlighted_text += w + " "
    print(highlighted_text)

# Demo important words (from explainability or rubric)
important_words = ['force', 'acceleration', 'diagram']
visualize_explanation_on_text(text, important_words)

Servo Motor Jumpers Ultrasonic sensor Arduino UNO Power Source(50000 MaH) Fig.1.1. Smart bamboo Dustbin

6.0.5 Evaluation

Evaluation will be multifaceted and include:

Accuracy and agreement with human markers (using metrics like Cohen’s kappa)(Figueroa, Ghosh, and Aragon 2023).
Qualitative user studies with teachers and students to assess feedback quality.
Fairness audits to detect and mitigate potential bias.
Stress-testing the model under adversarial or unusual inputs.

Code

# === Imports ===
import pytesseract
import cv2
from PIL import Image
import numpy as np
import matplotlib.pyplot as plt
import torch
from transformers import CLIPProcessor, CLIPModel
from sklearn.metrics import cohen_kappa_score
from lime.lime_text import LimeTextExplainer
import openai
import ipywidgets as widgets
from IPython.display import display
import os

# === Configurations ===
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'
device = "cuda" if torch.cuda.is_available() else "cpu"

clip_model_name = "openai/clip-vit-base-patch32"
processor = CLIPProcessor.from_pretrained(clip_model_name)
clip_model = CLIPModel.from_pretrained(clip_model_name).to(device)

# OpenAI API client
client = openai.OpenAI(api_key="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")

# === Functions ===
def preprocess_image(image_path):
    img = cv2.imread(image_path)
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    _, thresh = cv2.threshold(gray, 150, 255, cv2.THRESH_BINARY_INV)
    return thresh, img

def extract_text_from_image(image_path):
    thresh, orig_img = preprocess_image(image_path)
    inverted = cv2.bitwise_not(thresh)
    text = pytesseract.image_to_string(inverted, lang='eng')
    return text.strip(), orig_img, thresh

def encode_multimodal(text, image_path):
    image = Image.open(image_path)
    inputs = processor(text=[text], images=image, return_tensors="pt", padding=True).to(device)
    outputs = clip_model(**inputs)
    text_emb = outputs.text_embeds.detach().cpu().numpy()
    image_emb = outputs.image_embeds.detach().cpu().numpy()
    return text_emb, image_emb

def generate_feedback(student_answer, score, rubric):
    prompt = f"""
You are an educational assistant. A student answered:\n{student_answer}\n
The score given is {score} based on this rubric:\n{rubric}\n
Provide constructive, personalized feedback highlighting strengths and areas for improvement.
"""
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "You are a helpful educational assistant."},
            {"role": "user", "content": prompt}
        ],
        max_tokens=150,
        temperature=0.7
    )
    return response.choices[0].message.content.strip()

def visualize_explanation_on_text(text, important_words):
    words = text.split()
    highlighted_text = ""
    for w in words:
        if w.lower().strip('.,') in important_words:
            highlighted_text += f"\033[93m{w}\033[0m "  # Yellow highlight
        else:
            highlighted_text += w + " "
    print(highlighted_text)

# === Step 1: OCR ===
sample_img_path = sample_img_path = r"C:\Users\Dell\OneDrive\Pictures\Screenshots\Screenshot 2023-10-05 152941.png"
text, orig_img, thresh_img = extract_text_from_image(sample_img_path)
print("Extracted Text:\n", text)

# Show images
fig, axs = plt.subplots(1, 2, figsize=(12,6))
axs[0].imshow(cv2.cvtColor(orig_img, cv2.COLOR_BGR2RGB))
axs[0].set_title("Original Image")
axs[1].imshow(thresh_img, cmap='gray')
axs[1].set_title("Preprocessed for OCR")
plt.show()

# === Step 2: CLIP Embeddings ===
text_emb, image_emb = encode_multimodal(text, sample_img_path)
print("Text embedding shape:", text_emb.shape)
print("Image embedding shape:", image_emb.shape)

# === Step 3: AI Feedback Generation ===
rubric_text = """
The answer should clearly explain the components used (e.g., Arduino, ultrasonic sensor, servo motor), the working mechanism, power source, and any coding involved. Clarity, completeness, and correct terminology are essential.
"""
score = 8
feedback = generate_feedback(text, score, rubric_text)
print("\nGenerated Feedback:\n", feedback)

# === Step 4: Evaluation (Cohen's Kappa) ===
human_scores = [7, 8, 6, 5, 9]
ai_scores =    [7, 7, 6, 4, 9]
kappa = cohen_kappa_score(human_scores, ai_scores)
print(f"\nCohen's Kappa score between human and AI: {kappa:.2f}")

# === Step 5: Word Highlighting ===
important_words = ['force', 'acceleration', 'diagram']
visualize_explanation_on_text(text, important_words)

# === Step 6: Interactive UI for Feedback ===
def on_button_clicked(b):
    answer = answer_text.value
    score = int(score_text.value)
    feedback = generate_feedback(answer, score, rubric_text)
    feedback_out.value = feedback

answer_text = widgets.Textarea(value=text, description='Student Answer:', layout=widgets.Layout(width='600px', height='150px'))
score_text = widgets.BoundedIntText(value=7, min=0, max=10, description='Score:')
button = widgets.Button(description="Generate Feedback")
feedback_out = widgets.Textarea(value="", description='Feedback:', layout=widgets.Layout(width='600px', height='150px'))

button.on_click(on_button_clicked)
display(answer_text, score_text, button, feedback_out)

Extracted Text:
 Servo Motor

Jumpers

Ultrasonic sensor

Arduino UNO

Power Source(50000 MaH)

Fig.1.1. Smart bamboo Dustbin

Text embedding shape: (1, 512)
Image embedding shape: (1, 512)

Generated Feedback:
 Great job on listing the components used in the smart bamboo dustbin project - Servo Motor, Jumpers, Ultrasonic sensor, Arduino UNO, and a Power Source (50000 mAh). You have covered a good range of components that are essential for this type of project.

Strengths:
1. You have correctly identified important components used in the project.
2. The list is well-organized and easy to follow.
3. The mention of the Power Source shows an understanding of the project's power requirements.

Areas for improvement:
1. It would be helpful to provide a brief explanation of how each component contributes to the functionality of the smart bamboo dustbin. For example, you could mention that the servo motor controls the lid of the dust

Cohen's Kappa score between human and AI: 0.52
Servo Motor Jumpers Ultrasonic sensor Arduino UNO Power Source(50000 MaH) Fig.1.1. Smart bamboo Dustbin

6.0.6 Educational Alignment

To ensure pedagogical validity, the model will be aligned with AQA’s official marking rubrics and cognitive theories such as Bloom’s taxonomy. The project will work closely with domain experts and educators to validate that AI outputs reflect genuine learning objectives.

Fig 6.1 Generative feed back as per the AQA framework

To determine the grading using the measuring agreement, cohen’s Kappa quantifies the level of agreement between two raters (in your case: a human and an AI model), beyond what would be expected by chance(Kolesnyk and Khairova 2022). During the real project, this shall be conducted using the real data from the database using genuine human rating for the answer with AI rating.

Code

# Sample human vs AI scores
human_scores = [7, 8, 6, 5, 9]
ai_scores =    [7, 7, 6, 4, 9]

kappa = cohen_kappa_score(human_scores, ai_scores)
print(f"Cohen's Kappa score between human and AI: {kappa:.2f}")

Cohen's Kappa score between human and AI: 0.52

7 Ethical Considerations

Given the high-stakes nature of educational assessment, ethical considerations will be paramount. All data will be handled in accordance with GDPR and institutional ethical guidelines(Murchan and Siddiq 2021). Bias detection and mitigation will be an ongoing priority, and transparency will be maintained throughout the development and deployment process. Students and educators will be given clear explanations of how the AI system works, and a mechanism for appeal or review will be considered in the final design.

8 Expected Contributions

This research will make the following contributions to both academia and practice:

A novel AI framework capable of multimodal, explainable assessment of student work.
Methods for aligning AI-generated assessments with educational goals.
Empirical studies demonstrating the effectiveness and limitations of such systems.
Open-source tools and datasets to support future research.
Practical insights for examination boards on deploying AI in high-stakes settings.

9 Timeline

As illustrated in fig 9.1, this PhD project is planned over four years, starting in 2025. The first year focuses on preparing the project proposal and reviewing existing research. In the second year, the main work includes getting ethical approval, collecting data, and starting data management. The third year is dedicated to developing and testing the model, including selecting the right methods and improving the model’s performance. In the final year, the focus shifts to writing and reviewing the dissertation, preparing for submission and defence, and sharing the research through presentations or publications. The timeline is designed to keep the project on track from start to finish.

Fig 9.1. Time line for the research project

10 Budget

The estimated budget for this PhD research project is approximately £7,470 over four years. This includes essential items such as a high-performance laptop, electronic components like sensors and microcontrollers, and cloud computing services. Additional costs cover software tools, API subscriptions, data storage, and participation in conferences. Funds are also set aside for unexpected expenses and the final printing and binding of the dissertation. This budget ensures that all necessary resources are available to successfully carry out the research and present the findings.

Code

import pandas as pd
from IPython.display import display, Markdown

# to create the budget data
data = [
    ["Hardware & Devices", "High-performance laptop/workstation", "£2,000", "One-time", "Year 1 (2025)", "£2,000"],
    ["", "Arduino, sensors, microcontrollers", "£500", "One-time", "Year 2 (2026)", "£500"],
    ["Software & Licences", "Cloud computing credits (AWS/GCP)", "£300/year", "Annual (4 years)", "2025-2028", "£1,200"],
    ["", "OCR plugins/utilities (Tesseract add-ons)", "£100", "One-time", "Year 1 (2025)", "£100"],
    ["APIs & Subscriptions", "OpenAI API usage (GPT for feedback)", "£20/month", "Monthly (3 years)", "2026-2028", "£720"],
    ["Data & Storage", "Cloud/external data storage", "£100/year", "Annual (4 years)", "2025-2028", "£400"],
    ["Conferences & Publications", "Conference registration, travel, paper fees", "£800/year", "Annual (2 years)", "2027-2028", "£1,600"],
    ["Miscellaneous & Contingency", "Unexpected repairs, tools, components", "£200/year", "Annual (4 years)", "2025-2028", "£800"],
    ["Printing & Binding", "Dissertation printing and binding", "£150", "One-time", "Year 4 (2028)", "£150"],
    ["TOTAL", "", "", "", "", "£7,470"]
]

# datframe creation
columns = ["Category", "Item Description", "Unit Cost (GBP)", "Frequency", "Year(s)", "Total Cost (GBP)"]
df = pd.DataFrame(data, columns=columns)

# visualizing with the markdown
display(Markdown("PhD Project Budget Table (2025-2028)"))
display(df.style
       .set_properties(**{'text-align': 'left'})
       .set_table_styles([{
           'selector': 'th',
           'props': [('text-align', 'left')]
       }])
       .hide(axis='index')
       .format(precision=0))

PhD Project Budget Table (2025-2028)

Category	Item Description	Unit Cost (GBP)	Frequency	Year(s)	Total Cost (GBP)
Hardware & Devices	High-performance laptop/workstation	£2,000	One-time	Year 1 (2025)	£2,000
	Arduino, sensors, microcontrollers	£500	One-time	Year 2 (2026)	£500
Software & Licences	Cloud computing credits (AWS/GCP)	£300/year	Annual (4 years)	2025-2028	£1,200
	OCR plugins/utilities (Tesseract add-ons)	£100	One-time	Year 1 (2025)	£100
APIs & Subscriptions	OpenAI API usage (GPT for feedback)	£20/month	Monthly (3 years)	2026-2028	£720
Data & Storage	Cloud/external data storage	£100/year	Annual (4 years)	2025-2028	£400
Conferences & Publications	Conference registration, travel, paper fees	£800/year	Annual (2 years)	2027-2028	£1,600
Miscellaneous & Contingency	Unexpected repairs, tools, components	£200/year	Annual (4 years)	2025-2028	£800
Printing & Binding	Dissertation printing and binding	£150	One-time	Year 4 (2028)	£150
TOTAL					£7,470

11 Supervision and Environment

This PhD will be supervised by Professor Yulan He at King’s College London, a leading researcher in natural language processing, explainable AI, and machine learning. The student will also collaborate with AQA and benefit from interdisciplinary expertise in education, cognitive science, and AI ethics. The Faculty of Natural, Mathematical & Engineering Sciences offers a vibrant research environment, access to state-of-the-art facilities, and regular opportunities for training and professional development.

Once the data is combined from 16 yearly data sets for each index, it is normalized and cleaned as per the model(Nayak, Misra, and Behera 2012).

12 Conclusion:

This PhD project aims to revolutionize high-stakes educational assessment by developing an explainable, multimodal AI system that evaluates student responses both text and visuals with fairness, accuracy, and pedagogical alignment. Leveraging large language models (LLMs), computer vision, and interpretability techniques (SHAP, LIME), the research addresses critical gaps in automated assessment, including transparency, multimodal reasoning, and alignment with institutional rubrics. By collaborating with AQA, the UK’s largest examination board, the project ensures real-world applicability while advancing AI’s role in education. The system’s ability to generate personalized, constructive feedback further bridges the gap between AI efficiency and human-like educational support, offering scalable solutions without compromising quality or equity.

Beyond technical innovation, this work contributes to the responsible adoption of AI in education by prioritizing ethical considerations, bias mitigation, and stakeholder trust. The research outcomes including open-source tools, empirical validations, and frameworks for explainable AI assessment will empower educators, policymakers, and technologists to deploy AI systems that enhance learning outcomes while maintaining rigor and fairness. This project lays the foundation for a future where technology and pedagogy coexist to create more accessible, transparent, and effective assessment systems worldwide by uniting cutting-edge AI with educational expertise.

13 Bibliography:

BAİDOO-ANU, David, and Leticia OWUSU ANSAH. 2023. “Education in the Era of Generative Artificial Intelligence (AI): Understanding the Potential Benefits of ChatGPT in Promoting Teaching and Learning.” Journal of AI 7 (1): 52–62. https://doi.org/10.61969/jai.1337500.

Bernard, Raymond, Shaina Raza, Subhabrata Das, and Rahul Murugan. 2025. “EQUATOR: A Deterministic Framework for Evaluating LLM Reasoning with Open-Ended Questions. # V1.0.0-Beta.” https://doi.org/10.48550/ARXIV.2501.00257.

Figueroa, Andrea, Sourojit Ghosh, and Cecilia Aragon. 2023. “Generalized Cohen’s Kappa: A Novel Inter-Rater Reliability Metric for Non-Mutually Exclusive Categories.” In, 19–34. Springer Nature Switzerland. https://doi.org/10.1007/978-3-031-35132-7_2.

Joshi, Gargi, Rahee Walambe, and Ketan Kotecha. 2021. “A Review on Explainability in Multimodal Deep Neural Nets.” IEEE Access 9: 59800–59821. https://doi.org/10.1109/access.2021.3070212.

Kasneci, Enkelejda, Kathrin Sessler, Stefan Küchemann, Maria Bannert, Daryna Dementieva, Frank Fischer, Urs Gasser, et al. 2023. “ChatGPT for Good? On Opportunities and Challenges of Large Language Models for Education.” Learning and Individual Differences 103 (April): 102274. https://doi.org/10.1016/j.lindif.2023.102274.

Khosravi, Hassan, Simon Buckingham Shum, Guanliang Chen, Cristina Conati, Yi-Shan Tsai, Judy Kay, Simon Knight, Roberto Martinez-Maldonado, Shazia Sadiq, and Dragan Gašević. 2022. “Explainable Artificial Intelligence in Education.” Computers and Education: Artificial Intelligence 3: 100074. https://doi.org/10.1016/j.caeai.2022.100074.

Kolesnyk, A. S., and N. F. Khairova. 2022. “Justification for the Use of Cohen’s Kappa Statistic in Experimental Studies of NLP and Text Mining.” Cybernetics and Systems Analysis 58 (2): 280–88. https://doi.org/10.1007/s10559-022-00460-3.

Leaton Gray, Sandra, and Natalia Kucirkova. 2021. “AI and the Human in Education: Editorial.” London Review of Education 19 (1). https://doi.org/10.14324/lre.19.1.10.

Liang, Weixin, Girmaw Abebe Tadesse, Daniel Ho, L. Fei-Fei, Matei Zaharia, Ce Zhang, and James Zou. 2022. “Author Correction: Advances, Challenges and Opportunities in Creating Data for Trustworthy AI.” Nature Machine Intelligence 4 (10): 904–4. https://doi.org/10.1038/s42256-022-00548-7.

Linardatos, Pantelis, Vasilis Papastefanopoulos, and Sotiris Kotsiantis. 2020. “Explainable AI: A Review of Machine Learning Interpretability Methods.” Entropy 23 (1): 18. https://doi.org/10.3390/e23010018.

Liu, Yufei, Hua Cheng, Yiquan Fang, Yiming Pan, Zehong Qian, and Xiaoning Chen. 2025. “Multimodal Coupling Prompt Learning for Image Classification Tasks.” The European Journal on Artificial Intelligence, May. https://doi.org/10.1177/30504554251335569.

Murchan, Damian, and Fazilat Siddiq. 2021. “A Call to Action: A Systematic Review of Ethical and Regulatory Issues in Using Process Data in Educational Assessment.” Large-Scale Assessments in Education 9 (1). https://doi.org/10.1186/s40536-021-00115-3.

Nayak, S. C., B. B. Misra, and H. S. Behera. 2012. “Evaluation of Normalization Methods on Neuro-Genetic Models for Stock Index Forecasting.” 2012 World Congress on Information and Communication Technologies, October, 602–7. https://doi.org/10.1109/wict.2012.6409147.

Rybinski, Krzysztof. 2021. “Assessing How QAA Accreditation Reflects Student Experience.” Higher Education Research & Development 41 (3): 898–918. https://doi.org/10.1080/07294360.2021.1872058.

Smith, Mary Lee, and Patricia Fey. 2000. “Validity and Accountability in High-Stakes Testing.” Journal of Teacher Education 51 (5): 334–44. https://doi.org/10.1177/0022487100051005002.

Citation

BibTeX citation:

@online{thapa2025,
  author = {Thapa, Rabin},
  title = {Explainable and {Multimodal} {AI} for {High-Stakes}
    {Educational} {Assessment}},
  date = {2025-05-23},
  url = {https://www.researchgate.net/profile/Rabin-Thapa-8},
  langid = {en}
}

For attribution, please cite this work as:

Thapa, Rabin. 2025. “Explainable and Multimodal AI for High-Stakes Educational Assessment.” May 23, 2025. https://www.researchgate.net/profile/Rabin-Thapa-8.