EIPM Strand 2 Inter-Rater Reliability Analysis

Technical Analysis with Interpretations

Author

Dr. Lucas Sempé

Published

June 23, 2025

Executive Summary

Key Findings

Dataset Overview:

89 applications evaluated by 11 reviewers
Christine Kelly assigned as Reviewer_3 in 93.3 % of applications
Systematic reviewer differences found in 5 / 5 criteria

Reliability Summary:

ICC Range: -1.286 to 0.75
Acceptable reliability: 11 / 39 valid measurements

⚠️ Major Concern: Less than half of measurements show acceptable reliability

Reviewer Assignment Issues

Christine Kelly appears as Reviewer_3 in 93.3% of applications, creating systematic bias. This makes it impossible to separate individual reviewer effects from position effects.

Individual Reviewer Analysis

Reviewer Patterns

Most Active Reviewer: Christine Kelly ( 380 applications)

Highest Average Scores: Paul Jackson ( 4.2 average)

Lowest Average Scores: Ezana Weldeghebrael ( 3.1 average)

⚠️ Large scoring differences: 1.1 point spread between highest and lowest average scores suggests inconsistent standards.

Statistical Significance

⚠️ All criteria show significant reviewer differences - This indicates systematic bias where some reviewers consistently score higher or lower than others.

Implication: When reviewers differ systematically, application scores depend partly on which reviewers are assigned, compromising fairness.

Reliability by Reviewer Combination

Combination Performance Summary

Good reliability (ICC ≥ 0.75): 0 measurements
Moderate reliability (ICC 0.50-0.74): 11 measurements
Poor reliability (ICC < 0.50): 28 measurements

🚨 Critical: No combinations achieve good reliability standards.

Best Performance:

Abdelrahman Nagy, Christine Kelly, Dina Kiwan ( Equity ): ICC = 0.75

Note: 1 measurements marked as ’Not computable*’ due to insufficient sample size or extreme disagreement.

Reliability by Criterion

Criterion-Specific Findings

Most Reliable Criterion: Equity (Mean ICC: 0.469 )

Least Reliable Criterion: Overall (Mean ICC: -0.027 )

Criteria Needing Attention:

Equity : Mean ICC = 0.469 ( 5 / 8 combinations acceptable)
Understanding : Mean ICC = 0.316 ( 2 / 8 combinations acceptable)
Methodology : Mean ICC = 0.037 ( 1 / 7 combinations acceptable)
Team : Mean ICC = 0.001 ( 1 / 8 combinations acceptable)
Overall : Mean ICC = -0.027 ( 2 / 8 combinations acceptable)

🚨 Overall Assessment: Poor reliability requiring immediate intervention.

Visual Analysis

Technical Notes

Data Quality

Methodology

Analysis Approach:

Used actual reviewer names instead of generic positions
Calculated ICC using two-way random effects model with consistency definition
Applied average measures ICC for combination reliability
Analyzed only combinations with ≥5 applications for statistical stability

ICC Interpretation:

≥ 0.75: Good reliability (acceptable for high-stakes decisions)
0.50-0.74: Moderate reliability (may be acceptable with caveats)
0.25-0.49: Poor reliability (problematic for fair evaluation)
< 0.25: Unacceptable reliability
Negative: Systematic disagreement (worse than random)
“Not computable”: Insufficient data or extreme variance

Notes:

*Non-computable ICCs occur when sample sizes are too small or when reviewers show extreme disagreement
All calculations performed automatically and update with new data
ANOVA tests assess whether reviewers differ systematically in their scoring patterns

Generated on 2025-06-23 | Automated Inter-Rater Reliability Analysis