EIPM Strand 2 Inter-Rater Reliability Analysis

Technical Analysis with Interpretations

Author

Dr. Lucas Sempé

Published

June 23, 2025

Executive Summary

Key Findings

Dataset Overview:

  • 89 applications evaluated by 11 reviewers

  • Christine Kelly assigned as Reviewer_3 in 93.3 % of applications

  • Systematic reviewer differences found in 5 / 5 criteria

Reliability Summary:

  • ICC Range: -1.286 to 0.75

  • Acceptable reliability: 11 / 39 valid measurements

⚠️ Major Concern: Less than half of measurements show acceptable reliability

Reviewer Assignment Issues

Christine Kelly appears as Reviewer_3 in 93.3% of applications, creating systematic bias. This makes it impossible to separate individual reviewer effects from position effects.

Individual Reviewer Analysis

Reviewer Patterns

Most Active Reviewer: Christine Kelly ( 380 applications)

Highest Average Scores: Paul Jackson ( 4.2 average)

Lowest Average Scores: Ezana Weldeghebrael ( 3.1 average)

⚠️ Large scoring differences: 1.1 point spread between highest and lowest average scores suggests inconsistent standards.

Statistical Significance

⚠️ All criteria show significant reviewer differences - This indicates systematic bias where some reviewers consistently score higher or lower than others.

Implication: When reviewers differ systematically, application scores depend partly on which reviewers are assigned, compromising fairness.

Reliability by Reviewer Combination

Combination Performance Summary

  • Good reliability (ICC ≥ 0.75): 0 measurements

  • Moderate reliability (ICC 0.50-0.74): 11 measurements

  • Poor reliability (ICC < 0.50): 28 measurements

🚨 Critical: No combinations achieve good reliability standards.

Best Performance:

  • Abdelrahman Nagy, Christine Kelly, Dina Kiwan ( Equity ): ICC = 0.75

Note: 1 measurements marked as ’Not computable*’ due to insufficient sample size or extreme disagreement.

Reliability by Criterion

Criterion-Specific Findings

Most Reliable Criterion: Equity (Mean ICC: 0.469 )

Least Reliable Criterion: Overall (Mean ICC: -0.027 )

Criteria Needing Attention:

  • Equity : Mean ICC = 0.469 ( 5 / 8 combinations acceptable)

  • Understanding : Mean ICC = 0.316 ( 2 / 8 combinations acceptable)

  • Methodology : Mean ICC = 0.037 ( 1 / 7 combinations acceptable)

  • Team : Mean ICC = 0.001 ( 1 / 8 combinations acceptable)

  • Overall : Mean ICC = -0.027 ( 2 / 8 combinations acceptable)

🚨 Overall Assessment: Poor reliability requiring immediate intervention.

Visual Analysis

Reliability Analysis Visualizations

Technical Notes

Data Quality

Methodology

Methodology

Analysis Approach:

  • Used actual reviewer names instead of generic positions

  • Calculated ICC using two-way random effects model with consistency definition

  • Applied average measures ICC for combination reliability

  • Analyzed only combinations with ≥5 applications for statistical stability

ICC Interpretation:

  • ≥ 0.75: Good reliability (acceptable for high-stakes decisions)

  • 0.50-0.74: Moderate reliability (may be acceptable with caveats)

  • 0.25-0.49: Poor reliability (problematic for fair evaluation)

  • < 0.25: Unacceptable reliability

  • Negative: Systematic disagreement (worse than random)

  • “Not computable”: Insufficient data or extreme variance

Notes:

  • *Non-computable ICCs occur when sample sizes are too small or when reviewers show extreme disagreement

  • All calculations performed automatically and update with new data

  • ANOVA tests assess whether reviewers differ systematically in their scoring patterns


Generated on 2025-06-23 | Automated Inter-Rater Reliability Analysis