EIPM Strand 2 Inter-Rater Reliability Analysis
Technical Analysis with Interpretations
Executive Summary
Key Findings
Dataset Overview:
89 applications evaluated by 11 reviewers
Christine Kelly assigned as Reviewer_3 in 93.3 % of applications
Systematic reviewer differences found in 5 / 5 criteria
Reliability Summary:
ICC Range: -1.286 to 0.75
Acceptable reliability: 11 / 39 valid measurements
⚠️ Major Concern: Less than half of measurements show acceptable reliability
Reviewer Assignment Issues
Christine Kelly appears as Reviewer_3 in 93.3% of applications, creating systematic bias. This makes it impossible to separate individual reviewer effects from position effects.
Individual Reviewer Analysis
Reviewer Patterns
Most Active Reviewer: Christine Kelly ( 380 applications)
Highest Average Scores: Paul Jackson ( 4.2 average)
Lowest Average Scores: Ezana Weldeghebrael ( 3.1 average)
⚠️ Large scoring differences: 1.1 point spread between highest and lowest average scores suggests inconsistent standards.
Statistical Significance
⚠️ All criteria show significant reviewer differences - This indicates systematic bias where some reviewers consistently score higher or lower than others.
Implication: When reviewers differ systematically, application scores depend partly on which reviewers are assigned, compromising fairness.
Reliability by Reviewer Combination
Combination Performance Summary
Good reliability (ICC ≥ 0.75): 0 measurements
Moderate reliability (ICC 0.50-0.74): 11 measurements
Poor reliability (ICC < 0.50): 28 measurements
🚨 Critical: No combinations achieve good reliability standards.
Best Performance:
- Abdelrahman Nagy, Christine Kelly, Dina Kiwan ( Equity ): ICC = 0.75
Note: 1 measurements marked as ’Not computable*’ due to insufficient sample size or extreme disagreement.
Reliability by Criterion
Criterion-Specific Findings
Most Reliable Criterion: Equity (Mean ICC: 0.469 )
Least Reliable Criterion: Overall (Mean ICC: -0.027 )
Criteria Needing Attention:
Equity : Mean ICC = 0.469 ( 5 / 8 combinations acceptable)
Understanding : Mean ICC = 0.316 ( 2 / 8 combinations acceptable)
Methodology : Mean ICC = 0.037 ( 1 / 7 combinations acceptable)
Team : Mean ICC = 0.001 ( 1 / 8 combinations acceptable)
Overall : Mean ICC = -0.027 ( 2 / 8 combinations acceptable)
🚨 Overall Assessment: Poor reliability requiring immediate intervention.
Visual Analysis
Technical Notes
Data Quality
Methodology
Methodology
Analysis Approach:
Used actual reviewer names instead of generic positions
Calculated ICC using two-way random effects model with consistency definition
Applied average measures ICC for combination reliability
Analyzed only combinations with ≥5 applications for statistical stability
ICC Interpretation:
≥ 0.75: Good reliability (acceptable for high-stakes decisions)
0.50-0.74: Moderate reliability (may be acceptable with caveats)
0.25-0.49: Poor reliability (problematic for fair evaluation)
< 0.25: Unacceptable reliability
Negative: Systematic disagreement (worse than random)
“Not computable”: Insufficient data or extreme variance
Notes:
*Non-computable ICCs occur when sample sizes are too small or when reviewers show extreme disagreement
All calculations performed automatically and update with new data
ANOVA tests assess whether reviewers differ systematically in their scoring patterns
Generated on 2025-06-23 | Automated Inter-Rater Reliability Analysis