Study Overview

Purpose

The purpose of this study was to determine the minimum percentage of items from an 80-item test that Angoff raters need to evaluate to produce a reliable cut score. The goal was to balance accuracy with rater effort, potentially reducing the workload without compromising the integrity of the standard-setting process.

Research Question

“What is the minimum percentage of items from an 80-item test that Angoff raters need to evaluate in order to produce a cut score that is within 5% of the full-set cut score, with 95% confidence, while also maintaining a standard error no greater than 1.5 times that of the full-set standard error?”

Methodology

Simulation Parameters

  • Test length: 80 items
  • Number of raters: 5
  • Rater means: Randomly generated between 60 and 85
  • Rater bias: Normally distributed with mean 0 and SD 0.10
  • Item-level variability: SD 0.10

Analysis Procedure

  1. Generated a full set of simulated Angoff ratings
  2. Calculated the full-set cut score and standard error as benchmarks
  3. Conducted bootstrap analyses for sampling rates from 10% to 100% in 10% increments
  4. For each sampling rate:
    • Estimated the cut score
    • Calculated the standard error and 95% confidence interval
    • Determined if the results met the specified criteria

Criteria for Acceptability

  • Estimated cut score within 5% of the full-set cut score with 95% confidence
  • Standard error no greater than 1.5 times the full-set standard error

Results

Full Dataset Benchmark

  • Full-set cut score: 76.53
  • Full-set standard error: 0.33

Results by Sampling Rate

Sampling Rate (%) Items Rated Estimated Cut Score Standard Error Lower 95% CI Upper 95% CI Meets Criteria
10 8 77.52 1.05 75.57 79.71 FALSE
20 16 76.75 0.76 75.24 78.18 FALSE
30 25 76.06 0.60 74.89 77.34 FALSE
40 32 76.55 0.51 75.48 77.49 FALSE
50 40 76.19 0.47 75.35 77.21 TRUE
60 48 76.56 0.44 75.66 77.39 TRUE
70 56 76.54 0.39 75.75 77.28 TRUE
80 64 76.69 0.38 75.96 77.46 TRUE
90 72 76.60 0.35 75.93 77.29 TRUE
100 80 76.53 0.33 75.88 77.14 TRUE
100 80 76.53 0.33 75.88 77.18 TRUE

Minimum Acceptable Sampling Rate

  • Minimum sampling rate meeting all criteria: 50%
  • Corresponding number of items: 40

Visualization

Interpretation of Results

  1. Minimum Sampling Rate: Our analysis indicates that a sampling rate of 50% is the minimum that meets all specified criteria. This corresponds to rating 40 items out of the full 80-item test.

  2. Stability of Estimates: As the sampling rate increases, we observe a gradual stabilization of the cut score estimates, with narrowing confidence intervals.

  3. Precision vs. Effort Trade-off: While higher sampling rates generally provide more precise estimates, the improvements beyond 50% sampling rate appear to be marginal, suggesting a potential point of diminishing returns.

  4. Practical Implications: These results suggest that Angoff raters could potentially evaluate as few as 40 items to produce a cut score that is statistically indistinguishable from one produced by rating all 80 items.

Conclusions

  1. Feasibility of Reduced Rating Burden: This study demonstrates that it is possible to significantly reduce the number of items rated in the Angoff method while maintaining a high level of accuracy and confidence in the resulting cut score.

  2. Optimal Sampling Strategy: Based on our criteria, a sampling rate of 50% (corresponding to 40 items) appears to be the optimal balance between minimizing rater effort and maintaining statistical rigor.

  3. Potential Benefits:

    • Reduced time and cognitive load for raters
    • Potential for involving more raters or conducting more frequent standard setting
    • Maintained integrity of the cut score determination process

Limitations and Future Directions

  1. Simulation-Based Study: These results are based on simulated data. Validation with real-world data is necessary to confirm these findings.

  2. Generalizability: The optimal sampling rate may vary based on factors such as test content, item difficulty distribution, and rater characteristics. Further research could explore how these factors influence the required sample size.

  3. Alternative Approaches: Future studies could compare this method with other efficiency-improving approaches, such as iterative rating processes or stratified item sampling.

  4. Rater Dynamics: Investigation into how reduced item sets affect rater behavior and decision-making processes could provide additional insights.

In conclusion, this study provides evidence-based guidance for optimizing the Angoff standard setting process, potentially leading to more efficient and sustainable practices in educational and professional certification contexts.