Attribute Agreement Analysis: Appraiser Agreement and Detection Performance

iQualityR MSA Team

2026-05-02

1. Business Context

This example simulates a visual inspection study for injection-molded connector housings. The quality team wants to validate whether three inspectors can consistently classify each part as Accept or Reject, and whether those classifications agree with the known engineering standard.

The study is deliberately built as a complete attribute agreement study:

The required output is not only an overall Kappa number. A usable report must include between-appraiser agreement, each appraiser versus standard, within-appraiser repeatability, response-level agreement, detection performance, and the related plots.

2. Method

Kappa adjusts observed agreement for chance agreement:

\[ \kappa = \frac{P_o - P_e}{1 - P_e} \]

For measurement-system release decisions, Kappa should be read together with false negative risk. A high overall agreement can still be unacceptable if defect parts are frequently classified as acceptable.

3. Complete Study Data

set.seed(20260508)

n_parts <- 50
parts <- sprintf("C%03d", seq_len(n_parts))
latent_score <- sort(rnorm(n_parts, mean = 0, sd = 1.1))
standard <- ifelse(latent_score > 0.18, "Reject", "Accept")

appraisers <- c("A", "B", "C")
trials <- 1:2

appraiser_bias <- c(A = -0.03, B = 0.10, C = -0.08)
appraiser_sd <- c(A = 0.34, B = 0.38, C = 0.42)

attr_data <- do.call(rbind, lapply(appraisers, function(app) {
  do.call(rbind, lapply(trials, function(trial) {
    observed_score <- latent_score + appraiser_bias[[app]] + rnorm(n_parts, 0, appraiser_sd[[app]])
    rating <- ifelse(observed_score > 0.18, "Reject", "Accept")
    data.frame(
      Sample = parts,
      Appraiser = app,
      Trial = trial,
      Standard = standard,
      Rating = rating,
      stringsAsFactors = FALSE
    )
  }))
}))

knitr::kable(head(attr_data, 12), caption = "Complete attribute agreement data")
Complete attribute agreement data
Sample Appraiser Trial Standard Rating
C001 A 1 Accept Accept
C002 A 1 Accept Accept
C003 A 1 Accept Accept
C004 A 1 Accept Accept
C005 A 1 Accept Accept
C006 A 1 Accept Accept
C007 A 1 Accept Accept
C008 A 1 Accept Accept
C009 A 1 Accept Accept
C010 A 1 Accept Accept
C011 A 1 Accept Accept
C012 A 1 Accept Accept

4. Complete Attribute Agreement Workflow

plan <- AttrGagePlan$new(
  plan_name = "Connector_Housing_Visual_Inspection",
  objectives = "Evaluate appraiser agreement and defect detection risk",
  appraisers = appraisers,
  samples = parts,
  standards = standard,
  categories = c("Accept", "Reject"),
  trials = 2,
  comparison_mode = "one_way",
  kappa_method = "fleiss",
  conf_level = 0.95
)

plan$set_meta(
  "data",
  sample_col = "Sample",
  rater_col = "Appraiser",
  rating_col = "Rating",
  standard_col = "Standard",
  trial_col = "Trial",
  reference_col = "Standard",
  test_col = "Rating",
  positive_category = "Reject",
  negative_category = "Accept"
)

attr_task <- AttrGageTask$new(
  data = attr_data,
  plan = plan,
  theme = "academic",
  mode = "all"
)
attr_task$compute()
attr_task$summary()
#> 
#> ========== Attribute Agreement Summary ==========
#> 
#> --- Kappa Statistics ---
#>  Statistic          Value           
#>  Method             Fleiss' Kappa   
#>  Kappa              0.7039          
#>  SE Kappa           0.1038          
#>  Z                  6.7815          
#>  P(vs > 0)          <0.0001         
#>  95% CI             [0.5004, 0.9073]
#>  Observed agreement 86.00%          
#>  Expected agreement 52.72%          
#>  Interpretation     Good            
#> 
#> --- Between Appraisers ---
#>  Comparison N   Percent_Agreement Kappa  SE_Kappa Z       P_vs_gt_0
#>  A vs B     100 84                0.6637 0.0771    8.6142 0        
#>  A vs C     100 86                0.7030 0.0736    9.5508 0        
#>  B vs C     100 90                0.7879 0.0636   12.3800 0        
#>  Interpretation
#>  Good          
#>  Good          
#>  Strong        
#> 
#> --- Each Appraiser vs Standard ---
#>  Comparison                 N   Percent_Agreement Kappa  SE_Kappa Z      
#>  A vs Standard              100 91                0.8100 0.0604   13.4041
#>  B vs Standard              100 87                0.7255 0.0710   10.2170
#>  C vs Standard              100 89                0.7654 0.0667   11.4673
#>  All Appraisers vs Standard 300 89                0.7669 0.0383   20.0391
#>  P_vs_gt_0 Interpretation
#>  0         Strong        
#>  0         Good          
#>  0         Strong        
#>  0         Strong        
#> 
#> --- Within Appraiser Repeatability ---
#>  Comparison     N  Percent_Agreement Kappa  SE_Kappa Z      P_vs_gt_0
#>  A Trial 1 vs 2 50 90                0.7899 0.0891   8.8624 0        
#>  B Trial 1 vs 2 50 78                0.5470 0.1206   4.5337 0        
#>  C Trial 1 vs 2 50 82                0.6141 0.1165   5.2713 0        
#>  Interpretation
#>  Strong        
#>  Moderate      
#>  Good          
#> 
#> --- Response-Level Agreement ---
#>  Response Ratings Standard_Count Matches_to_Standard Percent_Match_to_Standard
#>  Accept   185     186            169                 90.8602                  
#>  Reject   115     114             98                 85.9649                  
#> 
#> --- Detection Performance ---
#>  Metric              Estimate CI              
#>  Sensitivity         85.96%   [78.21%, 91.76%]
#>  False negative rate 14.04%                   
#>  Specificity         90.86%   [85.77%, 94.59%]
#>  False positive rate 9.14%                    
#>  Youden index        0.7683                   
#>  LR+                 9.4056                   
#>  LR-                 0.1545                   
#>  Prevalence          38.00%                   
#> =================================================

5. Overall Kappa

kr <- attr_task$kappa_results$raw_output
kappa_summary <- data.frame(
  Statistic = c("Method", "Kappa", "SE Kappa", "Z", "P(vs > 0)", "95% CI", "Observed agreement", "Expected agreement", "Interpretation"),
  Value = c(
    kr$method,
    fmt_num(kr$kappa, 4),
    fmt_num(kr$se, 4),
    fmt_num(kr$z, 4),
    fmt_p(kr$p_value, 4),
    fmt_ci(kr$ci, 4),
    fmt_pct(kr$Po, 2),
    fmt_pct(kr$Pe, 2),
    fmt_chr(kr$interpretation)
  )
)
knitr::kable(kappa_summary, align = c("l", "r"), caption = "Overall Kappa statistics")
Overall Kappa statistics
Statistic Value
Method Fleiss’ Kappa
Kappa 0.7039
SE Kappa 0.1038
Z 6.7815
P(vs > 0) <0.0001
95% CI [0.5004, 0.9073]
Observed agreement 86.00%
Expected agreement 52.72%
Interpretation Good
attr_task$plot("kappa_benchmark")

6. Between Appraisers

knitr::kable(
  round_df(kr$pairwise_appraisers, 4),
  align = c("l", "r", "r", "r", "r", "r", "r", "l"),
  caption = "Pairwise appraiser agreement"
)
Pairwise appraiser agreement
Comparison N Percent_Agreement Kappa SE_Kappa Z P_vs_gt_0 Interpretation
A vs B 100 84 0.6637 0.0771 8.6142 0 Good
A vs C 100 86 0.7030 0.0736 9.5508 0 Good
B vs C 100 90 0.7879 0.0636 12.3800 0 Strong
attr_task$plot("pairwise_agreement")

7. Each Appraiser vs Standard

knitr::kable(
  round_df(kr$appraiser_vs_standard, 4),
  align = c("l", "r", "r", "r", "r", "r", "r", "l"),
  caption = "Each appraiser compared with the known standard"
)
Each appraiser compared with the known standard
Comparison N Percent_Agreement Kappa SE_Kappa Z P_vs_gt_0 Interpretation
A vs Standard 100 91 0.8100 0.0604 13.4041 0 Strong
B vs Standard 100 87 0.7255 0.0710 10.2170 0 Good
C vs Standard 100 89 0.7654 0.0667 11.4673 0 Strong
All Appraisers vs Standard 300 89 0.7669 0.0383 20.0391 0 Strong
attr_task$plot("appraiser_standard")

8. Within Appraiser Repeatability

knitr::kable(
  round_df(kr$within_appraiser, 4),
  align = c("l", "r", "r", "r", "r", "r", "r", "l"),
  caption = "Within-appraiser repeatability across repeated trials"
)
Within-appraiser repeatability across repeated trials
Comparison N Percent_Agreement Kappa SE_Kappa Z P_vs_gt_0 Interpretation
A Trial 1 vs 2 50 90 0.7899 0.0891 8.8624 0 Strong
B Trial 1 vs 2 50 78 0.5470 0.1206 4.5337 0 Moderate
C Trial 1 vs 2 50 82 0.6141 0.1165 5.2713 0 Good

9. Response-Level Agreement

knitr::kable(
  round_df(kr$response_table, 4),
  align = c("l", "r", "r", "r", "r"),
  caption = "Response-level agreement summary"
)
Response-level agreement summary
Response Ratings Standard_Count Matches_to_Standard Percent_Match_to_Standard
Accept 185 186 169 90.8602
Reject 115 114 98 85.9649
attr_task$plot("response_agreement")

10. Sample-Level Disagreement Diagnostics

The report does not use a full sample-by-response rating matrix as the main graphic when many samples are present. With 50 parts, that heatmap compresses too many sample labels into the Y axis. The practical diagnostic is to focus on the samples with the highest disagreement.

knitr::kable(
  round_df(head(kr$sample_disagreement, 12), 4),
  align = c("l", "l", "r", "l", "r", "r", "r", "r", "r"),
  caption = "Top samples by disagreement"
)
Top samples by disagreement
Sample Standard N_Ratings Most_Common_Rating Modal_Count Percent_Modal Standard_Matches Percent_Match_to_Standard Discordant_Ratings
32 C032 Reject 6 Accept 3 50.0000 3 50.0000 3
28 C028 Accept 6 Reject 4 66.6667 2 33.3333 2
31 C031 Accept 6 Reject 4 66.6667 2 33.3333 2
33 C033 Reject 6 Accept 4 66.6667 2 33.3333 2
34 C034 Reject 6 Accept 4 66.6667 2 33.3333 2
24 C024 Accept 6 Accept 4 66.6667 4 66.6667 2
30 C030 Accept 6 Accept 4 66.6667 4 66.6667 2
36 C036 Reject 6 Reject 4 66.6667 4 66.6667 2
23 C023 Accept 6 Accept 5 83.3333 5 83.3333 1
25 C025 Accept 6 Accept 5 83.3333 5 83.3333 1
26 C026 Accept 6 Accept 5 83.3333 5 83.3333 1
27 C027 Accept 6 Accept 5 83.3333 5 83.3333 1
attr_task$plot("sample_disagreement")

11. Detection Performance

dr <- attr_task$detection_results$raw_output
detection_summary <- data.frame(
  Metric = c("Sensitivity", "False negative rate", "Specificity", "False positive rate", "Youden index", "LR+", "LR-", "Prevalence"),
  Estimate = c(
    fmt_pct(dr$detection_rate, 2),
    fmt_pct(dr$false_negative_rate, 2),
    fmt_pct(dr$specificity, 2),
    fmt_pct(dr$false_positive_rate, 2),
    fmt_num(dr$youden_index, 4),
    fmt_num(dr$LR_positive, 4),
    fmt_num(dr$LR_negative, 4),
    fmt_pct(dr$prevalence, 2)
  )
)
knitr::kable(detection_summary, align = c("l", "r"), caption = "Detection performance")
Detection performance
Metric Estimate
Sensitivity 85.96%
False negative rate 14.04%
Specificity 90.86%
False positive rate 9.14%
Youden index 0.7683
LR+ 9.4056
LR- 0.1545
Prevalence 38.00%
knitr::kable(dr$confusion_matrix_table, caption = "Detection confusion matrix")
Detection confusion matrix
Reference Test_Positive Test_Negative Total
Reference: Reject 98 16 114
Reference: Accept 17 169 186
knitr::kable(round_df(dr$risk_table, 4), caption = "Classification risk summary")
Classification risk summary
Group N Correct Incorrect Percent_Correct
Positive reference 114 98 16 85.9649
Negative reference 186 169 17 90.8602
All samples 300 267 33 89.0000
attr_task$plot("detection_ci")

attr_task$plot("detection_metrics")

12. Quick Entry: Cohen Kappa

Use Cohen Kappa when two appraisers, two systems, or two repeated classification streams are compared directly.

trial1 <- attr_data[attr_data$Trial == 1, c("Sample", "Appraiser", "Rating")]
cohen_wide <- reshape(trial1, idvar = "Sample", timevar = "Appraiser", direction = "wide")
names(cohen_wide) <- sub("^Rating\\.", "", names(cohen_wide))

cohen_task <- iqr_cohen_kappa(
  cohen_wide,
  eval1_col = "A",
  eval2_col = "B",
  theme = "academic"
)

cohen_task$summary()
#> 
#> ========== Attribute Agreement Summary ==========
#> 
#> --- Kappa Statistics ---
#>  Statistic          Value           
#>  Method             Cohen's Kappa   
#>  Kappa              0.6552          
#>  SE Kappa           0.1117          
#>  Z                  5.8635          
#>  P(vs > 0)          <0.0001         
#>  95% CI             [0.4362, 0.8742]
#>  Observed agreement 84.00%          
#>  Expected agreement 53.60%          
#>  Interpretation     Good            
#> 
#> --- Response-Level Agreement ---
#>  Response N  Matches Percent_Agreement
#>  Accept   36 28      77.7778          
#>  Reject   22 14      63.6364          
#> =================================================
cohen_task$plot("kappa_benchmark")

13. Quick Entry: Fleiss Kappa Without Standard

Use Fleiss Kappa when three or more appraisers classify the same samples and no known standard is available.

fleiss_no_standard <- attr_data[attr_data$Trial == 1, c("Sample", "Appraiser", "Rating")]

fleiss_task <- iqr_fleiss_kappa(
  fleiss_no_standard,
  sample_col = "Sample",
  rater_col = "Appraiser",
  rating_col = "Rating",
  theme = "academic"
)

fleiss_task$summary()
#> 
#> ========== Attribute Agreement Summary ==========
#> 
#> --- Kappa Statistics ---
#>  Statistic          Value           
#>  Method             Fleiss' Kappa   
#>  Kappa              0.7416          
#>  SE Kappa           0.0989          
#>  Z                  7.4950          
#>  P(vs > 0)          <0.0001         
#>  95% CI             [0.5477, 0.9356]
#>  Observed agreement 88.00%          
#>  Expected agreement 53.56%          
#>  Interpretation     Good            
#> 
#> --- Between Appraisers ---
#>  Comparison N  Percent_Agreement Kappa  SE_Kappa Z       P_vs_gt_0
#>  A vs B     50 84                0.6552 0.1117    5.8635 0        
#>  A vs C     50 86                0.7059 0.1031    6.8472 0        
#>  B vs C     50 94                0.8687 0.0735   11.8145 0        
#>  Interpretation
#>  Good          
#>  Good          
#>  Strong        
#> 
#> --- Response-Level Agreement ---
#>  Response Ratings Standard_Count Matches_to_Standard Percent_Match_to_Standard
#>  Accept   95      NA             NA                  NA                       
#>  Reject   55      NA             NA                  NA                       
#> =================================================
fleiss_task$plot("pairwise_agreement")

14. Quick Entry: Detection-Only Study

Use detection-only analysis when the question is binary classification performance against a reference, such as automated vision inspection versus final disposition.

set.seed(20260509)
vision_data <- unique(attr_data[, c("Sample", "Standard")])
vision_data$Vision_System <- vision_data$Standard
reject_idx <- which(vision_data$Standard == "Reject")
accept_idx <- which(vision_data$Standard == "Accept")
vision_data$Vision_System[sample(reject_idx, 3)] <- "Accept"
vision_data$Vision_System[sample(accept_idx, 4)] <- "Reject"

detection_task <- iqr_detection(
  vision_data,
  reference_col = "Standard",
  test_col = "Vision_System",
  positive_category = "Reject",
  negative_category = "Accept",
  theme = "academic"
)

detection_task$summary()
#> 
#> ========== Attribute Agreement Summary ==========
#> 
#> --- Detection Performance ---
#>  Metric              Estimate CI              
#>  Sensitivity         84.21%   [60.42%, 96.62%]
#>  False negative rate 15.79%                   
#>  Specificity         87.10%   [70.17%, 96.37%]
#>  False positive rate 12.90%                   
#>  Youden index        0.7131                   
#>  LR+                 6.5263                   
#>  LR-                 0.1813                   
#>  Prevalence          38.00%                   
#> =================================================
detection_task$plot("detection_ci")

detection_task$plot("detection_metrics")

15. Plot Catalogue

plot_catalogue <- data.frame(
  Plot_Type = c(
    "summary",
    "kappa_benchmark",
    "pairwise_agreement",
    "appraiser_standard",
    "response_agreement",
    "sample_disagreement",
    "detection_ci",
    "detection_metrics"
  ),
  Purpose = c(
    "Combined overview for available Kappa and detection graphics",
    "Overall Kappa value against agreement bands",
    "Between-appraiser pairwise Kappa and percent agreement",
    "Each appraiser compared with the known standard",
    "Response-level match-to-standard rate",
    "Top samples with unstable classification",
    "Sensitivity and specificity confidence intervals",
    "Aligned detection metrics and risk rates"
  )
)
knitr::kable(plot_catalogue, caption = "Attribute agreement plot types")
Attribute agreement plot types
Plot_Type Purpose
summary Combined overview for available Kappa and detection graphics
kappa_benchmark Overall Kappa value against agreement bands
pairwise_agreement Between-appraiser pairwise Kappa and percent agreement
appraiser_standard Each appraiser compared with the known standard
response_agreement Response-level match-to-standard rate
sample_disagreement Top samples with unstable classification
detection_ci Sensitivity and specificity confidence intervals
detection_metrics Aligned detection metrics and risk rates

16. Coverage Checklist

This vignette exercises the full attribute agreement surface:

  1. Complete multi-appraiser study with standard, repeated trials, overall Kappa, pairwise appraiser agreement, appraiser-versus-standard, within-appraiser repeatability, response agreement, sample disagreement, and detection performance.
  2. Quick-entry Cohen Kappa, Fleiss Kappa without standard, and detection-only workflows.
  3. Plot catalogue and report export for the complete study.

16. Report Output

The report export below uses the same complete attr_task, so the generated HTML and Excel outputs include Kappa, between-appraiser comparisons, appraiser-versus-standard comparisons, within-appraiser repeatability, response-level agreement, and detection performance.

xlsx_path <- tempfile("attribute_agreement_complete_", fileext = ".xlsx")
html_path <- tempfile("attribute_agreement_complete_", fileext = ".html")

attr_task$report("excel", path = xlsx_path)
attr_task$report("html", path = html_path)

data.frame(
  File = c("Excel", "HTML"),
  Exists = c(file.exists(xlsx_path), file.exists(html_path)),
  Size_KB = round(c(file.info(xlsx_path)$size, file.info(html_path)$size) / 1024, 1)
)
#>    File Exists Size_KB
#> 1 Excel   TRUE    30.5
#> 2  HTML   TRUE  1369.2

17. Interpretation Template

  1. Start with the design: number of parts, appraisers, repeated trials, response categories, and whether standards are available.
  2. Report overall Kappa with SE, Z, p-value, confidence interval, observed agreement, and expected agreement.
  3. Report between-appraiser agreement to identify inconsistent appraiser pairs.
  4. Report each appraiser versus standard to identify training or calibration needs.
  5. Report within-appraiser repeatability to detect unstable individual judgement.
  6. Report sensitivity and false negative rate for the positive defect class; this is often the business-critical risk.

References