Attribute Agreement Analysis: Appraiser Agreement and Detection Performance

1. Business Context

This example simulates a visual inspection study for injection-molded connector housings. The quality team wants to validate whether three inspectors can consistently classify each part as Accept or Reject, and whether those classifications agree with the known engineering standard.

The study is deliberately built as a complete attribute agreement study:

50 parts covering clear accept, clear reject, and borderline cases.
3 appraisers.
2 repeated trials per appraiser.
Known standard for every part.
Binary positive class Reject, so the same data can also evaluate detection risk.

The required output is not only an overall Kappa number. A usable report must include between-appraiser agreement, each appraiser versus standard, within-appraiser repeatability, response-level agreement, detection performance, and the related plots.

2. Method

Kappa adjusts observed agreement for chance agreement:

\[ \kappa = \frac{P_o - P_e}{1 - P_e} \]

For measurement-system release decisions, Kappa should be read together with false negative risk. A high overall agreement can still be unacceptable if defect parts are frequently classified as acceptable.

3. Complete Study Data

set.seed(20260508)

n_parts <- 50
parts <- sprintf("C%03d", seq_len(n_parts))
latent_score <- sort(rnorm(n_parts, mean = 0, sd = 1.1))
standard <- ifelse(latent_score > 0.18, "Reject", "Accept")

appraisers <- c("A", "B", "C")
trials <- 1:2

appraiser_bias <- c(A = -0.03, B = 0.10, C = -0.08)
appraiser_sd <- c(A = 0.34, B = 0.38, C = 0.42)

attr_data <- do.call(rbind, lapply(appraisers, function(app) {
  do.call(rbind, lapply(trials, function(trial) {
    observed_score <- latent_score + appraiser_bias[[app]] + rnorm(n_parts, 0, appraiser_sd[[app]])
    rating <- ifelse(observed_score > 0.18, "Reject", "Accept")
    data.frame(
      Sample = parts,
      Appraiser = app,
      Trial = trial,
      Standard = standard,
      Rating = rating,
      stringsAsFactors = FALSE
    )
  }))
}))

knitr::kable(head(attr_data, 12), caption = "Complete attribute agreement data")

Complete attribute agreement data
Sample	Appraiser	Trial	Standard	Rating
C001	A	1	Accept	Accept
C002	A	1	Accept	Accept
C003	A	1	Accept	Accept
C004	A	1	Accept	Accept
C005	A	1	Accept	Accept
C006	A	1	Accept	Accept
C007	A	1	Accept	Accept
C008	A	1	Accept	Accept
C009	A	1	Accept	Accept
C010	A	1	Accept	Accept
C011	A	1	Accept	Accept
C012	A	1	Accept	Accept

4. Complete Attribute Agreement Workflow

plan <- AttrGagePlan$new(
  plan_name = "Connector_Housing_Visual_Inspection",
  objectives = "Evaluate appraiser agreement and defect detection risk",
  appraisers = appraisers,
  samples = parts,
  standards = standard,
  categories = c("Accept", "Reject"),
  trials = 2,
  comparison_mode = "one_way",
  kappa_method = "fleiss",
  conf_level = 0.95
)

plan$set_meta(
  "data",
  sample_col = "Sample",
  rater_col = "Appraiser",
  rating_col = "Rating",
  standard_col = "Standard",
  trial_col = "Trial",
  reference_col = "Standard",
  test_col = "Rating",
  positive_category = "Reject",
  negative_category = "Accept"
)

attr_task <- AttrGageTask$new(
  data = attr_data,
  plan = plan,
  theme = "academic",
  mode = "all"
)
attr_task$compute()
attr_task$summary()
#> 
#> ========== Attribute Agreement Summary ==========
#> 
#> --- Kappa Statistics ---
#>  Statistic          Value           
#>  Method             Fleiss' Kappa   
#>  Kappa              0.7039          
#>  SE Kappa           0.1038          
#>  Z                  6.7815          
#>  P(vs > 0)          <0.0001         
#>  95% CI             [0.5004, 0.9073]
#>  Observed agreement 86.00%          
#>  Expected agreement 52.72%          
#>  Interpretation     Good            
#> 
#> --- Between Appraisers ---
#>  Comparison N   Percent_Agreement Kappa  SE_Kappa Z       P_vs_gt_0
#>  A vs B     100 84                0.6637 0.0771    8.6142 0        
#>  A vs C     100 86                0.7030 0.0736    9.5508 0        
#>  B vs C     100 90                0.7879 0.0636   12.3800 0        
#>  Interpretation
#>  Good          
#>  Good          
#>  Strong        
#> 
#> --- Each Appraiser vs Standard ---
#>  Comparison                 N   Percent_Agreement Kappa  SE_Kappa Z      
#>  A vs Standard              100 91                0.8100 0.0604   13.4041
#>  B vs Standard              100 87                0.7255 0.0710   10.2170
#>  C vs Standard              100 89                0.7654 0.0667   11.4673
#>  All Appraisers vs Standard 300 89                0.7669 0.0383   20.0391
#>  P_vs_gt_0 Interpretation
#>  0         Strong        
#>  0         Good          
#>  0         Strong        
#>  0         Strong        
#> 
#> --- Within Appraiser Repeatability ---
#>  Comparison     N  Percent_Agreement Kappa  SE_Kappa Z      P_vs_gt_0
#>  A Trial 1 vs 2 50 90                0.7899 0.0891   8.8624 0        
#>  B Trial 1 vs 2 50 78                0.5470 0.1206   4.5337 0        
#>  C Trial 1 vs 2 50 82                0.6141 0.1165   5.2713 0        
#>  Interpretation
#>  Strong        
#>  Moderate      
#>  Good          
#> 
#> --- Response-Level Agreement ---
#>  Response Ratings Standard_Count Matches_to_Standard Percent_Match_to_Standard
#>  Accept   185     186            169                 90.8602                  
#>  Reject   115     114             98                 85.9649                  
#> 
#> --- Detection Performance ---
#>  Metric              Estimate CI              
#>  Sensitivity         85.96%   [78.21%, 91.76%]
#>  False negative rate 14.04%                   
#>  Specificity         90.86%   [85.77%, 94.59%]
#>  False positive rate 9.14%                    
#>  Youden index        0.7683                   
#>  LR+                 9.4056                   
#>  LR-                 0.1545                   
#>  Prevalence          38.00%                   
#> =================================================

5. Overall Kappa

kr <- attr_task$kappa_results$raw_output
kappa_summary <- data.frame(
  Statistic = c("Method", "Kappa", "SE Kappa", "Z", "P(vs > 0)", "95% CI", "Observed agreement", "Expected agreement", "Interpretation"),
  Value = c(
    kr$method,
    fmt_num(kr$kappa, 4),
    fmt_num(kr$se, 4),
    fmt_num(kr$z, 4),
    fmt_p(kr$p_value, 4),
    fmt_ci(kr$ci, 4),
    fmt_pct(kr$Po, 2),
    fmt_pct(kr$Pe, 2),
    fmt_chr(kr$interpretation)
  )
)
knitr::kable(kappa_summary, align = c("l", "r"), caption = "Overall Kappa statistics")

Overall Kappa statistics
Statistic	Value
Method	Fleiss’ Kappa
Kappa	0.7039
SE Kappa	0.1038
Z	6.7815
P(vs > 0)	<0.0001
95% CI	[0.5004, 0.9073]
Observed agreement	86.00%
Expected agreement	52.72%
Interpretation	Good

attr_task$plot("kappa_benchmark")

6. Between Appraisers

knitr::kable(
  round_df(kr$pairwise_appraisers, 4),
  align = c("l", "r", "r", "r", "r", "r", "r", "l"),
  caption = "Pairwise appraiser agreement"
)

Pairwise appraiser agreement
Comparison	N	Percent_Agreement	Kappa	SE_Kappa	Z	Interpretation
A vs B	100	84	0.6637	0.0771	8.6142	Good
A vs C	100	86	0.7030	0.0736	9.5508	Good
B vs C	100	90	0.7879	0.0636	12.3800	Strong

attr_task$plot("pairwise_agreement")

7. Each Appraiser vs Standard

knitr::kable(
  round_df(kr$appraiser_vs_standard, 4),
  align = c("l", "r", "r", "r", "r", "r", "r", "l"),
  caption = "Each appraiser compared with the known standard"
)

Each appraiser compared with the known standard
Comparison	N	Percent_Agreement	Kappa	SE_Kappa	Z	Interpretation
A vs Standard	100	91	0.8100	0.0604	13.4041	Strong
B vs Standard	100	87	0.7255	0.0710	10.2170	Good
C vs Standard	100	89	0.7654	0.0667	11.4673	Strong
All Appraisers vs Standard	300	89	0.7669	0.0383	20.0391	Strong

attr_task$plot("appraiser_standard")

8. Within Appraiser Repeatability

knitr::kable(
  round_df(kr$within_appraiser, 4),
  align = c("l", "r", "r", "r", "r", "r", "r", "l"),
  caption = "Within-appraiser repeatability across repeated trials"
)

Within-appraiser repeatability across repeated trials
Comparison	N	Percent_Agreement	Kappa	SE_Kappa	Z	Interpretation
A Trial 1 vs 2	50	90	0.7899	0.0891	8.8624	Strong
B Trial 1 vs 2	50	78	0.5470	0.1206	4.5337	Moderate
C Trial 1 vs 2	50	82	0.6141	0.1165	5.2713	Good

9. Response-Level Agreement

knitr::kable(
  round_df(kr$response_table, 4),
  align = c("l", "r", "r", "r", "r"),
  caption = "Response-level agreement summary"
)

Response-level agreement summary
Response	Ratings	Standard_Count	Matches_to_Standard	Percent_Match_to_Standard
Accept	185	186	169	90.8602
Reject	115	114	98	85.9649

attr_task$plot("response_agreement")

10. Sample-Level Disagreement Diagnostics

The report does not use a full sample-by-response rating matrix as the main graphic when many samples are present. With 50 parts, that heatmap compresses too many sample labels into the Y axis. The practical diagnostic is to focus on the samples with the highest disagreement.

knitr::kable(
  round_df(head(kr$sample_disagreement, 12), 4),
  align = c("l", "l", "r", "l", "r", "r", "r", "r", "r"),
  caption = "Top samples by disagreement"
)

Top samples by disagreement
	Sample	Standard	N_Ratings	Most_Common_Rating	Modal_Count	Percent_Modal	Standard_Matches	Percent_Match_to_Standard	Discordant_Ratings
32	C032	Reject	6	Accept	3	50.0000	3	50.0000	3
28	C028	Accept	6	Reject	4	66.6667	2	33.3333	2
31	C031	Accept	6	Reject	4	66.6667	2	33.3333	2
33	C033	Reject	6	Accept	4	66.6667	2	33.3333	2
34	C034	Reject	6	Accept	4	66.6667	2	33.3333	2
24	C024	Accept	6	Accept	4	66.6667	4	66.6667	2
30	C030	Accept	6	Accept	4	66.6667	4	66.6667	2
36	C036	Reject	6	Reject	4	66.6667	4	66.6667	2
23	C023	Accept	6	Accept	5	83.3333	5	83.3333	1
25	C025	Accept	6	Accept	5	83.3333	5	83.3333	1
26	C026	Accept	6	Accept	5	83.3333	5	83.3333	1
27	C027	Accept	6	Accept	5	83.3333	5	83.3333	1

attr_task$plot("sample_disagreement")

11. Detection Performance

dr <- attr_task$detection_results$raw_output
detection_summary <- data.frame(
  Metric = c("Sensitivity", "False negative rate", "Specificity", "False positive rate", "Youden index", "LR+", "LR-", "Prevalence"),
  Estimate = c(
    fmt_pct(dr$detection_rate, 2),
    fmt_pct(dr$false_negative_rate, 2),
    fmt_pct(dr$specificity, 2),
    fmt_pct(dr$false_positive_rate, 2),
    fmt_num(dr$youden_index, 4),
    fmt_num(dr$LR_positive, 4),
    fmt_num(dr$LR_negative, 4),
    fmt_pct(dr$prevalence, 2)
  )
)
knitr::kable(detection_summary, align = c("l", "r"), caption = "Detection performance")

Detection performance
Metric	Estimate
Sensitivity	85.96%
False negative rate	14.04%
Specificity	90.86%
False positive rate	9.14%
Youden index	0.7683
LR+	9.4056
LR-	0.1545
Prevalence	38.00%

knitr::kable(dr$confusion_matrix_table, caption = "Detection confusion matrix")

Detection confusion matrix
Reference	Test_Positive	Test_Negative	Total
Reference: Reject	98	16	114
Reference: Accept	17	169	186

knitr::kable(round_df(dr$risk_table, 4), caption = "Classification risk summary")

Classification risk summary
Group	N	Correct	Incorrect	Percent_Correct
Positive reference	114	98	16	85.9649
Negative reference	186	169	17	90.8602
All samples	300	267	33	89.0000

attr_task$plot("detection_ci")

attr_task$plot("detection_metrics")

12. Quick Entry: Cohen Kappa

Use Cohen Kappa when two appraisers, two systems, or two repeated classification streams are compared directly.

trial1 <- attr_data[attr_data$Trial == 1, c("Sample", "Appraiser", "Rating")]
cohen_wide <- reshape(trial1, idvar = "Sample", timevar = "Appraiser", direction = "wide")
names(cohen_wide) <- sub("^Rating\\.", "", names(cohen_wide))

cohen_task <- iqr_cohen_kappa(
  cohen_wide,
  eval1_col = "A",
  eval2_col = "B",
  theme = "academic"
)

cohen_task$summary()
#> 
#> ========== Attribute Agreement Summary ==========
#> 
#> --- Kappa Statistics ---
#>  Statistic          Value           
#>  Method             Cohen's Kappa   
#>  Kappa              0.6552          
#>  SE Kappa           0.1117          
#>  Z                  5.8635          
#>  P(vs > 0)          <0.0001         
#>  95% CI             [0.4362, 0.8742]
#>  Observed agreement 84.00%          
#>  Expected agreement 53.60%          
#>  Interpretation     Good            
#> 
#> --- Response-Level Agreement ---
#>  Response N  Matches Percent_Agreement
#>  Accept   36 28      77.7778          
#>  Reject   22 14      63.6364          
#> =================================================
cohen_task$plot("kappa_benchmark")

13. Quick Entry: Fleiss Kappa Without Standard

Use Fleiss Kappa when three or more appraisers classify the same samples and no known standard is available.

fleiss_no_standard <- attr_data[attr_data$Trial == 1, c("Sample", "Appraiser", "Rating")]

fleiss_task <- iqr_fleiss_kappa(
  fleiss_no_standard,
  sample_col = "Sample",
  rater_col = "Appraiser",
  rating_col = "Rating",
  theme = "academic"
)

fleiss_task$summary()
#> 
#> ========== Attribute Agreement Summary ==========
#> 
#> --- Kappa Statistics ---
#>  Statistic          Value           
#>  Method             Fleiss' Kappa   
#>  Kappa              0.7416          
#>  SE Kappa           0.0989          
#>  Z                  7.4950          
#>  P(vs > 0)          <0.0001         
#>  95% CI             [0.5477, 0.9356]
#>  Observed agreement 88.00%          
#>  Expected agreement 53.56%          
#>  Interpretation     Good            
#> 
#> --- Between Appraisers ---
#>  Comparison N  Percent_Agreement Kappa  SE_Kappa Z       P_vs_gt_0
#>  A vs B     50 84                0.6552 0.1117    5.8635 0        
#>  A vs C     50 86                0.7059 0.1031    6.8472 0        
#>  B vs C     50 94                0.8687 0.0735   11.8145 0        
#>  Interpretation
#>  Good          
#>  Good          
#>  Strong        
#> 
#> --- Response-Level Agreement ---
#>  Response Ratings Standard_Count Matches_to_Standard Percent_Match_to_Standard
#>  Accept   95      NA             NA                  NA                       
#>  Reject   55      NA             NA                  NA                       
#> =================================================
fleiss_task$plot("pairwise_agreement")

14. Quick Entry: Detection-Only Study

Use detection-only analysis when the question is binary classification performance against a reference, such as automated vision inspection versus final disposition.

set.seed(20260509)
vision_data <- unique(attr_data[, c("Sample", "Standard")])
vision_data$Vision_System <- vision_data$Standard
reject_idx <- which(vision_data$Standard == "Reject")
accept_idx <- which(vision_data$Standard == "Accept")
vision_data$Vision_System[sample(reject_idx, 3)] <- "Accept"
vision_data$Vision_System[sample(accept_idx, 4)] <- "Reject"

detection_task <- iqr_detection(
  vision_data,
  reference_col = "Standard",
  test_col = "Vision_System",
  positive_category = "Reject",
  negative_category = "Accept",
  theme = "academic"
)

detection_task$summary()
#> 
#> ========== Attribute Agreement Summary ==========
#> 
#> --- Detection Performance ---
#>  Metric              Estimate CI              
#>  Sensitivity         84.21%   [60.42%, 96.62%]
#>  False negative rate 15.79%                   
#>  Specificity         87.10%   [70.17%, 96.37%]
#>  False positive rate 12.90%                   
#>  Youden index        0.7131                   
#>  LR+                 6.5263                   
#>  LR-                 0.1813                   
#>  Prevalence          38.00%                   
#> =================================================
detection_task$plot("detection_ci")

detection_task$plot("detection_metrics")

15. Plot Catalogue

plot_catalogue <- data.frame(
  Plot_Type = c(
    "summary",
    "kappa_benchmark",
    "pairwise_agreement",
    "appraiser_standard",
    "response_agreement",
    "sample_disagreement",
    "detection_ci",
    "detection_metrics"
  ),
  Purpose = c(
    "Combined overview for available Kappa and detection graphics",
    "Overall Kappa value against agreement bands",
    "Between-appraiser pairwise Kappa and percent agreement",
    "Each appraiser compared with the known standard",
    "Response-level match-to-standard rate",
    "Top samples with unstable classification",
    "Sensitivity and specificity confidence intervals",
    "Aligned detection metrics and risk rates"
  )
)
knitr::kable(plot_catalogue, caption = "Attribute agreement plot types")

Attribute agreement plot types
Plot_Type	Purpose
summary	Combined overview for available Kappa and detection graphics
kappa_benchmark	Overall Kappa value against agreement bands
pairwise_agreement	Between-appraiser pairwise Kappa and percent agreement
appraiser_standard	Each appraiser compared with the known standard
response_agreement	Response-level match-to-standard rate
sample_disagreement	Top samples with unstable classification
detection_ci	Sensitivity and specificity confidence intervals
detection_metrics	Aligned detection metrics and risk rates

16. Coverage Checklist

This vignette exercises the full attribute agreement surface:

Complete multi-appraiser study with standard, repeated trials, overall Kappa, pairwise appraiser agreement, appraiser-versus-standard, within-appraiser repeatability, response agreement, sample disagreement, and detection performance.
Quick-entry Cohen Kappa, Fleiss Kappa without standard, and detection-only workflows.
Plot catalogue and report export for the complete study.

16. Report Output

The report export below uses the same complete attr_task, so the generated HTML and Excel outputs include Kappa, between-appraiser comparisons, appraiser-versus-standard comparisons, within-appraiser repeatability, response-level agreement, and detection performance.

xlsx_path <- tempfile("attribute_agreement_complete_", fileext = ".xlsx")
html_path <- tempfile("attribute_agreement_complete_", fileext = ".html")

attr_task$report("excel", path = xlsx_path)
attr_task$report("html", path = html_path)

data.frame(
  File = c("Excel", "HTML"),
  Exists = c(file.exists(xlsx_path), file.exists(html_path)),
  Size_KB = round(c(file.info(xlsx_path)$size, file.info(html_path)$size) / 1024, 1)
)
#>    File Exists Size_KB
#> 1 Excel   TRUE    30.5
#> 2  HTML   TRUE  1369.2

17. Interpretation Template

Start with the design: number of parts, appraisers, repeated trials, response categories, and whether standards are available.
Report overall Kappa with SE, Z, p-value, confidence interval, observed agreement, and expected agreement.
Report between-appraiser agreement to identify inconsistent appraiser pairs.
Report each appraiser versus standard to identify training or calibration needs.
Report within-appraiser repeatability to detect unstable individual judgement.
Report sensitivity and false negative rate for the positive defect class; this is often the business-critical risk.