This example simulates a visual inspection study for injection-molded
connector housings. The quality team wants to validate whether three
inspectors can consistently classify each part as Accept or
Reject, and whether those classifications agree with the
known engineering standard.
The study is deliberately built as a complete attribute agreement study:
Reject, so the same data can also
evaluate detection risk.The required output is not only an overall Kappa number. A usable report must include between-appraiser agreement, each appraiser versus standard, within-appraiser repeatability, response-level agreement, detection performance, and the related plots.
Kappa adjusts observed agreement for chance agreement:
\[ \kappa = \frac{P_o - P_e}{1 - P_e} \]
For measurement-system release decisions, Kappa should be read together with false negative risk. A high overall agreement can still be unacceptable if defect parts are frequently classified as acceptable.
set.seed(20260508)
n_parts <- 50
parts <- sprintf("C%03d", seq_len(n_parts))
latent_score <- sort(rnorm(n_parts, mean = 0, sd = 1.1))
standard <- ifelse(latent_score > 0.18, "Reject", "Accept")
appraisers <- c("A", "B", "C")
trials <- 1:2
appraiser_bias <- c(A = -0.03, B = 0.10, C = -0.08)
appraiser_sd <- c(A = 0.34, B = 0.38, C = 0.42)
attr_data <- do.call(rbind, lapply(appraisers, function(app) {
do.call(rbind, lapply(trials, function(trial) {
observed_score <- latent_score + appraiser_bias[[app]] + rnorm(n_parts, 0, appraiser_sd[[app]])
rating <- ifelse(observed_score > 0.18, "Reject", "Accept")
data.frame(
Sample = parts,
Appraiser = app,
Trial = trial,
Standard = standard,
Rating = rating,
stringsAsFactors = FALSE
)
}))
}))
knitr::kable(head(attr_data, 12), caption = "Complete attribute agreement data")| Sample | Appraiser | Trial | Standard | Rating |
|---|---|---|---|---|
| C001 | A | 1 | Accept | Accept |
| C002 | A | 1 | Accept | Accept |
| C003 | A | 1 | Accept | Accept |
| C004 | A | 1 | Accept | Accept |
| C005 | A | 1 | Accept | Accept |
| C006 | A | 1 | Accept | Accept |
| C007 | A | 1 | Accept | Accept |
| C008 | A | 1 | Accept | Accept |
| C009 | A | 1 | Accept | Accept |
| C010 | A | 1 | Accept | Accept |
| C011 | A | 1 | Accept | Accept |
| C012 | A | 1 | Accept | Accept |
plan <- AttrGagePlan$new(
plan_name = "Connector_Housing_Visual_Inspection",
objectives = "Evaluate appraiser agreement and defect detection risk",
appraisers = appraisers,
samples = parts,
standards = standard,
categories = c("Accept", "Reject"),
trials = 2,
comparison_mode = "one_way",
kappa_method = "fleiss",
conf_level = 0.95
)
plan$set_meta(
"data",
sample_col = "Sample",
rater_col = "Appraiser",
rating_col = "Rating",
standard_col = "Standard",
trial_col = "Trial",
reference_col = "Standard",
test_col = "Rating",
positive_category = "Reject",
negative_category = "Accept"
)
attr_task <- AttrGageTask$new(
data = attr_data,
plan = plan,
theme = "academic",
mode = "all"
)
attr_task$compute()
attr_task$summary()
#>
#> ========== Attribute Agreement Summary ==========
#>
#> --- Kappa Statistics ---
#> Statistic Value
#> Method Fleiss' Kappa
#> Kappa 0.7039
#> SE Kappa 0.1038
#> Z 6.7815
#> P(vs > 0) <0.0001
#> 95% CI [0.5004, 0.9073]
#> Observed agreement 86.00%
#> Expected agreement 52.72%
#> Interpretation Good
#>
#> --- Between Appraisers ---
#> Comparison N Percent_Agreement Kappa SE_Kappa Z P_vs_gt_0
#> A vs B 100 84 0.6637 0.0771 8.6142 0
#> A vs C 100 86 0.7030 0.0736 9.5508 0
#> B vs C 100 90 0.7879 0.0636 12.3800 0
#> Interpretation
#> Good
#> Good
#> Strong
#>
#> --- Each Appraiser vs Standard ---
#> Comparison N Percent_Agreement Kappa SE_Kappa Z
#> A vs Standard 100 91 0.8100 0.0604 13.4041
#> B vs Standard 100 87 0.7255 0.0710 10.2170
#> C vs Standard 100 89 0.7654 0.0667 11.4673
#> All Appraisers vs Standard 300 89 0.7669 0.0383 20.0391
#> P_vs_gt_0 Interpretation
#> 0 Strong
#> 0 Good
#> 0 Strong
#> 0 Strong
#>
#> --- Within Appraiser Repeatability ---
#> Comparison N Percent_Agreement Kappa SE_Kappa Z P_vs_gt_0
#> A Trial 1 vs 2 50 90 0.7899 0.0891 8.8624 0
#> B Trial 1 vs 2 50 78 0.5470 0.1206 4.5337 0
#> C Trial 1 vs 2 50 82 0.6141 0.1165 5.2713 0
#> Interpretation
#> Strong
#> Moderate
#> Good
#>
#> --- Response-Level Agreement ---
#> Response Ratings Standard_Count Matches_to_Standard Percent_Match_to_Standard
#> Accept 185 186 169 90.8602
#> Reject 115 114 98 85.9649
#>
#> --- Detection Performance ---
#> Metric Estimate CI
#> Sensitivity 85.96% [78.21%, 91.76%]
#> False negative rate 14.04%
#> Specificity 90.86% [85.77%, 94.59%]
#> False positive rate 9.14%
#> Youden index 0.7683
#> LR+ 9.4056
#> LR- 0.1545
#> Prevalence 38.00%
#> =================================================kr <- attr_task$kappa_results$raw_output
kappa_summary <- data.frame(
Statistic = c("Method", "Kappa", "SE Kappa", "Z", "P(vs > 0)", "95% CI", "Observed agreement", "Expected agreement", "Interpretation"),
Value = c(
kr$method,
fmt_num(kr$kappa, 4),
fmt_num(kr$se, 4),
fmt_num(kr$z, 4),
fmt_p(kr$p_value, 4),
fmt_ci(kr$ci, 4),
fmt_pct(kr$Po, 2),
fmt_pct(kr$Pe, 2),
fmt_chr(kr$interpretation)
)
)
knitr::kable(kappa_summary, align = c("l", "r"), caption = "Overall Kappa statistics")| Statistic | Value |
|---|---|
| Method | Fleiss’ Kappa |
| Kappa | 0.7039 |
| SE Kappa | 0.1038 |
| Z | 6.7815 |
| P(vs > 0) | <0.0001 |
| 95% CI | [0.5004, 0.9073] |
| Observed agreement | 86.00% |
| Expected agreement | 52.72% |
| Interpretation | Good |
knitr::kable(
round_df(kr$pairwise_appraisers, 4),
align = c("l", "r", "r", "r", "r", "r", "r", "l"),
caption = "Pairwise appraiser agreement"
)| Comparison | N | Percent_Agreement | Kappa | SE_Kappa | Z | P_vs_gt_0 | Interpretation |
|---|---|---|---|---|---|---|---|
| A vs B | 100 | 84 | 0.6637 | 0.0771 | 8.6142 | 0 | Good |
| A vs C | 100 | 86 | 0.7030 | 0.0736 | 9.5508 | 0 | Good |
| B vs C | 100 | 90 | 0.7879 | 0.0636 | 12.3800 | 0 | Strong |
knitr::kable(
round_df(kr$appraiser_vs_standard, 4),
align = c("l", "r", "r", "r", "r", "r", "r", "l"),
caption = "Each appraiser compared with the known standard"
)| Comparison | N | Percent_Agreement | Kappa | SE_Kappa | Z | P_vs_gt_0 | Interpretation |
|---|---|---|---|---|---|---|---|
| A vs Standard | 100 | 91 | 0.8100 | 0.0604 | 13.4041 | 0 | Strong |
| B vs Standard | 100 | 87 | 0.7255 | 0.0710 | 10.2170 | 0 | Good |
| C vs Standard | 100 | 89 | 0.7654 | 0.0667 | 11.4673 | 0 | Strong |
| All Appraisers vs Standard | 300 | 89 | 0.7669 | 0.0383 | 20.0391 | 0 | Strong |
knitr::kable(
round_df(kr$within_appraiser, 4),
align = c("l", "r", "r", "r", "r", "r", "r", "l"),
caption = "Within-appraiser repeatability across repeated trials"
)| Comparison | N | Percent_Agreement | Kappa | SE_Kappa | Z | P_vs_gt_0 | Interpretation |
|---|---|---|---|---|---|---|---|
| A Trial 1 vs 2 | 50 | 90 | 0.7899 | 0.0891 | 8.8624 | 0 | Strong |
| B Trial 1 vs 2 | 50 | 78 | 0.5470 | 0.1206 | 4.5337 | 0 | Moderate |
| C Trial 1 vs 2 | 50 | 82 | 0.6141 | 0.1165 | 5.2713 | 0 | Good |
knitr::kable(
round_df(kr$response_table, 4),
align = c("l", "r", "r", "r", "r"),
caption = "Response-level agreement summary"
)| Response | Ratings | Standard_Count | Matches_to_Standard | Percent_Match_to_Standard |
|---|---|---|---|---|
| Accept | 185 | 186 | 169 | 90.8602 |
| Reject | 115 | 114 | 98 | 85.9649 |
The report does not use a full sample-by-response rating matrix as the main graphic when many samples are present. With 50 parts, that heatmap compresses too many sample labels into the Y axis. The practical diagnostic is to focus on the samples with the highest disagreement.
knitr::kable(
round_df(head(kr$sample_disagreement, 12), 4),
align = c("l", "l", "r", "l", "r", "r", "r", "r", "r"),
caption = "Top samples by disagreement"
)| Sample | Standard | N_Ratings | Most_Common_Rating | Modal_Count | Percent_Modal | Standard_Matches | Percent_Match_to_Standard | Discordant_Ratings | |
|---|---|---|---|---|---|---|---|---|---|
| 32 | C032 | Reject | 6 | Accept | 3 | 50.0000 | 3 | 50.0000 | 3 |
| 28 | C028 | Accept | 6 | Reject | 4 | 66.6667 | 2 | 33.3333 | 2 |
| 31 | C031 | Accept | 6 | Reject | 4 | 66.6667 | 2 | 33.3333 | 2 |
| 33 | C033 | Reject | 6 | Accept | 4 | 66.6667 | 2 | 33.3333 | 2 |
| 34 | C034 | Reject | 6 | Accept | 4 | 66.6667 | 2 | 33.3333 | 2 |
| 24 | C024 | Accept | 6 | Accept | 4 | 66.6667 | 4 | 66.6667 | 2 |
| 30 | C030 | Accept | 6 | Accept | 4 | 66.6667 | 4 | 66.6667 | 2 |
| 36 | C036 | Reject | 6 | Reject | 4 | 66.6667 | 4 | 66.6667 | 2 |
| 23 | C023 | Accept | 6 | Accept | 5 | 83.3333 | 5 | 83.3333 | 1 |
| 25 | C025 | Accept | 6 | Accept | 5 | 83.3333 | 5 | 83.3333 | 1 |
| 26 | C026 | Accept | 6 | Accept | 5 | 83.3333 | 5 | 83.3333 | 1 |
| 27 | C027 | Accept | 6 | Accept | 5 | 83.3333 | 5 | 83.3333 | 1 |
dr <- attr_task$detection_results$raw_output
detection_summary <- data.frame(
Metric = c("Sensitivity", "False negative rate", "Specificity", "False positive rate", "Youden index", "LR+", "LR-", "Prevalence"),
Estimate = c(
fmt_pct(dr$detection_rate, 2),
fmt_pct(dr$false_negative_rate, 2),
fmt_pct(dr$specificity, 2),
fmt_pct(dr$false_positive_rate, 2),
fmt_num(dr$youden_index, 4),
fmt_num(dr$LR_positive, 4),
fmt_num(dr$LR_negative, 4),
fmt_pct(dr$prevalence, 2)
)
)
knitr::kable(detection_summary, align = c("l", "r"), caption = "Detection performance")| Metric | Estimate |
|---|---|
| Sensitivity | 85.96% |
| False negative rate | 14.04% |
| Specificity | 90.86% |
| False positive rate | 9.14% |
| Youden index | 0.7683 |
| LR+ | 9.4056 |
| LR- | 0.1545 |
| Prevalence | 38.00% |
| Reference | Test_Positive | Test_Negative | Total |
|---|---|---|---|
| Reference: Reject | 98 | 16 | 114 |
| Reference: Accept | 17 | 169 | 186 |
| Group | N | Correct | Incorrect | Percent_Correct |
|---|---|---|---|---|
| Positive reference | 114 | 98 | 16 | 85.9649 |
| Negative reference | 186 | 169 | 17 | 90.8602 |
| All samples | 300 | 267 | 33 | 89.0000 |
Use Cohen Kappa when two appraisers, two systems, or two repeated classification streams are compared directly.
trial1 <- attr_data[attr_data$Trial == 1, c("Sample", "Appraiser", "Rating")]
cohen_wide <- reshape(trial1, idvar = "Sample", timevar = "Appraiser", direction = "wide")
names(cohen_wide) <- sub("^Rating\\.", "", names(cohen_wide))
cohen_task <- iqr_cohen_kappa(
cohen_wide,
eval1_col = "A",
eval2_col = "B",
theme = "academic"
)
cohen_task$summary()
#>
#> ========== Attribute Agreement Summary ==========
#>
#> --- Kappa Statistics ---
#> Statistic Value
#> Method Cohen's Kappa
#> Kappa 0.6552
#> SE Kappa 0.1117
#> Z 5.8635
#> P(vs > 0) <0.0001
#> 95% CI [0.4362, 0.8742]
#> Observed agreement 84.00%
#> Expected agreement 53.60%
#> Interpretation Good
#>
#> --- Response-Level Agreement ---
#> Response N Matches Percent_Agreement
#> Accept 36 28 77.7778
#> Reject 22 14 63.6364
#> =================================================
cohen_task$plot("kappa_benchmark")Use Fleiss Kappa when three or more appraisers classify the same samples and no known standard is available.
fleiss_no_standard <- attr_data[attr_data$Trial == 1, c("Sample", "Appraiser", "Rating")]
fleiss_task <- iqr_fleiss_kappa(
fleiss_no_standard,
sample_col = "Sample",
rater_col = "Appraiser",
rating_col = "Rating",
theme = "academic"
)
fleiss_task$summary()
#>
#> ========== Attribute Agreement Summary ==========
#>
#> --- Kappa Statistics ---
#> Statistic Value
#> Method Fleiss' Kappa
#> Kappa 0.7416
#> SE Kappa 0.0989
#> Z 7.4950
#> P(vs > 0) <0.0001
#> 95% CI [0.5477, 0.9356]
#> Observed agreement 88.00%
#> Expected agreement 53.56%
#> Interpretation Good
#>
#> --- Between Appraisers ---
#> Comparison N Percent_Agreement Kappa SE_Kappa Z P_vs_gt_0
#> A vs B 50 84 0.6552 0.1117 5.8635 0
#> A vs C 50 86 0.7059 0.1031 6.8472 0
#> B vs C 50 94 0.8687 0.0735 11.8145 0
#> Interpretation
#> Good
#> Good
#> Strong
#>
#> --- Response-Level Agreement ---
#> Response Ratings Standard_Count Matches_to_Standard Percent_Match_to_Standard
#> Accept 95 NA NA NA
#> Reject 55 NA NA NA
#> =================================================
fleiss_task$plot("pairwise_agreement")Use detection-only analysis when the question is binary classification performance against a reference, such as automated vision inspection versus final disposition.
set.seed(20260509)
vision_data <- unique(attr_data[, c("Sample", "Standard")])
vision_data$Vision_System <- vision_data$Standard
reject_idx <- which(vision_data$Standard == "Reject")
accept_idx <- which(vision_data$Standard == "Accept")
vision_data$Vision_System[sample(reject_idx, 3)] <- "Accept"
vision_data$Vision_System[sample(accept_idx, 4)] <- "Reject"
detection_task <- iqr_detection(
vision_data,
reference_col = "Standard",
test_col = "Vision_System",
positive_category = "Reject",
negative_category = "Accept",
theme = "academic"
)
detection_task$summary()
#>
#> ========== Attribute Agreement Summary ==========
#>
#> --- Detection Performance ---
#> Metric Estimate CI
#> Sensitivity 84.21% [60.42%, 96.62%]
#> False negative rate 15.79%
#> Specificity 87.10% [70.17%, 96.37%]
#> False positive rate 12.90%
#> Youden index 0.7131
#> LR+ 6.5263
#> LR- 0.1813
#> Prevalence 38.00%
#> =================================================
detection_task$plot("detection_ci")plot_catalogue <- data.frame(
Plot_Type = c(
"summary",
"kappa_benchmark",
"pairwise_agreement",
"appraiser_standard",
"response_agreement",
"sample_disagreement",
"detection_ci",
"detection_metrics"
),
Purpose = c(
"Combined overview for available Kappa and detection graphics",
"Overall Kappa value against agreement bands",
"Between-appraiser pairwise Kappa and percent agreement",
"Each appraiser compared with the known standard",
"Response-level match-to-standard rate",
"Top samples with unstable classification",
"Sensitivity and specificity confidence intervals",
"Aligned detection metrics and risk rates"
)
)
knitr::kable(plot_catalogue, caption = "Attribute agreement plot types")| Plot_Type | Purpose |
|---|---|
| summary | Combined overview for available Kappa and detection graphics |
| kappa_benchmark | Overall Kappa value against agreement bands |
| pairwise_agreement | Between-appraiser pairwise Kappa and percent agreement |
| appraiser_standard | Each appraiser compared with the known standard |
| response_agreement | Response-level match-to-standard rate |
| sample_disagreement | Top samples with unstable classification |
| detection_ci | Sensitivity and specificity confidence intervals |
| detection_metrics | Aligned detection metrics and risk rates |
This vignette exercises the full attribute agreement surface:
The report export below uses the same complete
attr_task, so the generated HTML and Excel outputs include
Kappa, between-appraiser comparisons, appraiser-versus-standard
comparisons, within-appraiser repeatability, response-level agreement,
and detection performance.
xlsx_path <- tempfile("attribute_agreement_complete_", fileext = ".xlsx")
html_path <- tempfile("attribute_agreement_complete_", fileext = ".html")
attr_task$report("excel", path = xlsx_path)
attr_task$report("html", path = html_path)
data.frame(
File = c("Excel", "HTML"),
Exists = c(file.exists(xlsx_path), file.exists(html_path)),
Size_KB = round(c(file.info(xlsx_path)$size, file.info(html_path)$size) / 1024, 1)
)
#> File Exists Size_KB
#> 1 Excel TRUE 30.5
#> 2 HTML TRUE 1369.2