Detecting Inattentive Respondents: Complexity–RT Correlation

Can item complexity × response time predict attention?

Author

Jamie C. Lee

Published

April 23, 2026

Overview

This document evaluates two complementary RT-based features for detecting inattentive survey respondents:

  1. Spearman ρ — the rank correlation between item complexity and per-item response time. Attentive respondents take longer on harder items; non-attenders’ RTs are driven by clicking rhythm, not difficulty.
  2. MAR (mean absolute residual) — the average unexplained log-RT variance after fitting a within-person regression of log(RT) on item complexity alone. High MAR means RT varies in ways that complexity cannot explain, which is a signal of non-systematic (inattentive) responding.

Both features treat RT solely as a function of item complexity — no reverse-coding term is included. The question addressed here is whether combining these two features in a mixture model improves classification of attentive vs. inattentive respondents compared to using Spearman ρ alone.

In this simulation, item complexity is a fixed property of the instrument (instrument[[s]]$complexity), and per-item RTs are available as RT_s{s}_i{j}. The complexity effect on RT is built into the data-generating process for attenders: at full attention (\(A = 1\)), each SD of complexity multiplies per-item RT by \(\exp(\texttt{complexity\_rt\_weight} \times A)\). At \(A = 0\) the effect vanishes. Non-attenders are entirely unaffected by complexity.

The simulation was generated with the following complexity parameters:

  • item_complexity_mean = 0, item_complexity_sd = 1
  • complexity_rt_weight = 0.3

Approach summary. Spearman ρ and MAR are both derived from the complexity–RT relationship but capture different aspects of it: ρ captures whether complexity and RT co-vary in rank order; MAR captures how much unexplained residual variance remains after accounting for complexity. Their combination in a 2-feature mixture model is evaluated against ρ alone (Kneedle threshold) and against the existing Laz.R-based and Zh-based approaches in the final comparison section.


Simulation Ground Truth

Show code
class_tab <- data.frame(
  Group = c("Attenders", "Non-attenders"),
  N     = c(sum(dat$is_non_attender == 0), sum(dat$is_non_attender == 1)),
  Pct   = round(100 * c(mean(dat$is_non_attender == 0),
                         mean(dat$is_non_attender == 1)), 1)
)

na_dat  <- dat[dat$is_non_attender == 1, ]
sub_tab <- as.data.frame(table(
  Style   = na_dat$non_attender_style,
  Pattern = na_dat$non_attender_pattern_type
))
sub_tab <- sub_tab[sub_tab$Freq > 0, ]

kable(class_tab, col.names = c("Group", "N", "%"),
      caption = "Sample composition") |>
  kable_styling(full_width = FALSE)
Sample composition
Group N %
Attenders 4459 89.2
Non-attenders 541 10.8
Show code
kable(sub_tab, caption = "Non-attender subtypes") |>
  kable_styling(full_width = FALSE)
Non-attender subtypes
Style Pattern Freq
2 with_pattern acquiescent 54
4 with_pattern alternating 17
6 with_pattern biased_random 36
8 with_pattern diagonal 17
10 with_pattern disacquiescent 25
12 with_pattern midpoint_default 53
14 with_pattern near_straightline 57
16 with_pattern straightline 114
18 with_pattern uniform_random 25

Benchmarking Established Attention-Checking Methods

Before evaluating our proposed complexity–RT features in detail, we benchmark six established measures of inattentive responding against one another and against the proposed Spearman ρ and weighted τ. All measures are computed across all 40 items (both scales pooled), so that comparisons are made on an equal footing with our own features.

The six established measures are:

  1. LongString — the length of the longest uninterrupted run of identical responses in a person’s item sequence. Patterned non-attenders (straightliners, alternators) produce unusually long runs; attenders and random responders do not.
  2. IRV (Intra-individual Response Variability) — the standard deviation of a person’s raw item responses across all items. Non-attenders who straightline or give near-constant responses show very low IRV; random or biased-random responders may show elevated IRV.
  3. Mahalanobis’ Distance on RTs — the multivariate distance between a respondent’s per-item RT vector and the sample mean RT vector, scaled by the sample covariance of RTs. This identifies respondents whose RT profile across items is a multivariate outlier relative to the group — flagging both uniformly fast and unusually patterned RT vectors.
  4. Laz.R — sequential transitional predictability of item responses (defined in the main analysis below). High Laz.R signals mechanically regular response sequences.
  5. Zh — the standardised person-fit statistic from a Graded Response Model (Drasgow et al., 1987). Strongly negative values indicate responses inconsistent with any plausible latent trait.
  6. Even-Odd Consistency (EO) — the correlation between a person’s responses on even-numbered and odd-numbered items. Truly attentive respondents show moderate-to-high even–odd consistency if the scale has reasonable internal structure; random responders show near-zero consistency.

Our two proposed features — Spearman ρ and weighted τ — are included as the final two columns in all comparison tables and highlighted in green for easy identification.

Note on item pooling. All measures below are computed over all 40 items simultaneously (both scales pooled into one item string per respondent), matching the full-instrument condition against which we evaluate sensitivity in subsequent sections.

Show code
# ── Compute all benchmark measures across all items ───────────────────────────
ni_total  <- ns * ni   # total items across both scales

# All item response columns
y_cols_all  <- unlist(lapply(seq_len(ns), function(s)
  paste0("Y_s", s, "_i", seq_len(ni))))
rt_cols_all <- unlist(lapply(seq_len(ns), function(s)
  paste0("RT_s", s, "_i", seq_len(ni))))
cx_all      <- unlist(lapply(seq_len(ns), function(s)
  instrument[[s]]$complexity))

resp_mat <- as.matrix(dat[, y_cols_all])
rt_mat   <- as.matrix(dat[, rt_cols_all])

# ── 1. LongString ─────────────────────────────────────────────────────────────
longstring <- function(x) {
  if (all(is.na(x))) return(NA_integer_)
  x   <- x[!is.na(x)]
  rle_x <- rle(x)
  max(rle_x$lengths)
}
results$longstring <- apply(resp_mat, 1, longstring)

# ── 2. IRV ────────────────────────────────────────────────────────────────────
results$irv <- apply(resp_mat, 1, sd, na.rm = TRUE)

# ── 3. Mahalanobis Distance on RTs ───────────────────────────────────────────
# Compute on log-RT to reduce extreme-RT leverage; use robust estimate for
# the covariance to guard against non-attender contamination.
log_rt_mat <- log(pmax(rt_mat, 1e-6))

# Mean vector and covariance across respondents
rt_mean <- colMeans(log_rt_mat, na.rm = TRUE)
rt_cov  <- tryCatch(
  cov(log_rt_mat, use = "pairwise.complete.obs"),
  error = function(e) diag(ncol(log_rt_mat))
)
# Regularise: add small ridge in case of near-singularity
rt_cov_reg <- rt_cov + diag(1e-4, nrow(rt_cov))

results$mahal_rt <- tryCatch({
  mahal_vals <- mahalanobis(log_rt_mat, center = rt_mean, cov = rt_cov_reg)
  mahal_vals
}, error = function(e) {
  # Fallback: Euclidean distance from mean (scaled)
  apply(sweep(log_rt_mat, 2, rt_mean)^2, 1, sum)
})

# ── 4. Laz.R (pooled across both scales) ─────────────────────────────────────
lazr_pooled <- apply(resp_mat, 1, Laz.R)
results$lazr_pooled <- lazr_pooled

# ── 5. Zh (iterative purification GRM, computed here for benchmarking) ────────
# NOTE: moved to chunk benchmark-zh below (cache: true) to avoid re-running
#       on every render. Placeholder so downstream summary cat() doesn't error.
results$zh <- NA_real_   # will be overwritten by benchmark-zh chunk

# ── 6. Even-Odd Consistency ───────────────────────────────────────────────────
even_idx <- seq(2, ni_total, by = 2)
odd_idx  <- seq(1, ni_total, by = 2)
# Trim to same length if ni_total is odd
min_len   <- min(length(even_idx), length(odd_idx))
even_idx  <- even_idx[seq_len(min_len)]
odd_idx   <- odd_idx[seq_len(min_len)]

results$even_odd <- apply(resp_mat, 1, function(r) {
  e <- r[even_idx]; o <- r[odd_idx]
  complete <- is.finite(e) & is.finite(o)
  if (sum(complete) < 3) return(NA_real_)
  cor(e[complete], o[complete], method = "pearson")
})

# ── Spearman rho and weighted tau pooled (all items) ─────────────────────────
cx_rank_all <- rank(cx_all)

results$spearman_pooled <- apply(rt_mat, 1, function(rt_i) {
  cor(cx_rank_all, rank(rt_i), method = "spearman")
})

# Weighted tau pooled
pairs_all <- combn(ni_total, 2)
cx_diff_all <- cx_all[pairs_all[1,]] - cx_all[pairs_all[2,]]
w_all       <- abs(cx_diff_all)

results$wtau_pooled <- apply(rt_mat, 1, function(rt_i) {
  rt_diff     <- rt_i[pairs_all[1,]] - rt_i[pairs_all[2,]]
  concordance <- sign(cx_diff_all) * sign(rt_diff)
  sum(w_all * concordance) / sum(w_all)
})

cat("All benchmark measures computed.\n")
All benchmark measures computed.
Show code
cat(sprintf("  LongString  : attender mean = %.2f  |  non-attender mean = %.2f\n",
  mean(results$longstring[results$is_non_attender == 0], na.rm=TRUE),
  mean(results$longstring[results$is_non_attender == 1], na.rm=TRUE)))
  LongString  : attender mean = 3.96  |  non-attender mean = 10.69
Show code
cat(sprintf("  IRV         : attender mean = %.3f  |  non-attender mean = %.3f\n",
  mean(results$irv[results$is_non_attender == 0], na.rm=TRUE),
  mean(results$irv[results$is_non_attender == 1], na.rm=TRUE)))
  IRV         : attender mean = 1.056  |  non-attender mean = 0.916
Show code
cat(sprintf("  Mahal RT    : attender mean = %.2f  |  non-attender mean = %.2f\n",
  mean(results$mahal_rt[results$is_non_attender == 0], na.rm=TRUE),
  mean(results$mahal_rt[results$is_non_attender == 1], na.rm=TRUE)))
  Mahal RT    : attender mean = 41.27  |  non-attender mean = 29.13
Show code
cat(sprintf("  Laz.R (pool): attender mean = %.3f  |  non-attender mean = %.3f\n",
  mean(results$lazr_pooled[results$is_non_attender == 0], na.rm=TRUE),
  mean(results$lazr_pooled[results$is_non_attender == 1], na.rm=TRUE)))
  Laz.R (pool): attender mean = 0.359  |  non-attender mean = 0.628
Show code
cat(sprintf("  Zh          : attender mean = %.3f  |  non-attender mean = %.3f\n",
  mean(results$zh[results$is_non_attender == 0], na.rm=TRUE),
  mean(results$zh[results$is_non_attender == 1], na.rm=TRUE)))
  Zh          : attender mean = NaN  |  non-attender mean = NaN
Show code
cat(sprintf("  Even-Odd    : attender mean = %.3f  |  non-attender mean = %.3f\n",
  mean(results$even_odd[results$is_non_attender == 0], na.rm=TRUE),
  mean(results$even_odd[results$is_non_attender == 1], na.rm=TRUE)))
  Even-Odd    : attender mean = 0.095  |  non-attender mean = 0.320
Show code
cat(sprintf("  Spearman ρ  : attender mean = %.3f  |  non-attender mean = %.3f\n",
  mean(results$spearman_pooled[results$is_non_attender == 0], na.rm=TRUE),
  mean(results$spearman_pooled[results$is_non_attender == 1], na.rm=TRUE)))
  Spearman ρ  : attender mean = 0.440  |  non-attender mean = 0.007
Show code
cat(sprintf("  Weighted τ  : attender mean = %.3f  |  non-attender mean = %.3f\n",
  mean(results$wtau_pooled[results$is_non_attender == 0], na.rm=TRUE),
  mean(results$wtau_pooled[results$is_non_attender == 1], na.rm=TRUE)))
  Weighted τ  : attender mean = 0.472  |  non-attender mean = 0.006
Show code
# ── Shared lookup objects used by all benchmark chunks ────────────────────────
# Direction of each measure: high_bad = TRUE means HIGH scores → flag inattentive
bench_direction <- list(
  longstring      = list(col = "longstring",      high_bad = TRUE),
  irv             = list(col = "irv",             high_bad = FALSE),
  mahal_rt        = list(col = "mahal_rt",        high_bad = TRUE),
  lazr_pooled     = list(col = "lazr_pooled",     high_bad = TRUE),
  zh              = list(col = "zh",              high_bad = FALSE),
  even_odd        = list(col = "even_odd",        high_bad = FALSE),
  spearman_pooled = list(col = "spearman_pooled", high_bad = FALSE),
  wtau_pooled     = list(col = "wtau_pooled",     high_bad = FALSE)
)

bench_labels <- c(
  longstring      = "LongString",
  irv             = "IRV",
  mahal_rt        = "Mahalanobis D² (RT)",
  lazr_pooled     = "Laz.R",
  zh              = "Zh",
  even_odd        = "Even-Odd Consistency",
  spearman_pooled = "Spearman ρ ★",
  wtau_pooled     = "Weighted τ ★"
)
Show code
# Zh via iterative-purification GRM. Cached so re-renders are instant.
compute_zh_purified <- function(dat, scale_idx) {
  ni_s      <- params$items_per_scale
  K_s       <- params$n_categories
  y_cols_s  <- paste0("Y_s", scale_idx, "_i", seq_len(ni_s))
  rev_items <- instrument[[scale_idx]]$reverse
  resp_s              <- as.matrix(dat[, y_cols_s])
  resp_scored         <- resp_s
  resp_scored[, rev_items] <- K_s + 1 - resp_s[, rev_items]
  resp_df   <- as.data.frame(resp_scored)

  fit1      <- mirt(resp_df, model = 1, itemtype = "graded", verbose = FALSE)
  pf1       <- personfit(fit1, method = "ML")
  keep_mask <- !is.finite(pf1$Zh) | pf1$Zh > -2
  cat(sprintf("Scale %d — Round 1: %d flagged, refitting on %d of %d\n",
              scale_idx, sum(!keep_mask), sum(keep_mask), nrow(dat)))

  fit2           <- mirt(resp_df[keep_mask, ], model = 1,
                         itemtype = "graded", verbose = FALSE)
  pars_fixed     <- mod2values(fit2)
  pars_fixed$est <- FALSE
  fit2_all       <- mirt(resp_df, model = 1, itemtype = "graded",
                         verbose = FALSE, pars = pars_fixed, TOL = NA)
  pf2 <- personfit(fit2_all, method = "ML")
  cat(sprintf("Scale %d — Round 2: Zh range [%.2f, %.2f], mean = %.3f\n",
              scale_idx, min(pf2$Zh, na.rm=TRUE),
              max(pf2$Zh, na.rm=TRUE), mean(pf2$Zh, na.rm=TRUE)))
  pf2$Zh
}

zh_list    <- lapply(seq_len(ns), compute_zh_purified, dat = dat)
Scale 1 — Round 1: 449 flagged, refitting on 4551 of 5000
Scale 1 — Round 2: Zh range [-12.99, 3.58], mean = -0.294
Scale 2 — Round 1: 438 flagged, refitting on 4562 of 5000
Scale 2 — Round 2: Zh range [-11.03, 3.58], mean = -0.251
Show code
results$zh <- rowMeans(do.call(cbind, zh_list), na.rm = TRUE)
cat("Zh computed and stored in results$zh\n")
Zh computed and stored in results$zh

Distributions by True Class

Each panel below shows the attender (blue) and non-attender (red) density distributions for a single measure. Better measures show greater separation between the two distributions.

Note the direction of each measure’s signal: LongString, Mahal RT, and Laz.R are high → inattentive; IRV is low → inattentive for straightliners but high → inattentive for random responders (bimodal non-attender distribution expected); Zh is low → inattentive; Even-Odd is low → inattentive; Spearman ρ and weighted τ are low → inattentive.

Show code
bench_measures <- list(
  list(col = "longstring",     label = "LongString",
       xlab = "Max run length",           ref = NULL,    high_bad = TRUE),
  list(col = "irv",            label = "IRV",
       xlab = "SD of item responses",     ref = NULL,    high_bad = FALSE),
  list(col = "mahal_rt",       label = "Mahalanobis Distance (RT)",
       xlab = "Mahalanobis D²",           ref = NULL,    high_bad = TRUE),
  list(col = "lazr_pooled",    label = "Laz.R (pooled)",
       xlab = "Sequential predictability", ref = 1/K,   high_bad = TRUE),
  list(col = "zh",             label = "Zh (person-fit)",
       xlab = "Zh",                        ref = -2,    high_bad = FALSE),
  list(col = "even_odd",       label = "Even-Odd Consistency",
       xlab = "Even–odd correlation",      ref = 0,     high_bad = FALSE)
)

proposed_measures <- list(
  list(col = "spearman_pooled", label = "Spearman ρ (proposed)",
       xlab = "Spearman ρ",  ref = 0, high_bad = FALSE),
  list(col = "wtau_pooled",    label = "Weighted τ (proposed)",
       xlab = "Weighted τ",  ref = 0, high_bad = FALSE)
)

make_bench_dist_plot <- function(m, proposed = FALSE) {
  p <- ggplot(results, aes(x = .data[[m$col]], fill = class_label)) +
    geom_density(alpha = 0.5, colour = NA) +
    scale_fill_manual(
      values = c("Attender (true)" = "steelblue",
                 "Non-attender (true)" = "tomato"), name = NULL) +
    labs(title = m$label, x = m$xlab, y = "Density") +
    theme(legend.position = "none")
  if (!is.null(m$ref)) {
    p <- p + geom_vline(xintercept = m$ref, linetype = "dashed",
                        colour = "grey30", linewidth = 0.7)
  }
  if (proposed) {
    p <- p + theme(
      panel.border     = element_rect(colour = "#2e7d32", fill = NA,
                                      linewidth = 1.8),
      plot.background  = element_rect(fill = "#f1f8f1", colour = NA)
    )
  }
  p
}

all_plots <- c(
  lapply(bench_measures,    make_bench_dist_plot, proposed = FALSE),
  lapply(proposed_measures, make_bench_dist_plot, proposed = TRUE)
)

wrap_plots(all_plots, ncol = 2) +
  plot_layout(guides = "collect") &
  theme(legend.position = "bottom")

Attender (blue) vs. non-attender (red) distributions for all eight measures, computed across all 40 items. Dashed reference lines mark conventional flagging thresholds where applicable. The bottom two panels (green borders) show the two proposed complexity–RT features.

AUC and Descriptive Statistics

Show code
# bench_direction and bench_labels are defined in benchmark-setup above

bench_auc_tab <- do.call(rbind, lapply(names(bench_direction), function(nm) {
  m      <- bench_direction[[nm]]
  scores <- if (m$high_bad) -results[[m$col]] else results[[m$col]]
  scores[!is.finite(scores)] <- NA
  auc_val <- roc_auc(scores, results$true_inattentive)
  data.frame(
    Measure  = bench_labels[[nm]],
    AUC      = round(auc_val, 3),
    Mean_att = round(mean(results[[m$col]][results$is_non_attender == 0], na.rm=TRUE), 3),
    SD_att   = round(sd(results[[m$col]][results$is_non_attender == 0],   na.rm=TRUE), 3),
    Mean_non = round(mean(results[[m$col]][results$is_non_attender == 1], na.rm=TRUE), 3),
    SD_non   = round(sd(results[[m$col]][results$is_non_attender == 1],   na.rm=TRUE), 3),
    is_proposed = nm %in% c("spearman_pooled", "wtau_pooled")
  )
}))

proposed_rows <- which(bench_auc_tab$is_proposed)

kable(
  bench_auc_tab[, !names(bench_auc_tab) %in% "is_proposed"],
  col.names = c("Measure", "AUC",
                "Mean (attenders)", "SD (attenders)",
                "Mean (non-attenders)", "SD (non-attenders)"),
  caption = "Discriminability of all eight measures (all 40 items). ★ = proposed complexity–RT feature."
) |>
  kable_styling(full_width = FALSE) |>
  row_spec(which.max(bench_auc_tab$AUC), bold = TRUE, background = "#eef6ee") |>
  row_spec(proposed_rows, background = "#e8f5e9", color = "#1b5e20")
Discriminability of all eight measures (all 40 items). ★ = proposed complexity–RT feature.
Measure AUC Mean (attenders) SD (attenders) Mean (non-attenders) SD (non-attenders)
LongString 0.723 3.959 1.300 10.693 9.267
IRV 0.567 1.056 0.185 0.916 0.475
Mahalanobis D² (RT) 0.125 41.273 9.251 29.126 5.546
Laz.R 0.766 0.359 0.049 0.628 0.259
Zh 0.707 -0.079 1.142 -1.866 2.899
Even-Odd Consistency 0.378 0.095 0.244 0.320 0.460
Spearman ρ ★ 0.964 0.440 0.164 0.007 0.157
Weighted τ ★ 0.968 0.472 0.167 0.006 0.166

Cutoff Selection and Classification Performance

We flag respondents using three cutoff strategies for each measure:

  • Kneedle — the elbow point on the sorted score curve (data-driven, no ground-truth required). Used for all measures except Mahalanobis D².
  • χ²(0.975) — for Mahalanobis D² only: the 97.5th percentile of a chi-squared distribution with p = 40 degrees of freedom (one per item). This is the standard outlier threshold from the multivariate statistics literature — under multivariate normality, only 2.5% of attenders should exceed it.
  • Median split — the sample median (a simple, robust alternative)
  • Oracle — the attender 5th (or 95th) percentile, chosen with knowledge of ground truth, providing an upper bound on performance

All three are compared in the table below. For the proposed ρ and τ, counts and percentages are reported for both false positives (FP: attenders incorrectly flagged) and false negatives (FN: non-attenders missed). Proposed-feature rows are highlighted in green.

Show code
# Mahalanobis D² ~ chi-squared(p) under normality; flag above 97.5th percentile
mahal_chisq_thresh <- qchisq(0.975, df = ni_total)

# Apply Kneedle, median, and oracle cutoffs for each measure.
# Mahalanobis uses a chi-squared threshold instead of Kneedle.
apply_bench_cutoffs <- function(col, high_bad) {
  scores <- results[[col]]
  truth  <- results$true_inattentive

  signal <- if (high_bad) scores else -scores
  signal[!is.finite(signal)] <- NA

  # Mahalanobis: chi-squared outlier threshold (replaces Kneedle)
  if (col == "mahal_rt") {
    flag_kn  <- scores > mahal_chisq_thresh   # "Kneedle" slot holds chi-sq flag
    t_kn     <- mahal_chisq_thresh
  } else {
    sorted_sig <- sort(signal, na.last = NA)
    k_idx      <- kneedle_fn(sorted_sig, sign = -1)
    t_kn       <- sorted_sig[k_idx]
    flag_kn    <- signal > t_kn
  }

  # Median on signal
  t_med      <- median(signal, na.rm = TRUE)
  flag_med   <- signal > t_med

  # Oracle: attender 95th percentile of signal
  t_oracle    <- quantile(signal[!truth], 0.95, na.rm = TRUE)
  flag_oracle <- signal > t_oracle

  list(flag_kn = flag_kn, flag_med = flag_med, flag_oracle = flag_oracle,
       t_kn = t_kn, t_med = t_med, t_oracle = t_oracle)
}

bench_cuts <- lapply(names(bench_direction), function(nm) {
  m <- bench_direction[[nm]]
  apply_bench_cutoffs(m$col, m$high_bad)
})
names(bench_cuts) <- names(bench_direction)

# Build performance table
make_bench_perf_row <- function(nm, cutoff_name, flag) {
  truth <- results$true_inattentive
  n_att <- sum(!truth)
  n_non <- sum(truth)
  cs    <- confusion_stats(flag, truth)
  fp_n  <- cs$FP;  fp_pct <- round(100 * cs$FP / n_att, 1)
  fn_n  <- cs$FN;  fn_pct <- round(100 * cs$FN / n_non, 1)
  data.frame(
    Measure  = bench_labels[[nm]],
    Cutoff   = cutoff_name,
    Sens     = round(cs$Sensitivity, 3),
    Spec     = round(cs$Specificity, 3),
    FP_n     = fp_n,  FP_pct = fp_pct,
    FN_n     = fn_n,  FN_pct = fn_pct,
    Flagged  = round(cs$Flagged_pct, 1),
    is_proposed = nm %in% c("spearman_pooled", "wtau_pooled")
  )
}

perf_rows <- do.call(rbind, lapply(names(bench_direction), function(nm) {
  rbind(
    make_bench_perf_row(nm, "Kneedle", bench_cuts[[nm]]$flag_kn),
    make_bench_perf_row(nm, "Median",  bench_cuts[[nm]]$flag_med),
    make_bench_perf_row(nm, "Oracle",  bench_cuts[[nm]]$flag_oracle)
  )
}))

proposed_perf_rows <- which(perf_rows$is_proposed)

kable(
  perf_rows[, !names(perf_rows) %in% "is_proposed"],
  col.names = c("Measure", "Cutoff",
                "Sensitivity", "Specificity",
                "FP (N)", "FP (%)",
                "FN (N)", "FN (%)",
                "% Flagged"),
  caption = paste0(
    "Classification performance for all measures. ",
    "Mahalanobis D\u00b2 uses a \u03c7\u00b2(0.975) outlier threshold instead of Kneedle. ",
    "FP = attenders incorrectly flagged; FN = non-attenders missed. \u2605 = proposed feature."
  )
) |>
  kable_styling(full_width = FALSE) |>
  row_spec(proposed_perf_rows, background = "#e8f5e9", color = "#1b5e20") |>
  row_spec(which(perf_rows$Cutoff == "Oracle"), italic = TRUE)
Classification performance for all measures. Mahalanobis D² uses a χ²(0.975) outlier threshold instead of Kneedle. FP = attenders incorrectly flagged; FN = non-attenders missed. ★ = proposed feature.
Measure Cutoff Sensitivity Specificity FP (N) FP (%) FN (N) FN (%) % Flagged
LongString Kneedle 0.525 0.963 167 3.7 257 47.5 9.0
LongString Median 0.632 0.738 1168 26.2 199 36.8 30.2
LongString Oracle 0.525 0.963 167 3.7 257 47.5 9.0
IRV Kneedle 0.401 0.974 115 2.6 324 59.9 6.6
IRV Median 0.558 0.510 2187 49.0 239 44.2 49.8
IRV Oracle 0.405 0.952 214 4.8 322 59.5 8.7
Mahalanobis D² (RT) Kneedle 0.000 0.967 146 3.3 541 100.0 2.9
Mahalanobis D² (RT) Median 0.044 0.445 2476 55.5 517 95.6 50.0
Mahalanobis D² (RT) Oracle 0.002 0.950 223 5.0 540 99.8 4.5
Laz.R Kneedle 0.630 0.953 209 4.7 200 37.0 11.0
Laz.R Median 0.739 0.529 2100 47.1 141 26.1 50.0
Laz.R Oracle 0.630 0.950 223 5.0 200 37.0 11.3
Zh Kneedle 0.477 0.919 363 8.1 283 52.3 12.4
Zh Median 0.726 0.527 2107 47.3 148 27.4 50.0
Zh Oracle 0.458 0.950 223 5.0 293 54.2 9.4
Even-Odd Consistency Kneedle 0.050 0.957 192 4.3 474 87.6 NA
Even-Odd Consistency Median 0.415 0.490 2272 51.0 292 54.0 NA
Even-Odd Consistency Oracle 0.050 0.950 223 5.0 474 87.6 NA
Spearman ρ ★ Kneedle 0.906 0.894 473 10.6 51 9.4 19.3
Spearman ρ ★ Median 0.996 0.560 1961 44.0 2 0.4 50.0
Spearman ρ ★ Oracle 0.787 0.950 223 5.0 115 21.3 13.0
Weighted τ ★ Kneedle 0.928 0.887 502 11.3 39 7.2 20.1
Weighted τ ★ Median 0.998 0.560 1960 44.0 1 0.2 50.0
Weighted τ ★ Oracle 0.810 0.950 223 5.0 103 19.0 13.2

Aggregated Performance: Weighted Balanced Accuracy and Sensitivity at Fixed Specificity

The table above gives the full picture but requires readers to weigh sensitivity against specificity themselves. Here we provide two complementary summaries that make the trade-off explicit.

Weighted Balanced Accuracy (WBA) averages sensitivity and specificity with an explicit weight w on specificity:

\[\text{WBA}(w) = w \cdot \text{Specificity} + (1 - w) \cdot \text{Sensitivity}\]

At w = 0.5 this is ordinary balanced accuracy (symmetric). Higher w rewards measures that keep FPs low — appropriate here because false positives introduce systematic bias in sample composition (see §False Positive Covariate Analysis), while false negatives reduce power but do not distort who remains in the sample.

Sensitivity at 95% Specificity (S@95) reads directly off the ROC curve: it is the highest sensitivity a measure can achieve while keeping the false positive rate at or below 5%. This answers the applied question — given I am only willing to incorrectly exclude 1 in 20 real attenders, how many non-attenders can I catch? — without requiring any weight choice.

Both are computed using the Kneedle threshold (or χ² for Mahalanobis). WBA is also shown at w ∈ {0.5, 0.6, 0.7, 0.8, 0.9} so readers can see whether measure rankings are stable across reasonable choices of specificity preference.

Show code
truth <- results$true_inattentive

# ── Sensitivity at 95% Specificity from ROC curve ────────────────────────────
sens_at_95spec <- function(nm) {
  m      <- bench_direction[[nm]]
  signal <- if (m$high_bad) results[[m$col]] else -results[[m$col]]
  signal[!is.finite(signal)] <- NA
  thresholds <- sort(unique(signal[!is.na(signal)]))
  roc <- do.call(rbind, lapply(thresholds, function(t) {
    flag <- signal > t
    TP <- sum( flag &  truth, na.rm=TRUE); FP <- sum( flag & !truth, na.rm=TRUE)
    TN <- sum(!flag & !truth, na.rm=TRUE); FN <- sum(!flag &  truth, na.rm=TRUE)
    data.frame(spec = TN/(TN+FP), sens = TP/(TP+FN))
  }))
  # Best sensitivity among rows where specificity >= 0.95
  eligible <- roc[!is.na(roc$spec) & roc$spec >= 0.95, ]
  if (nrow(eligible) == 0) return(NA_real_)
  round(max(eligible$sens, na.rm = TRUE), 3)
}

# ── WBA at a given weight ─────────────────────────────────────────────────────
wba <- function(sens, spec, w) round(w * spec + (1 - w) * sens, 3)

# ── Build summary table ───────────────────────────────────────────────────────
weights   <- seq(0.5, 0.9, by = 0.1)
wba_tab <- do.call(rbind, lapply(names(bench_direction), function(nm) {
  # Use Kneedle (or chi-sq) flag
  flag <- bench_cuts[[nm]]$flag_kn
  cs   <- confusion_stats(flag, truth)
  sens <- cs$Sensitivity
  spec <- cs$Specificity
  s95  <- sens_at_95spec(nm)

  wba_vals <- setNames(sapply(weights, function(w) wba(sens, spec, w)),
                       paste0("WBA(w=", weights, ")"))

  data.frame(
    Measure   = bench_labels[[nm]],
    Sens      = round(sens, 3),
    Spec      = round(spec, 3),
    `S@95%`   = s95,
    as.data.frame(t(wba_vals)),
    is_proposed = nm %in% c("spearman_pooled", "wtau_pooled"),
    check.names = FALSE
  )
}))

prop_rows_wba <- which(wba_tab$is_proposed)
wba_display   <- wba_tab[, !names(wba_tab) %in% "is_proposed"]

# ── Main WBA table ────────────────────────────────────────────────────────────
tbl_wba <- kable(
  wba_display,
  col.names = c("Measure", "Sensitivity", "Specificity", "S@95% Spec",
                paste0("WBA w=", weights)),
  caption = paste0(
    "Sensitivity, specificity, Sensitivity at 95% Specificity (S@95%), and ",
    "Weighted Balanced Accuracy at five specificity weights. ",
    "Higher w = greater penalty for false positives. ",
    "\u2605 = proposed complexity\u2013RT feature."
  ),
  digits = 3
) |>
  kable_styling(full_width = FALSE) |>
  row_spec(prop_rows_wba, background = "#e8f5e9", color = "#1b5e20") |>
  # Bold the S@95 and WBA(w=0.7) columns as primary recommended summaries
  column_spec(4, bold = TRUE) |>
  column_spec(7, bold = TRUE)   # w=0.7 column (4 fixed cols + 3rd weight)

cat(as.character(tbl_wba), "\n\n")
Sensitivity, specificity, Sensitivity at 95% Specificity (S@95%), and Weighted Balanced Accuracy at five specificity weights. Higher w = greater penalty for false positives. ★ = proposed complexity–RT feature.
Measure Sensitivity Specificity S@95% Spec WBA w=0.5 WBA w=0.6 WBA w=0.7 WBA w=0.8 WBA w=0.9
LongString 0.525 0.963 0.525 0.744 0.788 0.831 0.875 0.919
IRV 0.401 0.974 0.405 0.688 0.745 0.802 0.860 0.917
Mahalanobis D² (RT) 0.000 0.967 0.002 0.484 0.580 0.677 0.774 0.871
Laz.R 0.630 0.953 0.630 0.792 0.824 0.856 0.889 0.921
Zh 0.477 0.919 0.458 0.698 0.742 0.786 0.830 0.874
Even-Odd Consistency 0.050 0.957 0.050 0.504 0.594 0.685 0.776 0.866
Spearman ρ ★ 0.906 0.894 0.787 0.900 0.899 0.897 0.896 0.895
Weighted τ ★ 0.928 0.887 0.810 0.908 0.904 0.900 0.896 0.891
Show code
wba_long <- tidyr::pivot_longer(
  wba_tab[, c("Measure", paste0("WBA(w=", weights, ")"))],
  cols      = -Measure,
  names_to  = "weight",
  values_to = "WBA"
) |>
  mutate(
    weight  = factor(weight, levels = paste0("WBA(w=", weights, ")")),
    w_label = factor(
      gsub("WBA\\(w=(.*)\\)", "w = \\1", weight),
      levels = gsub("WBA\\(w=(.*)\\)", "w = \\1", paste0("WBA(w=", weights, ")"))
    ),
    Measure = factor(Measure, levels = rev(bench_labels[names(bench_direction)]))
  )

# Rank each measure within each weight for annotation
wba_long <- wba_long |>
  group_by(weight) |>
  mutate(rank_label = paste0("#", rank(-WBA, ties.method = "min"))) |>
  ungroup()

proposed_measures_vec <- bench_labels[c("spearman_pooled", "wtau_pooled")]

ggplot(wba_long, aes(x = w_label, y = Measure, fill = WBA)) +
  geom_tile(colour = "white", linewidth = 0.5) +
  geom_text(aes(label = sprintf("%.3f\n(%s)", WBA, rank_label)),
            size = 2.8, lineheight = 0.9) +
  scale_fill_gradient2(
    low      = "tomato",
    mid      = "lightyellow",
    high     = "#2e7d32",
    midpoint = 0.65,
    name     = "WBA"
  ) +
  labs(
    title    = "Weighted Balanced Accuracy across specificity preference weights",
    subtitle = "Cell = WBA value; rank in parentheses. Bold green rows = proposed features.",
    x        = "Specificity weight (w)",
    y        = NULL
  ) +
  theme(
    axis.text.y = element_text(
      colour = ifelse(
        rev(bench_labels[names(bench_direction)]) %in% proposed_measures_vec,
        "#1b5e20", "grey20"
      ),
      face = ifelse(
        rev(bench_labels[names(bench_direction)]) %in% proposed_measures_vec,
        "bold", "plain"
      )
    ),
    legend.position = "right"
  )

Weighted Balanced Accuracy for each measure (rows) across specificity weights w = 0.5–0.9 (columns). Green = higher WBA, red = lower. Stable rankings across columns indicate the conclusion is robust to the choice of weight. Proposed features shown in bold green row labels.

Cutoff Visualisation

Show code
make_bench_cutoff_plot <- function(nm, proposed = FALSE) {
  m      <- bench_direction[[nm]]
  cuts   <- bench_cuts[[nm]]
  scores <- results[[m$col]]
  high_b <- m$high_bad

  # Label first threshold as chi-sq for Mahalanobis, Kneedle otherwise
  primary_label <- if (nm == "mahal_rt") "χ²(0.975)" else "Kneedle"

  thresh_df <- data.frame(
    xintercept = c(
      if (high_b) cuts$t_kn    else -cuts$t_kn,
      if (high_b) cuts$t_med   else -cuts$t_med,
      if (high_b) cuts$t_oracle else -cuts$t_oracle
    ),
    ltype  = c("solid", "dashed", "dotted"),
    label  = c(primary_label, "Median", "Oracle")
  )

  p <- ggplot(results, aes(x = .data[[m$col]], fill = class_label)) +
    geom_density(alpha = 0.45, colour = NA) +
    geom_vline(data = thresh_df,
               aes(xintercept = xintercept, linetype = ltype),
               colour = "black", linewidth = 0.75, inherit.aes = FALSE) +
    scale_linetype_identity() +
    annotate("text",
             x     = thresh_df$xintercept,
             y     = Inf,
             label = thresh_df$label,
             hjust = if (high_b) c(-0.1,-0.1,1.1) else c(1.1,1.1,-0.1),
             vjust = 1.4, size = 2.8) +
    scale_fill_manual(
      values = c("Attender (true)" = "steelblue",
                 "Non-attender (true)" = "tomato"), name = NULL) +
    labs(title = bench_labels[[nm]], x = m$col, y = "Density") +
    theme(legend.position = "none")

  if (proposed) {
    p <- p + theme(
      panel.border    = element_rect(colour = "#2e7d32", fill = NA,
                                     linewidth = 1.8),
      plot.background = element_rect(fill = "#f1f8f1", colour = NA)
    )
  }
  p
}

all_cutoff_plots <- c(
  lapply(names(bench_direction)[1:6], make_bench_cutoff_plot, proposed = FALSE),
  lapply(names(bench_direction)[7:8], make_bench_cutoff_plot, proposed = TRUE)
)

wrap_plots(all_cutoff_plots, ncol = 2) +
  plot_layout(guides = "collect") &
  theme(legend.position = "bottom")

Density distributions for each measure with Kneedle (solid), Median (dashed), and Oracle (dotted) cutoff lines. Red shading marks the flagging region. Proposed features (ρ, τ) are shown with a green panel border.

ROC Curves: All Measures

Show code
truth <- results$true_inattentive

roc_list <- lapply(names(bench_direction), function(nm) {
  m      <- bench_direction[[nm]]
  signal <- if (m$high_bad) results[[m$col]] else -results[[m$col]]
  signal[!is.finite(signal)] <- NA
  r      <- make_roc_df(signal, truth, bench_labels[[nm]])
  auc_v  <- round(roc_auc(signal, truth), 3)
  r$variant <- sprintf("%s (AUC=%.3f)", bench_labels[[nm]], auc_v)
  r$proposed <- nm %in% c("spearman_pooled", "wtau_pooled")
  r
})

roc_all <- do.call(rbind, roc_list)

proposed_variants <- unique(roc_all$variant[roc_all$proposed])
other_variants    <- unique(roc_all$variant[!roc_all$proposed])

n_other   <- length(other_variants)
grey_pal  <- colorRampPalette(c("grey60", "steelblue3"))(n_other)
colour_map <- c(
  setNames(grey_pal, other_variants),
  setNames(c("#2e7d32", "#81c784"), proposed_variants)
)
size_map <- c(
  setNames(rep(0.7, n_other),   other_variants),
  setNames(c(1.4, 1.4),          proposed_variants)
)

ggplot(roc_all, aes(x = fpr, y = tpr,
                    colour = variant, linewidth = variant)) +
  geom_abline(slope = 1, intercept = 0,
              linetype = "dashed", colour = "grey80") +
  geom_path() +
  scale_colour_manual(values = colour_map, name = NULL) +
  scale_linewidth_manual(values = size_map, guide = "none") +
  labs(title    = "ROC curves — all attention-checking measures",
       subtitle = "Green = proposed complexity–RT features",
       x = "False Positive Rate (1 − Specificity)",
       y = "Sensitivity") +
  theme(legend.text = element_text(size = 8),
        legend.key.width = unit(1.2, "cm"))

ROC curves for all eight measures. Proposed complexity–RT features (ρ, τ) are shown as bold green lines; established measures in grey/blue. AUC values are reported in the legend.

False Positive Covariate Analysis

Are false positives from each measure systematically different from the true negatives (correctly retained attenders)? This section examines the selectivity profile of each measure’s errors: if a measure flags people not because they are inattentive but because they share some substantive characteristic (trait extremity, response style, low attention but genuine engagement), the resulting exclusions introduce selection bias rather than merely noise.

We compare FP vs TN attenders on five covariates:

  1. θ̂ (IRT trait estimate) — are FPs clustering at the centre of the latent trait? A measure whose FPs are predominantly mid-range respondents is biased against a specific substantive group.
  2. Observed mean score — same question via the raw scale mean, which is available without IRT assumptions.
  3. Latent attention A — are FPs simply low-attention genuine attenders? If so, the exclusion may be defensible (they were marginally engaged), but it should be acknowledged.
  4. IRV (response variability) — do FPs show restricted range use? Low IRV can arise from mid-point responding or from genuine homogeneity of trait expression; high IRV from acquiescent or extreme responding.
  5. Mean RT — are FPs simply fast responders overall, or is their exclusion driven by RT–complexity decoupling specifically?

For each measure, FP and TN attenders are compared via density overlays and a summary table of group means and standardised mean differences (Cohen’s d).

Show code
# ── IRT theta: read directly from simulation output ───────────────────────────
# theta_1 and theta_2 are the true latent trait values from the DGP,
# one per scale. We use both individually and their mean as an overall index.
results$theta_1   <- dat$theta_1
results$theta_2   <- dat$theta_2
results$theta_mean <- rowMeans(cbind(dat$theta_1, dat$theta_2), na.rm = TRUE)

# ── Observed mean score (all items, reverse-scored) ───────────────────────────
# Build index of which columns in resp_mat need reversing
resp_mat_scored <- resp_mat
for (s in seq_len(ns)) {
  rev_item_nums <- instrument[[s]]$reverse   # integer item indices within scale s
  if (length(rev_item_nums) > 0) {
    col_names <- paste0("Y_s", s, "_i", rev_item_nums)
    # Only reverse columns that actually exist in resp_mat
    valid <- col_names[col_names %in% colnames(resp_mat_scored)]
    if (length(valid) > 0) {
      resp_mat_scored[, valid] <- K + 1 - resp_mat[, valid]
    }
  }
}
results$mean_score <- rowMeans(resp_mat_scored, na.rm = TRUE)

# ── Mean RT (overall speed) ───────────────────────────────────────────────────
results$mean_rt <- rowMeans(rt_mat, na.rm = TRUE)

# IRV and A are already in results

cat("Covariate summary (attenders only):\n")
Covariate summary (attenders only):
Show code
att <- results[results$is_non_attender == 0, ]
for (col in c("theta_1", "theta_2", "theta_mean", "mean_score", "A", "irv", "mean_rt")) {
  cat(sprintf("  %-12s  mean = %.3f  sd = %.3f\n",
              col, mean(att[[col]], na.rm = TRUE), sd(att[[col]], na.rm = TRUE)))
}
  theta_1       mean = 0.005  sd = 0.993
  theta_2       mean = 0.005  sd = 1.023
  theta_mean    mean = 0.005  sd = 0.708
  mean_score    mean = 3.045  sd = 0.270
  A             mean = 0.670  sd = 0.201
  irv           mean = 1.056  sd = 0.185
  mean_rt       mean = 5.052  sd = 0.570
Show code
# Build a unified FP/TN flag column for each measure (among attenders only)
# using the Kneedle / chi-sq thresholds from bench_cuts

fp_flag_df <- results[results$is_non_attender == 0, ] |>
  select(ID, A, theta_1, theta_2, theta_mean, mean_score, irv, mean_rt)

for (nm in names(bench_direction)) {
  cuts <- bench_cuts[[nm]]
  fp_flag_df[[paste0("fp_", nm)]] <- as.logical(cuts$flag_kn[results$is_non_attender == 0])
}

cat(sprintf("FP flag data frame: %d attenders × %d columns\n",
            nrow(fp_flag_df), ncol(fp_flag_df)))
FP flag data frame: 4459 attenders × 16 columns

Density Overlays: FP vs TN by Covariate

Each plot below focuses on one covariate and shows all eight measures as facets. Within each facet, the blue distribution is the true negatives (attenders correctly retained) and the red distribution is the false positives (attenders incorrectly flagged).

How to read these: there are three patterns to watch for:

  • Near-complete overlap (red ≈ blue in shape and location) — the measure’s false positives are not selectively different from true negatives on that covariate. The errors are essentially random with respect to it, which is the best-case scenario.
  • Shifted distribution (red centre is left or right of blue) — the measure is systematically excluding attenders at a particular level of the covariate. For example, red shifted toward θ = 0 means the measure preferentially flags mid-trait respondents; red shifted toward low A means it mainly catches low-attention-but-genuine attenders.
  • Narrow, tall red distribution (red is more peaked and concentrated than blue, even if centred similarly) — the false positives come from a very specific, homogeneous slice of the attender population. This is a stronger form of bias than a simple shift: the measure is not just nudging toward one end of the covariate, it is selecting a particular type of respondent with unusual precision. For instance, a narrow red spike at moderate θ values would suggest the measure almost exclusively flags respondents near the scale midpoint, leaving very high and very low scorers alone even when they are inattentive.

In general, a measure whose FPs look like a random sample of the TN distribution (same shape, same spread, same centre) is making errors that do not differentially harm any subgroup. A measure whose FPs form a tight, distinctive cluster is effectively acting as a filter for a specific kind of respondent, regardless of whether those respondents are actually inattentive.

The proposed measures (ρ, τ) are shown with a green facet border.

Show code
covariates <- list(
  list(col = "theta_1",    label = "\u03b8\u2081 (Scale 1 trait)"),
  list(col = "theta_2",    label = "\u03b8\u2082 (Scale 2 trait)"),
  list(col = "theta_mean", label = "\u03b8\u0304 (mean trait)"),
  list(col = "mean_score", label = "Mean observed score"),
  list(col = "A",          label = "Latent attention A"),
  list(col = "irv",        label = "IRV (response variability)"),
  list(col = "mean_rt",    label = "Mean RT (ms)")
)

measure_order <- c("longstring", "irv", "mahal_rt", "lazr_pooled",
                   "zh", "even_odd", "spearman_pooled", "wtau_pooled")

# Build a long data frame: one row per attender × measure
fp_long <- do.call(rbind, lapply(measure_order, function(nm) {
  flag_col <- paste0("fp_", nm)
  data.frame(
    fp_flag_df,
    measure   = bench_labels[[nm]],
    group     = ifelse(fp_flag_df[[flag_col]], "FP (flagged)", "TN (retained)"),
    proposed  = nm %in% c("spearman_pooled", "wtau_pooled"),
    stringsAsFactors = FALSE
  )
}))

fp_long$measure <- factor(fp_long$measure, levels = bench_labels[measure_order])
fp_long$group   <- factor(fp_long$group, levels = c("TN (retained)", "FP (flagged)"))

# One plot per covariate — each rendered separately so text isn't squashed
for (cv in covariates) {
  p <- ggplot(fp_long, aes(x = .data[[cv$col]], fill = group)) +
    geom_density(alpha = 0.5, colour = NA) +
    scale_fill_manual(
      values = c("TN (retained)" = "steelblue", "FP (flagged)" = "tomato"),
      name   = NULL) +
    facet_wrap(~ measure, nrow = 2, scales = "free_y") +
    labs(
      title    = sprintf("FP vs TN: %s", cv$label),
      subtitle = "Red = false positives (flagged attenders); blue = true negatives (retained attenders)",
      x        = cv$label,
      y        = "Density"
    ) +
    theme(
      legend.position  = "bottom",
      strip.text       = element_text(size = 9, face = "bold"),
      strip.background = element_rect(
        fill = ifelse(
          levels(fp_long$measure) %in%
            bench_labels[c("spearman_pooled", "wtau_pooled")],
          "#e8f5e9", "grey92"
        ),
        colour = ifelse(
          levels(fp_long$measure) %in%
            bench_labels[c("spearman_pooled", "wtau_pooled")],
          "#2e7d32", "grey70"
        ),
        linewidth = 0.8
      )
    )
  print(p)
}

Summary Table: Group Means and Cohen’s d

Show code
cohens_d <- function(x_fp, x_tn) {
  n1 <- sum(!is.na(x_fp)); n2 <- sum(!is.na(x_tn))
  if (n1 < 2 || n2 < 2) return(NA_real_)
  sd_pooled <- sqrt(((n1-1)*var(x_fp, na.rm=TRUE) +
                     (n2-1)*var(x_tn, na.rm=TRUE)) / (n1+n2-2))
  if (sd_pooled == 0) return(NA_real_)
  (mean(x_fp, na.rm=TRUE) - mean(x_tn, na.rm=TRUE)) / sd_pooled
}

covariate_cols   <- c("theta_1", "theta_2", "theta_mean", "mean_score", "A", "irv", "mean_rt")
covariate_labels <- c("\u03b8\u2081 (S1)", "\u03b8\u2082 (S2)", "\u03b8\u0304 (mean)",
                      "Mean score", "Attention A", "IRV", "Mean RT")

fp_tab <- do.call(rbind, lapply(measure_order, function(nm) {
  flag   <- fp_flag_df[[paste0("fp_", nm)]]
  fp_grp <- fp_flag_df[flag  == TRUE,  ]
  tn_grp <- fp_flag_df[flag  == FALSE, ]
  n_fp   <- sum(flag, na.rm = TRUE)
  n_tn   <- sum(!flag, na.rm = TRUE)

  row_vals <- unlist(lapply(covariate_cols, function(cv) {
    d    <- round(cohens_d(fp_grp[[cv]], tn_grp[[cv]]), 2)
    m_fp <- round(mean(fp_grp[[cv]], na.rm = TRUE), 2)
    m_tn <- round(mean(tn_grp[[cv]], na.rm = TRUE), 2)
    c(m_fp, m_tn, d)
  }))

  df <- as.data.frame(t(row_vals))
  names(df) <- as.vector(outer(
    covariate_labels,
    c("FP mean", "TN mean", "d"),
    paste, sep = " \u2014 "
  ))
  cbind(Measure = bench_labels[[nm]], `N FP` = n_fp, `N TN` = n_tn, df)
}))

prop_rows <- which(fp_tab$Measure %in% c("Spearman \u03c1 \u2605", "Weighted \u03c4 \u2605"))

# One sub-table per covariate — use cat() so knitr renders HTML correctly
for (i in seq_along(covariate_cols)) {
  cv_lbl <- covariate_labels[i]
  cols_i <- c("Measure", "N FP", "N TN",
              paste(cv_lbl, c("FP mean", "TN mean", "d"), sep = " \u2014 "))
  sub_tab <- fp_tab[, cols_i]

  tbl_html <- kable(
    sub_tab,
    format     = "html",
    col.names  = c("Measure", "N FP", "N TN", "Mean (FP)", "Mean (TN)", "Cohen's d"),
    caption    = sprintf("FP vs TN comparison on %s. Positive d = FPs score higher.", cv_lbl),
    digits     = 3
  ) |>
    kable_styling(full_width = FALSE) |>
    row_spec(prop_rows, background = "#e8f5e9", color = "#1b5e20") |>
    column_spec(6, bold = TRUE)

  cat(as.character(tbl_html), "\n\n")
}
FP vs TN comparison on θ₁ (S1). Positive d = FPs score higher.
Measure N FP N TN Mean (FP) Mean (TN) Cohen’s d
LongString 167 4292 0.09 0.01 0.61
IRV 115 4344 -0.13 0.01 1.05
Mahalanobis D² (RT) 146 4313 -0.07 0.01 -0.03
Laz.R 209 4250 -0.01 0.01 0.81
Zh 363 4096 0.05 0.00 -2.21
Even-Odd Consistency 192 4267 -0.10 0.01 -0.17
Spearman ρ ★ 473 3986 0.01 0.00 -1.78
Weighted τ ★ 502 3957 0.04 0.00 -1.78
FP vs TN comparison on θ₂ (S2). Positive d = FPs score higher.
Measure N FP N TN Mean (FP) Mean (TN) Cohen’s d
LongString 167 4292 0.00 0.00 0.94
IRV 115 4344 0.01 -0.16 0.67
Mahalanobis D² (RT) 146 4313 0.01 -0.06 1.06
Laz.R 209 4250 0.01 -0.06 0.80
Zh 363 4096 0.00 0.04 1.29
Even-Odd Consistency 192 4267 0.01 -0.02 1.01
Spearman ρ ★ 473 3986 0.00 0.03 1.19
Weighted τ ★ 502 3957 0.00 0.05 1.19
FP vs TN comparison on θ̄ (mean). Positive d = FPs score higher.
Measure N FP N TN Mean (FP) Mean (TN) Cohen’s d
LongString 167 4292 0.09 3.04 1.06
IRV 115 4344 -0.13 3.02 1.07
Mahalanobis D² (RT) 146 4313 -0.07 3.05 1.06
Laz.R 209 4250 -0.02 3.05 1.07
Zh 363 4096 0.05 3.03 1.03
Even-Odd Consistency 192 4267 -0.11 3.03 1.06
Spearman ρ ★ 473 3986 0.00 3.03 1.04
Weighted τ ★ 502 3957 0.04 3.04 1.04
FP vs TN comparison on Mean score. Positive d = FPs score higher.
Measure N FP N TN Mean (FP) Mean (TN) Cohen’s d
LongString 167 4292 -0.09 3.04 -0.63
IRV 115 4344 -0.09 3.05 -2.24
Mahalanobis D² (RT) 146 4313 -0.01 3.04 0.04
Laz.R 209 4250 -0.06 3.04 -1.52
Zh 363 4096 0.01 3.05 1.52
Even-Odd Consistency 192 4267 0.09 3.05 -0.27
Spearman ρ ★ 473 3986 0.04 3.05 0.81
Weighted τ ★ 502 3957 0.03 3.05 0.81
FP vs TN comparison on Attention A. Positive d = FPs score higher.
Measure N FP N TN Mean (FP) Mean (TN) Cohen’s d
LongString 167 4292 0.01 -0.01 5.35
IRV 115 4344 0.01 -0.11 5.58
Mahalanobis D² (RT) 146 4313 0.01 0.00 5.21
Laz.R 209 4250 0.01 0.02 5.44
Zh 363 4096 0.00 -0.06 4.25
Even-Odd Consistency 192 4267 0.00 -0.05 5.00
Spearman ρ ★ 473 3986 0.00 -0.05 4.39
Weighted τ ★ 502 3957 0.00 -0.03 4.39
FP vs TN comparison on IRV. Positive d = FPs score higher.
Measure N FP N TN Mean (FP) Mean (TN) Cohen’s d
LongString 167 4292 -0.09 0.79 5.04
IRV 115 4344 -0.10 0.87 5.04
Mahalanobis D² (RT) 146 4313 -0.01 0.67 5.05
Laz.R 209 4250 -0.06 0.82 5.03
Zh 363 4096 0.00 0.32 5.12
Even-Odd Consistency 192 4267 0.09 0.64 5.05
Spearman ρ ★ 473 3986 0.04 0.39 5.13
Weighted τ ★ 502 3957 0.02 0.39 5.14
FP vs TN comparison on Mean RT. Positive d = FPs score higher.
Measure N FP N TN Mean (FP) Mean (TN) Cohen’s d
LongString 167 4292 0.00 0.67 0.55
IRV 115 4344 -0.11 0.66 0.96
Mahalanobis D² (RT) 146 4313 -0.04 0.67 0.29
Laz.R 209 4250 -0.03 0.66 0.72
Zh 363 4096 0.03 0.70 -1.68
Even-Odd Consistency 192 4267 -0.01 0.67 -0.10
Spearman ρ ★ 473 3986 0.02 0.70 -1.41
Weighted τ ★ 502 3957 0.03 0.71 -1.45

Cohen’s d Heatmap: All Measures × Covariates

Show code
d_long <- do.call(rbind, lapply(measure_order, function(nm) {
  flag   <- fp_flag_df[[paste0("fp_", nm)]]
  fp_grp <- fp_flag_df[flag == TRUE,  ]
  tn_grp <- fp_flag_df[flag == FALSE, ]
  do.call(rbind, lapply(seq_along(covariate_cols), function(i) {
    data.frame(
      measure   = bench_labels[[nm]],
      covariate = covariate_labels[i],
      d         = cohens_d(fp_grp[[covariate_cols[i]]],
                           tn_grp[[covariate_cols[i]]])
    )
  }))
}))

d_long$measure <- factor(d_long$measure,
  levels = rev(bench_labels[measure_order]))

ggplot(d_long, aes(x = covariate, y = measure, fill = d)) +
  geom_tile(colour = "white", linewidth = 0.6) +
  geom_text(aes(label = sprintf("%.2f", d)), size = 3) +
  scale_fill_gradient2(
    low      = "#6a0dad",   # purple = FPs score lower
    mid      = "white",
    high     = "#2e7d32",   # green  = FPs score higher
    midpoint = 0,
    na.value = "grey85",
    name     = "Cohen's d\n(FP − TN)"
  ) +
  labs(title    = "False positive selectivity: Cohen's d (FP vs TN attenders)",
       subtitle = "Rows = measures; columns = covariates. Proposed features in green.",
       x = NULL, y = NULL) +
  theme(
    axis.text.x  = element_text(angle = 20, hjust = 1),
    axis.text.y  = element_text(
      colour = ifelse(
        rev(bench_labels[measure_order]) %in%
          c("Spearman \u03c1 \u2605", "Weighted \u03c4 \u2605"),
        "#1b5e20", "grey20"
      ),
      face = ifelse(
        rev(bench_labels[measure_order]) %in%
          c("Spearman \u03c1 \u2605", "Weighted \u03c4 \u2605"),
        "bold", "plain"
      )
    )
  )

Cohen’s d for each measure (rows) × covariate (columns). Positive d (green) means FPs score higher than TNs; negative d (purple) means FPs score lower. Values near zero mean the measure’s FPs are not selectively different from TNs on that covariate. Proposed features highlighted with a star in the row label.

Sensitivity Analysis I: Sample Size (N)

How stable are the eight measures as sample size decreases from the full dataset of 5000 respondents down to N = 100? We draw random subsamples (without replacement) at each of the following sizes: N = 5,000 · 2,000 · 1,000 · 500 · 250 · 100, recomputing all measures and their Kneedle-threshold performance at each size.

Because subsampling is stochastic, we repeat each condition 20 times and report the mean ± SD of sensitivity and FP rate across replications.

Design. The simulation should be run with at least N = 5,000 respondents so that the largest subsample equals the full dataset. Subsampling preserves the original attender/non-attender ratio within each draw.

Show code
n_ss_reps <- 20   # replications per N level — enough for stable means
ss_sizes  <- c(100, 250, 500, 1000, 2000, 5000)
# Cap at actual sample size
ss_sizes  <- ss_sizes[ss_sizes <= nrow(dat)]

set.seed(2025)

# Pre-generate subsample indices: list[N_level][rep] -> row indices
ss_idx <- lapply(ss_sizes, function(n_s) {
  lapply(seq_len(n_ss_reps), function(r) {
    att_idx <- which(dat$is_non_attender == 0)
    non_idx <- which(dat$is_non_attender == 1)
    prop_non <- mean(dat$is_non_attender == 1)
    n_non <- round(n_s * prop_non)
    n_att <- n_s - n_non
    c(
      sample(att_idx, min(n_att, length(att_idx)), replace = FALSE),
      sample(non_idx, min(n_non, length(non_idx)), replace = FALSE)
    )
  })
})
names(ss_idx) <- as.character(ss_sizes)

cat(sprintf("Subsample sizes: %s\nReplications: %d\n",
            paste(ss_sizes, collapse = ", "), n_ss_reps))
Subsample sizes: 100, 250, 500, 1000, 2000, 5000
Replications: 20
Show code
# ── Vectorised measure helpers — defined here, used by both sensitivity loops ─

# LongString: max run length per row
longstring_vec <- function(mat) {
  apply(mat, 1, function(x) {
    x <- x[!is.na(x)]
    if (length(x) == 0) return(NA_integer_)
    max(rle(x)$lengths)
  })
}

# IRV
irv_vec <- function(mat) apply(mat, 1, sd, na.rm = TRUE)

# Even-Odd: vectorised via column indexing
eo_vec <- function(mat) {
  k      <- ncol(mat)
  e_idx  <- seq(2, k, by = 2)
  o_idx  <- seq(1, k, by = 2)
  min_p  <- min(length(e_idx), length(o_idx))
  if (min_p < 3) return(rep(NA_real_, nrow(mat)))
  e_idx  <- e_idx[seq_len(min_p)]; o_idx <- o_idx[seq_len(min_p)]
  E <- mat[, e_idx, drop = FALSE]
  O <- mat[, o_idx, drop = FALSE]
  me  <- rowMeans(E, na.rm = TRUE); mo <- rowMeans(O, na.rm = TRUE)
  cov_eo <- rowMeans(E * O, na.rm = TRUE) - me * mo
  sd_e   <- apply(E, 1, sd, na.rm = TRUE)
  sd_o   <- apply(O, 1, sd, na.rm = TRUE)
  ifelse(sd_e == 0 | sd_o == 0, NA_real_, cov_eo / (sd_e * sd_o))
}

# Laz.R: cast to integer first for speed
lazr_vec <- function(mat) {
  mat_int <- matrix(as.integer(mat), nrow = nrow(mat))
  apply(mat_int, 1, Laz.R)
}

# Mahalanobis on log-RT
mahal_vec <- function(log_rt) {
  mu    <- colMeans(log_rt, na.rm = TRUE)
  cov_m <- tryCatch(
    cov(log_rt, use = "pairwise.complete.obs") + diag(1e-4, ncol(log_rt)),
    error = function(e) diag(ncol(log_rt))
  )
  tryCatch(
    mahalanobis(log_rt, center = mu, cov = cov_m),
    error = function(e) rowSums(sweep(log_rt, 2, mu)^2)
  )
}

# Spearman rho
rho_vec <- function(rt_mat, cx) {
  cx_rk <- rank(cx)
  apply(rt_mat, 1, function(rt_i) cor(cx_rk, rank(rt_i), method = "spearman"))
}

# Vectorised weighted tau
wtau_vectorised <- function(rt_mat, cx) {
  pairs   <- combn(length(cx), 2)
  cx_diff <- cx[pairs[1,]] - cx[pairs[2,]]
  w       <- abs(cx_diff)
  w_sum   <- sum(w)
  rt_diff <- rt_mat[, pairs[1,], drop=FALSE] - rt_mat[, pairs[2,], drop=FALSE]
  conc    <- sweep(sign(rt_diff), 2, sign(cx_diff), `*`)
  as.numeric((conc %*% w) / w_sum)
}

# ── Per-subsample performance ─────────────────────────────────────────────────

# For each subsample, recompute measures and apply Kneedle cutoff
compute_ss_perf <- function(row_idx) {
  sub_dat <- dat[row_idx, ]
  sub_res <- sub_dat[, c("ID", "is_non_attender")]
  sub_res$true_inatt <- sub_res$is_non_attender == 1
  n_att   <- sum(!sub_res$true_inatt)
  n_non   <- sum(sub_res$true_inatt)
  if (n_att < 5 || n_non < 5) return(NULL)

  sub_resp <- as.matrix(sub_dat[, y_cols_all])
  sub_rt   <- as.matrix(sub_dat[, rt_cols_all])

  # 1. LongString
  sub_res$longstring      <- longstring_vec(sub_resp)
  # 2. IRV
  sub_res$irv             <- irv_vec(sub_resp)
  # 3. Mahalanobis on log-RT
  sub_res$mahal_rt        <- mahal_vec(log(pmax(sub_rt, 1e-6)))
  # 4. Laz.R
  sub_res$lazr_pooled     <- lazr_vec(sub_resp)
  # 5. Even-Odd
  sub_res$even_odd        <- eo_vec(sub_resp)
  # 6. Spearman pooled
  sub_res$spearman_pooled <- rho_vec(sub_rt, cx_all)
  # 7. Weighted tau — vectorised
  sub_res$wtau_pooled     <- wtau_vectorised(sub_rt, cx_all)
  # Zh excluded (too slow per subsample)
  sub_res$zh <- NA_real_

  # Apply cutoffs per measure and compute sens / FP rate
  # Mahalanobis uses chi-squared outlier threshold; all others use Kneedle.
  perf_list <- lapply(names(bench_direction), function(nm) {
    m <- bench_direction[[nm]]
    if (nm == "zh") return(data.frame(measure = nm, sens = NA, spec = NA, fp_n = NA, fn_n = NA))
    signal <- if (m$high_bad) sub_res[[m$col]] else -sub_res[[m$col]]
    signal[!is.finite(signal)] <- NA
    if (all(is.na(signal))) return(data.frame(measure = nm, sens = NA, spec = NA, fp_n = NA, fn_n = NA))

    if (nm == "mahal_rt") {
      # chi-sq threshold scales with df = number of items used
      flag <- sub_res$mahal_rt > qchisq(0.975, df = ni_total)
    } else {
      sorted_sig <- sort(signal, na.last = NA)
      if (length(sorted_sig) < 2)
        return(data.frame(measure = nm, sens = NA, spec = NA, fp_n = NA, fn_n = NA))
      k_idx <- kneedle_fn(sorted_sig, sign = -1)
      flag  <- signal > sorted_sig[k_idx]
    }
    cs <- confusion_stats(flag, sub_res$true_inatt)
    data.frame(
      measure  = nm,
      sens     = cs$Sensitivity,
      spec     = cs$Specificity,
      fp_n     = cs$FP,
      fn_n     = cs$FN
    )
  })

  do.call(rbind, perf_list)
}

# Run across all sizes and reps
ss_results_list <- lapply(ss_sizes, function(n_s) {
  reps <- ss_idx[[as.character(n_s)]]
  rep_perf <- lapply(reps, function(idx) {
    pf <- tryCatch(compute_ss_perf(idx), error = function(e) NULL)
    if (!is.null(pf)) pf$n <- n_s
    pf
  })
  do.call(rbind, Filter(Negate(is.null), rep_perf))
})

ss_results <- do.call(rbind, ss_results_list)

cat(sprintf("Sensitivity-N analysis complete: %d rows of results.\n", nrow(ss_results)))
Sensitivity-N analysis complete: 960 rows of results.

Sensitivity and False Positive Rate vs. N

Show code
ss_summary <- ss_results |>
  group_by(n, measure) |>
  summarise(
    mean_sens = round(mean(sens, na.rm = TRUE), 3),
    sd_sens   = round(sd(sens,   na.rm = TRUE), 3),
    mean_spec = round(mean(spec, na.rm = TRUE), 3),
    sd_spec   = round(sd(spec,   na.rm = TRUE), 3),
    mean_fp   = round(mean(fp_n, na.rm = TRUE), 1),
    sd_fp     = round(sd(fp_n,   na.rm = TRUE), 1),
    mean_fn   = round(mean(fn_n, na.rm = TRUE), 1),
    sd_fn     = round(sd(fn_n,   na.rm = TRUE), 1),
    .groups   = "drop"
  ) |>
  mutate(measure_label = bench_labels[measure])

make_wide <- function(data, value_col, prefix) {
  data |>
    select(measure_label, n, all_of(value_col)) |>
    tidyr::pivot_wider(names_from = n, values_from = all_of(value_col),
                       names_prefix = prefix) |>
    mutate(is_proposed = measure_label %in% c("Spearman \u03c1 \u2605", "Weighted \u03c4 \u2605"))
}

render_wide <- function(wide_df, col_suffix, caption_text) {
  prop_rows <- which(wide_df$is_proposed)
  kable(
    wide_df[, !names(wide_df) %in% "is_proposed"],
    col.names = c("Measure", paste0("N=", ss_sizes)),
    caption   = caption_text
  ) |>
    kable_styling(full_width = FALSE) |>
    row_spec(prop_rows, background = "#e8f5e9", color = "#1b5e20")
}

render_wide(make_wide(ss_summary, "mean_sens", "N="),
  "sens", "Mean sensitivity by N. Zh excluded (GRM refit too slow).")
Mean sensitivity by N. Zh excluded (GRM refit too slow).
Measure N=100 N=250 N=500 N=1000 N=2000 N=5000
Even-Odd Consistency 0.059 0.066 0.077 0.066 0.058 0.050
IRV 0.359 0.396 0.426 0.416 0.408 0.401
Laz.R 0.618 0.652 0.653 0.642 0.636 0.630
LongString 0.486 0.543 0.520 0.516 0.520 0.525
Mahalanobis D² (RT) 0.000 0.002 0.001 0.000 0.000 0.000
Spearman ρ ★ 0.891 0.930 0.899 0.912 0.906 0.906
Weighted τ ★ 0.900 0.937 0.934 0.927 0.931 0.928
Zh NaN NaN NaN NaN NaN NaN
Show code
render_wide(make_wide(ss_summary, "mean_spec", "N="),
  "spec", "Mean specificity by N.")
Mean specificity by N.
Measure N=100 N=250 N=500 N=1000 N=2000 N=5000
Even-Odd Consistency 0.911 0.930 0.937 0.936 0.946 0.957
IRV 0.949 0.961 0.964 0.967 0.972 0.974
Laz.R 0.960 0.956 0.957 0.952 0.954 0.953
LongString 0.959 0.971 0.973 0.970 0.966 0.963
Mahalanobis D² (RT) 0.997 0.980 0.974 0.970 0.968 0.967
Spearman ρ ★ 0.883 0.876 0.890 0.887 0.889 0.894
Weighted τ ★ 0.889 0.877 0.881 0.889 0.883 0.887
Zh NaN NaN NaN NaN NaN NaN
Show code
render_wide(make_wide(ss_summary, "mean_fp", "N="),
  "fp",   "Mean false positive count (flagged attenders) by N.")
Mean false positive count (flagged attenders) by N.
Measure N=100 N=250 N=500 N=1000 N=2000 N=5000
Even-Odd Consistency 8.0 15.7 28.1 57.2 96.7 192
IRV 4.6 8.6 16.1 29.6 50.5 115
Laz.R 3.6 9.8 19.3 42.8 82.8 209
LongString 3.6 6.4 12.2 27.1 61.0 167
Mahalanobis D² (RT) 0.2 4.3 11.5 26.6 56.3 146
Spearman ρ ★ 10.4 27.6 49.2 100.6 197.2 473
Weighted τ ★ 9.9 27.4 53.0 98.7 209.1 502
Zh NaN NaN NaN NaN NaN NaN
Show code
render_wide(make_wide(ss_summary, "mean_fn", "N="),
  "fn",   "Mean false negative count (missed non-attenders) by N.")
Mean false negative count (missed non-attenders) by N.
Measure N=100 N=250 N=500 N=1000 N=2000 N=5000
Even-Odd Consistency 9.8 23.0 46.3 93.0 187.8 474
IRV 7.0 16.3 31.0 63.0 127.8 324
Laz.R 4.2 9.4 18.8 38.7 78.6 200
LongString 5.7 12.3 25.9 52.2 103.8 257
Mahalanobis D² (RT) 11.0 27.0 54.0 108.0 216.0 541
Spearman ρ ★ 1.2 1.9 5.4 9.5 20.3 51
Weighted τ ★ 1.1 1.7 3.5 7.9 15.0 39
Zh NaN NaN NaN NaN NaN NaN
Show code
ss_colours <- c(
  setNames(colorRampPalette(c("grey60", "steelblue3"))(6),
           setdiff(unique(ss_summary$measure_label),
                   c("Spearman \u03c1 \u2605", "Weighted \u03c4 \u2605"))),
  "Spearman \u03c1 \u2605" = "#2e7d32",
  "Weighted \u03c4 \u2605" = "#81c784"
)
ss_sizes_map <- c(
  setNames(rep(0.6, 6),
           setdiff(unique(ss_summary$measure_label),
                   c("Spearman \u03c1 \u2605", "Weighted \u03c4 \u2605"))),
  "Spearman \u03c1 \u2605" = 1.4,
  "Weighted \u03c4 \u2605" = 1.4
)

make_ss_line_plot <- function(y_col, sd_col, y_label, title_str) {
  ggplot(ss_summary, aes(x = n, y = .data[[y_col]],
                          colour = measure_label, linewidth = measure_label)) +
    geom_ribbon(aes(ymin = pmax(.data[[y_col]] - .data[[sd_col]], 0),
                    ymax = pmin(.data[[y_col]] + .data[[sd_col]], 1),
                    fill = measure_label),
                alpha = 0.08, colour = NA) +
    geom_line() +
    geom_point(size = 2) +
    scale_x_continuous(breaks = ss_sizes, labels = scales::comma) +
    scale_colour_manual(values = ss_colours, name = NULL) +
    scale_fill_manual(  values = ss_colours, name = NULL, guide = "none") +
    scale_linewidth_manual(values = ss_sizes_map, guide = "none") +
    labs(title = title_str,
         subtitle = "Mean \u00b1 1 SD across replications; Zh excluded",
         x = "Sample size (N)", y = y_label) +
    theme(legend.position = "right", legend.text = element_text(size = 8))
}

make_ss_line_plot("mean_sens", "sd_sens", "Sensitivity", "Sensitivity vs. N") /
make_ss_line_plot("mean_spec", "sd_spec", "Specificity", "Specificity vs. N") +
  plot_layout(guides = "collect")

Sensitivity (top) and specificity (bottom) as a function of sample size N. Points show means across replications; ribbons show ± 1 SD. Proposed features (ρ, τ) are bold green; established measures in grey/blue.

Sensitivity Analysis II: Number of Items

How do the measures degrade as fewer items are available? We create holdout item sets by randomly removing between 0 and 30 items (in steps of 5) from the full pool of 40 items, recomputing each measure on the remaining items only.

Items are removed at random (preserving scale membership proportionally) 15 times per holdout level, and performance is averaged across replications. This simulates scenarios where researchers want to apply RT-based screening to shorter instruments.

Note on Zh. Zh requires fitting a new GRM for every holdout set and is excluded from this sensitivity analysis. Its item-count sensitivity can be inferred from the single-scale Zh results in the benchmark section.

Show code
n_item_reps   <- 15   # replications per removal level
items_removed <- seq(0, 30, by = 5)   # 0, 5, 10, 15, 20, 25, 30

set.seed(2026)

holdout_idx <- lapply(items_removed, function(n_rem) {
  lapply(seq_len(n_item_reps), function(r) {
    if (n_rem == 0) return(seq_len(ni_total))
    removed <- sample(seq_len(ni_total), n_rem, replace = FALSE)
    setdiff(seq_len(ni_total), removed)
  })
})
names(holdout_idx) <- as.character(items_removed)

cat(sprintf("Item removal levels: %s\nReplications: %d\n",
            paste(items_removed, collapse = ", "), n_item_reps))
Item removal levels: 0, 5, 10, 15, 20, 25, 30
Replications: 15
Show code
# All vectorised helpers (longstring_vec, irv_vec, eo_vec, lazr_vec,
# mahal_vec, rho_vec, wtau_vectorised) defined in sensitivity-n-compute above.

compute_item_perf <- function(item_idx) {
  n_items  <- length(item_idx)
  truth    <- results$true_inattentive
  n_att    <- sum(!truth)

  sub_resp <- resp_mat[, item_idx, drop = FALSE]
  sub_rt   <- rt_mat[,  item_idx, drop = FALSE]
  sub_cx   <- cx_all[item_idx]

  tmp <- data.frame(
    longstring      = longstring_vec(sub_resp),
    irv             = irv_vec(sub_resp),
    mahal_rt        = mahal_vec(log(pmax(sub_rt, 1e-6))),
    lazr_pooled     = lazr_vec(sub_resp),
    even_odd        = eo_vec(sub_resp),
    zh              = NA_real_,          # excluded — too slow per holdout
    spearman_pooled = rho_vec(sub_rt, sub_cx),
    wtau_pooled     = if (n_items >= 2) wtau_vectorised(sub_rt, sub_cx)
                      else rep(NA_real_, nrow(sub_rt))
  )

  perf_list <- lapply(names(bench_direction), function(nm) {
    m <- bench_direction[[nm]]
    if (nm == "zh") return(data.frame(measure = nm, sens = NA, spec = NA, fp_n = NA, fn_n = NA))
    signal <- if (m$high_bad) tmp[[m$col]] else -tmp[[m$col]]
    signal[!is.finite(signal)] <- NA
    if (all(is.na(signal))) return(data.frame(measure = nm, sens = NA, spec = NA, fp_n = NA, fn_n = NA))

    if (nm == "mahal_rt") {
      # chi-sq df = number of retained items for this holdout
      flag <- tmp$mahal_rt > qchisq(0.975, df = n_items)
    } else {
      sorted_sig <- sort(signal, na.last = NA)
      if (length(sorted_sig) < 2) return(data.frame(measure = nm, sens = NA, spec = NA, fp_n = NA, fn_n = NA))
      k_idx <- kneedle_fn(sorted_sig, sign = -1)
      flag  <- signal > sorted_sig[k_idx]
    }
    cs <- confusion_stats(flag, truth)
    data.frame(
      measure = nm,
      sens    = cs$Sensitivity,
      spec    = cs$Specificity,
      fp_n    = cs$FP,
      fn_n    = cs$FN
    )
  })

  do.call(rbind, perf_list)
}

# Run across all item-removal levels and reps
item_results_list <- lapply(items_removed, function(n_rem) {
  reps <- holdout_idx[[as.character(n_rem)]]
  rep_perf <- lapply(reps, function(idx) {
    pf <- tryCatch(compute_item_perf(idx), error = function(e) NULL)
    if (!is.null(pf)) pf$n_removed <- n_rem
    pf
  })
  do.call(rbind, Filter(Negate(is.null), rep_perf))
})

item_results <- do.call(rbind, item_results_list)
item_results$n_items <- ni_total - item_results$n_removed

cat(sprintf("Sensitivity-items analysis complete: %d rows.\n", nrow(item_results)))
Sensitivity-items analysis complete: 840 rows.

Sensitivity and False Positive Rate vs. Number of Items

Show code
item_summary <- item_results |>
  group_by(n_items, measure) |>
  summarise(
    mean_sens = round(mean(sens, na.rm = TRUE), 3),
    sd_sens   = round(sd(sens,   na.rm = TRUE), 3),
    mean_spec = round(mean(spec, na.rm = TRUE), 3),
    sd_spec   = round(sd(spec,   na.rm = TRUE), 3),
    mean_fp   = round(mean(fp_n, na.rm = TRUE), 1),
    sd_fp     = round(sd(fp_n,   na.rm = TRUE), 1),
    mean_fn   = round(mean(fn_n, na.rm = TRUE), 1),
    sd_fn     = round(sd(fn_n,   na.rm = TRUE), 1),
    .groups   = "drop"
  ) |>
  mutate(measure_label = bench_labels[measure])

k_levels <- sort(unique(item_summary$n_items))

make_wide_items <- function(data, value_col) {
  data |>
    select(measure_label, n_items, all_of(value_col)) |>
    tidyr::pivot_wider(names_from = n_items, values_from = all_of(value_col),
                       names_prefix = "k=") |>
    mutate(is_proposed = measure_label %in% c("Spearman \u03c1 \u2605", "Weighted \u03c4 \u2605"))
}

render_wide_items <- function(wide_df, caption_text) {
  prop_rows <- which(wide_df$is_proposed)
  kable(
    wide_df[, !names(wide_df) %in% "is_proposed"],
    col.names = c("Measure", paste0("k=", k_levels)),
    caption   = caption_text
  ) |>
    kable_styling(full_width = FALSE) |>
    row_spec(prop_rows, background = "#e8f5e9", color = "#1b5e20")
}

render_wide_items(make_wide_items(item_summary, "mean_sens"),
  "Mean sensitivity by number of retained items (k). Zh excluded.")
Mean sensitivity by number of retained items (k). Zh excluded.
Measure k=10 k=15 k=20 k=25 k=30 k=35 k=40
Even-Odd Consistency 0.160 0.086 0.066 0.073 0.078 0.082 0.050
IRV 0.385 0.397 0.400 0.398 0.401 0.401 0.401
Laz.R 0.432 0.487 0.555 0.612 0.629 0.632 0.630
LongString 0.502 0.508 0.491 0.490 0.488 0.509 0.525
Mahalanobis D² (RT) 0.025 0.004 0.001 0.001 0.001 0.000 0.000
Spearman ρ ★ 0.525 0.631 0.698 0.681 0.816 0.852 0.906
Weighted τ ★ 0.591 0.704 0.746 0.744 0.864 0.885 0.928
Zh NaN NaN NaN NaN NaN NaN NaN
Show code
render_wide_items(make_wide_items(item_summary, "mean_spec"),
  "Mean specificity by number of retained items (k).")
Mean specificity by number of retained items (k).
Measure k=10 k=15 k=20 k=25 k=30 k=35 k=40
Even-Odd Consistency 0.780 0.873 0.909 0.915 0.921 0.925 0.957
IRV 0.956 0.947 0.956 0.965 0.959 0.969 0.974
Laz.R 0.918 0.943 0.945 0.945 0.949 0.955 0.953
LongString 0.887 0.912 0.951 0.957 0.969 0.966 0.963
Mahalanobis D² (RT) 0.976 0.975 0.972 0.969 0.969 0.967 0.967
Spearman ρ ★ 0.900 0.913 0.919 0.923 0.916 0.908 0.894
Weighted τ ★ 0.881 0.898 0.907 0.911 0.903 0.903 0.887
Zh NaN NaN NaN NaN NaN NaN NaN
Show code
render_wide_items(make_wide_items(item_summary, "mean_fp"),
  "Mean false positive count (flagged attenders) by number of retained items (k).")
Mean false positive count (flagged attenders) by number of retained items (k).
Measure k=10 k=15 k=20 k=25 k=30 k=35 k=40
Even-Odd Consistency 961.4 565.7 406.1 378.7 354.1 336.1 192
IRV 194.6 236.1 196.7 153.9 180.6 140.1 115
Laz.R 366.1 253.1 246.7 245.8 226.6 199.1 209
LongString 504.8 392.5 217.7 192.5 139.0 150.6 167
Mahalanobis D² (RT) 107.9 113.5 126.0 139.3 139.9 147.3 146
Spearman ρ ★ 446.2 389.9 362.0 341.3 373.3 410.4 473
Weighted τ ★ 531.3 456.3 414.1 395.0 432.8 432.8 502
Zh NaN NaN NaN NaN NaN NaN NaN
Show code
render_wide_items(make_wide_items(item_summary, "mean_fn"),
  "Mean false negative count (missed non-attenders) by number of retained items (k).")
Mean false negative count (missed non-attenders) by number of retained items (k).
Measure k=10 k=15 k=20 k=25 k=30 k=35 k=40
Even-Odd Consistency 368.9 430.3 457.7 459.5 463.5 465.8 474
IRV 332.9 326.1 324.6 325.7 323.9 324.3 324
Laz.R 307.4 277.4 240.9 209.8 200.7 198.9 200
LongString 269.3 265.9 275.5 275.9 276.7 265.9 257
Mahalanobis D² (RT) 527.5 539.1 540.4 540.3 540.5 540.8 541
Spearman ρ ★ 257.1 199.5 163.3 172.6 99.5 79.9 51
Weighted τ ★ 221.0 160.0 137.4 138.4 73.5 62.1 39
Zh NaN NaN NaN NaN NaN NaN NaN
Show code
item_colours <- c(
  setNames(colorRampPalette(c("grey60", "steelblue3"))(6),
           setdiff(unique(item_summary$measure_label),
                   c("Spearman \u03c1 \u2605", "Weighted \u03c4 \u2605"))),
  "Spearman \u03c1 \u2605" = "#2e7d32",
  "Weighted \u03c4 \u2605" = "#81c784"
)
item_size_map <- c(
  setNames(rep(0.6, 6),
           setdiff(unique(item_summary$measure_label),
                   c("Spearman \u03c1 \u2605", "Weighted \u03c4 \u2605"))),
  "Spearman \u03c1 \u2605" = 1.4,
  "Weighted \u03c4 \u2605" = 1.4
)

make_item_line_plot <- function(y_col, sd_col, y_label, title_str) {
  ggplot(item_summary, aes(x = n_items, y = .data[[y_col]],
                            colour = measure_label, linewidth = measure_label)) +
    geom_ribbon(aes(ymin = pmax(.data[[y_col]] - .data[[sd_col]], 0),
                    ymax = pmin(.data[[y_col]] + .data[[sd_col]], 1),
                    fill = measure_label),
                alpha = 0.08, colour = NA) +
    geom_line() +
    geom_point(size = 2) +
    scale_x_continuous(breaks = k_levels) +
    scale_colour_manual(values = item_colours, name = NULL) +
    scale_fill_manual(  values = item_colours, name = NULL, guide = "none") +
    scale_linewidth_manual(values = item_size_map, guide = "none") +
    labs(title = title_str,
         subtitle = "Mean \u00b1 1 SD across replications",
         x = "Items retained (k)", y = y_label) +
    theme(legend.position = "right", legend.text = element_text(size = 8))
}

make_item_line_plot("mean_sens", "sd_sens", "Sensitivity", "Sensitivity vs. items") /
make_item_line_plot("mean_spec", "sd_spec", "Specificity", "Specificity vs. items") +
  plot_layout(guides = "collect")

Sensitivity (top) and specificity (bottom) as a function of the number of retained items. Ribbons show ± 1 SD across replications. Proposed complexity–RT features are highlighted in green.

Interaction: N × Items

The heatmap below shows mean sensitivity as a joint function of sample size and number of items for each proposed feature, enabling a quick read of the operating conditions under which ρ and τ remain viable.

Show code
# Re-run a coarser grid (N × k) for the heatmap — use fewer reps
n_hm_reps   <- 10
ss_hm_sizes <- ss_sizes
items_hm    <- seq(10, ni_total, by = 10)  # 10, 20, 30, 40

set.seed(999)

hm_results <- do.call(rbind, lapply(ss_hm_sizes, function(n_s) {
  do.call(rbind, lapply(items_hm, function(k) {
    do.call(rbind, lapply(seq_len(n_hm_reps), function(r) {
      att_idx  <- which(dat$is_non_attender == 0)
      non_idx  <- which(dat$is_non_attender == 1)
      prop_non <- mean(dat$is_non_attender == 1)
      n_non    <- round(n_s * prop_non)
      n_att    <- n_s - n_non
      row_idx  <- c(
        sample(att_idx, min(n_att, length(att_idx)), replace = FALSE),
        sample(non_idx, min(n_non, length(non_idx)), replace = FALSE)
      )
      if (length(row_idx) < 20) return(NULL)

      item_idx <- sample(seq_len(ni_total), k, replace = FALSE)
      sub_rt   <- rt_mat[row_idx, ][, item_idx, drop = FALSE]
      sub_cx   <- cx_all[item_idx]
      truth    <- (dat$is_non_attender == 1)[row_idx]
      n_att_s  <- sum(!truth)
      if (n_att_s < 5 || sum(truth) < 5) return(NULL)

      cx_rk_hm <- rank(sub_cx)
      rho_hm   <- apply(sub_rt, 1, function(rt_i)
        cor(cx_rk_hm, rank(rt_i), method = "spearman"))
      wtau_hm  <- wtau_vectorised(sub_rt, sub_cx)

      kneedle_stats <- function(signal, truth) {
        signal[!is.finite(signal)] <- NA
        sorted <- sort(signal, na.last = NA)
        if (length(sorted) < 2)
          return(list(sens = NA, spec = NA, fp_pct = NA, fn_pct = NA))
        k_i  <- kneedle_fn(sorted, sign = -1)
        flag <- signal > sorted[k_i]
        cs   <- confusion_stats(flag, truth)
        n_att_loc <- sum(!truth)
        n_non_loc <- sum(truth)
        list(sens   = cs$Sensitivity,
             spec   = cs$Specificity,
             fp_pct = if (n_att_loc > 0) 100 * cs$FP / n_att_loc else NA,
             fn_pct = if (n_non_loc > 0) 100 * cs$FN / n_non_loc else NA)
      }

      s_rho  <- kneedle_stats(-rho_hm,  truth)
      s_wtau <- kneedle_stats(-wtau_hm, truth)

      data.frame(
        n             = n_s, k = k,
        sens_rho      = s_rho$sens,    sens_wtau   = s_wtau$sens,
        spec_rho      = s_rho$spec,    spec_wtau   = s_wtau$spec,
        fp_pct_rho    = s_rho$fp_pct,  fp_pct_wtau = s_wtau$fp_pct,
        fn_pct_rho    = s_rho$fn_pct,  fn_pct_wtau = s_wtau$fn_pct
      )
    }))
  }))
}))

hm_summary <- hm_results |>
  group_by(n, k) |>
  summarise(
    sens_rho    = round(mean(sens_rho,    na.rm = TRUE), 3),
    sens_wtau   = round(mean(sens_wtau,   na.rm = TRUE), 3),
    spec_rho    = round(mean(spec_rho,    na.rm = TRUE), 3),
    spec_wtau   = round(mean(spec_wtau,   na.rm = TRUE), 3),
    fp_pct_rho  = round(mean(fp_pct_rho,  na.rm = TRUE), 1),
    fp_pct_wtau = round(mean(fp_pct_wtau, na.rm = TRUE), 1),
    fn_pct_rho  = round(mean(fn_pct_rho,  na.rm = TRUE), 1),
    fn_pct_wtau = round(mean(fn_pct_wtau, na.rm = TRUE), 1),
    .groups     = "drop"
  )

fmt_label_y <- function(x) formatC(as.numeric(x), format = "d", big.mark = ",")

make_heatmap <- function(data, rho_col, wtau_col, metric_label,
                         fill_label, low_col, mid_col, high_col,
                         midpoint_val, fmt = "%.2f") {
  long_df <- data |>
    select(n, k, rho = all_of(rho_col), wtau = all_of(wtau_col)) |>
    tidyr::pivot_longer(cols = c(rho, wtau),
                        names_to = "feature", values_to = "value") |>
    mutate(feature = ifelse(feature == "rho", "Spearman \u03c1", "Weighted \u03c4"))

  ggplot(long_df, aes(x = factor(k), y = factor(n), fill = value)) +
    geom_tile(colour = "white", linewidth = 0.5) +
    geom_text(aes(label = sprintf(fmt, value)), size = 2.6) +
    scale_fill_gradient2(low = low_col, mid = mid_col, high = high_col,
                         midpoint = midpoint_val, na.value = "grey85",
                         name = fill_label) +
    facet_wrap(~ feature, ncol = 2) +
    scale_y_discrete(labels = fmt_label_y) +
    labs(title = paste(metric_label, "heatmap: N \u00d7 items retained"),
         x = "Items retained (k)", y = "Sample size (N)") +
    theme(axis.text.x = element_text(angle = 0),
          legend.position = "right")
}

p_hm_sens <- make_heatmap(hm_summary, "sens_rho", "sens_wtau",
  "Sensitivity", "Mean\nSensitivity",
  "tomato", "lightyellow", "#2e7d32", 0.5)

p_hm_spec <- make_heatmap(hm_summary, "spec_rho", "spec_wtau",
  "Specificity", "Mean\nSpecificity",
  "tomato", "lightyellow", "#2e7d32", 0.5)

p_hm_fp <- make_heatmap(hm_summary, "fp_pct_rho", "fp_pct_wtau",
  "False Positive Rate (%)", "Mean FP\nRate (%)",
  "#2e7d32", "lightyellow", "tomato", 50, fmt = "%.1f%%")

p_hm_fn <- make_heatmap(hm_summary, "fn_pct_rho", "fn_pct_wtau",
  "False Negative Rate (%)", "Mean FN\nRate (%)",
  "#2e7d32", "lightyellow", "tomato", 50, fmt = "%.1f%%")

p_hm_sens / p_hm_spec / p_hm_fp / p_hm_fn

Mean sensitivity, specificity, FP count, and FN count for Spearman ρ (left column) and weighted τ (right column) as a joint function of sample size N and items retained k. Grey cells = conditions not evaluated.

The Complexity–RT Feature

Computing Spearman Correlations

We compute the complexity–RT Spearman correlation separately for each scale using that scale’s 20 items and fixed complexity scores:

\[\rho_{i,s} = \text{Spearman}\!\left(\text{rank}(c_1^{(s)}, \ldots, c_{ni}^{(s)}),\; \text{rank}(\text{RT}_{i,1}^{(s)}, \ldots, \text{RT}_{i,ni}^{(s)})\right)\]

No outlier capping is applied. Spearman’s rank correlation is robust to extreme values by construction — a single stall-contaminated RT moves to the top of that person’s rank order but does not inflate the correlation beyond its \([-1, 1]\) bounds. Capping would discard potentially informative signal about relative item difficulty.

Show code
# ── roc_auc, make_roc_df, kneedle_fn, apply_kneedle_flag are now defined in
# ── the setup chunk and available globally ────────────────────────────────────

# Attention-vs-A plot (filters results dynamically — does NOT use a pre-snapshot)
make_attention_plot <- function(col, label) {
  att_df <- results[results$is_non_attender == 0, ]
  na_q <- quantile(results[[col]][results$is_non_attender == 1],
                   probs = c(0.05, 0.25, 0.50, 0.75, 0.95), na.rm = TRUE)
  ggplot(att_df, aes(x = A, y = .data[[col]])) +
    annotate("rect", xmin = -Inf, xmax = Inf,
             ymin = na_q["5%"],  ymax = na_q["95%"],
             fill = "tomato", alpha = 0.08) +
    annotate("rect", xmin = -Inf, xmax = Inf,
             ymin = na_q["25%"], ymax = na_q["75%"],
             fill = "tomato", alpha = 0.12) +
    geom_hline(yintercept = na_q["50%"], colour = "tomato3",
               linetype = "dashed", linewidth = 0.7) +
    geom_point(aes(colour = factor(Z)), alpha = 0.35, size = 1.4) +
    geom_smooth(method = "loess", span = 0.6, se = TRUE,
                colour = "steelblue4", fill = "steelblue", alpha = 0.2) +
    scale_colour_manual(values = c("0" = "grey60", "1" = "steelblue"),
                        labels = c("Control", "Treatment"), name = "Arm") +
    labs(title = label, subtitle = "Red band = non-attender 5–95th pct",
         x = "Latent attention A", y = "Score")
}

# Subtype ridge plot (uses results$subtype — only call after subtype-label chunk)
make_subtype_ridge <- function(col, label) {
  subtype_order <- results |>
    group_by(subtype) |>
    summarise(med = median(.data[[col]], na.rm = TRUE), .groups = "drop") |>
    arrange(med) |>
    pull(subtype)
  results$subtype_f <- factor(results$subtype, levels = subtype_order)
  ggplot(results,
         aes(x = .data[[col]], y = subtype_f,
             fill = ifelse(subtype == "Attender", "Attender", "Non-attender"))) +
    ggridges::geom_density_ridges(
      alpha = 0.55, colour = "white", scale = 0.85,
      quantile_lines = TRUE, quantiles = 2) +
    geom_vline(xintercept = 0, linetype = "dashed",
               colour = "grey30", linewidth = 0.7) +
    scale_fill_manual(values = c("Attender" = "steelblue",
                                  "Non-attender" = "tomato"), name = NULL) +
    labs(title = label, subtitle = "Median line shown; dashed = 0",
         x = "Score", y = NULL) +
    theme(legend.position = "right")
}

# ── Per-scale Spearman rho ────────────────────────────────────────────────────
compute_spearman_scale <- function(dat, scale_idx) {
  ni      <- params$items_per_scale
  rt_cols <- paste0("RT_s", scale_idx, "_i", seq_len(ni))
  cx      <- instrument[[scale_idx]]$complexity
  rt_mat  <- as.matrix(dat[, rt_cols])
  cx_rank <- rank(cx)
  apply(rt_mat, 1, function(rt_i) {
    cor(cx_rank, rank(rt_i), method = "spearman")
  })
}

for (s in seq_len(ns)) {
  results[[paste0("spearman_s", s)]] <- compute_spearman_scale(dat, s)
}

cat("Spearman ρ summary (attenders vs non-attenders):\n")
Spearman ρ summary (attenders vs non-attenders):
Show code
for (col in c("spearman_s1", "spearman_s2")) {
  att <- results[[col]][results$is_non_attender == 0]
  non <- results[[col]][results$is_non_attender == 1]
  cat(sprintf("  %-20s  attender mean = %.3f  |  non-attender mean = %.3f\n",
              col, mean(att, na.rm = TRUE), mean(non, na.rm = TRUE)))
}
  spearman_s1           attender mean = 0.455  |  non-attender mean = -0.005
  spearman_s2           attender mean = 0.424  |  non-attender mean = 0.016

Computing Weighted Kendall τ

Standard Spearman ρ treats every rank inversion equally, so a noise-driven flip between two items whose complexity scores are nearly identical counts the same as a flip between an easy and a hard item. When item complexities cluster closely, even genuinely attentive respondents can accumulate many near-tie inversions from RT noise alone, driving ρ downward unfairly.

The complexity-weighted Kendall τ addresses this by scaling each concordant or discordant pair by the absolute complexity difference between the two items:

\[\tau_w = \frac{\displaystyle\sum_{i<j} |c_i - c_j|\cdot \text{sign}(c_i - c_j)\cdot\text{sign}(\text{RT}_i - \text{RT}_j)} {\displaystyle\sum_{i<j} |c_i - c_j|}\]

A flip between items 0.05 complexity-units apart contributes almost nothing; a flip between items 1.0 apart contributes fully. The statistic is still in \([-1, 1]\), anchored at 0 for random responding, and drops into the same threshold and mixture pipeline as ρ.

Show code
compute_weighted_tau <- function(dat, scale_idx) {
  ni      <- params$items_per_scale
  rt_cols <- paste0("RT_s", scale_idx, "_i", seq_len(ni))
  cx      <- instrument[[scale_idx]]$complexity
  rt_mat  <- as.matrix(dat[, rt_cols])

  pairs   <- combn(ni, 2)        # 2 × C(ni,2) matrix of index pairs
  cx_diff <- cx[pairs[1,]] - cx[pairs[2,]]
  w       <- abs(cx_diff)        # pair weights = |complexity difference|

  apply(rt_mat, 1, function(rt_i) {
    rt_diff     <- rt_i[pairs[1,]] - rt_i[pairs[2,]]
    concordance <- sign(cx_diff) * sign(rt_diff)
    sum(w * concordance) / sum(w)
  })
}

for (s in seq_len(ns)) {
  results[[paste0("wtau_s", s)]] <- compute_weighted_tau(dat, s)
}

cat("Weighted τ summary (attenders vs non-attenders):\n")
Weighted τ summary (attenders vs non-attenders):
Show code
for (col in c("wtau_s1", "wtau_s2")) {
  att <- results[[col]][results$is_non_attender == 0]
  non <- results[[col]][results$is_non_attender == 1]
  cat(sprintf("  %-12s  attender mean = %.3f  |  non-attender mean = %.3f\n",
              col, mean(att, na.rm = TRUE), mean(non, na.rm = TRUE)))
}
  wtau_s1       attender mean = 0.475  |  non-attender mean = -0.005
  wtau_s2       attender mean = 0.477  |  non-attender mean = 0.013
Show code
# Assign subtype here so all downstream chunks (including wtau comparisons) can use it
results$subtype <- with(results, case_when(
  is_non_attender == 0                         ~ "Attender",
  non_attender_style == "plausible_noise"      ~ "plausible_noise",
  TRUE                                         ~ non_attender_pattern_type
))

Comparing Spearman ρ and Weighted τ

Distributions

Show code
make_dist_panel <- function(col, label, xlab) {
  ggplot(results, aes(x = .data[[col]], fill = class_label)) +
    geom_density(alpha = 0.5, colour = NA) +
    geom_vline(xintercept = 0, linetype = "dashed",
               colour = "grey30", linewidth = 0.7) +
    scale_fill_manual(
      values = c("Attender (true)" = "steelblue",
                 "Non-attender (true)" = "tomato"), name = NULL) +
    labs(title = label, x = xlab, y = "Density")
}

(make_dist_panel("spearman_s1", "Spearman ρ — Scale 1", "Spearman ρ") +
 make_dist_panel("spearman_s2", "Spearman ρ — Scale 2", "Spearman ρ")) /
(make_dist_panel("wtau_s1",    "Weighted τ — Scale 1", "Weighted τ") +
 make_dist_panel("wtau_s2",    "Weighted τ — Scale 2", "Weighted τ"))

Distributions of Spearman ρ (top) and weighted τ (bottom) by true class, for each scale. If near-tie noise is inflating FPs, the weighted τ distributions should show better separation — a cleaner gap between attenders and non-attenders.

AUC and Descriptive Comparison

Show code
compare_features <- list(
  "ρ (S1)"   = "spearman_s1",
  "ρ (S2)"   = "spearman_s2",
  "τ_w (S1)" = "wtau_s1",
  "τ_w (S2)" = "wtau_s2"
)

compare_tab <- do.call(rbind, lapply(names(compare_features), function(nm) {
  col <- compare_features[[nm]]
  data.frame(
    Feature  = nm,
    AUC      = round(roc_auc(results[[col]], results$true_inattentive), 3),
    mean_att = round(mean(results[[col]][results$is_non_attender == 0], na.rm = TRUE), 3),
    mean_non = round(mean(results[[col]][results$is_non_attender == 1], na.rm = TRUE), 3),
    sd_att   = round(sd(results[[col]][results$is_non_attender == 0],   na.rm = TRUE), 3),
    sd_non   = round(sd(results[[col]][results$is_non_attender == 1],   na.rm = TRUE), 3)
  )
}))

kable(
  compare_tab,
  col.names = c("Feature", "AUC",
                "Mean (attenders)", "Mean (non-attenders)",
                "SD (attenders)",   "SD (non-attenders)"),
  caption = "Spearman ρ vs. weighted τ: discriminability by scale"
) |>
  kable_styling(full_width = FALSE) |>
  row_spec(which.max(compare_tab$AUC), bold = TRUE, background = "#eef6ee")
Spearman ρ vs. weighted τ: discriminability by scale
Feature AUC Mean (attenders) Mean (non-attenders) SD (attenders) SD (non-attenders)
ρ (S1) 0.926 0.455 -0.005 0.205 0.232
ρ (S2) 0.902 0.424 0.016 0.212 0.230
τ_w (S1) 0.932 0.475 -0.005 0.205 0.236
τ_w (S2) 0.914 0.477 0.013 0.218 0.251

How Attention Moderates Weighted τ

Show code
att_results_w <- results[results$is_non_attender == 0, ]

make_attention_plot("wtau_s1", "Weighted τ — Scale 1") +
make_attention_plot("wtau_s2", "Weighted τ — Scale 2")

Weighted τ vs. latent attention A among attenders, for Scale 1 (left) and Scale 2 (right). The red band shows the non-attender 5–95th percentile range. Compare to the equivalent Spearman ρ plots above — the weighted τ should show a tighter attender band if near-tie noise was adding variance to ρ.

Weighted τ by Non-attender Subtype

Show code
make_subtype_ridge("wtau_s1", "Weighted τ — Scale 1") +
make_subtype_ridge("wtau_s2", "Weighted τ — Scale 2")

Weighted τ distributions by subtype for Scale 1 (left) and Scale 2 (right). Compare to the Spearman ρ subtype plots — subtypes that were hard to detect under ρ may show greater separation here.
Show code
wtau_subtype_tab <- results |>
  group_by(subtype) |>
  summarise(
    N          = n(),
    mean_rho_s1  = round(mean(spearman_s1, na.rm = TRUE), 3),
    mean_wtau_s1 = round(mean(wtau_s1,     na.rm = TRUE), 3),
    mean_rho_s2  = round(mean(spearman_s2, na.rm = TRUE), 3),
    mean_wtau_s2 = round(mean(wtau_s2,     na.rm = TRUE), 3),
    .groups = "drop"
  ) |>
  arrange(mean_wtau_s1)

kable(
  wtau_subtype_tab,
  col.names = c("Subtype", "N",
                "Mean ρ (S1)", "Mean τ_w (S1)",
                "Mean ρ (S2)", "Mean τ_w (S2)"),
  caption = "Spearman ρ vs. weighted τ by subtype"
) |>
  kable_styling(full_width = FALSE) |>
  row_spec(which(wtau_subtype_tab$subtype == "Attender"), bold = TRUE)
Spearman ρ vs. weighted τ by subtype
Subtype N Mean ρ (S1) Mean τ_w (S1) Mean ρ (S2) Mean τ_w (S2)
midpoint_default 53 -0.032 -0.039 -0.001 -0.027
uniform_random 25 -0.024 -0.037 0.058 0.045
straightline 114 -0.016 -0.018 0.013 0.025
plausible_noise 143 -0.012 -0.008 0.001 -0.008
acquiescent 54 -0.001 -0.003 0.013 0.008
near_straightline 57 0.004 0.008 0.069 0.090
diagonal 17 0.007 0.021 -0.027 -0.053
alternating 17 0.038 0.027 0.020 -0.007
biased_random 36 0.039 0.032 0.015 0.029
disacquiescent 25 0.030 0.037 0.021 -0.003
Attender 4459 0.455 0.475 0.424 0.477

Head-to-head FP / FN: Kneedle Threshold

Show code
# Apply Kneedle to Spearman ρ per scale (also needed here for comparison plots)
kn_s1 <- apply_kneedle_flag(results$spearman_s1)
kn_s2 <- apply_kneedle_flag(results$spearman_s2)

results$flag_kn_s1 <- kn_s1$flag
results$flag_kn_s2 <- kn_s2$flag

thresh_zero_s1   <- 0
thresh_zero_s2   <- 0
thresh_oracle_s1 <- quantile(results$spearman_s1[results$is_non_attender == 0], 0.05, na.rm = TRUE)
thresh_oracle_s2 <- quantile(results$spearman_s2[results$is_non_attender == 0], 0.05, na.rm = TRUE)

results$flag_zero_s1   <- results$spearman_s1 < thresh_zero_s1
results$flag_zero_s2   <- results$spearman_s2 < thresh_zero_s2
results$flag_oracle_s1 <- results$spearman_s1 < thresh_oracle_s1
results$flag_oracle_s2 <- results$spearman_s2 < thresh_oracle_s2

# Apply Kneedle to weighted τ per scale
kn_wtau_s1 <- apply_kneedle_flag(results$wtau_s1)
kn_wtau_s2 <- apply_kneedle_flag(results$wtau_s2)

results$flag_kn_wtau_s1 <- kn_wtau_s1$flag
results$flag_kn_wtau_s2 <- kn_wtau_s2$flag

thresh_zero_wtau_s1   <- 0
thresh_zero_wtau_s2   <- 0
thresh_oracle_wtau_s1 <- quantile(results$wtau_s1[results$is_non_attender == 0], 0.05, na.rm = TRUE)
thresh_oracle_wtau_s2 <- quantile(results$wtau_s2[results$is_non_attender == 0], 0.05, na.rm = TRUE)

results$flag_zero_wtau_s1   <- results$wtau_s1 < thresh_zero_wtau_s1
results$flag_zero_wtau_s2   <- results$wtau_s2 < thresh_zero_wtau_s2
results$flag_oracle_wtau_s1 <- results$wtau_s1 < thresh_oracle_wtau_s1
results$flag_oracle_wtau_s2 <- results$wtau_s2 < thresh_oracle_wtau_s2

cat(sprintf("Kneedle thresholds — ρ:    S1 = %.3f | S2 = %.3f\n",
            kn_s1$threshold, kn_s2$threshold))
Kneedle thresholds — ρ:    S1 = 0.173 | S2 = 0.135
Show code
cat(sprintf("Kneedle thresholds — τ_w:  S1 = %.3f | S2 = %.3f\n",
            kn_wtau_s1$threshold, kn_wtau_s2$threshold))
Kneedle thresholds — τ_w:  S1 = 0.168 | S2 = 0.205
Show code
# Head-to-head performance table
head2head <- do.call(rbind, lapply(list(
    list(flag = results$flag_kn_s1,      label = "ρ (S1)"),
    list(flag = results$flag_kn_wtau_s1, label = "τ_w (S1)"),
    list(flag = results$flag_kn_s2,      label = "ρ (S2)"),
    list(flag = results$flag_kn_wtau_s2, label = "τ_w (S2)")
  ), function(x) {
  cs <- confusion_stats(x$flag, results$true_inattentive)
  data.frame(
    Feature     = x$label,
    Sensitivity = round(cs$Sensitivity, 3),
    Specificity = round(cs$Specificity, 3),
    Pct_Flagged = round(cs$Flagged_pct, 1),
    TP = cs$TP, FP = cs$FP, TN = cs$TN, FN = cs$FN
  )
}))

kable(
  head2head,
  col.names = c("Feature", "Sensitivity", "Specificity",
                "% Flagged", "TP", "FP", "TN", "FN"),
  caption = "Head-to-head: Spearman ρ vs. weighted τ (Kneedle threshold per scale)"
) |>
  kable_styling(full_width = FALSE) |>
  row_spec(which.max(head2head$Sensitivity), bold = TRUE, background = "#eef6ee")
Head-to-head: Spearman ρ vs. weighted τ (Kneedle threshold per scale)
Feature Sensitivity Specificity % Flagged TP FP TN FN
ρ (S1) 0.769 0.906 16.7 416 421 4038 125
τ_w (S1) 0.767 0.921 15.4 415 353 4106 126
ρ (S2) 0.664 0.906 15.6 359 421 4038 182
τ_w (S2) 0.765 0.887 18.3 414 503 3956 127
Show code
make_fp_comparison_plot <- function(flag_rho, flag_wtau, scale_label) {
  results |>
    filter(is_non_attender == 0) |>
    tidyr::pivot_longer(
      cols      = c(all_of(flag_rho), all_of(flag_wtau)),
      names_to  = "feature",
      values_to = "flagged"
    ) |>
    mutate(feature = factor(feature,
      levels = c(flag_rho, flag_wtau),
      labels = c("Spearman ρ (Kneedle)", "Weighted τ (Kneedle)"))) |>
    ggplot(aes(x = A, fill = flagged)) +
    geom_density(alpha = 0.55, colour = NA) +
    scale_fill_manual(
      values = c("TRUE" = "tomato", "FALSE" = "steelblue"),
      labels = c("TRUE" = "FP (flagged)", "FALSE" = "TN (retained)"),
      name = NULL) +
    facet_wrap(~ feature, ncol = 2) +
    labs(title = scale_label, x = "Latent attention A", y = "Density")
}

make_fp_comparison_plot("flag_kn_s1", "flag_kn_wtau_s1", "Scale 1") /
make_fp_comparison_plot("flag_kn_s2", "flag_kn_wtau_s2", "Scale 2")

Attention distribution of false positives under Spearman ρ vs. weighted τ (Kneedle threshold), for Scale 1 (top) and Scale 2 (bottom). If weighted τ is reducing near-tie FPs, its FP group should be more concentrated at genuinely low-A attenders.
Show code
rho_wtau_subtype <- results |>
  filter(is_non_attender == 1) |>
  group_by(subtype) |>
  summarise(
    N            = n(),
    det_rho_s1   = round(mean(flag_kn_s1),      3),
    det_wtau_s1  = round(mean(flag_kn_wtau_s1), 3),
    det_rho_s2   = round(mean(flag_kn_s2),      3),
    det_wtau_s2  = round(mean(flag_kn_wtau_s2), 3),
    .groups = "drop"
  ) |>
  arrange(desc(det_wtau_s1))

kable(
  rho_wtau_subtype,
  col.names = c("Subtype", "N",
                "Det. ρ (S1)", "Det. τ_w (S1)",
                "Det. ρ (S2)", "Det. τ_w (S2)"),
  caption = "Detection rates by subtype: Spearman ρ vs. weighted τ (Kneedle)"
) |>
  kable_styling(full_width = FALSE) |>
  row_spec(which(rho_wtau_subtype$det_wtau_s1 < 0.5),  color = "tomato") |>
  row_spec(which(rho_wtau_subtype$det_wtau_s1 >= 0.5), color = "steelblue")
Detection rates by subtype: Spearman ρ vs. weighted τ (Kneedle)
Subtype N Det. ρ (S1) Det. τ_w (S1) Det. ρ (S2) Det. τ_w (S2)
near_straightline 57 0.842 0.825 0.596 0.649
straightline 114 0.772 0.781 0.667 0.737
acquiescent 54 0.815 0.778 0.685 0.833
midpoint_default 53 0.774 0.774 0.698 0.811
plausible_noise 143 0.748 0.762 0.713 0.804
disacquiescent 25 0.800 0.760 0.680 0.800
uniform_random 25 0.760 0.760 0.600 0.720
alternating 17 0.647 0.706 0.529 0.647
diagonal 17 0.765 0.706 0.647 0.882
biased_random 36 0.694 0.694 0.583 0.722
Show code
make_rho_wtau_roc_panel <- function(rho_col, wtau_col, truth, scale_label) {
  r_rho  <- make_roc_df(results[[rho_col]],  truth, "Spearman ρ")
  r_wtau <- make_roc_df(results[[wtau_col]], truth, "Weighted τ")
  auc_rho  <- round(roc_auc(results[[rho_col]],  truth), 3)
  auc_wtau <- round(roc_auc(results[[wtau_col]], truth), 3)
  roc_df <- rbind(r_rho, r_wtau)

  ggplot(roc_df, aes(x = fpr, y = tpr, colour = variant)) +
    geom_abline(slope = 1, intercept = 0, linetype = "dashed", colour = "grey60") +
    geom_path(linewidth = 0.9) +
    scale_colour_manual(
      values = c("Spearman ρ" = "steelblue", "Weighted τ" = "darkorange"),
      labels = c(
        sprintf("Spearman ρ  (AUC = %.3f)", auc_rho),
        sprintf("Weighted τ  (AUC = %.3f)", auc_wtau)
      ),
      name = NULL) +
    labs(title = scale_label,
         x = "False Positive Rate (1 − Specificity)",
         y = "Sensitivity (True Positive Rate)") +
    theme(legend.position = "bottom")
}

make_rho_wtau_roc_panel("spearman_s1", "wtau_s1", results$true_inattentive, "Scale 1") +
make_rho_wtau_roc_panel("spearman_s2", "wtau_s2", results$true_inattentive, "Scale 2")

ROC curves for Spearman ρ and weighted τ on Scale 1 (left) and Scale 2 (right). A higher AUC for τ_w confirms that down-weighting near-tie rank inversions improves discrimination.

Computing Within-person RT Regression Features

Spearman ρ measures whether complexity and RT co-vary in rank order, but it does not quantify how much unexplained RT variation remains after accounting for complexity. To capture this, we fit a within-person OLS regression of log-transformed RT on item complexity alone, pooled across all 40 items:

\[\log(\text{RT}_{ij}) = \alpha_i + \beta_i \cdot c_j + \varepsilon_{ij}\]

This gives two per-person features:

  • \(\hat\beta_i\)complexity slope: how much log-RT increases per unit of complexity. Positive for attenders; near zero for non-attenders who are insensitive to item difficulty.
  • MARmean absolute residual \(\frac{1}{n}\sum_j|\hat\varepsilon_{ij}|\): average unexplained log-RT variance after accounting for complexity. High MAR means RT is highly variable in ways that complexity alone cannot predict.

Note that MAR and Spearman ρ are complementary rather than redundant. ρ captures whether the rank order of RTs matches the rank order of complexity; MAR captures how much absolute deviation remains around the complexity trend, regardless of whether that trend is positive. A respondent could have a moderate positive ρ but high MAR (complexity partially predicts their RTs but with a lot of scatter), or low ρ and low MAR (flat RT pattern with little variance — characteristic of patterned non-attenders).

Show code
compute_rt_regression <- function(dat) {
  ni  <- params$items_per_scale

  cx_all <- unlist(lapply(seq_len(ns), function(s) instrument[[s]]$complexity))

  rt_cols_all <- unlist(lapply(seq_len(ns), function(s)
    paste0("RT_s", s, "_i", seq_len(ni))))

  rt_mat <- as.matrix(dat[, rt_cols_all])

  # Fit OLS per person: log(RT) ~ complexity
  # Use vapply to guarantee a 2-row matrix regardless of nrow(dat)
  out <- vapply(seq_len(nrow(rt_mat)), function(i) {
    rt_i   <- rt_mat[i, ]
    log_rt <- log(pmax(rt_i, 1e-6))

    fit <- tryCatch(
      lm.fit(x = cbind(1, cx_all), y = log_rt),
      error = function(e) NULL
    )

    if (is.null(fit) || length(fit$coefficients) < 2) {
      return(c(beta = NA_real_, mar = NA_real_))
    }

    c(beta = fit$coefficients[[2]],
      mar  = mean(abs(fit$residuals), na.rm = TRUE))
  }, numeric(2))

  # out is 2 x nrow(dat); transpose to nrow x 2
  out <- t(out)
  colnames(out) <- c("beta", "mar")

  data.frame(
    reg_beta = out[, "beta"],
    reg_mar  = out[, "mar"]
  )
}

reg_feats       <- compute_rt_regression(dat)
results$reg_beta <- reg_feats$reg_beta
results$reg_mar  <- reg_feats$reg_mar

cat("Regression feature summary (attenders vs non-attenders):\n")
Regression feature summary (attenders vs non-attenders):
Show code
for (col in c("reg_beta", "reg_mar")) {
  att <- results[[col]][results$is_non_attender == 0]
  non <- results[[col]][results$is_non_attender == 1]
  cat(sprintf("  %-12s  attender mean = %+.3f  |  non-attender mean = %+.3f\n",
              col, mean(att, na.rm = TRUE), mean(non, na.rm = TRUE)))
}
  reg_beta      attender mean = +0.200  |  non-attender mean = +0.002
  reg_mar       attender mean = +0.277  |  non-attender mean = +0.195

Understanding the Feature: Instrument Complexity Structure

Show code
cx_df <- do.call(rbind, lapply(seq_len(ns), function(s) {
  data.frame(
    scale      = paste0("Scale ", s),
    item       = seq_len(ni),
    complexity = instrument[[s]]$complexity
  )
}))

ggplot(cx_df, aes(x = item, y = complexity)) +
  geom_hline(yintercept = 0, linetype = "dashed", colour = "grey60") +
  geom_point(size = 2.5, colour = "steelblue") +
  geom_segment(aes(xend = item, yend = 0), linewidth = 0.4, alpha = 0.5,
               colour = "steelblue") +
  facet_wrap(~ scale, ncol = 2) +
  labs(title    = "Item complexity scores by scale",
       subtitle = sprintf(
         "complexity_rt_weight = %g — each SD of complexity multiplies RT by exp(%g × A)",
         params$complexity_rt_weight, params$complexity_rt_weight),
       x = "Item index", y = "Complexity score")

Per-item complexity scores for each scale. Items to the right are harder and should elicit longer RTs from attentive respondents.

Feature Distributions: Scale 1 and Scale 2

Show code
rho_variants <- list(
  "Scale 1" = "spearman_s1",
  "Scale 2" = "spearman_s2"
)

plots_all <- lapply(names(rho_variants), function(label) {
  col <- rho_variants[[label]]
  ggplot(results, aes(x = .data[[col]], fill = class_label)) +
    geom_density(alpha = 0.5, colour = NA) +
    geom_vline(xintercept = 0, linetype = "dashed",
               colour = "grey30", linewidth = 0.7) +
    scale_fill_manual(
      values = c("Attender (true)" = "steelblue",
                 "Non-attender (true)" = "tomato"), name = NULL) +
    labs(title = label, x = "Spearman ρ", y = "Density")
})

wrap_plots(plots_all, ncol = 2)

Distributions of Spearman ρ for Scale 1 (left) and Scale 2 (right) by true attender status. The dashed line marks ρ = 0. Wider separation between blue and red distributions indicates better discriminative power.

Discriminability Comparison

```codhdawkpuihhqbhwyrbapnkpvbjlvtauc_tab <- data.frame( Variant = names(rho_variants), n_items = c(ni, ni), AUC = sapply(rho_variants, function(col) round(roc_auc(results[[col]], results\(true_inattentive), 3)), mean_att = sapply(rho_variants, function(col) round(mean(results[[col]][results\)is_non_attender == 0], na.rm = TRUE), 3)), mean_non = sapply(rho_variants, function(col) round(mean(results[[col]][results\(is_non_attender == 1], na.rm = TRUE), 3)), sd_att = sapply(rho_variants, function(col) round(sd(results[[col]][results\)is_non_attender == 0], na.rm = TRUE), 3)), sd_non = sapply(rho_variants, function(col) round(sd(results[[col]][results$is_non_attender == 1], na.rm = TRUE), 3)), row.names = NULL )

kable( auc_tab, col.names = c(“Scale”, “Items”, “AUC”, “Mean ρ (attenders)”, “Mean ρ (non-attenders)”, “SD ρ (attenders)”, “SD ρ (non-attenders)”), caption = “Discriminability of Spearman ρ by scale” ) |> kable_styling(full_width = FALSE) |> row_spec(which.max(auc_tab$AUC), bold = TRUE, background = “#eef6ee”)


## How Attention Moderates Each Scale


::: {.cell}

```{.r .cell-code}
make_attention_plot("spearman_s1", "Scale 1") +
make_attention_plot("spearman_s2", "Scale 2")

Spearman ρ for Scale 1 (left) and Scale 2 (right) vs. latent attention A among attenders. The red band shows the non-attender 5–95th percentile range.

:::

Feature by Non-attender Subtype

Show code
make_subtype_ridge("spearman_s1", "Scale 1") +
make_subtype_ridge("spearman_s2", "Scale 2")

Spearman ρ distributions by subtype (ridge plots), shown separately for Scale 1 (left) and Scale 2 (right). Subtypes whose distribution overlaps heavily with attenders will be harder to detect.
Show code
subtype_sp_tab <- results |>
  group_by(subtype) |>
  summarise(
    N       = n(),
    mean_s1 = round(mean(spearman_s1, na.rm = TRUE), 3),
    mean_s2 = round(mean(spearman_s2, na.rm = TRUE), 3),
    .groups = "drop"
  ) |>
  arrange(mean_s1)

kable(
  subtype_sp_tab,
  col.names = c("Subtype", "N", "Mean ρ (S1)", "Mean ρ (S2)"),
  caption   = "Mean Spearman ρ by scale and subtype"
) |>
  kable_styling(full_width = FALSE) |>
  row_spec(which(subtype_sp_tab$subtype == "Attender"), bold = TRUE)
Mean Spearman ρ by scale and subtype
Subtype N Mean ρ (S1) Mean ρ (S2)
midpoint_default 53 -0.032 -0.001
uniform_random 25 -0.024 0.058
straightline 114 -0.016 0.013
plausible_noise 143 -0.012 0.001
acquiescent 54 -0.001 0.013
near_straightline 57 0.004 0.069
diagonal 17 0.007 -0.027
disacquiescent 25 0.030 0.021
alternating 17 0.038 0.020
biased_random 36 0.039 0.015
Attender 4459 0.455 0.424

Regression Feature Distributions

Show code
p_beta <- ggplot(results, aes(x = reg_beta, fill = class_label)) +
  geom_density(alpha = 0.5, colour = NA) +
  geom_vline(xintercept = 0, linetype = "dashed",
             colour = "grey30", linewidth = 0.7) +
  scale_fill_manual(
    values = c("Attender (true)" = "steelblue",
               "Non-attender (true)" = "tomato"), name = NULL) +
  labs(title    = "β̂ — complexity slope",
       subtitle = "log(RT) increase per unit complexity",
       x = "β̂", y = "Density")

p_mar <- ggplot(results, aes(x = reg_mar, fill = class_label)) +
  geom_density(alpha = 0.5, colour = NA) +
  scale_fill_manual(
    values = c("Attender (true)" = "steelblue",
               "Non-attender (true)" = "tomato"), name = NULL) +
  labs(title    = "MAR — mean absolute residual",
       subtitle = "Unexplained log(RT) variance after complexity-only regression",
       x = "MAR", y = "Density")

p_beta + p_mar

Distributions of the two within-person regression features by true attender status. β (complexity slope) should be positive for attenders and near zero for non-attenders. MAR should be lower for attenders whose RTs follow a clean complexity-sensitive pattern.
Show code
att_only <- results[results$is_non_attender == 0, ]

make_reg_attention_plot <- function(col, label, ylab) {
  na_q <- quantile(results[[col]][results$is_non_attender == 1],
                   probs = c(0.05, 0.25, 0.50, 0.75, 0.95), na.rm = TRUE)
  ggplot(att_only, aes(x = A, y = .data[[col]])) +
    annotate("rect", xmin = -Inf, xmax = Inf,
             ymin = na_q["5%"],  ymax = na_q["95%"],
             fill = "tomato", alpha = 0.08) +
    annotate("rect", xmin = -Inf, xmax = Inf,
             ymin = na_q["25%"], ymax = na_q["75%"],
             fill = "tomato", alpha = 0.12) +
    geom_hline(yintercept = na_q["50%"], colour = "tomato3",
               linetype = "dashed", linewidth = 0.7) +
    geom_point(aes(colour = factor(Z)), alpha = 0.35, size = 1.4) +
    geom_smooth(method = "loess", span = 0.6, se = TRUE,
                colour = "steelblue4", fill = "steelblue", alpha = 0.2) +
    geom_vline(xintercept = 0, colour = NA) +
    scale_colour_manual(values = c("0" = "grey60", "1" = "steelblue"),
                        labels = c("Control", "Treatment"), name = "Arm") +
    labs(title = label,
         subtitle = "Red band = non-attender 5–95th pct",
         x = "Latent attention A", y = ylab)
}

make_reg_attention_plot("reg_beta", "β̂ vs. latent attention", "β̂ (complexity slope)")

β̂ vs. latent attention A among attenders. The feature should increase with A because more attentive respondents are more sensitive to item difficulty. The red band shows the non-attender distribution for reference.

Regression Features by Non-attender Subtype

Show code
subtype_reg_tab <- results |>
  group_by(subtype) |>
  summarise(
    N          = n(),
    mean_beta  = round(mean(reg_beta, na.rm = TRUE), 3),
    mean_mar   = round(mean(reg_mar,  na.rm = TRUE), 3),
    .groups    = "drop"
  ) |>
  arrange(mean_beta)

kable(
  subtype_reg_tab,
  col.names = c("Subtype", "N", "Mean β̂", "Mean MAR"),
  caption   = "Within-person regression features by group and subtype"
) |>
  kable_styling(full_width = FALSE) |>
  row_spec(which(subtype_reg_tab$subtype == "Attender"), bold = TRUE)
Within-person regression features by group and subtype
Subtype N Mean β | Mean MA
midpoint_default 53 -0.009 0.191
diagonal 17 -0.006 0.189
plausible_noise 143 0.000 0.198
acquiescent 54 0.002 0.194
straightline 114 0.002 0.194
uniform_random 25 0.004 0.202
alternating 17 0.006 0.197
disacquiescent 25 0.006 0.194
biased_random 36 0.009 0.196
near_straightline 57 0.010 0.196
Attender 4459 0.200 0.277

AUC of Regression Features vs. Spearman ρ

Show code
all_features <- list(
  "ρ (Scale 1)"  = list(col = "spearman_s1", direction = "low"),
  "ρ (Scale 2)"  = list(col = "spearman_s2", direction = "low"),
  "β̂ (slope)"   = list(col = "reg_beta",    direction = "low"),
  "MAR"           = list(col = "reg_mar",     direction = "high")
)

feat_auc_tab <- do.call(rbind, lapply(names(all_features), function(nm) {
  col  <- all_features[[nm]]$col
  dir  <- all_features[[nm]]$direction
  scores <- if (dir == "low") results[[col]] else -results[[col]]
  auc    <- roc_auc(scores, results$true_inattentive)
  data.frame(
    Feature   = nm,
    Direction = ifelse(dir == "low", "Low → inattentive", "High → inattentive"),
    AUC       = round(auc, 3),
    mean_att  = round(mean(results[[col]][results$is_non_attender == 0], na.rm=TRUE), 3),
    mean_non  = round(mean(results[[col]][results$is_non_attender == 1], na.rm=TRUE), 3)
  )
}))

kable(
  feat_auc_tab,
  col.names = c("Feature", "Flag direction", "AUC",
                "Mean (attenders)", "Mean (non-attenders)"),
  caption   = "Individual-feature AUCs: Spearman ρ (per scale) and regression features"
) |>
  kable_styling(full_width = FALSE) |>
  row_spec(which.max(feat_auc_tab$AUC), bold = TRUE,
           background = "#eef6ee")
Individual-feature AUCs: Spearman ρ (per scale) and regression features
Feature Flag direction AUC Mean (attenders) Mean (non-attenders)
ρ (Scale 1) Low → inattentive 0.926 0.455 -0.005
ρ (Scale 2) Low → inattentive 0.902 0.424 0.016
β̂ (slope) |Low → inattentive | 0.98 | 0.20 | 0.00
MAR High → inattentive 0.020 0.277 0.195