Detecting Inattentive Respondents: Complexity–RT Correlation

Can item complexity × response time predict attention?

Author

Jamie C. Lee

Published

April 27, 2026

Overview

This document evaluates two complementary RT-based features for detecting inattentive survey respondents:

  1. Spearman ρ — the rank correlation between item complexity and per-item response time. Attentive respondents take longer on harder items; non-attenders’ RTs are driven by clicking rhythm, not difficulty.
  2. MAR (mean absolute residual) — the average unexplained log-RT variance after fitting a within-person regression of log(RT) on item complexity alone. High MAR means RT varies in ways that complexity cannot explain, which is a signal of non-systematic (inattentive) responding.

Both features treat RT solely as a function of item complexity — no reverse-coding term is included. The question addressed here is whether combining these two features in a mixture model improves classification of attentive vs. inattentive respondents compared to using Spearman ρ alone.

In this simulation, item complexity is a fixed property of the instrument (instrument[[s]]$complexity), and per-item RTs are available as RT_s{s}_i{j}. The complexity effect on RT is built into the data-generating process for attenders: at full attention (\(A = 1\)), each SD of complexity multiplies per-item RT by \(\exp(\texttt{complexity\_rt\_weight} \times A)\). At \(A = 0\) the effect vanishes. Non-attenders are entirely unaffected by complexity.

The simulation was generated with the following complexity parameters:

  • item_complexity_mean = 0, item_complexity_sd = 1
  • complexity_rt_weight = 0.3

Approach summary. Spearman ρ and MAR are both derived from the complexity–RT relationship but capture different aspects of it: ρ captures whether complexity and RT co-vary in rank order; MAR captures how much unexplained residual variance remains after accounting for complexity. Their combination in a 2-feature mixture model is evaluated against ρ alone (Kneedle threshold) and against the existing Laz.R-based and Zh-based approaches in the final comparison section.


Simulation Ground Truth

Show code
class_tab <- data.frame(
  Group = c("Attenders", "Non-attenders"),
  N     = c(sum(dat$is_non_attender == 0), sum(dat$is_non_attender == 1)),
  Pct   = round(100 * c(mean(dat$is_non_attender == 0),
                         mean(dat$is_non_attender == 1)), 1)
)

na_dat  <- dat[dat$is_non_attender == 1, ]
sub_tab <- as.data.frame(table(
  Style   = na_dat$non_attender_style,
  Pattern = na_dat$non_attender_pattern_type
))
sub_tab <- sub_tab[sub_tab$Freq > 0, ]

kable(class_tab, col.names = c("Group", "N", "%"),
      caption = "Sample composition") |>
  kable_styling(full_width = FALSE)
Sample composition
Group N %
Attenders 4244 84.9
Non-attenders 756 15.1
Show code
kable(sub_tab, caption = "Non-attender subtypes") |>
  kable_styling(full_width = FALSE)
Non-attender subtypes
Style Pattern Freq
2 with_pattern acquiescent 80
4 with_pattern alternating 25
6 with_pattern biased_random 43
8 with_pattern diagonal 26
10 with_pattern disacquiescent 28
12 with_pattern midpoint_default 72
14 with_pattern near_straightline 84
16 with_pattern straightline 140
18 with_pattern uniform_random 43

Benchmarking Established Attention-Checking Methods

Before evaluating our proposed complexity–RT features in detail, we benchmark six established measures of inattentive responding against one another and against the proposed Spearman ρ and weighted τ. All measures are computed across all 40 items (both scales pooled), so that comparisons are made on an equal footing with our own features.

The five established measures are:

  1. LongString — the length of the longest uninterrupted run of identical responses in a person’s item sequence. Patterned non-attenders (straightliners, alternators) produce unusually long runs; attenders and random responders do not.
  2. IRV (Intra-individual Response Variability) — the standard deviation of a person’s raw item responses across all items. Non-attenders who straightline or give near-constant responses show very low IRV; random or biased-random responders may show elevated IRV.
  3. Mahalanobis’ Distance on RTs — the multivariate distance between a respondent’s per-item RT vector and the sample mean RT vector, scaled by the sample covariance of RTs. This identifies respondents whose RT profile across items is a multivariate outlier relative to the group — flagging both uniformly fast and unusually patterned RT vectors.
  4. Laz.R — sequential transitional predictability of item responses (defined in the main analysis below). High Laz.R signals mechanically regular response sequences.
  5. Zh — the standardised person-fit statistic from a Graded Response Model (Drasgow et al., 1987). Strongly negative values indicate responses inconsistent with any plausible latent trait.

Our two proposed features — Spearman ρ and weighted τ — are included as the final two columns in all comparison tables and highlighted in green for easy identification.

Note on item pooling. All measures below are computed over all 40 items simultaneously (both scales pooled into one item string per respondent), matching the full-instrument condition against which we evaluate sensitivity in subsequent sections.

Show code
# ── Compute all benchmark measures across all items ───────────────────────────
ni_total  <- ns * ni   # total items across both scales

# All item response columns
y_cols_all  <- unlist(lapply(seq_len(ns), function(s)
  paste0("Y_s", s, "_i", seq_len(ni))))
rt_cols_all <- unlist(lapply(seq_len(ns), function(s)
  paste0("RT_s", s, "_i", seq_len(ni))))
cx_all      <- unlist(lapply(seq_len(ns), function(s)
  instrument[[s]]$complexity))

resp_mat <- as.matrix(dat[, y_cols_all])
rt_mat   <- as.matrix(dat[, rt_cols_all])

# ── 1. LongString ─────────────────────────────────────────────────────────────
longstring <- function(x) {
  if (all(is.na(x))) return(NA_integer_)
  x   <- x[!is.na(x)]
  rle_x <- rle(x)
  max(rle_x$lengths)
}
results$longstring <- apply(resp_mat, 1, longstring)

# ── 2. IRV ────────────────────────────────────────────────────────────────────
results$irv <- apply(resp_mat, 1, sd, na.rm = TRUE)

# ── 3. Mahalanobis Distance on RTs ───────────────────────────────────────────
# Compute on log-RT to reduce extreme-RT leverage; use robust estimate for
# the covariance to guard against non-attender contamination.
log_rt_mat <- log(pmax(rt_mat, 1e-6))

# Mean vector and covariance across respondents
rt_mean <- colMeans(log_rt_mat, na.rm = TRUE)
rt_cov  <- tryCatch(
  cov(log_rt_mat, use = "pairwise.complete.obs"),
  error = function(e) diag(ncol(log_rt_mat))
)
# Regularise: add small ridge in case of near-singularity
rt_cov_reg <- rt_cov + diag(1e-4, nrow(rt_cov))

results$mahal_rt <- tryCatch({
  mahal_vals <- mahalanobis(log_rt_mat, center = rt_mean, cov = rt_cov_reg)
  mahal_vals
}, error = function(e) {
  # Fallback: Euclidean distance from mean (scaled)
  apply(sweep(log_rt_mat, 2, rt_mean)^2, 1, sum)
})

# ── 4. Laz.R (pooled across both scales) ─────────────────────────────────────
lazr_pooled <- apply(resp_mat, 1, Laz.R)
results$lazr_pooled <- lazr_pooled

# ── 5. Zh (iterative purification GRM, computed here for benchmarking) ────────
# NOTE: moved to chunk benchmark-zh below (cache: true) to avoid re-running
#       on every render. Placeholder so downstream summary cat() doesn't error.
results$zh <- NA_real_   # will be overwritten by benchmark-zh chunk

# ── Spearman rho and weighted tau pooled (all items) ─────────────────────────
cx_rank_all <- rank(cx_all)

results$spearman_pooled <- apply(rt_mat, 1, function(rt_i) {
  cor(cx_rank_all, rank(rt_i), method = "spearman")
})

# Weighted tau pooled
pairs_all <- combn(ni_total, 2)
cx_diff_all <- cx_all[pairs_all[1,]] - cx_all[pairs_all[2,]]
w_all       <- abs(cx_diff_all)

results$wtau_pooled <- apply(rt_mat, 1, function(rt_i) {
  rt_diff     <- rt_i[pairs_all[1,]] - rt_i[pairs_all[2,]]
  concordance <- sign(cx_diff_all) * sign(rt_diff)
  sum(w_all * concordance) / sum(w_all)
})

cat("All benchmark measures computed.\n")
All benchmark measures computed.
Show code
cat(sprintf("  LongString  : attender mean = %.2f  |  non-attender mean = %.2f\n",
  mean(results$longstring[results$is_non_attender == 0], na.rm=TRUE),
  mean(results$longstring[results$is_non_attender == 1], na.rm=TRUE)))
  LongString  : attender mean = 3.95  |  non-attender mean = 9.99
Show code
cat(sprintf("  IRV         : attender mean = %.3f  |  non-attender mean = %.3f\n",
  mean(results$irv[results$is_non_attender == 0], na.rm=TRUE),
  mean(results$irv[results$is_non_attender == 1], na.rm=TRUE)))
  IRV         : attender mean = 1.052  |  non-attender mean = 0.924
Show code
cat(sprintf("  Mahal RT    : attender mean = %.2f  |  non-attender mean = %.2f\n",
  mean(results$mahal_rt[results$is_non_attender == 0], na.rm=TRUE),
  mean(results$mahal_rt[results$is_non_attender == 1], na.rm=TRUE)))
  Mahal RT    : attender mean = 42.28  |  non-attender mean = 26.95
Show code
cat(sprintf("  Laz.R (pool): attender mean = %.3f  |  non-attender mean = %.3f\n",
  mean(results$lazr_pooled[results$is_non_attender == 0], na.rm=TRUE),
  mean(results$lazr_pooled[results$is_non_attender == 1], na.rm=TRUE)))
  Laz.R (pool): attender mean = 0.361  |  non-attender mean = 0.613
Show code
cat(sprintf("  Zh          : attender mean = %.3f  |  non-attender mean = %.3f\n",
  mean(results$zh[results$is_non_attender == 0], na.rm=TRUE),
  mean(results$zh[results$is_non_attender == 1], na.rm=TRUE)))
  Zh          : attender mean = NaN  |  non-attender mean = NaN
Show code
cat(sprintf("  Spearman ρ  : attender mean = %.3f  |  non-attender mean = %.3f\n",
  mean(results$spearman_pooled[results$is_non_attender == 0], na.rm=TRUE),
  mean(results$spearman_pooled[results$is_non_attender == 1], na.rm=TRUE)))
  Spearman ρ  : attender mean = 0.445  |  non-attender mean = -0.007
Show code
cat(sprintf("  Weighted τ  : attender mean = %.3f  |  non-attender mean = %.3f\n",
  mean(results$wtau_pooled[results$is_non_attender == 0], na.rm=TRUE),
  mean(results$wtau_pooled[results$is_non_attender == 1], na.rm=TRUE)))
  Weighted τ  : attender mean = 0.477  |  non-attender mean = -0.007
Show code
# ── Shared lookup objects used by all benchmark chunks ────────────────────────
# Direction of each measure: high_bad = TRUE means HIGH scores → flag inattentive
# Mahalanobis: high_bad = FALSE because LOW D² → inattentive in this simulation
# (non-attenders cluster near the RT centroid; attenders are the outliers)
bench_direction <- list(
  longstring      = list(col = "longstring",      high_bad = TRUE),
  irv             = list(col = "irv",             high_bad = FALSE),
  mahal_rt        = list(col = "mahal_rt",        high_bad = FALSE),
  lazr_pooled     = list(col = "lazr_pooled",     high_bad = TRUE),
  zh              = list(col = "zh",              high_bad = FALSE),
  spearman_pooled = list(col = "spearman_pooled", high_bad = FALSE),
  wtau_pooled     = list(col = "wtau_pooled",     high_bad = FALSE)
)

bench_labels <- c(
  longstring      = "LongString",
  irv             = "IRV",
  mahal_rt        = "Mahalanobis D² (RT)",
  lazr_pooled     = "Laz.R",
  zh              = "Zh",
  spearman_pooled = "Spearman ρ ★",
  wtau_pooled     = "Weighted τ ★"
)
Show code
# Zh via iterative-purification GRM. Cached so re-renders are instant.
compute_zh_purified <- function(dat, scale_idx) {
  ni_s      <- params$items_per_scale
  K_s       <- params$n_categories
  y_cols_s  <- paste0("Y_s", scale_idx, "_i", seq_len(ni_s))
  rev_items <- instrument[[scale_idx]]$reverse
  resp_s              <- as.matrix(dat[, y_cols_s])
  resp_scored         <- resp_s
  resp_scored[, rev_items] <- K_s + 1 - resp_s[, rev_items]
  resp_df   <- as.data.frame(resp_scored)

  fit1      <- mirt(resp_df, model = 1, itemtype = "graded", verbose = FALSE)
  pf1       <- personfit(fit1, method = "ML")
  keep_mask <- !is.finite(pf1$Zh) | pf1$Zh > -2
  cat(sprintf("Scale %d — Round 1: %d flagged, refitting on %d of %d\n",
              scale_idx, sum(!keep_mask), sum(keep_mask), nrow(dat)))

  fit2           <- mirt(resp_df[keep_mask, ], model = 1,
                         itemtype = "graded", verbose = FALSE)
  pars_fixed     <- mod2values(fit2)
  pars_fixed$est <- FALSE
  fit2_all       <- mirt(resp_df, model = 1, itemtype = "graded",
                         verbose = FALSE, pars = pars_fixed, TOL = NA)
  pf2 <- personfit(fit2_all, method = "ML")
  cat(sprintf("Scale %d — Round 2: Zh range [%.2f, %.2f], mean = %.3f\n",
              scale_idx, min(pf2$Zh, na.rm=TRUE),
              max(pf2$Zh, na.rm=TRUE), mean(pf2$Zh, na.rm=TRUE)))
  pf2$Zh
}

zh_list    <- lapply(seq_len(ns), compute_zh_purified, dat = dat)
Scale 1 — Round 1: 449 flagged, refitting on 4551 of 5000
Scale 1 — Round 2: Zh range [-12.99, 3.58], mean = -0.294
Scale 2 — Round 1: 438 flagged, refitting on 4562 of 5000
Scale 2 — Round 2: Zh range [-11.03, 3.58], mean = -0.251
Show code
results$zh <- rowMeans(do.call(cbind, zh_list), na.rm = TRUE)
cat("Zh computed and stored in results$zh\n")
Zh computed and stored in results$zh

Distributions by True Class

Each panel below shows the attender (blue) and non-attender (red) density distributions for a single measure. Better measures show greater separation between the two distributions.

Note the direction of each measure’s signal: LongString and Laz.R are high → inattentive; Mahal RT is low → inattentive in this simulation — non-attenders cluster near the RT centroid (their flat/random profiles are close to the sample mean), so unusually low D² flags inattention; IRV is low → inattentive for straightliners but high → inattentive for random responders (bimodal non-attender distribution expected); Zh is low → inattentive; Spearman ρ and weighted τ are low → inattentive.

Show code
bench_measures <- list(
  list(col = "longstring",     label = "LongString",
       xlab = "Max run length",           ref = NULL,    high_bad = TRUE),
  list(col = "irv",            label = "IRV",
       xlab = "SD of item responses",     ref = NULL,    high_bad = FALSE),
  list(col = "mahal_rt",       label = "Mahalanobis Distance (RT)",
       xlab = "Mahalanobis D²",           ref = NULL,    high_bad = FALSE),
  list(col = "lazr_pooled",    label = "Laz.R (pooled)",
       xlab = "Sequential predictability", ref = 1/K,   high_bad = TRUE),
  list(col = "zh",             label = "Zh (person-fit)",
       xlab = "Zh",                        ref = -2,    high_bad = FALSE)
)

proposed_measures <- list(
  list(col = "spearman_pooled", label = "Spearman ρ (proposed)",
       xlab = "Spearman ρ",  ref = 0, high_bad = FALSE),
  list(col = "wtau_pooled",    label = "Weighted τ (proposed)",
       xlab = "Weighted τ",  ref = 0, high_bad = FALSE)
)

make_bench_dist_plot <- function(m, proposed = FALSE) {
  p <- ggplot(results, aes(x = .data[[m$col]], fill = class_label)) +
    geom_density(alpha = 0.5, colour = NA) +
    scale_fill_manual(
      values = c("Attender (true)" = "steelblue",
                 "Non-attender (true)" = "tomato"), name = NULL) +
    labs(title = m$label, x = m$xlab, y = "Density") +
    theme(legend.position = "none")
  if (!is.null(m$ref)) {
    p <- p + geom_vline(xintercept = m$ref, linetype = "dashed",
                        colour = "grey30", linewidth = 0.7)
  }
  if (proposed) {
    p <- p + theme(
      panel.border     = element_rect(colour = "#2e7d32", fill = NA,
                                      linewidth = 1.8),
      plot.background  = element_rect(fill = "#f1f8f1", colour = NA)
    )
  }
  p
}

all_plots <- c(
  lapply(bench_measures,    make_bench_dist_plot, proposed = FALSE),
  lapply(proposed_measures, make_bench_dist_plot, proposed = TRUE)
)

wrap_plots(all_plots, ncol = 2) +
  plot_layout(guides = "collect") &
  theme(legend.position = "bottom")

Attender (blue) vs. non-attender (red) distributions for all seven measures, computed across all 40 items. Dashed reference lines mark conventional flagging thresholds where applicable. The bottom two panels (green borders) show the two proposed complexity–RT features.

AUC and Descriptive Statistics

Show code
# bench_direction and bench_labels are defined in benchmark-setup above

bench_auc_tab <- do.call(rbind, lapply(names(bench_direction), function(nm) {
  m      <- bench_direction[[nm]]
  # roc_auc() flags scores < t as inattentive, so pass scores where LOW = inattentive.
  # high_bad=FALSE (includes Mahalanobis): low raw score = inattentive → pass as-is
  # high_bad=TRUE: high raw score = inattentive → negate so low = inattentive
  scores <- if (m$high_bad) -results[[m$col]] else results[[m$col]]
  scores[!is.finite(scores)] <- NA
  auc_val <- roc_auc(scores, results$true_inattentive)
  data.frame(
    Measure  = bench_labels[[nm]],
    AUC      = round(auc_val, 3),
    Mean_att = round(mean(results[[m$col]][results$is_non_attender == 0], na.rm=TRUE), 3),
    SD_att   = round(sd(results[[m$col]][results$is_non_attender == 0],   na.rm=TRUE), 3),
    Mean_non = round(mean(results[[m$col]][results$is_non_attender == 1], na.rm=TRUE), 3),
    SD_non   = round(sd(results[[m$col]][results$is_non_attender == 1],   na.rm=TRUE), 3),
    is_proposed = nm %in% c("spearman_pooled", "wtau_pooled")
  )
}))

proposed_rows <- which(bench_auc_tab$is_proposed)

kable(
  bench_auc_tab[, !names(bench_auc_tab) %in% "is_proposed"],
  col.names = c("Measure", "AUC",
                "Mean (attenders)", "SD (attenders)",
                "Mean (non-attenders)", "SD (non-attenders)"),
  caption = "Discriminability of all seven measures (all 40 items). ★ = proposed complexity–RT feature."
) |>
  kable_styling(full_width = FALSE) |>
  row_spec(which.max(bench_auc_tab$AUC), bold = TRUE, background = "#eef6ee") |>
  row_spec(proposed_rows, background = "#e8f5e9", color = "#1b5e20")
Discriminability of all seven measures (all 40 items). ★ = proposed complexity–RT feature.
Measure AUC Mean (attenders) SD (attenders) Mean (non-attenders) SD (non-attenders)
LongString 0.723 3.959 1.300 10.693 9.267
IRV 0.567 1.056 0.185 0.916 0.475
Mahalanobis D² (RT) 0.875 41.273 9.251 29.126 5.546
Laz.R 0.766 0.359 0.049 0.628 0.259
Zh 0.707 -0.079 1.142 -1.866 2.899
Spearman ρ ★ 0.964 0.440 0.164 0.007 0.157
Weighted τ ★ 0.968 0.472 0.167 0.006 0.166

Cutoff Selection and Classification Performance

We flag respondents using the following cutoff strategies for each measure:

  • Kneedle — the elbow point on the sorted score curve (data-driven, no ground-truth required). Used for all measures, including Mahalanobis D² as a secondary comparison threshold.
  • Outlier (2.5th pct) — for Mahalanobis D² only: flags respondents whose D² falls in the lowest 2.5% of the sample. This direction correction is necessary because in this simulation non-attenders have lower D² than attenders — their flat or random RT profiles cluster near the sample centroid while attenders, whose RTs vary meaningfully with item complexity, are the multivariate outliers. Flagging the low tail mirrors the classical outlier-exclusion logic but applied to the correct tail given the inverted direction.
  • Median split — the sample median (a simple, robust alternative)
  • Oracle — the attender 5th (or 95th) percentile, chosen with knowledge of ground truth, providing an upper bound on performance

Mahalanobis D² shows four rows in the table (Outlier, Kneedle, Median, Oracle) to allow direct comparison of the two data-driven thresholds. All other measures show three rows. Counts and percentages are reported for both false positives (FP: attenders incorrectly flagged) and false negatives (FN: non-attenders missed). Proposed-feature rows are highlighted in green.

Show code
# Apply cutoffs for each measure.
# signal = -scores when high_bad=FALSE (low raw score → inattentive → high signal).
# signal =  scores when high_bad=TRUE  (high raw score → inattentive → high signal).
# Mahalanobis: high_bad=FALSE because LOW D² → inattentive.
#
# For Mahalanobis two thresholds are computed:
#   "Outlier (2.5th pct)" — flags the lowest 2.5% of D² values (signal > 97.5th pct).
#   "Kneedle"             — data-driven elbow on the sorted signal curve.

apply_bench_cutoffs <- function(col, high_bad) {
  scores <- results[[col]]
  truth  <- results$true_inattentive

  signal <- if (high_bad) scores else -scores
  signal[!is.finite(signal)] <- NA

  # Primary threshold: outlier 2.5th pct for Mahalanobis, Kneedle for all others
  if (col == "mahal_rt") {
    t_kn    <- quantile(signal, 0.975, na.rm = TRUE)   # 97.5th pct of signal = 2.5th pct of D²
    flag_kn <- signal > t_kn
  } else {
    sorted_sig <- sort(signal, na.last = NA)
    k_idx      <- kneedle_fn(sorted_sig, sign = -1)
    t_kn       <- sorted_sig[k_idx]
    flag_kn    <- signal > t_kn
  }

  # Secondary Kneedle threshold for Mahalanobis (for comparison)
  if (col == "mahal_rt") {
    sorted_sig      <- sort(signal, na.last = NA)
    k_idx           <- kneedle_fn(sorted_sig, sign = -1)
    t_kn_kneedle    <- sorted_sig[k_idx]
    flag_kn_kneedle <- signal > t_kn_kneedle
  } else {
    t_kn_kneedle    <- NULL
    flag_kn_kneedle <- NULL
  }

  # Median on signal
  t_med      <- median(signal, na.rm = TRUE)
  flag_med   <- signal > t_med

  # Oracle: attender 95th percentile of signal
  t_oracle    <- quantile(signal[!truth], 0.95, na.rm = TRUE)
  flag_oracle <- signal > t_oracle

  list(flag_kn = flag_kn, flag_med = flag_med, flag_oracle = flag_oracle,
       flag_kn_kneedle = flag_kn_kneedle,
       t_kn = t_kn, t_med = t_med, t_oracle = t_oracle,
       t_kn_kneedle = t_kn_kneedle)
}

bench_cuts <- lapply(names(bench_direction), function(nm) {
  m <- bench_direction[[nm]]
  apply_bench_cutoffs(m$col, m$high_bad)
})
names(bench_cuts) <- names(bench_direction)

# Build performance table
make_bench_perf_row <- function(nm, cutoff_name, flag) {
  truth <- results$true_inattentive
  n_att <- sum(!truth)
  n_non <- sum(truth)
  cs    <- confusion_stats(flag, truth)
  fp_n  <- cs$FP;  fp_pct <- round(100 * cs$FP / n_att, 1)
  fn_n  <- cs$FN;  fn_pct <- round(100 * cs$FN / n_non, 1)
  data.frame(
    Measure  = bench_labels[[nm]],
    Cutoff   = cutoff_name,
    Sens     = round(cs$Sensitivity, 3),
    Spec     = round(cs$Specificity, 3),
    FP_n     = fp_n,  FP_pct = fp_pct,
    FN_n     = fn_n,  FN_pct = fn_pct,
    Flagged  = round(cs$Flagged_pct, 1),
    is_proposed = nm %in% c("spearman_pooled", "wtau_pooled")
  )
}

perf_rows <- do.call(rbind, lapply(names(bench_direction), function(nm) {
  rows <- rbind(
    make_bench_perf_row(nm, "Kneedle", bench_cuts[[nm]]$flag_kn),
    make_bench_perf_row(nm, "Median",  bench_cuts[[nm]]$flag_med),
    make_bench_perf_row(nm, "Oracle",  bench_cuts[[nm]]$flag_oracle)
  )
  # For Mahalanobis, rename primary row and add the Kneedle comparison row
  if (nm == "mahal_rt") {
    rows$Cutoff[rows$Cutoff == "Kneedle"] <- "Outlier (2.5th pct)"
    rows <- rbind(
      rows,
      make_bench_perf_row(nm, "Kneedle", bench_cuts[[nm]]$flag_kn_kneedle)
    )
  }
  rows
}))

proposed_perf_rows <- which(perf_rows$is_proposed)

kable(
  perf_rows[, !names(perf_rows) %in% "is_proposed"],
  col.names = c("Measure", "Cutoff",
                "Sensitivity", "Specificity",
                "FP (N)", "FP (%)",
                "FN (N)", "FN (%)",
                "% Flagged"),
  caption = paste0(
    "Classification performance for all measures. ",
    "Mahalanobis D\u00b2 shows two thresholds: \u2018Outlier (2.5th pct)\u2019 flags the ",
    "lowest 2.5% of D\u00b2 values (direction-corrected outlier exclusion); ",
    "\u2018Kneedle\u2019 is the data-driven elbow. Direction is inverted relative to ",
    "the standard assumption: LOW D\u00b2 \u2192 inattentive in this simulation. ",
    "FP = attenders incorrectly flagged; FN = non-attenders missed. \u2605 = proposed feature."
  )
) |>
  kable_styling(full_width = FALSE) |>
  row_spec(proposed_perf_rows, background = "#e8f5e9", color = "#1b5e20") |>
  row_spec(which(perf_rows$Cutoff == "Oracle"), italic = TRUE)
Classification performance for all measures. Mahalanobis D² shows two thresholds: ‘Outlier (2.5th pct)’ flags the lowest 2.5% of D² values (direction-corrected outlier exclusion); ‘Kneedle’ is the data-driven elbow. Direction is inverted relative to the standard assumption: LOW D² → inattentive in this simulation. FP = attenders incorrectly flagged; FN = non-attenders missed. ★ = proposed feature.
Measure Cutoff Sensitivity Specificity FP (N) FP (%) FN (N) FN (%) % Flagged
LongString Kneedle 0.525 0.963 167 3.7 257 47.5 9.0
LongString Median 0.632 0.738 1168 26.2 199 36.8 30.2
LongString Oracle 0.525 0.963 167 3.7 257 47.5 9.0
IRV Kneedle 0.401 0.974 115 2.6 324 59.9 6.6
IRV Median 0.558 0.510 2187 49.0 239 44.2 49.8
IRV Oracle 0.405 0.952 214 4.8 322 59.5 8.7
Mahalanobis D² (RT) Outlier (2.5th pct) 0.150 0.990 44 1.0 460 85.0 2.5
Mahalanobis D² (RT) Median 0.956 0.555 1983 44.5 24 4.4 50.0
Mahalanobis D² (RT) Oracle 0.390 0.950 223 5.0 330 61.0 8.7
Mahalanobis D² (RT) Kneedle 0.189 0.982 79 1.8 439 81.1 3.6
Laz.R Kneedle 0.630 0.953 209 4.7 200 37.0 11.0
Laz.R Median 0.739 0.529 2100 47.1 141 26.1 50.0
Laz.R Oracle 0.630 0.950 223 5.0 200 37.0 11.3
Zh Kneedle 0.477 0.919 363 8.1 283 52.3 12.4
Zh Median 0.726 0.527 2107 47.3 148 27.4 50.0
Zh Oracle 0.458 0.950 223 5.0 293 54.2 9.4
Spearman ρ ★ Kneedle 0.906 0.894 473 10.6 51 9.4 19.3
Spearman ρ ★ Median 0.996 0.560 1961 44.0 2 0.4 50.0
Spearman ρ ★ Oracle 0.787 0.950 223 5.0 115 21.3 13.0
Weighted τ ★ Kneedle 0.928 0.887 502 11.3 39 7.2 20.1
Weighted τ ★ Median 0.998 0.560 1960 44.0 1 0.2 50.0
Weighted τ ★ Oracle 0.810 0.950 223 5.0 103 19.0 13.2

Aggregated Performance: Weighted Balanced Accuracy and Sensitivity at Fixed Specificity

The table above gives the full picture but requires readers to weigh sensitivity against specificity themselves. Here we provide two complementary summaries that make the trade-off explicit.

Weighted Balanced Accuracy (WBA) averages sensitivity and specificity with an explicit weight w on specificity:

\[\text{WBA}(w) = w \cdot \text{Specificity} + (1 - w) \cdot \text{Sensitivity}\]

At w = 0.5 this is ordinary balanced accuracy (symmetric). Higher w rewards measures that keep FPs low — appropriate here because false positives introduce systematic bias in sample composition (see §False Positive Covariate Analysis), while false negatives reduce power but do not distort who remains in the sample.

Sensitivity at 95% Specificity (S@95) reads directly off the ROC curve: it is the highest sensitivity a measure can achieve while keeping the false positive rate at or below 5%. This answers the applied question — given I am only willing to incorrectly exclude 1 in 20 real attenders, how many non-attenders can I catch? — without requiring any weight choice.

Both are computed using the Kneedle threshold (or χ² for Mahalanobis). WBA is also shown at w ∈ {0.5, 0.6, 0.7, 0.8, 0.9} so readers can see whether measure rankings are stable across reasonable choices of specificity preference.

Show code
truth <- results$true_inattentive

# ── Sensitivity at 95% Specificity from ROC curve ────────────────────────────
sens_at_95spec <- function(nm) {
  m      <- bench_direction[[nm]]
  # Flag inattentive when signal > t, so signal must be HIGH for inattentive.
  # high_bad=TRUE: high raw = inattentive → signal = raw
  # high_bad=FALSE (includes Mahalanobis): low raw = inattentive → signal = -raw
  signal <- if (m$high_bad) results[[m$col]] else -results[[m$col]]
  signal[!is.finite(signal)] <- NA
  thresholds <- sort(unique(signal[!is.na(signal)]))
  roc <- do.call(rbind, lapply(thresholds, function(t) {
    flag <- signal > t
    TP <- sum( flag &  truth, na.rm=TRUE); FP <- sum( flag & !truth, na.rm=TRUE)
    TN <- sum(!flag & !truth, na.rm=TRUE); FN <- sum(!flag &  truth, na.rm=TRUE)
    data.frame(spec = TN/(TN+FP), sens = TP/(TP+FN))
  }))
  # Best sensitivity among rows where specificity >= 0.95
  eligible <- roc[!is.na(roc$spec) & roc$spec >= 0.95, ]
  if (nrow(eligible) == 0) return(NA_real_)
  round(max(eligible$sens, na.rm = TRUE), 3)
}

# ── WBA at a given weight ─────────────────────────────────────────────────────
wba <- function(sens, spec, w) round(w * spec + (1 - w) * sens, 3)

# ── Build summary table ───────────────────────────────────────────────────────
weights   <- seq(0.5, 0.9, by = 0.1)
wba_tab <- do.call(rbind, lapply(names(bench_direction), function(nm) {
  # Use primary threshold flag (Outlier 2.5th pct for Mahalanobis, Kneedle for others)
  flag <- bench_cuts[[nm]]$flag_kn
  cs   <- confusion_stats(flag, truth)
  sens <- cs$Sensitivity
  spec <- cs$Specificity
  s95  <- sens_at_95spec(nm)

  wba_vals <- setNames(sapply(weights, function(w) wba(sens, spec, w)),
                       paste0("WBA(w=", weights, ")"))

  data.frame(
    Measure   = bench_labels[[nm]],
    Sens      = round(sens, 3),
    Spec      = round(spec, 3),
    `S@95%`   = s95,
    as.data.frame(t(wba_vals)),
    is_proposed = nm %in% c("spearman_pooled", "wtau_pooled"),
    check.names = FALSE
  )
}))

prop_rows_wba <- which(wba_tab$is_proposed)
wba_display   <- wba_tab[, !names(wba_tab) %in% "is_proposed"]

# ── Main WBA table ────────────────────────────────────────────────────────────
tbl_wba <- kable(
  wba_display,
  col.names = c("Measure", "Sensitivity", "Specificity", "S@95% Spec",
                paste0("WBA w=", weights)),
  caption = paste0(
    "Sensitivity, specificity, Sensitivity at 95% Specificity (S@95%), and ",
    "Weighted Balanced Accuracy at five specificity weights. ",
    "Higher w = greater penalty for false positives. ",
    "\u2605 = proposed complexity\u2013RT feature."
  ),
  digits = 3
) |>
  kable_styling(full_width = FALSE) |>
  row_spec(prop_rows_wba, background = "#e8f5e9", color = "#1b5e20") |>
  # Bold the S@95 and WBA(w=0.7) columns as primary recommended summaries
  column_spec(4, bold = TRUE) |>
  column_spec(7, bold = TRUE)   # w=0.7 column (4 fixed cols + 3rd weight)

cat(as.character(tbl_wba), "\n\n")
Sensitivity, specificity, Sensitivity at 95% Specificity (S@95%), and Weighted Balanced Accuracy at five specificity weights. Higher w = greater penalty for false positives. ★ = proposed complexity–RT feature.
Measure Sensitivity Specificity S@95% Spec WBA w=0.5 WBA w=0.6 WBA w=0.7 WBA w=0.8 WBA w=0.9
LongString 0.525 0.963 0.525 0.744 0.788 0.831 0.875 0.919
IRV 0.401 0.974 0.405 0.688 0.745 0.802 0.860 0.917
Mahalanobis D² (RT) 0.150 0.990 0.390 0.570 0.654 0.738 0.822 0.906
Laz.R 0.630 0.953 0.630 0.792 0.824 0.856 0.889 0.921
Zh 0.477 0.919 0.458 0.698 0.742 0.786 0.830 0.874
Spearman ρ ★ 0.906 0.894 0.787 0.900 0.899 0.897 0.896 0.895
Weighted τ ★ 0.928 0.887 0.810 0.908 0.904 0.900 0.896 0.891
Show code
wba_long <- tidyr::pivot_longer(
  wba_tab[, c("Measure", paste0("WBA(w=", weights, ")"))],
  cols      = -Measure,
  names_to  = "weight",
  values_to = "WBA"
) |>
  mutate(
    weight  = factor(weight, levels = paste0("WBA(w=", weights, ")")),
    w_label = factor(
      gsub("WBA\\(w=(.*)\\)", "w = \\1", weight),
      levels = gsub("WBA\\(w=(.*)\\)", "w = \\1", paste0("WBA(w=", weights, ")"))
    ),
    Measure = factor(Measure, levels = rev(bench_labels[names(bench_direction)]))
  )

# Rank each measure within each weight for annotation
wba_long <- wba_long |>
  group_by(weight) |>
  mutate(rank_label = paste0("#", rank(-WBA, ties.method = "min"))) |>
  ungroup()

proposed_measures_vec <- bench_labels[c("spearman_pooled", "wtau_pooled")]

ggplot(wba_long, aes(x = w_label, y = Measure, fill = WBA)) +
  geom_tile(colour = "white", linewidth = 0.5) +
  geom_text(aes(label = sprintf("%.3f\n(%s)", WBA, rank_label)),
            size = 2.8, lineheight = 0.9) +
  scale_fill_gradient2(
    low      = "tomato",
    mid      = "lightyellow",
    high     = "#2e7d32",
    midpoint = 0.65,
    name     = "WBA"
  ) +
  labs(
    title    = "Weighted Balanced Accuracy across specificity preference weights",
    subtitle = "Cell = WBA value; rank in parentheses. Bold green rows = proposed features.",
    x        = "Specificity weight (w)",
    y        = NULL
  ) +
  theme(
    axis.text.y = element_text(
      colour = ifelse(
        rev(bench_labels[names(bench_direction)]) %in% proposed_measures_vec,
        "#1b5e20", "grey20"
      ),
      face = ifelse(
        rev(bench_labels[names(bench_direction)]) %in% proposed_measures_vec,
        "bold", "plain"
      )
    ),
    legend.position = "right"
  )

Weighted Balanced Accuracy for each measure (rows) across specificity weights w = 0.5–0.9 (columns). Green = higher WBA, red = lower. Stable rankings across columns indicate the conclusion is robust to the choice of weight. Proposed features shown in bold green row labels.

Cutoff Visualisation

Show code
make_bench_cutoff_plot <- function(nm, proposed = FALSE) {
  m      <- bench_direction[[nm]]
  cuts   <- bench_cuts[[nm]]
  scores <- results[[m$col]]
  high_b <- m$high_bad

  # Thresholds are stored on signal scale (signal = scores if high_bad, -scores if !high_bad).
  # Convert back to raw score scale: raw = signal if high_bad, raw = -signal if !high_bad.
  sig_to_raw <- function(t) if (high_b) t else -t

  # Build threshold data frame with distinct line types
  thresh_df <- data.frame(
    xintercept = c(sig_to_raw(cuts$t_kn),
                   sig_to_raw(cuts$t_med),
                   sig_to_raw(cuts$t_oracle)),
    ltype = c("solid", "dashed", "dotted"),
    label = c(if (nm == "mahal_rt") "Outlier (2.5th pct)" else "Kneedle",
              "Median", "Oracle"),
    stringsAsFactors = FALSE
  )

  # For Mahalanobis, add Kneedle as a fourth line
  if (nm == "mahal_rt" && !is.null(cuts$t_kn_kneedle)) {
    thresh_df <- rbind(thresh_df, data.frame(
      xintercept = sig_to_raw(cuts$t_kn_kneedle),
      ltype      = "longdash",
      label      = "Kneedle",
      stringsAsFactors = FALSE
    ))
  }

  # Flag region: high_bad=FALSE (low raw = inattentive) → shade left tail
  #              high_bad=TRUE  (high raw = inattentive) → shade right tail
  x_range    <- range(scores, na.rm = TRUE)
  primary_x  <- thresh_df$xintercept[1]
  shade_xmin <- if (!high_b) x_range[1] else primary_x
  shade_xmax <- if (!high_b) primary_x   else x_range[2]

  p <- ggplot(results, aes(x = .data[[m$col]], fill = class_label)) +
    annotate("rect",
             xmin = shade_xmin, xmax = shade_xmax,
             ymin = -Inf, ymax = Inf,
             alpha = 0.08, fill = "tomato") +
    geom_density(alpha = 0.45, colour = NA) +
    geom_vline(data = thresh_df,
               aes(xintercept = xintercept, linetype = ltype),
               colour = "black", linewidth = 0.65, inherit.aes = FALSE) +
    scale_linetype_identity(
      guide  = "legend",
      breaks = c("solid", "longdash", "dashed", "dotted"),
      labels = c("Outlier / Kneedle", "Kneedle (Mahal)", "Median", "Oracle")
    ) +
    # Labels placed above the plot area using annotation on a secondary y scale
    annotate("label",
             x      = thresh_df$xintercept,
             y      = Inf,
             label  = thresh_df$label,
             hjust  = 0.5,
             vjust  = 1.15,
             size   = 2.5,
             fill   = "white",
             colour = "grey20",
             label.size = 0.2) +
    scale_fill_manual(
      values = c("Attender (true)" = "steelblue",
                 "Non-attender (true)" = "tomato"), name = NULL) +
    labs(title = bench_labels[[nm]], x = m$col, y = "Density") +
    theme(legend.position = "none")

  if (proposed) {
    p <- p + theme(
      panel.border    = element_rect(colour = "#2e7d32", fill = NA,
                                     linewidth = 1.8),
      plot.background = element_rect(fill = "#f1f8f1", colour = NA)
    )
  }
  p
}

proposed_nms    <- c("spearman_pooled", "wtau_pooled")
benchmark_nms   <- setdiff(names(bench_direction), proposed_nms)

all_cutoff_plots <- c(
  lapply(benchmark_nms, make_bench_cutoff_plot, proposed = FALSE),
  lapply(proposed_nms,  make_bench_cutoff_plot, proposed = TRUE)
)

wrap_plots(all_cutoff_plots, ncol = 2) +
  plot_layout(guides = "collect") &
  theme(legend.position = "bottom")

Density distributions for each measure with cutoff lines marking the Kneedle, Median, and Oracle thresholds. Red shading marks the flagging region. Mahalanobis D² shows an additional Kneedle line for comparison. Proposed features (ρ, τ) are shown with a green panel border.

ROC Curves: All Measures

Show code
truth <- results$true_inattentive

roc_list <- lapply(names(bench_direction), function(nm) {
  m      <- bench_direction[[nm]]
  signal <- if (m$high_bad) results[[m$col]] else -results[[m$col]]
  signal[!is.finite(signal)] <- NA
  r      <- make_roc_df(signal, truth, bench_labels[[nm]])
  auc_v  <- round(roc_auc(signal, truth), 3)
  r$variant <- sprintf("%s (AUC=%.3f)", bench_labels[[nm]], auc_v)
  r$proposed <- nm %in% c("spearman_pooled", "wtau_pooled")
  r
})

roc_all <- do.call(rbind, roc_list)

proposed_variants <- unique(roc_all$variant[roc_all$proposed])
other_variants    <- unique(roc_all$variant[!roc_all$proposed])

n_other   <- length(other_variants)
grey_pal  <- colorRampPalette(c("grey60", "steelblue3"))(n_other)
colour_map <- c(
  setNames(grey_pal, other_variants),
  setNames(c("#2e7d32", "#81c784"), proposed_variants)
)
size_map <- c(
  setNames(rep(0.7, n_other),   other_variants),
  setNames(c(1.4, 1.4),          proposed_variants)
)

ggplot(roc_all, aes(x = fpr, y = tpr,
                    colour = variant, linewidth = variant)) +
  geom_abline(slope = 1, intercept = 0,
              linetype = "dashed", colour = "grey80") +
  geom_path() +
  scale_colour_manual(values = colour_map, name = NULL) +
  scale_linewidth_manual(values = size_map, guide = "none") +
  labs(title    = "ROC curves — all attention-checking measures",
       subtitle = "Green = proposed complexity–RT features",
       x = "False Positive Rate (1 − Specificity)",
       y = "Sensitivity") +
  theme(legend.text = element_text(size = 8),
        legend.key.width = unit(1.2, "cm"))

ROC curves for all seven measures. Proposed complexity–RT features (ρ, τ) are shown as bold green lines; established measures in grey/blue. AUC values are reported in the legend.

False Positive Covariate Analysis

Are false positives from each measure systematically different from the true negatives (correctly retained attenders)? This section examines the selectivity profile of each measure’s errors: if a measure flags people not because they are inattentive but because they share some substantive characteristic (trait extremity, response style, low attention but genuine engagement), the resulting exclusions introduce selection bias rather than merely noise.

We compare FP vs TN attenders on five covariates:

  1. θ̂ (IRT trait estimate) — are FPs clustering at the centre of the latent trait? A measure whose FPs are predominantly mid-range respondents is biased against a specific substantive group.
  2. Observed mean score — same question via the raw scale mean, which is available without IRT assumptions.
  3. Latent attention A — are FPs simply low-attention genuine attenders? If so, the exclusion may be defensible (they were marginally engaged), but it should be acknowledged.
  4. IRV (response variability) — do FPs show restricted range use? Low IRV can arise from mid-point responding or from genuine homogeneity of trait expression; high IRV from acquiescent or extreme responding.
  5. Mean RT — are FPs simply fast responders overall, or is their exclusion driven by RT–complexity decoupling specifically?

For each measure, FP and TN attenders are compared via density overlays and a summary table of group means and standardised mean differences (Cohen’s d).

Show code
# ── IRT theta: read directly from simulation output ───────────────────────────
# theta_1 and theta_2 are the true latent trait values from the DGP,
# one per scale. We use both individually and their mean as an overall index.
results$theta_1   <- dat$theta_1
results$theta_2   <- dat$theta_2
results$theta_mean <- rowMeans(cbind(dat$theta_1, dat$theta_2), na.rm = TRUE)

# ── Observed mean score (all items, reverse-scored) ───────────────────────────
# Build index of which columns in resp_mat need reversing
resp_mat_scored <- resp_mat
for (s in seq_len(ns)) {
  rev_item_nums <- instrument[[s]]$reverse   # integer item indices within scale s
  if (length(rev_item_nums) > 0) {
    col_names <- paste0("Y_s", s, "_i", rev_item_nums)
    # Only reverse columns that actually exist in resp_mat
    valid <- col_names[col_names %in% colnames(resp_mat_scored)]
    if (length(valid) > 0) {
      resp_mat_scored[, valid] <- K + 1 - resp_mat[, valid]
    }
  }
}
results$mean_score <- rowMeans(resp_mat_scored, na.rm = TRUE)

# ── Mean RT (overall speed) ───────────────────────────────────────────────────
results$mean_rt <- rowMeans(rt_mat, na.rm = TRUE)

# IRV and A are already in results

cat("Covariate summary (attenders only):\n")
Covariate summary (attenders only):
Show code
att <- results[results$is_non_attender == 0, ]
for (col in c("theta_1", "theta_2", "theta_mean", "mean_score", "A", "irv", "mean_rt")) {
  cat(sprintf("  %-12s  mean = %.3f  sd = %.3f\n",
              col, mean(att[[col]], na.rm = TRUE), sd(att[[col]], na.rm = TRUE)))
}
  theta_1       mean = 0.010  sd = 1.001
  theta_2       mean = 0.001  sd = 1.020
  theta_mean    mean = 0.006  sd = 0.708
  mean_score    mean = 3.050  sd = 0.328
  A             mean = 0.670  sd = 0.201
  irv           mean = 1.056  sd = 0.185
  mean_rt       mean = 4.840  sd = 1.085
Show code
# Build a unified FP/TN flag column for each measure (among attenders only)
# using the Kneedle / chi-sq thresholds from bench_cuts

fp_flag_df <- results[results$is_non_attender == 0, ] |>
  select(ID, A, theta_1, theta_2, theta_mean, mean_score, irv, mean_rt)

for (nm in names(bench_direction)) {
  cuts <- bench_cuts[[nm]]
  fp_flag_df[[paste0("fp_", nm)]] <- as.logical(cuts$flag_kn[results$is_non_attender == 0])
}

cat(sprintf("FP flag data frame: %d attenders × %d columns\n",
            nrow(fp_flag_df), ncol(fp_flag_df)))
FP flag data frame: 4459 attenders × 15 columns

Density Overlays: FP vs TN by Covariate

Each plot below focuses on one covariate and shows all seven measures as facets. Within each facet, the blue distribution is the true negatives (attenders correctly retained) and the red distribution is the false positives (attenders incorrectly flagged).

How to read these: there are three patterns to watch for:

  • Near-complete overlap (red ≈ blue in shape and location) — the measure’s false positives are not selectively different from true negatives on that covariate. The errors are essentially random with respect to it, which is the best-case scenario.
  • Shifted distribution (red centre is left or right of blue) — the measure is systematically excluding attenders at a particular level of the covariate. For example, red shifted toward θ = 0 means the measure preferentially flags mid-trait respondents; red shifted toward low A means it mainly catches low-attention-but-genuine attenders.
  • Narrow, tall red distribution (red is more peaked and concentrated than blue, even if centred similarly) — the false positives come from a very specific, homogeneous slice of the attender population. This is a stronger form of bias than a simple shift: the measure is not just nudging toward one end of the covariate, it is selecting a particular type of respondent with unusual precision. For instance, a narrow red spike at moderate θ values would suggest the measure almost exclusively flags respondents near the scale midpoint, leaving very high and very low scorers alone even when they are inattentive.

In general, a measure whose FPs look like a random sample of the TN distribution (same shape, same spread, same centre) is making errors that do not differentially harm any subgroup. A measure whose FPs form a tight, distinctive cluster is effectively acting as a filter for a specific kind of respondent, regardless of whether those respondents are actually inattentive.

The proposed measures (ρ, τ) are shown with a green facet border.

Show code
covariates <- list(
  list(col = "theta_1",    label = "\u03b8\u2081 (Scale 1 trait)"),
  list(col = "theta_2",    label = "\u03b8\u2082 (Scale 2 trait)"),
  list(col = "theta_mean", label = "\u03b8\u0304 (mean trait)"),
  list(col = "mean_score", label = "Mean observed score"),
  list(col = "A",          label = "Latent attention A"),
  list(col = "irv",        label = "IRV (response variability)"),
  list(col = "mean_rt",    label = "Mean RT (ms)")
)

measure_order <- c("longstring", "irv", "mahal_rt", "lazr_pooled",
                   "zh", "spearman_pooled", "wtau_pooled")

# Build a long data frame: one row per attender × measure
fp_long <- do.call(rbind, lapply(measure_order, function(nm) {
  flag_col <- paste0("fp_", nm)
  data.frame(
    fp_flag_df,
    measure   = bench_labels[[nm]],
    group     = ifelse(fp_flag_df[[flag_col]], "FP (flagged)", "TN (retained)"),
    proposed  = nm %in% c("spearman_pooled", "wtau_pooled"),
    stringsAsFactors = FALSE
  )
}))

fp_long$measure <- factor(fp_long$measure, levels = bench_labels[measure_order])
fp_long$group   <- factor(fp_long$group, levels = c("TN (retained)", "FP (flagged)"))

# One plot per covariate — each rendered separately so text isn't squashed
for (cv in covariates) {
  p <- ggplot(fp_long, aes(x = .data[[cv$col]], fill = group)) +
    geom_density(alpha = 0.5, colour = NA) +
    scale_fill_manual(
      values = c("TN (retained)" = "steelblue", "FP (flagged)" = "tomato"),
      name   = NULL) +
    facet_wrap(~ measure, nrow = 2, scales = "free_y") +
    labs(
      title    = sprintf("FP vs TN: %s", cv$label),
      subtitle = "Red = false positives (flagged attenders); blue = true negatives (retained attenders)",
      x        = cv$label,
      y        = "Density"
    ) +
    theme(
      legend.position  = "bottom",
      strip.text       = element_text(size = 9, face = "bold"),
      strip.background = element_rect(
        fill = ifelse(
          levels(fp_long$measure) %in%
            bench_labels[c("spearman_pooled", "wtau_pooled")],
          "#e8f5e9", "grey92"
        ),
        colour = ifelse(
          levels(fp_long$measure) %in%
            bench_labels[c("spearman_pooled", "wtau_pooled")],
          "#2e7d32", "grey70"
        ),
        linewidth = 0.8
      )
    )
  print(p)
}

Summary Table: Group Means and Cohen’s d

Show code
cohens_d <- function(x_fp, x_tn) {
  n1 <- sum(!is.na(x_fp)); n2 <- sum(!is.na(x_tn))
  if (n1 < 2 || n2 < 2) return(NA_real_)
  sd_pooled <- sqrt(((n1-1)*var(x_fp, na.rm=TRUE) +
                     (n2-1)*var(x_tn, na.rm=TRUE)) / (n1+n2-2))
  if (sd_pooled == 0) return(NA_real_)
  (mean(x_fp, na.rm=TRUE) - mean(x_tn, na.rm=TRUE)) / sd_pooled
}

covariate_cols   <- c("theta_1", "theta_2", "theta_mean", "mean_score", "A", "irv", "mean_rt")
covariate_labels <- c("\u03b8\u2081 (S1)", "\u03b8\u2082 (S2)", "\u03b8\u0304 (mean)",
                      "Mean score", "Attention A", "IRV", "Mean RT")

fp_tab <- do.call(rbind, lapply(measure_order, function(nm) {
  flag   <- fp_flag_df[[paste0("fp_", nm)]]
  fp_grp <- fp_flag_df[flag  == TRUE,  ]
  tn_grp <- fp_flag_df[flag  == FALSE, ]
  n_fp   <- sum(flag, na.rm = TRUE)
  n_tn   <- sum(!flag, na.rm = TRUE)

  row_vals <- unlist(lapply(covariate_cols, function(cv) {
    d    <- round(cohens_d(fp_grp[[cv]], tn_grp[[cv]]), 2)
    m_fp <- round(mean(fp_grp[[cv]], na.rm = TRUE), 2)
    m_tn <- round(mean(tn_grp[[cv]], na.rm = TRUE), 2)
    c(m_fp, m_tn, d)
  }))

  df <- as.data.frame(t(row_vals))
  names(df) <- as.vector(outer(
    covariate_labels,
    c("FP mean", "TN mean", "d"),
    paste, sep = " \u2014 "
  ))
  cbind(Measure = bench_labels[[nm]], `N FP` = n_fp, `N TN` = n_tn, df)
}))

prop_rows <- which(fp_tab$Measure %in% c("Spearman \u03c1 \u2605", "Weighted \u03c4 \u2605"))

# One sub-table per covariate — use cat() so knitr renders HTML correctly
for (i in seq_along(covariate_cols)) {
  cv_lbl <- covariate_labels[i]
  cols_i <- c("Measure", "N FP", "N TN",
              paste(cv_lbl, c("FP mean", "TN mean", "d"), sep = " \u2014 "))
  sub_tab <- fp_tab[, cols_i]

  tbl_html <- kable(
    sub_tab,
    format     = "html",
    col.names  = c("Measure", "N FP", "N TN", "Mean (FP)", "Mean (TN)", "Cohen's d"),
    caption    = sprintf("FP vs TN comparison on %s. Positive d = FPs score higher.", cv_lbl),
    digits     = 3
  ) |>
    kable_styling(full_width = FALSE) |>
    row_spec(prop_rows, background = "#e8f5e9", color = "#1b5e20") |>
    column_spec(6, bold = TRUE)

  cat(as.character(tbl_html), "\n\n")
}
FP vs TN comparison on θ₁ (S1). Positive d = FPs score higher.
Measure N FP N TN Mean (FP) Mean (TN) Cohen’s d
LongString 167 4292 -0.06 0.01 0.61
IRV 115 4344 -0.02 0.01 1.05
Mahalanobis D² (RT) 44 4415 0.23 0.00 -0.09
Laz.R 209 4250 -0.07 0.01 0.81
Zh 363 4096 -0.05 0.01 -2.21
Spearman ρ ★ 473 3986 -0.09 0.01 -1.78
Weighted τ ★ 502 3957 -0.08 0.01 -1.78
FP vs TN comparison on θ₂ (S2). Positive d = FPs score higher.
Measure N FP N TN Mean (FP) Mean (TN) Cohen’s d
LongString 167 4292 0.01 -0.07 0.94
IRV 115 4344 0.01 -0.07 0.67
Mahalanobis D² (RT) 44 4415 0.01 0.20 1.08
Laz.R 209 4250 0.01 -0.10 0.80
Zh 363 4096 0.02 -0.06 1.29
Spearman ρ ★ 473 3986 0.02 -0.08 1.19
Weighted τ ★ 502 3957 0.02 -0.06 1.19
FP vs TN comparison on θ̄ (mean). Positive d = FPs score higher.
Measure N FP N TN Mean (FP) Mean (TN) Cohen’s d
LongString 167 4292 -0.08 3.04 1.06
IRV 115 4344 -0.03 3.01 1.07
Mahalanobis D² (RT) 44 4415 0.22 3.11 1.06
Laz.R 209 4250 -0.08 3.04 1.07
Zh 363 4096 -0.06 3.04 1.03
Spearman ρ ★ 473 3986 -0.12 3.04 1.04
Weighted τ ★ 502 3957 -0.11 3.04 1.04
FP vs TN comparison on Mean score. Positive d = FPs score higher.
Measure N FP N TN Mean (FP) Mean (TN) Cohen’s d
LongString 167 4292 -0.03 3.05 -0.63
IRV 115 4344 -0.07 3.05 -2.24
Mahalanobis D² (RT) 44 4415 0.06 3.05 0.13
Laz.R 209 4250 -0.06 3.05 -1.52
Zh 363 4096 -0.03 3.05 1.52
Spearman ρ ★ 473 3986 0.01 3.05 0.81
Weighted τ ★ 502 3957 0.03 3.05 0.81
FP vs TN comparison on Attention A. Positive d = FPs score higher.
Measure N FP N TN Mean (FP) Mean (TN) Cohen’s d
LongString 167 4292 0 -0.02 4.76
IRV 115 4344 0 -0.12 4.92
Mahalanobis D² (RT) 44 4415 0 0.18 4.56
Laz.R 209 4250 0 -0.03 4.88
Zh 363 4096 0 -0.04 4.84
Spearman ρ ★ 473 3986 0 -0.02 4.84
Weighted τ ★ 502 3957 0 -0.02 4.76
FP vs TN comparison on IRV. Positive d = FPs score higher.
Measure N FP N TN Mean (FP) Mean (TN) Cohen’s d
LongString 167 4292 -0.03 0.79 4.84
IRV 115 4344 -0.08 0.87 4.84
Mahalanobis D² (RT) 44 4415 0.06 0.65 4.84
Laz.R 209 4250 -0.06 0.82 4.84
Zh 363 4096 -0.03 0.32 4.84
Spearman ρ ★ 473 3986 0.01 0.39 4.84
Weighted τ ★ 502 3957 0.03 0.39 4.85
FP vs TN comparison on Mean RT. Positive d = FPs score higher.
Measure N FP N TN Mean (FP) Mean (TN) Cohen’s d
LongString 167 4292 -0.04 0.67 -0.08
IRV 115 4344 -0.05 0.66 0.07
Mahalanobis D² (RT) 44 4415 0.15 0.67 -0.26
Laz.R 209 4250 -0.06 0.66 0.04
Zh 363 4096 -0.04 0.70 0.00
Spearman ρ ★ 473 3986 -0.04 0.70 0.00
Weighted τ ★ 502 3957 -0.03 0.71 -0.08

Cohen’s d Heatmap: All Measures × Covariates

Show code
d_long <- do.call(rbind, lapply(measure_order, function(nm) {
  flag   <- fp_flag_df[[paste0("fp_", nm)]]
  fp_grp <- fp_flag_df[flag == TRUE,  ]
  tn_grp <- fp_flag_df[flag == FALSE, ]
  do.call(rbind, lapply(seq_along(covariate_cols), function(i) {
    data.frame(
      measure   = bench_labels[[nm]],
      covariate = covariate_labels[i],
      d         = cohens_d(fp_grp[[covariate_cols[i]]],
                           tn_grp[[covariate_cols[i]]])
    )
  }))
}))

d_long$measure <- factor(d_long$measure,
  levels = rev(bench_labels[measure_order]))

ggplot(d_long, aes(x = covariate, y = measure, fill = d)) +
  geom_tile(colour = "white", linewidth = 0.6) +
  geom_text(aes(label = sprintf("%.2f", d)), size = 3) +
  scale_fill_gradient2(
    low      = "#6a0dad",   # purple = FPs score lower
    mid      = "white",
    high     = "#2e7d32",   # green  = FPs score higher
    midpoint = 0,
    na.value = "grey85",
    name     = "Cohen's d\n(FP − TN)"
  ) +
  labs(title    = "False positive selectivity: Cohen's d (FP vs TN attenders)",
       subtitle = "Rows = measures; columns = covariates. Proposed features in green.",
       x = NULL, y = NULL) +
  theme(
    axis.text.x  = element_text(angle = 20, hjust = 1),
    axis.text.y  = element_text(
      colour = ifelse(
        rev(bench_labels[measure_order]) %in%
          c("Spearman \u03c1 \u2605", "Weighted \u03c4 \u2605"),
        "#1b5e20", "grey20"
      ),
      face = ifelse(
        rev(bench_labels[measure_order]) %in%
          c("Spearman \u03c1 \u2605", "Weighted \u03c4 \u2605"),
        "bold", "plain"
      )
    )
  )

Cohen’s d for each measure (rows) × covariate (columns). Positive d (green) means FPs score higher than TNs; negative d (purple) means FPs score lower. Values near zero mean the measure’s FPs are not selectively different from TNs on that covariate. Proposed features highlighted with a star in the row label.

Sensitivity Analysis I: Sample Size (N)

How stable are the seven measures as sample size decreases from the full dataset of 5000 respondents down to N = 100? We draw random subsamples (without replacement) at each of the following sizes: N = 5,000 · 2,000 · 1,000 · 500 · 250 · 100, recomputing all measures and their Kneedle-threshold performance at each size.

Because subsampling is stochastic, we repeat each condition 20 times and report the mean ± SD of sensitivity and FP rate across replications.

Design. The simulation should be run with at least N = 5,000 respondents so that the largest subsample equals the full dataset. Subsampling preserves the original attender/non-attender ratio within each draw.

Show code
n_ss_reps <- 20   # replications per N level — enough for stable means
ss_sizes  <- c(100, 250, 500, 1000, 2000, 5000)
# Cap at actual sample size
ss_sizes  <- ss_sizes[ss_sizes <= nrow(dat)]

set.seed(2025)

# Pre-generate subsample indices: list[N_level][rep] -> row indices
ss_idx <- lapply(ss_sizes, function(n_s) {
  lapply(seq_len(n_ss_reps), function(r) {
    att_idx <- which(dat$is_non_attender == 0)
    non_idx <- which(dat$is_non_attender == 1)
    prop_non <- mean(dat$is_non_attender == 1)
    n_non <- round(n_s * prop_non)
    n_att <- n_s - n_non
    c(
      sample(att_idx, min(n_att, length(att_idx)), replace = FALSE),
      sample(non_idx, min(n_non, length(non_idx)), replace = FALSE)
    )
  })
})
names(ss_idx) <- as.character(ss_sizes)

cat(sprintf("Subsample sizes: %s\nReplications: %d\n",
            paste(ss_sizes, collapse = ", "), n_ss_reps))
Subsample sizes: 100, 250, 500, 1000, 2000, 5000
Replications: 20
Show code
# ── Vectorised measure helpers — defined here, used by both sensitivity loops ─

# LongString: max run length per row
longstring_vec <- function(mat) {
  apply(mat, 1, function(x) {
    x <- x[!is.na(x)]
    if (length(x) == 0) return(NA_integer_)
    max(rle(x)$lengths)
  })
}

# IRV
irv_vec <- function(mat) apply(mat, 1, sd, na.rm = TRUE)

# Laz.R: cast to integer first for speed
lazr_vec <- function(mat) {
  mat_int <- matrix(as.integer(mat), nrow = nrow(mat))
  apply(mat_int, 1, Laz.R)
}

# Mahalanobis on log-RT
mahal_vec <- function(log_rt) {
  mu    <- colMeans(log_rt, na.rm = TRUE)
  cov_m <- tryCatch(
    cov(log_rt, use = "pairwise.complete.obs") + diag(1e-4, ncol(log_rt)),
    error = function(e) diag(ncol(log_rt))
  )
  tryCatch(
    mahalanobis(log_rt, center = mu, cov = cov_m),
    error = function(e) rowSums(sweep(log_rt, 2, mu)^2)
  )
}

# Spearman rho
rho_vec <- function(rt_mat, cx) {
  cx_rk <- rank(cx)
  apply(rt_mat, 1, function(rt_i) cor(cx_rk, rank(rt_i), method = "spearman"))
}

# Vectorised weighted tau
wtau_vectorised <- function(rt_mat, cx) {
  pairs   <- combn(length(cx), 2)
  cx_diff <- cx[pairs[1,]] - cx[pairs[2,]]
  w       <- abs(cx_diff)
  w_sum   <- sum(w)
  rt_diff <- rt_mat[, pairs[1,], drop=FALSE] - rt_mat[, pairs[2,], drop=FALSE]
  conc    <- sweep(sign(rt_diff), 2, sign(cx_diff), `*`)
  as.numeric((conc %*% w) / w_sum)
}

# ── Per-subsample performance ─────────────────────────────────────────────────

# For each subsample, recompute measures and apply Kneedle cutoff
compute_ss_perf <- function(row_idx) {
  sub_dat <- dat[row_idx, ]
  sub_res <- sub_dat[, c("ID", "is_non_attender")]
  sub_res$true_inatt <- sub_res$is_non_attender == 1
  n_att   <- sum(!sub_res$true_inatt)
  n_non   <- sum(sub_res$true_inatt)
  if (n_att < 5 || n_non < 5) return(NULL)

  sub_resp <- as.matrix(sub_dat[, y_cols_all])
  sub_rt   <- as.matrix(sub_dat[, rt_cols_all])

  # 1. LongString
  sub_res$longstring      <- longstring_vec(sub_resp)
  # 2. IRV
  sub_res$irv             <- irv_vec(sub_resp)
  # 3. Mahalanobis on log-RT
  sub_res$mahal_rt        <- mahal_vec(log(pmax(sub_rt, 1e-6)))
  # 4. Laz.R
  sub_res$lazr_pooled     <- lazr_vec(sub_resp)
  # 5. Spearman pooled
  sub_res$spearman_pooled <- rho_vec(sub_rt, cx_all)
  # 7. Weighted tau — vectorised
  sub_res$wtau_pooled     <- wtau_vectorised(sub_rt, cx_all)
  # Zh excluded (too slow per subsample)
  sub_res$zh <- NA_real_

  # Apply cutoffs per measure and compute sens / FP rate
  # Mahalanobis uses outlier (2.5th pct) threshold; all others use Kneedle.
  perf_list <- lapply(names(bench_direction), function(nm) {
    m <- bench_direction[[nm]]
    if (nm == "zh") return(data.frame(measure = nm, sens = NA, spec = NA, fp_n = NA, fn_n = NA))
    signal <- if (m$high_bad) sub_res[[m$col]] else -sub_res[[m$col]]
    signal[!is.finite(signal)] <- NA
    if (all(is.na(signal))) return(data.frame(measure = nm, sens = NA, spec = NA, fp_n = NA, fn_n = NA))

    if (nm == "mahal_rt") {
      # Outlier exclusion: flag lowest 2.5% of D² (highest 2.5% of -D² signal)
      flag <- signal > quantile(signal, 0.975, na.rm = TRUE)
    } else {
      sorted_sig <- sort(signal, na.last = NA)
      if (length(sorted_sig) < 2)
        return(data.frame(measure = nm, sens = NA, spec = NA, fp_n = NA, fn_n = NA))
      k_idx <- kneedle_fn(sorted_sig, sign = -1)
      flag  <- signal > sorted_sig[k_idx]
    }
    cs <- confusion_stats(flag, sub_res$true_inatt)
    data.frame(
      measure  = nm,
      sens     = cs$Sensitivity,
      spec     = cs$Specificity,
      fp_n     = cs$FP,
      fn_n     = cs$FN
    )
  })

  do.call(rbind, perf_list)
}

# Run across all sizes and reps
ss_results_list <- lapply(ss_sizes, function(n_s) {
  reps <- ss_idx[[as.character(n_s)]]
  rep_perf <- lapply(reps, function(idx) {
    pf <- tryCatch(compute_ss_perf(idx), error = function(e) NULL)
    if (!is.null(pf)) pf$n <- n_s
    pf
  })
  do.call(rbind, Filter(Negate(is.null), rep_perf))
})

ss_results <- do.call(rbind, ss_results_list)

cat(sprintf("Sensitivity-N analysis complete: %d rows of results.\n", nrow(ss_results)))
Sensitivity-N analysis complete: 840 rows of results.

Sensitivity and False Positive Rate vs. N

Show code
ss_summary <- ss_results |>
  group_by(n, measure) |>
  summarise(
    mean_sens = round(mean(sens, na.rm = TRUE), 3),
    sd_sens   = round(sd(sens,   na.rm = TRUE), 3),
    mean_spec = round(mean(spec, na.rm = TRUE), 3),
    sd_spec   = round(sd(spec,   na.rm = TRUE), 3),
    mean_fp   = round(mean(fp_n, na.rm = TRUE), 1),
    sd_fp     = round(sd(fp_n,   na.rm = TRUE), 1),
    mean_fn   = round(mean(fn_n, na.rm = TRUE), 1),
    sd_fn     = round(sd(fn_n,   na.rm = TRUE), 1),
    .groups   = "drop"
  ) |>
  mutate(measure_label = bench_labels[measure])

make_wide <- function(data, value_col, prefix) {
  data |>
    select(measure_label, n, all_of(value_col)) |>
    tidyr::pivot_wider(names_from = n, values_from = all_of(value_col),
                       names_prefix = prefix) |>
    mutate(is_proposed = measure_label %in% c("Spearman \u03c1 \u2605", "Weighted \u03c4 \u2605"))
}

render_wide <- function(wide_df, col_suffix, caption_text) {
  prop_rows <- which(wide_df$is_proposed)
  kable(
    wide_df[, !names(wide_df) %in% "is_proposed"],
    col.names = c("Measure", paste0("N=", ss_sizes)),
    caption   = caption_text
  ) |>
    kable_styling(full_width = FALSE) |>
    row_spec(prop_rows, background = "#e8f5e9", color = "#1b5e20")
}

render_wide(make_wide(ss_summary, "mean_sens", "N="),
  "sens", "Mean sensitivity by N. Zh excluded (GRM refit too slow).")
Mean sensitivity by N. Zh excluded (GRM refit too slow).
Measure N=100 N=250 N=500 N=1000 N=2000 N=5000
IRV 0.359 0.396 0.426 0.416 0.408 0.401
Laz.R 0.618 0.652 0.653 0.642 0.636 0.630
LongString 0.486 0.543 0.520 0.516 0.520 0.525
Mahalanobis D² (RT) 0.000 0.002 0.001 0.000 0.000 0.000
Spearman ρ ★ 0.891 0.930 0.899 0.912 0.906 0.906
Weighted τ ★ 0.900 0.937 0.934 0.927 0.931 0.928
Zh NaN NaN NaN NaN NaN NaN
Show code
render_wide(make_wide(ss_summary, "mean_spec", "N="),
  "spec", "Mean specificity by N.")
Mean specificity by N.
Measure N=100 N=250 N=500 N=1000 N=2000 N=5000
IRV 0.949 0.961 0.964 0.967 0.972 0.974
Laz.R 0.960 0.956 0.957 0.952 0.954 0.953
LongString 0.959 0.971 0.973 0.970 0.966 0.963
Mahalanobis D² (RT) 0.966 0.969 0.971 0.972 0.972 0.972
Spearman ρ ★ 0.883 0.876 0.890 0.887 0.889 0.894
Weighted τ ★ 0.889 0.877 0.881 0.889 0.883 0.887
Zh NaN NaN NaN NaN NaN NaN
Show code
render_wide(make_wide(ss_summary, "mean_fp", "N="),
  "fp",   "Mean false positive count (flagged attenders) by N.")
Mean false positive count (flagged attenders) by N.
Measure N=100 N=250 N=500 N=1000 N=2000 N=5000
IRV 4.6 8.6 16.1 29.6 50.5 115
Laz.R 3.6 9.8 19.3 42.8 82.8 209
LongString 3.6 6.4 12.2 27.1 61.0 167
Mahalanobis D² (RT) 3.0 7.0 12.9 25.0 50.0 125
Spearman ρ ★ 10.4 27.6 49.2 100.6 197.2 473
Weighted τ ★ 9.9 27.4 53.0 98.7 209.1 502
Zh NaN NaN NaN NaN NaN NaN
Show code
render_wide(make_wide(ss_summary, "mean_fn", "N="),
  "fn",   "Mean false negative count (missed non-attenders) by N.")
Mean false negative count (missed non-attenders) by N.
Measure N=100 N=250 N=500 N=1000 N=2000 N=5000
IRV 7.0 16.3 31.0 63.0 127.8 324
Laz.R 4.2 9.4 18.8 38.7 78.6 200
LongString 5.7 12.3 25.9 52.2 103.8 257
Mahalanobis D² (RT) 11.0 27.0 54.0 108.0 216.0 541
Spearman ρ ★ 1.2 1.9 5.4 9.5 20.3 51
Weighted τ ★ 1.1 1.7 3.5 7.9 15.0 39
Zh NaN NaN NaN NaN NaN NaN
Show code
ss_colours <- c(
  setNames(colorRampPalette(c("grey60", "steelblue3"))(6),
           setdiff(unique(ss_summary$measure_label),
                   c("Spearman \u03c1 \u2605", "Weighted \u03c4 \u2605"))),
  "Spearman \u03c1 \u2605" = "#2e7d32",
  "Weighted \u03c4 \u2605" = "#81c784"
)
ss_sizes_map <- c(
  setNames(rep(0.6, 6),
           setdiff(unique(ss_summary$measure_label),
                   c("Spearman \u03c1 \u2605", "Weighted \u03c4 \u2605"))),
  "Spearman \u03c1 \u2605" = 1.4,
  "Weighted \u03c4 \u2605" = 1.4
)

make_ss_line_plot <- function(y_col, sd_col, y_label, title_str) {
  ggplot(ss_summary, aes(x = n, y = .data[[y_col]],
                          colour = measure_label, linewidth = measure_label)) +
    geom_ribbon(aes(ymin = pmax(.data[[y_col]] - .data[[sd_col]], 0),
                    ymax = pmin(.data[[y_col]] + .data[[sd_col]], 1),
                    fill = measure_label),
                alpha = 0.08, colour = NA) +
    geom_line() +
    geom_point(size = 2) +
    scale_x_continuous(breaks = ss_sizes, labels = scales::comma) +
    scale_colour_manual(values = ss_colours, name = NULL) +
    scale_fill_manual(  values = ss_colours, name = NULL, guide = "none") +
    scale_linewidth_manual(values = ss_sizes_map, guide = "none") +
    labs(title = title_str,
         subtitle = "Mean \u00b1 1 SD across replications; Zh excluded",
         x = "Sample size (N)", y = y_label) +
    theme(legend.position = "right", legend.text = element_text(size = 8))
}

make_ss_line_plot("mean_sens", "sd_sens", "Sensitivity", "Sensitivity vs. N") /
make_ss_line_plot("mean_spec", "sd_spec", "Specificity", "Specificity vs. N") +
  plot_layout(guides = "collect")

Sensitivity (top) and specificity (bottom) as a function of sample size N. Points show means across replications; ribbons show ± 1 SD. Proposed features (ρ, τ) are bold green; established measures in grey/blue.

Sensitivity Analysis II: Number of Items

How do the measures degrade as fewer items are available? We create holdout item sets by randomly removing between 0 and 30 items (in steps of 5) from the full pool of 40 items, recomputing each measure on the remaining items only.

Items are removed at random (preserving scale membership proportionally) 15 times per holdout level, and performance is averaged across replications. This simulates scenarios where researchers want to apply RT-based screening to shorter instruments.

Note on Zh. Zh requires fitting a new GRM for every holdout set and is excluded from this sensitivity analysis. Its item-count sensitivity can be inferred from the single-scale Zh results in the benchmark section.

Show code
n_item_reps   <- 15   # replications per removal level
items_removed <- seq(0, 30, by = 5)   # 0, 5, 10, 15, 20, 25, 30

set.seed(2026)

holdout_idx <- lapply(items_removed, function(n_rem) {
  lapply(seq_len(n_item_reps), function(r) {
    if (n_rem == 0) return(seq_len(ni_total))
    removed <- sample(seq_len(ni_total), n_rem, replace = FALSE)
    setdiff(seq_len(ni_total), removed)
  })
})
names(holdout_idx) <- as.character(items_removed)

cat(sprintf("Item removal levels: %s\nReplications: %d\n",
            paste(items_removed, collapse = ", "), n_item_reps))
Item removal levels: 0, 5, 10, 15, 20, 25, 30
Replications: 15
Show code
# All vectorised helpers (longstring_vec, irv_vec, lazr_vec,
# mahal_vec, rho_vec, wtau_vectorised) defined in sensitivity-n-compute above.

compute_item_perf <- function(item_idx) {
  n_items  <- length(item_idx)
  truth    <- results$true_inattentive
  n_att    <- sum(!truth)

  sub_resp <- resp_mat[, item_idx, drop = FALSE]
  sub_rt   <- rt_mat[,  item_idx, drop = FALSE]
  sub_cx   <- cx_all[item_idx]

  tmp <- data.frame(
    longstring      = longstring_vec(sub_resp),
    irv             = irv_vec(sub_resp),
    mahal_rt        = mahal_vec(log(pmax(sub_rt, 1e-6))),
    lazr_pooled     = lazr_vec(sub_resp),
    zh              = NA_real_,          # excluded — too slow per holdout
    spearman_pooled = rho_vec(sub_rt, sub_cx),
    wtau_pooled     = if (n_items >= 2) wtau_vectorised(sub_rt, sub_cx)
                      else rep(NA_real_, nrow(sub_rt))
  )

  perf_list <- lapply(names(bench_direction), function(nm) {
    m <- bench_direction[[nm]]
    if (nm == "zh") return(data.frame(measure = nm, sens = NA, spec = NA, fp_n = NA, fn_n = NA))
    signal <- if (m$high_bad) tmp[[m$col]] else -tmp[[m$col]]
    signal[!is.finite(signal)] <- NA
    if (all(is.na(signal))) return(data.frame(measure = nm, sens = NA, spec = NA, fp_n = NA, fn_n = NA))

    if (nm == "mahal_rt") {
      # Outlier exclusion: flag lowest 2.5% of D² (highest 2.5% of -D² signal)
      flag <- signal > quantile(signal, 0.975, na.rm = TRUE)
    } else {
      sorted_sig <- sort(signal, na.last = NA)
      if (length(sorted_sig) < 2) return(data.frame(measure = nm, sens = NA, spec = NA, fp_n = NA, fn_n = NA))
      k_idx <- kneedle_fn(sorted_sig, sign = -1)
      flag  <- signal > sorted_sig[k_idx]
    }
    cs <- confusion_stats(flag, truth)
    data.frame(
      measure = nm,
      sens    = cs$Sensitivity,
      spec    = cs$Specificity,
      fp_n    = cs$FP,
      fn_n    = cs$FN
    )
  })

  do.call(rbind, perf_list)
}

# Run across all item-removal levels and reps
item_results_list <- lapply(items_removed, function(n_rem) {
  reps <- holdout_idx[[as.character(n_rem)]]
  rep_perf <- lapply(reps, function(idx) {
    pf <- tryCatch(compute_item_perf(idx), error = function(e) NULL)
    if (!is.null(pf)) pf$n_removed <- n_rem
    pf
  })
  do.call(rbind, Filter(Negate(is.null), rep_perf))
})

item_results <- do.call(rbind, item_results_list)
item_results$n_items <- ni_total - item_results$n_removed

cat(sprintf("Sensitivity-items analysis complete: %d rows.\n", nrow(item_results)))
Sensitivity-items analysis complete: 735 rows.

Sensitivity and False Positive Rate vs. Number of Items

Show code
item_summary <- item_results |>
  group_by(n_items, measure) |>
  summarise(
    mean_sens = round(mean(sens, na.rm = TRUE), 3),
    sd_sens   = round(sd(sens,   na.rm = TRUE), 3),
    mean_spec = round(mean(spec, na.rm = TRUE), 3),
    sd_spec   = round(sd(spec,   na.rm = TRUE), 3),
    mean_fp   = round(mean(fp_n, na.rm = TRUE), 1),
    sd_fp     = round(sd(fp_n,   na.rm = TRUE), 1),
    mean_fn   = round(mean(fn_n, na.rm = TRUE), 1),
    sd_fn     = round(sd(fn_n,   na.rm = TRUE), 1),
    .groups   = "drop"
  ) |>
  mutate(measure_label = bench_labels[measure])

k_levels <- sort(unique(item_summary$n_items))

make_wide_items <- function(data, value_col) {
  data |>
    select(measure_label, n_items, all_of(value_col)) |>
    tidyr::pivot_wider(names_from = n_items, values_from = all_of(value_col),
                       names_prefix = "k=") |>
    mutate(is_proposed = measure_label %in% c("Spearman \u03c1 \u2605", "Weighted \u03c4 \u2605"))
}

render_wide_items <- function(wide_df, caption_text) {
  prop_rows <- which(wide_df$is_proposed)
  kable(
    wide_df[, !names(wide_df) %in% "is_proposed"],
    col.names = c("Measure", paste0("k=", k_levels)),
    caption   = caption_text
  ) |>
    kable_styling(full_width = FALSE) |>
    row_spec(prop_rows, background = "#e8f5e9", color = "#1b5e20")
}

render_wide_items(make_wide_items(item_summary, "mean_sens"),
  "Mean sensitivity by number of retained items (k). Zh excluded.")
Mean sensitivity by number of retained items (k). Zh excluded.
Measure k=10 k=15 k=20 k=25 k=30 k=35 k=40
IRV 0.385 0.397 0.400 0.398 0.401 0.401 0.401
Laz.R 0.432 0.487 0.555 0.612 0.629 0.632 0.630
LongString 0.502 0.508 0.491 0.490 0.488 0.509 0.525
Mahalanobis D² (RT) 0.026 0.005 0.001 0.001 0.001 0.000 0.000
Spearman ρ ★ 0.525 0.631 0.698 0.681 0.816 0.852 0.906
Weighted τ ★ 0.591 0.704 0.746 0.744 0.864 0.885 0.928
Zh NaN NaN NaN NaN NaN NaN NaN
Show code
render_wide_items(make_wide_items(item_summary, "mean_spec"),
  "Mean specificity by number of retained items (k).")
Mean specificity by number of retained items (k).
Measure k=10 k=15 k=20 k=25 k=30 k=35 k=40
IRV 0.956 0.947 0.956 0.965 0.959 0.969 0.974
Laz.R 0.918 0.943 0.945 0.945 0.949 0.955 0.953
LongString 0.887 0.912 0.951 0.957 0.969 0.966 0.963
Mahalanobis D² (RT) 0.975 0.973 0.972 0.972 0.972 0.972 0.972
Spearman ρ ★ 0.900 0.913 0.919 0.923 0.916 0.908 0.894
Weighted τ ★ 0.881 0.898 0.907 0.911 0.903 0.903 0.887
Zh NaN NaN NaN NaN NaN NaN NaN
Show code
render_wide_items(make_wide_items(item_summary, "mean_fp"),
  "Mean false positive count (flagged attenders) by number of retained items (k).")
Mean false positive count (flagged attenders) by number of retained items (k).
Measure k=10 k=15 k=20 k=25 k=30 k=35 k=40
IRV 194.6 236.1 196.7 153.9 180.6 140.1 115
Laz.R 366.1 253.1 246.7 245.8 226.6 199.1 209
LongString 504.8 392.5 217.7 192.5 139.0 150.6 167
Mahalanobis D² (RT) 110.7 122.3 124.4 124.3 124.5 124.8 125
Spearman ρ ★ 446.2 389.9 362.0 341.3 373.3 410.4 473
Weighted τ ★ 531.3 456.3 414.1 395.0 432.8 432.8 502
Zh NaN NaN NaN NaN NaN NaN NaN
Show code
render_wide_items(make_wide_items(item_summary, "mean_fn"),
  "Mean false negative count (missed non-attenders) by number of retained items (k).")
Mean false negative count (missed non-attenders) by number of retained items (k).
Measure k=10 k=15 k=20 k=25 k=30 k=35 k=40
IRV 332.9 326.1 324.6 325.7 323.9 324.3 324
Laz.R 307.4 277.4 240.9 209.8 200.7 198.9 200
LongString 269.3 265.9 275.5 275.9 276.7 265.9 257
Mahalanobis D² (RT) 526.7 538.3 540.4 540.3 540.5 540.8 541
Spearman ρ ★ 257.1 199.5 163.3 172.6 99.5 79.9 51
Weighted τ ★ 221.0 160.0 137.4 138.4 73.5 62.1 39
Zh NaN NaN NaN NaN NaN NaN NaN
Show code
item_colours <- c(
  setNames(colorRampPalette(c("grey60", "steelblue3"))(6),
           setdiff(unique(item_summary$measure_label),
                   c("Spearman \u03c1 \u2605", "Weighted \u03c4 \u2605"))),
  "Spearman \u03c1 \u2605" = "#2e7d32",
  "Weighted \u03c4 \u2605" = "#81c784"
)
item_size_map <- c(
  setNames(rep(0.6, 6),
           setdiff(unique(item_summary$measure_label),
                   c("Spearman \u03c1 \u2605", "Weighted \u03c4 \u2605"))),
  "Spearman \u03c1 \u2605" = 1.4,
  "Weighted \u03c4 \u2605" = 1.4
)

make_item_line_plot <- function(y_col, sd_col, y_label, title_str) {
  ggplot(item_summary, aes(x = n_items, y = .data[[y_col]],
                            colour = measure_label, linewidth = measure_label)) +
    geom_ribbon(aes(ymin = pmax(.data[[y_col]] - .data[[sd_col]], 0),
                    ymax = pmin(.data[[y_col]] + .data[[sd_col]], 1),
                    fill = measure_label),
                alpha = 0.08, colour = NA) +
    geom_line() +
    geom_point(size = 2) +
    scale_x_continuous(breaks = k_levels) +
    scale_colour_manual(values = item_colours, name = NULL) +
    scale_fill_manual(  values = item_colours, name = NULL, guide = "none") +
    scale_linewidth_manual(values = item_size_map, guide = "none") +
    labs(title = title_str,
         subtitle = "Mean \u00b1 1 SD across replications",
         x = "Items retained (k)", y = y_label) +
    theme(legend.position = "right", legend.text = element_text(size = 8))
}

make_item_line_plot("mean_sens", "sd_sens", "Sensitivity", "Sensitivity vs. items") /
make_item_line_plot("mean_spec", "sd_spec", "Specificity", "Specificity vs. items") +
  plot_layout(guides = "collect")

Sensitivity (top) and specificity (bottom) as a function of the number of retained items. Ribbons show ± 1 SD across replications. Proposed complexity–RT features are highlighted in green.

Interaction: N × Items

The heatmap below shows mean sensitivity as a joint function of sample size and number of items for each proposed feature, enabling a quick read of the operating conditions under which ρ and τ remain viable.

Show code
# Re-run a coarser grid (N × k) for the heatmap — use fewer reps
n_hm_reps   <- 10
ss_hm_sizes <- ss_sizes
items_hm    <- seq(10, ni_total, by = 10)  # 10, 20, 30, 40

set.seed(999)

hm_results <- do.call(rbind, lapply(ss_hm_sizes, function(n_s) {
  do.call(rbind, lapply(items_hm, function(k) {
    do.call(rbind, lapply(seq_len(n_hm_reps), function(r) {
      att_idx  <- which(dat$is_non_attender == 0)
      non_idx  <- which(dat$is_non_attender == 1)
      prop_non <- mean(dat$is_non_attender == 1)
      n_non    <- round(n_s * prop_non)
      n_att    <- n_s - n_non
      row_idx  <- c(
        sample(att_idx, min(n_att, length(att_idx)), replace = FALSE),
        sample(non_idx, min(n_non, length(non_idx)), replace = FALSE)
      )
      if (length(row_idx) < 20) return(NULL)

      item_idx <- sample(seq_len(ni_total), k, replace = FALSE)
      sub_rt   <- rt_mat[row_idx, ][, item_idx, drop = FALSE]
      sub_cx   <- cx_all[item_idx]
      truth    <- (dat$is_non_attender == 1)[row_idx]
      n_att_s  <- sum(!truth)
      if (n_att_s < 5 || sum(truth) < 5) return(NULL)

      cx_rk_hm <- rank(sub_cx)
      rho_hm   <- apply(sub_rt, 1, function(rt_i)
        cor(cx_rk_hm, rank(rt_i), method = "spearman"))
      wtau_hm  <- wtau_vectorised(sub_rt, sub_cx)

      kneedle_stats <- function(signal, truth) {
        signal[!is.finite(signal)] <- NA
        sorted <- sort(signal, na.last = NA)
        if (length(sorted) < 2)
          return(list(sens = NA, spec = NA, fp_pct = NA, fn_pct = NA))
        k_i  <- kneedle_fn(sorted, sign = -1)
        flag <- signal > sorted[k_i]
        cs   <- confusion_stats(flag, truth)
        n_att_loc <- sum(!truth)
        n_non_loc <- sum(truth)
        list(sens   = cs$Sensitivity,
             spec   = cs$Specificity,
             fp_pct = if (n_att_loc > 0) 100 * cs$FP / n_att_loc else NA,
             fn_pct = if (n_non_loc > 0) 100 * cs$FN / n_non_loc else NA)
      }

      s_rho  <- kneedle_stats(-rho_hm,  truth)
      s_wtau <- kneedle_stats(-wtau_hm, truth)

      data.frame(
        n             = n_s, k = k,
        sens_rho      = s_rho$sens,    sens_wtau   = s_wtau$sens,
        spec_rho      = s_rho$spec,    spec_wtau   = s_wtau$spec,
        fp_pct_rho    = s_rho$fp_pct,  fp_pct_wtau = s_wtau$fp_pct,
        fn_pct_rho    = s_rho$fn_pct,  fn_pct_wtau = s_wtau$fn_pct
      )
    }))
  }))
}))

hm_summary <- hm_results |>
  group_by(n, k) |>
  summarise(
    sens_rho    = round(mean(sens_rho,    na.rm = TRUE), 3),
    sens_wtau   = round(mean(sens_wtau,   na.rm = TRUE), 3),
    spec_rho    = round(mean(spec_rho,    na.rm = TRUE), 3),
    spec_wtau   = round(mean(spec_wtau,   na.rm = TRUE), 3),
    fp_pct_rho  = round(mean(fp_pct_rho,  na.rm = TRUE), 1),
    fp_pct_wtau = round(mean(fp_pct_wtau, na.rm = TRUE), 1),
    fn_pct_rho  = round(mean(fn_pct_rho,  na.rm = TRUE), 1),
    fn_pct_wtau = round(mean(fn_pct_wtau, na.rm = TRUE), 1),
    .groups     = "drop"
  )

fmt_label_y <- function(x) formatC(as.numeric(x), format = "d", big.mark = ",")

make_heatmap <- function(data, rho_col, wtau_col, metric_label,
                         fill_label, low_col, mid_col, high_col,
                         midpoint_val, fmt = "%.2f") {
  long_df <- data |>
    select(n, k, rho = all_of(rho_col), wtau = all_of(wtau_col)) |>
    tidyr::pivot_longer(cols = c(rho, wtau),
                        names_to = "feature", values_to = "value") |>
    mutate(feature = ifelse(feature == "rho", "Spearman \u03c1", "Weighted \u03c4"))

  ggplot(long_df, aes(x = factor(k), y = factor(n), fill = value)) +
    geom_tile(colour = "white", linewidth = 0.5) +
    geom_text(aes(label = sprintf(fmt, value)), size = 2.6) +
    scale_fill_gradient2(low = low_col, mid = mid_col, high = high_col,
                         midpoint = midpoint_val, na.value = "grey85",
                         name = fill_label) +
    facet_wrap(~ feature, ncol = 2) +
    scale_y_discrete(labels = fmt_label_y) +
    labs(title = paste(metric_label, "heatmap: N \u00d7 items retained"),
         x = "Items retained (k)", y = "Sample size (N)") +
    theme(axis.text.x = element_text(angle = 0),
          legend.position = "right")
}

p_hm_sens <- make_heatmap(hm_summary, "sens_rho", "sens_wtau",
  "Sensitivity", "Mean\nSensitivity",
  "tomato", "lightyellow", "#2e7d32", 0.5)

p_hm_spec <- make_heatmap(hm_summary, "spec_rho", "spec_wtau",
  "Specificity", "Mean\nSpecificity",
  "tomato", "lightyellow", "#2e7d32", 0.5)

p_hm_fp <- make_heatmap(hm_summary, "fp_pct_rho", "fp_pct_wtau",
  "False Positive Rate (%)", "Mean FP\nRate (%)",
  "#2e7d32", "lightyellow", "tomato", 50, fmt = "%.1f%%")

p_hm_fn <- make_heatmap(hm_summary, "fn_pct_rho", "fn_pct_wtau",
  "False Negative Rate (%)", "Mean FN\nRate (%)",
  "#2e7d32", "lightyellow", "tomato", 50, fmt = "%.1f%%")

p_hm_sens / p_hm_spec / p_hm_fp / p_hm_fn

Mean sensitivity, specificity, FP count, and FN count for Spearman ρ (left column) and weighted τ (right column) as a joint function of sample size N and items retained k. Grey cells = conditions not evaluated.

The Complexity–RT Feature

Computing Spearman Correlations

We compute the complexity–RT Spearman correlation separately for each scale using that scale’s 20 items and fixed complexity scores:

\[\rho_{i,s} = \text{Spearman}\!\left(\text{rank}(c_1^{(s)}, \ldots, c_{ni}^{(s)}),\; \text{rank}(\text{RT}_{i,1}^{(s)}, \ldots, \text{RT}_{i,ni}^{(s)})\right)\]

No outlier capping is applied. Spearman’s rank correlation is robust to extreme values by construction — a single stall-contaminated RT moves to the top of that person’s rank order but does not inflate the correlation beyond its \([-1, 1]\) bounds. Capping would discard potentially informative signal about relative item difficulty.

Show code
# ── roc_auc, make_roc_df, kneedle_fn, apply_kneedle_flag are now defined in
# ── the setup chunk and available globally ────────────────────────────────────

# Attention-vs-A plot (filters results dynamically — does NOT use a pre-snapshot)
make_attention_plot <- function(col, label) {
  att_df <- results[results$is_non_attender == 0, ]
  na_q <- quantile(results[[col]][results$is_non_attender == 1],
                   probs = c(0.05, 0.25, 0.50, 0.75, 0.95), na.rm = TRUE)
  ggplot(att_df, aes(x = A, y = .data[[col]])) +
    annotate("rect", xmin = -Inf, xmax = Inf,
             ymin = na_q["5%"],  ymax = na_q["95%"],
             fill = "tomato", alpha = 0.08) +
    annotate("rect", xmin = -Inf, xmax = Inf,
             ymin = na_q["25%"], ymax = na_q["75%"],
             fill = "tomato", alpha = 0.12) +
    geom_hline(yintercept = na_q["50%"], colour = "tomato3",
               linetype = "dashed", linewidth = 0.7) +
    geom_point(aes(colour = factor(Z)), alpha = 0.35, size = 1.4) +
    geom_smooth(method = "loess", span = 0.6, se = TRUE,
                colour = "steelblue4", fill = "steelblue", alpha = 0.2) +
    scale_colour_manual(values = c("0" = "grey60", "1" = "steelblue"),
                        labels = c("Control", "Treatment"), name = "Arm") +
    labs(title = label, subtitle = "Red band = non-attender 5–95th pct",
         x = "Latent attention A", y = "Score")
}

# Subtype ridge plot (uses results$subtype — only call after subtype-label chunk)
make_subtype_ridge <- function(col, label) {
  subtype_order <- results |>
    group_by(subtype) |>
    summarise(med = median(.data[[col]], na.rm = TRUE), .groups = "drop") |>
    arrange(med) |>
    pull(subtype)
  results$subtype_f <- factor(results$subtype, levels = subtype_order)
  ggplot(results,
         aes(x = .data[[col]], y = subtype_f,
             fill = ifelse(subtype == "Attender", "Attender", "Non-attender"))) +
    ggridges::geom_density_ridges(
      alpha = 0.55, colour = "white", scale = 0.85,
      quantile_lines = TRUE, quantiles = 2) +
    geom_vline(xintercept = 0, linetype = "dashed",
               colour = "grey30", linewidth = 0.7) +
    scale_fill_manual(values = c("Attender" = "steelblue",
                                  "Non-attender" = "tomato"), name = NULL) +
    labs(title = label, subtitle = "Median line shown; dashed = 0",
         x = "Score", y = NULL) +
    theme(legend.position = "right")
}

# ── Per-scale Spearman rho ────────────────────────────────────────────────────
compute_spearman_scale <- function(dat, scale_idx) {
  ni      <- params$items_per_scale
  rt_cols <- paste0("RT_s", scale_idx, "_i", seq_len(ni))
  cx      <- instrument[[scale_idx]]$complexity
  rt_mat  <- as.matrix(dat[, rt_cols])
  cx_rank <- rank(cx)
  apply(rt_mat, 1, function(rt_i) {
    cor(cx_rank, rank(rt_i), method = "spearman")
  })
}

for (s in seq_len(ns)) {
  results[[paste0("spearman_s", s)]] <- compute_spearman_scale(dat, s)
}

cat("Spearman ρ summary (attenders vs non-attenders):\n")
Spearman ρ summary (attenders vs non-attenders):
Show code
for (col in c("spearman_s1", "spearman_s2")) {
  att <- results[[col]][results$is_non_attender == 0]
  non <- results[[col]][results$is_non_attender == 1]
  cat(sprintf("  %-20s  attender mean = %.3f  |  non-attender mean = %.3f\n",
              col, mean(att, na.rm = TRUE), mean(non, na.rm = TRUE)))
}
  spearman_s1           attender mean = 0.436  |  non-attender mean = -0.010
  spearman_s2           attender mean = 0.408  |  non-attender mean = -0.008

Computing Weighted Kendall τ

Standard Spearman ρ treats every rank inversion equally, so a noise-driven flip between two items whose complexity scores are nearly identical counts the same as a flip between an easy and a hard item. When item complexities cluster closely, even genuinely attentive respondents can accumulate many near-tie inversions from RT noise alone, driving ρ downward unfairly.

The complexity-weighted Kendall τ addresses this by scaling each concordant or discordant pair by the absolute complexity difference between the two items:

\[\tau_w = \frac{\displaystyle\sum_{i<j} |c_i - c_j|\cdot \text{sign}(c_i - c_j)\cdot\text{sign}(\text{RT}_i - \text{RT}_j)} {\displaystyle\sum_{i<j} |c_i - c_j|}\]

A flip between items 0.05 complexity-units apart contributes almost nothing; a flip between items 1.0 apart contributes fully. The statistic is still in \([-1, 1]\), anchored at 0 for random responding, and drops into the same threshold and mixture pipeline as ρ.

Show code
compute_weighted_tau <- function(dat, scale_idx) {
  ni      <- params$items_per_scale
  rt_cols <- paste0("RT_s", scale_idx, "_i", seq_len(ni))
  cx      <- instrument[[scale_idx]]$complexity
  rt_mat  <- as.matrix(dat[, rt_cols])

  pairs   <- combn(ni, 2)        # 2 × C(ni,2) matrix of index pairs
  cx_diff <- cx[pairs[1,]] - cx[pairs[2,]]
  w       <- abs(cx_diff)        # pair weights = |complexity difference|

  apply(rt_mat, 1, function(rt_i) {
    rt_diff     <- rt_i[pairs[1,]] - rt_i[pairs[2,]]
    concordance <- sign(cx_diff) * sign(rt_diff)
    sum(w * concordance) / sum(w)
  })
}

for (s in seq_len(ns)) {
  results[[paste0("wtau_s", s)]] <- compute_weighted_tau(dat, s)
}

cat("Weighted τ summary (attenders vs non-attenders):\n")
Weighted τ summary (attenders vs non-attenders):
Show code
for (col in c("wtau_s1", "wtau_s2")) {
  att <- results[[col]][results$is_non_attender == 0]
  non <- results[[col]][results$is_non_attender == 1]
  cat(sprintf("  %-12s  attender mean = %.3f  |  non-attender mean = %.3f\n",
              col, mean(att, na.rm = TRUE), mean(non, na.rm = TRUE)))
}
  wtau_s1       attender mean = 0.454  |  non-attender mean = -0.010
  wtau_s2       attender mean = 0.459  |  non-attender mean = -0.010
Show code
# Assign subtype here so all downstream chunks (including wtau comparisons) can use it
results$subtype <- with(results, case_when(
  is_non_attender == 0                         ~ "Attender",
  non_attender_style == "plausible_noise"      ~ "plausible_noise",
  TRUE                                         ~ non_attender_pattern_type
))

Comparing Spearman ρ and Weighted τ

Distributions

Show code
make_dist_panel <- function(col, label, xlab) {
  ggplot(results, aes(x = .data[[col]], fill = class_label)) +
    geom_density(alpha = 0.5, colour = NA) +
    geom_vline(xintercept = 0, linetype = "dashed",
               colour = "grey30", linewidth = 0.7) +
    scale_fill_manual(
      values = c("Attender (true)" = "steelblue",
                 "Non-attender (true)" = "tomato"), name = NULL) +
    labs(title = label, x = xlab, y = "Density")
}

(make_dist_panel("spearman_s1", "Spearman ρ — Scale 1", "Spearman ρ") +
 make_dist_panel("spearman_s2", "Spearman ρ — Scale 2", "Spearman ρ")) /
(make_dist_panel("wtau_s1",    "Weighted τ — Scale 1", "Weighted τ") +
 make_dist_panel("wtau_s2",    "Weighted τ — Scale 2", "Weighted τ"))

Distributions of Spearman ρ (top) and weighted τ (bottom) by true class, for each scale. If near-tie noise is inflating FPs, the weighted τ distributions should show better separation — a cleaner gap between attenders and non-attenders.

AUC and Descriptive Comparison

Show code
compare_features <- list(
  "ρ (S1)"   = "spearman_s1",
  "ρ (S2)"   = "spearman_s2",
  "τ_w (S1)" = "wtau_s1",
  "τ_w (S2)" = "wtau_s2"
)

compare_tab <- do.call(rbind, lapply(names(compare_features), function(nm) {
  col <- compare_features[[nm]]
  data.frame(
    Feature  = nm,
    AUC      = round(roc_auc(results[[col]], results$true_inattentive), 3),
    mean_att = round(mean(results[[col]][results$is_non_attender == 0], na.rm = TRUE), 3),
    mean_non = round(mean(results[[col]][results$is_non_attender == 1], na.rm = TRUE), 3),
    sd_att   = round(sd(results[[col]][results$is_non_attender == 0],   na.rm = TRUE), 3),
    sd_non   = round(sd(results[[col]][results$is_non_attender == 1],   na.rm = TRUE), 3)
  )
}))

kable(
  compare_tab,
  col.names = c("Feature", "AUC",
                "Mean (attenders)", "Mean (non-attenders)",
                "SD (attenders)",   "SD (non-attenders)"),
  caption = "Spearman ρ vs. weighted τ: discriminability by scale"
) |>
  kable_styling(full_width = FALSE) |>
  row_spec(which.max(compare_tab$AUC), bold = TRUE, background = "#eef6ee")
Spearman ρ vs. weighted τ: discriminability by scale
Feature AUC Mean (attenders) Mean (non-attenders) SD (attenders) SD (non-attenders)
ρ (S1) 0.909 0.436 -0.010 0.234 0.229
ρ (S2) 0.895 0.408 -0.008 0.226 0.230
τ_w (S1) 0.913 0.454 -0.010 0.237 0.234
τ_w (S2) 0.908 0.459 -0.010 0.235 0.248

How Attention Moderates Weighted τ

Show code
att_results_w <- results[results$is_non_attender == 0, ]

make_attention_plot("wtau_s1", "Weighted τ — Scale 1") +
make_attention_plot("wtau_s2", "Weighted τ — Scale 2")

Weighted τ vs. latent attention A among attenders, for Scale 1 (left) and Scale 2 (right). The red band shows the non-attender 5–95th percentile range. Compare to the equivalent Spearman ρ plots above — the weighted τ should show a tighter attender band if near-tie noise was adding variance to ρ.

Weighted τ by Non-attender Subtype

Show code
make_subtype_ridge("wtau_s1", "Weighted τ — Scale 1") +
make_subtype_ridge("wtau_s2", "Weighted τ — Scale 2")

Weighted τ distributions by subtype for Scale 1 (left) and Scale 2 (right). Compare to the Spearman ρ subtype plots — subtypes that were hard to detect under ρ may show greater separation here.
Show code
wtau_subtype_tab <- results |>
  group_by(subtype) |>
  summarise(
    N          = n(),
    mean_rho_s1  = round(mean(spearman_s1, na.rm = TRUE), 3),
    mean_wtau_s1 = round(mean(wtau_s1,     na.rm = TRUE), 3),
    mean_rho_s2  = round(mean(spearman_s2, na.rm = TRUE), 3),
    mean_wtau_s2 = round(mean(wtau_s2,     na.rm = TRUE), 3),
    .groups = "drop"
  ) |>
  arrange(mean_wtau_s1)

kable(
  wtau_subtype_tab,
  col.names = c("Subtype", "N",
                "Mean ρ (S1)", "Mean τ_w (S1)",
                "Mean ρ (S2)", "Mean τ_w (S2)"),
  caption = "Spearman ρ vs. weighted τ by subtype"
) |>
  kable_styling(full_width = FALSE) |>
  row_spec(which(wtau_subtype_tab$subtype == "Attender"), bold = TRUE)
Spearman ρ vs. weighted τ by subtype
Subtype N Mean ρ (S1) Mean τ_w (S1) Mean ρ (S2) Mean τ_w (S2)
midpoint_default 53 -0.034 -0.045 0.016 0.027
plausible_noise 143 -0.039 -0.037 -0.019 -0.011
disacquiescent 25 -0.014 -0.023 -0.029 -0.034
uniform_random 25 -0.021 -0.022 -0.029 -0.055
biased_random 36 -0.004 -0.010 0.023 0.018
acquiescent 54 -0.008 -0.009 -0.007 -0.001
near_straightline 57 -0.016 -0.006 0.042 0.025
straightline 114 -0.002 0.000 -0.006 -0.024
alternating 17 0.117 0.125 -0.129 -0.135
diagonal 17 0.135 0.130 -0.045 -0.007
Attender 4459 0.436 0.454 0.408 0.459

Head-to-head FP / FN: Kneedle Threshold

Show code
# Apply Kneedle to Spearman ρ per scale (also needed here for comparison plots)
kn_s1 <- apply_kneedle_flag(results$spearman_s1)
kn_s2 <- apply_kneedle_flag(results$spearman_s2)

results$flag_kn_s1 <- kn_s1$flag
results$flag_kn_s2 <- kn_s2$flag

thresh_zero_s1   <- 0
thresh_zero_s2   <- 0
thresh_oracle_s1 <- quantile(results$spearman_s1[results$is_non_attender == 0], 0.05, na.rm = TRUE)
thresh_oracle_s2 <- quantile(results$spearman_s2[results$is_non_attender == 0], 0.05, na.rm = TRUE)

results$flag_zero_s1   <- results$spearman_s1 < thresh_zero_s1
results$flag_zero_s2   <- results$spearman_s2 < thresh_zero_s2
results$flag_oracle_s1 <- results$spearman_s1 < thresh_oracle_s1
results$flag_oracle_s2 <- results$spearman_s2 < thresh_oracle_s2

# Apply Kneedle to weighted τ per scale
kn_wtau_s1 <- apply_kneedle_flag(results$wtau_s1)
kn_wtau_s2 <- apply_kneedle_flag(results$wtau_s2)

results$flag_kn_wtau_s1 <- kn_wtau_s1$flag
results$flag_kn_wtau_s2 <- kn_wtau_s2$flag

thresh_zero_wtau_s1   <- 0
thresh_zero_wtau_s2   <- 0
thresh_oracle_wtau_s1 <- quantile(results$wtau_s1[results$is_non_attender == 0], 0.05, na.rm = TRUE)
thresh_oracle_wtau_s2 <- quantile(results$wtau_s2[results$is_non_attender == 0], 0.05, na.rm = TRUE)

results$flag_zero_wtau_s1   <- results$wtau_s1 < thresh_zero_wtau_s1
results$flag_zero_wtau_s2   <- results$wtau_s2 < thresh_zero_wtau_s2
results$flag_oracle_wtau_s1 <- results$wtau_s1 < thresh_oracle_wtau_s1
results$flag_oracle_wtau_s2 <- results$wtau_s2 < thresh_oracle_wtau_s2

cat(sprintf("Kneedle thresholds — ρ:    S1 = %.3f | S2 = %.3f\n",
            kn_s1$threshold, kn_s2$threshold))
Kneedle thresholds — ρ:    S1 = 0.090 | S2 = 0.086
Show code
cat(sprintf("Kneedle thresholds — τ_w:  S1 = %.3f | S2 = %.3f\n",
            kn_wtau_s1$threshold, kn_wtau_s2$threshold))
Kneedle thresholds — τ_w:  S1 = 0.104 | S2 = 0.159
Show code
# Head-to-head performance table
head2head <- do.call(rbind, lapply(list(
    list(flag = results$flag_kn_s1,      label = "ρ (S1)"),
    list(flag = results$flag_kn_wtau_s1, label = "τ_w (S1)"),
    list(flag = results$flag_kn_s2,      label = "ρ (S2)"),
    list(flag = results$flag_kn_wtau_s2, label = "τ_w (S2)")
  ), function(x) {
  cs <- confusion_stats(x$flag, results$true_inattentive)
  data.frame(
    Feature     = x$label,
    Sensitivity = round(cs$Sensitivity, 3),
    Specificity = round(cs$Specificity, 3),
    Pct_Flagged = round(cs$Flagged_pct, 1),
    TP = cs$TP, FP = cs$FP, TN = cs$TN, FN = cs$FN
  )
}))

kable(
  head2head,
  col.names = c("Feature", "Sensitivity", "Specificity",
                "% Flagged", "TP", "FP", "TN", "FN"),
  caption = "Head-to-head: Spearman ρ vs. weighted τ (Kneedle threshold per scale)"
) |>
  kable_styling(full_width = FALSE) |>
  row_spec(which.max(head2head$Sensitivity), bold = TRUE, background = "#eef6ee")
Head-to-head: Spearman ρ vs. weighted τ (Kneedle threshold per scale)
Feature Sensitivity Specificity % Flagged TP FP TN FN
ρ (S1) 0.638 0.915 14.5 345 378 4081 196
τ_w (S1) 0.669 0.916 14.7 362 373 4086 179
ρ (S2) 0.658 0.909 15.2 356 406 4053 185
τ_w (S2) 0.739 0.888 18.0 400 498 3961 141
Show code
make_fp_comparison_plot <- function(flag_rho, flag_wtau, scale_label) {
  results |>
    filter(is_non_attender == 0) |>
    tidyr::pivot_longer(
      cols      = c(all_of(flag_rho), all_of(flag_wtau)),
      names_to  = "feature",
      values_to = "flagged"
    ) |>
    mutate(feature = factor(feature,
      levels = c(flag_rho, flag_wtau),
      labels = c("Spearman ρ (Kneedle)", "Weighted τ (Kneedle)"))) |>
    ggplot(aes(x = A, fill = flagged)) +
    geom_density(alpha = 0.55, colour = NA) +
    scale_fill_manual(
      values = c("TRUE" = "tomato", "FALSE" = "steelblue"),
      labels = c("TRUE" = "FP (flagged)", "FALSE" = "TN (retained)"),
      name = NULL) +
    facet_wrap(~ feature, ncol = 2) +
    labs(title = scale_label, x = "Latent attention A", y = "Density")
}

make_fp_comparison_plot("flag_kn_s1", "flag_kn_wtau_s1", "Scale 1") /
make_fp_comparison_plot("flag_kn_s2", "flag_kn_wtau_s2", "Scale 2")

Attention distribution of false positives under Spearman ρ vs. weighted τ (Kneedle threshold), for Scale 1 (top) and Scale 2 (bottom). If weighted τ is reducing near-tie FPs, its FP group should be more concentrated at genuinely low-A attenders.
Show code
rho_wtau_subtype <- results |>
  filter(is_non_attender == 1) |>
  group_by(subtype) |>
  summarise(
    N            = n(),
    det_rho_s1   = round(mean(flag_kn_s1),      3),
    det_wtau_s1  = round(mean(flag_kn_wtau_s1), 3),
    det_rho_s2   = round(mean(flag_kn_s2),      3),
    det_wtau_s2  = round(mean(flag_kn_wtau_s2), 3),
    .groups = "drop"
  ) |>
  arrange(desc(det_wtau_s1))

kable(
  rho_wtau_subtype,
  col.names = c("Subtype", "N",
                "Det. ρ (S1)", "Det. τ_w (S1)",
                "Det. ρ (S2)", "Det. τ_w (S2)"),
  caption = "Detection rates by subtype: Spearman ρ vs. weighted τ (Kneedle)"
) |>
  kable_styling(full_width = FALSE) |>
  row_spec(which(rho_wtau_subtype$det_wtau_s1 < 0.5),  color = "tomato") |>
  row_spec(which(rho_wtau_subtype$det_wtau_s1 >= 0.5), color = "steelblue")
Detection rates by subtype: Spearman ρ vs. weighted τ (Kneedle)
Subtype N Det. ρ (S1) Det. τ_w (S1) Det. ρ (S2) Det. τ_w (S2)
disacquiescent 25 0.760 0.760 0.600 0.680
midpoint_default 53 0.774 0.736 0.623 0.660
straightline 114 0.605 0.719 0.649 0.772
plausible_noise 143 0.650 0.692 0.678 0.769
acquiescent 54 0.630 0.648 0.685 0.685
near_straightline 57 0.614 0.614 0.614 0.719
biased_random 36 0.583 0.611 0.639 0.611
uniform_random 25 0.720 0.600 0.720 0.840
alternating 17 0.529 0.471 0.765 1.000
diagonal 17 0.353 0.471 0.647 0.706
Show code
make_rho_wtau_roc_panel <- function(rho_col, wtau_col, truth, scale_label) {
  r_rho  <- make_roc_df(results[[rho_col]],  truth, "Spearman ρ")
  r_wtau <- make_roc_df(results[[wtau_col]], truth, "Weighted τ")
  auc_rho  <- round(roc_auc(results[[rho_col]],  truth), 3)
  auc_wtau <- round(roc_auc(results[[wtau_col]], truth), 3)
  roc_df <- rbind(r_rho, r_wtau)

  ggplot(roc_df, aes(x = fpr, y = tpr, colour = variant)) +
    geom_abline(slope = 1, intercept = 0, linetype = "dashed", colour = "grey60") +
    geom_path(linewidth = 0.9) +
    scale_colour_manual(
      values = c("Spearman ρ" = "steelblue", "Weighted τ" = "darkorange"),
      labels = c(
        sprintf("Spearman ρ  (AUC = %.3f)", auc_rho),
        sprintf("Weighted τ  (AUC = %.3f)", auc_wtau)
      ),
      name = NULL) +
    labs(title = scale_label,
         x = "False Positive Rate (1 − Specificity)",
         y = "Sensitivity (True Positive Rate)") +
    theme(legend.position = "bottom")
}

make_rho_wtau_roc_panel("spearman_s1", "wtau_s1", results$true_inattentive, "Scale 1") +
make_rho_wtau_roc_panel("spearman_s2", "wtau_s2", results$true_inattentive, "Scale 2")

ROC curves for Spearman ρ and weighted τ on Scale 1 (left) and Scale 2 (right). A higher AUC for τ_w confirms that down-weighting near-tie rank inversions improves discrimination.

Computing Within-person RT Regression Features

Spearman ρ measures whether complexity and RT co-vary in rank order, but it does not quantify how much unexplained RT variation remains after accounting for complexity. To capture this, we fit a within-person OLS regression of log-transformed RT on item complexity alone, pooled across all 40 items:

\[\log(\text{RT}_{ij}) = \alpha_i + \beta_i \cdot c_j + \varepsilon_{ij}\]

This gives two per-person features:

  • \(\hat\beta_i\)complexity slope: how much log-RT increases per unit of complexity. Positive for attenders; near zero for non-attenders who are insensitive to item difficulty.
  • MARmean absolute residual \(\frac{1}{n}\sum_j|\hat\varepsilon_{ij}|\): average unexplained log-RT variance after accounting for complexity. High MAR means RT is highly variable in ways that complexity alone cannot predict.

Note that MAR and Spearman ρ are complementary rather than redundant. ρ captures whether the rank order of RTs matches the rank order of complexity; MAR captures how much absolute deviation remains around the complexity trend, regardless of whether that trend is positive. A respondent could have a moderate positive ρ but high MAR (complexity partially predicts their RTs but with a lot of scatter), or low ρ and low MAR (flat RT pattern with little variance — characteristic of patterned non-attenders).

Show code
compute_rt_regression <- function(dat) {
  ni  <- params$items_per_scale

  cx_all <- unlist(lapply(seq_len(ns), function(s) instrument[[s]]$complexity))

  rt_cols_all <- unlist(lapply(seq_len(ns), function(s)
    paste0("RT_s", s, "_i", seq_len(ni))))

  rt_mat <- as.matrix(dat[, rt_cols_all])

  # Fit OLS per person: log(RT) ~ complexity
  # Use vapply to guarantee a 2-row matrix regardless of nrow(dat)
  out <- vapply(seq_len(nrow(rt_mat)), function(i) {
    rt_i   <- rt_mat[i, ]
    log_rt <- log(pmax(rt_i, 1e-6))

    fit <- tryCatch(
      lm.fit(x = cbind(1, cx_all), y = log_rt),
      error = function(e) NULL
    )

    if (is.null(fit) || length(fit$coefficients) < 2) {
      return(c(beta = NA_real_, mar = NA_real_))
    }

    c(beta = fit$coefficients[[2]],
      mar  = mean(abs(fit$residuals), na.rm = TRUE))
  }, numeric(2))

  # out is 2 x nrow(dat); transpose to nrow x 2
  out <- t(out)
  colnames(out) <- c("beta", "mar")

  data.frame(
    reg_beta = out[, "beta"],
    reg_mar  = out[, "mar"]
  )
}

reg_feats       <- compute_rt_regression(dat)
results$reg_beta <- reg_feats$reg_beta
results$reg_mar  <- reg_feats$reg_mar

cat("Regression feature summary (attenders vs non-attenders):\n")
Regression feature summary (attenders vs non-attenders):
Show code
for (col in c("reg_beta", "reg_mar")) {
  att <- results[[col]][results$is_non_attender == 0]
  non <- results[[col]][results$is_non_attender == 1]
  cat(sprintf("  %-12s  attender mean = %+.3f  |  non-attender mean = %+.3f\n",
              col, mean(att, na.rm = TRUE), mean(non, na.rm = TRUE)))
}
  reg_beta      attender mean = +0.192  |  non-attender mean = -0.003
  reg_mar       attender mean = +0.273  |  non-attender mean = +0.194

Understanding the Feature: Instrument Complexity Structure

Show code
cx_df <- do.call(rbind, lapply(seq_len(ns), function(s) {
  data.frame(
    scale      = paste0("Scale ", s),
    item       = seq_len(ni),
    complexity = instrument[[s]]$complexity
  )
}))

ggplot(cx_df, aes(x = item, y = complexity)) +
  geom_hline(yintercept = 0, linetype = "dashed", colour = "grey60") +
  geom_point(size = 2.5, colour = "steelblue") +
  geom_segment(aes(xend = item, yend = 0), linewidth = 0.4, alpha = 0.5,
               colour = "steelblue") +
  facet_wrap(~ scale, ncol = 2) +
  labs(title    = "Item complexity scores by scale",
       subtitle = sprintf(
         "complexity_rt_weight = %g — each SD of complexity multiplies RT by exp(%g × A)",
         params$complexity_rt_weight, params$complexity_rt_weight),
       x = "Item index", y = "Complexity score")

Per-item complexity scores for each scale. Items to the right are harder and should elicit longer RTs from attentive respondents.

Feature Distributions: Scale 1 and Scale 2

Show code
rho_variants <- list(
  "Scale 1" = "spearman_s1",
  "Scale 2" = "spearman_s2"
)

plots_all <- lapply(names(rho_variants), function(label) {
  col <- rho_variants[[label]]
  ggplot(results, aes(x = .data[[col]], fill = class_label)) +
    geom_density(alpha = 0.5, colour = NA) +
    geom_vline(xintercept = 0, linetype = "dashed",
               colour = "grey30", linewidth = 0.7) +
    scale_fill_manual(
      values = c("Attender (true)" = "steelblue",
                 "Non-attender (true)" = "tomato"), name = NULL) +
    labs(title = label, x = "Spearman ρ", y = "Density")
})

wrap_plots(plots_all, ncol = 2)

Distributions of Spearman ρ for Scale 1 (left) and Scale 2 (right) by true attender status. The dashed line marks ρ = 0. Wider separation between blue and red distributions indicates better discriminative power.

Discriminability Comparison

```cfrzpxertmbtgwhlzswbaphwtchrodaauc_tab <- data.frame( Variant = names(rho_variants), n_items = c(ni, ni), AUC = sapply(rho_variants, function(col) round(roc_auc(results[[col]], results\(true_inattentive), 3)), mean_att = sapply(rho_variants, function(col) round(mean(results[[col]][results\)is_non_attender == 0], na.rm = TRUE), 3)), mean_non = sapply(rho_variants, function(col) round(mean(results[[col]][results\(is_non_attender == 1], na.rm = TRUE), 3)), sd_att = sapply(rho_variants, function(col) round(sd(results[[col]][results\)is_non_attender == 0], na.rm = TRUE), 3)), sd_non = sapply(rho_variants, function(col) round(sd(results[[col]][results$is_non_attender == 1], na.rm = TRUE), 3)), row.names = NULL )

kable( auc_tab, col.names = c(“Scale”, “Items”, “AUC”, “Mean ρ (attenders)”, “Mean ρ (non-attenders)”, “SD ρ (attenders)”, “SD ρ (non-attenders)”), caption = “Discriminability of Spearman ρ by scale” ) |> kable_styling(full_width = FALSE) |> row_spec(which.max(auc_tab$AUC), bold = TRUE, background = “#eef6ee”)


## How Attention Moderates Each Scale


::: {.cell}

```{.r .cell-code}
make_attention_plot("spearman_s1", "Scale 1") +
make_attention_plot("spearman_s2", "Scale 2")

Spearman ρ for Scale 1 (left) and Scale 2 (right) vs. latent attention A among attenders. The red band shows the non-attender 5–95th percentile range.

:::

Feature by Non-attender Subtype

Show code
make_subtype_ridge("spearman_s1", "Scale 1") +
make_subtype_ridge("spearman_s2", "Scale 2")

Spearman ρ distributions by subtype (ridge plots), shown separately for Scale 1 (left) and Scale 2 (right). Subtypes whose distribution overlaps heavily with attenders will be harder to detect.
Show code
subtype_sp_tab <- results |>
  group_by(subtype) |>
  summarise(
    N       = n(),
    mean_s1 = round(mean(spearman_s1, na.rm = TRUE), 3),
    mean_s2 = round(mean(spearman_s2, na.rm = TRUE), 3),
    .groups = "drop"
  ) |>
  arrange(mean_s1)

kable(
  subtype_sp_tab,
  col.names = c("Subtype", "N", "Mean ρ (S1)", "Mean ρ (S2)"),
  caption   = "Mean Spearman ρ by scale and subtype"
) |>
  kable_styling(full_width = FALSE) |>
  row_spec(which(subtype_sp_tab$subtype == "Attender"), bold = TRUE)
Mean Spearman ρ by scale and subtype
Subtype N Mean ρ (S1) Mean ρ (S2)
plausible_noise 143 -0.039 -0.019
midpoint_default 53 -0.034 0.016
uniform_random 25 -0.021 -0.029
near_straightline 57 -0.016 0.042
disacquiescent 25 -0.014 -0.029
acquiescent 54 -0.008 -0.007
biased_random 36 -0.004 0.023
straightline 114 -0.002 -0.006
alternating 17 0.117 -0.129
diagonal 17 0.135 -0.045
Attender 4459 0.436 0.408

Regression Feature Distributions

Show code
p_beta <- ggplot(results, aes(x = reg_beta, fill = class_label)) +
  geom_density(alpha = 0.5, colour = NA) +
  geom_vline(xintercept = 0, linetype = "dashed",
             colour = "grey30", linewidth = 0.7) +
  scale_fill_manual(
    values = c("Attender (true)" = "steelblue",
               "Non-attender (true)" = "tomato"), name = NULL) +
  labs(title    = "β̂ — complexity slope",
       subtitle = "log(RT) increase per unit complexity",
       x = "β̂", y = "Density")

p_mar <- ggplot(results, aes(x = reg_mar, fill = class_label)) +
  geom_density(alpha = 0.5, colour = NA) +
  scale_fill_manual(
    values = c("Attender (true)" = "steelblue",
               "Non-attender (true)" = "tomato"), name = NULL) +
  labs(title    = "MAR — mean absolute residual",
       subtitle = "Unexplained log(RT) variance after complexity-only regression",
       x = "MAR", y = "Density")

p_beta + p_mar

Distributions of the two within-person regression features by true attender status. β (complexity slope) should be positive for attenders and near zero for non-attenders. MAR should be lower for attenders whose RTs follow a clean complexity-sensitive pattern.
Show code
att_only <- results[results$is_non_attender == 0, ]

make_reg_attention_plot <- function(col, label, ylab) {
  na_q <- quantile(results[[col]][results$is_non_attender == 1],
                   probs = c(0.05, 0.25, 0.50, 0.75, 0.95), na.rm = TRUE)
  ggplot(att_only, aes(x = A, y = .data[[col]])) +
    annotate("rect", xmin = -Inf, xmax = Inf,
             ymin = na_q["5%"],  ymax = na_q["95%"],
             fill = "tomato", alpha = 0.08) +
    annotate("rect", xmin = -Inf, xmax = Inf,
             ymin = na_q["25%"], ymax = na_q["75%"],
             fill = "tomato", alpha = 0.12) +
    geom_hline(yintercept = na_q["50%"], colour = "tomato3",
               linetype = "dashed", linewidth = 0.7) +
    geom_point(aes(colour = factor(Z)), alpha = 0.35, size = 1.4) +
    geom_smooth(method = "loess", span = 0.6, se = TRUE,
                colour = "steelblue4", fill = "steelblue", alpha = 0.2) +
    geom_vline(xintercept = 0, colour = NA) +
    scale_colour_manual(values = c("0" = "grey60", "1" = "steelblue"),
                        labels = c("Control", "Treatment"), name = "Arm") +
    labs(title = label,
         subtitle = "Red band = non-attender 5–95th pct",
         x = "Latent attention A", y = ylab)
}

make_reg_attention_plot("reg_beta", "β̂ vs. latent attention", "β̂ (complexity slope)")

β̂ vs. latent attention A among attenders. The feature should increase with A because more attentive respondents are more sensitive to item difficulty. The red band shows the non-attender distribution for reference.

Regression Features by Non-attender Subtype

Show code
subtype_reg_tab <- results |>
  group_by(subtype) |>
  summarise(
    N          = n(),
    mean_beta  = round(mean(reg_beta, na.rm = TRUE), 3),
    mean_mar   = round(mean(reg_mar,  na.rm = TRUE), 3),
    .groups    = "drop"
  ) |>
  arrange(mean_beta)

kable(
  subtype_reg_tab,
  col.names = c("Subtype", "N", "Mean β̂", "Mean MAR"),
  caption   = "Within-person regression features by group and subtype"
) |>
  kable_styling(full_width = FALSE) |>
  row_spec(which(subtype_reg_tab$subtype == "Attender"), bold = TRUE)
Within-person regression features by group and subtype
Subtype N Mean β | Mean MA
uniform_random 25 -0.009 0.192
disacquiescent 25 -0.008 0.198
plausible_noise 143 -0.005 0.197
alternating 17 -0.003 0.179
midpoint_default 53 -0.003 0.198
straightline 114 -0.003 0.192
acquiescent 54 0.000 0.196
biased_random 36 0.000 0.199
near_straightline 57 0.000 0.186
diagonal 17 0.017 0.199
Attender 4459 0.192 0.273

AUC of Regression Features vs. Spearman ρ

Show code
all_features <- list(
  "ρ (Scale 1)"  = list(col = "spearman_s1", direction = "low"),
  "ρ (Scale 2)"  = list(col = "spearman_s2", direction = "low"),
  "β̂ (slope)"   = list(col = "reg_beta",    direction = "low"),
  "MAR"           = list(col = "reg_mar",     direction = "high")
)

feat_auc_tab <- do.call(rbind, lapply(names(all_features), function(nm) {
  col  <- all_features[[nm]]$col
  dir  <- all_features[[nm]]$direction
  scores <- if (dir == "low") results[[col]] else -results[[col]]
  auc    <- roc_auc(scores, results$true_inattentive)
  data.frame(
    Feature   = nm,
    Direction = ifelse(dir == "low", "Low → inattentive", "High → inattentive"),
    AUC       = round(auc, 3),
    mean_att  = round(mean(results[[col]][results$is_non_attender == 0], na.rm=TRUE), 3),
    mean_non  = round(mean(results[[col]][results$is_non_attender == 1], na.rm=TRUE), 3)
  )
}))

kable(
  feat_auc_tab,
  col.names = c("Feature", "Flag direction", "AUC",
                "Mean (attenders)", "Mean (non-attenders)"),
  caption   = "Individual-feature AUCs: Spearman ρ (per scale) and regression features"
) |>
  kable_styling(full_width = FALSE) |>
  row_spec(which.max(feat_auc_tab$AUC), bold = TRUE,
           background = "#eef6ee")
Individual-feature AUCs: Spearman ρ (per scale) and regression features
Feature Flag direction AUC Mean (attenders) Mean (non-attenders)
ρ (Scale 1) Low → inattentive 0.909 0.436 -0.010
ρ (Scale 2) Low → inattentive 0.895 0.408 -0.008
β̂ (slope) |Low → inattentive | 0.96 | 0.19 | -0.00
MAR High → inattentive 0.047 0.273 0.194