Detecting Inattentive Respondents: Complexity–RT Correlation

Can item complexity × response time predict attention?

Author

Jamie C. Lee

Published

April 27, 2026

Overview

This document evaluates two complementary RT-based features for detecting inattentive survey respondents:

Spearman ρ — the rank correlation between item complexity and per-item response time. Attentive respondents take longer on harder items; non-attenders’ RTs are driven by clicking rhythm, not difficulty.
MAR (mean absolute residual) — the average unexplained log-RT variance after fitting a within-person regression of log(RT) on item complexity alone. High MAR means RT varies in ways that complexity cannot explain, which is a signal of non-systematic (inattentive) responding.

Both features treat RT solely as a function of item complexity — no reverse-coding term is included. The question addressed here is whether combining these two features in a mixture model improves classification of attentive vs. inattentive respondents compared to using Spearman ρ alone.

In this simulation, item complexity is a fixed property of the instrument (instrument[[s]]$complexity), and per-item RTs are available as RT_s{s}_i{j}. The complexity effect on RT is built into the data-generating process for attenders: at full attention ($A = 1$), each SD of complexity multiplies per-item RT by $\exp(\texttt{complexity\_rt\_weight} \times A)$. At $A = 0$ the effect vanishes. Non-attenders are entirely unaffected by complexity.

The simulation was generated with the following complexity parameters:

item_complexity_mean = 0, item_complexity_sd = 1
complexity_rt_weight = 0.3

Approach summary. Spearman ρ and MAR are both derived from the complexity–RT relationship but capture different aspects of it: ρ captures whether complexity and RT co-vary in rank order; MAR captures how much unexplained residual variance remains after accounting for complexity. Their combination in a 2-feature mixture model is evaluated against ρ alone (Kneedle threshold) and against the existing Laz.R-based and Zh-based approaches in the final comparison section.

Simulation Ground Truth

Show code

class_tab <- data.frame(
  Group = c("Attenders", "Non-attenders"),
  N     = c(sum(dat$is_non_attender == 0), sum(dat$is_non_attender == 1)),
  Pct   = round(100 * c(mean(dat$is_non_attender == 0),
                         mean(dat$is_non_attender == 1)), 1)
)

na_dat  <- dat[dat$is_non_attender == 1, ]
sub_tab <- as.data.frame(table(
  Style   = na_dat$non_attender_style,
  Pattern = na_dat$non_attender_pattern_type
))
sub_tab <- sub_tab[sub_tab$Freq > 0, ]

kable(class_tab, col.names = c("Group", "N", "%"),
      caption = "Sample composition") |>
  kable_styling(full_width = FALSE)

Sample composition
Group	N	%
Attenders	4244	84.9
Non-attenders	756	15.1

Show code

kable(sub_tab, caption = "Non-attender subtypes") |>
  kable_styling(full_width = FALSE)

Non-attender subtypes
	Style	Pattern	Freq
2	with_pattern	acquiescent	80
4	with_pattern	alternating	25
6	with_pattern	biased_random	43
8	with_pattern	diagonal	26
10	with_pattern	disacquiescent	28
12	with_pattern	midpoint_default	72
14	with_pattern	near_straightline	84
16	with_pattern	straightline	140
18	with_pattern	uniform_random	43

Benchmarking Established Attention-Checking Methods

Before evaluating our proposed complexity–RT features in detail, we benchmark six established measures of inattentive responding against one another and against the proposed Spearman ρ and weighted τ. All measures are computed across all 40 items (both scales pooled), so that comparisons are made on an equal footing with our own features.

The five established measures are:

LongString — the length of the longest uninterrupted run of identical responses in a person’s item sequence. Patterned non-attenders (straightliners, alternators) produce unusually long runs; attenders and random responders do not.
IRV (Intra-individual Response Variability) — the standard deviation of a person’s raw item responses across all items. Non-attenders who straightline or give near-constant responses show very low IRV; random or biased-random responders may show elevated IRV.
Mahalanobis’ Distance on RTs — the multivariate distance between a respondent’s per-item RT vector and the sample mean RT vector, scaled by the sample covariance of RTs. This identifies respondents whose RT profile across items is a multivariate outlier relative to the group — flagging both uniformly fast and unusually patterned RT vectors.
Laz.R — sequential transitional predictability of item responses (defined in the main analysis below). High Laz.R signals mechanically regular response sequences.
Zh — the standardised person-fit statistic from a Graded Response Model (Drasgow et al., 1987). Strongly negative values indicate responses inconsistent with any plausible latent trait.

Our two proposed features — Spearman ρ and weighted τ — are included as the final two columns in all comparison tables and highlighted in green for easy identification.

Note on item pooling. All measures below are computed over all 40 items simultaneously (both scales pooled into one item string per respondent), matching the full-instrument condition against which we evaluate sensitivity in subsequent sections.

Show code

# ── Compute all benchmark measures across all items ───────────────────────────
ni_total  <- ns * ni   # total items across both scales

# All item response columns
y_cols_all  <- unlist(lapply(seq_len(ns), function(s)
  paste0("Y_s", s, "_i", seq_len(ni))))
rt_cols_all <- unlist(lapply(seq_len(ns), function(s)
  paste0("RT_s", s, "_i", seq_len(ni))))
cx_all      <- unlist(lapply(seq_len(ns), function(s)
  instrument[[s]]$complexity))

resp_mat <- as.matrix(dat[, y_cols_all])
rt_mat   <- as.matrix(dat[, rt_cols_all])

# ── 1. LongString ─────────────────────────────────────────────────────────────
longstring <- function(x) {
  if (all(is.na(x))) return(NA_integer_)
  x   <- x[!is.na(x)]
  rle_x <- rle(x)
  max(rle_x$lengths)
}
results$longstring <- apply(resp_mat, 1, longstring)

# ── 2. IRV ────────────────────────────────────────────────────────────────────
results$irv <- apply(resp_mat, 1, sd, na.rm = TRUE)

# ── 3. Mahalanobis Distance on RTs ───────────────────────────────────────────
# Compute on log-RT to reduce extreme-RT leverage; use robust estimate for
# the covariance to guard against non-attender contamination.
log_rt_mat <- log(pmax(rt_mat, 1e-6))

# Mean vector and covariance across respondents
rt_mean <- colMeans(log_rt_mat, na.rm = TRUE)
rt_cov  <- tryCatch(
  cov(log_rt_mat, use = "pairwise.complete.obs"),
  error = function(e) diag(ncol(log_rt_mat))
)
# Regularise: add small ridge in case of near-singularity
rt_cov_reg <- rt_cov + diag(1e-4, nrow(rt_cov))

results$mahal_rt <- tryCatch({
  mahal_vals <- mahalanobis(log_rt_mat, center = rt_mean, cov = rt_cov_reg)
  mahal_vals
}, error = function(e) {
  # Fallback: Euclidean distance from mean (scaled)
  apply(sweep(log_rt_mat, 2, rt_mean)^2, 1, sum)
})

# ── 4. Laz.R (pooled across both scales) ─────────────────────────────────────
lazr_pooled <- apply(resp_mat, 1, Laz.R)
results$lazr_pooled <- lazr_pooled

# ── 5. Zh (iterative purification GRM, computed here for benchmarking) ────────
# NOTE: moved to chunk benchmark-zh below (cache: true) to avoid re-running
#       on every render. Placeholder so downstream summary cat() doesn't error.
results$zh <- NA_real_   # will be overwritten by benchmark-zh chunk

# ── Spearman rho and weighted tau pooled (all items) ─────────────────────────
cx_rank_all <- rank(cx_all)

results$spearman_pooled <- apply(rt_mat, 1, function(rt_i) {
  cor(cx_rank_all, rank(rt_i), method = "spearman")
})

# Weighted tau pooled
pairs_all <- combn(ni_total, 2)
cx_diff_all <- cx_all[pairs_all[1,]] - cx_all[pairs_all[2,]]
w_all       <- abs(cx_diff_all)

results$wtau_pooled <- apply(rt_mat, 1, function(rt_i) {
  rt_diff     <- rt_i[pairs_all[1,]] - rt_i[pairs_all[2,]]
  concordance <- sign(cx_diff_all) * sign(rt_diff)
  sum(w_all * concordance) / sum(w_all)
})

cat("All benchmark measures computed.\n")

All benchmark measures computed.

Show code

cat(sprintf("  LongString  : attender mean = %.2f  |  non-attender mean = %.2f\n",
  mean(results$longstring[results$is_non_attender == 0], na.rm=TRUE),
  mean(results$longstring[results$is_non_attender == 1], na.rm=TRUE)))

  LongString  : attender mean = 3.95  |  non-attender mean = 9.99

Show code

cat(sprintf("  IRV         : attender mean = %.3f  |  non-attender mean = %.3f\n",
  mean(results$irv[results$is_non_attender == 0], na.rm=TRUE),
  mean(results$irv[results$is_non_attender == 1], na.rm=TRUE)))

  IRV         : attender mean = 1.052  |  non-attender mean = 0.924

Show code

cat(sprintf("  Mahal RT    : attender mean = %.2f  |  non-attender mean = %.2f\n",
  mean(results$mahal_rt[results$is_non_attender == 0], na.rm=TRUE),
  mean(results$mahal_rt[results$is_non_attender == 1], na.rm=TRUE)))

  Mahal RT    : attender mean = 42.28  |  non-attender mean = 26.95

Show code

cat(sprintf("  Laz.R (pool): attender mean = %.3f  |  non-attender mean = %.3f\n",
  mean(results$lazr_pooled[results$is_non_attender == 0], na.rm=TRUE),
  mean(results$lazr_pooled[results$is_non_attender == 1], na.rm=TRUE)))

  Laz.R (pool): attender mean = 0.361  |  non-attender mean = 0.613

Show code

cat(sprintf("  Zh          : attender mean = %.3f  |  non-attender mean = %.3f\n",
  mean(results$zh[results$is_non_attender == 0], na.rm=TRUE),
  mean(results$zh[results$is_non_attender == 1], na.rm=TRUE)))

  Zh          : attender mean = NaN  |  non-attender mean = NaN

Show code

cat(sprintf("  Spearman ρ  : attender mean = %.3f  |  non-attender mean = %.3f\n",
  mean(results$spearman_pooled[results$is_non_attender == 0], na.rm=TRUE),
  mean(results$spearman_pooled[results$is_non_attender == 1], na.rm=TRUE)))

  Spearman ρ  : attender mean = 0.445  |  non-attender mean = -0.007

Show code

cat(sprintf("  Weighted τ  : attender mean = %.3f  |  non-attender mean = %.3f\n",
  mean(results$wtau_pooled[results$is_non_attender == 0], na.rm=TRUE),
  mean(results$wtau_pooled[results$is_non_attender == 1], na.rm=TRUE)))

  Weighted τ  : attender mean = 0.477  |  non-attender mean = -0.007

Show code

# ── Shared lookup objects used by all benchmark chunks ────────────────────────
# Direction of each measure: high_bad = TRUE means HIGH scores → flag inattentive
# Mahalanobis: high_bad = FALSE because LOW D² → inattentive in this simulation
# (non-attenders cluster near the RT centroid; attenders are the outliers)
bench_direction <- list(
  longstring      = list(col = "longstring",      high_bad = TRUE),
  irv             = list(col = "irv",             high_bad = FALSE),
  mahal_rt        = list(col = "mahal_rt",        high_bad = FALSE),
  lazr_pooled     = list(col = "lazr_pooled",     high_bad = TRUE),
  zh              = list(col = "zh",              high_bad = FALSE),
  spearman_pooled = list(col = "spearman_pooled", high_bad = FALSE),
  wtau_pooled     = list(col = "wtau_pooled",     high_bad = FALSE)
)

bench_labels <- c(
  longstring      = "LongString",
  irv             = "IRV",
  mahal_rt        = "Mahalanobis D² (RT)",
  lazr_pooled     = "Laz.R",
  zh              = "Zh",
  spearman_pooled = "Spearman ρ ★",
  wtau_pooled     = "Weighted τ ★"
)

Show code

# Zh via iterative-purification GRM. Cached so re-renders are instant.
compute_zh_purified <- function(dat, scale_idx) {
  ni_s      <- params$items_per_scale
  K_s       <- params$n_categories
  y_cols_s  <- paste0("Y_s", scale_idx, "_i", seq_len(ni_s))
  rev_items <- instrument[[scale_idx]]$reverse
  resp_s              <- as.matrix(dat[, y_cols_s])
  resp_scored         <- resp_s
  resp_scored[, rev_items] <- K_s + 1 - resp_s[, rev_items]
  resp_df   <- as.data.frame(resp_scored)

  fit1      <- mirt(resp_df, model = 1, itemtype = "graded", verbose = FALSE)
  pf1       <- personfit(fit1, method = "ML")
  keep_mask <- !is.finite(pf1$Zh) | pf1$Zh > -2
  cat(sprintf("Scale %d — Round 1: %d flagged, refitting on %d of %d\n",
              scale_idx, sum(!keep_mask), sum(keep_mask), nrow(dat)))

  fit2           <- mirt(resp_df[keep_mask, ], model = 1,
                         itemtype = "graded", verbose = FALSE)
  pars_fixed     <- mod2values(fit2)
  pars_fixed$est <- FALSE
  fit2_all       <- mirt(resp_df, model = 1, itemtype = "graded",
                         verbose = FALSE, pars = pars_fixed, TOL = NA)
  pf2 <- personfit(fit2_all, method = "ML")
  cat(sprintf("Scale %d — Round 2: Zh range [%.2f, %.2f], mean = %.3f\n",
              scale_idx, min(pf2$Zh, na.rm=TRUE),
              max(pf2$Zh, na.rm=TRUE), mean(pf2$Zh, na.rm=TRUE)))
  pf2$Zh
}

zh_list    <- lapply(seq_len(ns), compute_zh_purified, dat = dat)

Scale 1 — Round 1: 449 flagged, refitting on 4551 of 5000
Scale 1 — Round 2: Zh range [-12.99, 3.58], mean = -0.294
Scale 2 — Round 1: 438 flagged, refitting on 4562 of 5000
Scale 2 — Round 2: Zh range [-11.03, 3.58], mean = -0.251

Show code

results$zh <- rowMeans(do.call(cbind, zh_list), na.rm = TRUE)
cat("Zh computed and stored in results$zh\n")

Zh computed and stored in results$zh

Distributions by True Class

Each panel below shows the attender (blue) and non-attender (red) density distributions for a single measure. Better measures show greater separation between the two distributions.

Note the direction of each measure’s signal: LongString and Laz.R are high → inattentive; Mahal RT is low → inattentive in this simulation — non-attenders cluster near the RT centroid (their flat/random profiles are close to the sample mean), so unusually low D² flags inattention; IRV is low → inattentive for straightliners but high → inattentive for random responders (bimodal non-attender distribution expected); Zh is low → inattentive; Spearman ρ and weighted τ are low → inattentive.

Show code

bench_measures <- list(
  list(col = "longstring",     label = "LongString",
       xlab = "Max run length",           ref = NULL,    high_bad = TRUE),
  list(col = "irv",            label = "IRV",
       xlab = "SD of item responses",     ref = NULL,    high_bad = FALSE),
  list(col = "mahal_rt",       label = "Mahalanobis Distance (RT)",
       xlab = "Mahalanobis D²",           ref = NULL,    high_bad = FALSE),
  list(col = "lazr_pooled",    label = "Laz.R (pooled)",
       xlab = "Sequential predictability", ref = 1/K,   high_bad = TRUE),
  list(col = "zh",             label = "Zh (person-fit)",
       xlab = "Zh",                        ref = -2,    high_bad = FALSE)
)

proposed_measures <- list(
  list(col = "spearman_pooled", label = "Spearman ρ (proposed)",
       xlab = "Spearman ρ",  ref = 0, high_bad = FALSE),
  list(col = "wtau_pooled",    label = "Weighted τ (proposed)",
       xlab = "Weighted τ",  ref = 0, high_bad = FALSE)
)

make_bench_dist_plot <- function(m, proposed = FALSE) {
  p <- ggplot(results, aes(x = .data[[m$col]], fill = class_label)) +
    geom_density(alpha = 0.5, colour = NA) +
    scale_fill_manual(
      values = c("Attender (true)" = "steelblue",
                 "Non-attender (true)" = "tomato"), name = NULL) +
    labs(title = m$label, x = m$xlab, y = "Density") +
    theme(legend.position = "none")
  if (!is.null(m$ref)) {
    p <- p + geom_vline(xintercept = m$ref, linetype = "dashed",
                        colour = "grey30", linewidth = 0.7)
  }
  if (proposed) {
    p <- p + theme(
      panel.border     = element_rect(colour = "#2e7d32", fill = NA,
                                      linewidth = 1.8),
      plot.background  = element_rect(fill = "#f1f8f1", colour = NA)
    )
  }
  p
}

all_plots <- c(
  lapply(bench_measures,    make_bench_dist_plot, proposed = FALSE),
  lapply(proposed_measures, make_bench_dist_plot, proposed = TRUE)
)

wrap_plots(all_plots, ncol = 2) +
  plot_layout(guides = "collect") &
  theme(legend.position = "bottom")

Attender (blue) vs. non-attender (red) distributions for all seven measures, computed across all 40 items. Dashed reference lines mark conventional flagging thresholds where applicable. The bottom two panels (green borders) show the two proposed complexity–RT features.

AUC and Descriptive Statistics

Show code

# bench_direction and bench_labels are defined in benchmark-setup above

bench_auc_tab <- do.call(rbind, lapply(names(bench_direction), function(nm) {
  m      <- bench_direction[[nm]]
  # roc_auc() flags scores < t as inattentive, so pass scores where LOW = inattentive.
  # high_bad=FALSE (includes Mahalanobis): low raw score = inattentive → pass as-is
  # high_bad=TRUE: high raw score = inattentive → negate so low = inattentive
  scores <- if (m$high_bad) -results[[m$col]] else results[[m$col]]
  scores[!is.finite(scores)] <- NA
  auc_val <- roc_auc(scores, results$true_inattentive)
  data.frame(
    Measure  = bench_labels[[nm]],
    AUC      = round(auc_val, 3),
    Mean_att = round(mean(results[[m$col]][results$is_non_attender == 0], na.rm=TRUE), 3),
    SD_att   = round(sd(results[[m$col]][results$is_non_attender == 0],   na.rm=TRUE), 3),
    Mean_non = round(mean(results[[m$col]][results$is_non_attender == 1], na.rm=TRUE), 3),
    SD_non   = round(sd(results[[m$col]][results$is_non_attender == 1],   na.rm=TRUE), 3),
    is_proposed = nm %in% c("spearman_pooled", "wtau_pooled")
  )
}))

proposed_rows <- which(bench_auc_tab$is_proposed)

kable(
  bench_auc_tab[, !names(bench_auc_tab) %in% "is_proposed"],
  col.names = c("Measure", "AUC",
                "Mean (attenders)", "SD (attenders)",
                "Mean (non-attenders)", "SD (non-attenders)"),
  caption = "Discriminability of all seven measures (all 40 items). ★ = proposed complexity–RT feature."
) |>
  kable_styling(full_width = FALSE) |>
  row_spec(which.max(bench_auc_tab$AUC), bold = TRUE, background = "#eef6ee") |>
  row_spec(proposed_rows, background = "#e8f5e9", color = "#1b5e20")

Discriminability of all seven measures (all 40 items). ★ = proposed complexity–RT feature.
Measure	AUC	Mean (attenders)	SD (attenders)	Mean (non-attenders)	SD (non-attenders)
LongString	0.723	3.959	1.300	10.693	9.267
IRV	0.567	1.056	0.185	0.916	0.475
Mahalanobis D² (RT)	0.875	41.273	9.251	29.126	5.546
Laz.R	0.766	0.359	0.049	0.628	0.259
Zh	0.707	-0.079	1.142	-1.866	2.899
Spearman ρ ★	0.964	0.440	0.164	0.007	0.157
Weighted τ ★	0.968	0.472	0.167	0.006	0.166

Cutoff Selection and Classification Performance

We flag respondents using the following cutoff strategies for each measure:

Kneedle — the elbow point on the sorted score curve (data-driven, no ground-truth required). Used for all measures, including Mahalanobis D² as a secondary comparison threshold.
Outlier (2.5th pct) — for Mahalanobis D² only: flags respondents whose D² falls in the lowest 2.5% of the sample. This direction correction is necessary because in this simulation non-attenders have lower D² than attenders — their flat or random RT profiles cluster near the sample centroid while attenders, whose RTs vary meaningfully with item complexity, are the multivariate outliers. Flagging the low tail mirrors the classical outlier-exclusion logic but applied to the correct tail given the inverted direction.
Median split — the sample median (a simple, robust alternative)
Oracle — the attender 5th (or 95th) percentile, chosen with knowledge of ground truth, providing an upper bound on performance

Mahalanobis D² shows four rows in the table (Outlier, Kneedle, Median, Oracle) to allow direct comparison of the two data-driven thresholds. All other measures show three rows. Counts and percentages are reported for both false positives (FP: attenders incorrectly flagged) and false negatives (FN: non-attenders missed). Proposed-feature rows are highlighted in green.

Show code

# Apply cutoffs for each measure.
# signal = -scores when high_bad=FALSE (low raw score → inattentive → high signal).
# signal =  scores when high_bad=TRUE  (high raw score → inattentive → high signal).
# Mahalanobis: high_bad=FALSE because LOW D² → inattentive.
#
# For Mahalanobis two thresholds are computed:
#   "Outlier (2.5th pct)" — flags the lowest 2.5% of D² values (signal > 97.5th pct).
#   "Kneedle"             — data-driven elbow on the sorted signal curve.

apply_bench_cutoffs <- function(col, high_bad) {
  scores <- results[[col]]
  truth  <- results$true_inattentive

  signal <- if (high_bad) scores else -scores
  signal[!is.finite(signal)] <- NA

  # Primary threshold: outlier 2.5th pct for Mahalanobis, Kneedle for all others
  if (col == "mahal_rt") {
    t_kn    <- quantile(signal, 0.975, na.rm = TRUE)   # 97.5th pct of signal = 2.5th pct of D²
    flag_kn <- signal > t_kn
  } else {
    sorted_sig <- sort(signal, na.last = NA)
    k_idx      <- kneedle_fn(sorted_sig, sign = -1)
    t_kn       <- sorted_sig[k_idx]
    flag_kn    <- signal > t_kn
  }

  # Secondary Kneedle threshold for Mahalanobis (for comparison)
  if (col == "mahal_rt") {
    sorted_sig      <- sort(signal, na.last = NA)
    k_idx           <- kneedle_fn(sorted_sig, sign = -1)
    t_kn_kneedle    <- sorted_sig[k_idx]
    flag_kn_kneedle <- signal > t_kn_kneedle
  } else {
    t_kn_kneedle    <- NULL
    flag_kn_kneedle <- NULL
  }

  # Median on signal
  t_med      <- median(signal, na.rm = TRUE)
  flag_med   <- signal > t_med

  # Oracle: attender 95th percentile of signal
  t_oracle    <- quantile(signal[!truth], 0.95, na.rm = TRUE)
  flag_oracle <- signal > t_oracle

  list(flag_kn = flag_kn, flag_med = flag_med, flag_oracle = flag_oracle,
       flag_kn_kneedle = flag_kn_kneedle,
       t_kn = t_kn, t_med = t_med, t_oracle = t_oracle,
       t_kn_kneedle = t_kn_kneedle)
}

bench_cuts <- lapply(names(bench_direction), function(nm) {
  m <- bench_direction[[nm]]
  apply_bench_cutoffs(m$col, m$high_bad)
})
names(bench_cuts) <- names(bench_direction)

# Build performance table
make_bench_perf_row <- function(nm, cutoff_name, flag) {
  truth <- results$true_inattentive
  n_att <- sum(!truth)
  n_non <- sum(truth)
  cs    <- confusion_stats(flag, truth)
  fp_n  <- cs$FP;  fp_pct <- round(100 * cs$FP / n_att, 1)
  fn_n  <- cs$FN;  fn_pct <- round(100 * cs$FN / n_non, 1)
  data.frame(
    Measure  = bench_labels[[nm]],
    Cutoff   = cutoff_name,
    Sens     = round(cs$Sensitivity, 3),
    Spec     = round(cs$Specificity, 3),
    FP_n     = fp_n,  FP_pct = fp_pct,
    FN_n     = fn_n,  FN_pct = fn_pct,
    Flagged  = round(cs$Flagged_pct, 1),
    is_proposed = nm %in% c("spearman_pooled", "wtau_pooled")
  )
}

perf_rows <- do.call(rbind, lapply(names(bench_direction), function(nm) {
  rows <- rbind(
    make_bench_perf_row(nm, "Kneedle", bench_cuts[[nm]]$flag_kn),
    make_bench_perf_row(nm, "Median",  bench_cuts[[nm]]$flag_med),
    make_bench_perf_row(nm, "Oracle",  bench_cuts[[nm]]$flag_oracle)
  )
  # For Mahalanobis, rename primary row and add the Kneedle comparison row
  if (nm == "mahal_rt") {
    rows$Cutoff[rows$Cutoff == "Kneedle"] <- "Outlier (2.5th pct)"
    rows <- rbind(
      rows,
      make_bench_perf_row(nm, "Kneedle", bench_cuts[[nm]]$flag_kn_kneedle)
    )
  }
  rows
}))

proposed_perf_rows <- which(perf_rows$is_proposed)

kable(
  perf_rows[, !names(perf_rows) %in% "is_proposed"],
  col.names = c("Measure", "Cutoff",
                "Sensitivity", "Specificity",
                "FP (N)", "FP (%)",
                "FN (N)", "FN (%)",
                "% Flagged"),
  caption = paste0(
    "Classification performance for all measures. ",
    "Mahalanobis D\u00b2 shows two thresholds: \u2018Outlier (2.5th pct)\u2019 flags the ",
    "lowest 2.5% of D\u00b2 values (direction-corrected outlier exclusion); ",
    "\u2018Kneedle\u2019 is the data-driven elbow. Direction is inverted relative to ",
    "the standard assumption: LOW D\u00b2 \u2192 inattentive in this simulation. ",
    "FP = attenders incorrectly flagged; FN = non-attenders missed. \u2605 = proposed feature."
  )
) |>
  kable_styling(full_width = FALSE) |>
  row_spec(proposed_perf_rows, background = "#e8f5e9", color = "#1b5e20") |>
  row_spec(which(perf_rows$Cutoff == "Oracle"), italic = TRUE)

Classification performance for all measures. Mahalanobis D² shows two thresholds: ‘Outlier (2.5th pct)’ flags the lowest 2.5% of D² values (direction-corrected outlier exclusion); ‘Kneedle’ is the data-driven elbow. Direction is inverted relative to the standard assumption: LOW D² → inattentive in this simulation. FP = attenders incorrectly flagged; FN = non-attenders missed. ★ = proposed feature.
Measure	Cutoff	Sensitivity	Specificity	FP (N)	FP (%)	FN (N)	FN (%)	% Flagged
LongString	Kneedle	0.525	0.963	167	3.7	257	47.5	9.0
LongString	Median	0.632	0.738	1168	26.2	199	36.8	30.2
LongString	Oracle	0.525	0.963	167	3.7	257	47.5	9.0
IRV	Kneedle	0.401	0.974	115	2.6	324	59.9	6.6
IRV	Median	0.558	0.510	2187	49.0	239	44.2	49.8
IRV	Oracle	0.405	0.952	214	4.8	322	59.5	8.7
Mahalanobis D² (RT)	Outlier (2.5th pct)	0.150	0.990	44	1.0	460	85.0	2.5
Mahalanobis D² (RT)	Median	0.956	0.555	1983	44.5	24	4.4	50.0
Mahalanobis D² (RT)	Oracle	0.390	0.950	223	5.0	330	61.0	8.7
Mahalanobis D² (RT)	Kneedle	0.189	0.982	79	1.8	439	81.1	3.6
Laz.R	Kneedle	0.630	0.953	209	4.7	200	37.0	11.0
Laz.R	Median	0.739	0.529	2100	47.1	141	26.1	50.0
Laz.R	Oracle	0.630	0.950	223	5.0	200	37.0	11.3
Zh	Kneedle	0.477	0.919	363	8.1	283	52.3	12.4
Zh	Median	0.726	0.527	2107	47.3	148	27.4	50.0
Zh	Oracle	0.458	0.950	223	5.0	293	54.2	9.4
Spearman ρ ★	Kneedle	0.906	0.894	473	10.6	51	9.4	19.3
Spearman ρ ★	Median	0.996	0.560	1961	44.0	2	0.4	50.0
Spearman ρ ★	Oracle	0.787	0.950	223	5.0	115	21.3	13.0
Weighted τ ★	Kneedle	0.928	0.887	502	11.3	39	7.2	20.1
Weighted τ ★	Median	0.998	0.560	1960	44.0	1	0.2	50.0
Weighted τ ★	Oracle	0.810	0.950	223	5.0	103	19.0	13.2

Aggregated Performance: Weighted Balanced Accuracy and Sensitivity at Fixed Specificity

The table above gives the full picture but requires readers to weigh sensitivity against specificity themselves. Here we provide two complementary summaries that make the trade-off explicit.

Weighted Balanced Accuracy (WBA) averages sensitivity and specificity with an explicit weight w on specificity:

\[\text{WBA}(w) = w \cdot \text{Specificity} + (1 - w) \cdot \text{Sensitivity}\]

At w = 0.5 this is ordinary balanced accuracy (symmetric). Higher w rewards measures that keep FPs low — appropriate here because false positives introduce systematic bias in sample composition (see §False Positive Covariate Analysis), while false negatives reduce power but do not distort who remains in the sample.

Sensitivity at 95% Specificity (S@95) reads directly off the ROC curve: it is the highest sensitivity a measure can achieve while keeping the false positive rate at or below 5%. This answers the applied question — given I am only willing to incorrectly exclude 1 in 20 real attenders, how many non-attenders can I catch? — without requiring any weight choice.

Both are computed using the Kneedle threshold (or χ² for Mahalanobis). WBA is also shown at w ∈ {0.5, 0.6, 0.7, 0.8, 0.9} so readers can see whether measure rankings are stable across reasonable choices of specificity preference.

Show code

truth <- results$true_inattentive

# ── Sensitivity at 95% Specificity from ROC curve ────────────────────────────
sens_at_95spec <- function(nm) {
  m      <- bench_direction[[nm]]
  # Flag inattentive when signal > t, so signal must be HIGH for inattentive.
  # high_bad=TRUE: high raw = inattentive → signal = raw
  # high_bad=FALSE (includes Mahalanobis): low raw = inattentive → signal = -raw
  signal <- if (m$high_bad) results[[m$col]] else -results[[m$col]]
  signal[!is.finite(signal)] <- NA
  thresholds <- sort(unique(signal[!is.na(signal)]))
  roc <- do.call(rbind, lapply(thresholds, function(t) {
    flag <- signal > t
    TP <- sum( flag &  truth, na.rm=TRUE); FP <- sum( flag & !truth, na.rm=TRUE)
    TN <- sum(!flag & !truth, na.rm=TRUE); FN <- sum(!flag &  truth, na.rm=TRUE)
    data.frame(spec = TN/(TN+FP), sens = TP/(TP+FN))
  }))
  # Best sensitivity among rows where specificity >= 0.95
  eligible <- roc[!is.na(roc$spec) & roc$spec >= 0.95, ]
  if (nrow(eligible) == 0) return(NA_real_)
  round(max(eligible$sens, na.rm = TRUE), 3)
}

# ── WBA at a given weight ─────────────────────────────────────────────────────
wba <- function(sens, spec, w) round(w * spec + (1 - w) * sens, 3)

# ── Build summary table ───────────────────────────────────────────────────────
weights   <- seq(0.5, 0.9, by = 0.1)
wba_tab <- do.call(rbind, lapply(names(bench_direction), function(nm) {
  # Use primary threshold flag (Outlier 2.5th pct for Mahalanobis, Kneedle for others)
  flag <- bench_cuts[[nm]]$flag_kn
  cs   <- confusion_stats(flag, truth)
  sens <- cs$Sensitivity
  spec <- cs$Specificity
  s95  <- sens_at_95spec(nm)

  wba_vals <- setNames(sapply(weights, function(w) wba(sens, spec, w)),
                       paste0("WBA(w=", weights, ")"))

  data.frame(
    Measure   = bench_labels[[nm]],
    Sens      = round(sens, 3),
    Spec      = round(spec, 3),
    `S@95%`   = s95,
    as.data.frame(t(wba_vals)),
    is_proposed = nm %in% c("spearman_pooled", "wtau_pooled"),
    check.names = FALSE
  )
}))

prop_rows_wba <- which(wba_tab$is_proposed)
wba_display   <- wba_tab[, !names(wba_tab) %in% "is_proposed"]

# ── Main WBA table ────────────────────────────────────────────────────────────
tbl_wba <- kable(
  wba_display,
  col.names = c("Measure", "Sensitivity", "Specificity", "S@95% Spec",
                paste0("WBA w=", weights)),
  caption = paste0(
    "Sensitivity, specificity, Sensitivity at 95% Specificity (S@95%), and ",
    "Weighted Balanced Accuracy at five specificity weights. ",
    "Higher w = greater penalty for false positives. ",
    "\u2605 = proposed complexity\u2013RT feature."
  ),
  digits = 3
) |>
  kable_styling(full_width = FALSE) |>
  row_spec(prop_rows_wba, background = "#e8f5e9", color = "#1b5e20") |>
  # Bold the S@95 and WBA(w=0.7) columns as primary recommended summaries
  column_spec(4, bold = TRUE) |>
  column_spec(7, bold = TRUE)   # w=0.7 column (4 fixed cols + 3rd weight)

cat(as.character(tbl_wba), "\n\n")

Sensitivity, specificity, Sensitivity at 95% Specificity (S@95%), and Weighted Balanced Accuracy at five specificity weights. Higher w = greater penalty for false positives. ★ = proposed complexity–RT feature.
Measure	Sensitivity	Specificity	S@95% Spec	WBA w=0.5	WBA w=0.6	WBA w=0.7	WBA w=0.8	WBA w=0.9
LongString	0.525	0.963	0.525	0.744	0.788	0.831	0.875	0.919
IRV	0.401	0.974	0.405	0.688	0.745	0.802	0.860	0.917
Mahalanobis D² (RT)	0.150	0.990	0.390	0.570	0.654	0.738	0.822	0.906
Laz.R	0.630	0.953	0.630	0.792	0.824	0.856	0.889	0.921
Zh	0.477	0.919	0.458	0.698	0.742	0.786	0.830	0.874
Spearman ρ ★	0.906	0.894	0.787	0.900	0.899	0.897	0.896	0.895
Weighted τ ★	0.928	0.887	0.810	0.908	0.904	0.900	0.896	0.891

Show code

wba_long <- tidyr::pivot_longer(
  wba_tab[, c("Measure", paste0("WBA(w=", weights, ")"))],
  cols      = -Measure,
  names_to  = "weight",
  values_to = "WBA"
) |>
  mutate(
    weight  = factor(weight, levels = paste0("WBA(w=", weights, ")")),
    w_label = factor(
      gsub("WBA\\(w=(.*)\\)", "w = \\1", weight),
      levels = gsub("WBA\\(w=(.*)\\)", "w = \\1", paste0("WBA(w=", weights, ")"))
    ),
    Measure = factor(Measure, levels = rev(bench_labels[names(bench_direction)]))
  )

# Rank each measure within each weight for annotation
wba_long <- wba_long |>
  group_by(weight) |>
  mutate(rank_label = paste0("#", rank(-WBA, ties.method = "min"))) |>
  ungroup()

proposed_measures_vec <- bench_labels[c("spearman_pooled", "wtau_pooled")]

ggplot(wba_long, aes(x = w_label, y = Measure, fill = WBA)) +
  geom_tile(colour = "white", linewidth = 0.5) +
  geom_text(aes(label = sprintf("%.3f\n(%s)", WBA, rank_label)),
            size = 2.8, lineheight = 0.9) +
  scale_fill_gradient2(
    low      = "tomato",
    mid      = "lightyellow",
    high     = "#2e7d32",
    midpoint = 0.65,
    name     = "WBA"
  ) +
  labs(
    title    = "Weighted Balanced Accuracy across specificity preference weights",
    subtitle = "Cell = WBA value; rank in parentheses. Bold green rows = proposed features.",
    x        = "Specificity weight (w)",
    y        = NULL
  ) +
  theme(
    axis.text.y = element_text(
      colour = ifelse(
        rev(bench_labels[names(bench_direction)]) %in% proposed_measures_vec,
        "#1b5e20", "grey20"
      ),
      face = ifelse(
        rev(bench_labels[names(bench_direction)]) %in% proposed_measures_vec,
        "bold", "plain"
      )
    ),
    legend.position = "right"
  )

Weighted Balanced Accuracy for each measure (rows) across specificity weights w = 0.5–0.9 (columns). Green = higher WBA, red = lower. Stable rankings across columns indicate the conclusion is robust to the choice of weight. Proposed features shown in bold green row labels.

Cutoff Visualisation

Show code

make_bench_cutoff_plot <- function(nm, proposed = FALSE) {
  m      <- bench_direction[[nm]]
  cuts   <- bench_cuts[[nm]]
  scores <- results[[m$col]]
  high_b <- m$high_bad

  # Thresholds are stored on signal scale (signal = scores if high_bad, -scores if !high_bad).
  # Convert back to raw score scale: raw = signal if high_bad, raw = -signal if !high_bad.
  sig_to_raw <- function(t) if (high_b) t else -t

  # Build threshold data frame with distinct line types
  thresh_df <- data.frame(
    xintercept = c(sig_to_raw(cuts$t_kn),
                   sig_to_raw(cuts$t_med),
                   sig_to_raw(cuts$t_oracle)),
    ltype = c("solid", "dashed", "dotted"),
    label = c(if (nm == "mahal_rt") "Outlier (2.5th pct)" else "Kneedle",
              "Median", "Oracle"),
    stringsAsFactors = FALSE
  )

  # For Mahalanobis, add Kneedle as a fourth line
  if (nm == "mahal_rt" && !is.null(cuts$t_kn_kneedle)) {
    thresh_df <- rbind(thresh_df, data.frame(
      xintercept = sig_to_raw(cuts$t_kn_kneedle),
      ltype      = "longdash",
      label      = "Kneedle",
      stringsAsFactors = FALSE
    ))
  }

  # Flag region: high_bad=FALSE (low raw = inattentive) → shade left tail
  #              high_bad=TRUE  (high raw = inattentive) → shade right tail
  x_range    <- range(scores, na.rm = TRUE)
  primary_x  <- thresh_df$xintercept[1]
  shade_xmin <- if (!high_b) x_range[1] else primary_x
  shade_xmax <- if (!high_b) primary_x   else x_range[2]

  p <- ggplot(results, aes(x = .data[[m$col]], fill = class_label)) +
    annotate("rect",
             xmin = shade_xmin, xmax = shade_xmax,
             ymin = -Inf, ymax = Inf,
             alpha = 0.08, fill = "tomato") +
    geom_density(alpha = 0.45, colour = NA) +
    geom_vline(data = thresh_df,
               aes(xintercept = xintercept, linetype = ltype),
               colour = "black", linewidth = 0.65, inherit.aes = FALSE) +
    scale_linetype_identity(
      guide  = "legend",
      breaks = c("solid", "longdash", "dashed", "dotted"),
      labels = c("Outlier / Kneedle", "Kneedle (Mahal)", "Median", "Oracle")
    ) +
    # Labels placed above the plot area using annotation on a secondary y scale
    annotate("label",
             x      = thresh_df$xintercept,
             y      = Inf,
             label  = thresh_df$label,
             hjust  = 0.5,
             vjust  = 1.15,
             size   = 2.5,
             fill   = "white",
             colour = "grey20",
             label.size = 0.2) +
    scale_fill_manual(
      values = c("Attender (true)" = "steelblue",
                 "Non-attender (true)" = "tomato"), name = NULL) +
    labs(title = bench_labels[[nm]], x = m$col, y = "Density") +
    theme(legend.position = "none")

  if (proposed) {
    p <- p + theme(
      panel.border    = element_rect(colour = "#2e7d32", fill = NA,
                                     linewidth = 1.8),
      plot.background = element_rect(fill = "#f1f8f1", colour = NA)
    )
  }
  p
}

proposed_nms    <- c("spearman_pooled", "wtau_pooled")
benchmark_nms   <- setdiff(names(bench_direction), proposed_nms)

all_cutoff_plots <- c(
  lapply(benchmark_nms, make_bench_cutoff_plot, proposed = FALSE),
  lapply(proposed_nms,  make_bench_cutoff_plot, proposed = TRUE)
)

wrap_plots(all_cutoff_plots, ncol = 2) +
  plot_layout(guides = "collect") &
  theme(legend.position = "bottom")

Density distributions for each measure with cutoff lines marking the Kneedle, Median, and Oracle thresholds. Red shading marks the flagging region. Mahalanobis D² shows an additional Kneedle line for comparison. Proposed features (ρ, τ) are shown with a green panel border.

ROC Curves: All Measures

Show code

truth <- results$true_inattentive

roc_list <- lapply(names(bench_direction), function(nm) {
  m      <- bench_direction[[nm]]
  signal <- if (m$high_bad) results[[m$col]] else -results[[m$col]]
  signal[!is.finite(signal)] <- NA
  r      <- make_roc_df(signal, truth, bench_labels[[nm]])
  auc_v  <- round(roc_auc(signal, truth), 3)
  r$variant <- sprintf("%s (AUC=%.3f)", bench_labels[[nm]], auc_v)
  r$proposed <- nm %in% c("spearman_pooled", "wtau_pooled")
  r
})

roc_all <- do.call(rbind, roc_list)

proposed_variants <- unique(roc_all$variant[roc_all$proposed])
other_variants    <- unique(roc_all$variant[!roc_all$proposed])

n_other   <- length(other_variants)
grey_pal  <- colorRampPalette(c("grey60", "steelblue3"))(n_other)
colour_map <- c(
  setNames(grey_pal, other_variants),
  setNames(c("#2e7d32", "#81c784"), proposed_variants)
)
size_map <- c(
  setNames(rep(0.7, n_other),   other_variants),
  setNames(c(1.4, 1.4),          proposed_variants)
)

ggplot(roc_all, aes(x = fpr, y = tpr,
                    colour = variant, linewidth = variant)) +
  geom_abline(slope = 1, intercept = 0,
              linetype = "dashed", colour = "grey80") +
  geom_path() +
  scale_colour_manual(values = colour_map, name = NULL) +
  scale_linewidth_manual(values = size_map, guide = "none") +
  labs(title    = "ROC curves — all attention-checking measures",
       subtitle = "Green = proposed complexity–RT features",
       x = "False Positive Rate (1 − Specificity)",
       y = "Sensitivity") +
  theme(legend.text = element_text(size = 8),
        legend.key.width = unit(1.2, "cm"))

ROC curves for all seven measures. Proposed complexity–RT features (ρ, τ) are shown as bold green lines; established measures in grey/blue. AUC values are reported in the legend.

False Positive Covariate Analysis

Are false positives from each measure systematically different from the true negatives (correctly retained attenders)? This section examines the selectivity profile of each measure’s errors: if a measure flags people not because they are inattentive but because they share some substantive characteristic (trait extremity, response style, low attention but genuine engagement), the resulting exclusions introduce selection bias rather than merely noise.

We compare FP vs TN attenders on five covariates:

θ̂ (IRT trait estimate) — are FPs clustering at the centre of the latent trait? A measure whose FPs are predominantly mid-range respondents is biased against a specific substantive group.
Observed mean score — same question via the raw scale mean, which is available without IRT assumptions.
Latent attention A — are FPs simply low-attention genuine attenders? If so, the exclusion may be defensible (they were marginally engaged), but it should be acknowledged.
IRV (response variability) — do FPs show restricted range use? Low IRV can arise from mid-point responding or from genuine homogeneity of trait expression; high IRV from acquiescent or extreme responding.
Mean RT — are FPs simply fast responders overall, or is their exclusion driven by RT–complexity decoupling specifically?

For each measure, FP and TN attenders are compared via density overlays and a summary table of group means and standardised mean differences (Cohen’s d).

Show code

# ── IRT theta: read directly from simulation output ───────────────────────────
# theta_1 and theta_2 are the true latent trait values from the DGP,
# one per scale. We use both individually and their mean as an overall index.
results$theta_1   <- dat$theta_1
results$theta_2   <- dat$theta_2
results$theta_mean <- rowMeans(cbind(dat$theta_1, dat$theta_2), na.rm = TRUE)

# ── Observed mean score (all items, reverse-scored) ───────────────────────────
# Build index of which columns in resp_mat need reversing
resp_mat_scored <- resp_mat
for (s in seq_len(ns)) {
  rev_item_nums <- instrument[[s]]$reverse   # integer item indices within scale s
  if (length(rev_item_nums) > 0) {
    col_names <- paste0("Y_s", s, "_i", rev_item_nums)
    # Only reverse columns that actually exist in resp_mat
    valid <- col_names[col_names %in% colnames(resp_mat_scored)]
    if (length(valid) > 0) {
      resp_mat_scored[, valid] <- K + 1 - resp_mat[, valid]
    }
  }
}
results$mean_score <- rowMeans(resp_mat_scored, na.rm = TRUE)

# ── Mean RT (overall speed) ───────────────────────────────────────────────────
results$mean_rt <- rowMeans(rt_mat, na.rm = TRUE)

# IRV and A are already in results

cat("Covariate summary (attenders only):\n")

Covariate summary (attenders only):

Show code

att <- results[results$is_non_attender == 0, ]
for (col in c("theta_1", "theta_2", "theta_mean", "mean_score", "A", "irv", "mean_rt")) {
  cat(sprintf("  %-12s  mean = %.3f  sd = %.3f\n",
              col, mean(att[[col]], na.rm = TRUE), sd(att[[col]], na.rm = TRUE)))
}

  theta_1       mean = 0.010  sd = 1.001
  theta_2       mean = 0.001  sd = 1.020
  theta_mean    mean = 0.006  sd = 0.708
  mean_score    mean = 3.050  sd = 0.328
  A             mean = 0.670  sd = 0.201
  irv           mean = 1.056  sd = 0.185
  mean_rt       mean = 4.840  sd = 1.085

Show code

# Build a unified FP/TN flag column for each measure (among attenders only)
# using the Kneedle / chi-sq thresholds from bench_cuts

fp_flag_df <- results[results$is_non_attender == 0, ] |>
  select(ID, A, theta_1, theta_2, theta_mean, mean_score, irv, mean_rt)

for (nm in names(bench_direction)) {
  cuts <- bench_cuts[[nm]]
  fp_flag_df[[paste0("fp_", nm)]] <- as.logical(cuts$flag_kn[results$is_non_attender == 0])
}

cat(sprintf("FP flag data frame: %d attenders × %d columns\n",
            nrow(fp_flag_df), ncol(fp_flag_df)))

FP flag data frame: 4459 attenders × 15 columns

Density Overlays: FP vs TN by Covariate

Each plot below focuses on one covariate and shows all seven measures as facets. Within each facet, the blue distribution is the true negatives (attenders correctly retained) and the red distribution is the false positives (attenders incorrectly flagged).

How to read these: there are three patterns to watch for:

Near-complete overlap (red ≈ blue in shape and location) — the measure’s false positives are not selectively different from true negatives on that covariate. The errors are essentially random with respect to it, which is the best-case scenario.
Shifted distribution (red centre is left or right of blue) — the measure is systematically excluding attenders at a particular level of the covariate. For example, red shifted toward θ = 0 means the measure preferentially flags mid-trait respondents; red shifted toward low A means it mainly catches low-attention-but-genuine attenders.
Narrow, tall red distribution (red is more peaked and concentrated than blue, even if centred similarly) — the false positives come from a very specific, homogeneous slice of the attender population. This is a stronger form of bias than a simple shift: the measure is not just nudging toward one end of the covariate, it is selecting a particular type of respondent with unusual precision. For instance, a narrow red spike at moderate θ values would suggest the measure almost exclusively flags respondents near the scale midpoint, leaving very high and very low scorers alone even when they are inattentive.

In general, a measure whose FPs look like a random sample of the TN distribution (same shape, same spread, same centre) is making errors that do not differentially harm any subgroup. A measure whose FPs form a tight, distinctive cluster is effectively acting as a filter for a specific kind of respondent, regardless of whether those respondents are actually inattentive.

The proposed measures (ρ, τ) are shown with a green facet border.

Show code

covariates <- list(
  list(col = "theta_1",    label = "\u03b8\u2081 (Scale 1 trait)"),
  list(col = "theta_2",    label = "\u03b8\u2082 (Scale 2 trait)"),
  list(col = "theta_mean", label = "\u03b8\u0304 (mean trait)"),
  list(col = "mean_score", label = "Mean observed score"),
  list(col = "A",          label = "Latent attention A"),
  list(col = "irv",        label = "IRV (response variability)"),
  list(col = "mean_rt",    label = "Mean RT (ms)")
)

measure_order <- c("longstring", "irv", "mahal_rt", "lazr_pooled",
                   "zh", "spearman_pooled", "wtau_pooled")

# Build a long data frame: one row per attender × measure
fp_long <- do.call(rbind, lapply(measure_order, function(nm) {
  flag_col <- paste0("fp_", nm)
  data.frame(
    fp_flag_df,
    measure   = bench_labels[[nm]],
    group     = ifelse(fp_flag_df[[flag_col]], "FP (flagged)", "TN (retained)"),
    proposed  = nm %in% c("spearman_pooled", "wtau_pooled"),
    stringsAsFactors = FALSE
  )
}))

fp_long$measure <- factor(fp_long$measure, levels = bench_labels[measure_order])
fp_long$group   <- factor(fp_long$group, levels = c("TN (retained)", "FP (flagged)"))

# One plot per covariate — each rendered separately so text isn't squashed
for (cv in covariates) {
  p <- ggplot(fp_long, aes(x = .data[[cv$col]], fill = group)) +
    geom_density(alpha = 0.5, colour = NA) +
    scale_fill_manual(
      values = c("TN (retained)" = "steelblue", "FP (flagged)" = "tomato"),
      name   = NULL) +
    facet_wrap(~ measure, nrow = 2, scales = "free_y") +
    labs(
      title    = sprintf("FP vs TN: %s", cv$label),
      subtitle = "Red = false positives (flagged attenders); blue = true negatives (retained attenders)",
      x        = cv$label,
      y        = "Density"
    ) +
    theme(
      legend.position  = "bottom",
      strip.text       = element_text(size = 9, face = "bold"),
      strip.background = element_rect(
        fill = ifelse(
          levels(fp_long$measure) %in%
            bench_labels[c("spearman_pooled", "wtau_pooled")],
          "#e8f5e9", "grey92"
        ),
        colour = ifelse(
          levels(fp_long$measure) %in%
            bench_labels[c("spearman_pooled", "wtau_pooled")],
          "#2e7d32", "grey70"
        ),
        linewidth = 0.8
      )
    )
  print(p)
}

Summary Table: Group Means and Cohen’s d

Show code

cohens_d <- function(x_fp, x_tn) {
  n1 <- sum(!is.na(x_fp)); n2 <- sum(!is.na(x_tn))
  if (n1 < 2 || n2 < 2) return(NA_real_)
  sd_pooled <- sqrt(((n1-1)*var(x_fp, na.rm=TRUE) +
                     (n2-1)*var(x_tn, na.rm=TRUE)) / (n1+n2-2))
  if (sd_pooled == 0) return(NA_real_)
  (mean(x_fp, na.rm=TRUE) - mean(x_tn, na.rm=TRUE)) / sd_pooled
}

covariate_cols   <- c("theta_1", "theta_2", "theta_mean", "mean_score", "A", "irv", "mean_rt")
covariate_labels <- c("\u03b8\u2081 (S1)", "\u03b8\u2082 (S2)", "\u03b8\u0304 (mean)",
                      "Mean score", "Attention A", "IRV", "Mean RT")

fp_tab <- do.call(rbind, lapply(measure_order, function(nm) {
  flag   <- fp_flag_df[[paste0("fp_", nm)]]
  fp_grp <- fp_flag_df[flag  == TRUE,  ]
  tn_grp <- fp_flag_df[flag  == FALSE, ]
  n_fp   <- sum(flag, na.rm = TRUE)
  n_tn   <- sum(!flag, na.rm = TRUE)

  row_vals <- unlist(lapply(covariate_cols, function(cv) {
    d    <- round(cohens_d(fp_grp[[cv]], tn_grp[[cv]]), 2)
    m_fp <- round(mean(fp_grp[[cv]], na.rm = TRUE), 2)
    m_tn <- round(mean(tn_grp[[cv]], na.rm = TRUE), 2)
    c(m_fp, m_tn, d)
  }))

  df <- as.data.frame(t(row_vals))
  names(df) <- as.vector(outer(
    covariate_labels,
    c("FP mean", "TN mean", "d"),
    paste, sep = " \u2014 "
  ))
  cbind(Measure = bench_labels[[nm]], `N FP` = n_fp, `N TN` = n_tn, df)
}))

prop_rows <- which(fp_tab$Measure %in% c("Spearman \u03c1 \u2605", "Weighted \u03c4 \u2605"))

# One sub-table per covariate — use cat() so knitr renders HTML correctly
for (i in seq_along(covariate_cols)) {
  cv_lbl <- covariate_labels[i]
  cols_i <- c("Measure", "N FP", "N TN",
              paste(cv_lbl, c("FP mean", "TN mean", "d"), sep = " \u2014 "))
  sub_tab <- fp_tab[, cols_i]

  tbl_html <- kable(
    sub_tab,
    format     = "html",
    col.names  = c("Measure", "N FP", "N TN", "Mean (FP)", "Mean (TN)", "Cohen's d"),
    caption    = sprintf("FP vs TN comparison on %s. Positive d = FPs score higher.", cv_lbl),
    digits     = 3
  ) |>
    kable_styling(full_width = FALSE) |>
    row_spec(prop_rows, background = "#e8f5e9", color = "#1b5e20") |>
    column_spec(6, bold = TRUE)

  cat(as.character(tbl_html), "\n\n")
}

FP vs TN comparison on θ₁ (S1). Positive d = FPs score higher.
Measure	N FP	N TN	Mean (FP)	Mean (TN)	Cohen’s d
LongString	167	4292	-0.06	0.01	0.61
IRV	115	4344	-0.02	0.01	1.05
Mahalanobis D² (RT)	44	4415	0.23	0.00	-0.09
Laz.R	209	4250	-0.07	0.01	0.81
Zh	363	4096	-0.05	0.01	-2.21
Spearman ρ ★	473	3986	-0.09	0.01	-1.78
Weighted τ ★	502	3957	-0.08	0.01	-1.78

FP vs TN comparison on θ₂ (S2). Positive d = FPs score higher.
Measure	N FP	N TN	Mean (FP)	Mean (TN)	Cohen’s d
LongString	167	4292	0.01	-0.07	0.94
IRV	115	4344	0.01	-0.07	0.67
Mahalanobis D² (RT)	44	4415	0.01	0.20	1.08
Laz.R	209	4250	0.01	-0.10	0.80
Zh	363	4096	0.02	-0.06	1.29
Spearman ρ ★	473	3986	0.02	-0.08	1.19
Weighted τ ★	502	3957	0.02	-0.06	1.19

FP vs TN comparison on θ̄ (mean). Positive d = FPs score higher.
Measure	N FP	N TN	Mean (FP)	Mean (TN)	Cohen’s d
LongString	167	4292	-0.08	3.04	1.06
IRV	115	4344	-0.03	3.01	1.07
Mahalanobis D² (RT)	44	4415	0.22	3.11	1.06
Laz.R	209	4250	-0.08	3.04	1.07
Zh	363	4096	-0.06	3.04	1.03
Spearman ρ ★	473	3986	-0.12	3.04	1.04
Weighted τ ★	502	3957	-0.11	3.04	1.04

FP vs TN comparison on Mean score. Positive d = FPs score higher.
Measure	N FP	N TN	Mean (FP)	Mean (TN)	Cohen’s d
LongString	167	4292	-0.03	3.05	-0.63
IRV	115	4344	-0.07	3.05	-2.24
Mahalanobis D² (RT)	44	4415	0.06	3.05	0.13
Laz.R	209	4250	-0.06	3.05	-1.52
Zh	363	4096	-0.03	3.05	1.52
Spearman ρ ★	473	3986	0.01	3.05	0.81
Weighted τ ★	502	3957	0.03	3.05	0.81

FP vs TN comparison on Attention A. Positive d = FPs score higher.
Measure	N FP	N TN	Mean (TN)	Cohen’s d
LongString	167	4292	-0.02	4.76
IRV	115	4344	-0.12	4.92
Mahalanobis D² (RT)	44	4415	0.18	4.56
Laz.R	209	4250	-0.03	4.88
Zh	363	4096	-0.04	4.84
Spearman ρ ★	473	3986	-0.02	4.84
Weighted τ ★	502	3957	-0.02	4.76

FP vs TN comparison on IRV. Positive d = FPs score higher.
Measure	N FP	N TN	Mean (FP)	Mean (TN)	Cohen’s d
LongString	167	4292	-0.03	0.79	4.84
IRV	115	4344	-0.08	0.87	4.84
Mahalanobis D² (RT)	44	4415	0.06	0.65	4.84
Laz.R	209	4250	-0.06	0.82	4.84
Zh	363	4096	-0.03	0.32	4.84
Spearman ρ ★	473	3986	0.01	0.39	4.84
Weighted τ ★	502	3957	0.03	0.39	4.85

FP vs TN comparison on Mean RT. Positive d = FPs score higher.
Measure	N FP	N TN	Mean (FP)	Mean (TN)	Cohen’s d
LongString	167	4292	-0.04	0.67	-0.08
IRV	115	4344	-0.05	0.66	0.07
Mahalanobis D² (RT)	44	4415	0.15	0.67	-0.26
Laz.R	209	4250	-0.06	0.66	0.04
Zh	363	4096	-0.04	0.70	0.00
Spearman ρ ★	473	3986	-0.04	0.70	0.00
Weighted τ ★	502	3957	-0.03	0.71	-0.08

Cohen’s d Heatmap: All Measures × Covariates

Show code

d_long <- do.call(rbind, lapply(measure_order, function(nm) {
  flag   <- fp_flag_df[[paste0("fp_", nm)]]
  fp_grp <- fp_flag_df[flag == TRUE,  ]
  tn_grp <- fp_flag_df[flag == FALSE, ]
  do.call(rbind, lapply(seq_along(covariate_cols), function(i) {
    data.frame(
      measure   = bench_labels[[nm]],
      covariate = covariate_labels[i],
      d         = cohens_d(fp_grp[[covariate_cols[i]]],
                           tn_grp[[covariate_cols[i]]])
    )
  }))
}))

d_long$measure <- factor(d_long$measure,
  levels = rev(bench_labels[measure_order]))

ggplot(d_long, aes(x = covariate, y = measure, fill = d)) +
  geom_tile(colour = "white", linewidth = 0.6) +
  geom_text(aes(label = sprintf("%.2f", d)), size = 3) +
  scale_fill_gradient2(
    low      = "#6a0dad",   # purple = FPs score lower
    mid      = "white",
    high     = "#2e7d32",   # green  = FPs score higher
    midpoint = 0,
    na.value = "grey85",
    name     = "Cohen's d\n(FP − TN)"
  ) +
  labs(title    = "False positive selectivity: Cohen's d (FP vs TN attenders)",
       subtitle = "Rows = measures; columns = covariates. Proposed features in green.",
       x = NULL, y = NULL) +
  theme(
    axis.text.x  = element_text(angle = 20, hjust = 1),
    axis.text.y  = element_text(
      colour = ifelse(
        rev(bench_labels[measure_order]) %in%
          c("Spearman \u03c1 \u2605", "Weighted \u03c4 \u2605"),
        "#1b5e20", "grey20"
      ),
      face = ifelse(
        rev(bench_labels[measure_order]) %in%
          c("Spearman \u03c1 \u2605", "Weighted \u03c4 \u2605"),
        "bold", "plain"
      )
    )
  )

Cohen’s d for each measure (rows) × covariate (columns). Positive d (green) means FPs score higher than TNs; negative d (purple) means FPs score lower. Values near zero mean the measure’s FPs are not selectively different from TNs on that covariate. Proposed features highlighted with a star in the row label.

Sensitivity Analysis I: Sample Size (N)

How stable are the seven measures as sample size decreases from the full dataset of 5000 respondents down to N = 100? We draw random subsamples (without replacement) at each of the following sizes: N = 5,000 · 2,000 · 1,000 · 500 · 250 · 100, recomputing all measures and their Kneedle-threshold performance at each size.

Because subsampling is stochastic, we repeat each condition 20 times and report the mean ± SD of sensitivity and FP rate across replications.

Design. The simulation should be run with at least N = 5,000 respondents so that the largest subsample equals the full dataset. Subsampling preserves the original attender/non-attender ratio within each draw.

Show code

n_ss_reps <- 20   # replications per N level — enough for stable means
ss_sizes  <- c(100, 250, 500, 1000, 2000, 5000)
# Cap at actual sample size
ss_sizes  <- ss_sizes[ss_sizes <= nrow(dat)]

set.seed(2025)

# Pre-generate subsample indices: list[N_level][rep] -> row indices
ss_idx <- lapply(ss_sizes, function(n_s) {
  lapply(seq_len(n_ss_reps), function(r) {
    att_idx <- which(dat$is_non_attender == 0)
    non_idx <- which(dat$is_non_attender == 1)
    prop_non <- mean(dat$is_non_attender == 1)
    n_non <- round(n_s * prop_non)
    n_att <- n_s - n_non
    c(
      sample(att_idx, min(n_att, length(att_idx)), replace = FALSE),
      sample(non_idx, min(n_non, length(non_idx)), replace = FALSE)
    )
  })
})
names(ss_idx) <- as.character(ss_sizes)

cat(sprintf("Subsample sizes: %s\nReplications: %d\n",
            paste(ss_sizes, collapse = ", "), n_ss_reps))

Subsample sizes: 100, 250, 500, 1000, 2000, 5000
Replications: 20

Show code

# ── Vectorised measure helpers — defined here, used by both sensitivity loops ─

# LongString: max run length per row
longstring_vec <- function(mat) {
  apply(mat, 1, function(x) {
    x <- x[!is.na(x)]
    if (length(x) == 0) return(NA_integer_)
    max(rle(x)$lengths)
  })
}

# IRV
irv_vec <- function(mat) apply(mat, 1, sd, na.rm = TRUE)

# Laz.R: cast to integer first for speed
lazr_vec <- function(mat) {
  mat_int <- matrix(as.integer(mat), nrow = nrow(mat))
  apply(mat_int, 1, Laz.R)
}

# Mahalanobis on log-RT
mahal_vec <- function(log_rt) {
  mu    <- colMeans(log_rt, na.rm = TRUE)
  cov_m <- tryCatch(
    cov(log_rt, use = "pairwise.complete.obs") + diag(1e-4, ncol(log_rt)),
    error = function(e) diag(ncol(log_rt))
  )
  tryCatch(
    mahalanobis(log_rt, center = mu, cov = cov_m),
    error = function(e) rowSums(sweep(log_rt, 2, mu)^2)
  )
}

# Spearman rho
rho_vec <- function(rt_mat, cx) {
  cx_rk <- rank(cx)
  apply(rt_mat, 1, function(rt_i) cor(cx_rk, rank(rt_i), method = "spearman"))
}

# Vectorised weighted tau
wtau_vectorised <- function(rt_mat, cx) {
  pairs   <- combn(length(cx), 2)
  cx_diff <- cx[pairs[1,]] - cx[pairs[2,]]
  w       <- abs(cx_diff)
  w_sum   <- sum(w)
  rt_diff <- rt_mat[, pairs[1,], drop=FALSE] - rt_mat[, pairs[2,], drop=FALSE]
  conc    <- sweep(sign(rt_diff), 2, sign(cx_diff), `*`)
  as.numeric((conc %*% w) / w_sum)
}

# ── Per-subsample performance ─────────────────────────────────────────────────

# For each subsample, recompute measures and apply Kneedle cutoff
compute_ss_perf <- function(row_idx) {
  sub_dat <- dat[row_idx, ]
  sub_res <- sub_dat[, c("ID", "is_non_attender")]
  sub_res$true_inatt <- sub_res$is_non_attender == 1
  n_att   <- sum(!sub_res$true_inatt)
  n_non   <- sum(sub_res$true_inatt)
  if (n_att < 5 || n_non < 5) return(NULL)

  sub_resp <- as.matrix(sub_dat[, y_cols_all])
  sub_rt   <- as.matrix(sub_dat[, rt_cols_all])

  # 1. LongString
  sub_res$longstring      <- longstring_vec(sub_resp)
  # 2. IRV
  sub_res$irv             <- irv_vec(sub_resp)
  # 3. Mahalanobis on log-RT
  sub_res$mahal_rt        <- mahal_vec(log(pmax(sub_rt, 1e-6)))
  # 4. Laz.R
  sub_res$lazr_pooled     <- lazr_vec(sub_resp)
  # 5. Spearman pooled
  sub_res$spearman_pooled <- rho_vec(sub_rt, cx_all)
  # 7. Weighted tau — vectorised
  sub_res$wtau_pooled     <- wtau_vectorised(sub_rt, cx_all)
  # Zh excluded (too slow per subsample)
  sub_res$zh <- NA_real_

  # Apply cutoffs per measure and compute sens / FP rate
  # Mahalanobis uses outlier (2.5th pct) threshold; all others use Kneedle.
  perf_list <- lapply(names(bench_direction), function(nm) {
    m <- bench_direction[[nm]]
    if (nm == "zh") return(data.frame(measure = nm, sens = NA, spec = NA, fp_n = NA, fn_n = NA))
    signal <- if (m$high_bad) sub_res[[m$col]] else -sub_res[[m$col]]
    signal[!is.finite(signal)] <- NA
    if (all(is.na(signal))) return(data.frame(measure = nm, sens = NA, spec = NA, fp_n = NA, fn_n = NA))

    if (nm == "mahal_rt") {
      # Outlier exclusion: flag lowest 2.5% of D² (highest 2.5% of -D² signal)
      flag <- signal > quantile(signal, 0.975, na.rm = TRUE)
    } else {
      sorted_sig <- sort(signal, na.last = NA)
      if (length(sorted_sig) < 2)
        return(data.frame(measure = nm, sens = NA, spec = NA, fp_n = NA, fn_n = NA))
      k_idx <- kneedle_fn(sorted_sig, sign = -1)
      flag  <- signal > sorted_sig[k_idx]
    }
    cs <- confusion_stats(flag, sub_res$true_inatt)
    data.frame(
      measure  = nm,
      sens     = cs$Sensitivity,
      spec     = cs$Specificity,
      fp_n     = cs$FP,
      fn_n     = cs$FN
    )
  })

  do.call(rbind, perf_list)
}

# Run across all sizes and reps
ss_results_list <- lapply(ss_sizes, function(n_s) {
  reps <- ss_idx[[as.character(n_s)]]
  rep_perf <- lapply(reps, function(idx) {
    pf <- tryCatch(compute_ss_perf(idx), error = function(e) NULL)
    if (!is.null(pf)) pf$n <- n_s
    pf
  })
  do.call(rbind, Filter(Negate(is.null), rep_perf))
})

ss_results <- do.call(rbind, ss_results_list)

cat(sprintf("Sensitivity-N analysis complete: %d rows of results.\n", nrow(ss_results)))

Sensitivity-N analysis complete: 840 rows of results.

Sensitivity and False Positive Rate vs. N

Show code

ss_summary <- ss_results |>
  group_by(n, measure) |>
  summarise(
    mean_sens = round(mean(sens, na.rm = TRUE), 3),
    sd_sens   = round(sd(sens,   na.rm = TRUE), 3),
    mean_spec = round(mean(spec, na.rm = TRUE), 3),
    sd_spec   = round(sd(spec,   na.rm = TRUE), 3),
    mean_fp   = round(mean(fp_n, na.rm = TRUE), 1),
    sd_fp     = round(sd(fp_n,   na.rm = TRUE), 1),
    mean_fn   = round(mean(fn_n, na.rm = TRUE), 1),
    sd_fn     = round(sd(fn_n,   na.rm = TRUE), 1),
    .groups   = "drop"
  ) |>
  mutate(measure_label = bench_labels[measure])

make_wide <- function(data, value_col, prefix) {
  data |>
    select(measure_label, n, all_of(value_col)) |>
    tidyr::pivot_wider(names_from = n, values_from = all_of(value_col),
                       names_prefix = prefix) |>
    mutate(is_proposed = measure_label %in% c("Spearman \u03c1 \u2605", "Weighted \u03c4 \u2605"))
}

render_wide <- function(wide_df, col_suffix, caption_text) {
  prop_rows <- which(wide_df$is_proposed)
  kable(
    wide_df[, !names(wide_df) %in% "is_proposed"],
    col.names = c("Measure", paste0("N=", ss_sizes)),
    caption   = caption_text
  ) |>
    kable_styling(full_width = FALSE) |>
    row_spec(prop_rows, background = "#e8f5e9", color = "#1b5e20")
}

render_wide(make_wide(ss_summary, "mean_sens", "N="),
  "sens", "Mean sensitivity by N. Zh excluded (GRM refit too slow).")

Mean sensitivity by N. Zh excluded (GRM refit too slow).
Measure	N=100	N=250	N=500	N=1000	N=2000	N=5000
IRV	0.359	0.396	0.426	0.416	0.408	0.401
Laz.R	0.618	0.652	0.653	0.642	0.636	0.630
LongString	0.486	0.543	0.520	0.516	0.520	0.525
Mahalanobis D² (RT)	0.000	0.002	0.001	0.000	0.000	0.000
Spearman ρ ★	0.891	0.930	0.899	0.912	0.906	0.906
Weighted τ ★	0.900	0.937	0.934	0.927	0.931	0.928
Zh	NaN	NaN	NaN	NaN	NaN	NaN

Show code

render_wide(make_wide(ss_summary, "mean_spec", "N="),
  "spec", "Mean specificity by N.")

Mean specificity by N.
Measure	N=100	N=250	N=500	N=1000	N=2000	N=5000
IRV	0.949	0.961	0.964	0.967	0.972	0.974
Laz.R	0.960	0.956	0.957	0.952	0.954	0.953
LongString	0.959	0.971	0.973	0.970	0.966	0.963
Mahalanobis D² (RT)	0.966	0.969	0.971	0.972	0.972	0.972
Spearman ρ ★	0.883	0.876	0.890	0.887	0.889	0.894
Weighted τ ★	0.889	0.877	0.881	0.889	0.883	0.887
Zh	NaN	NaN	NaN	NaN	NaN	NaN

Show code

render_wide(make_wide(ss_summary, "mean_fp", "N="),
  "fp",   "Mean false positive count (flagged attenders) by N.")

Mean false positive count (flagged attenders) by N.
Measure	N=100	N=250	N=500	N=1000	N=2000	N=5000
IRV	4.6	8.6	16.1	29.6	50.5	115
Laz.R	3.6	9.8	19.3	42.8	82.8	209
LongString	3.6	6.4	12.2	27.1	61.0	167
Mahalanobis D² (RT)	3.0	7.0	12.9	25.0	50.0	125
Spearman ρ ★	10.4	27.6	49.2	100.6	197.2	473
Weighted τ ★	9.9	27.4	53.0	98.7	209.1	502
Zh	NaN	NaN	NaN	NaN	NaN	NaN

Show code

render_wide(make_wide(ss_summary, "mean_fn", "N="),
  "fn",   "Mean false negative count (missed non-attenders) by N.")

Mean false negative count (missed non-attenders) by N.
Measure	N=100	N=250	N=500	N=1000	N=2000	N=5000
IRV	7.0	16.3	31.0	63.0	127.8	324
Laz.R	4.2	9.4	18.8	38.7	78.6	200
LongString	5.7	12.3	25.9	52.2	103.8	257
Mahalanobis D² (RT)	11.0	27.0	54.0	108.0	216.0	541
Spearman ρ ★	1.2	1.9	5.4	9.5	20.3	51
Weighted τ ★	1.1	1.7	3.5	7.9	15.0	39
Zh	NaN	NaN	NaN	NaN	NaN	NaN

Show code

ss_colours <- c(
  setNames(colorRampPalette(c("grey60", "steelblue3"))(6),
           setdiff(unique(ss_summary$measure_label),
                   c("Spearman \u03c1 \u2605", "Weighted \u03c4 \u2605"))),
  "Spearman \u03c1 \u2605" = "#2e7d32",
  "Weighted \u03c4 \u2605" = "#81c784"
)
ss_sizes_map <- c(
  setNames(rep(0.6, 6),
           setdiff(unique(ss_summary$measure_label),
                   c("Spearman \u03c1 \u2605", "Weighted \u03c4 \u2605"))),
  "Spearman \u03c1 \u2605" = 1.4,
  "Weighted \u03c4 \u2605" = 1.4
)

make_ss_line_plot <- function(y_col, sd_col, y_label, title_str) {
  ggplot(ss_summary, aes(x = n, y = .data[[y_col]],
                          colour = measure_label, linewidth = measure_label)) +
    geom_ribbon(aes(ymin = pmax(.data[[y_col]] - .data[[sd_col]], 0),
                    ymax = pmin(.data[[y_col]] + .data[[sd_col]], 1),
                    fill = measure_label),
                alpha = 0.08, colour = NA) +
    geom_line() +
    geom_point(size = 2) +
    scale_x_continuous(breaks = ss_sizes, labels = scales::comma) +
    scale_colour_manual(values = ss_colours, name = NULL) +
    scale_fill_manual(  values = ss_colours, name = NULL, guide = "none") +
    scale_linewidth_manual(values = ss_sizes_map, guide = "none") +
    labs(title = title_str,
         subtitle = "Mean \u00b1 1 SD across replications; Zh excluded",
         x = "Sample size (N)", y = y_label) +
    theme(legend.position = "right", legend.text = element_text(size = 8))
}

make_ss_line_plot("mean_sens", "sd_sens", "Sensitivity", "Sensitivity vs. N") /
make_ss_line_plot("mean_spec", "sd_spec", "Specificity", "Specificity vs. N") +
  plot_layout(guides = "collect")

Sensitivity (top) and specificity (bottom) as a function of sample size N. Points show means across replications; ribbons show ± 1 SD. Proposed features (ρ, τ) are bold green; established measures in grey/blue.

Sensitivity Analysis II: Number of Items

How do the measures degrade as fewer items are available? We create holdout item sets by randomly removing between 0 and 30 items (in steps of 5) from the full pool of 40 items, recomputing each measure on the remaining items only.

Items are removed at random (preserving scale membership proportionally) 15 times per holdout level, and performance is averaged across replications. This simulates scenarios where researchers want to apply RT-based screening to shorter instruments.

Note on Zh. Zh requires fitting a new GRM for every holdout set and is excluded from this sensitivity analysis. Its item-count sensitivity can be inferred from the single-scale Zh results in the benchmark section.

Show code

n_item_reps   <- 15   # replications per removal level
items_removed <- seq(0, 30, by = 5)   # 0, 5, 10, 15, 20, 25, 30

set.seed(2026)

holdout_idx <- lapply(items_removed, function(n_rem) {
  lapply(seq_len(n_item_reps), function(r) {
    if (n_rem == 0) return(seq_len(ni_total))
    removed <- sample(seq_len(ni_total), n_rem, replace = FALSE)
    setdiff(seq_len(ni_total), removed)
  })
})
names(holdout_idx) <- as.character(items_removed)

cat(sprintf("Item removal levels: %s\nReplications: %d\n",
            paste(items_removed, collapse = ", "), n_item_reps))

Item removal levels: 0, 5, 10, 15, 20, 25, 30
Replications: 15

Show code

# All vectorised helpers (longstring_vec, irv_vec, lazr_vec,
# mahal_vec, rho_vec, wtau_vectorised) defined in sensitivity-n-compute above.

compute_item_perf <- function(item_idx) {
  n_items  <- length(item_idx)
  truth    <- results$true_inattentive
  n_att    <- sum(!truth)

  sub_resp <- resp_mat[, item_idx, drop = FALSE]
  sub_rt   <- rt_mat[,  item_idx, drop = FALSE]
  sub_cx   <- cx_all[item_idx]

  tmp <- data.frame(
    longstring      = longstring_vec(sub_resp),
    irv             = irv_vec(sub_resp),
    mahal_rt        = mahal_vec(log(pmax(sub_rt, 1e-6))),
    lazr_pooled     = lazr_vec(sub_resp),
    zh              = NA_real_,          # excluded — too slow per holdout
    spearman_pooled = rho_vec(sub_rt, sub_cx),
    wtau_pooled     = if (n_items >= 2) wtau_vectorised(sub_rt, sub_cx)
                      else rep(NA_real_, nrow(sub_rt))
  )

  perf_list <- lapply(names(bench_direction), function(nm) {
    m <- bench_direction[[nm]]
    if (nm == "zh") return(data.frame(measure = nm, sens = NA, spec = NA, fp_n = NA, fn_n = NA))
    signal <- if (m$high_bad) tmp[[m$col]] else -tmp[[m$col]]
    signal[!is.finite(signal)] <- NA
    if (all(is.na(signal))) return(data.frame(measure = nm, sens = NA, spec = NA, fp_n = NA, fn_n = NA))

    if (nm == "mahal_rt") {
      # Outlier exclusion: flag lowest 2.5% of D² (highest 2.5% of -D² signal)
      flag <- signal > quantile(signal, 0.975, na.rm = TRUE)
    } else {
      sorted_sig <- sort(signal, na.last = NA)
      if (length(sorted_sig) < 2) return(data.frame(measure = nm, sens = NA, spec = NA, fp_n = NA, fn_n = NA))
      k_idx <- kneedle_fn(sorted_sig, sign = -1)
      flag  <- signal > sorted_sig[k_idx]
    }
    cs <- confusion_stats(flag, truth)
    data.frame(
      measure = nm,
      sens    = cs$Sensitivity,
      spec    = cs$Specificity,
      fp_n    = cs$FP,
      fn_n    = cs$FN
    )
  })

  do.call(rbind, perf_list)
}

# Run across all item-removal levels and reps
item_results_list <- lapply(items_removed, function(n_rem) {
  reps <- holdout_idx[[as.character(n_rem)]]
  rep_perf <- lapply(reps, function(idx) {
    pf <- tryCatch(compute_item_perf(idx), error = function(e) NULL)
    if (!is.null(pf)) pf$n_removed <- n_rem
    pf
  })
  do.call(rbind, Filter(Negate(is.null), rep_perf))
})

item_results <- do.call(rbind, item_results_list)
item_results$n_items <- ni_total - item_results$n_removed

cat(sprintf("Sensitivity-items analysis complete: %d rows.\n", nrow(item_results)))

Sensitivity-items analysis complete: 735 rows.

Sensitivity and False Positive Rate vs. Number of Items

Show code

item_summary <- item_results |>
  group_by(n_items, measure) |>
  summarise(
    mean_sens = round(mean(sens, na.rm = TRUE), 3),
    sd_sens   = round(sd(sens,   na.rm = TRUE), 3),
    mean_spec = round(mean(spec, na.rm = TRUE), 3),
    sd_spec   = round(sd(spec,   na.rm = TRUE), 3),
    mean_fp   = round(mean(fp_n, na.rm = TRUE), 1),
    sd_fp     = round(sd(fp_n,   na.rm = TRUE), 1),
    mean_fn   = round(mean(fn_n, na.rm = TRUE), 1),
    sd_fn     = round(sd(fn_n,   na.rm = TRUE), 1),
    .groups   = "drop"
  ) |>
  mutate(measure_label = bench_labels[measure])

k_levels <- sort(unique(item_summary$n_items))

make_wide_items <- function(data, value_col) {
  data |>
    select(measure_label, n_items, all_of(value_col)) |>
    tidyr::pivot_wider(names_from = n_items, values_from = all_of(value_col),
                       names_prefix = "k=") |>
    mutate(is_proposed = measure_label %in% c("Spearman \u03c1 \u2605", "Weighted \u03c4 \u2605"))
}

render_wide_items <- function(wide_df, caption_text) {
  prop_rows <- which(wide_df$is_proposed)
  kable(
    wide_df[, !names(wide_df) %in% "is_proposed"],
    col.names = c("Measure", paste0("k=", k_levels)),
    caption   = caption_text
  ) |>
    kable_styling(full_width = FALSE) |>
    row_spec(prop_rows, background = "#e8f5e9", color = "#1b5e20")
}

render_wide_items(make_wide_items(item_summary, "mean_sens"),
  "Mean sensitivity by number of retained items (k). Zh excluded.")

Mean sensitivity by number of retained items (k). Zh excluded.
Measure	k=10	k=15	k=20	k=25	k=30	k=35	k=40
IRV	0.385	0.397	0.400	0.398	0.401	0.401	0.401
Laz.R	0.432	0.487	0.555	0.612	0.629	0.632	0.630
LongString	0.502	0.508	0.491	0.490	0.488	0.509	0.525
Mahalanobis D² (RT)	0.026	0.005	0.001	0.001	0.001	0.000	0.000
Spearman ρ ★	0.525	0.631	0.698	0.681	0.816	0.852	0.906
Weighted τ ★	0.591	0.704	0.746	0.744	0.864	0.885	0.928
Zh	NaN	NaN	NaN	NaN	NaN	NaN	NaN

Show code

render_wide_items(make_wide_items(item_summary, "mean_spec"),
  "Mean specificity by number of retained items (k).")

Mean specificity by number of retained items (k).
Measure	k=10	k=15	k=20	k=25	k=30	k=35	k=40
IRV	0.956	0.947	0.956	0.965	0.959	0.969	0.974
Laz.R	0.918	0.943	0.945	0.945	0.949	0.955	0.953
LongString	0.887	0.912	0.951	0.957	0.969	0.966	0.963
Mahalanobis D² (RT)	0.975	0.973	0.972	0.972	0.972	0.972	0.972
Spearman ρ ★	0.900	0.913	0.919	0.923	0.916	0.908	0.894
Weighted τ ★	0.881	0.898	0.907	0.911	0.903	0.903	0.887
Zh	NaN	NaN	NaN	NaN	NaN	NaN	NaN

Show code

render_wide_items(make_wide_items(item_summary, "mean_fp"),
  "Mean false positive count (flagged attenders) by number of retained items (k).")

Mean false positive count (flagged attenders) by number of retained items (k).
Measure	k=10	k=15	k=20	k=25	k=30	k=35	k=40
IRV	194.6	236.1	196.7	153.9	180.6	140.1	115
Laz.R	366.1	253.1	246.7	245.8	226.6	199.1	209
LongString	504.8	392.5	217.7	192.5	139.0	150.6	167
Mahalanobis D² (RT)	110.7	122.3	124.4	124.3	124.5	124.8	125
Spearman ρ ★	446.2	389.9	362.0	341.3	373.3	410.4	473
Weighted τ ★	531.3	456.3	414.1	395.0	432.8	432.8	502
Zh	NaN	NaN	NaN	NaN	NaN	NaN	NaN

Show code

render_wide_items(make_wide_items(item_summary, "mean_fn"),
  "Mean false negative count (missed non-attenders) by number of retained items (k).")

Mean false negative count (missed non-attenders) by number of retained items (k).
Measure	k=10	k=15	k=20	k=25	k=30	k=35	k=40
IRV	332.9	326.1	324.6	325.7	323.9	324.3	324
Laz.R	307.4	277.4	240.9	209.8	200.7	198.9	200
LongString	269.3	265.9	275.5	275.9	276.7	265.9	257
Mahalanobis D² (RT)	526.7	538.3	540.4	540.3	540.5	540.8	541
Spearman ρ ★	257.1	199.5	163.3	172.6	99.5	79.9	51
Weighted τ ★	221.0	160.0	137.4	138.4	73.5	62.1	39
Zh	NaN	NaN	NaN	NaN	NaN	NaN	NaN

Show code

item_colours <- c(
  setNames(colorRampPalette(c("grey60", "steelblue3"))(6),
           setdiff(unique(item_summary$measure_label),
                   c("Spearman \u03c1 \u2605", "Weighted \u03c4 \u2605"))),
  "Spearman \u03c1 \u2605" = "#2e7d32",
  "Weighted \u03c4 \u2605" = "#81c784"
)
item_size_map <- c(
  setNames(rep(0.6, 6),
           setdiff(unique(item_summary$measure_label),
                   c("Spearman \u03c1 \u2605", "Weighted \u03c4 \u2605"))),
  "Spearman \u03c1 \u2605" = 1.4,
  "Weighted \u03c4 \u2605" = 1.4
)

make_item_line_plot <- function(y_col, sd_col, y_label, title_str) {
  ggplot(item_summary, aes(x = n_items, y = .data[[y_col]],
                            colour = measure_label, linewidth = measure_label)) +
    geom_ribbon(aes(ymin = pmax(.data[[y_col]] - .data[[sd_col]], 0),
                    ymax = pmin(.data[[y_col]] + .data[[sd_col]], 1),
                    fill = measure_label),
                alpha = 0.08, colour = NA) +
    geom_line() +
    geom_point(size = 2) +
    scale_x_continuous(breaks = k_levels) +
    scale_colour_manual(values = item_colours, name = NULL) +
    scale_fill_manual(  values = item_colours, name = NULL, guide = "none") +
    scale_linewidth_manual(values = item_size_map, guide = "none") +
    labs(title = title_str,
         subtitle = "Mean \u00b1 1 SD across replications",
         x = "Items retained (k)", y = y_label) +
    theme(legend.position = "right", legend.text = element_text(size = 8))
}

make_item_line_plot("mean_sens", "sd_sens", "Sensitivity", "Sensitivity vs. items") /
make_item_line_plot("mean_spec", "sd_spec", "Specificity", "Specificity vs. items") +
  plot_layout(guides = "collect")

Sensitivity (top) and specificity (bottom) as a function of the number of retained items. Ribbons show ± 1 SD across replications. Proposed complexity–RT features are highlighted in green.

Interaction: N × Items

The heatmap below shows mean sensitivity as a joint function of sample size and number of items for each proposed feature, enabling a quick read of the operating conditions under which ρ and τ remain viable.

Show code

# Re-run a coarser grid (N × k) for the heatmap — use fewer reps
n_hm_reps   <- 10
ss_hm_sizes <- ss_sizes
items_hm    <- seq(10, ni_total, by = 10)  # 10, 20, 30, 40

set.seed(999)

hm_results <- do.call(rbind, lapply(ss_hm_sizes, function(n_s) {
  do.call(rbind, lapply(items_hm, function(k) {
    do.call(rbind, lapply(seq_len(n_hm_reps), function(r) {
      att_idx  <- which(dat$is_non_attender == 0)
      non_idx  <- which(dat$is_non_attender == 1)
      prop_non <- mean(dat$is_non_attender == 1)
      n_non    <- round(n_s * prop_non)
      n_att    <- n_s - n_non
      row_idx  <- c(
        sample(att_idx, min(n_att, length(att_idx)), replace = FALSE),
        sample(non_idx, min(n_non, length(non_idx)), replace = FALSE)
      )
      if (length(row_idx) < 20) return(NULL)

      item_idx <- sample(seq_len(ni_total), k, replace = FALSE)
      sub_rt   <- rt_mat[row_idx, ][, item_idx, drop = FALSE]
      sub_cx   <- cx_all[item_idx]
      truth    <- (dat$is_non_attender == 1)[row_idx]
      n_att_s  <- sum(!truth)
      if (n_att_s < 5 || sum(truth) < 5) return(NULL)

      cx_rk_hm <- rank(sub_cx)
      rho_hm   <- apply(sub_rt, 1, function(rt_i)
        cor(cx_rk_hm, rank(rt_i), method = "spearman"))
      wtau_hm  <- wtau_vectorised(sub_rt, sub_cx)

      kneedle_stats <- function(signal, truth) {
        signal[!is.finite(signal)] <- NA
        sorted <- sort(signal, na.last = NA)
        if (length(sorted) < 2)
          return(list(sens = NA, spec = NA, fp_pct = NA, fn_pct = NA))
        k_i  <- kneedle_fn(sorted, sign = -1)
        flag <- signal > sorted[k_i]
        cs   <- confusion_stats(flag, truth)
        n_att_loc <- sum(!truth)
        n_non_loc <- sum(truth)
        list(sens   = cs$Sensitivity,
             spec   = cs$Specificity,
             fp_pct = if (n_att_loc > 0) 100 * cs$FP / n_att_loc else NA,
             fn_pct = if (n_non_loc > 0) 100 * cs$FN / n_non_loc else NA)
      }

      s_rho  <- kneedle_stats(-rho_hm,  truth)
      s_wtau <- kneedle_stats(-wtau_hm, truth)

      data.frame(
        n             = n_s, k = k,
        sens_rho      = s_rho$sens,    sens_wtau   = s_wtau$sens,
        spec_rho      = s_rho$spec,    spec_wtau   = s_wtau$spec,
        fp_pct_rho    = s_rho$fp_pct,  fp_pct_wtau = s_wtau$fp_pct,
        fn_pct_rho    = s_rho$fn_pct,  fn_pct_wtau = s_wtau$fn_pct
      )
    }))
  }))
}))

hm_summary <- hm_results |>
  group_by(n, k) |>
  summarise(
    sens_rho    = round(mean(sens_rho,    na.rm = TRUE), 3),
    sens_wtau   = round(mean(sens_wtau,   na.rm = TRUE), 3),
    spec_rho    = round(mean(spec_rho,    na.rm = TRUE), 3),
    spec_wtau   = round(mean(spec_wtau,   na.rm = TRUE), 3),
    fp_pct_rho  = round(mean(fp_pct_rho,  na.rm = TRUE), 1),
    fp_pct_wtau = round(mean(fp_pct_wtau, na.rm = TRUE), 1),
    fn_pct_rho  = round(mean(fn_pct_rho,  na.rm = TRUE), 1),
    fn_pct_wtau = round(mean(fn_pct_wtau, na.rm = TRUE), 1),
    .groups     = "drop"
  )

fmt_label_y <- function(x) formatC(as.numeric(x), format = "d", big.mark = ",")

make_heatmap <- function(data, rho_col, wtau_col, metric_label,
                         fill_label, low_col, mid_col, high_col,
                         midpoint_val, fmt = "%.2f") {
  long_df <- data |>
    select(n, k, rho = all_of(rho_col), wtau = all_of(wtau_col)) |>
    tidyr::pivot_longer(cols = c(rho, wtau),
                        names_to = "feature", values_to = "value") |>
    mutate(feature = ifelse(feature == "rho", "Spearman \u03c1", "Weighted \u03c4"))

  ggplot(long_df, aes(x = factor(k), y = factor(n), fill = value)) +
    geom_tile(colour = "white", linewidth = 0.5) +
    geom_text(aes(label = sprintf(fmt, value)), size = 2.6) +
    scale_fill_gradient2(low = low_col, mid = mid_col, high = high_col,
                         midpoint = midpoint_val, na.value = "grey85",
                         name = fill_label) +
    facet_wrap(~ feature, ncol = 2) +
    scale_y_discrete(labels = fmt_label_y) +
    labs(title = paste(metric_label, "heatmap: N \u00d7 items retained"),
         x = "Items retained (k)", y = "Sample size (N)") +
    theme(axis.text.x = element_text(angle = 0),
          legend.position = "right")
}

p_hm_sens <- make_heatmap(hm_summary, "sens_rho", "sens_wtau",
  "Sensitivity", "Mean\nSensitivity",
  "tomato", "lightyellow", "#2e7d32", 0.5)

p_hm_spec <- make_heatmap(hm_summary, "spec_rho", "spec_wtau",
  "Specificity", "Mean\nSpecificity",
  "tomato", "lightyellow", "#2e7d32", 0.5)

p_hm_fp <- make_heatmap(hm_summary, "fp_pct_rho", "fp_pct_wtau",
  "False Positive Rate (%)", "Mean FP\nRate (%)",
  "#2e7d32", "lightyellow", "tomato", 50, fmt = "%.1f%%")

p_hm_fn <- make_heatmap(hm_summary, "fn_pct_rho", "fn_pct_wtau",
  "False Negative Rate (%)", "Mean FN\nRate (%)",
  "#2e7d32", "lightyellow", "tomato", 50, fmt = "%.1f%%")

p_hm_sens / p_hm_spec / p_hm_fp / p_hm_fn

Mean sensitivity, specificity, FP count, and FN count for Spearman ρ (left column) and weighted τ (right column) as a joint function of sample size N and items retained k. Grey cells = conditions not evaluated.

The Complexity–RT Feature

Computing Spearman Correlations

We compute the complexity–RT Spearman correlation separately for each scale using that scale’s 20 items and fixed complexity scores:

\[\rho_{i,s} = \text{Spearman}\!\left(\text{rank}(c_1^{(s)}, \ldots, c_{ni}^{(s)}),\; \text{rank}(\text{RT}_{i,1}^{(s)}, \ldots, \text{RT}_{i,ni}^{(s)})\right)\]

No outlier capping is applied. Spearman’s rank correlation is robust to extreme values by construction — a single stall-contaminated RT moves to the top of that person’s rank order but does not inflate the correlation beyond its $[-1, 1]$ bounds. Capping would discard potentially informative signal about relative item difficulty.

Show code

# ── roc_auc, make_roc_df, kneedle_fn, apply_kneedle_flag are now defined in
# ── the setup chunk and available globally ────────────────────────────────────

# Attention-vs-A plot (filters results dynamically — does NOT use a pre-snapshot)
make_attention_plot <- function(col, label) {
  att_df <- results[results$is_non_attender == 0, ]
  na_q <- quantile(results[[col]][results$is_non_attender == 1],
                   probs = c(0.05, 0.25, 0.50, 0.75, 0.95), na.rm = TRUE)
  ggplot(att_df, aes(x = A, y = .data[[col]])) +
    annotate("rect", xmin = -Inf, xmax = Inf,
             ymin = na_q["5%"],  ymax = na_q["95%"],
             fill = "tomato", alpha = 0.08) +
    annotate("rect", xmin = -Inf, xmax = Inf,
             ymin = na_q["25%"], ymax = na_q["75%"],
             fill = "tomato", alpha = 0.12) +
    geom_hline(yintercept = na_q["50%"], colour = "tomato3",
               linetype = "dashed", linewidth = 0.7) +
    geom_point(aes(colour = factor(Z)), alpha = 0.35, size = 1.4) +
    geom_smooth(method = "loess", span = 0.6, se = TRUE,
                colour = "steelblue4", fill = "steelblue", alpha = 0.2) +
    scale_colour_manual(values = c("0" = "grey60", "1" = "steelblue"),
                        labels = c("Control", "Treatment"), name = "Arm") +
    labs(title = label, subtitle = "Red band = non-attender 5–95th pct",
         x = "Latent attention A", y = "Score")
}

# Subtype ridge plot (uses results$subtype — only call after subtype-label chunk)
make_subtype_ridge <- function(col, label) {
  subtype_order <- results |>
    group_by(subtype) |>
    summarise(med = median(.data[[col]], na.rm = TRUE), .groups = "drop") |>
    arrange(med) |>
    pull(subtype)
  results$subtype_f <- factor(results$subtype, levels = subtype_order)
  ggplot(results,
         aes(x = .data[[col]], y = subtype_f,
             fill = ifelse(subtype == "Attender", "Attender", "Non-attender"))) +
    ggridges::geom_density_ridges(
      alpha = 0.55, colour = "white", scale = 0.85,
      quantile_lines = TRUE, quantiles = 2) +
    geom_vline(xintercept = 0, linetype = "dashed",
               colour = "grey30", linewidth = 0.7) +
    scale_fill_manual(values = c("Attender" = "steelblue",
                                  "Non-attender" = "tomato"), name = NULL) +
    labs(title = label, subtitle = "Median line shown; dashed = 0",
         x = "Score", y = NULL) +
    theme(legend.position = "right")
}

# ── Per-scale Spearman rho ────────────────────────────────────────────────────
compute_spearman_scale <- function(dat, scale_idx) {
  ni      <- params$items_per_scale
  rt_cols <- paste0("RT_s", scale_idx, "_i", seq_len(ni))
  cx      <- instrument[[scale_idx]]$complexity
  rt_mat  <- as.matrix(dat[, rt_cols])
  cx_rank <- rank(cx)
  apply(rt_mat, 1, function(rt_i) {
    cor(cx_rank, rank(rt_i), method = "spearman")
  })
}

for (s in seq_len(ns)) {
  results[[paste0("spearman_s", s)]] <- compute_spearman_scale(dat, s)
}

cat("Spearman ρ summary (attenders vs non-attenders):\n")

Spearman ρ summary (attenders vs non-attenders):

Show code

for (col in c("spearman_s1", "spearman_s2")) {
  att <- results[[col]][results$is_non_attender == 0]
  non <- results[[col]][results$is_non_attender == 1]
  cat(sprintf("  %-20s  attender mean = %.3f  |  non-attender mean = %.3f\n",
              col, mean(att, na.rm = TRUE), mean(non, na.rm = TRUE)))
}

  spearman_s1           attender mean = 0.436  |  non-attender mean = -0.010
  spearman_s2           attender mean = 0.408  |  non-attender mean = -0.008

Computing Weighted Kendall τ

Standard Spearman ρ treats every rank inversion equally, so a noise-driven flip between two items whose complexity scores are nearly identical counts the same as a flip between an easy and a hard item. When item complexities cluster closely, even genuinely attentive respondents can accumulate many near-tie inversions from RT noise alone, driving ρ downward unfairly.

The complexity-weighted Kendall τ addresses this by scaling each concordant or discordant pair by the absolute complexity difference between the two items:

\[\tau_w = \frac{\displaystyle\sum_{i<j} |c_i - c_j|\cdot \text{sign}(c_i - c_j)\cdot\text{sign}(\text{RT}_i - \text{RT}_j)} {\displaystyle\sum_{i<j} |c_i - c_j|}\]

A flip between items 0.05 complexity-units apart contributes almost nothing; a flip between items 1.0 apart contributes fully. The statistic is still in $[-1, 1]$, anchored at 0 for random responding, and drops into the same threshold and mixture pipeline as ρ.

Show code

compute_weighted_tau <- function(dat, scale_idx) {
  ni      <- params$items_per_scale
  rt_cols <- paste0("RT_s", scale_idx, "_i", seq_len(ni))
  cx      <- instrument[[scale_idx]]$complexity
  rt_mat  <- as.matrix(dat[, rt_cols])

  pairs   <- combn(ni, 2)        # 2 × C(ni,2) matrix of index pairs
  cx_diff <- cx[pairs[1,]] - cx[pairs[2,]]
  w       <- abs(cx_diff)        # pair weights = |complexity difference|

  apply(rt_mat, 1, function(rt_i) {
    rt_diff     <- rt_i[pairs[1,]] - rt_i[pairs[2,]]
    concordance <- sign(cx_diff) * sign(rt_diff)
    sum(w * concordance) / sum(w)
  })
}

for (s in seq_len(ns)) {
  results[[paste0("wtau_s", s)]] <- compute_weighted_tau(dat, s)
}

cat("Weighted τ summary (attenders vs non-attenders):\n")

Weighted τ summary (attenders vs non-attenders):

Show code

for (col in c("wtau_s1", "wtau_s2")) {
  att <- results[[col]][results$is_non_attender == 0]
  non <- results[[col]][results$is_non_attender == 1]
  cat(sprintf("  %-12s  attender mean = %.3f  |  non-attender mean = %.3f\n",
              col, mean(att, na.rm = TRUE), mean(non, na.rm = TRUE)))
}

  wtau_s1       attender mean = 0.454  |  non-attender mean = -0.010
  wtau_s2       attender mean = 0.459  |  non-attender mean = -0.010

Show code

# Assign subtype here so all downstream chunks (including wtau comparisons) can use it
results$subtype <- with(results, case_when(
  is_non_attender == 0                         ~ "Attender",
  non_attender_style == "plausible_noise"      ~ "plausible_noise",
  TRUE                                         ~ non_attender_pattern_type
))

Comparing Spearman ρ and Weighted τ

Distributions

Show code

make_dist_panel <- function(col, label, xlab) {
  ggplot(results, aes(x = .data[[col]], fill = class_label)) +
    geom_density(alpha = 0.5, colour = NA) +
    geom_vline(xintercept = 0, linetype = "dashed",
               colour = "grey30", linewidth = 0.7) +
    scale_fill_manual(
      values = c("Attender (true)" = "steelblue",
                 "Non-attender (true)" = "tomato"), name = NULL) +
    labs(title = label, x = xlab, y = "Density")
}

(make_dist_panel("spearman_s1", "Spearman ρ — Scale 1", "Spearman ρ") +
 make_dist_panel("spearman_s2", "Spearman ρ — Scale 2", "Spearman ρ")) /
(make_dist_panel("wtau_s1",    "Weighted τ — Scale 1", "Weighted τ") +
 make_dist_panel("wtau_s2",    "Weighted τ — Scale 2", "Weighted τ"))

Distributions of Spearman ρ (top) and weighted τ (bottom) by true class, for each scale. If near-tie noise is inflating FPs, the weighted τ distributions should show better separation — a cleaner gap between attenders and non-attenders.

AUC and Descriptive Comparison

Show code

compare_features <- list(
  "ρ (S1)"   = "spearman_s1",
  "ρ (S2)"   = "spearman_s2",
  "τ_w (S1)" = "wtau_s1",
  "τ_w (S2)" = "wtau_s2"
)

compare_tab <- do.call(rbind, lapply(names(compare_features), function(nm) {
  col <- compare_features[[nm]]
  data.frame(
    Feature  = nm,
    AUC      = round(roc_auc(results[[col]], results$true_inattentive), 3),
    mean_att = round(mean(results[[col]][results$is_non_attender == 0], na.rm = TRUE), 3),
    mean_non = round(mean(results[[col]][results$is_non_attender == 1], na.rm = TRUE), 3),
    sd_att   = round(sd(results[[col]][results$is_non_attender == 0],   na.rm = TRUE), 3),
    sd_non   = round(sd(results[[col]][results$is_non_attender == 1],   na.rm = TRUE), 3)
  )
}))

kable(
  compare_tab,
  col.names = c("Feature", "AUC",
                "Mean (attenders)", "Mean (non-attenders)",
                "SD (attenders)",   "SD (non-attenders)"),
  caption = "Spearman ρ vs. weighted τ: discriminability by scale"
) |>
  kable_styling(full_width = FALSE) |>
  row_spec(which.max(compare_tab$AUC), bold = TRUE, background = "#eef6ee")

Spearman ρ vs. weighted τ: discriminability by scale
Feature	AUC	Mean (attenders)	Mean (non-attenders)	SD (attenders)	SD (non-attenders)
ρ (S1)	0.909	0.436	-0.010	0.234	0.229
ρ (S2)	0.895	0.408	-0.008	0.226	0.230
τ_w (S1)	0.913	0.454	-0.010	0.237	0.234
τ_w (S2)	0.908	0.459	-0.010	0.235	0.248

How Attention Moderates Weighted τ

Show code

att_results_w <- results[results$is_non_attender == 0, ]

make_attention_plot("wtau_s1", "Weighted τ — Scale 1") +
make_attention_plot("wtau_s2", "Weighted τ — Scale 2")

Weighted τ vs. latent attention A among attenders, for Scale 1 (left) and Scale 2 (right). The red band shows the non-attender 5–95th percentile range. Compare to the equivalent Spearman ρ plots above — the weighted τ should show a tighter attender band if near-tie noise was adding variance to ρ.

Weighted τ by Non-attender Subtype

Show code

make_subtype_ridge("wtau_s1", "Weighted τ — Scale 1") +
make_subtype_ridge("wtau_s2", "Weighted τ — Scale 2")

Weighted τ distributions by subtype for Scale 1 (left) and Scale 2 (right). Compare to the Spearman ρ subtype plots — subtypes that were hard to detect under ρ may show greater separation here.

Show code

wtau_subtype_tab <- results |>
  group_by(subtype) |>
  summarise(
    N          = n(),
    mean_rho_s1  = round(mean(spearman_s1, na.rm = TRUE), 3),
    mean_wtau_s1 = round(mean(wtau_s1,     na.rm = TRUE), 3),
    mean_rho_s2  = round(mean(spearman_s2, na.rm = TRUE), 3),
    mean_wtau_s2 = round(mean(wtau_s2,     na.rm = TRUE), 3),
    .groups = "drop"
  ) |>
  arrange(mean_wtau_s1)

kable(
  wtau_subtype_tab,
  col.names = c("Subtype", "N",
                "Mean ρ (S1)", "Mean τ_w (S1)",
                "Mean ρ (S2)", "Mean τ_w (S2)"),
  caption = "Spearman ρ vs. weighted τ by subtype"
) |>
  kable_styling(full_width = FALSE) |>
  row_spec(which(wtau_subtype_tab$subtype == "Attender"), bold = TRUE)

Spearman ρ vs. weighted τ by subtype
Subtype	N	Mean ρ (S1)	Mean τ_w (S1)	Mean ρ (S2)	Mean τ_w (S2)
midpoint_default	53	-0.034	-0.045	0.016	0.027
plausible_noise	143	-0.039	-0.037	-0.019	-0.011
disacquiescent	25	-0.014	-0.023	-0.029	-0.034
uniform_random	25	-0.021	-0.022	-0.029	-0.055
biased_random	36	-0.004	-0.010	0.023	0.018
acquiescent	54	-0.008	-0.009	-0.007	-0.001
near_straightline	57	-0.016	-0.006	0.042	0.025
straightline	114	-0.002	0.000	-0.006	-0.024
alternating	17	0.117	0.125	-0.129	-0.135
diagonal	17	0.135	0.130	-0.045	-0.007
Attender	4459	0.436	0.454	0.408	0.459

Head-to-head FP / FN: Kneedle Threshold

Show code

# Apply Kneedle to Spearman ρ per scale (also needed here for comparison plots)
kn_s1 <- apply_kneedle_flag(results$spearman_s1)
kn_s2 <- apply_kneedle_flag(results$spearman_s2)

results$flag_kn_s1 <- kn_s1$flag
results$flag_kn_s2 <- kn_s2$flag

thresh_zero_s1   <- 0
thresh_zero_s2   <- 0
thresh_oracle_s1 <- quantile(results$spearman_s1[results$is_non_attender == 0], 0.05, na.rm = TRUE)
thresh_oracle_s2 <- quantile(results$spearman_s2[results$is_non_attender == 0], 0.05, na.rm = TRUE)

results$flag_zero_s1   <- results$spearman_s1 < thresh_zero_s1
results$flag_zero_s2   <- results$spearman_s2 < thresh_zero_s2
results$flag_oracle_s1 <- results$spearman_s1 < thresh_oracle_s1
results$flag_oracle_s2 <- results$spearman_s2 < thresh_oracle_s2

# Apply Kneedle to weighted τ per scale
kn_wtau_s1 <- apply_kneedle_flag(results$wtau_s1)
kn_wtau_s2 <- apply_kneedle_flag(results$wtau_s2)

results$flag_kn_wtau_s1 <- kn_wtau_s1$flag
results$flag_kn_wtau_s2 <- kn_wtau_s2$flag

thresh_zero_wtau_s1   <- 0
thresh_zero_wtau_s2   <- 0
thresh_oracle_wtau_s1 <- quantile(results$wtau_s1[results$is_non_attender == 0], 0.05, na.rm = TRUE)
thresh_oracle_wtau_s2 <- quantile(results$wtau_s2[results$is_non_attender == 0], 0.05, na.rm = TRUE)

results$flag_zero_wtau_s1   <- results$wtau_s1 < thresh_zero_wtau_s1
results$flag_zero_wtau_s2   <- results$wtau_s2 < thresh_zero_wtau_s2
results$flag_oracle_wtau_s1 <- results$wtau_s1 < thresh_oracle_wtau_s1
results$flag_oracle_wtau_s2 <- results$wtau_s2 < thresh_oracle_wtau_s2

cat(sprintf("Kneedle thresholds — ρ:    S1 = %.3f | S2 = %.3f\n",
            kn_s1$threshold, kn_s2$threshold))

Kneedle thresholds — ρ:    S1 = 0.090 | S2 = 0.086

Show code

cat(sprintf("Kneedle thresholds — τ_w:  S1 = %.3f | S2 = %.3f\n",
            kn_wtau_s1$threshold, kn_wtau_s2$threshold))

Kneedle thresholds — τ_w:  S1 = 0.104 | S2 = 0.159

Show code

# Head-to-head performance table
head2head <- do.call(rbind, lapply(list(
    list(flag = results$flag_kn_s1,      label = "ρ (S1)"),
    list(flag = results$flag_kn_wtau_s1, label = "τ_w (S1)"),
    list(flag = results$flag_kn_s2,      label = "ρ (S2)"),
    list(flag = results$flag_kn_wtau_s2, label = "τ_w (S2)")
  ), function(x) {
  cs <- confusion_stats(x$flag, results$true_inattentive)
  data.frame(
    Feature     = x$label,
    Sensitivity = round(cs$Sensitivity, 3),
    Specificity = round(cs$Specificity, 3),
    Pct_Flagged = round(cs$Flagged_pct, 1),
    TP = cs$TP, FP = cs$FP, TN = cs$TN, FN = cs$FN
  )
}))

kable(
  head2head,
  col.names = c("Feature", "Sensitivity", "Specificity",
                "% Flagged", "TP", "FP", "TN", "FN"),
  caption = "Head-to-head: Spearman ρ vs. weighted τ (Kneedle threshold per scale)"
) |>
  kable_styling(full_width = FALSE) |>
  row_spec(which.max(head2head$Sensitivity), bold = TRUE, background = "#eef6ee")

Head-to-head: Spearman ρ vs. weighted τ (Kneedle threshold per scale)
Feature	Sensitivity	Specificity	% Flagged	TP	FP	TN	FN
ρ (S1)	0.638	0.915	14.5	345	378	4081	196
τ_w (S1)	0.669	0.916	14.7	362	373	4086	179
ρ (S2)	0.658	0.909	15.2	356	406	4053	185
τ_w (S2)	0.739	0.888	18.0	400	498	3961	141

Show code

make_fp_comparison_plot <- function(flag_rho, flag_wtau, scale_label) {
  results |>
    filter(is_non_attender == 0) |>
    tidyr::pivot_longer(
      cols      = c(all_of(flag_rho), all_of(flag_wtau)),
      names_to  = "feature",
      values_to = "flagged"
    ) |>
    mutate(feature = factor(feature,
      levels = c(flag_rho, flag_wtau),
      labels = c("Spearman ρ (Kneedle)", "Weighted τ (Kneedle)"))) |>
    ggplot(aes(x = A, fill = flagged)) +
    geom_density(alpha = 0.55, colour = NA) +
    scale_fill_manual(
      values = c("TRUE" = "tomato", "FALSE" = "steelblue"),
      labels = c("TRUE" = "FP (flagged)", "FALSE" = "TN (retained)"),
      name = NULL) +
    facet_wrap(~ feature, ncol = 2) +
    labs(title = scale_label, x = "Latent attention A", y = "Density")
}

make_fp_comparison_plot("flag_kn_s1", "flag_kn_wtau_s1", "Scale 1") /
make_fp_comparison_plot("flag_kn_s2", "flag_kn_wtau_s2", "Scale 2")

Attention distribution of false positives under Spearman ρ vs. weighted τ (Kneedle threshold), for Scale 1 (top) and Scale 2 (bottom). If weighted τ is reducing near-tie FPs, its FP group should be more concentrated at genuinely low-A attenders.

Show code

rho_wtau_subtype <- results |>
  filter(is_non_attender == 1) |>
  group_by(subtype) |>
  summarise(
    N            = n(),
    det_rho_s1   = round(mean(flag_kn_s1),      3),
    det_wtau_s1  = round(mean(flag_kn_wtau_s1), 3),
    det_rho_s2   = round(mean(flag_kn_s2),      3),
    det_wtau_s2  = round(mean(flag_kn_wtau_s2), 3),
    .groups = "drop"
  ) |>
  arrange(desc(det_wtau_s1))

kable(
  rho_wtau_subtype,
  col.names = c("Subtype", "N",
                "Det. ρ (S1)", "Det. τ_w (S1)",
                "Det. ρ (S2)", "Det. τ_w (S2)"),
  caption = "Detection rates by subtype: Spearman ρ vs. weighted τ (Kneedle)"
) |>
  kable_styling(full_width = FALSE) |>
  row_spec(which(rho_wtau_subtype$det_wtau_s1 < 0.5),  color = "tomato") |>
  row_spec(which(rho_wtau_subtype$det_wtau_s1 >= 0.5), color = "steelblue")

Detection rates by subtype: Spearman ρ vs. weighted τ (Kneedle)
Subtype	N	Det. ρ (S1)	Det. τ_w (S1)	Det. ρ (S2)	Det. τ_w (S2)
disacquiescent	25	0.760	0.760	0.600	0.680
midpoint_default	53	0.774	0.736	0.623	0.660
straightline	114	0.605	0.719	0.649	0.772
plausible_noise	143	0.650	0.692	0.678	0.769
acquiescent	54	0.630	0.648	0.685	0.685
near_straightline	57	0.614	0.614	0.614	0.719
biased_random	36	0.583	0.611	0.639	0.611
uniform_random	25	0.720	0.600	0.720	0.840
alternating	17	0.529	0.471	0.765	1.000
diagonal	17	0.353	0.471	0.647	0.706

Show code

make_rho_wtau_roc_panel <- function(rho_col, wtau_col, truth, scale_label) {
  r_rho  <- make_roc_df(results[[rho_col]],  truth, "Spearman ρ")
  r_wtau <- make_roc_df(results[[wtau_col]], truth, "Weighted τ")
  auc_rho  <- round(roc_auc(results[[rho_col]],  truth), 3)
  auc_wtau <- round(roc_auc(results[[wtau_col]], truth), 3)
  roc_df <- rbind(r_rho, r_wtau)

  ggplot(roc_df, aes(x = fpr, y = tpr, colour = variant)) +
    geom_abline(slope = 1, intercept = 0, linetype = "dashed", colour = "grey60") +
    geom_path(linewidth = 0.9) +
    scale_colour_manual(
      values = c("Spearman ρ" = "steelblue", "Weighted τ" = "darkorange"),
      labels = c(
        sprintf("Spearman ρ  (AUC = %.3f)", auc_rho),
        sprintf("Weighted τ  (AUC = %.3f)", auc_wtau)
      ),
      name = NULL) +
    labs(title = scale_label,
         x = "False Positive Rate (1 − Specificity)",
         y = "Sensitivity (True Positive Rate)") +
    theme(legend.position = "bottom")
}

make_rho_wtau_roc_panel("spearman_s1", "wtau_s1", results$true_inattentive, "Scale 1") +
make_rho_wtau_roc_panel("spearman_s2", "wtau_s2", results$true_inattentive, "Scale 2")

ROC curves for Spearman ρ and weighted τ on Scale 1 (left) and Scale 2 (right). A higher AUC for τ_w confirms that down-weighting near-tie rank inversions improves discrimination.

Computing Within-person RT Regression Features

Spearman ρ measures whether complexity and RT co-vary in rank order, but it does not quantify how much unexplained RT variation remains after accounting for complexity. To capture this, we fit a within-person OLS regression of log-transformed RT on item complexity alone, pooled across all 40 items:

\[\log(\text{RT}_{ij}) = \alpha_i + \beta_i \cdot c_j + \varepsilon_{ij}\]

This gives two per-person features:

$\hat\beta_i$ — complexity slope: how much log-RT increases per unit of complexity. Positive for attenders; near zero for non-attenders who are insensitive to item difficulty.
MAR — mean absolute residual $\frac{1}{n}\sum_j|\hat\varepsilon_{ij}|$: average unexplained log-RT variance after accounting for complexity. High MAR means RT is highly variable in ways that complexity alone cannot predict.

Note that MAR and Spearman ρ are complementary rather than redundant. ρ captures whether the rank order of RTs matches the rank order of complexity; MAR captures how much absolute deviation remains around the complexity trend, regardless of whether that trend is positive. A respondent could have a moderate positive ρ but high MAR (complexity partially predicts their RTs but with a lot of scatter), or low ρ and low MAR (flat RT pattern with little variance — characteristic of patterned non-attenders).

Show code

compute_rt_regression <- function(dat) {
  ni  <- params$items_per_scale

  cx_all <- unlist(lapply(seq_len(ns), function(s) instrument[[s]]$complexity))

  rt_cols_all <- unlist(lapply(seq_len(ns), function(s)
    paste0("RT_s", s, "_i", seq_len(ni))))

  rt_mat <- as.matrix(dat[, rt_cols_all])

  # Fit OLS per person: log(RT) ~ complexity
  # Use vapply to guarantee a 2-row matrix regardless of nrow(dat)
  out <- vapply(seq_len(nrow(rt_mat)), function(i) {
    rt_i   <- rt_mat[i, ]
    log_rt <- log(pmax(rt_i, 1e-6))

    fit <- tryCatch(
      lm.fit(x = cbind(1, cx_all), y = log_rt),
      error = function(e) NULL
    )

    if (is.null(fit) || length(fit$coefficients) < 2) {
      return(c(beta = NA_real_, mar = NA_real_))
    }

    c(beta = fit$coefficients[[2]],
      mar  = mean(abs(fit$residuals), na.rm = TRUE))
  }, numeric(2))

  # out is 2 x nrow(dat); transpose to nrow x 2
  out <- t(out)
  colnames(out) <- c("beta", "mar")

  data.frame(
    reg_beta = out[, "beta"],
    reg_mar  = out[, "mar"]
  )
}

reg_feats       <- compute_rt_regression(dat)
results$reg_beta <- reg_feats$reg_beta
results$reg_mar  <- reg_feats$reg_mar

cat("Regression feature summary (attenders vs non-attenders):\n")

Regression feature summary (attenders vs non-attenders):

Show code

for (col in c("reg_beta", "reg_mar")) {
  att <- results[[col]][results$is_non_attender == 0]
  non <- results[[col]][results$is_non_attender == 1]
  cat(sprintf("  %-12s  attender mean = %+.3f  |  non-attender mean = %+.3f\n",
              col, mean(att, na.rm = TRUE), mean(non, na.rm = TRUE)))
}

  reg_beta      attender mean = +0.192  |  non-attender mean = -0.003
  reg_mar       attender mean = +0.273  |  non-attender mean = +0.194

Understanding the Feature: Instrument Complexity Structure

Show code

cx_df <- do.call(rbind, lapply(seq_len(ns), function(s) {
  data.frame(
    scale      = paste0("Scale ", s),
    item       = seq_len(ni),
    complexity = instrument[[s]]$complexity
  )
}))

ggplot(cx_df, aes(x = item, y = complexity)) +
  geom_hline(yintercept = 0, linetype = "dashed", colour = "grey60") +
  geom_point(size = 2.5, colour = "steelblue") +
  geom_segment(aes(xend = item, yend = 0), linewidth = 0.4, alpha = 0.5,
               colour = "steelblue") +
  facet_wrap(~ scale, ncol = 2) +
  labs(title    = "Item complexity scores by scale",
       subtitle = sprintf(
         "complexity_rt_weight = %g — each SD of complexity multiplies RT by exp(%g × A)",
         params$complexity_rt_weight, params$complexity_rt_weight),
       x = "Item index", y = "Complexity score")

Per-item complexity scores for each scale. Items to the right are harder and should elicit longer RTs from attentive respondents.

Feature Distributions: Scale 1 and Scale 2

Show code

rho_variants <- list(
  "Scale 1" = "spearman_s1",
  "Scale 2" = "spearman_s2"
)

plots_all <- lapply(names(rho_variants), function(label) {
  col <- rho_variants[[label]]
  ggplot(results, aes(x = .data[[col]], fill = class_label)) +
    geom_density(alpha = 0.5, colour = NA) +
    geom_vline(xintercept = 0, linetype = "dashed",
               colour = "grey30", linewidth = 0.7) +
    scale_fill_manual(
      values = c("Attender (true)" = "steelblue",
                 "Non-attender (true)" = "tomato"), name = NULL) +
    labs(title = label, x = "Spearman ρ", y = "Density")
})

wrap_plots(plots_all, ncol = 2)

Distributions of Spearman ρ for Scale 1 (left) and Scale 2 (right) by true attender status. The dashed line marks ρ = 0. Wider separation between blue and red distributions indicates better discriminative power.

Discriminability Comparison

```cfrzpxertmbtgwhlzswbaphwtchrodaauc_tab <- data.frame( Variant = names(rho_variants), n_items = c(ni, ni), AUC = sapply(rho_variants, function(col) round(roc_auc(results[[col]], results$true_inattentive), 3)), mean_att = sapply(rho_variants, function(col) round(mean(results[[col]][results$is_non_attender == 0], na.rm = TRUE), 3)), mean_non = sapply(rho_variants, function(col) round(mean(results[[col]][results$is_non_attender == 1], na.rm = TRUE), 3)), sd_att = sapply(rho_variants, function(col) round(sd(results[[col]][results$is_non_attender == 0], na.rm = TRUE), 3)), sd_non = sapply(rho_variants, function(col) round(sd(results[[col]][results$is_non_attender == 1], na.rm = TRUE), 3)), row.names = NULL )

kable( auc_tab, col.names = c(“Scale”, “Items”, “AUC”, “Mean ρ (attenders)”, “Mean ρ (non-attenders)”, “SD ρ (attenders)”, “SD ρ (non-attenders)”), caption = “Discriminability of Spearman ρ by scale” ) |> kable_styling(full_width = FALSE) |> row_spec(which.max(auc_tab$AUC), bold = TRUE, background = “#eef6ee”)


## How Attention Moderates Each Scale


::: {.cell}

```{.r .cell-code}
make_attention_plot("spearman_s1", "Scale 1") +
make_attention_plot("spearman_s2", "Scale 2")

Spearman ρ for Scale 1 (left) and Scale 2 (right) vs. latent attention A among attenders. The red band shows the non-attender 5–95th percentile range.

:::

Feature by Non-attender Subtype

Show code

make_subtype_ridge("spearman_s1", "Scale 1") +
make_subtype_ridge("spearman_s2", "Scale 2")

Spearman ρ distributions by subtype (ridge plots), shown separately for Scale 1 (left) and Scale 2 (right). Subtypes whose distribution overlaps heavily with attenders will be harder to detect.

Show code

subtype_sp_tab <- results |>
  group_by(subtype) |>
  summarise(
    N       = n(),
    mean_s1 = round(mean(spearman_s1, na.rm = TRUE), 3),
    mean_s2 = round(mean(spearman_s2, na.rm = TRUE), 3),
    .groups = "drop"
  ) |>
  arrange(mean_s1)

kable(
  subtype_sp_tab,
  col.names = c("Subtype", "N", "Mean ρ (S1)", "Mean ρ (S2)"),
  caption   = "Mean Spearman ρ by scale and subtype"
) |>
  kable_styling(full_width = FALSE) |>
  row_spec(which(subtype_sp_tab$subtype == "Attender"), bold = TRUE)

Mean Spearman ρ by scale and subtype
Subtype	N	Mean ρ (S1)	Mean ρ (S2)
plausible_noise	143	-0.039	-0.019
midpoint_default	53	-0.034	0.016
uniform_random	25	-0.021	-0.029
near_straightline	57	-0.016	0.042
disacquiescent	25	-0.014	-0.029
acquiescent	54	-0.008	-0.007
biased_random	36	-0.004	0.023
straightline	114	-0.002	-0.006
alternating	17	0.117	-0.129
diagonal	17	0.135	-0.045
Attender	4459	0.436	0.408

Regression Feature Distributions

Show code

p_beta <- ggplot(results, aes(x = reg_beta, fill = class_label)) +
  geom_density(alpha = 0.5, colour = NA) +
  geom_vline(xintercept = 0, linetype = "dashed",
             colour = "grey30", linewidth = 0.7) +
  scale_fill_manual(
    values = c("Attender (true)" = "steelblue",
               "Non-attender (true)" = "tomato"), name = NULL) +
  labs(title    = "β̂ — complexity slope",
       subtitle = "log(RT) increase per unit complexity",
       x = "β̂", y = "Density")

p_mar <- ggplot(results, aes(x = reg_mar, fill = class_label)) +
  geom_density(alpha = 0.5, colour = NA) +
  scale_fill_manual(
    values = c("Attender (true)" = "steelblue",
               "Non-attender (true)" = "tomato"), name = NULL) +
  labs(title    = "MAR — mean absolute residual",
       subtitle = "Unexplained log(RT) variance after complexity-only regression",
       x = "MAR", y = "Density")

p_beta + p_mar

Distributions of the two within-person regression features by true attender status. β (complexity slope) should be positive for attenders and near zero for non-attenders. MAR should be lower for attenders whose RTs follow a clean complexity-sensitive pattern.

Show code

att_only <- results[results$is_non_attender == 0, ]

make_reg_attention_plot <- function(col, label, ylab) {
  na_q <- quantile(results[[col]][results$is_non_attender == 1],
                   probs = c(0.05, 0.25, 0.50, 0.75, 0.95), na.rm = TRUE)
  ggplot(att_only, aes(x = A, y = .data[[col]])) +
    annotate("rect", xmin = -Inf, xmax = Inf,
             ymin = na_q["5%"],  ymax = na_q["95%"],
             fill = "tomato", alpha = 0.08) +
    annotate("rect", xmin = -Inf, xmax = Inf,
             ymin = na_q["25%"], ymax = na_q["75%"],
             fill = "tomato", alpha = 0.12) +
    geom_hline(yintercept = na_q["50%"], colour = "tomato3",
               linetype = "dashed", linewidth = 0.7) +
    geom_point(aes(colour = factor(Z)), alpha = 0.35, size = 1.4) +
    geom_smooth(method = "loess", span = 0.6, se = TRUE,
                colour = "steelblue4", fill = "steelblue", alpha = 0.2) +
    geom_vline(xintercept = 0, colour = NA) +
    scale_colour_manual(values = c("0" = "grey60", "1" = "steelblue"),
                        labels = c("Control", "Treatment"), name = "Arm") +
    labs(title = label,
         subtitle = "Red band = non-attender 5–95th pct",
         x = "Latent attention A", y = ylab)
}

make_reg_attention_plot("reg_beta", "β̂ vs. latent attention", "β̂ (complexity slope)")

β̂ vs. latent attention A among attenders. The feature should increase with A because more attentive respondents are more sensitive to item difficulty. The red band shows the non-attender distribution for reference.

Regression Features by Non-attender Subtype

Show code

subtype_reg_tab <- results |>
  group_by(subtype) |>
  summarise(
    N          = n(),
    mean_beta  = round(mean(reg_beta, na.rm = TRUE), 3),
    mean_mar   = round(mean(reg_mar,  na.rm = TRUE), 3),
    .groups    = "drop"
  ) |>
  arrange(mean_beta)

kable(
  subtype_reg_tab,
  col.names = c("Subtype", "N", "Mean β̂", "Mean MAR"),
  caption   = "Within-person regression features by group and subtype"
) |>
  kable_styling(full_width = FALSE) |>
  row_spec(which(subtype_reg_tab$subtype == "Attender"), bold = TRUE)

Within-person regression features by group and subtype
Subtype	N	Mean β	\| Mean MA
uniform_random	25	-0.009	0.192
disacquiescent	25	-0.008	0.198
plausible_noise	143	-0.005	0.197
alternating	17	-0.003	0.179
midpoint_default	53	-0.003	0.198
straightline	114	-0.003	0.192
acquiescent	54	0.000	0.196
biased_random	36	0.000	0.199
near_straightline	57	0.000	0.186
diagonal	17	0.017	0.199
Attender	4459	0.192	0.273

AUC of Regression Features vs. Spearman ρ

Show code

all_features <- list(
  "ρ (Scale 1)"  = list(col = "spearman_s1", direction = "low"),
  "ρ (Scale 2)"  = list(col = "spearman_s2", direction = "low"),
  "β̂ (slope)"   = list(col = "reg_beta",    direction = "low"),
  "MAR"           = list(col = "reg_mar",     direction = "high")
)

feat_auc_tab <- do.call(rbind, lapply(names(all_features), function(nm) {
  col  <- all_features[[nm]]$col
  dir  <- all_features[[nm]]$direction
  scores <- if (dir == "low") results[[col]] else -results[[col]]
  auc    <- roc_auc(scores, results$true_inattentive)
  data.frame(
    Feature   = nm,
    Direction = ifelse(dir == "low", "Low → inattentive", "High → inattentive"),
    AUC       = round(auc, 3),
    mean_att  = round(mean(results[[col]][results$is_non_attender == 0], na.rm=TRUE), 3),
    mean_non  = round(mean(results[[col]][results$is_non_attender == 1], na.rm=TRUE), 3)
  )
}))

kable(
  feat_auc_tab,
  col.names = c("Feature", "Flag direction", "AUC",
                "Mean (attenders)", "Mean (non-attenders)"),
  caption   = "Individual-feature AUCs: Spearman ρ (per scale) and regression features"
) |>
  kable_styling(full_width = FALSE) |>
  row_spec(which.max(feat_auc_tab$AUC), bold = TRUE,
           background = "#eef6ee")

Individual-feature AUCs: Spearman ρ (per scale) and regression features
Feature	Flag direction	AUC	Mean (attenders)	Mean (non-attenders)
ρ (Scale 1)	Low → inattentive	0.909	0.436	-0.010
ρ (Scale 2)	Low → inattentive	0.895	0.408	-0.008
β̂ (slope)	\|Low → inattentive	\| 0.96	\| 0.19	\| -0.00
MAR	High → inattentive	0.047	0.273	0.194