Detecting Inattentive Respondents in Survey Experiments

Using Factor Mixture Models

Author

Jamie C. Lee

Published

April 14, 2026

Overview

This document attempts to identify (different types of) inattentive respondents in a simulated experiment. The simulation generates 500 participants, of whom 15% are true non-attenders (i.e., is_non_attender == 1). Non-attenders fall into two broad categories: patterned responders (straightliners, alternators, acquiescent responders, etc.) whose responses follow detectable regularities, and plausible noise responders whose responses superficially resemble genuine engagement but are entirely content-free. Among true attenders, latent attention \(A \sim \text{Beta}(3, 1.5)\) varies continuously — most are fairly attentive but a tail exists at lower attention levels.

Three approaches are evaluated, built up incrementally. High-level findings for each approach are summarized inline (in italics) and discussed in detail in the corresponding section:

  1. LAZR (+ Kneedle): A first-order Markov chain statistic (introduced by Biemann et al. 2025) applied to raw response patterns on multi-item individual difference scales, measuring sequential transitional predictability. Higher values indicate more patterned responding. A data-driven cutoff is identified using a Kneedle elbow algorithm, applied to the sorted Laz.R distribution.

    LAZR is useful for flagging obviously patterned non-attenders but has two limitations. First, it produces false negatives (FN) for inattentive subtypes whose responses lack detectable sequential structure (e.g., uniform random, biased random, and plausible noise respondents). These individuals’ response transitions are sufficiently irregular to fall below the Kneedle threshold, despite being content-free. Second, it produces false positives (FP) among attenders whose genuine attitudes cluster near the scale midpoint, moderate, consistent responding produces transition patterns that superficially resemble patterned inattention.

  2. Finite Mixture Model (2-feature: LAZR + RT): A Gaussian mixture model fit to two jointly informative features: 1) mean Laz.R across scales, and 2) within-person standard deviation of logged response times across items. The RT feature captures whether a person shows meaningful item-to-item timing variability, which genuine attenders exhibit due to content sensitivity, reverse-item processing, and fatigue effects that non-attenders lack entirely. Importantly, this feature assumes that the survey contains sufficient heterogeneity in item demands or complexity (e.g., a mix of forward and reverse-coded items, variation in item length, etc.), such that attentive respondents are naturally induced to vary their response pace across items. In surveys with highly homogeneous item complexity, the timing spread of attenders and non-attenders may converge, reducing the discriminative value of this feature.

    Adding RT variability to the mixture model substantially reduces FP rates relative to Laz.R alone: attenders whose genuine midpoint-consistent responses were previously misclassified as patterned are now correctly retained because their item-to-item timing variation signals genuine engagement. However, this feature does not improve FN rates. It provides no additional signal to detect uniform random, biased random, or plausible noise respondents.

  3. Finite Mixture Model (3-feature: LAZR + RT + Zh): The 2-feature mixture model extended with Zh, a standardised person-fit statistic from a Graded Response Model (GRM) estimated via iterative purification. Zh measures how well a person’s responses are explained by a single latent trait. Attenders’ responses are causally driven by a latent construct and should fit the GRM well, while non-attenders’ responses are not generated by any latent trait and should show systematically poor fit. Item parameters are estimated on a purified subsample to prevent contamination by non-attenders.

    Adding Zh substantially improves detection of uniform random and biased random responders, whose responses lack the local trait-consistency the GRM expects regardless of where their trait level is estimated. However, Zh fails to separate plausible noise responders from genuine attenders. This is because plausible noise responses are anchored around a scale point with partial sensitivity to item direction, a response pattern that is structurally similar to that of a genuine trait attender. The GRM estimates a plausible trait level for these respondents and finds their responses only weakly inconsistent with it, producing Zh values that overlap substantially with the attender distribution. In this sense, plausible noise represents a fundamental identification boundary for IRT-based person-fit statistics: respondents whose content-free responses happen to be locally consistent with some trait level will evade detection by any method that relies on trait-response consistency alone.

Throughout, classifier performance is evaluated against ground truth using sensitivity, specificity, and subtype-level detection rates.


Simulation Ground Truth

Before evaluating classifiers, we describe the true composition of the sample.

Show code
# Class breakdown
class_tab <- data.frame(
  Group            = c("Attenders", "Non-attenders"),
  N                = c(sum(dat$is_non_attender == 0),
                       sum(dat$is_non_attender == 1)),
  Pct              = round(100 * c(mean(dat$is_non_attender == 0),
                                    mean(dat$is_non_attender == 1)), 1)
)

# Non-attender subtype breakdown
na_dat   <- dat[dat$is_non_attender == 1, ]
sub_tab  <- as.data.frame(table(
  Style   = na_dat$non_attender_style,
  Pattern = na_dat$non_attender_pattern_type
))
sub_tab  <- sub_tab[sub_tab$Freq > 0, ]

kable(class_tab, col.names = c("Group", "N", "%"),
      caption = "Sample composition") |>
  kable_styling(full_width = FALSE)
Sample composition
Group N %
Attenders 424 84.8
Non-attenders 76 15.2
Show code
kable(sub_tab, caption = "Non-attender subtypes") |>
  kable_styling(full_width = FALSE)
Non-attender subtypes
Style Pattern Freq
2 with_pattern acquiescent 17
4 with_pattern alternating 2
6 with_pattern biased_random 6
8 with_pattern disacquiescent 3
10 with_pattern midpoint_default 4
12 with_pattern near_straightline 8
14 with_pattern straightline 13
16 with_pattern uniform_random 5

Non-attenders fall into two broad styles: plausible noise (hard to detect — responses look human but are not based on the individual’s true latent trait) and patterned subtypes (straightline, alternating, diagonal, etc.). The key challenge is distinguishing these different subtypes from genuine attenders, based on observables.

Attention and Dose Among Attenders

Show code
att_dat <- dat[dat$is_non_attender == 0, ]

# Theoretical logistic curve
a_seq  <- seq(0, 1, length.out = 300)
curve_df <- data.frame(
  A    = a_seq,
  Dose = inv_logit(params$dose_steepness * (a_seq - params$dose_halfpoint))
)

p_scatter <- ggplot() +
  geom_point(
    data  = att_dat[att_dat$Z == 1, ],
    aes(x = A, y = Dose),
    alpha = 0.45, size = 1.8, colour = "steelblue"
  ) +
  geom_line(
    data      = curve_df,
    aes(x = A, y = Dose),
    colour    = "tomato", linewidth = 1, linetype = "solid"
  ) +
  geom_vline(
    xintercept = params$dose_halfpoint,
    linetype   = "dashed", colour = "grey40", linewidth = 0.7
  ) +
  annotate(
    "text", x = params$dose_halfpoint + 0.02, y = 0.05,
    label = sprintf("halfpoint = %.2f", params$dose_halfpoint),
    hjust = 0, size = 3.5, colour = "grey30"
  ) +
  labs(
    title    = "Attention → Dose mapping (treated attenders)",
    subtitle = sprintf(
      "Logistic(steepness = %g, halfpoint = %g) + noise(sd = %g)",
      params$dose_steepness, params$dose_halfpoint, params$dose_noise
    ),
    x = "Latent attention A",
    y = "Realised Dose"
  )

p_dist <- ggplot(att_dat, aes(x = A, fill = factor(Z))) +
  geom_density(alpha = 0.45, colour = NA) +
  geom_vline(
    xintercept = params$dose_halfpoint,
    linetype   = "dashed", colour = "grey40", linewidth = 0.7
  ) +
  scale_fill_manual(
    values = c("0" = "grey60", "1" = "steelblue"),
    labels = c("Control (Z=0)", "Treatment (Z=1)"),
    name   = NULL
  ) +
  labs(
    title    = "Distribution of A by treatment arm (attenders only)",
    subtitle = "Distributions should overlap — A is independent of Z by design",
    x = "Latent attention A",
    y = "Density"
  )

p_scatter + p_dist

Among treated attenders, Dose is a logistic function of latent attention A, plus noise. The red curve shows the theoretical logistic mapping; blue points are realised doses. The right panel shows the marginal distribution of A, with the dose halfpoint marked.
Show code
att_summary <- att_dat |>
  group_by(Z) |>
  summarise(
    N          = n(),
    Mean_A     = round(mean(A),      3),
    SD_A       = round(sd(A),        3),
    Mean_Dose1 = round(mean(Dose_1), 3),
    SD_Dose1   = round(sd(Dose_1),   3),
    .groups    = "drop"
  ) |>
  mutate(Z = factor(Z, labels = c("Control (Z=0)", "Treatment (Z=1)")))

kable(
  att_summary,
  col.names = c("Arm", "N", "Mean A", "SD A",
                "Mean Dose₁", "SD Dose₁"),
  caption   = "Attention and potential dose (Dose_1) by treatment arm (attenders only)"
) |>
  kable_styling(full_width = FALSE)
Attention and potential dose (Dose_1) by treatment arm (attenders only)
Arm N Mean A SD A Mean Dose₁ SD Dose₁
Control (Z=0) 210 0.684 0.204 0.628 0.181
Treatment (Z=1) 214 0.672 0.187 0.617 0.163

Because Z is randomly assigned, A should be balanced across arms — Dose_1 (the potential dose each attender would receive under treatment) is therefore also balanced. Realised Dose equals Dose_1 only for treated participants; controls always have Dose = 0.


Approach 1: LAZR + Kneedle Algorithm

What is LAZR?

LAZR (Laz.R) measures sequential conditional predictability in raw item responses. For a person’s response vector \(x_1, x_2, \ldots, x_n\), it builds a transition probability matrix across adjacent item responses.

Under random responding (uniform transitions), LAZR \(\approx 1/K\) where \(K\) is the number of response categories. Under fully deterministic responding (e.g., straightlining), LAZR \(= 1\). Higher values indicate more predictable, less content-driven responding.

The Kneedle algorithm identifies a data-driven exclusion threshold by locating the point of maximum curvature — the “elbow” — in the sorted LAZR distribution. Respondents above this threshold are flagged as potentially inattentive.

Approach 1 summary. LAZR + Kneedle performs well for obviously patterned non-attenders (straightline, near-straightline, alternating, diagonal) but has two systematic limitations. False negatives: uniform random, biased random, and plausible noise responders produce sufficiently irregular response transitions to fall below the Kneedle threshold, and are therefore missed. False positives: attenders with genuine midpoint-consistent response styles are sometimes flagged because their stable, moderate responses produce transition patterns that superficially resemble patterned inattention. Subtype-level detection rates and the overall false positive rate are reported below.

Show code
# LAZR statistic
Laz.R <- function(x) {
  x <- x[!is.na(x)]
  n <- length(x)
  if (n < 2) return(NA_real_)
  tr      <- table(x[-n], x[-1])
  row_sum <- rowSums(tr)
  row_sum[row_sum == 0] <- NA
  sum(rowSums(tr^2) / row_sum, na.rm = TRUE) / (n - 1)
}

# Kneedle algorithm
dist2d <- function(a, b, c) {
  v1 <- b - c; v2 <- a - b
  det(cbind(v1, v2)) / sqrt(sum(v1 * v1))
}

kneedle <- function(values, sign) {
  start <- c(1, values[1])
  end   <- c(length(values), values[length(values)])
  which.max(sapply(seq_along(values), function(idx) {
    sign * -1 * dist2d(c(idx, values[idx]), start, end)
  }))
}

compute_laz_cutoff <- function(x) {
  x_sorted <- sort(x, decreasing = TRUE, na.last = NA)
  k_idx    <- kneedle(x_sorted, sign = -1)
  cutoff_p <- k_idx / length(x_sorted)
  quantile(x, probs = 1 - cutoff_p, na.rm = TRUE)
}

compute_lazr <- function(dat, scale_idx) {
  y_cols <- paste0("Y_s", scale_idx, "_i", seq_len(ni))
  apply(as.matrix(dat[, y_cols]), 1, Laz.R)
}

Computing LAZR and Thresholds

Show code
# Build results data frame
results <- dat[, c("ID", "Z", "is_non_attender", "A",
                    "non_attender_style", "non_attender_pattern_type")]
results$true_inattentive <- results$is_non_attender == 1
results$class_label      <- factor(
  ifelse(results$true_inattentive, "Non-attender (true)", "Attender (true)"))

# LAZR per scale
for (s in seq_len(ns)) {
  results[[paste0("lazr_s", s)]] <- compute_lazr(dat, s)
}

# Kneedle thresholds and flags
kneedle_flags <- list()
for (s in seq_len(ns)) {
  col    <- paste0("lazr_s", s)
  thresh <- compute_laz_cutoff(results[[col]])
  flag   <- results[[col]] >= thresh
  results[[paste0("flag_kneedle_s", s)]] <- flag
  kneedle_flags[[s]] <- list(threshold = thresh, n_flagged = sum(flag))
}

# Union flag: flagged by either scale
results$flag_union    <- results$flag_kneedle_s1 | results$flag_kneedle_s2
results$attentive_kneedle <- !results$flag_union
Show code
plots_lazr <- lapply(seq_len(ns), function(s) {
  local({
    col    <- paste0("lazr_s", s)
    thresh <- kneedle_flags[[s]]$threshold
    df     <- results
    df$class_label <- factor(ifelse(df$true_inattentive,
                                     "Non-attender (true)", "Attender (true)"))
    ggplot(df, aes(x = .data[[col]], fill = class_label)) +
      geom_density(alpha = 0.5, color = NA) +
      geom_vline(xintercept = thresh, linetype = "dashed",
                 color = "black", linewidth = 0.9) +
      annotate("text", x = thresh + 0.01, y = Inf,
               label = sprintf("Kneedle\n%.3f", thresh),
               hjust = 0, vjust = 1.4, size = 3.5) +
      scale_fill_manual(
        values = c("Attender (true)" = "steelblue",
                   "Non-attender (true)" = "tomato"), name = NULL) +
      labs(title    = sprintf("Scale %d — LAZR distribution", s),
           subtitle = "Higher LAZR = more predictable transitions",
           x = "LAZR score", y = "Density")
  })
})
wrap_plots(plots_lazr, ncol = 2)

LAZR score distributions by true attender status. The dashed vertical line shows the Kneedle threshold. Non-attenders (red) have a somewhat uniform distribution of LAZR values, creating a meaningful overlap with attenders. False negative (retaining non-attenders) rates tend to be higher than false positive (excluding attentive participants) rates.
Show code
plots_elbow <- lapply(seq_len(ns), function(s) {
  local({
    col    <- paste0("lazr_s", s)
    thresh <- kneedle_flags[[s]]$threshold
    scores <- sort(results[[col]], decreasing = TRUE)
    df     <- data.frame(rank = seq_along(scores), lazr = scores)
    df$flagged <- df$lazr >= thresh
    ggplot(df, aes(x = rank, y = lazr, color = flagged)) +
      geom_line(linewidth = 0.7) +
      geom_point(data = df[which.min(abs(df$lazr - thresh)), ],
                 aes(x = rank, y = lazr),
                 color = "black", size = 4, shape = 21, fill = "yellow") +
      scale_color_manual(values = c("FALSE" = "steelblue", "TRUE" = "tomato"),
                          guide = "none") +
      labs(title    = sprintf("Scale %d — Sorted LAZR (Kneedle elbow)", s),
           subtitle = "Yellow = knee point; red = flagged",
           x = "Rank (highest first)", y = "LAZR score")
  })
})
wrap_plots(plots_elbow, ncol = 2)

Sorted LAZR scores with the Kneedle elbow point (yellow dot). Respondents to the left of the elbow (red) are flagged as potentially inattentive.

Laz.R Performance: How good is Laz.R at identifying attentive and inattentive participants?

The tables below report Laz.R’s sensitivity, specificity, FP, and FN rates.

  • Sensitivity: Of all true non-attenders, what fraction did Laz.R correctly flag as inattentive?

  • Specificity: Of all true attenders, what fraction did Laz.R correctly pass as attentive?

Show code
# ── LAZR + Kneedle performance summary ────────────────────────────────────
cs_kneedle <- confusion_stats(results$flag_union, results$true_inattentive)

# Overall sensitivity and specificity
overall_tab <- data.frame(
  Metric = c("Sensitivity", "Specificity", "% Flagged", "TP", "FP", "TN", "FN"),
  Value  = round(unlist(cs_kneedle[c("Sensitivity", "Specificity",
                                      "Flagged_pct", "TP", "FP", "TN", "FN")]), 3)
)

kable(
  overall_tab,
  col.names = c("Metric", "Value"),
  caption   = "Overall performance — LAZR + Kneedle"
) |>
  kable_styling(full_width = FALSE) |>
  row_spec(which(overall_tab$Metric %in% c("Sensitivity", "Specificity")),
           bold = TRUE)
Overall performance — LAZR + Kneedle
Metric Value
Sensitivity Sensitivity 0.632
Specificity Specificity 0.901
Flagged_pct % Flagged 18.000
TP TP 48.000
FP FP 42.000
TN TN 382.000
FN FN 28.000
Show code
# Detection rates by subtype
subtype_kneedle <- results |>
  mutate(
    subtype = case_when(
      is_non_attender == 0 ~ "Attender",
      non_attender_style == "plausible_noise" ~ "plausible_noise",
      TRUE ~ non_attender_pattern_type
    )
  ) |>
  group_by(subtype) |>
  summarise(
    N         = n(),
    n_flagged = sum(flag_union),
    flag_rate = round(mean(flag_union), 3),
    .groups   = "drop"
  ) |>
  arrange(desc(flag_rate))

# Split into FP (attenders incorrectly flagged) and FN (non-attenders missed)
fp_tab <- subtype_kneedle |>
  filter(subtype == "Attender") |>
  rename(rate = flag_rate)

fn_tab <- subtype_kneedle |>
  filter(subtype != "Attender") |>
  rename(rate = flag_rate)

kable(
  fp_tab |> select(subtype, N, n_flagged, rate),
  col.names = c("Subtype", "N", "N incorrectly flagged", "False positive rate"),
  caption   = "False positive rate — attenders incorrectly flagged by LAZR + Kneedle"
) |>
  kable_styling(full_width = FALSE) |>
  row_spec(1, color = "tomato")
False positive rate — attenders incorrectly flagged by LAZR + Kneedle
Subtype N N incorrectly flagged False positive rate
Attender 424 42 0.099
Show code
kable(
  fn_tab |> select(subtype, N, n_flagged, rate),
  col.names = c("Subtype", "N", "N detected", "Detection rate (sensitivity)"),
  caption   = "Detection rates by non-attender subtype — LAZR + Kneedle"
) |>
  kable_styling(full_width = FALSE) |>
  row_spec(which(fn_tab$rate < 0.5),  color = "tomato") |>
  row_spec(which(fn_tab$rate >= 0.5), color = "steelblue")
Detection rates by non-attender subtype — LAZR + Kneedle
Subtype N N detected Detection rate (sensitivity)
acquiescent 17 17 1.000
alternating 2 2 1.000
disacquiescent 3 3 1.000
midpoint_default 4 4 1.000
near_straightline 8 8 1.000
straightline 13 13 1.000
plausible_noise 18 1 0.056
biased_random 6 0 0.000
uniform_random 5 0 0.000

Approach 2: Finite Mixture Model (LAZR + RT)

The first mixture model extends LAZR with a second feature — within-person standard deviation of log-transformed item response times — and fits a 2-class Gaussian mixture model jointly on both features. Rather than applying a univariate threshold to LAZR alone, the mixture model identifies latent classes by finding the joint feature configuration that best separates the population into two groups under a multivariate normal assumption.

The RT variability feature is motivated by the observation that genuine attenders’ response times vary across items due to content sensitivity, reverse-item processing, and cumulative fatigue — none of which modulate non-attenders’ mechanical response process. Adding this feature specifically targets the false positive problem identified in Approach 1: attenders with midpoint-consistent response styles have similarly moderate-to-low Laz.R values to some non-attenders, but their item-to-item timing variability should be higher than truly non-attentive respondents, allowing the mixture model to retain them correctly.

Approach 2 summary. Adding RT variability to LAZR in a joint mixture model substantially reduces false positive rates relative to LAZR + Kneedle alone — attenders with midpoint-consistent response styles are more correctly retained because their timing variability distinguishes them from patterned non-attenders. However, false negative rates are not meaningfully improved: uniform random, biased random, and plausible noise responders remain largely undetected because their low timing variability is consistent with the inattentive class on this feature. Subtype-level detection rates are reported below.

Show code
# ── Within-person RT features ──────────────────────────────────────────────
compute_rt_features <- function(dat, scale_idx) {
  ni      <- params$items_per_scale
  rt_cols <- paste0("RT_s", scale_idx, "_i", seq_len(ni))
  rt_mat  <- as.matrix(dat[, rt_cols])

  rt_mat_w <- apply(rt_mat, 2, function(x) {
    cap <- quantile(x, 0.99, na.rm = TRUE)
    pmin(x, cap)
  })
  log_rt <- log(rt_mat_w + 1e-6)

  rt_logsd <- apply(log_rt, 1, sd,  na.rm = TRUE)
  rt_iqr   <- apply(log_rt, 1, IQR, na.rm = TRUE)

  data.frame(rt_logsd = rt_logsd, rt_iqr = rt_iqr)
}

rt_feat_list <- lapply(seq_len(ns), compute_rt_features, dat = dat)

results$rt_logsd <- rowMeans(
  do.call(cbind, lapply(rt_feat_list, `[[`, "rt_logsd")), na.rm = TRUE)
results$rt_iqr   <- rowMeans(
  do.call(cbind, lapply(rt_feat_list, `[[`, "rt_iqr")), na.rm = TRUE)
Show code
p_rt_logsd <- ggplot(results, aes(x = rt_logsd, fill = class_label)) +
  geom_density(alpha = 0.5, colour = NA) +
  scale_fill_manual(
    values = c("Attender (true)" = "steelblue",
               "Non-attender (true)" = "tomato"),
    name = NULL) +
  labs(title    = "Within-person SD of log(RT) across items",
       subtitle = "Higher = more item-to-item timing variability",
       x = "SD of log(RT)", y = "Density")

p_rt_iqr <- ggplot(results, aes(x = rt_iqr, fill = class_label)) +
  geom_density(alpha = 0.5, colour = NA) +
  scale_fill_manual(
    values = c("Attender (true)" = "steelblue",
               "Non-attender (true)" = "tomato"),
    name = NULL) +
  labs(title    = "IQR of log(RT) across items",
       subtitle = "Robust measure of within-person timing spread",
       x = "IQR of log(RT)", y = "Density")

p_rt_logsd + p_rt_iqr

Within-person RT variability by true attender status. Attenders should show higher item-to-item timing spread, because their response pace is determined by item complexity and fatigue over time. Non-attenders should show lower spread, because their RT is not sensitive to item complexity.

Within-person SD of log-transformed RTs distinguishes attenders from non-attenders more effectively than IQR. IQR measures spread using only the middle 50% of a person’s RT distribution, discarding the top and bottom quartiles entirely. However, the most informative timing signals (e.g., systematic slowing on reverse-coded items, high-attention deliberation on cognitively demanding items, and fatigue-driven speedup on later items) tend to produce values in these discarded tails. SD retains sensitivity to the full distribution, including these extreme values, and therefore better captures the item-level RT variability that genuine engagement produces.

In surveys where stall events or tab-switching frequently contaminate item-level RTs with artifactual spikes, IQR’s robustness to outliers would be advantageous. However, in the present simulation stall events are absorbed into page-entry times rather than item RTs, leaving the item-level RT matrix relatively clean and making SD’s sensitivity to extremes an asset rather than a liability. This bounds of this measure’s effectiveness in identifying attentive and inattentive latent groups should be further tested.

Show code
# ── 2-feature Gaussian mixture model: LAZR + RT-SD ────────────────────────
results$mean_lazr <- rowMeans(
  results[, paste0("lazr_s", seq_len(ns))], na.rm = TRUE)

feat_mat <- scale(results[, c("mean_lazr", "rt_logsd")])

set.seed(42)
mix_fit <- Mclust(feat_mat, G = 2, verbose = FALSE)

class_means     <- mix_fit$parameters$mean
lazr_by_class   <- class_means["mean_lazr", ]
inatt_mix_class <- which.max(lazr_by_class)

results$mix_class      <- mix_fit$classification
results$mix_prob_inatt <- mix_fit$z[, inatt_mix_class]
results$flag_mixture   <- results$mix_class == inatt_mix_class
results$attentive_mix  <- !results$flag_mixture

cat("2-feature mixture model class means (standardised):\n")
2-feature mixture model class means (standardised):
Show code
print(round(class_means, 3))
            [,1]   [,2]
mean_lazr  2.237 -0.268
rt_logsd  -1.638  0.196
Show code
# ── Posterior distribution ─────────────────────────────────────────────────
p_post <- ggplot(results, aes(x = mix_prob_inatt, fill = class_label)) +
  geom_histogram(bins = 40, position = "identity", alpha = 0.55, colour = NA) +
  scale_fill_manual(
    values = c("Attender (true)" = "steelblue",
               "Non-attender (true)" = "tomato"),
    name = NULL) +
  labs(title    = "Posterior P(inattentive) — 2-feature mixture model",
       subtitle = "LAZR + SD log(RT)",
       x = "P(inattentive | data)", y = "Count")

# ── Decision surface ───────────────────────────────────────────────────────
grid_range <- function(x, n = 80) seq(min(x) - 0.2, max(x) + 0.2, length.out = n)

grid_df <- expand.grid(
  mean_lazr = grid_range(feat_mat[, "mean_lazr"]),
  rt_logsd  = grid_range(feat_mat[, "rt_logsd"])
)
grid_pred   <- predict(mix_fit, newdata = as.matrix(grid_df))
grid_df$p_inatt <- grid_pred$z[, inatt_mix_class]

obs_df <- data.frame(
  mean_lazr  = feat_mat[, "mean_lazr"],
  rt_logsd   = feat_mat[, "rt_logsd"],
  true_class = results$class_label
)

p_surface <- ggplot() +
  geom_raster(data = grid_df,
              aes(x = mean_lazr, y = rt_logsd, fill = p_inatt),
              alpha = 0.6) +
  scale_fill_gradient2(low = "steelblue", mid = "white", high = "tomato",
                       midpoint = 0.5, name = "P(inattentive)") +
  geom_point(data = obs_df,
             aes(x = mean_lazr, y = rt_logsd, colour = true_class),
             size = 1.5, alpha = 0.7) +
  scale_colour_manual(
    values = c("Attender (true)" = "navy",
               "Non-attender (true)" = "firebrick"),
    name = NULL) +
  labs(title    = "Mixture model decision surface",
       subtitle = "Background = P(inattentive); points = true class",
       x = "Mean LAZR (std)", y = "SD log(RT) (std)")

p_post / p_surface

Top: posterior probability of being inattentive under the 2-feature mixture model. A bimodal distribution indicates clean separation; overlap indicates ambiguity. Bottom left: mixture model decision surface in feature space.

2-Feature Model Performance

Show code
cs_mixture <- confusion_stats(results$flag_mixture, results$true_inattentive)

overall_mix2 <- data.frame(
  Metric = c("Sensitivity", "Specificity", "% Flagged", "TP", "FP", "TN", "FN"),
  Value  = round(unlist(cs_mixture[c("Sensitivity", "Specificity",
                                      "Flagged_pct", "TP", "FP", "TN", "FN")]), 3)
)

kable(overall_mix2,
      col.names = c("Metric", "Value"),
      caption   = "Overall performance — 2-feature mixture model") |>
  kable_styling(full_width = FALSE) |>
  row_spec(which(overall_mix2$Metric %in% c("Sensitivity", "Specificity")),
           bold = TRUE)
Overall performance — 2-feature mixture model
Metric Value
Sensitivity Sensitivity 0.618
Specificity Specificity 0.998
Flagged_pct % Flagged 9.600
TP TP 47.000
FP FP 1.000
TN TN 423.000
FN FN 29.000
Show code
subtype_mix2 <- results |>
  mutate(
    subtype = case_when(
      is_non_attender == 0 ~ "Attender",
      non_attender_style == "plausible_noise" ~ "plausible_noise",
      TRUE ~ non_attender_pattern_type
    )
  ) |>
  group_by(subtype) |>
  summarise(
    N         = n(),
    n_flagged = sum(flag_mixture),
    rate      = round(mean(flag_mixture), 3),
    .groups   = "drop"
  ) |>
  arrange(desc(rate))

fp_mix2 <- subtype_mix2 |> filter(subtype == "Attender")
fn_mix2 <- subtype_mix2 |> filter(subtype != "Attender")

kable(fp_mix2 |> select(subtype, N, n_flagged, rate),
      col.names = c("Subtype", "N", "N incorrectly flagged", "False positive rate"),
      caption   = "False positive rate — 2-feature mixture model") |>
  kable_styling(full_width = FALSE) |>
  row_spec(1, color = "tomato")
False positive rate — 2-feature mixture model
Subtype N N incorrectly flagged False positive rate
Attender 424 1 0.002
Show code
kable(fn_mix2 |> select(subtype, N, n_flagged, rate),
      col.names = c("Subtype", "N", "N detected", "Detection rate"),
      caption   = "Detection rates by subtype — 2-feature mixture model") |>
  kable_styling(full_width = FALSE) |>
  row_spec(which(fn_mix2$rate < 0.5),  color = "tomato") |>
  row_spec(which(fn_mix2$rate >= 0.5), color = "steelblue")
Detection rates by subtype — 2-feature mixture model
Subtype N N detected Detection rate
alternating 2 2 1.000
disacquiescent 3 3 1.000
midpoint_default 4 4 1.000
near_straightline 8 8 1.000
straightline 13 13 1.000
acquiescent 17 16 0.941
plausible_noise 18 1 0.056
biased_random 6 0 0.000
uniform_random 5 0 0.000

Approach 3: Finite Mixture Model (LAZR + RT + Zh)

The third model extends the 2-feature mixture model by adding Zh as a third feature. Zh is a standardised person-fit statistic derived from a Graded Response Model (GRM) that quantifies how well a person’s observed response pattern is explained by a single latent trait. Because attenders’ responses are causally generated by a latent construct, their responses are locally consistent with their estimated trait level and produce Zh values near zero. Non-attenders’ responses are not generated by any latent trait, so even the best-fitting trait level leaves their responses statistically surprising, producing strongly negative Zh values.

To prevent non-attenders from distorting item parameter estimates — which would artificially inflate person-fit for inattentive respondents — item parameters are estimated via iterative purification: a first-pass GRM is fit on all respondents, the worst-fitting respondents are removed, and the GRM is refit on the remaining clean subsample. All respondents are then scored using these purified item parameters.

Approach 3 summary. Adding Zh to the mixture model improves detection of uniform random and biased random respondents, whose responses lack the local trait-consistency the GRM expects regardless of where their trait level is estimated. However, Zh does not improve detection of plausible noise respondents. Plausible noise responses are anchored around a scale value with partial sensitivity to item direction — a pattern structurally similar to genuine moderate-trait attenders near the scale midpoint. The GRM estimates a plausible trait level for these respondents and finds their responses only weakly inconsistent with it, producing Zh values that overlap substantially with the attender distribution. Plausible noise therefore represents a fundamental identification boundary for IRT-based person-fit statistics.

Show code
# ── Person-fit statistic (Zh) via iterative purification ──────────────────
# Stage 1: fit GRM on everyone to get initial person-fit estimates
# Stage 2: remove worst-fitting respondents, refit GRM on clean sample
# Stage 3: score everyone using purified item parameters
# This avoids contamination of item parameter estimates by non-attenders

compute_lz_purified <- function(dat, scale_idx) {
  ni        <- params$items_per_scale
  K         <- params$n_categories
  y_cols    <- paste0("Y_s", scale_idx, "_i", seq_len(ni))
  rev_items <- instrument[[scale_idx]]$reverse

  resp_mat    <- as.matrix(dat[, y_cols])
  resp_scored <- resp_mat
  resp_scored[, rev_items] <- K + 1 - resp_mat[, rev_items]
  resp_df     <- as.data.frame(resp_scored)
  n           <- nrow(dat)

  # Round 1: fit GRM on everyone
  fit1 <- mirt(resp_df, model = 1, itemtype = "graded", verbose = FALSE)
  pf1  <- personfit(fit1, method = "ML")

  # Retain respondents whose person-fit is not severely poor
  keep_mask <- !is.finite(pf1$Zh) | pf1$Zh > -2
  cat(sprintf(
    "Scale %d — Round 1: %d flagged as poor fit, refitting on %d of %d\n",
    scale_idx, sum(!keep_mask), sum(keep_mask), n
  ))

  # Round 2: refit GRM on only the well-fitting participants
  fit2 <- mirt(
    resp_df[keep_mask, ],
    model    = 1,
    itemtype = "graded",
    verbose  = FALSE
  )

  # Extract purified item parameters and fix them
  pars_fixed        <- mod2values(fit2)
  pars_fixed$est    <- FALSE   # fix all parameters — no re-estimation

  # Score all participants using purified parameters
  fit2_all <- mirt(
    resp_df,
    model    = 1,
    itemtype = "graded",
    verbose  = FALSE,
    pars     = pars_fixed,
    TOL      = NA
  )

  pf2 <- personfit(fit2_all, method = "ML")

  cat(sprintf(
    "Scale %d — Round 2: Zh range [%.2f, %.2f], mean = %.3f\n",
    scale_idx, min(pf2$Zh, na.rm = TRUE),
    max(pf2$Zh, na.rm = TRUE), mean(pf2$Zh, na.rm = TRUE)
  ))

  pf2$Zh
}

lz_list    <- lapply(seq_len(ns), compute_lz_purified, dat = dat)
Scale 1 — Round 1: 58 flagged as poor fit, refitting on 442 of 500
Scale 1 — Round 2: Zh range [-12.24, 3.59], mean = -0.686
Scale 2 — Round 1: 50 flagged as poor fit, refitting on 450 of 500
Scale 2 — Round 2: Zh range [-12.70, 3.55], mean = -0.639
Show code
results$lz <- rowMeans(do.call(cbind, lz_list), na.rm = TRUE)
Show code
ggplot(results, aes(x = lz, fill = class_label)) +
  geom_density(alpha = 0.5, colour = NA) +
  geom_vline(xintercept = -2, linetype = "dashed",
             colour = "black", linewidth = 0.8) +
  annotate("text", x = -2.1, y = Inf,
           label = "Zh = −2", hjust = 1, vjust = 1.4,
           size = 3.5, colour = "grey30") +
  scale_fill_manual(
    values = c("Attender (true)" = "steelblue",
               "Non-attender (true)" = "tomato"),
    name = NULL) +
  labs(
    title    = "Zh person-fit statistic",
    subtitle = "More negative = responses less consistent with a latent trait",
    x = "Zh", y = "Density"
  )

Distribution of Zh person-fit statistic by true attender status. Zh is approximately standard normal for well-fitting respondents. Strongly negative values indicate responses inconsistent with any latent trait — the expected signature of non-attenders. The dashed line marks Zh = -2, a conventional flagging threshold.
Show code
subtype_lz <- results |>
  mutate(
    subtype = case_when(
      is_non_attender == 0 ~ "Attender",
      non_attender_style == "plausible_noise" ~ "plausible_noise",
      TRUE ~ non_attender_pattern_type
    )
  ) |>
  group_by(subtype) |>
  summarise(
    N       = n(),
    mean_lz = round(mean(lz, na.rm = TRUE), 3),
    sd_lz   = round(sd(lz,   na.rm = TRUE), 3),
    pct_below_minus2 = round(100 * mean(lz < -2, na.rm = TRUE), 1),
    .groups = "drop"
  ) |>
  arrange(mean_lz)

kable(
  subtype_lz,
  col.names = c("Subtype", "N", "Mean Zh", "SD Zh", "% below −2"),
  caption   = "Zh person-fit by group and subtype"
) |>
  kable_styling(full_width = FALSE) |>
  row_spec(which(subtype_lz$subtype == "Attender"), bold = TRUE)
Zh person-fit by group and subtype
Subtype N Mean Zh SD Zh % below −2
disacquiescent 3 -8.993 0.516 100.0
acquiescent 17 -8.842 0.754 100.0
alternating 2 -7.098 2.076 100.0
biased_random 6 -5.188 1.196 100.0
uniform_random 5 -4.984 1.012 100.0
straightline 13 -4.202 5.134 53.8
near_straightline 8 -3.995 3.175 62.5
plausible_noise 18 -0.922 0.713 11.1
Attender 424 0.017 1.171 8.0
midpoint_default 4 3.087 0.107 0.0
Show code
ggplot(results, aes(x = mix2_prob_inatt, fill = class_label)) +
  geom_histogram(bins = 40, position = "identity", alpha = 0.55, colour = NA) +
  scale_fill_manual(
    values = c("Attender (true)" = "steelblue",
               "Non-attender (true)" = "tomato"),
    name = NULL) +
  labs(
    title    = "Posterior P(inattentive) — 3-feature mixture model",
    subtitle = "LAZR + SD log(RT) + Zh",
    x = "P(inattentive | data)", y = "Count"
  )

Posterior probability of being inattentive under the 3-feature mixture model.
Show code
pairs_df <- data.frame(
  mean_lazr  = feat_mat2[, "mean_lazr"],
  rt_logsd   = feat_mat2[, "rt_logsd"],
  lz         = feat_mat2[, "lz"],
  true_class = results$class_label
)

p12 <- ggplot(pairs_df, aes(mean_lazr, rt_logsd, colour = true_class)) +
  geom_point(size = 1.3, alpha = 0.6) +
  scale_colour_manual(values = c("steelblue", "tomato"), name = NULL) +
  labs(x = "Mean LAZR (std)", y = "SD log(RT) (std)")

p13 <- ggplot(pairs_df, aes(mean_lazr, lz, colour = true_class)) +
  geom_point(size = 1.3, alpha = 0.6) +
  scale_colour_manual(values = c("steelblue", "tomato"), name = NULL) +
  labs(x = "Mean LAZR (std)", y = "Zh person-fit (std)")

p23 <- ggplot(pairs_df, aes(rt_logsd, lz, colour = true_class)) +
  geom_point(size = 1.3, alpha = 0.6) +
  scale_colour_manual(values = c("steelblue", "tomato"), name = NULL) +
  labs(x = "SD log(RT) (std)", y = "Zh person-fit (std)")

(p12 | p13 | p23) +
  plot_layout(guides = "collect") &
  theme(legend.position = "bottom")

Pairwise feature plots coloured by true class.

3-Feature Model Performance

Show code
cs_kneedle <- confusion_stats(results$flag_union,   results$true_inattentive)
cs_mixture <- confusion_stats(results$flag_mixture, results$true_inattentive)
cs_mixture2 <- confusion_stats(results$flag_mixture2, results$true_inattentive)

perf_tab <- data.frame(
  Metric   = c("Sensitivity", "Specificity",
               "% Flagged", "TP", "FP", "TN", "FN"),
  Kneedle  = round(unlist(cs_kneedle[c("Sensitivity", "Specificity",
                                        "Flagged_pct", "TP", "FP", "TN", "FN")]), 3),
  Mixture2 = round(unlist(cs_mixture[c("Sensitivity", "Specificity",
                                        "Flagged_pct", "TP", "FP", "TN", "FN")]), 3),
  Mixture3 = round(unlist(cs_mixture2[c("Sensitivity", "Specificity",
                                         "Flagged_pct", "TP", "FP", "TN", "FN")]), 3)
)

kable(
  perf_tab,
  col.names = c("Metric", "Kneedle", "Mixture (2-feature)",
                "Mixture (3-feature: LAZR + RT + Zh)"),
  caption   = "Classifier performance across all three approaches"
) |>
  kable_styling(full_width = FALSE) |>
  row_spec(which(perf_tab$Metric %in% c("Sensitivity", "Specificity")),
           bold = TRUE)
Classifier performance across all three approaches
Metric Kneedle Mixture (2-feature) Mixture (3-feature: LAZR + RT + Zh)
Sensitivity Sensitivity 0.632 0.618 0.776
Specificity Specificity 0.901 0.998 1.000
Flagged_pct % Flagged 18.000 9.600 11.800
TP TP 48.000 47.000 59.000
FP FP 42.000 1.000 0.000
TN TN 382.000 423.000 424.000
FN FN 28.000 29.000 17.000
Show code
# Subtype detection rates across all three approaches
subtype_3way <- results |>
  mutate(
    subtype = case_when(
      is_non_attender == 0 ~ "Attender",
      non_attender_style == "plausible_noise" ~ "plausible_noise",
      TRUE ~ non_attender_pattern_type
    )
  ) |>
  group_by(subtype) |>
  summarise(
    N              = n(),
    n_kneedle      = sum(flag_union),
    rate_kneedle   = round(mean(flag_union),    3),
    n_mix2         = sum(flag_mixture),
    rate_mix2      = round(mean(flag_mixture),  3),
    n_mix3         = sum(flag_mixture2),
    rate_mix3      = round(mean(flag_mixture2), 3),
    .groups        = "drop"
  ) |>
  arrange(desc(rate_mix3))

kable(
  subtype_3way,
  col.names = c("Subtype", "N",
                "N (Kneedle)", "Rate (Kneedle)",
                "N (Mix 2)", "Rate (Mix 2)",
                "N (Mix 3)", "Rate (Mix 3)"),
  caption   = "Detection rates by subtype across all three approaches"
) |>
  kable_styling(full_width = FALSE) |>
  row_spec(which(subtype_3way$subtype == "Attender"), bold = TRUE) |>
  row_spec(which(subtype_3way$subtype != "Attender" &
                 subtype_3way$rate_mix3 < 0.5), color = "tomato") |>
  row_spec(which(subtype_3way$subtype != "Attender" &
                 subtype_3way$rate_mix3 >= 0.5), color = "steelblue")
Detection rates by subtype across all three approaches
Subtype N N (Kneedle) Rate (Kneedle) N (Mix 2) Rate (Mix 2) N (Mix 3) Rate (Mix 3)
acquiescent 17 17 1.000 16 0.941 17 1.000
alternating 2 2 1.000 2 1.000 2 1.000
biased_random 6 0 0.000 0 0.000 6 1.000
disacquiescent 3 3 1.000 3 1.000 3 1.000
midpoint_default 4 4 1.000 4 1.000 4 1.000
near_straightline 8 8 1.000 8 1.000 8 1.000
straightline 13 13 1.000 13 1.000 13 1.000
uniform_random 5 0 0.000 0 0.000 5 1.000
plausible_noise 18 1 0.056 1 0.056 1 0.056
Attender 424 42 0.099 1 0.002 0 0.000