Detecting Inattentive Respondents in Survey Experiments
Using Factor Mixture Models
Author
Jamie C. Lee
Published
April 14, 2026
Overview
This document attempts to identify (different types of) inattentive respondents in a simulated experiment. The simulation generates 500 participants, of whom 15% are true non-attenders (i.e., is_non_attender == 1). Non-attenders fall into two broad categories: patterned responders (straightliners, alternators, acquiescent responders, etc.) whose responses follow detectable regularities, and plausible noise responders whose responses superficially resemble genuine engagement but are entirely content-free. Among true attenders, latent attention \(A \sim \text{Beta}(3, 1.5)\) varies continuously — most are fairly attentive but a tail exists at lower attention levels.
Three approaches are evaluated, built up incrementally. High-level findings for each approach are summarized inline (in italics) and discussed in detail in the corresponding section:
LAZR (+ Kneedle): A first-order Markov chain statistic (introduced by Biemann et al. 2025) applied to raw response patterns on multi-item individual difference scales, measuring sequential transitional predictability. Higher values indicate more patterned responding. A data-driven cutoff is identified using a Kneedle elbow algorithm, applied to the sorted Laz.R distribution.
LAZR is useful for flagging obviously patterned non-attenders but has two limitations. First, it produces false negatives (FN) for inattentive subtypes whose responses lack detectable sequential structure (e.g., uniform random, biased random, and plausible noise respondents).These individuals’ response transitions are sufficiently irregular to fall below the Kneedle threshold, despite being content-free.Second, it produces false positives (FP) among attenders whose genuine attitudes cluster near the scale midpoint,moderate, consistent responding produces transition patterns that superficially resemble patterned inattention.
Finite Mixture Model (2-feature: LAZR + RT): A Gaussian mixture model fit to two jointly informative features: 1) mean Laz.R across scales, and 2) within-person standard deviation of logged response times across items. The RT feature captures whether a person shows meaningful item-to-item timing variability, which genuine attenders exhibit due to content sensitivity, reverse-item processing, and fatigue effects that non-attenders lack entirely. Importantly, this feature assumes that the survey contains sufficient heterogeneity in item demands or complexity (e.g., a mix of forward and reverse-coded items, variation in item length, etc.), such that attentive respondents are naturally induced to vary their response pace across items. In surveys with highly homogeneous item complexity, the timing spread of attenders and non-attenders may converge, reducing the discriminative value of this feature.
Adding RT variability to the mixture model substantially reduces FP rates relative to Laz.R alone: attenders whose genuine midpoint-consistent responses were previously misclassified as patterned are now correctly retained because their item-to-item timing variation signals genuine engagement.However, this feature does not improve FN rates. It provides no additional signal to detect uniform random, biased random, or plausible noise respondents.
Finite Mixture Model (3-feature: LAZR + RT + Zh): The 2-feature mixture model extended with Zh, a standardised person-fit statistic from a Graded Response Model (GRM) estimated via iterative purification. Zh measures how well a person’s responses are explained by a single latent trait. Attenders’ responses are causally driven by a latent construct and should fit the GRM well, while non-attenders’ responses are not generated by any latent trait and should show systematically poor fit. Item parameters are estimated on a purified subsample to prevent contamination by non-attenders.
Adding Zh substantially improves detection of uniform random and biased random responders, whose responses lack the local trait-consistency the GRM expects regardless of where their trait level is estimated. However, Zh fails to separate plausible noise responders from genuine attenders. This is because plausible noise responses are anchored around a scale point with partial sensitivity to item direction, a response pattern that is structurally similar to that of a genuine trait attender. The GRM estimates a plausible trait level for these respondents and finds their responses only weakly inconsistent with it, producing Zh values that overlap substantially with the attender distribution. In this sense, plausible noise represents a fundamental identification boundary for IRT-based person-fit statistics: respondents whose content-free responses happen to be locally consistent with some trait level will evade detection by any method that relies on trait-response consistency alone.
Throughout, classifier performance is evaluated against ground truth using sensitivity, specificity, and subtype-level detection rates.
Simulation Ground Truth
Before evaluating classifiers, we describe the true composition of the sample.
Non-attenders fall into two broad styles: plausible noise (hard to detect — responses look human but are not based on the individual’s true latent trait) and patterned subtypes (straightline, alternating, diagonal, etc.). The key challenge is distinguishing these different subtypes from genuine attenders, based on observables.
Attention and Dose Among Attenders
Show code
att_dat <- dat[dat$is_non_attender ==0, ]# Theoretical logistic curvea_seq <-seq(0, 1, length.out =300)curve_df <-data.frame(A = a_seq,Dose =inv_logit(params$dose_steepness * (a_seq - params$dose_halfpoint)))p_scatter <-ggplot() +geom_point(data = att_dat[att_dat$Z ==1, ],aes(x = A, y = Dose),alpha =0.45, size =1.8, colour ="steelblue" ) +geom_line(data = curve_df,aes(x = A, y = Dose),colour ="tomato", linewidth =1, linetype ="solid" ) +geom_vline(xintercept = params$dose_halfpoint,linetype ="dashed", colour ="grey40", linewidth =0.7 ) +annotate("text", x = params$dose_halfpoint +0.02, y =0.05,label =sprintf("halfpoint = %.2f", params$dose_halfpoint),hjust =0, size =3.5, colour ="grey30" ) +labs(title ="Attention → Dose mapping (treated attenders)",subtitle =sprintf("Logistic(steepness = %g, halfpoint = %g) + noise(sd = %g)", params$dose_steepness, params$dose_halfpoint, params$dose_noise ),x ="Latent attention A",y ="Realised Dose" )p_dist <-ggplot(att_dat, aes(x = A, fill =factor(Z))) +geom_density(alpha =0.45, colour =NA) +geom_vline(xintercept = params$dose_halfpoint,linetype ="dashed", colour ="grey40", linewidth =0.7 ) +scale_fill_manual(values =c("0"="grey60", "1"="steelblue"),labels =c("Control (Z=0)", "Treatment (Z=1)"),name =NULL ) +labs(title ="Distribution of A by treatment arm (attenders only)",subtitle ="Distributions should overlap — A is independent of Z by design",x ="Latent attention A",y ="Density" )p_scatter + p_dist
Among treated attenders, Dose is a logistic function of latent attention A, plus noise. The red curve shows the theoretical logistic mapping; blue points are realised doses. The right panel shows the marginal distribution of A, with the dose halfpoint marked.
Attention and potential dose (Dose_1) by treatment arm (attenders only)
Arm
N
Mean A
SD A
Mean Dose₁
SD Dose₁
Control (Z=0)
210
0.684
0.204
0.628
0.181
Treatment (Z=1)
214
0.672
0.187
0.617
0.163
Because Z is randomly assigned, A should be balanced across arms — Dose_1 (the potential dose each attender would receive under treatment) is therefore also balanced. Realised Dose equals Dose_1 only for treated participants; controls always have Dose = 0.
Approach 1: LAZR + Kneedle Algorithm
What is LAZR?
LAZR (Laz.R) measures sequential conditional predictability in raw item responses. For a person’s response vector \(x_1, x_2, \ldots, x_n\), it builds a transition probability matrix across adjacent item responses.
Under random responding (uniform transitions), LAZR \(\approx 1/K\) where \(K\) is the number of response categories. Under fully deterministic responding (e.g., straightlining), LAZR \(= 1\). Higher values indicate more predictable, less content-driven responding.
The Kneedle algorithm identifies a data-driven exclusion threshold by locating the point of maximum curvature — the “elbow” — in the sorted LAZR distribution. Respondents above this threshold are flagged as potentially inattentive.
Approach 1 summary. LAZR + Kneedle performs well for obviously patterned non-attenders (straightline, near-straightline, alternating, diagonal) but has two systematic limitations. False negatives: uniform random, biased random, and plausible noise responders produce sufficiently irregular response transitions to fall below the Kneedle threshold, and are therefore missed. False positives: attenders with genuine midpoint-consistent response styles are sometimes flagged because their stable, moderate responses produce transition patterns that superficially resemble patterned inattention. Subtype-level detection rates and the overall false positive rate are reported below.
# Build results data frameresults <- dat[, c("ID", "Z", "is_non_attender", "A","non_attender_style", "non_attender_pattern_type")]results$true_inattentive <- results$is_non_attender ==1results$class_label <-factor(ifelse(results$true_inattentive, "Non-attender (true)", "Attender (true)"))# LAZR per scalefor (s inseq_len(ns)) { results[[paste0("lazr_s", s)]] <-compute_lazr(dat, s)}# Kneedle thresholds and flagskneedle_flags <-list()for (s inseq_len(ns)) { col <-paste0("lazr_s", s) thresh <-compute_laz_cutoff(results[[col]]) flag <- results[[col]] >= thresh results[[paste0("flag_kneedle_s", s)]] <- flag kneedle_flags[[s]] <-list(threshold = thresh, n_flagged =sum(flag))}# Union flag: flagged by either scaleresults$flag_union <- results$flag_kneedle_s1 | results$flag_kneedle_s2results$attentive_kneedle <-!results$flag_union
Show code
plots_lazr <-lapply(seq_len(ns), function(s) {local({ col <-paste0("lazr_s", s) thresh <- kneedle_flags[[s]]$threshold df <- results df$class_label <-factor(ifelse(df$true_inattentive,"Non-attender (true)", "Attender (true)"))ggplot(df, aes(x = .data[[col]], fill = class_label)) +geom_density(alpha =0.5, color =NA) +geom_vline(xintercept = thresh, linetype ="dashed",color ="black", linewidth =0.9) +annotate("text", x = thresh +0.01, y =Inf,label =sprintf("Kneedle\n%.3f", thresh),hjust =0, vjust =1.4, size =3.5) +scale_fill_manual(values =c("Attender (true)"="steelblue","Non-attender (true)"="tomato"), name =NULL) +labs(title =sprintf("Scale %d — LAZR distribution", s),subtitle ="Higher LAZR = more predictable transitions",x ="LAZR score", y ="Density") })})wrap_plots(plots_lazr, ncol =2)
LAZR score distributions by true attender status. The dashed vertical line shows the Kneedle threshold. Non-attenders (red) have a somewhat uniform distribution of LAZR values, creating a meaningful overlap with attenders. False negative (retaining non-attenders) rates tend to be higher than false positive (excluding attentive participants) rates.
kable( fn_tab |>select(subtype, N, n_flagged, rate),col.names =c("Subtype", "N", "N detected", "Detection rate (sensitivity)"),caption ="Detection rates by non-attender subtype — LAZR + Kneedle") |>kable_styling(full_width =FALSE) |>row_spec(which(fn_tab$rate <0.5), color ="tomato") |>row_spec(which(fn_tab$rate >=0.5), color ="steelblue")
Detection rates by non-attender subtype — LAZR + Kneedle
Subtype
N
N detected
Detection rate (sensitivity)
acquiescent
17
17
1.000
alternating
2
2
1.000
disacquiescent
3
3
1.000
midpoint_default
4
4
1.000
near_straightline
8
8
1.000
straightline
13
13
1.000
plausible_noise
18
1
0.056
biased_random
6
0
0.000
uniform_random
5
0
0.000
Approach 2: Finite Mixture Model (LAZR + RT)
The first mixture model extends LAZR with a second feature — within-person standard deviation of log-transformed item response times — and fits a 2-class Gaussian mixture model jointly on both features. Rather than applying a univariate threshold to LAZR alone, the mixture model identifies latent classes by finding the joint feature configuration that best separates the population into two groups under a multivariate normal assumption.
The RT variability feature is motivated by the observation that genuine attenders’ response times vary across items due to content sensitivity, reverse-item processing, and cumulative fatigue — none of which modulate non-attenders’ mechanical response process. Adding this feature specifically targets the false positive problem identified in Approach 1: attenders with midpoint-consistent response styles have similarly moderate-to-low Laz.R values to some non-attenders, but their item-to-item timing variability should be higher than truly non-attentive respondents, allowing the mixture model to retain them correctly.
Approach 2 summary. Adding RT variability to LAZR in a joint mixture model substantially reduces false positive rates relative to LAZR + Kneedle alone — attenders with midpoint-consistent response styles are more correctly retained because their timing variability distinguishes them from patterned non-attenders. However, false negative rates are not meaningfully improved: uniform random, biased random, and plausible noise responders remain largely undetected because their low timing variability is consistent with the inattentive class on this feature. Subtype-level detection rates are reported below.
p_rt_logsd <-ggplot(results, aes(x = rt_logsd, fill = class_label)) +geom_density(alpha =0.5, colour =NA) +scale_fill_manual(values =c("Attender (true)"="steelblue","Non-attender (true)"="tomato"),name =NULL) +labs(title ="Within-person SD of log(RT) across items",subtitle ="Higher = more item-to-item timing variability",x ="SD of log(RT)", y ="Density")p_rt_iqr <-ggplot(results, aes(x = rt_iqr, fill = class_label)) +geom_density(alpha =0.5, colour =NA) +scale_fill_manual(values =c("Attender (true)"="steelblue","Non-attender (true)"="tomato"),name =NULL) +labs(title ="IQR of log(RT) across items",subtitle ="Robust measure of within-person timing spread",x ="IQR of log(RT)", y ="Density")p_rt_logsd + p_rt_iqr
Within-person RT variability by true attender status. Attenders should show higher item-to-item timing spread, because their response pace is determined by item complexity and fatigue over time. Non-attenders should show lower spread, because their RT is not sensitive to item complexity.
Within-person SD of log-transformed RTs distinguishes attenders from non-attenders more effectively than IQR. IQR measures spread using only the middle 50% of a person’s RT distribution, discarding the top and bottom quartiles entirely. However, the most informative timing signals (e.g., systematic slowing on reverse-coded items, high-attention deliberation on cognitively demanding items, and fatigue-driven speedup on later items) tend to produce values in these discarded tails. SD retains sensitivity to the full distribution, including these extreme values, and therefore better captures the item-level RT variability that genuine engagement produces.
In surveys where stall events or tab-switching frequently contaminate item-level RTs with artifactual spikes, IQR’s robustness to outliers would be advantageous. However, in the present simulation stall events are absorbed into page-entry times rather than item RTs, leaving the item-level RT matrix relatively clean and making SD’s sensitivity to extremes an asset rather than a liability. This bounds of this measure’s effectiveness in identifying attentive and inattentive latent groups should be further tested.
Top: posterior probability of being inattentive under the 2-feature mixture model. A bimodal distribution indicates clean separation; overlap indicates ambiguity. Bottom left: mixture model decision surface in feature space.
kable(fn_mix2 |>select(subtype, N, n_flagged, rate),col.names =c("Subtype", "N", "N detected", "Detection rate"),caption ="Detection rates by subtype — 2-feature mixture model") |>kable_styling(full_width =FALSE) |>row_spec(which(fn_mix2$rate <0.5), color ="tomato") |>row_spec(which(fn_mix2$rate >=0.5), color ="steelblue")
Detection rates by subtype — 2-feature mixture model
Subtype
N
N detected
Detection rate
alternating
2
2
1.000
disacquiescent
3
3
1.000
midpoint_default
4
4
1.000
near_straightline
8
8
1.000
straightline
13
13
1.000
acquiescent
17
16
0.941
plausible_noise
18
1
0.056
biased_random
6
0
0.000
uniform_random
5
0
0.000
Approach 3: Finite Mixture Model (LAZR + RT + Zh)
The third model extends the 2-feature mixture model by adding Zh as a third feature. Zh is a standardised person-fit statistic derived from a Graded Response Model (GRM) that quantifies how well a person’s observed response pattern is explained by a single latent trait. Because attenders’ responses are causally generated by a latent construct, their responses are locally consistent with their estimated trait level and produce Zh values near zero. Non-attenders’ responses are not generated by any latent trait, so even the best-fitting trait level leaves their responses statistically surprising, producing strongly negative Zh values.
To prevent non-attenders from distorting item parameter estimates — which would artificially inflate person-fit for inattentive respondents — item parameters are estimated via iterative purification: a first-pass GRM is fit on all respondents, the worst-fitting respondents are removed, and the GRM is refit on the remaining clean subsample. All respondents are then scored using these purified item parameters.
Approach 3 summary. Adding Zh to the mixture model improves detection of uniform random and biased random respondents, whose responses lack the local trait-consistency the GRM expects regardless of where their trait level is estimated. However, Zh does not improve detection of plausible noise respondents. Plausible noise responses are anchored around a scale value with partial sensitivity to item direction — a pattern structurally similar to genuine moderate-trait attenders near the scale midpoint. The GRM estimates a plausible trait level for these respondents and finds their responses only weakly inconsistent with it, producing Zh values that overlap substantially with the attender distribution. Plausible noise therefore represents a fundamental identification boundary for IRT-based person-fit statistics.
Show code
# ── Person-fit statistic (Zh) via iterative purification ──────────────────# Stage 1: fit GRM on everyone to get initial person-fit estimates# Stage 2: remove worst-fitting respondents, refit GRM on clean sample# Stage 3: score everyone using purified item parameters# This avoids contamination of item parameter estimates by non-attenderscompute_lz_purified <-function(dat, scale_idx) { ni <- params$items_per_scale K <- params$n_categories y_cols <-paste0("Y_s", scale_idx, "_i", seq_len(ni)) rev_items <- instrument[[scale_idx]]$reverse resp_mat <-as.matrix(dat[, y_cols]) resp_scored <- resp_mat resp_scored[, rev_items] <- K +1- resp_mat[, rev_items] resp_df <-as.data.frame(resp_scored) n <-nrow(dat)# Round 1: fit GRM on everyone fit1 <-mirt(resp_df, model =1, itemtype ="graded", verbose =FALSE) pf1 <-personfit(fit1, method ="ML")# Retain respondents whose person-fit is not severely poor keep_mask <-!is.finite(pf1$Zh) | pf1$Zh >-2cat(sprintf("Scale %d — Round 1: %d flagged as poor fit, refitting on %d of %d\n", scale_idx, sum(!keep_mask), sum(keep_mask), n ))# Round 2: refit GRM on only the well-fitting participants fit2 <-mirt( resp_df[keep_mask, ],model =1,itemtype ="graded",verbose =FALSE )# Extract purified item parameters and fix them pars_fixed <-mod2values(fit2) pars_fixed$est <-FALSE# fix all parameters — no re-estimation# Score all participants using purified parameters fit2_all <-mirt( resp_df,model =1,itemtype ="graded",verbose =FALSE,pars = pars_fixed,TOL =NA ) pf2 <-personfit(fit2_all, method ="ML")cat(sprintf("Scale %d — Round 2: Zh range [%.2f, %.2f], mean = %.3f\n", scale_idx, min(pf2$Zh, na.rm =TRUE),max(pf2$Zh, na.rm =TRUE), mean(pf2$Zh, na.rm =TRUE) )) pf2$Zh}lz_list <-lapply(seq_len(ns), compute_lz_purified, dat = dat)
Scale 1 — Round 1: 58 flagged as poor fit, refitting on 442 of 500
Scale 1 — Round 2: Zh range [-12.24, 3.59], mean = -0.686
Scale 2 — Round 1: 50 flagged as poor fit, refitting on 450 of 500
Scale 2 — Round 2: Zh range [-12.70, 3.55], mean = -0.639
ggplot(results, aes(x = lz, fill = class_label)) +geom_density(alpha =0.5, colour =NA) +geom_vline(xintercept =-2, linetype ="dashed",colour ="black", linewidth =0.8) +annotate("text", x =-2.1, y =Inf,label ="Zh = −2", hjust =1, vjust =1.4,size =3.5, colour ="grey30") +scale_fill_manual(values =c("Attender (true)"="steelblue","Non-attender (true)"="tomato"),name =NULL) +labs(title ="Zh person-fit statistic",subtitle ="More negative = responses less consistent with a latent trait",x ="Zh", y ="Density" )
Distribution of Zh person-fit statistic by true attender status. Zh is approximately standard normal for well-fitting respondents. Strongly negative values indicate responses inconsistent with any latent trait — the expected signature of non-attenders. The dashed line marks Zh = -2, a conventional flagging threshold.