For each child assessment used as an outcome in one of our program evaluation data sets, we want to get a sense of its overall measurement properties. Specifically, we want to address
Measurement invariance: Are the test scores measuring the construct of interest in the same way for different groups.
Precision: How does heteroskedastic measurement error affect the reported results ?
Each of these has a number of sub-components and the goal is to set up our analyses in a logical flow along the following lines.
Diagnosing whether there is an issue. This should be a procedure that is robust (i.e., not dependent on a specific model) and sensitive to the underlying issue.
Identifying the source of the issue. If there is evidence of measurement bias or issues with precision, how can we find the source of the issue?
Based on part 2, proposing one or more resolutions to each issue (e.g., remove an item, adjust a test score, etc.).
Before addressing these three steps for each of our two topics, we must also consider whether the data are compatible with the basic assumptions of item response theory (IRT) – if not, the proposed procedures will not be applicable.
This document illustrates the described analyses using the baseline MDAT language assessment from the WB SIEF Malawi data (https://microdata.worldbank.org/index.php/catalog/3416).
To use this code with a different example, save the assessment data as a data.frame named items and the rest of the code should run fine. Make sure that each column of items represents an assessment item scored as binary (1 = correct 0 = incorrect), and the rows represent the respondents. No other variables should be included in items.
Note that for the measurement bias analyses described below, it will also be necessary to identify a group variable – this is discussed in the first code snippet under “Measurement bias.”
### Preamble ---
library(sgof)
## Warning: package 'sgof' was built under R version 3.5.2
## Warning: package 'poibin' was built under R version 3.5.2
# for multiple inference
# Load data
data <- read.csv("r1.child.tests_clean.csv")
# Convert MDAT language items to binary
temp <- data[grep('fm[1-9]', names(data))]
items <- temp == "yes"
items <- as.data.frame(items*1)
(Be sure to run the relevant parts of all previous codes snippets in order for each new snippet to run.)
Let \(X\) be an indicator (e.g., an assessment item) of a construct, \(\theta\). In the psychometric literature, \(\theta\) is considered a latent variable (random effect) and models used to estimate \(\theta\) make assumptions about \(f(X \mid \theta)\). This is called the measurement model. For binary data, the two main assumptions are
\[P[X = 1 \mid \theta] \geq P[X = 1 \mid \theta'], \quad \forall \theta \geq \theta'.\]
This means that higher levels of the construct result in a higher probability of responding “correctly” to the indicator. Monotone non-increasing indicators can be “reverse coded”, and are not a problem. But non-monotone IRFs are incompatible with IRT – either those items need to be discarded or a different theory needs to be adopted (e.g., ideal point models http://dx.doi.org/10.1037/0021-9010.91.1.25).
\[X_i \mid \theta \perp X_j, \quad \forall i \neq j.\]
This means that \(\theta\) is sufficient to explain the statistical association among the indicators, and for this reason we interpret \(\theta\) as representing the construct of interest. This assumption is often modified (e.g. to allow dependence among certain subsets of items; to allow for multidimensional constructs), but at some point the joint distribution \(f(X \mid \theta)\) must factorize to something numerically tractable.
IRT models also make assumptions about \(g(\theta)\), which is called the population model. The usual assumption is that \(\theta \thicksim N(0, 1)\), with the scale and location being arbitrary (under-identified) features of the model.
When a specific IRT model is used (e.g., the Rasch model, the two-parameter logistic model), parametric forms for the measurement model and the population model are typically assumed. Hence, there is potential for mis-specification of the model, which leads to problems with robustness. In general, it is desirable that we can at least check for issues related to measurement without first having to make strong assumptions about the data, or fit complicated models whose operational characteristics are unknown in a given application. This is usually referred to as non-parametric IRT, although this title is a bit misleading – often parametric procedures are used (e.g., regression), but the parameters of an IRT model are not required.
Based on the non-parametric IRT literature, it is proposed that initial diagnosis of measurement issues use the “rest score” \(T_i = \sum_{j \neq i} X_j\) score as a proxy for \(\theta\). The rest score is just the total score of the test with the item(s) of interest removed. It has been shown that \(P[X_i = 1 \mid T_i]\) is always monotone if the IRF is, so checking assumptions related to monotonicity with the rest score is a robust approach (http://www.stat.cmu.edu/tr/tr661/tr661.ps).
Non-parametric IRT has its limitations. In particular, there is no generally agreed upon way of checking conditional independence without using a specific IRT model (e.g., http://dx.doi.org/10.1027/1614-2241/a000115). The relative utility of model-based approaches continues to be debated as well (http://dx.doi.org/10.1037/met0000121). We do not discuss such methods further here.
It is worth noting that this document does not (yet) consider robustness in terms of cluster-robust standard errors for use with inference. Procedures for clustering and sampling weights will be implemented after the overall procedure for measurement analysis has been hashed out.
As per the previous discussion we can test the assumption of latent monotonicity by checking whether \(P[X_i = 1 \mid T_i]\) is monotone increasing for each item. Below is an example using the baseline MDAT language assessment from the SEIF Malawi data. I use loess but the reader may have their own preferred method non-parametric / semi-parametric / smoothed regression.
# Constants and storage
n_items <- ncol(items)
rest_scores <- 0:(n_items - 1)
gg_data <- data.frame(rest = NA, prob = NA, SE = NA, item = NA)
# Loop over items
for (i in 1:n_items) {
# Rest score
temp <- items
rest <- apply(temp[,-i], 1, sum)
# Model
mod <- paste0(names(temp)[i], " ~ rest")
# Fit Loess
fit <- loess(mod, temp)
# Predicted values
pred <- predict(fit, newdata = rest_scores, se = T)
# Storage
gg_temp <- data.frame(
rest = rest_scores,
prob = pred$fit,
SE = pred$se.fit,
item = rep(names(temp)[i], times = n_items)
)
gg_data <- rbind(gg_data, gg_temp)
}
# Plotting the results
p <- ggplot(gg_data, aes(x = rest, y = prob)) +
geom_line(aes(color = item), lwd = .8) +
geom_ribbon(aes(ymin = prob - 1.96*SE,
ymax = prob + 1.96*SE,
fill = item), alpha = .4)
p + facet_wrap(~ item)
## Warning: Removed 1 rows containing missing values (geom_path).
There is no evidence that any item is non-monotone in the rest score. Note that the standard error regions use a normal reference distribution.
An indicator \(X\) is said to be an invariant measure of \(\theta\) if it is independent of any other variable \(W\) conditional on \(\theta\):
\[ X \mid \theta \perp W. \]
\(W\) is often gender or another demographic variable denoting a pre-existing social group. In the context of RCTs, \(W\) might be also be treatment conditions or time points.
While measurement invariance is defined for any property of the distribution of \(X \mid \theta\), the present discussion will focus on binary predictors and address whether
\[ P[X = 1 \mid \theta, W] = P[X = 1 \mid \theta] \]
i.e., whether the IRF is independent of potential confounders \(W\), which will be referred to as (lack of) measurement bias.
One widely used procedure for testing measurement bias is based on the Mantel-Haenszel (MH) test of a two (indicator) by two (group) table stratified by levels of the rest (or total) score. The MH test is uniformly most power for detecting differences in the location of the IRFs across groups (Holland, P. W., & Wainer, H. (Eds.). (1993). Differential item functioning. Hillsdale, NJ, US: Lawrence Erlbaum Associates, Inc). However the MH test has a number of shortcomings.
When the groups differ in the slope of the IRFs, using the MH test with the rest score can lead to false positives in which items are falsely identified as be being towards the more able group ( https://doi.org/10.2307/1165031). Caution should be taken to ensure that the IRFS of the two group are approximately parallel before using this method. Below we use Loess to check this assumption.
The MH procedure assumes that the only biased item with is the one being tested which is generally not realistic. In practice, it has been shown that the procedure is relatively robust to this assumption so long as “most” items do not show bias (Holland & Wainer, 1993). In general, if a test is made up of a “large” number of biased items, there is no generally applicable method of detecting measurement bias. Some authors have recommended iterative procedures in which the items are removed one at a time, but there are many issues that arise when taking this approach. Research has proposed a non-iterative, model-based method, but this is only applicable to the Rasch model (https://doi.org/10.1007/s11336-014-9408-y) In practice, if most items on a test are biased, it is not feasible to sort out the bias from the “real” differences on the construct.
Despite these shortcomings, the MH test has been show to be quite reliable in many applications and continues to be widely used in operational settings (https://www.ets.org/Media/Research/pdf/RR-12-08.pdf)
If one is willing to assume a parametric (usually logistic) form for \(P[X_i = 1 \mid T_i]\), standard glm methodology becomes available. In this case, it is possible to model differences in the location as well as the slopes of the IRFs, and to use procedures for nested models to test for bias in sets of items rather than checking each item individually. A regression based approach can be viewed as a sensitivity analysis for the MH approach, or vice versa.
The following code illustrates these three approaches to checking item bias over gender for the MDAT language scores at baseline:
In each case, gender is used as an example. To reproduce these analyses with a different group variable, just name that variable group and its levels group1 and group2, as below. Make sure that length(group) == nrow(items). Grouping variables with more than 2 levels are not considered in this document.
# Use gender as example groups
group <- data$cr_gender
group1 <- levels(group)[1]
group2 <- levels(group)[2]
# Storage
gg_data <- data.frame(rest = NA, prob = NA, SE = NA, item = NA, group = NA)
# Loop over items
for (i in 1:n_items) {
# Data
temp <- items
temp$rest <- apply(temp[,-i], 1, sum)
# Model
mod <- paste0(names(temp)[i], " ~ rest")
# Loess by group
f_fit <- loess(mod, subset(temp, group == group1))
m_fit <- loess(mod, subset(temp, group == group2))
f_pred <- predict(f_fit, newdata = rest_scores, se = T)
m_pred <- predict(m_fit, newdata = rest_scores, se = T)
# Storage
gg_temp <- data.frame(rest = rep(rest_scores, times = 2),
prob = c(f_pred$fit, m_pred$fit),
SE = c(f_pred$se.fit, m_pred$se.fit),
item = rep(names(items)[i], times = n_items*2),
group = rep(c(group1, group2), each = n_items)
)
gg_data <- rbind(gg_data, gg_temp)
}
# Plotting the results
p <- ggplot(gg_data, aes(x = rest, y = prob, group = group)) +
geom_line(aes(color = group), lwd = .8) +
geom_ribbon(aes(ymin = prob - 1.96*SE,
ymax = prob + 1.96*SE,
fill = group), alpha = .4)
p + facet_wrap(~ item)
## Warning: Removed 2 rows containing missing values (geom_path).
There are slight but appreciable differences between groups on some items. Item 25 and 29 appears to be biased towards males; items 33-35 towards females. The response functions are not strictly parallel (equal slopes) but deviations from this assumption appear quite mild for most items. Therefore, we consider the next Mantel-Haenszel tests for each item.
There are two points worth noting about this implementation of the MH procedure
Because several items had either very low or very high proportions of endorsement, strata were created using the deciles of the rest score, rather than the more customary approach of using the individual score categories. This was done to prevent empty cells in the MH test.
Confidence intervals are reported at 95% level, but an item was flagged as having bias using the Benjaminiâ“Hochberg procedure keeping false discovery rate below 5%.
# Storage
MH <- data.frame(item = names(items), OR = NA, lower = NA, upper = NA, pvalue = NA, bias = 0)
# Using deciles of rest score for strata (to avoid empty cells)
tenths <- seq(0, 1, by = .1)
# Loop over items
for(i in 1:n_items) {
temp <- items
temp$rest <- apply(temp[,-i], 1, sum)
deciles <- cut(temp$rest, unique(quantile(temp$rest, tenths)))
mh <- mantelhaen.test(temp[,i], group, deciles)
MH$OR[i] <- mh$conf.int[1]
MH$lower[i] <- mh$conf.int[1]
MH$upper[i] <- mh$conf.int[2]
MH$pvalue[i] <- mh$p.value
}
# BenjaminiâHochberg procedure for false discovery rate = 5%
MH$bias <- p.adjust(MH$pvalue, method = "BH") < .05
# Rounding for readability
MH[,-1] <- lapply(MH[,-1], function(x) round(x, 3))
MH
## item OR lower upper pvalue bias
## 1 fm23 1.086 1.086 1.605 0.006 1
## 2 fm22 0.840 0.840 2.941 0.201 0
## 3 fm20 0.079 0.079 2.834 0.700 0
## 4 fm30 1.114 1.114 1.819 0.006 1
## 5 fm35 1.455 1.455 3.052 0.000 1
## 6 fm36 1.474 1.474 2.398 0.000 1
## 7 fm24 0.600 0.600 0.909 0.005 1
## 8 fm21 0.566 0.566 2.236 0.872 0
## 9 fm28 0.749 0.749 1.158 0.561 0
## 10 fm29 0.804 0.804 1.155 0.722 0
## 11 fm37 0.704 0.704 1.031 0.109 0
## 12 fm38 0.580 0.580 1.314 0.588 0
## 13 fm33 0.885 0.885 1.387 0.401 0
## 14 fm27 0.720 0.720 1.417 0.955 0
## 15 fm34 1.373 1.373 2.262 0.000 1
## 16 fm32 0.685 0.685 1.008 0.066 0
## 17 fm41 0.655 0.655 1.295 0.698 0
## 18 fm40 0.688 0.688 1.195 0.532 0
## 19 fm39 0.680 0.680 1.071 0.191 0
The MH tests support the conclusion that items 33-35 are biased in favor of females, with odds ratios between .5 and .6 for each item. The MH test also identified items 28 and 37 as being biased in favor of females, with odds ratios of the similar magnitude. Items 25 and 29 were biased in favor of males, but the effect sizes were much smaller than those for the items that favored females.
Last, consider an approach based on logistic regression. This allows for testing of differences in the magnitude of the slope as well as the location of the IRFs across groups. This approach can serve to corroborate the MH procedure because it is not sensitive to false positives related to difference in slope. It can also be used to test the entire set of items simultaneously via nested models, before examining bias for individual items.
# Setting up data
long_data <- data.frame(score = NA, rest = NA, item = NA, group = NA)
# Loop over items
for (i in 1:n_items) {
temp <- items
temp$rest <- apply(temp[,-i], 1, sum)
long_temp <- data.frame(score = temp[,i],
rest = temp$rest,
item = rep(names(temp)[i], nrow(temp)),
group = group
)
long_data <- rbind(long_data, long_temp)
}
### logistic regression models ---
# No group differences
mod0 <- glm(score ~ -1 + item + rest:item, data = long_data, family = binomial)
# Group differences in location
mod1 <- glm(score ~ -1 + item + rest:item + group:item, data = long_data, family = binomial)
# Group difference in location and slope
mod2 <- glm(score ~ -1 + item + rest:item:group + group:item, data = long_data, family = binomial)
# Overall test of whether items were biased in slopes
anova(mod1, mod2, test = "LRT")
## Analysis of Deviance Table
##
## Model 1: score ~ -1 + item + rest:item + group:item
## Model 2: score ~ -1 + item + rest:item:group + group:item
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 40223 28657
## 2 40204 28636 19 21.564 0.3065
# Overall test of whether items were biased in intercepts
anova(mod0, mod1, test = "LRT")
## Analysis of Deviance Table
##
## Model 1: score ~ -1 + item + rest:item
## Model 2: score ~ -1 + item + rest:item + group:item
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 40242 28760
## 2 40223 28657 19 102.94 1.563e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The comparison of mod2 (group interactions with slopes and intercepts) and mod1 (group interactions with intercepts only) leads to the conclusion that mod2 did not fit the data better than mod1 This supports the conclusion that the MH test is appropriate for these data.
The comparison of mod1 and mod0 (no group differences) leads to the conclusion that mod1 fits the data significantly better than mod0. This also supports the conclusions of the MH tests, that some items showed bias between groups.
Let’s follow up this model comparison procedure by examining the odds ratios and significance levels for the item by group interactions in mod1
# Getting coefficients for item bias from mod1
temp <- as.data.frame(summary(mod1)$coefficients)
mod1_bias <- temp[grep(":group", row.names(temp)), ]
# Computing odds ratios
mod1_bias$OR <- exp(mod1_bias$Estimate)
# BenjaminiâHochberg procedure for false discovery rate = 5%
mod1_bias$bias <- p.adjust(mod1_bias[,4], method = "BH") < .05
# Rounding for readability
mod1_bias[] <- lapply(mod1_bias, function(x) round(x, 3))
mod1_bias
## Estimate Std. Error z value Pr(>|z|) OR bias
## itemfm20:groupmale -0.345 0.745 -0.464 0.643 0.708 0
## itemfm21:groupmale 0.193 0.336 0.575 0.565 1.213 0
## itemfm22:groupmale 0.509 0.310 1.643 0.100 1.664 0
## itemfm23:groupmale 0.287 0.099 2.896 0.004 1.333 1
## itemfm24:groupmale -0.298 0.105 -2.851 0.004 0.742 1
## itemfm27:groupmale -0.011 0.175 -0.061 0.952 0.989 0
## itemfm28:groupmale -0.083 0.111 -0.743 0.457 0.921 0
## itemfm29:groupmale -0.035 0.092 -0.384 0.701 0.965 0
## itemfm30:groupmale 0.349 0.126 2.780 0.005 1.417 1
## itemfm32:groupmale -0.189 0.099 -1.915 0.055 0.828 0
## itemfm33:groupmale 0.099 0.116 0.855 0.393 1.104 0
## itemfm34:groupmale 0.587 0.128 4.597 0.000 1.799 1
## itemfm35:groupmale 0.754 0.189 3.997 0.000 2.126 1
## itemfm36:groupmale 0.626 0.124 5.041 0.000 1.869 1
## itemfm37:groupmale -0.157 0.097 -1.617 0.106 0.855 0
## itemfm38:groupmale -0.176 0.207 -0.847 0.397 0.839 0
## itemfm39:groupmale -0.139 0.114 -1.216 0.224 0.870 0
## itemfm40:groupmale -0.129 0.142 -0.907 0.364 0.879 0
## itemfm41:groupmale -0.131 0.176 -0.742 0.458 0.877 0
The logistic regression leads to the same substantive conclusions as the MH tests. All of the same items were identified as having bias, the direction of bias was the same, but the point estimates were larger overall.