Nondifferential misclassification of binary exposure

Nondifferential misclassification of a binary exposure biases the results toward the null. However, under the null, this is not a bias because biases the null toward null is not really a bias. Therefore, under the null hypothesis test remains valid and the \(\alpha\) rate of 5% or less is maintained.

This can be confirmed in a simulation under the null where hypothesis tests are repeated using both the true exposure status and the misclassified exposure status. In the extreme scenario used in the following both the sensitivity and the specificity are very low. Thus, nondifferential misclassification will “bias” chance departures from the null in the samples (occuring 5% of times) toward the null, thus, the observed \(\alpha\) is lower than the nomial 5% using the misclassified exposure.

## Decide total number of individuals
N <- 10^7

## True exposure status is assigned 40:60
## Outcome probability is 20% for both exposed and unexposed
dat <- data.frame(A = rbinom(n = N, size = 1, prob = 0.4),
                  Y = rbinom(n = N, size = 1, prob = 0.2))

## Determine misclassification sensitivity/specificity
sens <- 0.55
spec <- 0.55

## Pick exposed individuals to misclassify
Nexposed <- sum(dat$A == 1)
posMisTo0 <- sample(which(dat$A == 1), size = Nexposed*(1-sens))

## Pick unexposed individuals to misclassify
Nunexposed <- sum(dat$A == 0)
posMisTo1 <- sample(which(dat$A == 0), size = Nunexposed*(1-spec))

## Misclassify them
dat$Amis <- dat$A
dat$Amis[posMisTo0] <- 0
dat$Amis[posMisTo1] <- 1

## Table entire dataset using true exposure
xtabs( ~ Y + A, data = dat)
##    A
## Y         0       1
##   0 4803507 3198183
##   1 1199066  799244
## Table entire dataset using misclassified exposure
xtabs( ~ Y + Amis, data = dat)
##    Amis
## Y         0       1
##   0 4081627 3920063
##   1 1018631  979679
## Regard the entire dataset as iteration of n = 1000 cohort studies
dat$group <- rep(seq_len(N/1000), each = 1000)

## Summarize
library(dplyr)
out <- dat %>%
    group_by(group) %>%
    summarise(p_true = chisq.test(y = Y, x = A, correct = FALSE)$p.value,
              p_mis  = chisq.test(y = Y, x = Amis)$p.value)

## Summarize p-value distributions
summary(out[,c("p_true","p_mis")])
##      p_true             p_mis          
##  Min.   :0.000008   Min.   :0.0001437  
##  1st Qu.:0.248691   1st Qu.:0.2826344  
##  Median :0.495661   Median :0.5435623  
##  Mean   :0.499105   Mean   :0.5397154  
##  3rd Qu.:0.751144   3rd Qu.:0.8085370  
##  Max.   :1.000000   Max.   :1.0000000
## Proportions of alpha errors
data.frame(prop_reject_true = mean(out$p_true < 0.05),
           prop_reject_mis  = mean(out$p_mis  < 0.05))
##   prop_reject_true prop_reject_mis
## 1           0.0497          0.0427
## Empirical CDF of p values
layout(matrix(1:2, ncol = 2))
plot(ecdf(out$p_true), main = "Using true exposure", xlab = "p", ylab = "Empirical cdf")
plot(ecdf(out$p_mis),  main = "Using misclassified exposure", xlab = "p", ylab = "Empirical cdf")