False Positive inflation of Wilcoxon and Permutation of P(x<y).

In the case of heteroscedasticity, the Wilcoxon-Mann-Whitney test is does not necessarily control the Type I error rate for comparing the median of two distributions.

To demonstrate this, we will same groups from two normal distributions with mean (and by symmetry median) 0, one with larger variance than the other, and calculate the Wilcoxon p value.

We are sampling from distributions following the null hypothesis of P(x<y)=P(y>x). However, the sampling is not identically distributed, and therefore we see an inflation of False positives in the QQ plot below:

x <- sapply(seq_len(10000), \(i){
group1 <- rnorm(50)
group2 <- rnorm(50,0,10)
wilcox.test(group1,group2,exact=TRUE)$p.value
})

plot(-log10(rank(x)/length(x)),-log10(x), xlab="Expected", ylab="Observed")
abline(0,1)

Note that the Wilcoxon test is identical to the permutation test of the point estimate of P(x<y). For completeness, we do that permutation test and show the same issue:

library(doParallel)

## Loading required package: foreach

## Loading required package: iterators

## Loading required package: parallel

registerDoParallel(20)

numPerm <- 100000

x <- foreach(i=seq_len(300), .combine = c) %dopar% {
  group1 <- rnorm(50)
  group2 <- rnorm(50,0,10)
  
  # obs.stat <- sum(apply(expand.grid(group1,group2), 1, \(x)(x[1]<x[2])))
  
  obs.mat <- (matrix(group1, nrow=50, ncol=50) < matrix(group2, nrow=50, ncol=50, byrow = T))
  obs.stat <- sum(obs.mat)
  perm.table <- sapply(seq_len(numPerm), function(x){
    perm <- sample(c(group1, group2))
    perm.mat <- (matrix(perm[1:50], nrow=50, ncol=50) < matrix(perm[51:100], nrow=50, ncol=50, byrow = T))
    perm.stat <- sum(perm.mat)
    return(perm.stat)
  })
  min(mean(obs.stat<perm.table),mean(obs.stat>perm.table))*2
}

plot(-log10(rank(x)/length(x)),-log10(x), xlab="Expected", ylab="Observed")
abline(0,1)

To demonstrate equivalency, we can compare p values from the exact and finite permutation estimate:

library(doParallel)
registerDoParallel(20)

numPerm <- 100000

res <- foreach(i=seq_len(50), .combine = rbind) %dopar% {
  group1 <- rnorm(50)
  group2 <- rnorm(50,0,10)
  

  obs.mat <- (matrix(group1, nrow=50, ncol=50) < matrix(group2, nrow=50, ncol=50, byrow = T))
  obs.stat <- sum(obs.mat)
  perm.table <- sapply(seq_len(numPerm), function(x){
    perm <- sample(c(group1, group2))
    perm.mat <- (matrix(perm[1:50], nrow=50, ncol=50) < matrix(perm[51:100], nrow=50, ncol=50, byrow = T))
    perm.stat <- sum(perm.mat)
    return(perm.stat)
  })
  return(c(min(mean(obs.stat<perm.table),mean(obs.stat>perm.table))*2, wilcox.test(group1,group2, exact = T)$p.value))
}

plot(res, xlab="Permutation p", ylab="Exact p")

# plot(-log10(rank(x)/length(x)),-log10(x), xlab="Expected", ylab="Observed")
# abline(0,1)

Therefore, if we use a permutation null to test false positive control for Wilcoxon in this case, we will see no inflation of false positives.

This occurs because the fundamental assumption behind the permutation procedure is violated: samples are not exchangeable between the two groups, and yet, there is no difference in medians under the null.

As a bonus comparison, we can see that t tests don’t suffer from this issue:

x <- sapply(seq_len(10000), \(i){
group1 <- rnorm(50)
group2 <- rnorm(50,0,10)
t.test(group1,group2,exact=TRUE)$p.value
})

plot(-log10(rank(x)/length(x)),-log10(x), xlab="Expected", ylab="Observed")
abline(0,1)