In previous examples, hypothesis testing with two independent samples drawn from normally distributed populations was explored. Often, however, data is not normally distributed, which causes the t-test to output incorrect results. In the case of non-normally distributed data, nonparametric methods can be used. Nonparametric methods are so named since they do not rely on the assumption of normality of the data. In the case of two independent samples with non-normally distributed populations, the Mann-Whitney U Test is used. The Mann-Whitney test is also known as the Wilcoxon Rank Sum Test.

In this example, we are interested in cabbage from a field trial. Say samples are collected on two cults of cabbage and the ascorbic acid content is measured. Are the two populations of cabbage identical in their Vitamin C content?

Getting Started

The data is loaded from the MASS package. After plotting a Q-Q plot and a histogram of the Vitamin C content of the cabbages, it does appear the data of is exhibiting some non-normality in its distribution.

library(MASS)
library(ggplot2)
library(car)
library(caret)
## Loading required package: lattice
library(reshape2)
library(plyr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## The following object is masked from 'package:MASS':
## 
##     select
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(coin)
## Loading required package: survival
## 
## Attaching package: 'survival'
## The following object is masked from 'package:caret':
## 
##     cluster
data("cabbages")
c39 <- filter(cabbages, Cult == 'c39')
c52 <- filter(cabbages, Cult == 'c52')
par(mfrow=c(2,2))
qqPlot(c39$VitC, main = 'Q-Q Plot of c39 Concentration')
qqPlot(c52$VitC, main = 'Q-Q Plot of c52 Concentration')
hist(c39$VitC, main = 'Histogram of c39 Concentration')
hist(c52$VitC, main = 'Histogram of c52 Concentration')

The histogram of c39 shows the values are skewed to the left while the histogram of c52 shows a slight shift to the right. The data thus gives us enough motivation to employ the Mann-Whitney U test as it does not assume normality due to its nonparametric nature. What makes the Mann-Whitney test even more compelling is it performs just as well on data that is normally distributed.

The Mann-Whitney U Test

Formulating a hypothesis is the same as before. There is no reason to assume the Vitamin C concentration in the two cults of cabbage is different. Therefore, we can state the hypothesis formally:

\[ H_0: \mu_1 = \mu_2 \] \[ H_A: \mu_1 \neq \mu_2 \]

Where \(\mu_1\) is the c39 cabbage and \(\mu_2\) is the c52 cabbage.

The test statistic in the Mann-Whitney setting is denoted as \(U\) and is the minimum of the summed ranks of the two samples. The null hypothesis is rejected if \(U \leq U_0\), where \(U_0\) is found in a table for small sample sizes (the linked table goes up to 20 for each sample). For larger sample sizes, \(U\) is approximately normally distributed Wikipedia.

The test is nonparametric in the sense it uses the ranks of the values rather than the values themselves. Therefore, the values are ordered then ranked from 1 (smallest value) to the largest value. Ranks of tied values get the mean of the ranks the values would have received. For example, for a set of data points {4, 7, 7, 8} the ranks are {1, 2.5, 2.5, 4}. The 2.5 rank comes from 2 + 3 = 5 / 2. The ranks are then added for the values for both samples. The sum of the ranks for each sample are typically denoted by \(R_k\) where \(k\) is a sample indicator.

How to calculate the test manually will be described later. The Mann-Whitney U test can be performed with the base stats package in R with the wilcox.test() function.

mw <- wilcox.test(VitC ~ Cult, data = cabbages)
## Warning in wilcox.test.default(x = c(51L, 55L, 45L, 42L, 53L, 50L, 50L, :
## cannot compute exact p-value with ties

An alternative implementation of the Mann-Whitney test can be found in the coin package. The test outputs the rank sum test statistic \(W\) (which was called \(U\) earlier, apologies for the confusion if any was caused. It can generally be referred to as either one). It can be seen the outputs of the two implementations are nearly similar in the reported p-values. However, the test statistic is different.

coinw <- wilcox_test(VitC ~ Cult, data = cabbages)
coinw
## 
##  Asymptotic Wilcoxon-Mann-Whitney Test
## 
## data:  VitC by Cult (c39, c52)
## Z = -5.1497, p-value = 2.609e-07
## alternative hypothesis: true mu is not equal to 0
mw
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  VitC by Cult
## W = 102, p-value = 2.714e-07
## alternative hypothesis: true location shift is not equal to 0

Both tests report a tiny p-value. Regardless of which test is used, the null hypothesis would be rejected, and it can, therefore, be concluded the two populations are not identical. According to the ?wilcox.test, there are several different methods of calculating the statistic. I am not completely sure on the exact differences between the two packages’ implementations, though it seems from the output the wilcox.test() the method utilizes continuity correction which changes the z-score slightly.

The following is an example of how to manually calculate the Mann-Whitney U test and find the \(U\) statistic and z-score. The p-value is also determined. I’m not sure how the p-values are precisely computed in the implementations, so my p-value is slightly off from the other tests. The manually reported p-value is still significantly low and would result in the rejection of the null.

Manually Calculating the Mann-Whitney U test

\(U\) (or \(W\), if you prefer) for the two samples in the test, is given by:

\[ U_1 = R_1 - \frac{n_1(n_1 + 1)}{2} \] \[ U_2 = R_2 - \frac{n_2(n_2 + 1)}{2} \]

Where \(R_1\) and \(R_2\) are the sum of the ranks of the two samples as previously noted.

To calculate the \(U\) statistic, start by sorting the data.

cab.sort <- cabbages[order(cabbages$VitC),]

Then assign the ranks and split into two groups.

cab.sort$vit.ranked <- rank(cab.sort$VitC, ties.method = 'average')
c39 <- filter(cab.sort, Cult == 'c39')
c52 <- filter(cab.sort, Cult == 'c52')

With the ranks calculated, the sums can be found and then the values of \(U_1\) and \(U_2\) can be computed. I’ll assign the values we need to new variables to keep it easier to read.

R1 <- sum(c52$vit.ranked)
R2 <- sum(c39$vit.ranked)

n1 <- length(c39$vit.ranked)
n2 <- length(c39$vit.ranked)

U1 <- n1 * n2 + n1 * (n1 + 1) / 2 - R1
U2 <- n1 * n2 + n1 * (n2 + 1) / 2 - R2

With \(U_1\) and \(U_2\) computed, the last step is to find the minimum of the two values. We can see this matches the \(W\) statistic as reported by the wilcox.test() function.

U <- min(U1, U2)
U
## [1] 102
mw$statistic
##   W 
## 102

Calculating the p-value

As was alluded to previously, since the sample size is relatively large, the \(U\) is approximately normally distributed. Therefore, a z-score can be found for the \(U\) and used to calculate the p-value. In this setting, the z-score is defined as:

\[ z = \frac{U - m_U}{\sigma_U} \]

Where \(m_U\) and \(\sigma_U\) are the mean and standard deviation of the \(U\), respectively. The mean of \(U\) is defined as the two sample sizes multiplied and divided by two:

\[ m_U = \frac{n_{1}n_2}{2} \]

The standard deviation is equal to the product of the two sample sizes multiplied by the two samples sizes added together plus one divided by 12.

\[ \sigma^2_U = \frac{n_{1}n_{2}(n_1 + n_2 + 1)}{12} \]

However, since there are ties in our data, a correction needs to be applied to the standard deviation calculation. The corrected version of the standard deviation is defined as:

\[ \sigma^2_U = \frac{n_{1}n_{2}(n + 1)}{12} \left(1 - \sum_{i=1}^{k} \frac{t^3_i - t_i}{n(n - 1)} \right) \]

where \(n = n_1 + n_2\) and \(t_i\) is the count of points sharing the tied rank \(i\) and \(k\) is the total number of tied ranks.

Before solving the equations, find the rank counts denoted \(t_i\) as stated above. Getting the rank counts is straightforward with the count() function from the wonderful plyr package.

rank.counts <- count(cab.sort, vars = vit.ranked)

With the equations defined and the rank counts in hand, the z-score can be calculated. The standard deviation is broken into two pieces to make it easier to keep track of it all.

# Sum of sample sizes
n <- n1 + n2

# Mean
mu = (n1 * n2) / 2

# Pieces of corrected standard deviation equation
sigma_1 = ((n1 * n2) * (n + 1)) / 12
sigma_2 = 1 - sum((rank.counts$n^3 - rank.counts$n) / (n^3 - n))

# Final corrected standard deviation calculation
sigma = sqrt(sigma_1 * sigma_2)

# z-score
z <- (abs(U - mu)) / sigma
z
## [1] 5.149704
coinw@statistic@teststatistic
##       c39 
## -5.149704

The z-score is the same other than the switched signs. The p-values can then be calculated, though like I said they’re slightly different. What’s interesting is if the computed z-score is used in the coinw@distribution@pvalue() method, the p-value is the same as the reported p-value from the wilcox_test() method.

(1-pt(z, n)) * 2 # subtracted from one and multipled by two as it is a two-tailed test
## [1] 3.05005e-06
coinw@distribution@pvalue(z)
## [1] 2.608977e-07

In either case, the null would be rejected and conclude the two populations are not identical.

Conclusion

In this post, the Mann-Whitney U test was employed to test two populations of cabbage. The Mann-Whitney U test is very useful and efficient due to its nonparametric nature and therefore does not need to assume normality. According to Wikipedia, the Mann-Whitney test yields similar results to Student’s t test.