Introduction

Statistical tests for checking whether experimental (sampling) groups come from the same population are comonnly based on evaluating the null hypothesis that their central measure (mean or median) are all equal. And for this, the classical (parametric) analysis of variance F-test is built as follows

\[F = \frac{(I-1)^{-1}\sum_{i=1}^In_i(\bar{y}_i -\bar{y})^2}{s^2}\]

where \(\bar{y}_i\) is the mean of the \(n_i\) observations of group \(i\) (\(i=1,2,...,I\)), and \(s^2\) is the estimate of th residual variance.

Under the assumptions of the ANOVA model of normality, independence and homogeneity of variances, the numerator of \(F\) follows a \(\chi^2_{I-1}\) distribution (chi-squared, with \(I-1\) degrees of freedom), divided by \(I-1\), and the denominator follows a \(\chi^2_{n-I}\), with \(n = \sum_in_i\).

If the residual variance is rather known, i.e., not estimated, then the ratio

\[\frac{\sum_{i=1}^In_i(\bar{y}_i -\bar{y})^2}{\sigma^2} \sim \chi^2_{I-1}\] In both cases, the higher the variation of the group means \(\bar{y}_i\) around the general mean \(\bar{y}\), the higher the ratio, suggesting that the groups did not come from the same population.

This is the base for the construction of some of the main nonparametric tests used as alternative to the F-test.

Kruskal-Wallis’ test

Example 1: Bean Gold Mosaic Virus

Images: https://agrolink.com.br

Bemisia tabaci is a little insect known to transmit the BGMV, which can cause total damage in common bean crops. Let us consider the example of a completely randomized design experiment in a field cultivated with common beans where four insecticide treatments (groups) are compared. The response variable is grain yield (kg/ha). In R, import the data with

path <- "https://raw.githubusercontent.com/arsilva87/nonparametric/main/insecticides.csv"
dat <- read.csv(path)
head(dat)
    treat rep yield
1 Control   1  1451
2 Control   2  1533
3 Control   3  1493
4 Control   4  1299
5 Control   5  2463
6 Control   6  1441

Below, with the box plot per treatment, one can notice the outliers. Perhaps those are due to some design problem or else - it is difficult to say now (if it does matter at all!). The fact is that the data analyst has to deal with it, for the means can be highly inflated in this case, which would bring some misinterpretation if based on the careless choice of methods - try to run an one-way ANOVA F-test and then Tukey’s comparisons.

boxplot(yield ~ treat, dat)

In a situation like this the H-test (Kruskal and Wallis, 1952) comes in hand. As it is based on the ranks (general position) of the observations, it is robust to outliers. In fact, this nonparametric test only requires that the observations are independent from one another.

Formulation

Now let us present the observations and then their ranks in some convenient way.

aggregate(yield ~ treat, dat, FUN = c)
      treat yield.1 yield.2 yield.3 yield.4 yield.5 yield.6
1       B+C    1594    1658    1700    1900    6592    1648
2 Beauveria    1455    1575    1500    1498    5053    1513
3   Calypso    1368    1536    1634    1545    5970    1492
4   Control    1451    1533    1493    1299    2463    1441
r <- rank(dat$yield)
aggregate(r ~ treat, dat, FUN = c)
      treat r.1 r.2 r.3 r.4 r.5 r.6
1       B+C  15  18  19  20  24  17
2 Beauveria   5  14   9   8  22  10
3   Calypso   2  12  16  13  23   6
4   Control   4  11   7   1  21   3

It can be noticed that in the replication #5 of all treatments is where the outliers are. This information could leads us to another way to deal with those data, but for now we should focus on the ranks. For instance, the highest observation, 6592, is ranked as 24, and the lowest, 1299, is ranked as 1, and so forth. –Did you realize that ranks are just a monotonic transformation? Thus, instead of evaluating the variation of group means \(\bar{y}_i\), the variation of group rank means \(\bar{r}_i\) is computed with the test statistics

\[H = \left(\frac{n-1}{n}\right)\frac{1}{\sigma^2} \sum_{i=1}^I n_i(\bar{r}_i - \bar{r})^2\]

And because of the way the ranks are assigned to the data, from 1 through \(n=24\), it is reasonable to assume a Uniform discrete distribution for the ranks, \(R \sim U_d(1, n)\). From this, we have \(\bar{r} = \frac{n+1}{2}\), \(\sigma^2 = \frac{n^2-1}{12}\). The term \((n-1)/n\) is just a bias correction for the assumed variance \(\sigma^2\). Thus, from the previous section, we can say that

\[H \sim \chi^2_{I-1}\] This is an asymptotic (\(n \rightarrow \infty\)) approximation, based on the TCL (theorem of the central limit, for \(\bar{r}_i\)), that usually work for \(I \geq 3\) and \(n_i \geq 6\). In other cases, tables are available in Kruskal and Wallis (1952).

In R, we can obtain \(H = 7.62\), which indicates significant (\(p = 0.05455\), based on \(\chi^2_3\)) difference among treatment.

kruskal.test(yield ~ treat, dat)

    Kruskal-Wallis rank sum test

data:  yield by treat
Kruskal-Wallis chi-squared = 7.62, df = 3, p-value = 0.05455

Correction for ties

If some observations are tied, the following correction must be applied:

\[H' = \frac{H}{C}\]

where

\[C = 1 - \frac{\sum_s(t_s^3 - t_s)}{n^3 - n}\]

where \(t_s\) is the number of the group \(s\) of tied observations.

Dunn’s test

Considering the result of the H-test in the previous section, one could ask questions like: – Which treatments differ from the others? – Is there a significant difference between the group of isolated techniques (B and C) and the integrated one (B+C)? For pairwise or more general contrasts between groups, Dunn’s test can be applied. Like the H-test, it is based on the rank means (\(\bar{r}_i\)) to evaluate the null hypothesis that the contrasting groups come from the same population. Here is the formulation of the test:

\[z = \frac{d_m}{\sqrt{\sigma_m^2}} \rightarrow Normal(0,1)\] where

\[d_m = \frac{\sum_i n_i \bar{r}_i}{\sum_i n_i} - \frac{\sum_j n_j \bar{r}_j}{\sum_j n_j}\] and

\[\sigma_m^2 = \frac{n(n+1)}{12}\left( \frac{1}{\sum_i n_i} + \frac{1}{\sum_j n_j} \right)\]

If \(|z| \geq z_{1-\alpha/2k}\), the contrast is significant; \(k\) is the number of contrasts.

And when there are ties, the following variance correction is needed:

\[\sigma_m^2 = \left( \frac{n(n+1)}{12} - \frac{\sum_s(t_s^3 - t_s)}{12(n - 1)} \right) \left( \frac{1}{\sum_i n_i} + \frac{1}{\sum_j n_j} \right)\]

In R

# install.packages('dunn.test')
library(dunn.test)
with(dat, dunn.test(yield, treat, 
                    method = "bonferroni", 
                    kw = FALSE))

                         Comparison of yield by treat                          
                                 (Bonferroni)                                  
Col Mean-|
Row Mean |        B+C   Beauveri    Calypso
---------+---------------------------------
Beauveri |   1.837117
         |     0.1986
         |
 Calypso |   1.673817  -0.163299
         |     0.2825     1.0000
         |
 Control |   2.694438   0.857321   1.020620
         |    0.0212*     1.0000     0.9223

alpha = 0.05
Reject Ho if p <= alpha/2

With which we can notice difference (\(p = 0.0212\)) between the Control (no insecticide) and the integration of the biological Beauveria with the chemical Calypso.

B+C vs. group B,C

Here is some home brewed code to test for a more general contrast.

# rank means per group
r_treat <- with(dat, tapply(r, treat, mean))
# n per group
n_treat <- with(dat, tapply(r, treat, length))
# total n
n <- nrow(dat)
# contrast between rank means
dm  <- r_treat[1] - (r_treat[2] + r_treat[3])/2
# contrast variance
var_m <- n*(n+1)/12 * (1/n_treat[1] + 1/(n_treat[2] + n_treat[3]))
# Dunn's z
z <- as.vector( dm/sqrt(var_m) )
# p-value
pnorm(z, lower.tail = FALSE)
[1] 0.02132919

Friedman’s test

In 1937 Milton Friedman proposed a method somehow similar to the repeated-measures ANOVA. It consisted of ranking the data in each row (blocks) of a two-way table and then testing to see whether the different columns (groups) of the resultant table of ranks can be supposed to have all come from the same population. The procedure is similar to the construction of the kruskal-Wallis statistics \(H\), except for the specific way that the ranks are assigned.

For example, let us suppose that each ‘replication’ of those beans data comes from the very same four field plots (one for each treatment/group), but in a different year. This means that there is no actual replication, and that the data could be ranked by block (year), from 1 through 4 in this case. Raranging the data per ‘rep’ would result in

# data per block (rep)
aggregate(yield ~ rep, dat, FUN = c)
  rep yield.1 yield.2 yield.3 yield.4
1   1    1451    1455    1368    1594
2   2    1533    1575    1536    1658
3   3    1493    1500    1634    1700
4   4    1299    1498    1545    1900
5   5    2463    5053    5970    6592
6   6    1441    1513    1492    1648
# ranks per block (rep)
aggregate(yield ~ rep, dat, FUN = rank)
  rep yield.1 yield.2 yield.3 yield.4
1   1       2       3       1       4
2   2       1       3       2       4
3   3       1       2       3       4
4   4       1       2       3       4
5   5       1       2       3       4
6   6       1       3       2       4

Notice as treatment ‘B+C’ (last column) has received the highest ranks in every replication. Now, after computing the new rank means per group \(\bar{r}_i\), we calculate the Friedman’s chi-square

\[X^2 = \left(\frac{I-1}{I}\right)\frac{1}{\sigma^2} \sum_{i=1}^I (\bar{r}_i - \bar{r})^2\]

where \(\bar{r} = \frac{I+1}{2}\), \(\sigma^2 = \frac{I^2-1}{12b}\), and \(b\) is the number of blocks (‘rep’, \(b = 6\) in this case).

In R

friedman.test(yield ~ treat | rep, dat)

    Friedman rank sum test

data:  yield and treat and rep
Friedman chi-squared = 14.6, df = 3, p-value = 0.002192

Multiple comparisons of rank means

After running Friedman’s test, multiple comparisons between rank means can be done using Dunn’s approach. The contrast

\[d_m = \bar{r}_i - \bar{r}_j\] has the variance

\[\sigma^2_m = \frac{I(I+1)}{6b}\]

Then, based on the approximation of the standard Normal distribution, if \(|d_m/\sqrt{\sigma^2_m}| \geq z_{1-\alpha/2k}\), the contrast is significant; \(k\) is the number of contrasts.

In R

# install.packages('PMCMRplus')
library(PMCMRplus)
frdAllPairsNemenyiTest(yield ~ treat | rep, dat)
          B+C     Beauveria Calypso
Beauveria 0.18329 -         -      
Calypso   0.11358 0.99606   -      
Control   0.00083 0.27864   0.39863

Note: this implementation uses Tukey’s Stutentized Range Distribution instead of the standard normal to calculate p-values.

Online resources

Example 2 (exercise)

Test for differences (p < 0.05) on number of tomato fruits damaged by Tuta absoluta per field plots, which are cultivated with three different varieties (variety A is known to be resistant). Do not forget to state how you are choosing the tests.

y <- c(0, 0, 0, 0, 1, 0,   8, 5, 0, 3, 1, 6,   10, 4, 0, NA, 12, 7)
group <- gl(3, 6, labels = LETTERS[1:3])

References

Dunn, O.J. (1964) Multiple comparisons using rank sums. Technometrics, 6 (3): 241–252.

Friedman, M. (1937) The Use of Ranks to Avoid the Assumption of Normality Implicit in the Analysis of Variance. American Statistical Association, 32: 675-701.

Kruskal, W. H., Wallis, W. A. (1952) Use of ranks in one-criterion variance analysis. Journal of the American Statistical Association, 47(260): 583–621.


License: CC BY 4.0