What is Kruskal-Wallis test?


Name: Sabiena Glen

Date: March 24, 2023

Kruskal-Wallis test by rank is a non-parametric alternative to one-way ANOVA test, which extends the two-samples Wilcoxon test in the situation where there are more than two groups. It’s recommended when the assumptions of one-way ANOVA test are not met.

We’ll use the built-in R data set named PlantGrowth. It contains the weight of plants obtained under a control and two different treatment conditions.

Attach and view data

The dataset PlantGrowth is attached and is displayed below.

attach(PlantGrowth)
df <- PlantGrowth
table(df)
##       group
## weight ctrl trt1 trt2
##   3.59    0    1    0
##   3.83    0    1    0
##   4.17    1    1    0
##   4.32    0    1    0
##   4.41    0    1    0
##   4.5     1    0    0
##   4.53    1    0    0
##   4.61    1    0    0
##   4.69    0    1    0
##   4.81    0    1    0
##   4.89    0    1    0
##   4.92    0    0    1
##   5.12    0    0    1
##   5.14    1    0    0
##   5.17    1    0    0
##   5.18    1    0    0
##   5.26    0    0    1
##   5.29    0    0    1
##   5.33    1    0    0
##   5.37    0    0    1
##   5.5     0    0    1
##   5.54    0    0    1
##   5.58    1    0    0
##   5.8     0    0    1
##   5.87    0    1    0
##   6.03    0    1    0
##   6.11    1    0    0
##   6.15    0    0    1
##   6.31    0    0    1

Check data

The dataset PlantGrowth is checked below.

# load dataset
attach(PlantGrowth)
## The following objects are masked from PlantGrowth (pos = 3):
## 
##     group, weight
# print the head of the dataset
head(PlantGrowth)
##   weight group
## 1   4.17  ctrl
## 2   5.58  ctrl
## 3   5.18  ctrl
## 4   6.11  ctrl
## 5   4.50  ctrl
## 6   4.61  ctrl
# print the dimensions of the dataset
dim(PlantGrowth)
## [1] 30  2
# print the names of the columns in the dataset
names(PlantGrowth)
## [1] "weight" "group"

Show the group levels

In R terminology, the column “group” is called factor and the different categories (“ctr”, “trt1”, “trt2”) are named factor levels. The levels are ordered alphabetically.

The group levels are shown below.

# Show the group levels
attach(PlantGrowth)
## The following objects are masked from PlantGrowth (pos = 3):
## 
##     group, weight
## The following objects are masked from PlantGrowth (pos = 4):
## 
##     group, weight
my_data <- PlantGrowth
levels(my_data$group)
## [1] "ctrl" "trt1" "trt2"

If the levels are not automatically in the correct order, re-order them as follows:

attach(PlantGrowth)
## The following objects are masked from PlantGrowth (pos = 3):
## 
##     group, weight
## The following objects are masked from PlantGrowth (pos = 4):
## 
##     group, weight
## The following objects are masked from PlantGrowth (pos = 5):
## 
##     group, weight
my_data <- PlantGrowth
my_data$group <- factor(my_data$group,
                         levels = c('ctrl', 'trt1', 'trt2'))
levels(my_data$group)
## [1] "ctrl" "trt1" "trt2"

Compute summary statistics by groups

It’s possible to compute summary statistics by groups. The dplyr package can be used.

#Compute summary statistics by groups:
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.2.3
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
attach(PlantGrowth)
## The following objects are masked from PlantGrowth (pos = 4):
## 
##     group, weight
## The following objects are masked from PlantGrowth (pos = 5):
## 
##     group, weight
## The following objects are masked from PlantGrowth (pos = 6):
## 
##     group, weight
## The following objects are masked from PlantGrowth (pos = 7):
## 
##     group, weight
my_data <- PlantGrowth
group_by(my_data, group) %>%
  summarise(
    count = n(),
    mean = mean(weight, na.rm = TRUE),
    sd = sd(weight, na.rm = TRUE),
    median = median(weight, na.rm = TRUE),
    IQR = IQR(weight, na.rm = TRUE)
  )
## # A tibble: 3 × 6
##   group count  mean    sd median   IQR
##   <fct> <int> <dbl> <dbl>  <dbl> <dbl>
## 1 ctrl     10  5.03 0.583   5.15 0.743
## 2 trt1     10  4.66 0.794   4.55 0.662
## 3 trt2     10  5.53 0.443   5.44 0.467

Visualize the data using box plots

The boxplot of the weight column of the dataset PlantGrowth is shown below.

# Boxplot of weight
boxplot(PlantGrowth$weight)

The boxplot of the group column of the dataset PlantGrowth is shown below.

# Boxplot of group
boxplot(PlantGrowth$group)

Visualize data with ggpubr

A boxplot that plots weight by group and colour by group.

attach(PlantGrowth)
## The following objects are masked from PlantGrowth (pos = 3):
## 
##     group, weight
## The following objects are masked from PlantGrowth (pos = 5):
## 
##     group, weight
## The following objects are masked from PlantGrowth (pos = 6):
## 
##     group, weight
## The following objects are masked from PlantGrowth (pos = 7):
## 
##     group, weight
## The following objects are masked from PlantGrowth (pos = 8):
## 
##     group, weight
my_data <- PlantGrowth
library(ggpubr)
## Warning: package 'ggpubr' was built under R version 4.2.3
## Loading required package: ggplot2
ggboxplot(my_data, x = "group", y = "weight", 
          color = "group", palette = c("#00AFBB", "#E7B800", "#FC4E07"),
          order = c("ctrl", "trt1", "trt2"),
          ylab = "Weight", xlab = "Treatment")

library("ggpubr")
ggline(my_data, x = "group", y = "weight", 
       add = c("mean_se", "jitter"), 
       order = c("ctrl", "trt1", "trt2"),
       ylab = "Weight", xlab = "Treatment")

Compute Kruskal-Wallis test

The Kruskal-Wallis test is done and the results are shown below.

attach(PlantGrowth)
## The following objects are masked from PlantGrowth (pos = 5):
## 
##     group, weight
## The following objects are masked from PlantGrowth (pos = 6):
## 
##     group, weight
## The following objects are masked from PlantGrowth (pos = 8):
## 
##     group, weight
## The following objects are masked from PlantGrowth (pos = 9):
## 
##     group, weight
## The following objects are masked from PlantGrowth (pos = 10):
## 
##     group, weight
## The following objects are masked from PlantGrowth (pos = 11):
## 
##     group, weight
my_data <- PlantGrowth
kruskal.test(weight ~ group, data = my_data)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  weight by group
## Kruskal-Wallis chi-squared = 7.9882, df = 2, p-value = 0.01842

Interpret the Kruskal-Wallis test result

As the p-value is less than the significance level 0.05, we can conclude that there are significant differences between the treatment groups.

Multiple pairwise-comparison between groups

From the output of the Kruskal-Wallis test, we know that there is a significant difference between groups, but we don’t know which pairs of groups are different.

It’s possible to use the function pairwise.wilcox.test() to calculate pairwise comparisons between group levels with corrections for multiple testing.

pairwise.wilcox.test(PlantGrowth$weight, PlantGrowth$group,
                 p.adjust.method = "BH")
## Warning in wilcox.test.default(xi, xj, paired = paired, ...): cannot compute
## exact p-value with ties
## 
##  Pairwise comparisons using Wilcoxon rank sum test with continuity correction 
## 
## data:  PlantGrowth$weight and PlantGrowth$group 
## 
##      ctrl  trt1 
## trt1 0.199 -    
## trt2 0.095 0.027
## 
## P value adjustment method: BH

Interpret the Kruskal-Wallis test result

The pairwise comparison shows that, only trt1 and trt2 are significantly different (p < 0.05).