Games-Howell Post-Hoc Test

The Games-Howell post-hoc test is another nonparametric approach to compare combinations of groups or treatments. Although rather similar to Tukey’s test in its formulation, the Games-Howell test does not assume equal variances and sample sizes. The test was designed based on Welch’s degrees of freedom correction and uses Tukey’s studentized range distribution, denoted \(q\). The Games-Howell test is performed on the ranked variables similar to other nonparametric tests. Since the Games-Howell test does not rely on equal variances and sample sizes, it is often recommended over other approaches such as Tukey’s test.

In this example, the InsectSprays dataset available in R will be examined using the Games-Howell test procedure. Since there are few implementations of the Games-Howell test available in R, I’ve also provided a function that performs the test and outputs a variety of relevant statistics.

Getting Started

First, load the packages and data that will be used. The userfriendlyscience package was the only package I could find with an implementation of the Games-Howell test.

library(userfriendlyscience)

## Loading required package: ggplot2

library(ggplot2)
data("InsectSprays")

Plot a quick boxplot to visualize the distributions of the groups.

ggplot(InsectSprays, aes(x=spray, y=count, fill=spray)) + geom_boxplot()

The groups do not have equal variances and thus makes the Games-Howell test an ideal candidate for performing post-hoc analysis. Since the test is also non-parametric, the assumption of normality is not necessary in this case.

Performing the Games-Howell Test

The userfriendlyscience package offers a function oneway() that can output a Games-Howell post-hoc test.

one.way <- oneway(InsectSprays$spray, y = InsectSprays$count, posthoc = 'games-howell')
one.way

## ### Oneway Anova for y=count and x=spray (groups: A, B, C, D, E, F)
## 
## Eta Squared: 95% CI = [0.6; 0.77], point estimate = 0.72
## 
##                                      SS Df     MS    F     p
## Between groups (error + effect) 2668.83  5 533.77 34.7 <.001
## Within groups (error only)      1015.17 66  15.38           
## 
## ### Post hoc test: games-howell
## 
##        t    df     p
## A:B 0.45 21.78  .997
## A:C 8.41 14.74 <.001
## A:D 6.21 16.73 <.001
## A:E 7.58 13.91 <.001
## A:F 0.96 20.52  .925
## B:C 9.75 15.50 <.001
## B:D 7.29 17.76 <.001
## B:E 8.89 14.52 <.001
## B:F 0.61 19.50  .989
## C:D 3.08 20.87  .056
## C:E 1.87 21.63  .447
## C:F 7.75 13.20 <.001
## D:E 1.61 19.57  .601
## D:F 6.08 14.48 <.001
## E:F 7.07 12.70 <.001

The function provides an ANOVA fit in addition to the post-hoc test. The result outputs a table containing the t-value, degrees of freedom and p-value of each possible group combination. From the output, it can be seen the group combinations A:C, A:D, A:E, B:C, B:D, B:E, C:F, D:F, and E:F are all significantly different from each other. The differences make intuitive sense as judging from the boxplot it can be seen that these groups distributed quite differently from each other.

Calculating the Games-Howell Test

The Games-Howell test is defined as:

\[ \large \bar{x}_i - \bar{x}_j > q_{\sigma, k, df} \]

Where \(\sigma\) is equal to standard error:

\[ \sigma = \sqrt{{\frac{1}{2} \left(\frac{s^2_i}{n_i} + \frac{s^2_j}{n_j}\right)}} \]

Degrees of freedom is calculated using Welch’s correction:

\[ \large \frac{\left(\frac{s^2_i}{n_i} + \frac{s^2_j}{n_j}\right)^2}{\frac{\left(\frac{s_i^2}{n_i}\right)^2}{n_i - 1} + \frac{\left(\frac{s_j^2}{n_j}\right)^2}{n_j - 1}} \]

The t-value is found with Welch’s t-test:

\[ \large t = \frac{\bar{x}_i - \bar{x}_j}{\sqrt{\frac{s_i^2}{n_i} + \frac{s_j^2}{n_j}}} \]

Thus, confidence intervals can be formed with:

\[ \bar{x}_i - \bar{x}_j \pm t \sqrt{{\frac{1}{2} \left(\frac{s_i^2}{n_i} + \frac{s_j^2}{n_j}\right)}} \]

Lastly, p-values are calculated using Tukey’s studentized range:

\[ \large q_{t * \sqrt{2}, k, df} \]

The function to perform the Games-Howell test takes two arguments, the group vector and the data vector.

games.howell <- function(grp, obs) {
  
  #Create combinations
  combs <- combn(unique(grp), 2)
  
  # Statistics that will be used throughout the calculations:
  # n = sample size of each group
  # groups = number of groups in data
  # Mean = means of each group sample
  # std = variance of each group sample
  n <- tapply(obs, grp, length)
  groups <- length(tapply(obs, grp, length))
  Mean <- tapply(obs, grp, mean)
  std <- tapply(obs, grp, var)
  
  statistics <- lapply(1:ncol(combs), function(x) {
    
    mean.diff <- Mean[combs[2,x]] - Mean[combs[1,x]]
    
    #t-values
    t <- abs(Mean[combs[1,x]] - Mean[combs[2,x]]) / sqrt((std[combs[1,x]] / n[combs[1,x]]) + (std[combs[2,x]] / n[combs[2,x]]))
    
    # Degrees of Freedom
    df <- (std[combs[1,x]] / n[combs[1,x]] + std[combs[2,x]] / n[combs[2,x]])^2 / # Numerator Degrees of Freedom
            ((std[combs[1,x]] / n[combs[1,x]])^2 / (n[combs[1,x]] - 1) + # Part 1 of Denominator Degrees of Freedom 
              (std[combs[2,x]] / n[combs[2,x]])^2 / (n[combs[2,x]] - 1)) # Part 2 of Denominator Degrees of Freedom
    
    #p-values
    p <- ptukey(t * sqrt(2), groups, df, lower.tail = FALSE)
    
    # Sigma standard error
    se <- sqrt(0.5 * (std[combs[1,x]] / n[combs[1,x]] + std[combs[2,x]] / n[combs[2,x]]))
          
    # Upper Confidence Limit
    upper.conf <- lapply(1:ncol(combs), function(x) {
      mean.diff + qtukey(p = 0.95, nmeans = groups, df = df) * se
    })[[1]]
    
    # Lower Confidence Limit
    lower.conf <- lapply(1:ncol(combs), function(x) {
      mean.diff - qtukey(p = 0.95, nmeans = groups, df = df) * se
    })[[1]]
    
    # Group Combinations
    grp.comb <- paste(combs[1,x], ':', combs[2,x])
    
    # Collect all statistics into list
    stats <- list(grp.comb, mean.diff, se, t, df, p, upper.conf, lower.conf)
  })
  
  # Unlist statistics collected earlier
  stats.unlisted <- lapply(statistics, function(x) {
    unlist(x)
  })
  
  # Create dataframe from flattened list
  results <- data.frame(matrix(unlist(stats.unlisted), nrow = length(stats.unlisted), byrow=TRUE))
  
  # Select columns set as factors that should be numeric and change with as.numeric
  results[c(2, 3:ncol(results))] <- round(as.numeric(as.matrix(results[c(2, 3:ncol(results))])), digits = 3)
  
  # Rename data frame columns
  colnames(results) <- c('groups', 'Mean Difference', 'Standard Error', 't', 'df', 'p', 'upper limit', 'lower limit')

  return(results)
}

Running the function outputs all group combinations with respective mean differences, standard errors, t-values, degrees of freedom, p-values and upper and lower confidence limits. The results can be compared to the output of the oneway() function to verify.

games.howell(InsectSprays$spray, InsectSprays$count)

##    groups Mean Difference Standard Error     t     df     p upper limit
## 1   A : B           0.833          1.299 0.454 21.784 0.997       6.562
## 2   A : C         -12.417          1.044 8.407 14.739 0.000      -7.607
## 3   A : D          -9.583          1.090 6.214 16.735 0.000      -4.641
## 4   A : E         -11.000          1.026 7.580 13.910 0.000      -6.236
## 5   A : F           2.167          1.593 0.962 20.523 0.925       9.229
## 6   B : C         -13.250          0.961 9.754 15.499 0.000      -8.855
## 7   B : D         -10.417          1.011 7.289 17.758 0.000      -5.868
## 8   B : E         -11.833          0.941 8.894 14.523 0.000      -7.492
## 9   B : F           1.333          1.539 0.613 19.498 0.989       8.192
## 10  C : D           2.833          0.651 3.078 20.872 0.056       5.715
## 11  C : E           1.417          0.536 1.868 21.631 0.447       3.783
## 12  C : F          14.583          1.331 7.748 13.201 0.000      20.810
## 13  D : E          -1.417          0.621 1.612 19.570 0.601       1.351
## 14  D : F          11.750          1.367 6.076 14.479 0.000      18.063
## 15  E : F          13.167          1.317 7.071 12.699 0.000      19.364
##    lower limit
## 1       -4.896
## 2      -17.226
## 3      -14.525
## 4      -15.764
## 5       -4.895
## 6      -17.645
## 7      -14.965
## 8      -16.175
## 9       -5.526
## 10      -0.048
## 11      -0.949
## 12       8.357
## 13      -4.185
## 14       5.437
## 15       6.969

Aside: Games-Howell Test and Tukey’s Test

As mentioned earlier, the Games-Howell test and Tukey’s test will often report similar results with data that is assumed to have equal variance and equal sample sizes. This similarity in tests can be tested using the PlantGrowth dataset that has been explored in previous posts.

games.howell(PlantGrowth$group, PlantGrowth$weight)

##        groups Mean Difference Standard Error     t     df     p
## 1 ctrl : trt1          -0.371          0.220 1.191 16.524 0.475
## 2 ctrl : trt2           0.494          0.164 2.134 16.786 0.113
## 3 trt1 : trt2           0.865          0.203 3.010 14.104 0.024
##   upper limit lower limit
## 1       0.430      -1.172
## 2       1.089      -0.101
## 3       1.616       0.114

TukeyHSD(aov(weight ~ group, data = PlantGrowth), 'group')

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = weight ~ group, data = PlantGrowth)
## 
## $group
##             diff        lwr       upr     p adj
## trt1-ctrl -0.371 -1.0622161 0.3202161 0.3908711
## trt2-ctrl  0.494 -0.1972161 1.1852161 0.1979960
## trt2-trt1  0.865  0.1737839 1.5562161 0.0120064

Running the games.howell() and the TukeyHSD() functions on the PlantGrowth data, it can be seen that regardless of which test is performed, the conclusion that only treatments 1 and 2 are significantly different from each other. The confidence intervals from Games-Howell and Tukey’s are also close to each other. Therefore, it could make good sense to default to the Games-Howell test to avoid assumptions such as normality and equal variances. Tukey’s test is still valuable as it often provides tighter confidence intervals compared to Games-Howell, but the data must be examined to make sure it meets the test’s assumptions.

Summary

The Games-Howell test, another nonparametric procedure to perform post-hoc analysis, was examined in this post. The test was done with the oneway() function in the userfriendlyscience package, as well as a new custom function to verify the results and add some more statistics. The Games-Howell test is more flexible than other nonparametric approaches such as Tukey’s test as it does not assume normality, equal variances or sample sizes. However, the Games-Howell test will often report looser confidence intervals compared to Tukey’s test, and therefore which test to apply should depend on the particular need.

The Games-Howell function above is also a Gist.

References

Ruxton, G.D., and Beauchamp, G. (2008) ‘Time for some a priori thinking about post hoc testing’, Behavioral Ecology, 19(3), pp. 690-693. doi: 10.1093/beheco/arn020. In-text citations: (Ruxton and Beauchamp, 2008)

Post-hoc (no date) Available at: http://www.unt.edu/rss/class/Jon/ISSS_SC/Module009/isss_m91_onewayanova/node7.html