T-test and ANOVA on Iris Data

This code and vignette comes from the blog “Stats and R” by Antoine Soetewey. Link here: https://www.statsandr.com/blog/how-to-do-a-t-test-or-anova-for-many-variables-at-once-in-r-and-communicate-the-results-in-a-better-way/

Perform Multiple T-Tests at Once

This piece of code automates the process of drawing boxplots and performing the tests on several variables at once. Below is the code I used, illustrating the process with the iris dataset. The Species variable has 3 levels, so let’s remove one, and then draw a boxplot and apply a t-test on all 4 continuous variables at once. Note that the continuous variables that we would like to test are variables 1 to 4 in the iris dataset.

\(~\)

Full Output

dat <- iris

# remove one level to have only two groups
dat <- subset(dat, Species != "setosa")
dat$Species <- factor(dat$Species)

# boxplots and t-tests for the 4 variables at once
for (i in 1:4) { # variables to compare are variables 1 to 4
  boxplot(dat[, i] ~ dat$Species, # draw boxplots by group
    ylab = names(dat[i]), # rename y-axis with variable's name
    xlab = "Species"
  )
  print(t.test(dat[, i] ~ dat$Species)) # print results of t-test
}

## 
##  Welch Two Sample t-test
## 
## data:  dat[, i] by dat$Species
## t = -5.6292, df = 94.025, p-value = 1.866e-07
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.8819731 -0.4220269
## sample estimates:
## mean in group versicolor  mean in group virginica 
##                    5.936                    6.588

## 
##  Welch Two Sample t-test
## 
## data:  dat[, i] by dat$Species
## t = -3.2058, df = 97.927, p-value = 0.001819
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.33028364 -0.07771636
## sample estimates:
## mean in group versicolor  mean in group virginica 
##                    2.770                    2.974

## 
##  Welch Two Sample t-test
## 
## data:  dat[, i] by dat$Species
## t = -12.604, df = 95.57, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.49549 -1.08851
## sample estimates:
## mean in group versicolor  mean in group virginica 
##                    4.260                    5.552

## 
##  Welch Two Sample t-test
## 
## data:  dat[, i] by dat$Species
## t = -14.625, df = 89.043, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.7951002 -0.6048998
## sample estimates:
## mean in group versicolor  mean in group virginica 
##                    1.326                    2.026

\(~\)

Another view: Same results but more concise and easily interpretable

library(ggpubr)

# Edit from here #
x <- which(names(dat) == "Species") # name of grouping variable
y <- which(names(dat) == "Sepal.Length" # names of variables to test
| names(dat) == "Sepal.Width"
| names(dat) == "Petal.Length"
| names(dat) == "Petal.Width")
method <- "t.test" # one of "wilcox.test" or "t.test"
paired <- FALSE # if paired make sure that in the dataframe you have first all individuals at T1, then all individuals again at T2
# Edit until here


# Edit at your own risk
for (i in y) {
  for (j in x) {
    ifelse(paired == TRUE,
      p <- ggpaired(dat,
        x = colnames(dat[j]), y = colnames(dat[i]),
        color = colnames(dat[j]), line.color = "gray", line.size = 0.4,
        palette = "npg",
        legend = "none",
        xlab = colnames(dat[j]),
        ylab = colnames(dat[i]),
        add = "jitter"
      ),
      p <- ggboxplot(dat,
        x = colnames(dat[j]), y = colnames(dat[i]),
        color = colnames(dat[j]),
        palette = "npg",
        legend = "none",
        add = "jitter"
      )
    )
    #  Add p-value
    print(p + stat_compare_means(aes(label = paste0(..method.., ", p-value = ", ..p.format..)),
      method = method,
      paired = paired,
      # group.by = NULL,
      ref.group = NULL
    ))
  }
}

\(~\)

Additional p-value adjustment methods

If you would like to use another p-value adjustment method, you can use the p.adjust() function. Below are the raw p-values found above, together with p-values derived from the main adjustment methods (presented in a dataframe):

raw_pvalue <- numeric(length = length(1:4))
for (i in (1:4)) {
  raw_pvalue[i] <- t.test(dat[, i] ~ dat$Species,
    paired = FALSE,
    alternative = "two.sided"
  )$p.value
}

df <- data.frame(
  Variable = names(dat[, 1:4]),
  raw_pvalue = round(raw_pvalue, 3)
)

df$Bonferroni <-
  p.adjust(df$raw_pvalue,
    method = "bonferroni"
  )
df$BH <-
  p.adjust(df$raw_pvalue,
    method = "BH"
  )
df$Holm <-
  p.adjust(df$raw_pvalue,
    method = "holm"
  )
df$Hochberg <-
  p.adjust(df$raw_pvalue,
    method = "hochberg"
  )
df$Hommel <-
  p.adjust(df$raw_pvalue,
    method = "hommel"
  )
df$BY <-
  round(p.adjust(df$raw_pvalue,
    method = "BY"
  ), 3)
knitr::kable(df)

Variable	raw_pvalue	Bonferroni	BH	Holm	Hochberg	Hommel	BY
Sepal.Length	0.000	0.000	0.000	0.000	0.000	0.000	0.000
Sepal.Width	0.002	0.008	0.002	0.002	0.002	0.002	0.004
Petal.Length	0.000	0.000	0.000	0.000	0.000	0.000	0.000
Petal.Width	0.000	0.000	0.000	0.000	0.000	0.000	0.000

ANOVA

Below the same process with an ANOVA. Note that we reload the dataset iris to include all three Species this time:

dat <- iris

# Edit from here
x <- which(names(dat) == "Species") # name of grouping variable
y <- which(names(dat) == "Sepal.Length" # names of variables to test
| names(dat) == "Sepal.Width"
| names(dat) == "Petal.Length"
| names(dat) == "Petal.Width")
method1 <- "anova" # one of "anova" or "kruskal.test"
method2 <- "t.test" # one of "wilcox.test" or "t.test"
my_comparisons <- list(c("setosa", "versicolor"), c("setosa", "virginica"), c("versicolor", "virginica")) # comparisons for post-hoc tests
# Edit until here


# Edit at your own risk
for (i in y) {
  for (j in x) {
    p <- ggboxplot(dat,
      x = colnames(dat[j]), y = colnames(dat[i]),
      color = colnames(dat[j]),
      legend = "none",
      palette = "npg",
      add = "jitter"
    )
    print(
      p + stat_compare_means(aes(label = paste0(..method.., ", p-value = ", ..p.format..)),
        method = method1, label.y = max(dat[, i], na.rm = TRUE)
      )
      + stat_compare_means(comparisons = my_comparisons, method = method2, label = "p.format") # remove if p-value of ANOVA or Kruskal-Wallis test >= alpha
    )
  }
}

If you want to apply the same automated process to your data, you will need to modify the name of the grouping variable (Species), the names of the variables you want to test (Sepal.Length, etc.), whether you want to perform an ANOVA (anova) or Kruskal-Wallis test (kruskal.test) and finally specify the comparisons for the post-hoc tests.3

T-test and ANOVA on Iris Data

Mackenzie Zendt

7/9/2020

Perform Multiple T-Tests at Once

Full Output

Another view: Same results but more concise and easily interpretable

Additional p-value adjustment methods

ANOVA