Introduction

T-tests are vital statistical tools for comparing means. This hands-on r demo session aims to provide a concise yet thorough understanding of the following types of t-tests:

  • Single Sample T-test
  • Independent Samples T-test
  • Dependent Samples T-test
  • Nonparametric Alternatives

Single Sample T-test

Aka, one sample t-test

Used when you want to compare the mean of a single sample group to a known value or theoretical expectation.

Let’s do an example of the single sample t-test with the writing and writing score data. Let’s suppose we want to see if the population our sample comes from has a mean writing score of 15. The hypothesis is that the mean writing score is greater than 15.

The t.test function performs t-tests. The first argument is the variable. To do a single sample t-test, include the argument mu = with the comparison value.

# Load the data set
dta_rw <- read.csv("readwrite2.csv")

t.test(dta_rw$writing,mu=15)
## 
##  One Sample t-test
## 
## data:  dta_rw$writing
## t = 21.396, df = 1996, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 15
## 95 percent confidence interval:
##  16.85580 17.23033
## sample estimates:
## mean of x 
##  17.04306

Run through the logic of null hypothesis testing for the single sample t-test.

  1. test statistic: t-statistic
  2. assume null hypothesis is true: mean writing score is 15.
  3. sampling distribution of t-statistic is t distribution with df = N-1 = 1997-1 = 1996.
  4. the p value is <.001
  5. reject null hypothesis

The population mean is significantly greater than 15. Note that the t-test just says it is implausible that the population mean is 15, but doesn’t say anything about direction. Because the sample mean is above 15, we are allowed to say that the mean is significantly greater than 15.

Key assumptions:

  • random sampling
  • independence of observations
  • normality

Independent Samples T-test

Applied when you have two different groups and you want to compare their means.

For this example, we will use proficiency level to predict writing scores but just focusing on novice and intermediate low students (proficiency level = 1 and proficiency level = 2). The t.test function does independent samples t-tests as well, by putting in the standard R model syntax: outcome ~ predictor. The tilde (~) is used to indicate prediction. We saw this syntax with regression and we’ll continue to see this syntax throughout the semester.

Let’s first visually examine the data.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
dta_rw_12 <- dta_rw %>%
  filter(level == 1 | level == 2) %>%
  mutate(level_rename = case_when(level == 1 ~ "Novice",
                                  level == 2 ~ "Intermediate Low")) %>%
  mutate(level_rename = relevel(factor(level_rename),ref="Novice"))

ggplot(data = dta_rw_12, aes(x = as.factor(level_rename), 
                             y = writing)) +
  geom_boxplot() + # add the boxplot
  geom_point(position = position_jitter(width = .1), alpha = .1, size = 2) + # add individual data points
  stat_summary(fun = mean, geom = "point", color = "darkblue") + # add the mean as a point
  stat_summary(fun = mean, geom = "line", aes(group = "level_rename"), color = "darkblue") + # add the line between groups
  stat_summary(fun.data = mean_cl_boot, geom = "errorbar", width = 0.3, color = "darkblue") +  # add error bars 
  labs(x = "Proficiency level",
       y = "Writing scores") # rename x- and y-axis

The means of the writing scores for the “Novice” and “Intermediate Low” groups are visibly different (indicated by the non-overlapping 95%CIs). The “Intermediate Low” group has a higher mean score, suggesting that, on average, students with an “Intermediate Low” proficiency level tend to have better writing scores than those classified as “Novice”.

Note that an assumption of the independent samples t-test is homogeneity of variance: the variance is the same in both groups. By default, the t.test function does not make this assumption and instead uses a variation of the independent samples t-test called Welch’s t-test. Start by running the analysis assuming homogeneity of variance by adding the argument var.equal=TRUE.

t.test(dta_rw_12$writing ~ dta_rw_12$level_rename,var.equal=TRUE)
## 
##  Two Sample t-test
## 
## data:  dta_rw_12$writing by dta_rw_12$level_rename
## t = -15.82, df = 566, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group Novice and group Intermediate Low is not equal to 0
## 95 percent confidence interval:
##  -3.300049 -2.571089
## sample estimates:
##           mean in group Novice mean in group Intermediate Low 
##                       10.34513                       13.28070

Logic of null hypothesis testing for independent samples t-test

  1. test statistic: t statistic
  2. assume null hypothesis: population means are equal
  3. sampling distribution of t statistic is t distribution with df = N-2. Here, the df = 568 - 2 = 566 , which shows up in the output.
  4. p-value < 2.2e-16 (<.001)
  5. p < alpha so reject null hypothesis

key assumptions:

  • random sampling
  • independence of observations
  • normality
  • homogeneity of variance

Test for homogeneity of variance

There are three different tests of homogeneity of variance:

  • Bartlett
  • Levene
  • Fligner-Killeen tests.

Of these, Levene’s test is probably most common and Fligner-Killeen is probably best. A thorough exploration of the assumption of homogeneity of variance would involve running all three tests, but they almost always agree so just picking one is fine too. You’re safe if you use Levene’s test as your go-to.

These tests are available in the car package.

library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following object is masked from 'package:purrr':
## 
##     some
bartlett.test(writing~level_rename,data=dta_rw_12)
## 
##  Bartlett test of homogeneity of variances
## 
## data:  writing by level_rename
## Bartlett's K-squared = 0.16452, df = 1, p-value = 0.685
leveneTest(writing~level_rename,data=dta_rw_12)
## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value Pr(>F)
## group   1  0.1961 0.6581
##       566
fligner.test(writing~level_rename,data=dta_rw_12)
## 
##  Fligner-Killeen test of homogeneity of variances
## 
## data:  writing by level_rename
## Fligner-Killeen:med chi-squared = 0.05789, df = 1, p-value = 0.8099

For all three tests, the null hypothesis is homogeneity of variance, so a significant result indicates that homogeneity of variance is not a plausible assumption. All three tests were nonsignificant, so there is no evidence of violation of homogeneity of variance.

Effect Size

Let’s calculate the effect size and interpret it.

library(effsize)

# Obtain effect size and its CIs
cohen.d(dta_rw_12$writing, dta_rw_12$level_rename) 
## 
## Cohen's d
## 
## d estimate: -1.356135 (large)
## 95 percent confidence interval:
##     lower     upper 
## -1.542137 -1.170133

Welch’s t-test

The alternative to the independent samples t-test without the assumption of homogeneity of variance is called the Welch’s t-test. It is the default for the t.test function.

t.test(dta_rw_12$writing ~ dta_rw_12$level_rename)
## 
##  Welch Two Sample t-test
## 
## data:  dta_rw_12$writing by dta_rw_12$level_rename
## t = -15.9, df = 490.03, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group Novice and group Intermediate Low is not equal to 0
## 95 percent confidence interval:
##  -3.298333 -2.572805
## sample estimates:
##           mean in group Novice mean in group Intermediate Low 
##                       10.34513                       13.28070

Not much difference. No surprise since all three tests of homogeneity of variance were nonsignificant, which means the assumption of homogeneity of variance was reasonable.

All else being equal, the independent samples t-test has slightly more power. Of course, that’s at the price of making an additional assumption. Either choice - assume or don’t assume - is reasonable if there is no evidence that homogeneity of variance is violated.

Dependent Samples T-test

Ideal for before-and-after scenarios, where you want to compare the means of the same group at two different times.

For this section, let’s look at the French & O’Brien grammar data. We used this data for the regression exercise. For now, we would like to compare grammar scores at time 1 to grammar scores at time 2.

library(haven)

dta_fo <- read_sav("French & O'Brien grammar.sav")

# Compare the means
dta_mean <- dta_fo %>%
  select(gram_1, gram_2) %>%
  summarize(mean_gram_1 = round(mean(gram_1), 3),
          mean_gram_2 = round(mean(gram_2), 3))

dta_mean
## # A tibble: 1 × 2
##   mean_gram_1 mean_gram_2
##         <dbl>       <dbl>
## 1        16.6        27.2

Those observed sample means are different, but can we be confident that the population means - that is, the population mean at time 1 and the population mean at time 2 - are different? That’s where the dependent samples t-test comes in.

Before we run the t-test, let’s visually examine the data. However, ggplot2 only takes long format data. We need to restructure the data before creating the graph.

# restructure the dataset
dta_long <- dta_fo %>%
  select(subject, gram_1, gram_2) %>%
  pivot_longer(cols = c(gram_1, gram_2), # Specify the columns to be pivoted into long format
               names_to = "Test", # Rename the new column holding the original column names as "Test"
               values_to = "Score")

head(dta_long) # look at the first six rows
## # A tibble: 6 × 3
##   subject Test   Score
##   <chr>   <chr>  <dbl>
## 1 S53     gram_1    15
## 2 S53     gram_2    30
## 3 S62     gram_1    15
## 4 S62     gram_2    30
## 5 S58     gram_1    15
## 6 S58     gram_2    30
# data visualization
ggplot(data = dta_long, aes(x = as.factor(Test), 
                             y = Score)) +
  geom_boxplot() + # add the boxplot
  geom_point(position = position_jitter(width = .1), alpha = .15, size = 2) + # add individual data points
  stat_summary(fun = mean, geom = "point", color = "darkblue") + # add the mean as a point
  stat_summary(fun = mean, geom = "line", aes(group = "Test"), color = "darkblue") + # add the line between groups
  stat_summary(fun.data = mean_cl_boot, geom = "errorbar", width = 0.3, color = "darkblue") +  # add error bars 
  labs(x = "",
       y = "Grammar scores") + # rename x- and y-axis 
  scale_x_discrete(labels = c("gram_1" = "Time 1", 
                              "gram_2" = "Time 2")) # modifying the x-axis tick labels

Based on the visual representation, there is a clear upward trend in grammar scores between Time 1 and Time 2. The aggregated data, represented by the box plots (median points), line graph (mean points with error bars), and individual data points, further confirm this increase.

Now let’s run the dependent samples t-test. It uses the same t.test function, but with the additional argument of paired = TRUE.

t.test(dta_fo$gram_1, dta_fo$gram_2, paired=TRUE)
## 
##  Paired t-test
## 
## data:  dta_fo$gram_1 and dta_fo$gram_2
## t = -25.245, df = 103, p-value < 2.2e-16
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
##  -11.501185  -9.825738
## sample estimates:
## mean difference 
##       -10.66346

This is significant, so we can conclude that the population means are different. We can look at the means from the descriptive statistics or the graph to determine the direction of difference. The sample mean at time 2 is higher than the sample mean at time 1.

Effect Size

Let’s calculate the effect size and interpret it.

# Obtain effect size and its CIs
cohen.d(dta_fo$gram_1, dta_fo$gram_2, paired=TRUE) 
## 
## Cohen's d
## 
## d estimate: -2.347607 (large)
## 95 percent confidence interval:
##     lower     upper 
## -2.702907 -1.992306

We can therefore make a nice interpretation - grammar scores are significantly higher at time 2 compared to time 1. The magnitude of this effect is large.

Nonparametric Alternatives

For those times when your data doesn’t meet the assumptions for a traditional t-test, we’ll also explore nonparametric alternatives.

Mann-Whitney U test

The main nonparametric alternative to the independent samples t-test is the Mann-Whitney U test, also called Mann-Whitney-Wilcoxon test, also called Wilcoxon rank-sum test, also called Wilcoxon-Mann- Whitney test, but most commonly called Mann-Whitney U.

wilcox.test(reading~factor(level_rename),data=dta_rw_12)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  reading by factor(level_rename)
## W = 19879, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0

The test statistic is called W or U. The standardized version of U has a sampling distribution that is approximately normal. But you don’t need to worry about it- R directly outputs the p-value.

Based on this test, there was a significant difference in medians. Note that Mann-Whitney U is about medians, not means. If scores are normally distributed (an assumption of independent samples t-test), then mean=median and the results can be expected to be identical, except the parametric independent samples t-test will have more power and be more stable.

Permutation test

A permutation test offers a different nonparametric approach to independent samples mean comparison. The permutation test for 2 group mean comparison is sometimes called a one way test, but that’s a vague name. In a manuscript, I would call it a nonparametric version of an independent samples t-test using a permutation approach.

The one way test is in the coin package.

library(coin)
## Loading required package: survival

There is an exact version, which does all possible permutations.

oneway_test(reading ~ factor(level_rename),
            data=dta_rw_12,
            distribution="exact")
## 
##  Exact Two-Sample Fisher-Pitman Permutation Test
## 
## data:  reading by
##   factor(level_rename) (Novice, Intermediate Low)
## Z = -10.042, p-value < 2.2e-16
## alternative hypothesis: true mu is not equal to 0

There is also an approximate version, which does a more limited random selection of permutations. This is particularly useful with a large sample size where calculating every permutation would take a long time. Here, we are doing 9999 permutations.

oneway_test(reading ~ factor(level_rename),
            data=dta_rw_12,
            distribution = approximate(nresample=9999))
## 
##  Approximative Two-Sample Fisher-Pitman Permutation Test
## 
## data:  reading by
##   factor(level_rename) (Novice, Intermediate Low)
## Z = -10.042, p-value < 1e-04
## alternative hypothesis: true mu is not equal to 0

Nonparametric methods alternatives to dependent samples t-test

The main nonparametric alternative to the dependent samples t-test is the Wilcoxon signed rank test.

wilcox.test(dta_fo$gram_1,
            dta_fo$gram_2,
            paired=TRUE)
## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  dta_fo$gram_1 and dta_fo$gram_2
## V = 0, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0

There is also a permutation test, but it requires long format data. Also the grouping variable and the stratification variable should be factor variables.

# Running the Symmetry Test
symmetry_test(
  # Response and grouping variable
  Score ~ factor(Test)
  # Blocking or stratification variable
  | factor(subject),
  # Data being used
  data = dta_long)
## 
##  Asymptotic General Symmetry Test
## 
## data:  Score by
##   factor(Test) (gram_1, gram_2) 
##   stratified by factor(subject)
## Z = -9.4621, p-value < 2.2e-16
## alternative hypothesis: two.sided