T-tests are vital statistical tools for comparing means. This hands-on r demo session aims to provide a concise yet thorough understanding of the following types of t-tests:
Aka, one sample t-test
Used when you want to compare the mean of a single sample group to a known value or theoretical expectation.
Let’s do an example of the single sample t-test with the writing and writing score data. Let’s suppose we want to see if the population our sample comes from has a mean writing score of 15. The hypothesis is that the mean writing score is greater than 15.
The t.test
function performs t-tests. The first argument
is the variable. To do a single sample t-test, include the argument
mu =
with the comparison value.
# Load the data set
dta_rw <- read.csv("readwrite2.csv")
t.test(dta_rw$writing,mu=15)
##
## One Sample t-test
##
## data: dta_rw$writing
## t = 21.396, df = 1996, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 15
## 95 percent confidence interval:
## 16.85580 17.23033
## sample estimates:
## mean of x
## 17.04306
Run through the logic of null hypothesis testing for the single sample t-test.
The population mean is significantly greater than 15. Note that the t-test just says it is implausible that the population mean is 15, but doesn’t say anything about direction. Because the sample mean is above 15, we are allowed to say that the mean is significantly greater than 15.
Key assumptions:
Applied when you have two different groups and you want to compare their means.
For this example, we will use proficiency level to predict writing
scores but just focusing on novice and intermediate low students
(proficiency level = 1 and proficiency level = 2). The
t.test
function does independent samples t-tests as well,
by putting in the standard R model syntax:
outcome ~ predictor
. The tilde (~
) is used to
indicate prediction. We saw this syntax with regression and we’ll
continue to see this syntax throughout the semester.
Let’s first visually examine the data.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
dta_rw_12 <- dta_rw %>%
filter(level == 1 | level == 2) %>%
mutate(level_rename = case_when(level == 1 ~ "Novice",
level == 2 ~ "Intermediate Low")) %>%
mutate(level_rename = relevel(factor(level_rename),ref="Novice"))
ggplot(data = dta_rw_12, aes(x = as.factor(level_rename),
y = writing)) +
geom_boxplot() + # add the boxplot
geom_point(position = position_jitter(width = .1), alpha = .1, size = 2) + # add individual data points
stat_summary(fun = mean, geom = "point", color = "darkblue") + # add the mean as a point
stat_summary(fun = mean, geom = "line", aes(group = "level_rename"), color = "darkblue") + # add the line between groups
stat_summary(fun.data = mean_cl_boot, geom = "errorbar", width = 0.3, color = "darkblue") + # add error bars
labs(x = "Proficiency level",
y = "Writing scores") # rename x- and y-axis
The means of the writing scores for the “Novice” and “Intermediate Low” groups are visibly different (indicated by the non-overlapping 95%CIs). The “Intermediate Low” group has a higher mean score, suggesting that, on average, students with an “Intermediate Low” proficiency level tend to have better writing scores than those classified as “Novice”.
Note that an assumption of the independent samples t-test is
homogeneity of variance: the variance is the same in both groups. By
default, the t.test
function does not make this assumption
and instead uses a variation of the independent samples t-test called
Welch’s t-test. Start by running the analysis assuming homogeneity of
variance by adding the argument var.equal=TRUE
.
t.test(dta_rw_12$writing ~ dta_rw_12$level_rename,var.equal=TRUE)
##
## Two Sample t-test
##
## data: dta_rw_12$writing by dta_rw_12$level_rename
## t = -15.82, df = 566, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group Novice and group Intermediate Low is not equal to 0
## 95 percent confidence interval:
## -3.300049 -2.571089
## sample estimates:
## mean in group Novice mean in group Intermediate Low
## 10.34513 13.28070
Logic of null hypothesis testing for independent samples t-test
key assumptions:
Test for homogeneity of variance
There are three different tests of homogeneity of variance:
Of these, Levene’s test is probably most common and Fligner-Killeen is probably best. A thorough exploration of the assumption of homogeneity of variance would involve running all three tests, but they almost always agree so just picking one is fine too. You’re safe if you use Levene’s test as your go-to.
These tests are available in the car
package.
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:purrr':
##
## some
bartlett.test(writing~level_rename,data=dta_rw_12)
##
## Bartlett test of homogeneity of variances
##
## data: writing by level_rename
## Bartlett's K-squared = 0.16452, df = 1, p-value = 0.685
leveneTest(writing~level_rename,data=dta_rw_12)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 1 0.1961 0.6581
## 566
fligner.test(writing~level_rename,data=dta_rw_12)
##
## Fligner-Killeen test of homogeneity of variances
##
## data: writing by level_rename
## Fligner-Killeen:med chi-squared = 0.05789, df = 1, p-value = 0.8099
For all three tests, the null hypothesis is homogeneity of variance, so a significant result indicates that homogeneity of variance is not a plausible assumption. All three tests were nonsignificant, so there is no evidence of violation of homogeneity of variance.
Let’s calculate the effect size and interpret it.
library(effsize)
# Obtain effect size and its CIs
cohen.d(dta_rw_12$writing, dta_rw_12$level_rename)
##
## Cohen's d
##
## d estimate: -1.356135 (large)
## 95 percent confidence interval:
## lower upper
## -1.542137 -1.170133
The alternative to the independent samples t-test without the
assumption of homogeneity of variance is called the Welch’s t-test. It
is the default for the t.test
function.
t.test(dta_rw_12$writing ~ dta_rw_12$level_rename)
##
## Welch Two Sample t-test
##
## data: dta_rw_12$writing by dta_rw_12$level_rename
## t = -15.9, df = 490.03, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group Novice and group Intermediate Low is not equal to 0
## 95 percent confidence interval:
## -3.298333 -2.572805
## sample estimates:
## mean in group Novice mean in group Intermediate Low
## 10.34513 13.28070
Not much difference. No surprise since all three tests of homogeneity of variance were nonsignificant, which means the assumption of homogeneity of variance was reasonable.
All else being equal, the independent samples t-test has slightly more power. Of course, that’s at the price of making an additional assumption. Either choice - assume or don’t assume - is reasonable if there is no evidence that homogeneity of variance is violated.
Ideal for before-and-after scenarios, where you want to compare the means of the same group at two different times.
For this section, let’s look at the French & O’Brien grammar data. We used this data for the regression exercise. For now, we would like to compare grammar scores at time 1 to grammar scores at time 2.
library(haven)
dta_fo <- read_sav("French & O'Brien grammar.sav")
# Compare the means
dta_mean <- dta_fo %>%
select(gram_1, gram_2) %>%
summarize(mean_gram_1 = round(mean(gram_1), 3),
mean_gram_2 = round(mean(gram_2), 3))
dta_mean
## # A tibble: 1 × 2
## mean_gram_1 mean_gram_2
## <dbl> <dbl>
## 1 16.6 27.2
Those observed sample means are different, but can we be confident that the population means - that is, the population mean at time 1 and the population mean at time 2 - are different? That’s where the dependent samples t-test comes in.
Before we run the t-test, let’s visually examine the data. However,
ggplot2
only takes long format data. We need to restructure
the data before creating the graph.
# restructure the dataset
dta_long <- dta_fo %>%
select(subject, gram_1, gram_2) %>%
pivot_longer(cols = c(gram_1, gram_2), # Specify the columns to be pivoted into long format
names_to = "Test", # Rename the new column holding the original column names as "Test"
values_to = "Score")
head(dta_long) # look at the first six rows
## # A tibble: 6 × 3
## subject Test Score
## <chr> <chr> <dbl>
## 1 S53 gram_1 15
## 2 S53 gram_2 30
## 3 S62 gram_1 15
## 4 S62 gram_2 30
## 5 S58 gram_1 15
## 6 S58 gram_2 30
# data visualization
ggplot(data = dta_long, aes(x = as.factor(Test),
y = Score)) +
geom_boxplot() + # add the boxplot
geom_point(position = position_jitter(width = .1), alpha = .15, size = 2) + # add individual data points
stat_summary(fun = mean, geom = "point", color = "darkblue") + # add the mean as a point
stat_summary(fun = mean, geom = "line", aes(group = "Test"), color = "darkblue") + # add the line between groups
stat_summary(fun.data = mean_cl_boot, geom = "errorbar", width = 0.3, color = "darkblue") + # add error bars
labs(x = "",
y = "Grammar scores") + # rename x- and y-axis
scale_x_discrete(labels = c("gram_1" = "Time 1",
"gram_2" = "Time 2")) # modifying the x-axis tick labels
Based on the visual representation, there is a clear upward trend in grammar scores between Time 1 and Time 2. The aggregated data, represented by the box plots (median points), line graph (mean points with error bars), and individual data points, further confirm this increase.
Now let’s run the dependent samples t-test. It uses the same
t.test
function, but with the additional argument of
paired = TRUE
.
t.test(dta_fo$gram_1, dta_fo$gram_2, paired=TRUE)
##
## Paired t-test
##
## data: dta_fo$gram_1 and dta_fo$gram_2
## t = -25.245, df = 103, p-value < 2.2e-16
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
## -11.501185 -9.825738
## sample estimates:
## mean difference
## -10.66346
This is significant, so we can conclude that the population means are different. We can look at the means from the descriptive statistics or the graph to determine the direction of difference. The sample mean at time 2 is higher than the sample mean at time 1.
Let’s calculate the effect size and interpret it.
# Obtain effect size and its CIs
cohen.d(dta_fo$gram_1, dta_fo$gram_2, paired=TRUE)
##
## Cohen's d
##
## d estimate: -2.347607 (large)
## 95 percent confidence interval:
## lower upper
## -2.702907 -1.992306
We can therefore make a nice interpretation - grammar scores are significantly higher at time 2 compared to time 1. The magnitude of this effect is large.
For those times when your data doesn’t meet the assumptions for a traditional t-test, we’ll also explore nonparametric alternatives.
The main nonparametric alternative to the independent samples t-test is the Mann-Whitney U test, also called Mann-Whitney-Wilcoxon test, also called Wilcoxon rank-sum test, also called Wilcoxon-Mann- Whitney test, but most commonly called Mann-Whitney U.
wilcox.test(reading~factor(level_rename),data=dta_rw_12)
##
## Wilcoxon rank sum test with continuity correction
##
## data: reading by factor(level_rename)
## W = 19879, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
The test statistic is called W or U. The standardized version of U has a sampling distribution that is approximately normal. But you don’t need to worry about it- R directly outputs the p-value.
Based on this test, there was a significant difference in medians.
Note that Mann-Whitney U is about medians, not means. If scores are
normally distributed (an assumption of independent samples t-test), then
mean=median
and the results can be expected to be
identical, except the parametric independent samples t-test will have
more power and be more stable.
A permutation test offers a different nonparametric approach to independent samples mean comparison. The permutation test for 2 group mean comparison is sometimes called a one way test, but that’s a vague name. In a manuscript, I would call it a nonparametric version of an independent samples t-test using a permutation approach.
The one way test is in the coin
package.
library(coin)
## Loading required package: survival
There is an exact version, which does all possible permutations.
oneway_test(reading ~ factor(level_rename),
data=dta_rw_12,
distribution="exact")
##
## Exact Two-Sample Fisher-Pitman Permutation Test
##
## data: reading by
## factor(level_rename) (Novice, Intermediate Low)
## Z = -10.042, p-value < 2.2e-16
## alternative hypothesis: true mu is not equal to 0
There is also an approximate version, which does a more limited
random selection of permutations. This is particularly useful with a
large sample size where calculating every permutation would take a long
time. Here, we are doing 9999
permutations.
oneway_test(reading ~ factor(level_rename),
data=dta_rw_12,
distribution = approximate(nresample=9999))
##
## Approximative Two-Sample Fisher-Pitman Permutation Test
##
## data: reading by
## factor(level_rename) (Novice, Intermediate Low)
## Z = -10.042, p-value < 1e-04
## alternative hypothesis: true mu is not equal to 0
The main nonparametric alternative to the dependent samples t-test is the Wilcoxon signed rank test.
wilcox.test(dta_fo$gram_1,
dta_fo$gram_2,
paired=TRUE)
##
## Wilcoxon signed rank test with continuity correction
##
## data: dta_fo$gram_1 and dta_fo$gram_2
## V = 0, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
There is also a permutation test, but it requires long format data. Also the grouping variable and the stratification variable should be factor variables.
# Running the Symmetry Test
symmetry_test(
# Response and grouping variable
Score ~ factor(Test)
# Blocking or stratification variable
| factor(subject),
# Data being used
data = dta_long)
##
## Asymptotic General Symmetry Test
##
## data: Score by
## factor(Test) (gram_1, gram_2)
## stratified by factor(subject)
## Z = -9.4621, p-value < 2.2e-16
## alternative hypothesis: two.sided