Comparing Two Means

Anna Shirokanova and Olesya Volchenko

March 11, 2021

Let’s compare mean values

Mean values refer to continuous variables.
Mean values can be compared across groups.
If there are two groups, t-test is the parametric test you need to make a conclusion about the two populations.
If there are more than two groups, use one-way ANOVA (coming soon).

In the general case,

start with exploratory statistics (get the feel of your data);
check the assumptions;
if the assumptions hold, perform the t-test, else run the non-parametric test.

Is there just one “t-test”?

There are three kinds of the t-test:

comparing the sample mean to the population parameter (the “one-sample t-test”)

\[ t = \frac{\bar{X} - \mu}{S/\sqrt{n-1}}\]

comparing the means of paired samples (“paired-samples t-test”)

\[ t = \frac{\bar{d}}{S_d/\sqrt{n}}\]

comparing the means of two different groups (“independent-samples t-test”)

\[ t = \frac{\bar{X_1} - \bar{X_2} }{\sqrt{\frac{S_1^2}{n_1} + \frac{S_2^2}{n_2}}}\]

What are the paired and independent samples?

Two samples are called paired when each observation of group 1 has a particular corresponding observation in group 2.

Examples:

husbands (group 1) and wives (group 2) evaluating wedding photo services
students’ performance before the test (1) and after the test (2)

In paired samples, the sample size is the same in the two groups.

Independent-samples t-test

can be calculated when there is a continuous variable and a grouping variable with 2 levels
answers the question whether the mean value is the same across two different groups
has three assumptions:
- normal distribution of the continuous variable in each group
- equal variances of the continuous variable in each group
- random sampling from the population

How to proceed?

In the general case,

start with exploratory statistics (get the feel of your data);
check assumptions;
if assumptions hold, perform the t-test, else run the non-parametric test.

1. Start with data inspection and checking assumptions to decide on the test

Example: What is the average age when males and females in Russia start smoking, when they do? Is it the same? In practical terms, what is the age at which to target anti-smoking campaigns for male and female adolescents?

Data: RLMS-HSE, a survey of Russian households https://www.hse.ru/en/rlms/

Questionnaire: “Please remember your age when you began smoking. How old were you then?”

library(foreign)
a <- read.spss("alcohol.sav", use.value.labels = T, to.data.frame = T)
class(a$smoke)

## [1] "factor"

table(a$smoke)

## 
##   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20  21  22  23 
##   1   3  10  37  23  13  70  26 120 115 454 506 729 452 659 176 387  84  59  45 
##  24  25  26  27  28  29  30  31  32  33  34  35  36  38  39  40  42  43  45  46 
##  21  83  19  15   7   5  50   2   9   7   4  13   4   5   2   7   3   1   6   2 
##  47  48  49  50  53  55  56  57  59  60  63 
##   2   1   1   7   1   1   1   1   1   1   1

a$smoke <- as.numeric(as.character(a$smoke))

Evaluate the normality of distribution with numbers

statistics by groups

library(psych)
describeBy(a, a$sex) # age has that of a range because a household member is questioned about all of its members, even babies and very old people

## 
##  Descriptive statistics by group 
## group: male
##           vars    n  mean    sd median trimmed   mad min max range  skew
## sex*         1 8098  1.00  0.00      1    1.00  0.00   1   1     0   NaN
## age          2 8098 34.72 21.22     33   34.00 25.20   0  94    94  0.24
## material*    3 6364  3.48  1.14      4    3.51  1.48   1   5     4 -0.24
## marital*     4 6469  2.11  1.06      2    1.96  1.48   1   6     5  1.16
## alcohol*     5 3465  4.35  1.13      4    4.40  1.48   1   6     5 -0.61
## smoke        6 3075 16.63  4.10     16   16.45  2.97   4  60    56  2.45
##           kurtosis   se
## sex*           NaN 0.00
## age          -0.77 0.24
## material*    -1.04 0.01
## marital*      1.00 0.01
## alcohol*      0.41 0.02
## smoke        17.60 0.07
## ------------------------------------------------------------ 
## group: female
##           vars     n  mean    sd median trimmed   mad min max range  skew
## sex*         1 10274  2.00  0.00      2    2.00  0.00   2   2     0   NaN
## age          2 10274 40.69 22.99     40   40.45 26.69   0 101   101  0.07
## material*    3  8579  3.49  1.16      4    3.53  1.48   1   5     4 -0.29
## marital*     4  8669  2.80  1.46      2    2.74  1.48   1   6     5  0.40
## alcohol*     5  3088  5.03  0.92      5    5.12  1.48   1   6     5 -0.86
## smoke        6  1177 19.05  6.24     18   17.98  2.97   6  63    57  2.69
##           kurtosis   se
## sex*           NaN 0.00
## age          -0.93 0.23
## material*    -1.03 0.01
## marital*     -1.27 0.02
## alcohol*      0.92 0.02
## smoke        10.40 0.18

Skew and especially kurtosis indicate non-normal distribution.

There are over 3,000 males who smoke and over 1,100 females who smoke in the sample.

The minimum age of first smoke is 4 years for males and 6 years for females.

Evaluate the normality of distribution with graphs

histograms by groups

a$sex <- relevel(a$sex, ref = "female") # for ggplot colours

library(plyr)
mu <- ddply(a, "sex", summarise, grp.mean = mean(smoke, na.rm = T)) # calculate mean by group

library(ggplot2)
ggplot(a, aes(x = smoke, color = sex, fill = sex)) +
      geom_histogram(aes(y=..density..), position = "identity", alpha = 0.5, binwidth = 3) +
      geom_vline(data = mu, aes(xintercept = grp.mean, color = sex), linetype = "dashed") +
      labs(title = "Smoke by Sex histogram", x = "Started Smoking (age)", y = "Density") +
      theme_classic() +
      theme(legend.position = "top")

ggplot(a, aes(x = smoke, color = sex, fill = sex)) +
      geom_density(alpha = 0.5) +
      labs(title = "Smoke by Sex histogram", x = "Started Smoking (age)", y = "Density") +
      theme_classic()

#if you are a respectable boomer with no time for ggplot, this will also do:

f <- subset(a[a$sex == "female",]) # for simple histograms
m <- subset(a[a$sex == "male",])
par(mfrow = c(1,2))
hist(f$smoke, ylim = c(0, 2100))
hist(m$smoke, ylim = c(0, 2100))

Histograms show that distribution in both groups is skewed to the right.

Compare the variance in the groups with boxplots

boxplots

ggplot(a, aes(x = sex, y = smoke)) + 
  geom_boxplot() +
  stat_summary(fun.y = mean, geom = "point", shape = 4, size = 4) +
  theme_classic() +
  ggtitle("Age When Started Smoking by Sex of Respondent")

means <- aggregate(smoke ~  sex, a, mean) # calculate means on the side
par(mar = c(4,5,1,1)+.5) # set the margins of a plot right
boxplot(a$smoke ~ a$sex, 
        xlab = "Sex", ylab = "Age Started Smoking")
points(1:2, means$smoke, col = "red")
title("Age When Started Smoking by Sex of Respondent")

Results of Exploratory Analysis

both groups have thousands of observations
females have a higher age of first smoking
both distributions are rather skewed and have outliers

2. Let’s check the assumptions of the t-test

Normal distribution of the DV (=dependent variable) by group;
Homogeneous variances in the groups.

Normality checks by Tests

There are formal tests to check normality (Shapiro-Wilk test, Kolmogorov-Smirnov test), but:
they are oversensitive (= tell that distribution is not normal even when slight deviations from normality occur).
In addition, we care less about normality for large representative samples (over 300 obs.), following the Central Limit Theorem.
See: http://thestatsgeek.com/2013/09/28/the-t-test-and-robustness-to-non-normality/ and https://notstatschat.rbind.io/2019/02/09/what-have-i-got-against-the-shapiro-wilk-test/

Normality checks by Graphs

Alternatively to the formal test of normality, you can draw a Q-Q (quantile-quantile) plot.

The Q-Q plot shows the variable’s distribution against the normal distribution. For interpretation, see: https://goo.gl/ZyGjzK and https://stats.stackexchange.com/questions/101274/how-to-interpret-a-qq-plot
In a normal distribution, there will be a straight line.
In a skewed distribution, the line will resemble a snake hanging from a tree branch (as one quant psychologist once put it).
A non-normal distribution also has high absolute values of skew and kurtosis in descriptive stats.
See: https://data.library.virginia.edu/understanding-q-q-plots/

Normality checks by Graphs (Q-Q)

A Normal distribution resembles this:

What we have in the data looks like this:

par(mfrow = c(1,2))
qqnorm(f$smoke); qqline(f$smoke, col = 2)
qqnorm(m$smoke); qqline(m$smoke, col = 2) # these two are not straight lines

These two Q-Q plots do not look normal (heavy right tail).

Homogeneity (equality) of variances of the DV across groups

If variances are not equal, Welch’s correction should be applied to the t-test.
In R, the regular t.test uses Welch’s correction by default.
Formal tests for homogeneity of variances also exist, e.g. Levene’s test or Bartlett’s test

options(scipen = 99) # for small numbers to show zeroes

var(a$smoke[a$sex=="male"], na.rm = T)

## [1] 16.79787

var(a$smoke[a$sex=="female"], na.rm = T)

## [1] 38.89688

library(car)
leveneTest(a$smoke ~ a$sex)

## Levene's Test for Homogeneity of Variance (center = median)
##         Df F value                Pr(>F)    
## group    1      80 < 0.00000000000000022 ***
##       4250                                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

bartlett.test(a$smoke ~ a$sex)

## 
##  Bartlett test of homogeneity of variances
## 
## data:  a$smoke by a$sex
## Bartlett's K-squared = 331.84, df = 1, p-value < 0.00000000000000022

Both tests show that the data have a very low probability to occur if null hypothesis is true. Thus, the variances are not equal.

3. Now run the t-test

the distributions of the DV are not normal but n is really high, so non-normality can be effectively ignored
variances are not equal, so Welch’s correction should be applied

t.test(a$smoke ~ a$sex, var.equal = F)

## 
##  Welch Two Sample t-test
## 
## data:  a$smoke by a$sex
## t = 12.357, df = 1580.4, p-value < 0.00000000000000022
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  2.040116 2.809952
## sample estimates:
## mean in group female   mean in group male 
##             19.05268             16.62764

If you do not specify unequal variances, the results is equivalent:

t.test(a$smoke ~ a$sex)

## 
##  Welch Two Sample t-test
## 
## data:  a$smoke by a$sex
## t = 12.357, df = 1580.4, p-value < 0.00000000000000022
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  2.040116 2.809952
## sample estimates:
## mean in group female   mean in group male 
##             19.05268             16.62764

Conclusions:

on average, males start to smoke at 16.6, while females at 19.1 years.
The t-statistic t(1580.4) = 12.36 (p-value < 0.001), which means that the observed difference in means is statistically significant across the two groups (higher among females).

Results of the Assumptions’ Check

normally, a large sample means that normality is not that much of a concern
if distributions are not normal, you can bootstrap standard errors (re-sample them many times)
if distributions are not normal, you might also want to do both parametric and non-parametric tests and compare the results –

Non-parametric Test for Two Independent Samples

no assumptions about the distribution apply
it is slightly less powerful than the t-test if the assumptions hold
Mann-Whitney’s U / Wilcoxon’s Rank Sum test (two names, the same test)

wilcox.test(smoke ~ sex, data = a)

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  smoke by sex
## W = 2252622, p-value < 0.00000000000000022
## alternative hypothesis: true location shift is not equal to 0

Conclusion: the Wilcoxon rank sum W equals 2252622 at p < 0.001, which means that males and females start smoking at different ages and this difference is statistically significant.

If you are in doubt about which test to apply, e.g. you have non-normal distributions, and if the two tests give diverging conclusions, stick to the non-parametric result (but it will not show the ‘average difference’).

Paired Samples t-test

Example: alcohol consumption on weekdays and on weekends (same respondents).

Data: ESS 2014, UK sample. https://www.europeansocialsurvey.org/download.html?file=ESS7GB&c=GB&y=2014

library(foreign)
GBdata <- read.spss("ESS7GB.sav", to.data.frame = T, use.value.labels = T)
GBdata$alcwkdy <- as.numeric(as.character(GBdata$alcwkdy))
GBdata$alcwknd <- as.numeric(as.character(GBdata$alcwknd))
summary(GBdata$alcwkdy) # How many grams of alcohol respondent consumes on an average weekday

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     0.0     0.0    18.0    36.3    48.0   632.0     461

summary(GBdata$alcwknd) # How many grams of alcohol respondent consumes on an average weekend day

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00   17.00   40.00   62.43   77.75 1948.00     462

Draw the histogram of differences to get the idea of the distribution assumed to be normal:

#hist(GBdata$alcwkdy - GBdata$alcwknd)
ggplot(GBdata, aes(x = alcwkdy - alcwknd)) +
      geom_histogram(aes(y =..density..), position = "identity", alpha = 0.5, binwidth = 10) +
      geom_density(alpha = 0.6) +
      labs(title = "Difference in Alcohol Consumption on Weekend and Weekday, UK", 
           x = "Difference in grams", 
           y = "Density") +
      theme_classic()

The distribution looks rather left-skewed and leptokurtic, but it also follows the bell shape.

Now, the paired t-test itself: let’s compare whether respondents drink on an average weekend day differently than on an average weekday.

mean(GBdata$alcwknd - GBdata$alcwkdy, na.rm = T) # average difference in grams

## [1] 26.14159

t.test(GBdata$alcwknd, GBdata$alcwkdy, paired = T)

## 
##  Paired t-test
## 
## data:  GBdata$alcwknd and GBdata$alcwkdy
## t = 13.119, df = 1800, p-value < 0.00000000000000022
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  22.23332 30.04986
## sample estimates:
## mean of the differences 
##                26.14159

Conclusion: The mean amount of alcohol consumed on a weekend day is 26.1 grams higher than on a weekday. The paired t(1800) = 13.12 (p < 0.001), with 95% confidence intervals = [22.23; 30.05] for the true difference between the two means. Therefore, we conclude that the difference is statistically significant and alcohol consumption on weekends is, indeed, higher.

Non-parametric equivalent

If the assumptions of t-test do not hold, use the non-parametric Wilcoxon sign rank test.

wilcox.test(x = GBdata$alcwknd, y = GBdata$alcwkdy, paired = TRUE)

## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  GBdata$alcwknd and GBdata$alcwkdy
## V = 771289, p-value < 0.00000000000000022
## alternative hypothesis: true location shift is not equal to 0

Conclusion: Wilcoxon’s signed rank V = 771289 (p < 0.001), which means that alcohol consumption on weekend days is greater than on an average weekday, and this difference is statistically significant.

Both tests have reached the same conclusion.

Effect size for the t-test

effect size of the t-test shows how strong the statistically significant difference is, given the sample sizes
a popular measure in the literature is Cohen’s D
for interpretation, see: http://rpsychologist.com/d3/cohend/

library(effsize)
cohen.d(a$smoke ~ a$sex, na.rm = T)

## 
## Cohen's d
## 
## d estimate: 0.5066159 (medium)
## 95 percent confidence interval:
##     lower     upper 
## 0.4385600 0.5746717

With a Cohen’s d of 0.5, 69 % of the “treatment” group will be above the mean of the “control” group (Cohen’s U3), 80 % of the two groups will overlap, and there is a 64 % chance that a person picked at random from the treatment (female) group will have a higher score than a person picked at random from the control (male) group.

With a Cohen’s d of 0.3, 62 % of the treatment group will be above the mean of the control group (Cohen’s U3), 88 % of the two groups will overlap, and there is a 58 % chance that a person picked at random from the treatment group will have a higher score than a person picked at random from the control group (probability of superiority). Moreover, in order to have one more favorable outcome in the treatment group compared to the control group, we need to treat 10.6 people. This means that if 100 people go through the treatment, 9.4 more people will have a favorable outcome compared to if they had received the control treatment.

library(rstatix)
rstatix::wilcox_effsize(smoke ~ sex, data = a, na.rm = T) # for non-parametric

## # A tibble: 1 x 7
##   .y.   group1 group2 effsize    n1    n2 magnitude
## * <chr> <chr>  <chr>    <dbl> <int> <int> <ord>    
## 1 smoke female male     0.191 10274  8098 small

The effect size of the Mann-Whitney-Wilcoxon test is 0.19 (small), even though it is statistically significant. So, overall the size of the difference between the first age of smoking is small to medium.

Recap independent-samples t-test in pictures

Source: Dr. Allison Horst https://twitter.com/allison_horst/status/1216411185240690688?s=20

“But couldn’t our conclusion be wrong? Yes! Type I errors (based on your samples you conclude there is a significant difference in means, when there actually isn’t)”:

“Or Type II errors (based on your samples you conclude there isn’t a significant difference, when there actually is)”:

“And remember: the p-value is not enough. Discuss/report differences in other (often more meaningful) ways. Data visualization, actual differences in context, effect sizes, etc.”

The end.