Comparing Two Means

Anna Shirokanova and Olesya Volchenko

February 17, 2020

Let’s compare mean values

Mean values refer to continuous variables.
Mean values can be compared across groups.
If there are two groups, t-test is the parametric test you need to make a conclusion about the two populations.
If there are more than two groups, use one-way ANOVA (coming soon).

In the general case,

start with exploratory statistics (get the feel of your data);
check assumptions;
if assumptions hold, perform the t-test, else run the non-parametric test.

Is there just one “t-test”?

There are three kinds of the t-test:

comparing the sample mean to the population parameter (the “one-sample t-test”)

\[ t = \frac{\bar{X} - \mu}{S/\sqrt{n-1}}\]

comparing the means of paired samples (“paired-samples t-test”)

\[ t = \frac{\bar{d}}{S_d/\sqrt{n}}\]

comparing the means of two different groups (“independent-samples t-test”)

\[ t = \frac{\bar{X_1} - \bar{X_2} }{\sqrt{\frac{S_1^2}{n_1} + \frac{S_2^2}{n_2}}}\]

What are the paired and independent samples?

Two samples are called paired when each observation of group 1 has a particular corresponding observation in group 2.

Examples:

husbands (group 1) and wives (group 2) evaluating wedding photo services
students’ performance before the test (1) and after the test (2)

In paired samples, the sample size is the same in the two groups.

Independent-samples t-test

can be calculated when there is a continuous variable and a grouping variable with 2 levels
answers the question whether the mean value is the same across two different groups
has three assumptions:
- normal distribution of the continuous variable in each group
- equal variances of the continuous variable in each group
- sample size is 30 or more in the smaller group

How to proceed?

In the general case,

start with exploratory statistics (get the feel of your data);
check assumptions;
if assumptions hold, perform the t-test, else run the non-parametric test.

1. Start with data inspection and checking assumptions to decide on the test

Example: What is the average age when males and females in Russia start smoking, when they do? Is it the same? In practical terms, what is the age at which to target anti-smoking campaigns for male and female adolescents?

Data: RLMS-HSE, a survey of Russian households https://www.hse.ru/en/rlms/

Questionnaire: “Please remember your age when you began smoking. How old were you then?”

library(foreign)
a <- read.spss("alcohol.sav", use.value.labels = T, to.data.frame = T)

a$sex <- as.factor(a$sex)

a$smoke <- as.numeric(a$smoke)

Evaluate the normality of distribution with numbers

statistics by groups

library(psych)
describeBy(a, a$sex)

## 
##  Descriptive statistics by group 
## group: male
##           vars    n  mean    sd median trimmed   mad min max range  skew
## sex*         1 8098  1.00  0.00      1    1.00  0.00   1   1     0   NaN
## age          2 8098 34.72 21.22     33   34.00 25.20   0  94    94  0.24
## material*    3 6364  3.48  1.14      4    3.51  1.48   1   5     4 -0.24
## marital*     4 6469  2.11  1.06      2    1.96  1.48   1   6     5  1.16
## alcohol*     5 3465  4.35  1.13      4    4.40  1.48   1   6     5 -0.61
## smoke        6 3075 13.61  3.99     13   13.45  2.97   1  50    49  1.99
##           kurtosis   se
## sex*           NaN 0.00
## age          -0.77 0.24
## material*    -1.04 0.01
## marital*      1.00 0.01
## alcohol*      0.41 0.02
## smoke        12.60 0.07
## -------------------------------------------------------- 
## group: female
##           vars     n  mean    sd median trimmed   mad min max range  skew
## sex*         1 10274  2.00  0.00      2    2.00  0.00   2   2     0   NaN
## age          2 10274 40.69 22.99     40   40.45 26.69   0 101   101  0.07
## material*    3  8579  3.49  1.16      4    3.53  1.48   1   5     4 -0.29
## marital*     4  8669  2.80  1.46      2    2.74  1.48   1   6     5  0.40
## alcohol*     5  3088  5.03  0.92      5    5.12  1.48   1   6     5 -0.86
## smoke        6  1177 15.98  5.91     15   14.98  2.97   3  51    48  2.31
##           kurtosis   se
## sex*           NaN 0.00
## age          -0.93 0.23
## material*    -1.03 0.01
## marital*     -1.27 0.02
## alcohol*      0.92 0.02
## smoke         7.26 0.17

Skew and especially kurtosis indicate non-normal distribution.

Let’s also replace those ages below 6 with 6 years old (or else you could filter them losing observations).

#a$smoke[a$smoke < 6] <- 5 # you can replace the suspicious observations (like those who say they started smoking at 5 years old or earlier)
#describeBy(a, a$sex)

Evaluate the normality of distribution with graphs

histograms by groups

a$sex <- relevel(a$sex, ref = "female") # for ggplot colours

library(ggplot2)
ggplot(a, aes(x = smoke, color = sex, fill = sex)) +
      geom_histogram(aes(y=..density..), position = "identity", alpha = 0.5, binwidth = 3) +
      labs(title = "Smoke by Sex histogram",x = "Started Smoking (age)", y = "Density") +
      theme_classic()

ggplot(a, aes(x = smoke, color = sex, fill = sex)) +
      geom_density(alpha = 0.5) +
      labs(title = "Smoke by Sex histogram",x = "Started Smoking (age)", y = "Density") +
      theme_classic()

#if you are a respectable boomer with no time for ggplot, this will also do:

f <- subset(a[a$sex == "female",]) # for simple histograms
m <- subset(a[a$sex == "male",])
par(mfrow = c(1,2))
hist(f$smoke, ylim = c(0, 2100))
hist(m$smoke, ylim = c(0, 2100))

Histograms show that distribution in both groups is skewed to the right.

Compare the variance in the groups with boxplots

boxplots

ggplot(a, aes(x = sex, y = smoke)) + 
  geom_boxplot() +
  stat_summary(fun.y = mean, geom = "point", shape = 4, size = 4) +
  theme_classic() +
  ggtitle("Age When Started Smoking by Sex of Respondent")

means <- aggregate(smoke ~  sex, a, mean) # calculate means on the side
par(mar = c(4,5,1,1)+.5) # set the margins of a plot right
boxplot(a$smoke ~ a$sex, 
        xlab = "Sex", ylab = "Age Started Smoking")
points(1:2, means$smoke, col = "red")
title("Age When Started Smoking by Sex of Respondent")

Results of Exploratory Analysis

both groups have thousands of observations
females have a higher age of first smoking
both distributions are rather symmetric but have outliers

2. Let’s check the assumptions of the t-test

Normal distribution of the DV (=dependent variable) by group;
Homogeneous variances in the groups.

Normality checks by Tests

There are formal tests to check normality (Shapiro-Wilk test, Kolmogorov-Smirnov test), but:
they are oversensitive (= tell that distribution is not normal even when slight deviations from normality occur).
In addition, we care less about normality for large representative samples (over 300 obs.), following the Central Limit Theorem.
See: http://thestatsgeek.com/2013/09/28/the-t-test-and-robustness-to-non-normality/ and https://notstatschat.rbind.io/2019/02/09/what-have-i-got-against-the-shapiro-wilk-test/

Normality checks by Graphs

Alternatively to the formal test of normality, you can draw a Q-Q (quantile-quantile) plot.

The Q-Q plot shows the variable’s distribution against the normal distribution. For interpretation, see: https://goo.gl/ZyGjzK and https://stats.stackexchange.com/questions/101274/how-to-interpret-a-qq-plot
In a normal distribution, there will be a straight line.
In a skewed distribution, the line will resemble a snake hanging from a tree branch (as one quant psychologist once put it).
A non-normal distribution also has high absolute values of skew and kurtosis in descriptive stats.
See: https://data.library.virginia.edu/understanding-q-q-plots/

Normality checks by Graphs (Q-Q)

A Normal distribution resembles this:

What we have in the data looks like this:

par(mfrow = c(1,2))
qqnorm(f$smoke); qqline(f$smoke, col= 2)
qqnorm(m$smoke); qqline(m$smoke, col= 2) # these two are not straight lines

These two Q-Q plots do not look normal.

Homogeneity (equality) of variances of the DV across groups

If variances are not equal, Welch’s correction should be applied to the t-test.
In R, the regular t.test uses Welch’s correction by default.
Formal tests for homogeneity of variances also exist, e.g. Levene’s test or Bartlett’s test (but they are not needed if other tests are wisely used)

options(scipen = 99) # for small numbers to show zeroes
library(car)
leveneTest(a$smoke ~ a$sex)

## Levene's Test for Homogeneity of Variance (center = median)
##         Df F value                Pr(>F)    
## group    1  81.853 < 0.00000000000000022 ***
##       4250                                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

bartlett.test(a$smoke ~ a$sex)

## 
##  Bartlett test of homogeneity of variances
## 
## data:  a$smoke by a$sex
## Bartlett's K-squared = 289.28, df = 1, p-value <
## 0.00000000000000022

Both tests show that the data have a very low probability to occur if null hypothesis is true. Thus, the variances are not equal.

3. Now run the t-test

the distributions of the DV are not normal but n is really high, so non-normality can be effectively ignored
variances are not equal, so Welch’s correction should be applied

t.test(a$smoke ~ a$sex, var.equal = F)

## 
##  Welch Two Sample t-test
## 
## data:  a$smoke by a$sex
## t = 12.712, df = 1603.4, p-value < 0.00000000000000022
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  2.005746 2.737652
## sample estimates:
## mean in group female   mean in group male 
##             15.98471             13.61301

If you do not specify unequal variances, the results is equivalent:

t.test(a$smoke ~ a$sex)

## 
##  Welch Two Sample t-test
## 
## data:  a$smoke by a$sex
## t = 12.712, df = 1603.4, p-value < 0.00000000000000022
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  2.005746 2.737652
## sample estimates:
## mean in group female   mean in group male 
##             15.98471             13.61301

Conclusions:

on average, males start to smoke at 13.6, while females at 16.0 years.
The t-statistic t(1603.4) = 12.7 (p-value < 0.001), which means that the observed difference in means is statistically significant across the two groups (higher among females).

Results of the Assumptions’ Check

normally, a large sample means that normality is not that much of a concern
if distributions are not normal, you can bootstrap standard errors (re-sample them many times)
if distributions are not normal, you might also want to do both parametric and non-parametric tests and compare the results –

Non-parametric Test for Two Independent Samples

no assumptions about the distribution apply
it is slightly less powerful than the t-test if the assumptions hold

wilcox.test(smoke ~ sex, data = a)

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  smoke by sex
## W = 2252600, p-value < 0.00000000000000022
## alternative hypothesis: true location shift is not equal to 0

Conclusion: the Wilcoxon rank sum W equals 2252600 at p < 0.001, which means that males and females start smoking at different ages and this difference is statistically significant.

If you are in doubt about which test to apply, e.g. you have two groups of 29 observations, and if the two tests give diverging conclusions, stick to the non-parametric result.

Paired Samples t-test

Example: alcohol consumption on weekdays and on weekends (same respondents).

Data: ESS 2014, UK sample. https://www.europeansocialsurvey.org/download.html?file=ESS7GB&c=GB&y=2014

library(foreign)
GBdata <- read.spss("ESS7GB.sav", to.data.frame = T, use.value.labels = T)
GBdata$alcwkdy <- as.numeric(as.character(GBdata$alcwkdy))
GBdata$alcwknd <- as.numeric(as.character(GBdata$alcwknd))
summary(GBdata$alcwkdy) # How many grams of alcohol respondent consumes on an average weekday

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     0.0     0.0    18.0    36.3    48.0   632.0     461

summary(GBdata$alcwknd) # How many grams of alcohol respondent consumes on an average weekend day

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00   17.00   40.00   62.43   77.75 1948.00     462

Draw the histogram of differences to get the idea of the distribution assumed to be normal:

#hist(GBdata$alcwkdy - GBdata$alcwknd)
ggplot(GBdata, aes(x = alcwkdy - alcwknd)) +
      geom_histogram(aes(y=..density..), position = "identity", alpha = 0.5, binwidth = 10) +
      geom_density(alpha=0.6) +
      labs(title = "Difference in Alcohol Consumption on Weekend and Weekday, UK", 
           x = "Difference in grams", 
           y = "Density") +
      theme_classic()

The distribution looks rather left-skewed and leptokurtic, but it also follows the bell shape.

Now, the paired t-test itself: let’s compare whether respondents drink on an average weekend day differently than on an average weekday.

mean(GBdata$alcwknd - GBdata$alcwkdy, na.rm = T) # average difference in grams

## [1] 26.14159

t.test(GBdata$alcwknd, GBdata$alcwkdy, paired = T)

## 
##  Paired t-test
## 
## data:  GBdata$alcwknd and GBdata$alcwkdy
## t = 13.119, df = 1800, p-value < 0.00000000000000022
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  22.23332 30.04986
## sample estimates:
## mean of the differences 
##                26.14159

Conclusion: The mean amount of alcohol consumed on a weekend day is 25.5 grams higher than on a weekday. The paired t(1800) = 14.70 (p < 0.001), with 95% confidence intervals = [22.1; 28.9] for the true difference between the two means. Therefore, we conclude that the difference is statistically significant and alcohol consumption on weekends is, indeed, higher.

Non-parametric equivalent

If the assumptions of t-test do not hold, use the non-parametric Mann-Whitney’s U/ Wilcoxon’s Rank Sum test.

wilcox.test(x = GBdata$alcwknd, y = GBdata$alcwkdy, paired = TRUE)

## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  GBdata$alcwknd and GBdata$alcwkdy
## V = 771290, p-value < 0.00000000000000022
## alternative hypothesis: true location shift is not equal to 0

Conclusion: Wilcoxon’s signed rank V = 771290 (p < 0.001), which means that alcohol consumption on weekend days is greater than on an average weekday, and this difference is statistically significant.

Both tests have reached the same conclusion.

Effect size for the t-test

effect size of the t-test shows how strong the statistically significant difference is, given the sample sizes
a popular measure in the literature is Cohen’s D
for interpretation, see: http://rpsychologist.com/d3/cohend/

library(effsize)
cohen.d(a$smoke ~ a$sex, na.rm = T)

## 
## Cohen's d
## 
## d estimate: 0.5156865 (medium)
## 95 percent confidence interval:
##     lower     upper 
## 0.4475998 0.5837731

With a Cohen’s d of 0.5, 69 % of the “treatment” group will be above the mean of the “control” group (Cohen’s U3), 80 % of the two groups will overlap, and there is a 64 % chance that a person picked at random from the treatment (weekend) group will have a higher score than a person picked at random from the control (weekday) group.

With a Cohen’s d of 0.3, 62 % of the treatment group will be above the mean of the control group (Cohen’s U3), 88 % of the two groups will overlap, and there is a 58 % chance that a person picked at random from the treatment group will have a higher score than a person picked at random from the control group (probability of superiority). Moreover, in order to have one more favorable outcome in the treatment group compared to the control group, we need to treat 10.6 people. This means that if 100 people go through the treatment, 9.4 more people will have a favorable outcome compared to if they had received the control treatment.

Recap independent-samples t-test in pictures

Source: Dr. Allison Horst https://twitter.com/allison_horst/status/1216411185240690688?s=20

“But couldn’t our conclusion be wrong? Yes! Type I errors (based on your samples you conclude there is a significant difference in means, when there actually isn’t)”:

“Or Type II errors (based on your samples you conclude there isn’t a significant difference, when there actually is)”:

“And remember: the p-value is not enough. Discuss/report differences in other (often more meaningful) ways. Data visualization, actual differences in context, effect sizes, etc.”

The end.