Comparing Two Means

Anna Shirokanova and Olesya Volchenko

February 17, 2020

Let’s compare mean values

In the general case,

  1. start with exploratory statistics (get the feel of your data);
  2. check assumptions;
  3. if assumptions hold, perform the t-test, else run the non-parametric test.

Is there just one “t-test”?

There are three kinds of the t-test:

\[ t = \frac{\bar{X} - \mu}{S/\sqrt{n-1}}\]

\[ t = \frac{\bar{d}}{S_d/\sqrt{n}}\]

\[ t = \frac{\bar{X_1} - \bar{X_2} }{\sqrt{\frac{S_1^2}{n_1} + \frac{S_2^2}{n_2}}}\]

What are the paired and independent samples?

Two samples are called paired when each observation of group 1 has a particular corresponding observation in group 2.

Examples:

In paired samples, the sample size is the same in the two groups.

Independent-samples t-test

How to proceed?

In the general case,

  1. start with exploratory statistics (get the feel of your data);
  2. check assumptions;
  3. if assumptions hold, perform the t-test, else run the non-parametric test.

1. Start with data inspection and checking assumptions to decide on the test

Example: What is the average age when males and females in Russia start smoking, when they do? Is it the same? In practical terms, what is the age at which to target anti-smoking campaigns for male and female adolescents?

Data: RLMS-HSE, a survey of Russian households https://www.hse.ru/en/rlms/

Questionnaire: “Please remember your age when you began smoking. How old were you then?”

library(foreign)
a <- read.spss("alcohol.sav", use.value.labels = T, to.data.frame = T)

a$sex <- as.factor(a$sex)

a$smoke <- as.numeric(a$smoke)

Evaluate the normality of distribution with numbers

statistics by groups

library(psych)
describeBy(a, a$sex)
## 
##  Descriptive statistics by group 
## group: male
##           vars    n  mean    sd median trimmed   mad min max range  skew
## sex*         1 8098  1.00  0.00      1    1.00  0.00   1   1     0   NaN
## age          2 8098 34.72 21.22     33   34.00 25.20   0  94    94  0.24
## material*    3 6364  3.48  1.14      4    3.51  1.48   1   5     4 -0.24
## marital*     4 6469  2.11  1.06      2    1.96  1.48   1   6     5  1.16
## alcohol*     5 3465  4.35  1.13      4    4.40  1.48   1   6     5 -0.61
## smoke        6 3075 13.61  3.99     13   13.45  2.97   1  50    49  1.99
##           kurtosis   se
## sex*           NaN 0.00
## age          -0.77 0.24
## material*    -1.04 0.01
## marital*      1.00 0.01
## alcohol*      0.41 0.02
## smoke        12.60 0.07
## -------------------------------------------------------- 
## group: female
##           vars     n  mean    sd median trimmed   mad min max range  skew
## sex*         1 10274  2.00  0.00      2    2.00  0.00   2   2     0   NaN
## age          2 10274 40.69 22.99     40   40.45 26.69   0 101   101  0.07
## material*    3  8579  3.49  1.16      4    3.53  1.48   1   5     4 -0.29
## marital*     4  8669  2.80  1.46      2    2.74  1.48   1   6     5  0.40
## alcohol*     5  3088  5.03  0.92      5    5.12  1.48   1   6     5 -0.86
## smoke        6  1177 15.98  5.91     15   14.98  2.97   3  51    48  2.31
##           kurtosis   se
## sex*           NaN 0.00
## age          -0.93 0.23
## material*    -1.03 0.01
## marital*     -1.27 0.02
## alcohol*      0.92 0.02
## smoke         7.26 0.17

Skew and especially kurtosis indicate non-normal distribution.

Let’s also replace those ages below 6 with 6 years old (or else you could filter them losing observations).

#a$smoke[a$smoke < 6] <- 5 # you can replace the suspicious observations (like those who say they started smoking at 5 years old or earlier)
#describeBy(a, a$sex)

Evaluate the normality of distribution with graphs

histograms by groups

a$sex <- relevel(a$sex, ref = "female") # for ggplot colours

library(ggplot2)
ggplot(a, aes(x = smoke, color = sex, fill = sex)) +
      geom_histogram(aes(y=..density..), position = "identity", alpha = 0.5, binwidth = 3) +
      labs(title = "Smoke by Sex histogram",x = "Started Smoking (age)", y = "Density") +
      theme_classic()

ggplot(a, aes(x = smoke, color = sex, fill = sex)) +
      geom_density(alpha = 0.5) +
      labs(title = "Smoke by Sex histogram",x = "Started Smoking (age)", y = "Density") +
      theme_classic()

#if you are a respectable boomer with no time for ggplot, this will also do:

f <- subset(a[a$sex == "female",]) # for simple histograms
m <- subset(a[a$sex == "male",])
par(mfrow = c(1,2))
hist(f$smoke, ylim = c(0, 2100))
hist(m$smoke, ylim = c(0, 2100))

Histograms show that distribution in both groups is skewed to the right.

Compare the variance in the groups with boxplots

boxplots

ggplot(a, aes(x = sex, y = smoke)) + 
  geom_boxplot() +
  stat_summary(fun.y = mean, geom = "point", shape = 4, size = 4) +
  theme_classic() +
  ggtitle("Age When Started Smoking by Sex of Respondent")

means <- aggregate(smoke ~  sex, a, mean) # calculate means on the side
par(mar = c(4,5,1,1)+.5) # set the margins of a plot right
boxplot(a$smoke ~ a$sex, 
        xlab = "Sex", ylab = "Age Started Smoking")
points(1:2, means$smoke, col = "red")
title("Age When Started Smoking by Sex of Respondent")

Results of Exploratory Analysis

2. Let’s check the assumptions of the t-test

Normality checks by Tests

Normality checks by Graphs

Alternatively to the formal test of normality, you can draw a Q-Q (quantile-quantile) plot.

Normality checks by Graphs (Q-Q)

A Normal distribution resembles this:

What we have in the data looks like this:

par(mfrow = c(1,2))
qqnorm(f$smoke); qqline(f$smoke, col= 2)
qqnorm(m$smoke); qqline(m$smoke, col= 2) # these two are not straight lines

These two Q-Q plots do not look normal.

Homogeneity (equality) of variances of the DV across groups

options(scipen = 99) # for small numbers to show zeroes
library(car)
leveneTest(a$smoke ~ a$sex)
## Levene's Test for Homogeneity of Variance (center = median)
##         Df F value                Pr(>F)    
## group    1  81.853 < 0.00000000000000022 ***
##       4250                                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
bartlett.test(a$smoke ~ a$sex)
## 
##  Bartlett test of homogeneity of variances
## 
## data:  a$smoke by a$sex
## Bartlett's K-squared = 289.28, df = 1, p-value <
## 0.00000000000000022

Both tests show that the data have a very low probability to occur if null hypothesis is true. Thus, the variances are not equal.

3. Now run the t-test

t.test(a$smoke ~ a$sex, var.equal = F) 
## 
##  Welch Two Sample t-test
## 
## data:  a$smoke by a$sex
## t = 12.712, df = 1603.4, p-value < 0.00000000000000022
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  2.005746 2.737652
## sample estimates:
## mean in group female   mean in group male 
##             15.98471             13.61301

If you do not specify unequal variances, the results is equivalent:

t.test(a$smoke ~ a$sex)
## 
##  Welch Two Sample t-test
## 
## data:  a$smoke by a$sex
## t = 12.712, df = 1603.4, p-value < 0.00000000000000022
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  2.005746 2.737652
## sample estimates:
## mean in group female   mean in group male 
##             15.98471             13.61301

Conclusions:

  1. on average, males start to smoke at 13.6, while females at 16.0 years.

  2. The t-statistic t(1603.4) = 12.7 (p-value < 0.001), which means that the observed difference in means is statistically significant across the two groups (higher among females).

Results of the Assumptions’ Check

Non-parametric Test for Two Independent Samples

wilcox.test(smoke ~ sex, data = a)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  smoke by sex
## W = 2252600, p-value < 0.00000000000000022
## alternative hypothesis: true location shift is not equal to 0

Conclusion: the Wilcoxon rank sum W equals 2252600 at p < 0.001, which means that males and females start smoking at different ages and this difference is statistically significant.

If you are in doubt about which test to apply, e.g. you have two groups of 29 observations, and if the two tests give diverging conclusions, stick to the non-parametric result.

Paired Samples t-test

Example: alcohol consumption on weekdays and on weekends (same respondents).

Data: ESS 2014, UK sample. https://www.europeansocialsurvey.org/download.html?file=ESS7GB&c=GB&y=2014

library(foreign)
GBdata <- read.spss("ESS7GB.sav", to.data.frame = T, use.value.labels = T)
GBdata$alcwkdy <- as.numeric(as.character(GBdata$alcwkdy))
GBdata$alcwknd <- as.numeric(as.character(GBdata$alcwknd))
summary(GBdata$alcwkdy) # How many grams of alcohol respondent consumes on an average weekday
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     0.0     0.0    18.0    36.3    48.0   632.0     461
summary(GBdata$alcwknd) # How many grams of alcohol respondent consumes on an average weekend day
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00   17.00   40.00   62.43   77.75 1948.00     462

Draw the histogram of differences to get the idea of the distribution assumed to be normal:

#hist(GBdata$alcwkdy - GBdata$alcwknd)
ggplot(GBdata, aes(x = alcwkdy - alcwknd)) +
      geom_histogram(aes(y=..density..), position = "identity", alpha = 0.5, binwidth = 10) +
      geom_density(alpha=0.6) +
      labs(title = "Difference in Alcohol Consumption on Weekend and Weekday, UK", 
           x = "Difference in grams", 
           y = "Density") +
      theme_classic()

The distribution looks rather left-skewed and leptokurtic, but it also follows the bell shape.

Now, the paired t-test itself: let’s compare whether respondents drink on an average weekend day differently than on an average weekday.

mean(GBdata$alcwknd - GBdata$alcwkdy, na.rm = T) # average difference in grams
## [1] 26.14159
t.test(GBdata$alcwknd, GBdata$alcwkdy, paired = T)
## 
##  Paired t-test
## 
## data:  GBdata$alcwknd and GBdata$alcwkdy
## t = 13.119, df = 1800, p-value < 0.00000000000000022
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  22.23332 30.04986
## sample estimates:
## mean of the differences 
##                26.14159

Conclusion: The mean amount of alcohol consumed on a weekend day is 25.5 grams higher than on a weekday. The paired t(1800) = 14.70 (p < 0.001), with 95% confidence intervals = [22.1; 28.9] for the true difference between the two means. Therefore, we conclude that the difference is statistically significant and alcohol consumption on weekends is, indeed, higher.

Non-parametric equivalent

wilcox.test(x = GBdata$alcwknd, y = GBdata$alcwkdy, paired = TRUE)
## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  GBdata$alcwknd and GBdata$alcwkdy
## V = 771290, p-value < 0.00000000000000022
## alternative hypothesis: true location shift is not equal to 0

Conclusion: Wilcoxon’s signed rank V = 771290 (p < 0.001), which means that alcohol consumption on weekend days is greater than on an average weekday, and this difference is statistically significant.

Both tests have reached the same conclusion.

Effect size for the t-test

library(effsize)
cohen.d(a$smoke ~ a$sex, na.rm = T)
## 
## Cohen's d
## 
## d estimate: 0.5156865 (medium)
## 95 percent confidence interval:
##     lower     upper 
## 0.4475998 0.5837731

With a Cohen’s d of 0.5, 69 % of the “treatment” group will be above the mean of the “control” group (Cohen’s U3), 80 % of the two groups will overlap, and there is a 64 % chance that a person picked at random from the treatment (weekend) group will have a higher score than a person picked at random from the control (weekday) group.

With a Cohen’s d of 0.3, 62 % of the treatment group will be above the mean of the control group (Cohen’s U3), 88 % of the two groups will overlap, and there is a 58 % chance that a person picked at random from the treatment group will have a higher score than a person picked at random from the control group (probability of superiority). Moreover, in order to have one more favorable outcome in the treatment group compared to the control group, we need to treat 10.6 people. This means that if 100 people go through the treatment, 9.4 more people will have a favorable outcome compared to if they had received the control treatment.

Recap independent-samples t-test in pictures

Source: Dr. Allison Horst https://twitter.com/allison_horst/status/1216411185240690688?s=20

“But couldn’t our conclusion be wrong? Yes! Type I errors (based on your samples you conclude there is a significant difference in means, when there actually isn’t)”:

“Or Type II errors (based on your samples you conclude there isn’t a significant difference, when there actually is)”:

“And remember: the p-value is not enough. Discuss/report differences in other (often more meaningful) ways. Data visualization, actual differences in context, effect sizes, etc.”

The end.