Comparing Two Means

Anna Shirokanova and Olesya Volchenko

March 11, 2021

Let’s compare mean values

In the general case,

  1. start with exploratory statistics (get the feel of your data);
  2. check the assumptions;
  3. if the assumptions hold, perform the t-test, else run the non-parametric test.

Is there just one “t-test”?

There are three kinds of the t-test:

\[ t = \frac{\bar{X} - \mu}{S/\sqrt{n-1}}\]

\[ t = \frac{\bar{d}}{S_d/\sqrt{n}}\]

\[ t = \frac{\bar{X_1} - \bar{X_2} }{\sqrt{\frac{S_1^2}{n_1} + \frac{S_2^2}{n_2}}}\]

What are the paired and independent samples?

Two samples are called paired when each observation of group 1 has a particular corresponding observation in group 2.

Examples:

In paired samples, the sample size is the same in the two groups.

Independent-samples t-test

How to proceed?

In the general case,

  1. start with exploratory statistics (get the feel of your data);
  2. check assumptions;
  3. if assumptions hold, perform the t-test, else run the non-parametric test.

1. Start with data inspection and checking assumptions to decide on the test

Example: What is the average age when males and females in Russia start smoking, when they do? Is it the same? In practical terms, what is the age at which to target anti-smoking campaigns for male and female adolescents?

Data: RLMS-HSE, a survey of Russian households https://www.hse.ru/en/rlms/

Questionnaire: “Please remember your age when you began smoking. How old were you then?”

library(foreign)
a <- read.spss("alcohol.sav", use.value.labels = T, to.data.frame = T)
class(a$smoke)
## [1] "factor"
table(a$smoke)
## 
##   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20  21  22  23 
##   1   3  10  37  23  13  70  26 120 115 454 506 729 452 659 176 387  84  59  45 
##  24  25  26  27  28  29  30  31  32  33  34  35  36  38  39  40  42  43  45  46 
##  21  83  19  15   7   5  50   2   9   7   4  13   4   5   2   7   3   1   6   2 
##  47  48  49  50  53  55  56  57  59  60  63 
##   2   1   1   7   1   1   1   1   1   1   1
a$smoke <- as.numeric(as.character(a$smoke))

Evaluate the normality of distribution with numbers

statistics by groups

library(psych)
describeBy(a, a$sex) # age has that of a range because a household member is questioned about all of its members, even babies and very old people
## 
##  Descriptive statistics by group 
## group: male
##           vars    n  mean    sd median trimmed   mad min max range  skew
## sex*         1 8098  1.00  0.00      1    1.00  0.00   1   1     0   NaN
## age          2 8098 34.72 21.22     33   34.00 25.20   0  94    94  0.24
## material*    3 6364  3.48  1.14      4    3.51  1.48   1   5     4 -0.24
## marital*     4 6469  2.11  1.06      2    1.96  1.48   1   6     5  1.16
## alcohol*     5 3465  4.35  1.13      4    4.40  1.48   1   6     5 -0.61
## smoke        6 3075 16.63  4.10     16   16.45  2.97   4  60    56  2.45
##           kurtosis   se
## sex*           NaN 0.00
## age          -0.77 0.24
## material*    -1.04 0.01
## marital*      1.00 0.01
## alcohol*      0.41 0.02
## smoke        17.60 0.07
## ------------------------------------------------------------ 
## group: female
##           vars     n  mean    sd median trimmed   mad min max range  skew
## sex*         1 10274  2.00  0.00      2    2.00  0.00   2   2     0   NaN
## age          2 10274 40.69 22.99     40   40.45 26.69   0 101   101  0.07
## material*    3  8579  3.49  1.16      4    3.53  1.48   1   5     4 -0.29
## marital*     4  8669  2.80  1.46      2    2.74  1.48   1   6     5  0.40
## alcohol*     5  3088  5.03  0.92      5    5.12  1.48   1   6     5 -0.86
## smoke        6  1177 19.05  6.24     18   17.98  2.97   6  63    57  2.69
##           kurtosis   se
## sex*           NaN 0.00
## age          -0.93 0.23
## material*    -1.03 0.01
## marital*     -1.27 0.02
## alcohol*      0.92 0.02
## smoke        10.40 0.18

Skew and especially kurtosis indicate non-normal distribution.

There are over 3,000 males who smoke and over 1,100 females who smoke in the sample.

The minimum age of first smoke is 4 years for males and 6 years for females.

Evaluate the normality of distribution with graphs

histograms by groups

a$sex <- relevel(a$sex, ref = "female") # for ggplot colours

library(plyr)
mu <- ddply(a, "sex", summarise, grp.mean = mean(smoke, na.rm = T)) # calculate mean by group

library(ggplot2)
ggplot(a, aes(x = smoke, color = sex, fill = sex)) +
      geom_histogram(aes(y=..density..), position = "identity", alpha = 0.5, binwidth = 3) +
      geom_vline(data = mu, aes(xintercept = grp.mean, color = sex), linetype = "dashed") +
      labs(title = "Smoke by Sex histogram", x = "Started Smoking (age)", y = "Density") +
      theme_classic() +
      theme(legend.position = "top")

ggplot(a, aes(x = smoke, color = sex, fill = sex)) +
      geom_density(alpha = 0.5) +
      labs(title = "Smoke by Sex histogram", x = "Started Smoking (age)", y = "Density") +
      theme_classic()

#if you are a respectable boomer with no time for ggplot, this will also do:

f <- subset(a[a$sex == "female",]) # for simple histograms
m <- subset(a[a$sex == "male",])
par(mfrow = c(1,2))
hist(f$smoke, ylim = c(0, 2100))
hist(m$smoke, ylim = c(0, 2100))

Histograms show that distribution in both groups is skewed to the right.

Compare the variance in the groups with boxplots

boxplots

ggplot(a, aes(x = sex, y = smoke)) + 
  geom_boxplot() +
  stat_summary(fun.y = mean, geom = "point", shape = 4, size = 4) +
  theme_classic() +
  ggtitle("Age When Started Smoking by Sex of Respondent")

means <- aggregate(smoke ~  sex, a, mean) # calculate means on the side
par(mar = c(4,5,1,1)+.5) # set the margins of a plot right
boxplot(a$smoke ~ a$sex, 
        xlab = "Sex", ylab = "Age Started Smoking")
points(1:2, means$smoke, col = "red")
title("Age When Started Smoking by Sex of Respondent")

Results of Exploratory Analysis

2. Let’s check the assumptions of the t-test

Normality checks by Tests

Normality checks by Graphs

Alternatively to the formal test of normality, you can draw a Q-Q (quantile-quantile) plot.

Normality checks by Graphs (Q-Q)

A Normal distribution resembles this:

What we have in the data looks like this:

par(mfrow = c(1,2))
qqnorm(f$smoke); qqline(f$smoke, col = 2)
qqnorm(m$smoke); qqline(m$smoke, col = 2) # these two are not straight lines

These two Q-Q plots do not look normal (heavy right tail).

Homogeneity (equality) of variances of the DV across groups

options(scipen = 99) # for small numbers to show zeroes

var(a$smoke[a$sex=="male"], na.rm = T)
## [1] 16.79787
var(a$smoke[a$sex=="female"], na.rm = T)
## [1] 38.89688
library(car)
leveneTest(a$smoke ~ a$sex)
## Levene's Test for Homogeneity of Variance (center = median)
##         Df F value                Pr(>F)    
## group    1      80 < 0.00000000000000022 ***
##       4250                                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
bartlett.test(a$smoke ~ a$sex)
## 
##  Bartlett test of homogeneity of variances
## 
## data:  a$smoke by a$sex
## Bartlett's K-squared = 331.84, df = 1, p-value < 0.00000000000000022

Both tests show that the data have a very low probability to occur if null hypothesis is true. Thus, the variances are not equal.

3. Now run the t-test

t.test(a$smoke ~ a$sex, var.equal = F) 
## 
##  Welch Two Sample t-test
## 
## data:  a$smoke by a$sex
## t = 12.357, df = 1580.4, p-value < 0.00000000000000022
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  2.040116 2.809952
## sample estimates:
## mean in group female   mean in group male 
##             19.05268             16.62764

If you do not specify unequal variances, the results is equivalent:

t.test(a$smoke ~ a$sex)
## 
##  Welch Two Sample t-test
## 
## data:  a$smoke by a$sex
## t = 12.357, df = 1580.4, p-value < 0.00000000000000022
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  2.040116 2.809952
## sample estimates:
## mean in group female   mean in group male 
##             19.05268             16.62764

Conclusions:

  1. on average, males start to smoke at 16.6, while females at 19.1 years.

  2. The t-statistic t(1580.4) = 12.36 (p-value < 0.001), which means that the observed difference in means is statistically significant across the two groups (higher among females).

Results of the Assumptions’ Check

Non-parametric Test for Two Independent Samples

wilcox.test(smoke ~ sex, data = a)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  smoke by sex
## W = 2252622, p-value < 0.00000000000000022
## alternative hypothesis: true location shift is not equal to 0

Conclusion: the Wilcoxon rank sum W equals 2252622 at p < 0.001, which means that males and females start smoking at different ages and this difference is statistically significant.

If you are in doubt about which test to apply, e.g. you have non-normal distributions, and if the two tests give diverging conclusions, stick to the non-parametric result (but it will not show the ‘average difference’).

Paired Samples t-test

Example: alcohol consumption on weekdays and on weekends (same respondents).

Data: ESS 2014, UK sample. https://www.europeansocialsurvey.org/download.html?file=ESS7GB&c=GB&y=2014

library(foreign)
GBdata <- read.spss("ESS7GB.sav", to.data.frame = T, use.value.labels = T)
GBdata$alcwkdy <- as.numeric(as.character(GBdata$alcwkdy))
GBdata$alcwknd <- as.numeric(as.character(GBdata$alcwknd))
summary(GBdata$alcwkdy) # How many grams of alcohol respondent consumes on an average weekday
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     0.0     0.0    18.0    36.3    48.0   632.0     461
summary(GBdata$alcwknd) # How many grams of alcohol respondent consumes on an average weekend day
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00   17.00   40.00   62.43   77.75 1948.00     462

Draw the histogram of differences to get the idea of the distribution assumed to be normal:

#hist(GBdata$alcwkdy - GBdata$alcwknd)
ggplot(GBdata, aes(x = alcwkdy - alcwknd)) +
      geom_histogram(aes(y =..density..), position = "identity", alpha = 0.5, binwidth = 10) +
      geom_density(alpha = 0.6) +
      labs(title = "Difference in Alcohol Consumption on Weekend and Weekday, UK", 
           x = "Difference in grams", 
           y = "Density") +
      theme_classic()

The distribution looks rather left-skewed and leptokurtic, but it also follows the bell shape.

Now, the paired t-test itself: let’s compare whether respondents drink on an average weekend day differently than on an average weekday.

mean(GBdata$alcwknd - GBdata$alcwkdy, na.rm = T) # average difference in grams
## [1] 26.14159
t.test(GBdata$alcwknd, GBdata$alcwkdy, paired = T)
## 
##  Paired t-test
## 
## data:  GBdata$alcwknd and GBdata$alcwkdy
## t = 13.119, df = 1800, p-value < 0.00000000000000022
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  22.23332 30.04986
## sample estimates:
## mean of the differences 
##                26.14159

Conclusion: The mean amount of alcohol consumed on a weekend day is 26.1 grams higher than on a weekday. The paired t(1800) = 13.12 (p < 0.001), with 95% confidence intervals = [22.23; 30.05] for the true difference between the two means. Therefore, we conclude that the difference is statistically significant and alcohol consumption on weekends is, indeed, higher.

Non-parametric equivalent

wilcox.test(x = GBdata$alcwknd, y = GBdata$alcwkdy, paired = TRUE)
## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  GBdata$alcwknd and GBdata$alcwkdy
## V = 771289, p-value < 0.00000000000000022
## alternative hypothesis: true location shift is not equal to 0

Conclusion: Wilcoxon’s signed rank V = 771289 (p < 0.001), which means that alcohol consumption on weekend days is greater than on an average weekday, and this difference is statistically significant.

Both tests have reached the same conclusion.

Effect size for the t-test

library(effsize)
cohen.d(a$smoke ~ a$sex, na.rm = T)
## 
## Cohen's d
## 
## d estimate: 0.5066159 (medium)
## 95 percent confidence interval:
##     lower     upper 
## 0.4385600 0.5746717

With a Cohen’s d of 0.5, 69 % of the “treatment” group will be above the mean of the “control” group (Cohen’s U3), 80 % of the two groups will overlap, and there is a 64 % chance that a person picked at random from the treatment (female) group will have a higher score than a person picked at random from the control (male) group.

With a Cohen’s d of 0.3, 62 % of the treatment group will be above the mean of the control group (Cohen’s U3), 88 % of the two groups will overlap, and there is a 58 % chance that a person picked at random from the treatment group will have a higher score than a person picked at random from the control group (probability of superiority). Moreover, in order to have one more favorable outcome in the treatment group compared to the control group, we need to treat 10.6 people. This means that if 100 people go through the treatment, 9.4 more people will have a favorable outcome compared to if they had received the control treatment.

library(rstatix)
rstatix::wilcox_effsize(smoke ~ sex, data = a, na.rm = T) # for non-parametric
## # A tibble: 1 x 7
##   .y.   group1 group2 effsize    n1    n2 magnitude
## * <chr> <chr>  <chr>    <dbl> <int> <int> <ord>    
## 1 smoke female male     0.191 10274  8098 small

The effect size of the Mann-Whitney-Wilcoxon test is 0.19 (small), even though it is statistically significant. So, overall the size of the difference between the first age of smoking is small to medium.

More to read:

Recap independent-samples t-test in pictures

Source: Dr. Allison Horst https://twitter.com/allison_horst/status/1216411185240690688?s=20

“But couldn’t our conclusion be wrong? Yes! Type I errors (based on your samples you conclude there is a significant difference in means, when there actually isn’t)”:

“Or Type II errors (based on your samples you conclude there isn’t a significant difference, when there actually is)”:

“And remember: the p-value is not enough. Discuss/report differences in other (often more meaningful) ways. Data visualization, actual differences in context, effect sizes, etc.”

The end.