Anna Shirokanova and Olesya Volchenko
February 17, 2020
In the general case,
There are three kinds of the t-test:
\[ t = \frac{\bar{X} - \mu}{S/\sqrt{n-1}}\]
\[ t = \frac{\bar{d}}{S_d/\sqrt{n}}\]
\[ t = \frac{\bar{X_1} - \bar{X_2} }{\sqrt{\frac{S_1^2}{n_1} + \frac{S_2^2}{n_2}}}\]
Two samples are called paired when each observation of group 1 has a particular corresponding observation in group 2.
Examples:
husbands (group 1) and wives (group 2) evaluating wedding photo services
students’ performance before the test (1) and after the test (2)
In paired samples, the sample size is the same in the two groups.
can be calculated when there is a continuous variable and a grouping variable with 2 levels
answers the question whether the mean value is the same across two different groups
has three assumptions:
In the general case,
Example: What is the average age when males and females in Russia start smoking, when they do? Is it the same? In practical terms, what is the age at which to target anti-smoking campaigns for male and female adolescents?
Data: RLMS-HSE, a survey of Russian households https://www.hse.ru/en/rlms/
Questionnaire: “Please remember your age when you began smoking. How old were you then?”
statistics by groups
##
## Descriptive statistics by group
## group: male
## vars n mean sd median trimmed mad min max range skew
## sex* 1 8098 1.00 0.00 1 1.00 0.00 1 1 0 NaN
## age 2 8098 34.72 21.22 33 34.00 25.20 0 94 94 0.24
## material* 3 6364 3.48 1.14 4 3.51 1.48 1 5 4 -0.24
## marital* 4 6469 2.11 1.06 2 1.96 1.48 1 6 5 1.16
## alcohol* 5 3465 4.35 1.13 4 4.40 1.48 1 6 5 -0.61
## smoke 6 3075 13.61 3.99 13 13.45 2.97 1 50 49 1.99
## kurtosis se
## sex* NaN 0.00
## age -0.77 0.24
## material* -1.04 0.01
## marital* 1.00 0.01
## alcohol* 0.41 0.02
## smoke 12.60 0.07
## --------------------------------------------------------
## group: female
## vars n mean sd median trimmed mad min max range skew
## sex* 1 10274 2.00 0.00 2 2.00 0.00 2 2 0 NaN
## age 2 10274 40.69 22.99 40 40.45 26.69 0 101 101 0.07
## material* 3 8579 3.49 1.16 4 3.53 1.48 1 5 4 -0.29
## marital* 4 8669 2.80 1.46 2 2.74 1.48 1 6 5 0.40
## alcohol* 5 3088 5.03 0.92 5 5.12 1.48 1 6 5 -0.86
## smoke 6 1177 15.98 5.91 15 14.98 2.97 3 51 48 2.31
## kurtosis se
## sex* NaN 0.00
## age -0.93 0.23
## material* -1.03 0.01
## marital* -1.27 0.02
## alcohol* 0.92 0.02
## smoke 7.26 0.17
Skew and especially kurtosis indicate non-normal distribution.
Let’s also replace those ages below 6 with 6 years old (or else you could filter them losing observations).
histograms by groups
a$sex <- relevel(a$sex, ref = "female") # for ggplot colours
library(ggplot2)
ggplot(a, aes(x = smoke, color = sex, fill = sex)) +
geom_histogram(aes(y=..density..), position = "identity", alpha = 0.5, binwidth = 3) +
labs(title = "Smoke by Sex histogram",x = "Started Smoking (age)", y = "Density") +
theme_classic()
ggplot(a, aes(x = smoke, color = sex, fill = sex)) +
geom_density(alpha = 0.5) +
labs(title = "Smoke by Sex histogram",x = "Started Smoking (age)", y = "Density") +
theme_classic()
#if you are a respectable boomer with no time for ggplot, this will also do:
f <- subset(a[a$sex == "female",]) # for simple histograms
m <- subset(a[a$sex == "male",])
par(mfrow = c(1,2))
hist(f$smoke, ylim = c(0, 2100))
hist(m$smoke, ylim = c(0, 2100))
Histograms show that distribution in both groups is skewed to the right.
boxplots
ggplot(a, aes(x = sex, y = smoke)) +
geom_boxplot() +
stat_summary(fun.y = mean, geom = "point", shape = 4, size = 4) +
theme_classic() +
ggtitle("Age When Started Smoking by Sex of Respondent")
means <- aggregate(smoke ~ sex, a, mean) # calculate means on the side
par(mar = c(4,5,1,1)+.5) # set the margins of a plot right
boxplot(a$smoke ~ a$sex,
xlab = "Sex", ylab = "Age Started Smoking")
points(1:2, means$smoke, col = "red")
title("Age When Started Smoking by Sex of Respondent")
they are oversensitive (= tell that distribution is not normal even when slight deviations from normality occur).
In addition, we care less about normality for large representative samples (over 300 obs.), following the Central Limit Theorem.
See: http://thestatsgeek.com/2013/09/28/the-t-test-and-robustness-to-non-normality/ and https://notstatschat.rbind.io/2019/02/09/what-have-i-got-against-the-shapiro-wilk-test/
Alternatively to the formal test of normality, you can draw a Q-Q (quantile-quantile) plot.
A Normal distribution resembles this:
What we have in the data looks like this:
par(mfrow = c(1,2))
qqnorm(f$smoke); qqline(f$smoke, col= 2)
qqnorm(m$smoke); qqline(m$smoke, col= 2) # these two are not straight lines
These two Q-Q plots do not look normal.
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 1 81.853 < 0.00000000000000022 ***
## 4250
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Bartlett test of homogeneity of variances
##
## data: a$smoke by a$sex
## Bartlett's K-squared = 289.28, df = 1, p-value <
## 0.00000000000000022
Both tests show that the data have a very low probability to occur if null hypothesis is true. Thus, the variances are not equal.
##
## Welch Two Sample t-test
##
## data: a$smoke by a$sex
## t = 12.712, df = 1603.4, p-value < 0.00000000000000022
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 2.005746 2.737652
## sample estimates:
## mean in group female mean in group male
## 15.98471 13.61301
If you do not specify unequal variances, the results is equivalent:
##
## Welch Two Sample t-test
##
## data: a$smoke by a$sex
## t = 12.712, df = 1603.4, p-value < 0.00000000000000022
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 2.005746 2.737652
## sample estimates:
## mean in group female mean in group male
## 15.98471 13.61301
Conclusions:
on average, males start to smoke at 13.6, while females at 16.0 years.
The t-statistic t(1603.4) = 12.7 (p-value < 0.001), which means that the observed difference in means is statistically significant across the two groups (higher among females).
##
## Wilcoxon rank sum test with continuity correction
##
## data: smoke by sex
## W = 2252600, p-value < 0.00000000000000022
## alternative hypothesis: true location shift is not equal to 0
Conclusion: the Wilcoxon rank sum W equals 2252600 at p < 0.001, which means that males and females start smoking at different ages and this difference is statistically significant.
If you are in doubt about which test to apply, e.g. you have two groups of 29 observations, and if the two tests give diverging conclusions, stick to the non-parametric result.
Example: alcohol consumption on weekdays and on weekends (same respondents).
Data: ESS 2014, UK sample. https://www.europeansocialsurvey.org/download.html?file=ESS7GB&c=GB&y=2014
library(foreign)
GBdata <- read.spss("ESS7GB.sav", to.data.frame = T, use.value.labels = T)
GBdata$alcwkdy <- as.numeric(as.character(GBdata$alcwkdy))
GBdata$alcwknd <- as.numeric(as.character(GBdata$alcwknd))
summary(GBdata$alcwkdy) # How many grams of alcohol respondent consumes on an average weekday
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0 0.0 18.0 36.3 48.0 632.0 461
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 17.00 40.00 62.43 77.75 1948.00 462
Draw the histogram of differences to get the idea of the distribution assumed to be normal:
#hist(GBdata$alcwkdy - GBdata$alcwknd)
ggplot(GBdata, aes(x = alcwkdy - alcwknd)) +
geom_histogram(aes(y=..density..), position = "identity", alpha = 0.5, binwidth = 10) +
geom_density(alpha=0.6) +
labs(title = "Difference in Alcohol Consumption on Weekend and Weekday, UK",
x = "Difference in grams",
y = "Density") +
theme_classic()
The distribution looks rather left-skewed and leptokurtic, but it also follows the bell shape.
Now, the paired t-test itself: let’s compare whether respondents drink on an average weekend day differently than on an average weekday.
## [1] 26.14159
##
## Paired t-test
##
## data: GBdata$alcwknd and GBdata$alcwkdy
## t = 13.119, df = 1800, p-value < 0.00000000000000022
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 22.23332 30.04986
## sample estimates:
## mean of the differences
## 26.14159
Conclusion: The mean amount of alcohol consumed on a weekend day is 25.5 grams higher than on a weekday. The paired t(1800) = 14.70 (p < 0.001), with 95% confidence intervals = [22.1; 28.9] for the true difference between the two means. Therefore, we conclude that the difference is statistically significant and alcohol consumption on weekends is, indeed, higher.
Non-parametric equivalent
##
## Wilcoxon signed rank test with continuity correction
##
## data: GBdata$alcwknd and GBdata$alcwkdy
## V = 771290, p-value < 0.00000000000000022
## alternative hypothesis: true location shift is not equal to 0
Conclusion: Wilcoxon’s signed rank V = 771290 (p < 0.001), which means that alcohol consumption on weekend days is greater than on an average weekday, and this difference is statistically significant.
Both tests have reached the same conclusion.
##
## Cohen's d
##
## d estimate: 0.5156865 (medium)
## 95 percent confidence interval:
## lower upper
## 0.4475998 0.5837731
With a Cohen’s d of 0.5, 69 % of the “treatment” group will be above the mean of the “control” group (Cohen’s U3), 80 % of the two groups will overlap, and there is a 64 % chance that a person picked at random from the treatment (weekend) group will have a higher score than a person picked at random from the control (weekday) group.
With a Cohen’s d of 0.3, 62 % of the treatment group will be above the mean of the control group (Cohen’s U3), 88 % of the two groups will overlap, and there is a 58 % chance that a person picked at random from the treatment group will have a higher score than a person picked at random from the control group (probability of superiority). Moreover, in order to have one more favorable outcome in the treatment group compared to the control group, we need to treat 10.6 people. This means that if 100 people go through the treatment, 9.4 more people will have a favorable outcome compared to if they had received the control treatment.
Source: Dr. Allison Horst https://twitter.com/allison_horst/status/1216411185240690688?s=20
“But couldn’t our conclusion be wrong? Yes! Type I errors (based on your samples you conclude there is a significant difference in means, when there actually isn’t)”:
“Or Type II errors (based on your samples you conclude there isn’t a significant difference, when there actually is)”:
“And remember: the p-value is not enough. Discuss/report differences in other (often more meaningful) ways. Data visualization, actual differences in context, effect sizes, etc.”
The end.