Anna Shirokanova and Olesya Volchenko
February 18, 2019
In the general case, 1) start with exploratory statistics (get the feel of your data); 2) check assumptions; 3) if assumptions hold, perform the t-test, else run the non-parametric test.
library(foreign)
a <- read.spss("C:/Users/lssi5/Dropbox/Data Analysis II (2019)/data/alcohol.sav", use.value.labels = T, to.data.frame = T)
a$sex <- as.factor(a$sex)
a$sex <- relevel(a$sex, ref = "female")
a$smoke <- as.numeric(a$smoke)
descriptive statistics by groups
library(psych)
describeBy(a, a$sex)
##
## Descriptive statistics by group
## group: female
## vars n mean sd median trimmed mad min max range skew
## sex* 1 10274 1.00 0.00 1 1.00 0.00 1 1 0 NaN
## age 2 10274 40.69 22.99 40 40.45 26.69 0 101 101 0.07
## material* 3 8579 3.49 1.16 4 3.53 1.48 1 5 4 -0.29
## marital* 4 8669 2.80 1.46 2 2.74 1.48 1 6 5 0.40
## alcohol* 5 3088 5.03 0.92 5 5.12 1.48 1 6 5 -0.86
## smoke 6 1177 15.98 5.91 15 14.98 2.97 3 51 48 2.31
## kurtosis se
## sex* NaN 0.00
## age -0.93 0.23
## material* -1.03 0.01
## marital* -1.27 0.02
## alcohol* 0.92 0.02
## smoke 7.26 0.17
## --------------------------------------------------------
## group: male
## vars n mean sd median trimmed mad min max range skew
## sex* 1 8098 2.00 0.00 2 2.00 0.00 2 2 0 NaN
## age 2 8098 34.72 21.22 33 34.00 25.20 0 94 94 0.24
## material* 3 6364 3.48 1.14 4 3.51 1.48 1 5 4 -0.24
## marital* 4 6469 2.11 1.06 2 1.96 1.48 1 6 5 1.16
## alcohol* 5 3465 4.35 1.13 4 4.40 1.48 1 6 5 -0.61
## smoke 6 3075 13.61 3.99 13 13.45 2.97 1 50 49 1.99
## kurtosis se
## sex* NaN 0.00
## age -0.77 0.24
## material* -1.04 0.01
## marital* 1.00 0.01
## alcohol* 0.41 0.02
## smoke 12.60 0.07
histograms by groups
library(ggplot2)
ggplot(a, aes(x = smoke, color = sex, fill = sex)) +
geom_histogram(aes(y=..density..), position = "identity", alpha = 0.5) +
labs(title = "Smoke by Sex histogram",x = "Started Smoking (age)", y = "Density") +
theme_classic() +
facet_grid(. ~ sex)
If you are in a desert with no time for ggplot, this will also do:
par(mfrow = c(1, 2))
hist(a$smoke[a$sex == "female"], ylim = c(0, 2100))
hist(a$smoke[a$sex == "male"], ylim = c(0, 2100))
boxplots
boxplot(a$smoke ~ a$sex,
xlab = "Sex",
ylab = "Age Started Smoking",
par(mar = c(4,5,1,1)+.5),
names = c("Females", "Males"))
ggplot(a, aes(x = sex, y = smoke)) +
geom_boxplot() +
stat_summary(fun.y = mean, geom = "point", shape = 4, size = 4) +
theme_classic() +
ggtitle("Age Started Smoking by Sex of Respondent")
## Warning: Removed 14120 rows containing non-finite values (stat_boxplot).
## Warning: Removed 14120 rows containing non-finite values (stat_summary).
There are formal tests to check normality (Shapiro-Wilk, Kolmogorov-Smirnov test), but they are oversensitive (= tell that distribution is not normal even with slight deviations from normality). In addition, we care less about normality for large representative samples (over 300 obs.), following the Central Limit Theorem. See: http://thestatsgeek.com/2013/09/28/the-t-test-and-robustness-to-non-normality/
As an alternative to the formal test of normality, you can draw a Q-Q (quantile-quantile) plot.
qqnorm(a$smoke[a$sex == "female"]); qqline(a$smoke[a$sex == "female"], col= 2)
qqnorm(a$smoke[a$sex == "male"]); qqline(a$smoke[a$sex == "female"], col= 2) #these two are not straight lines
t.test(a$smoke ~ as.factor(a$sex), var.equal = F) # Compare with:
##
## Welch Two Sample t-test
##
## data: a$smoke by as.factor(a$sex)
## t = 12.712, df = 1603.4, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 2.005746 2.737652
## sample estimates:
## mean in group female mean in group male
## 15.98471 13.61301
t.test(a$smoke ~ as.factor(a$sex))
##
## Welch Two Sample t-test
##
## data: a$smoke by as.factor(a$sex)
## t = 12.712, df = 1603.4, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 2.005746 2.737652
## sample estimates:
## mean in group female mean in group male
## 15.98471 13.61301
Conclusion: on average, males start to smoke at 13.6, while females at 16.0 years. The t-statistic is equal to 12.7 (p-value < 0.001), which means that the observed difference in means is statistically significant across the two groups.
wilcox.test(smoke ~ sex, data = a)
##
## Wilcoxon rank sum test with continuity correction
##
## data: smoke by sex
## W = 2252600, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
# wilcox.test(a$smoke ~ a$sex)
Conclusion: the Wilcoxon rank sum W = 2252600 (p < 0.001), which means that the males and females start smoking at different ages and this difference is statistically significant.
(If you are in doubt, e.g. you have two groups of 29 observations, and reach differenct conclusions with non-parametric and parametric tests, stick to the non-parametric outcome.)
Example: alcohol consumption on weekdays and on weekends (same respondents). Data: ESS 2014, UK sample.
library(foreign)
GBdata <- read.spss("C:/Users/lssi5/Dropbox/Data Analysis II (2019)/data/ESS7GB.sav", to.data.frame = T, use.value.labels = T)
GBdata$alcwkdy <- as.numeric(as.character(GBdata$alcwkdy))
GBdata$alcwknd <- as.numeric(as.character(GBdata$alcwknd))
summary(GBdata$alcwkdy) # How many grams of alcohol respondent consumes on an average weekday
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0 0.0 18.0 36.3 48.0 632.0 461
summary(GBdata$alcwknd) # How many grams of alcohol respondent consumes on an average weekend day
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 17.00 40.00 62.43 77.75 1948.00 462
#hist(GBdata$alcwkdy)
#hist(GBdata$alcwknd)
ggplot(GBdata, aes(x = alcwkdy)) +
geom_histogram(aes(y=..density..), position = "identity", alpha = 0.5) +
geom_density(alpha=0.6) +
labs(title = "Alcohol Consumption on Weekday, UK",x = "Alcohol consumed in grams", y = "Density") +
theme_classic()
ggplot(GBdata, aes(x = alcwknd)) +
geom_histogram(aes(y=..density..), position = "identity", alpha = 0.5) +
geom_density(alpha=0.6) +
labs(title = "Alcohol Consumption on Weekend Day, UK",x = "Alcohol consumed in grams", y = "Density") +
theme_classic()
ggplot(GBdata, aes(x = alcwknd - alcwkdy)) +
geom_histogram(aes(y=..density..), position = "identity", alpha = 0.5) +
geom_density(alpha=0.6) +
labs(title = "The difference between alcohol consumed during weekends and weekdays in grams, UK",x = "The difference in grams", y = "Density") +
theme_classic()
let’s compare whether respondents drink on an average weekend day MORE than on an average weekday.
mean(GBdata$alcwkdy - GBdata$alcwknd, na.rm = T) # average difference in grams
## [1] -26.14159
t.test(GBdata$alcwknd, GBdata$alcwkdy, paired = T)
##
## Paired t-test
##
## data: GBdata$alcwknd and GBdata$alcwkdy
## t = 13.119, df = 1800, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 22.23332 30.04986
## sample estimates:
## mean of the differences
## 26.14159
Conclusion: The mean amount of alcohol consumed on a weekend day is 20 grams higher than on a weekday. The paired t(1180) = 10.248 (p < 0.001), with 95%CI = [15; 22] Therefore, we conclude that the difference is statistically significant and alcohol consumption on weekends is, indeed, higher.
wilcox.test(x = GBdata$alcwknd, y = GBdata$alcwkdy, paired = TRUE, alternative = "greater")
##
## Wilcoxon signed rank test with continuity correction
##
## data: GBdata$alcwknd and GBdata$alcwkdy
## V = 771290, p-value < 2.2e-16
## alternative hypothesis: true location shift is greater than 0
Conclusion: Wilcoxon’s signed rank V = 277510 (p < 0.001), which means that alcohol consumption on weekend days is greater than on an average weekday, and this difference is statistically significant.
Field (2016, p.558): “When testing the difference between independent means, Yuen proposed a test that compares trimmed means and uses a bootstrap to compute confidence intervals. The trimming reduces the impact of outliers. We can implement this method thanks to a function by Wilcox34 in the R software package.”
# Try this at home:
# library(yuen)
# yuenbt(smoke ~ as.factor(sex), data = a)
With a Cohen’s d of 0.5, 69 % of the treatment group will be above the mean of the control group (Cohen’s U3), 80 % of the two groups will overlap, and there is a 64 % chance that a person picked at random from the treatment group will have a higher score than a person picked at random from the control group (probability of superiority).
library(lsr)
## Warning: package 'lsr' was built under R version 3.5.2
cohensD(a$smoke ~ a$sex)
## [1] 0.5156865