Comparing Two Means

Anna Shirokanova and Olesya Volchenko

February 18, 2019

Let’s compare mean values.

In the general case, 1) start with exploratory statistics (get the feel of your data); 2) check assumptions; 3) if assumptions hold, perform the t-test, else run the non-parametric test.

Start with data inspection and assumptions to decide on the test

library(foreign)
a <- read.spss("C:/Users/lssi5/Dropbox/Data Analysis II (2019)/data/alcohol.sav", use.value.labels = T, to.data.frame = T)
a$sex <- as.factor(a$sex)
a$sex <- relevel(a$sex, ref = "female")
a$smoke <- as.numeric(a$smoke)

Inspect the Distribution

descriptive statistics by groups

library(psych)
describeBy(a, a$sex)
## 
##  Descriptive statistics by group 
## group: female
##           vars     n  mean    sd median trimmed   mad min max range  skew
## sex*         1 10274  1.00  0.00      1    1.00  0.00   1   1     0   NaN
## age          2 10274 40.69 22.99     40   40.45 26.69   0 101   101  0.07
## material*    3  8579  3.49  1.16      4    3.53  1.48   1   5     4 -0.29
## marital*     4  8669  2.80  1.46      2    2.74  1.48   1   6     5  0.40
## alcohol*     5  3088  5.03  0.92      5    5.12  1.48   1   6     5 -0.86
## smoke        6  1177 15.98  5.91     15   14.98  2.97   3  51    48  2.31
##           kurtosis   se
## sex*           NaN 0.00
## age          -0.93 0.23
## material*    -1.03 0.01
## marital*     -1.27 0.02
## alcohol*      0.92 0.02
## smoke         7.26 0.17
## -------------------------------------------------------- 
## group: male
##           vars    n  mean    sd median trimmed   mad min max range  skew
## sex*         1 8098  2.00  0.00      2    2.00  0.00   2   2     0   NaN
## age          2 8098 34.72 21.22     33   34.00 25.20   0  94    94  0.24
## material*    3 6364  3.48  1.14      4    3.51  1.48   1   5     4 -0.24
## marital*     4 6469  2.11  1.06      2    1.96  1.48   1   6     5  1.16
## alcohol*     5 3465  4.35  1.13      4    4.40  1.48   1   6     5 -0.61
## smoke        6 3075 13.61  3.99     13   13.45  2.97   1  50    49  1.99
##           kurtosis   se
## sex*           NaN 0.00
## age          -0.77 0.24
## material*    -1.04 0.01
## marital*      1.00 0.01
## alcohol*      0.41 0.02
## smoke        12.60 0.07

Inspect the Distribution

histograms by groups

library(ggplot2)
ggplot(a, aes(x = smoke, color = sex, fill = sex)) +
      geom_histogram(aes(y=..density..), position = "identity", alpha = 0.5) +
      labs(title = "Smoke by Sex histogram",x = "Started Smoking (age)", y = "Density") +
      theme_classic() +
      facet_grid(. ~ sex) 

If you are in a desert with no time for ggplot, this will also do:

par(mfrow = c(1, 2))
hist(a$smoke[a$sex == "female"], ylim = c(0, 2100))
hist(a$smoke[a$sex == "male"], ylim = c(0, 2100))

Inspect the Distribution

boxplots

boxplot(a$smoke ~ a$sex, 
        xlab = "Sex", 
        ylab = "Age Started Smoking", 
        par(mar = c(4,5,1,1)+.5), 
        names = c("Females", "Males"))

ggplot(a, aes(x = sex, y = smoke)) + 
  geom_boxplot() +
  stat_summary(fun.y = mean, geom = "point", shape = 4, size = 4) +
  theme_classic() +
  ggtitle("Age Started Smoking by Sex of Respondent")
## Warning: Removed 14120 rows containing non-finite values (stat_boxplot).
## Warning: Removed 14120 rows containing non-finite values (stat_summary).

Results of Exploratory Analysis:

Let’s check the assumptions of t-test:

There are formal tests to check normality (Shapiro-Wilk, Kolmogorov-Smirnov test), but they are oversensitive (= tell that distribution is not normal even with slight deviations from normality). In addition, we care less about normality for large representative samples (over 300 obs.), following the Central Limit Theorem. See: http://thestatsgeek.com/2013/09/28/the-t-test-and-robustness-to-non-normality/

As an alternative to the formal test of normality, you can draw a Q-Q (quantile-quantile) plot.

QQ-plot for normality check

qqnorm(a$smoke[a$sex == "female"]); qqline(a$smoke[a$sex == "female"], col= 2)

qqnorm(a$smoke[a$sex == "male"]); qqline(a$smoke[a$sex == "female"], col= 2) #these two are not straight lines

Homogeneity (equality) of variances of DV across groups

t.test(a$smoke ~ as.factor(a$sex), var.equal = F) # Compare with:
## 
##  Welch Two Sample t-test
## 
## data:  a$smoke by as.factor(a$sex)
## t = 12.712, df = 1603.4, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  2.005746 2.737652
## sample estimates:
## mean in group female   mean in group male 
##             15.98471             13.61301
t.test(a$smoke ~ as.factor(a$sex))
## 
##  Welch Two Sample t-test
## 
## data:  a$smoke by as.factor(a$sex)
## t = 12.712, df = 1603.4, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  2.005746 2.737652
## sample estimates:
## mean in group female   mean in group male 
##             15.98471             13.61301

Conclusion: on average, males start to smoke at 13.6, while females at 16.0 years. The t-statistic is equal to 12.7 (p-value < 0.001), which means that the observed difference in means is statistically significant across the two groups.

Non-parametric Test for Two Independent Samples:

wilcox.test(smoke ~ sex, data = a)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  smoke by sex
## W = 2252600, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
# wilcox.test(a$smoke ~ a$sex)

Conclusion: the Wilcoxon rank sum W = 2252600 (p < 0.001), which means that the males and females start smoking at different ages and this difference is statistically significant.

(If you are in doubt, e.g. you have two groups of 29 observations, and reach differenct conclusions with non-parametric and parametric tests, stick to the non-parametric outcome.)

Paired Samples t-test

Example: alcohol consumption on weekdays and on weekends (same respondents). Data: ESS 2014, UK sample.

library(foreign)
GBdata <- read.spss("C:/Users/lssi5/Dropbox/Data Analysis II (2019)/data/ESS7GB.sav", to.data.frame = T, use.value.labels = T)
GBdata$alcwkdy <- as.numeric(as.character(GBdata$alcwkdy))
GBdata$alcwknd <- as.numeric(as.character(GBdata$alcwknd))
summary(GBdata$alcwkdy) # How many grams of alcohol respondent consumes on an average weekday
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     0.0     0.0    18.0    36.3    48.0   632.0     461
summary(GBdata$alcwknd) # How many grams of alcohol respondent consumes on an average weekend day
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00   17.00   40.00   62.43   77.75 1948.00     462

Draw histograms to get the idea of the distributions:

#hist(GBdata$alcwkdy)
#hist(GBdata$alcwknd)
ggplot(GBdata, aes(x = alcwkdy)) +
      geom_histogram(aes(y=..density..), position = "identity", alpha = 0.5) +
      geom_density(alpha=0.6) +
      labs(title = "Alcohol Consumption on Weekday, UK",x = "Alcohol consumed in grams", y = "Density") +
      theme_classic()

ggplot(GBdata, aes(x = alcwknd)) +
      geom_histogram(aes(y=..density..), position = "identity", alpha = 0.5) +
      geom_density(alpha=0.6) +
      labs(title = "Alcohol Consumption on Weekend Day, UK",x = "Alcohol consumed in grams", y = "Density") +
      theme_classic()

ggplot(GBdata, aes(x = alcwknd - alcwkdy)) +
      geom_histogram(aes(y=..density..), position = "identity", alpha = 0.5) +
      geom_density(alpha=0.6) +
      labs(title = "The difference between alcohol consumed during weekends and weekdays in grams, UK",x = "The difference in grams", y = "Density") +
      theme_classic()

Now, the paired t-test itself

let’s compare whether respondents drink on an average weekend day MORE than on an average weekday.

mean(GBdata$alcwkdy - GBdata$alcwknd, na.rm = T) # average difference in grams
## [1] -26.14159
t.test(GBdata$alcwknd, GBdata$alcwkdy, paired = T)
## 
##  Paired t-test
## 
## data:  GBdata$alcwknd and GBdata$alcwkdy
## t = 13.119, df = 1800, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  22.23332 30.04986
## sample estimates:
## mean of the differences 
##                26.14159

Conclusion: The mean amount of alcohol consumed on a weekend day is 20 grams higher than on a weekday. The paired t(1180) = 10.248 (p < 0.001), with 95%CI = [15; 22] Therefore, we conclude that the difference is statistically significant and alcohol consumption on weekends is, indeed, higher.

Non-parametric equivalent:

wilcox.test(x = GBdata$alcwknd, y = GBdata$alcwkdy, paired = TRUE, alternative = "greater")
## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  GBdata$alcwknd and GBdata$alcwkdy
## V = 771290, p-value < 2.2e-16
## alternative hypothesis: true location shift is greater than 0

Conclusion: Wilcoxon’s signed rank V = 277510 (p < 0.001), which means that alcohol consumption on weekend days is greater than on an average weekday, and this difference is statistically significant.

Some extras: bootstrapped CI and effect size for parametric t-test

Field (2016, p.558): “When testing the difference between independent means, Yuen proposed a test that compares trimmed means and uses a bootstrap to compute confidence intervals. The trimming reduces the impact of outliers. We can implement this method thanks to a function by Wilcox34 in the R software package.”

# Try this at home:
# library(yuen)
# yuenbt(smoke ~ as.factor(sex), data = a)

With a Cohen’s d of 0.5, 69 % of the treatment group will be above the mean of the control group (Cohen’s U3), 80 % of the two groups will overlap, and there is a 64 % chance that a person picked at random from the treatment group will have a higher score than a person picked at random from the control group (probability of superiority).

library(lsr)
## Warning: package 'lsr' was built under R version 3.5.2
cohensD(a$smoke ~ a$sex) 
## [1] 0.5156865

The end!