At some point in our work we face the need of making comparisons between two (or more) groups of observations
The classical method for comparing the means of two groups is to do a t-test, which assumes:
Here’s how you would do this in R, using the PRB data for now:
library(readr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
prb<-read_csv("https://raw.githubusercontent.com/coreysparks/data/master/PRB2013_new.csv", col_names=T)
## Parsed with column specification:
## cols(
## .default = col_integer(),
## Country = col_character(),
## Continent = col_character(),
## Region = col_character(),
## Population = col_double(),
## Rate_of_natural_increase = col_double(),
## ProjectedPopMid2025 = col_double(),
## ProjectedPopMid2050 = col_double(),
## ProjectedPopChange_08_50Perc = col_double(),
## IMR = col_double(),
## TFR = col_double(),
## PercPop1549HIVAIDS1995 = col_double(),
## PercPop1549HIVAIDS2011_2013 = col_double(),
## GDPGrowth2000_2006 = col_double(),
## GDPGrowth2007_2011 = col_double()
## )
## See spec(...) for full column specifications.
names(prb)<-tolower(names(prb))
Let’s say we want to test for equality of the total fertility rate in African vs Non-African countries. To do this analysis completely, I would calculate summary statistics for each group, produce a graphical summary of the data, categorized by group and only then do the test. Remember it’s best to explore your data first, before jumping into testing!
prb_new<-prb%>%
mutate(Africa=ifelse(prb$continent=="Africa",yes= "Africa",no= "Not Africa"))
#summary statistics by group
prb_new%>%
group_by(Africa)%>%
summarise(means=mean(tfr, na.rm=T), sds=sd(tfr, na.rm=T), n=n())
## # A tibble: 2 x 4
## Africa means sds n
## <chr> <dbl> <dbl> <int>
## 1 Africa 4.612727 1.418535 55
## 2 Not Africa 2.250980 0.888532 153
#boxplot
prb_new%>%
ggplot(aes(x=Africa, y=tfr))+geom_boxplot()
#test
t.test(tfr~Africa, data=prb_new)
##
## Welch Two Sample t-test
##
## data: tfr by Africa
## t = 11.559, df = 69.813, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 1.954226 2.769268
## sample estimates:
## mean in group Africa mean in group Not Africa
## 4.612727 2.250980
So, based on the things we just did, we can conclude: 1) The summary statistics suggest that the means are very different between the two groups, African countries have a TFR of 4.61 on average, compared to non-African countries, who’s TFR is 2.25. That’s a difference of almost 2.4 children per woman on average. 2) The graphical summary supports the statement that African countries have a overall higher distribution, but there are some countries whose fertility rate is lower that some non-African countries, and vice verse. 3) The t-test shows that the means are significantly different from one another. How do I conclude this? First, the t-statistic is 11.59, with 69.8 degrees of freedom. The probability of seeing a value this large from this t-distribution is:
#get the various parts of t statistic
n1<-length(prb_new$tfr[prb_new$Africa=="Africa"])
n2<-length(prb_new$tfr[prb_new$Africa=="Not Africa"])
m1<-mean(prb_new$tfr[prb_new$Africa=="Africa"], na.rm=T)
m2<-mean(prb_new$tfr[prb_new$Africa=="Not Africa"], na.rm=T)
s1<-var(prb_new$tfr[prb_new$Africa=="Africa"], na.rm=T)
s2<-var(prb_new$tfr[prb_new$Africa=="Not Africa"], na.rm=T)
#calculate t
t<-(m1-m2 )/ sqrt(s1/n1 +s2/n2)
#get the p value
2*pt(-abs(t), df=n2-n1)
## [1] 5.226801e-20
which is very, very small. So this would say that we reject the null hypothesis that the means are equal with great confidence.
We’ve learned a new test that we can use under a specific situation!!
See this video for how I feel about having tests that do one thing only.
I propose that we try to minimize the use of uni-tasker tests, in favor of using more general classes of models that can be used in many settings.
I learned these things from this text, and it’s really good, and certainly not the only game in town. It’s also kind of old now. What I learned from this book is that the linear model is extremely flexible and useful in many, many settings, besides just linear regression.
I also read this book, which, as a basis for teaching everything, uses the linear model as the jumping off point; which made a lot of sense to me, of course.
I’m not going to review linear statistical models here, there are 1,000 page books for that, and Field covers this well in chapters 7, 10 and 11 of his book.
So, this would be how I would write a linear model to compare two group means. This assumes your dependent variable is continuous.
Some Definitions: \(y_i\) = Your outcome measure on each of the \(i = 1 \cdots n\) observations Group = A binary representation of your two groups (see above: Africa, Not Africa) \(\beta_0\) = the mean of the “baseline” group - you choose what group this is \(\beta_1\) = how much the mean of the comparison group differs from the baseline group \(\epsilon\) = residual variation in the mean
The model would then be:
\(y_i = \beta_0 + \beta_1*\text{Group} + \epsilon_i\)
Assuming:
\(\epsilon_i \sim \text{Normal}(0, \sigma^2_{\epsilon})\)
is, that we estimate one group’s mean to be \(\beta_0\) and the express the other group’s mean as: \(\beta_0 + \beta_1\).
We then test the hypothesis that \(\beta_1 = 0\). Or, is the difference in mean’s between the two groups equal 0.
Here is a model like the one we did above for the African - Non African comparison: \(\text{TFR}_i = \beta_0 + \beta_1*\text{Not Africa} + e_i\)
Where \(\beta_0\) is the mean TFR in Africa, and \(\beta_1\) describes how the mean of the Non-African countries relates to the mean of the African countries.
e contains all the information on TFR that the difference between groups doesn’t explain, and is called the residual.
Now we do the test:
library(broom)
africa_fit<-lm(tfr~Africa, data=prb_new)
tidy(africa_fit)
## term estimate std.error statistic p.value
## 1 (Intercept) 4.612727 0.1420638 32.46941 5.803242e-83
## 2 AfricaNot Africa -2.361747 0.1656416 -14.25818 1.523522e-32
So, the labels here are a little different, \(\beta_0\) is labeled as (Intercept) and \(\beta_1\) is labeled as AfricaNot Africa, indicating that the “Not Africa” group is being compared to the “Africa” group.
the mean for African countries is \(\beta_0\) = 4.61 and the mean for the Non African countries is \(\beta_0 + \beta_1\) = 2.25, based on the estimated parameters.
How does this compare with our simpler descriptive reality?
#summary statistics by group
prb_new%>%
group_by(Africa)%>%
summarise(means=mean(tfr, na.rm=T), sds=sd(tfr, na.rm=T), n=n())
## # A tibble: 2 x 4
## Africa means sds n
## <chr> <dbl> <dbl> <int>
## 1 Africa 4.612727 1.418535 55
## 2 Not Africa 2.250980 0.888532 153
Spot on!
We need to evaluate a key assumption about the errors of the model.
qqnorm(rstudent(africa_fit), main="Q-Q Plot for Model Residuals")
qqline(rstudent(africa_fit), col="red")
Not bad, if everything was perfect, then all the dots would line up on the line.
There’s a formal, but overly sensitive test for normality that we can use here:
shapiro.test(resid(africa_fit))
##
## Shapiro-Wilk normality test
##
## data: resid(africa_fit)
## W = 0.96692, p-value = 8.517e-05
Which fails the normality test. This is pretty important assumption of the linear model in many settings, here not so much. But we should do our due diligence. If you have evidence of non-normality of your residuals, you can attempt to transform the dependent variable via a set of approaches.
Things we can try are the: natural log transform = log(x) square root transform = sqrt(x) reciprocal transformation = 1/x
We’ll try the log transform first:
africa_fit<-lm( log(tfr)~Africa, data=prb_new)
tidy(africa_fit)
## term estimate std.error statistic p.value
## 1 (Intercept) 1.4726415 0.04771856 30.86098 3.446976e-79
## 2 AfricaNot Africa -0.7269834 0.05563823 -13.06626 8.154790e-29
So everything is on a differnt scale now, and doesn’t look right. We can recover the real means by exponentiating the coefficients:
#African mean
exp(coef(africa_fit)["(Intercept)"])
## (Intercept)
## 4.360739
exp(sum(coef(africa_fit)))
## [1] 2.107828
The means are a little different from the un-transformed data, but we’re still right on target with our analysis, and our interpretation has not changed.
We can examine the normality assumption here:
qqnorm(rstudent(africa_fit), main="Q-Q Plot for Model Residuals")
qqline(rstudent(africa_fit), col="red")
shapiro.test(resid(africa_fit))
##
## Shapiro-Wilk normality test
##
## data: resid(africa_fit)
## W = 0.98851, p-value = 0.0932
The dots line up better and the Shapiro-Wilk test has a p value larger that typical \(\alpha\) levels, so it looks like we are good.
Now let’s open a ‘really real’ data file. This is a sample from the 2015 1-year American Community Survey microdata, meaning that each row in these data is a person who responded to the survey in 2015.
I’ve done an extract (do example in class) and stored the data in a stata format on my github data site. The file we are using is called usa_00045.dta.
There is also a codebook that describes the data and all the response levels for each variable in the data. They are also on my github data page, and called Codebook_DEM7273_IPUMS2015.
I can read it from github directly by using the read_dta() function in the haven library:
library(haven)
ipums<-read_dta("https://github.com/coreysparks/data/blob/master/usa_00045.dta?raw=true")
newpums<-ipums%>%
filter(labforce==2, age>=18, incwage>0)%>%
mutate(mywage= ifelse(incwage%in%c(999998,999999), NA,incwage),
sexrecode=ifelse(sex==1, "male", "female"))
newpums%>%
group_by(sexrecode)%>%
summarise(means=mean(mywage, na.rm=T))
## # A tibble: 2 x 2
## sexrecode means
## <chr> <dbl>
## 1 female 39446.25
## 2 male 59532.69
newpums%>%
ggplot(aes(x=sexrecode, y=mywage))+geom_boxplot()
t.test(mywage~sexrecode, data=newpums)
##
## Welch Two Sample t-test
##
## data: mywage by sexrecode
## t = -63.019, df = 115890, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -20711.17 -19461.73
## sample estimates:
## mean in group female mean in group male
## 39446.25 59532.69
sexinc<-lm(mywage~sexrecode, data=newpums)
tidy(sexinc)
## term estimate std.error statistic p.value
## 1 (Intercept) 39446.25 233.7754 168.73568 0
## 2 sexrecodemale 20086.45 324.3544 61.92746 0
So, men make more money on average than women. The difference is about $20,000 dollars.
Want to see some really non-normal residuals?
qqnorm(rstudent(sexinc), main="Q-Q Plot for Model Residuals")
qqline(rstudent(sexinc), col="red")
We can try transforms:
sexinc2<-lm(log(mywage)~sexrecode, data=newpums)
sexinc3<-lm(sqrt(mywage)~sexrecode, data=newpums)
sexinc4<-lm(I(1/mywage)~sexrecode, data=newpums)
qqnorm(rstudent(sexinc2), main="Q-Q Plot for Model Residuals")
qqline(rstudent(sexinc2), col="red")
qqnorm(rstudent(sexinc3), main="Q-Q Plot for Model Residuals")
qqline(rstudent(sexinc3), col="red")
qqnorm(rstudent(sexinc4), main="Q-Q Plot for Model Residuals")
#qqline(rstudent(sexinc4), col="red")
Honestly, none of these make me very happy.