Comparing groups

At some point in our work we face the need of making comparisons between two (or more) groups of observations

We gave this group Drug A and the other group Drug G, and we want to know if group A did better
Is the poverty rate different in towns along the US-Mexico border than in towns not adjacent to the border?
Many times our data come with a priori group structure, as in the first example
More often, we end up constructing a group variable from other data, as in the second example
If you are collecting your own data (such as an experimental setting), you will often specify comparison groups when you design your survey
But if you are using secondary data (collected by someone else) they may not have collected data in the same way as you need it, so you have to recode the data into another form.

Inspecting variation due to group structure

Initial investigation of differences across groups in the univariate setting is very common in social research.
We inspect our variables for differences between groups. This is often an initial step in larger analyses
It is often done to see if certain basic differences exist, then more complicated analyses are undertaken
Generally a first-step kind of approach

Traditional approach

The classical method for comparing the means of two groups is to do a t-test, which assumes:

The two samples have been drawn from normal distribution
The observations within each sample are independent of one another
The sample sizes of the two samples, n1 and n2 are large (>30 each)
The the sample standard deviations, $s_1$ and $s_2$ are equal

Here’s how you would do this in R, using the PRB data for now:

library(readr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
prb<-read_csv("https://raw.githubusercontent.com/coreysparks/data/master/PRB2013_new.csv", col_names=T)

## Parsed with column specification:
## cols(
##   .default = col_integer(),
##   Country = col_character(),
##   Continent = col_character(),
##   Region = col_character(),
##   Population = col_double(),
##   Rate_of_natural_increase = col_double(),
##   ProjectedPopMid2025 = col_double(),
##   ProjectedPopMid2050 = col_double(),
##   ProjectedPopChange_08_50Perc = col_double(),
##   IMR = col_double(),
##   TFR = col_double(),
##   PercPop1549HIVAIDS1995 = col_double(),
##   PercPop1549HIVAIDS2011_2013 = col_double(),
##   GDPGrowth2000_2006 = col_double(),
##   GDPGrowth2007_2011 = col_double()
## )

## See spec(...) for full column specifications.

names(prb)<-tolower(names(prb))

Let’s say we want to test for equality of the total fertility rate in African vs Non-African countries. To do this analysis completely, I would calculate summary statistics for each group, produce a graphical summary of the data, categorized by group and only then do the test. Remember it’s best to explore your data first, before jumping into testing!

prb_new<-prb%>%
  mutate(Africa=ifelse(prb$continent=="Africa",yes= "Africa",no= "Not Africa"))

#summary statistics by group
prb_new%>%
  group_by(Africa)%>%
  summarise(means=mean(tfr, na.rm=T), sds=sd(tfr, na.rm=T), n=n())

## # A tibble: 2 x 4
##       Africa    means      sds     n
##        <chr>    <dbl>    <dbl> <int>
## 1     Africa 4.612727 1.418535    55
## 2 Not Africa 2.250980 0.888532   153

#boxplot
prb_new%>%
  ggplot(aes(x=Africa, y=tfr))+geom_boxplot()

#test
t.test(tfr~Africa, data=prb_new)

## 
##  Welch Two Sample t-test
## 
## data:  tfr by Africa
## t = 11.559, df = 69.813, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  1.954226 2.769268
## sample estimates:
##     mean in group Africa mean in group Not Africa 
##                 4.612727                 2.250980

So, based on the things we just did, we can conclude: 1) The summary statistics suggest that the means are very different between the two groups, African countries have a TFR of 4.61 on average, compared to non-African countries, who’s TFR is 2.25. That’s a difference of almost 2.4 children per woman on average. 2) The graphical summary supports the statement that African countries have a overall higher distribution, but there are some countries whose fertility rate is lower that some non-African countries, and vice verse. 3) The t-test shows that the means are significantly different from one another. How do I conclude this? First, the t-statistic is 11.59, with 69.8 degrees of freedom. The probability of seeing a value this large from this t-distribution is:

#get the various parts of t statistic
n1<-length(prb_new$tfr[prb_new$Africa=="Africa"])
n2<-length(prb_new$tfr[prb_new$Africa=="Not Africa"])
m1<-mean(prb_new$tfr[prb_new$Africa=="Africa"], na.rm=T)
m2<-mean(prb_new$tfr[prb_new$Africa=="Not Africa"], na.rm=T)
s1<-var(prb_new$tfr[prb_new$Africa=="Africa"], na.rm=T)
s2<-var(prb_new$tfr[prb_new$Africa=="Not Africa"], na.rm=T)

#calculate t
t<-(m1-m2 )/ sqrt(s1/n1 +s2/n2)
#get the p value
2*pt(-abs(t), df=n2-n1)

## [1] 5.226801e-20

which is very, very small. So this would say that we reject the null hypothesis that the means are equal with great confidence.

This is great!

We’ve learned a new test that we can use under a specific situation!!

See this video for how I feel about having tests that do one thing only.

I propose that we try to minimize the use of uni-tasker tests, in favor of using more general classes of models that can be used in many settings.

Linear statistical models

I learned these things from this text, and it’s really good, and certainly not the only game in town. It’s also kind of old now. What I learned from this book is that the linear model is extremely flexible and useful in many, many settings, besides just linear regression.

I also read this book, which, as a basis for teaching everything, uses the linear model as the jumping off point; which made a lot of sense to me, of course.

What this is not

I’m not going to review linear statistical models here, there are 1,000 page books for that, and Field covers this well in chapters 7, 10 and 11 of his book.

Linear model to compare two groups.

So, this would be how I would write a linear model to compare two group means. This assumes your dependent variable is continuous.

Some Definitions: $y_i$ = Your outcome measure on each of the $i = 1 \cdots n$ observations Group = A binary representation of your two groups (see above: Africa, Not Africa) $\beta_0$ = the mean of the “baseline” group - you choose what group this is $\beta_1$ = how much the mean of the comparison group differs from the baseline group $\epsilon$ = residual variation in the mean

The model would then be:

$y_i = \beta_0 + \beta_1*\text{Group} + \epsilon_i$

Assuming:

$\epsilon_i \sim \text{Normal}(0, \sigma^2_{\epsilon})$

What this says:

is, that we estimate one group’s mean to be $\beta_0$ and the express the other group’s mean as: $\beta_0 + \beta_1$.

We then test the hypothesis that $\beta_1 = 0$. Or, is the difference in mean’s between the two groups equal 0.

Here is a model like the one we did above for the African - Non African comparison: $\text{TFR}_i = \beta_0 + \beta_1*\text{Not Africa} + e_i$

Where $\beta_0$ is the mean TFR in Africa, and $\beta_1$ describes how the mean of the Non-African countries relates to the mean of the African countries.

e contains all the information on TFR that the difference between groups doesn’t explain, and is called the residual.

Now we do the test:

library(broom)
africa_fit<-lm(tfr~Africa, data=prb_new)
tidy(africa_fit)

##               term  estimate std.error statistic      p.value
## 1      (Intercept)  4.612727 0.1420638  32.46941 5.803242e-83
## 2 AfricaNot Africa -2.361747 0.1656416 -14.25818 1.523522e-32

So, the labels here are a little different, $\beta_0$ is labeled as (Intercept) and $\beta_1$ is labeled as AfricaNot Africa, indicating that the “Not Africa” group is being compared to the “Africa” group.

the mean for African countries is $\beta_0$ = 4.61 and the mean for the Non African countries is $\beta_0 + \beta_1$ = 2.25, based on the estimated parameters.

How does this compare with our simpler descriptive reality?

#summary statistics by group
prb_new%>%
  group_by(Africa)%>%
  summarise(means=mean(tfr, na.rm=T), sds=sd(tfr, na.rm=T), n=n())

## # A tibble: 2 x 4
##       Africa    means      sds     n
##        <chr>    <dbl>    <dbl> <int>
## 1     Africa 4.612727 1.418535    55
## 2 Not Africa 2.250980 0.888532   153

Spot on!

Wait!

We need to evaluate a key assumption about the errors of the model.

qqnorm(rstudent(africa_fit), main="Q-Q Plot for Model Residuals")
qqline(rstudent(africa_fit), col="red")

Not bad, if everything was perfect, then all the dots would line up on the line.

There’s a formal, but overly sensitive test for normality that we can use here:

shapiro.test(resid(africa_fit))

## 
##  Shapiro-Wilk normality test
## 
## data:  resid(africa_fit)
## W = 0.96692, p-value = 8.517e-05

Which fails the normality test. This is pretty important assumption of the linear model in many settings, here not so much. But we should do our due diligence. If you have evidence of non-normality of your residuals, you can attempt to transform the dependent variable via a set of approaches.

Typical transformations

Things we can try are the: natural log transform = log(x) square root transform = sqrt(x) reciprocal transformation = 1/x

We’ll try the log transform first:

africa_fit<-lm( log(tfr)~Africa, data=prb_new)
tidy(africa_fit)

##               term   estimate  std.error statistic      p.value
## 1      (Intercept)  1.4726415 0.04771856  30.86098 3.446976e-79
## 2 AfricaNot Africa -0.7269834 0.05563823 -13.06626 8.154790e-29

So everything is on a differnt scale now, and doesn’t look right. We can recover the real means by exponentiating the coefficients:

#African mean
exp(coef(africa_fit)["(Intercept)"])

## (Intercept) 
##    4.360739

exp(sum(coef(africa_fit)))

## [1] 2.107828

The means are a little different from the un-transformed data, but we’re still right on target with our analysis, and our interpretation has not changed.

We can examine the normality assumption here:

qqnorm(rstudent(africa_fit), main="Q-Q Plot for Model Residuals")
qqline(rstudent(africa_fit), col="red")

shapiro.test(resid(africa_fit))

## 
##  Shapiro-Wilk normality test
## 
## data:  resid(africa_fit)
## W = 0.98851, p-value = 0.0932

The dots line up better and the Shapiro-Wilk test has a p value larger that typical $\alpha$ levels, so it looks like we are good.

Really Real data example

Now let’s open a ‘really real’ data file. This is a sample from the 2015 1-year American Community Survey microdata, meaning that each row in these data is a person who responded to the survey in 2015.

I’ve done an extract (do example in class) and stored the data in a stata format on my github data site. The file we are using is called usa_00045.dta.

There is also a codebook that describes the data and all the response levels for each variable in the data. They are also on my github data page, and called Codebook_DEM7273_IPUMS2015.

I can read it from github directly by using the read_dta() function in the haven library:

library(haven)
ipums<-read_dta("https://github.com/coreysparks/data/blob/master/usa_00045.dta?raw=true")

newpums<-ipums%>%
  filter(labforce==2, age>=18, incwage>0)%>%
  mutate(mywage= ifelse(incwage%in%c(999998,999999), NA,incwage),
         sexrecode=ifelse(sex==1, "male", "female"))
  
newpums%>%
  group_by(sexrecode)%>%
  summarise(means=mean(mywage, na.rm=T))

## # A tibble: 2 x 2
##   sexrecode    means
##       <chr>    <dbl>
## 1    female 39446.25
## 2      male 59532.69

newpums%>%
   ggplot(aes(x=sexrecode, y=mywage))+geom_boxplot()

t.test(mywage~sexrecode, data=newpums)

## 
##  Welch Two Sample t-test
## 
## data:  mywage by sexrecode
## t = -63.019, df = 115890, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -20711.17 -19461.73
## sample estimates:
## mean in group female   mean in group male 
##             39446.25             59532.69

sexinc<-lm(mywage~sexrecode, data=newpums)
tidy(sexinc)

##            term estimate std.error statistic p.value
## 1   (Intercept) 39446.25  233.7754 168.73568       0
## 2 sexrecodemale 20086.45  324.3544  61.92746       0

So, men make more money on average than women. The difference is about $20,000 dollars.

Want to see some really non-normal residuals?

qqnorm(rstudent(sexinc), main="Q-Q Plot for Model Residuals")
qqline(rstudent(sexinc), col="red")

We can try transforms:

sexinc2<-lm(log(mywage)~sexrecode, data=newpums)

sexinc3<-lm(sqrt(mywage)~sexrecode, data=newpums)

sexinc4<-lm(I(1/mywage)~sexrecode, data=newpums)

qqnorm(rstudent(sexinc2), main="Q-Q Plot for Model Residuals")
qqline(rstudent(sexinc2), col="red")

qqnorm(rstudent(sexinc3), main="Q-Q Plot for Model Residuals")
qqline(rstudent(sexinc3), col="red")

qqnorm(rstudent(sexinc4), main="Q-Q Plot for Model Residuals")

#qqline(rstudent(sexinc4), col="red")

Honestly, none of these make me very happy.

What to do, what to do?

Are you planning on using this model for prediction?

No - Use the test to do your comparison and stop
Yes - Don’t

Are my results valid?

Yes - the means are different, we know that based on lots of evidence.

DEM 7273 Example 5 - Comparing two groups with the linear model

Corey S. Sparks, PhD

September 20, 2017