Comparing More than two groups

We have seen how to compare 2 groups using the linear model, as compared to the two sample t-test. Now we move on to comparing more than two groups. Traditionally, this would be done using a statistical tool called the analysis of variance (ANOVA). This method, although it has “variance” in the name really is used to compare central tendency (means) between multiple groups.

We have seen so far that we can calculate the amount of variation within a sample using the sample variance. This is, for each group in our data, the within group variance. The ANOVA works by comparing the amount of within group variation to the amount of between group variation. This is a different concept, that we will describe below.

The idea is that, if the amount of between group variance is low, compared to the amount of within group variance, then the means of the different groups are similar to one another. Here is what this may look like:

The first figure shows the example where the means of three groups (vertical lines) are very different from one another. The amount of within group variation is the same.

The second figure shows the example where the means of three groups (vertical lines) are similar to one another. The amount of within group variation is the same.

The graphs are nice, but we need a statistical way to compare the distributions of these groups.

Remember the two group case?

Now we want to extend this logic to more than 2 groups. In the two group case, we were testing the null hypothesis $H_0 : \mu_1=\mu_2$, that the 2 group means were equal - Now we are testing the hypothesis - $H_0 : \mu_1=\mu_2= \cdots = \mu_k$, or that the k group means are equal, verses the alternative that they are not all equal. This alternative hypothesis does not specify which means are different, just that they are not all equal, so, the alternative hypothesis for 3 groups, for example would be : $H_A: \mu_1\neq\mu_2 \text{ or } \mu_1\neq\mu_3 \text{ or } \mu_2\neq\mu_3$

One method of doing this is to simply do 3 different t-tests one for each of the null hypotheses. We run into problems with this, because as the number of pair-wise tests increases, the probability of falsely rejecting one of the alternative hypotheses increases (type I error).

Although we may have set our holy alpha value at .05, the actual probability of rejecting any of our null hypotheses is smaller. It’s more like .05/#of tests

So instead of doing lots of pair wise t-tests, we test for general variation among the means simultaneously using the analysis of within and between group variance. Enter the ANOVA table

In the two group case, the within group variance is:

$s_w = \frac{(n_1-1)s^2_1+(n_2-1)s^2_2}{n1+n2-2}$

This is cute for two groups, but as the number of groups increases, we need a more general formula for within sample variation, and we also need a formula for between sample variation.

General forms of within and between group variation

Some notation first: $\bar{y_{..}}$ = the grand mean (mean of all observations) = $\bar{y_{..}}= \sum_{i=1}^{n} y_{ij}/n$ $\bar{y_{j.}}$ = the mean of the jth group = $\bar{y_{j.}}= \sum_{i=1}^{n} y_{ij}/n_j$

Now we express each individual relative to the grand mean and the group means.

The total amount of variation in the sample is the Total Sums of Squares: $TSS = \sum_{i=1}^n \sum_{j=1}^{n_j}(y_{ij} - \bar{y_{..}})^2 = (n-1)s_T^2$

We can partition the TSS into sources, one source within, and one source between groups.

$TSS = (y_{ij} - {y_{..}} )^2=j (y{ij} -{y_{j.}})^2+i ({y{j.}} -{y_{..}})^2 $

The first quantity on the right captures the deviations of the $n_j$ observations from the mean of group j, and is a measure of the within sample variation, SSW

$SSW = \sum_{ij}(y_{ij}-\bar{y_{..}})^2$

The second quantity captures the deviations of the j group means from the grand mean, and is a measure of between group variation, SSB.

$SSB = \sum_i n_i (\bar{y_{j.}} - \bar{y_{..}})^2$

using these three quantities, we can calculate estimates of the within group varince ($s_w^2$), the between group variance ($s_B^2$).

$s_w^2 = \frac{SSW}{n-k}$

$s_B^2= \frac{SSB}{k-1}$

k-1 and n-k are the degrees of freedom for $s_B^2$ and $s_w^2$

These terms are also called the mean squares, becuase the represent average deviations around the group means and the grand mena, respectively.

These numbers are usually presented in what’s called the ANOVA Table, along with an F-test, which is $F = s_B^2/s_w^2$. Finally a useful measure of overal model fit is often presented, which is the model $R^2$. $R^2= SSB/SST$ and is the percent of variation in the outcome that is explained by the model, in this case, the differences between the groups.

More Linear statistical models

The ANOVA model can be written:

$y_i = \beta_0 + \tau_j + \epsilon_{ij}$

This model states that we can write the ith observation of the jth group as the product of three terms: the grand mean, $\beta_0$, a group-specific deviation (effect) from the grand mean, $\tau_j$, both of which are parameters to be estimated. The final term,$\epsilon_{ij}$, represents the random deviation of each observation around the grand mean. The ??ij are often referred to as the “error term”, since it only represents residual variation of individuals from the mean. This residual variation is just variation that is not attributable to the group structure. We assume the $\epsilon_{ij}$’s are Normally distributed with constant variance.

The interpretation of this model really focuses on the $\tau_j$ terms. These represent the differences in the means of the groups from the grand mean. They may be positive, negative or zero. The ANOVA model specifies the test of the hypothesis that all means are equal using the model terms:

$H_0 : \tau_1=\tau_2= \cdots = \tau_j =0$ versus the alternative hypothesis that at least one of the $\tau_j$’s are note 0.

Post hoc tests

The ANOVA model is good for examining if there is a global difference in means -i.e. at least two of the groups have different means, but it tells nothing about which groups are different from which other group. This is the realm of so-called post hoc tests, Latin for “Occurring or done after the event”, where the event is the ANOVA F test. This basically involves doing all “pairwise” comparisons between the groups in the analysis.

This is done only after evidence of a significant F test from the ANOVA!
It is customary to recalculate the alpha level based on the number of tests being done, again to avoid the type I error. These are typically called “post hoc corrections” and there are a lot of them. I will use the Bonferroni correction in all my examples.

Linear model to compare multiple groups.

So, this would be how I would write a linear model to compare more than two group means. This assumes your dependent variable is continuous. This acutally works just like the previous example of comparing two means.

The model we are using is:

$y_i = \beta_0 + \tau_j + \epsilon_{ij}$

Assuming:

$\epsilon_i \sim \text{Normal}(0, \sigma^2_{\epsilon})$

Now we do the test using the PRB data. Here, we will test if the TFR is the same across continents, not just between Africa and Not African countries.

library(broom)
library(readr)
library(dplyr)
library(ggplot2)
prb<-read_csv("https://raw.githubusercontent.com/coreysparks/data/master/PRB2013_new.csv", col_names=T)
names(prb)<-tolower(names(prb))

#summary statistics by group
prb%>%
  group_by(continent)%>%
  summarise(means=mean(tfr, na.rm=T), sds=sd(tfr, na.rm=T), n=n())

## # A tibble: 6 x 4
##       continent    means       sds     n
##           <chr>    <dbl>     <dbl> <int>
## 1        Africa 4.612727 1.4185353    55
## 2          Asia 2.517647 1.0326095    51
## 3        Europe 1.553333 0.2282343    45
## 4 North America 2.207407 0.5455596    27
## 5       Oceania 3.176471 0.9010615    17
## 6 South America 2.500000 0.4760952    13

#boxplot
prb%>%
  ggplot(aes(x=continent, y=tfr))+geom_boxplot()+ggtitle(label = "Total Fertility Rate Across Continents")

Our models are fit again using lm(), and we will use the anova() function to do our F test.

tfr_fit<-prb%>%
  lm(tfr~continent, data=.)
anova(tfr_fit)

## Analysis of Variance Table
## 
## Response: tfr
##            Df Sum Sq Mean Sq F value    Pr(>F)    
## continent   5 266.61  53.322  57.379 < 2.2e-16 ***
## Residuals 202 187.72   0.929                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The F test shows that there is significant variation between continents in our analysis. But we do not know which continents are different from each other. We can perform the post hoc tests to see that:

pairwise.t.test(x = prb$tfr,g = prb$continent,p.adjust.method = "bonf")

## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  prb$tfr and prb$continent 
## 
##               Africa  Asia    Europe  North America Oceania
## Asia          < 2e-16 -       -       -             -      
## Europe        < 2e-16 3.1e-05 -       -             -      
## North America < 2e-16 1.000   0.087   -             -      
## Oceania       3.2e-06 0.233   2.1e-07 0.020         -      
## South America 3.0e-10 1.000   0.031   1.000         0.874  
## 
## P value adjustment method: bonferroni

We can also examine the $\tau_j$’s from our model. These are equivalent to the previous example, when we tested if $\beta_1$ was not 0.

tidy(tfr_fit)

##                     term  estimate std.error  statistic      p.value
## 1            (Intercept)  4.612727 0.1299852  35.486551 9.608196e-89
## 2          continentAsia -2.095080 0.1873967 -11.179923 6.708022e-23
## 3        continentEurope -3.059394 0.1937705 -15.788747 4.138926e-37
## 4 continentNorth America -2.405320 0.2265265 -10.618271 3.267317e-21
## 5       continentOceania -1.436257 0.2675074  -5.369036 2.164102e-07
## 6 continentSouth America -2.112727 0.2972876  -7.106679 2.001239e-11

So, we have here the deviations of each continent’s mean from the mean of the “reference group”. In this case, Africa was chosen as the reference group, since it came first alpha-numerically. If we want to see how each group differs from the “world mean” we can adust how R does the comparisons.

prb$continent2<-relevel(as.factor(prb$continent),ref = "North America")
tfr_fit2<-lm(tfr~continent2, data=prb)
tidy(tfr_fit2)

##                      term   estimate std.error  statistic      p.value
## 1             (Intercept)  2.2074074 0.1855212 11.8984133 4.374452e-25
## 2        continent2Africa  2.4053199 0.2265265 10.6182712 3.267317e-21
## 3          continent2Asia  0.3102397 0.2294329  1.3522022 1.778226e-01
## 4        continent2Europe -0.6540741 0.2346678 -2.7872344 5.823166e-03
## 5       continent2Oceania  0.9690632 0.2984663  3.2468092 1.366392e-03
## 6 continent2South America  0.2925926 0.3254256  0.8991074 3.696660e-01

Again, we can check the means using summarise:

cont_means<-prb%>%
  group_by(continent2)%>%
  summarise(means=mean(tfr, na.rm=T), sds=sd(tfr, na.rm=T), n=n())
cont_means$diff<-cont_means$means-cont_means$means[1]
cont_means

## # A tibble: 6 x 5
##      continent2    means       sds     n       diff
##          <fctr>    <dbl>     <dbl> <int>      <dbl>
## 1 North America 2.207407 0.5455596    27  0.0000000
## 2        Africa 4.612727 1.4185353    55  2.4053199
## 3          Asia 2.517647 1.0326095    51  0.3102397
## 4        Europe 1.553333 0.2282343    45 -0.6540741
## 5       Oceania 3.176471 0.9010615    17  0.9690632
## 6 South America 2.500000 0.4760952    13  0.2925926

And we see that the column labeled “diff” is the same as the parameters estimated in the tfr_fit2 model, above.

Wait!

We need to evaluate a key assumption about the errors of the model.

qqnorm(rstudent(tfr_fit), main="Q-Q Plot for Model Residuals")
qqline(rstudent(tfr_fit), col="red")

Not bad, if everything was perfect, then all the dots would line up on the line.

There’s a formal, but overly sensitive test for normality that we can use here:

shapiro.test(resid(tfr_fit))

## 
##  Shapiro-Wilk normality test
## 
## data:  resid(tfr_fit)
## W = 0.96102, p-value = 1.756e-05

Which fails the normality test. This is pretty important assumption of the linear model in many settings, here not so much. But we should do our due diligence. If you have evidence of non-normality of your residuals, you can attempt to transform the dependent variable via a set of approaches.

Typical transformations

Things we can try are the: natural log transform = log(x) square root transform = sqrt(x) reciprocal transformation = 1/x

We’ll try the log transform first:

tfr_fit3<-lm( log(tfr)~continent, data=prb)
qqnorm(rstudent(tfr_fit3), main="Q-Q Plot for Model Residuals")
qqline(rstudent(tfr_fit3), col="red")

shapiro.test(resid(tfr_fit3))

## 
##  Shapiro-Wilk normality test
## 
## data:  resid(tfr_fit3)
## W = 0.98564, p-value = 0.03318

AHHHHH! SOOO close!

Really Real data example

Now let’s open a ‘really real’ data file. This is a sample from the 2015 1-year American Community Survey microdata, meaning that each row in these data is a person who responded to the survey in 2015.

I’ve done an extract (do example in class) and stored the data in a stata format on my github data site. The file we are using is called usa_00045.dta.

There is also a codebook that describes the data and all the response levels for each variable in the data. They are also on my github data page, and called Codebook_DEM7273_IPUMS2015.

I can read it from github directly by using the read_dta() function in the haven library:

library(haven)
ipums<-read_dta("https://github.com/coreysparks/data/blob/master/usa_00045.dta?raw=true")

newpums<-ipums%>%
  filter(labforce==2, age>=18, incwage>0)%>%
  mutate(mywage= ifelse(incwage%in%c(999998,999999), NA,incwage),
         sexrecode=ifelse(sex==1, "male", "female"))%>%
  mutate(race_eth = case_when(.$hispan %in% c(1:4) & .$race %in%c(1:9) ~ "hispanic", 
                          .$hispan ==0 & .$race==1 ~"nh_white",
                         .$hispan ==0 & .$race==2 ~"nh_black",
                         .$hispan ==0 & .$race%in%c(3,7,8,9) ~"nh_other",
                         .$hispan ==0 & .$race%in%c(4:6) ~"nh_asian",
                          .$hispan==9 ~ "missing"))
newpums%>%
  group_by(race_eth)%>%
  summarise(meaninc=mean(mywage, na.rm=T), sds=sd(mywage, na.rm=T), n=n())

## # A tibble: 5 x 4
##   race_eth  meaninc      sds     n
##      <chr>    <dbl>    <dbl> <int>
## 1 hispanic 34787.89 39528.74 17909
## 2 nh_asian 60151.03 69964.62  7418
## 3 nh_black 36271.81 39148.12 12474
## 4 nh_other 40001.45 48277.40  3370
## 5 nh_white 54229.95 63980.76 91311

newpums$logwage<-log(newpums$mywage)

#box plot by race/ethnicity
newpums%>%
   ggplot(aes(x=race_eth, y=logwage))+geom_boxplot()

#histogram by race/ethnicity
newpums%>%
   ggplot(aes(logwage, fill=race_eth))+geom_histogram(aes(y=0.5*..density..))+facet_wrap(~race_eth)+ggtitle(label = "Log-Wage by Race/Ethnicity")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#ANOVA model
newpums$race_eth<-relevel(as.factor(newpums$race_eth), ref = "nh_white")

raceinc<-newpums%>%
  lm(mywage~race_eth, data=.)

anova(raceinc)

## Analysis of Variance Table
## 
## Response: mywage
##               Df     Sum Sq    Mean Sq F value    Pr(>F)    
## race_eth       4 9.2284e+12 2.3071e+12  657.23 < 2.2e-16 ***
## Residuals 132477 4.6504e+14 3.5103e+09                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

pairwise.t.test(x = newpums$mywage,g = newpums$race_eth,p.adjust.method = "bonf")

## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  newpums$mywage and newpums$race_eth 
## 
##          nh_white hispanic nh_asian nh_black
## hispanic < 2e-16  -        -        -       
## nh_asian 1.3e-15  < 2e-16  -        -       
## nh_black < 2e-16  0.317    < 2e-16  -       
## nh_other < 2e-16  2.8e-05  < 2e-16  0.012   
## 
## P value adjustment method: bonferroni

tidy(raceinc)

##               term   estimate std.error  statistic       p.value
## 1      (Intercept)  54229.949  196.0704 276.584095  0.000000e+00
## 2 race_ethhispanic -19442.063  484.2029 -40.152717  0.000000e+00
## 3 race_ethnh_asian   5921.081  715.3045   8.277708  1.267188e-16
## 4 race_ethnh_black -17958.142  565.5570 -31.753017 1.953130e-220
## 5 race_ethnh_other -14228.504 1039.2702 -13.690861  1.231080e-42

So, NH- White’s make an average of 5.422994910^{4} dollars, Hispanics make 1.944206310^{4} less on average than NH whites, NH Asians make 5921.0812901 more than NH whites, NH blacks make 1.795814210^{4} less than NH whites, and NH other’s make 1.422850410^{4} less than NH whites.

Normality checks

qqnorm(rstudent(raceinc))
qqline(rstudent(raceinc), col=2)

DEM 7273 Example 6 - Comparing multiple groups with the linear model - ANOVA

Corey S. Sparks, PhD

September 27, 2017