We have seen how to compare 2 groups using the linear model, as compared to the two sample t-test. Now we move on to comparing more than two groups. Traditionally, this would be done using a statistical tool called the analysis of variance (ANOVA). This method, although it has “variance” in the name really is used to compare central tendency (means) between multiple groups.
We have seen so far that we can calculate the amount of variation within a sample using the sample variance. This is, for each group in our data, the within group variance. The ANOVA works by comparing the amount of within group variation to the amount of between group variation. This is a different concept, that we will describe below.
The idea is that, if the amount of between group variance is low, compared to the amount of within group variance, then the means of the different groups are similar to one another. Here is what this may look like:
The first figure shows the example where the means of three groups (vertical lines) are very different from one another. The amount of within group variation is the same.
The second figure shows the example where the means of three groups (vertical lines) are similar to one another. The amount of within group variation is the same.
The graphs are nice, but we need a statistical way to compare the distributions of these groups.
Now we want to extend this logic to more than 2 groups. In the two group case, we were testing the null hypothesis \(H_0 : \mu_1=\mu_2\), that the 2 group means were equal - Now we are testing the hypothesis - \(H_0 : \mu_1=\mu_2= \cdots = \mu_k\), or that the k group means are equal, verses the alternative that they are not all equal. This alternative hypothesis does not specify which means are different, just that they are not all equal, so, the alternative hypothesis for 3 groups, for example would be : \(H_A: \mu_1\neq\mu_2 \text{ or } \mu_1\neq\mu_3 \text{ or } \mu_2\neq\mu_3\)
One method of doing this is to simply do 3 different t-tests one for each of the null hypotheses. We run into problems with this, because as the number of pair-wise tests increases, the probability of falsely rejecting one of the alternative hypotheses increases (type I error).
Although we may have set our holy alpha value at .05, the actual probability of rejecting any of our null hypotheses is smaller. It’s more like .05/#of tests
So instead of doing lots of pair wise t-tests, we test for general variation among the means simultaneously using the analysis of within and between group variance. Enter the ANOVA table
In the two group case, the within group variance is:
\(s_w = \frac{(n_1-1)s^2_1+(n_2-1)s^2_2}{n1+n2-2}\)
This is cute for two groups, but as the number of groups increases, we need a more general formula for within sample variation, and we also need a formula for between sample variation.
Some notation first: \(\bar{y_{..}}\) = the grand mean (mean of all observations) = \(\bar{y_{..}}= \sum_{i=1}^{n} y_{ij}/n\) \(\bar{y_{j.}}\) = the mean of the jth group = \(\bar{y_{j.}}= \sum_{i=1}^{n} y_{ij}/n_j\)
Now we express each individual relative to the grand mean and the group means.
The total amount of variation in the sample is the Total Sums of Squares: \(TSS = \sum_{i=1}^n \sum_{j=1}^{n_j}(y_{ij} - \bar{y_{..}})^2 = (n-1)s_T^2\)
We can partition the TSS into sources, one source within, and one source between groups.
$TSS = (y_{ij} - {y_{..}} )^2=j (y{ij} -{y_{j.}})^2+i ({y{j.}} -{y_{..}})^2 $
The first quantity on the right captures the deviations of the \(n_j\) observations from the mean of group j, and is a measure of the within sample variation, SSW
\(SSW = \sum_{ij}(y_{ij}-\bar{y_{..}})^2\)
The second quantity captures the deviations of the j group means from the grand mean, and is a measure of between group variation, SSB.
\(SSB = \sum_i n_i (\bar{y_{j.}} - \bar{y_{..}})^2\)
using these three quantities, we can calculate estimates of the within group varince (\(s_w^2\)), the between group variance (\(s_B^2\)).
\(s_w^2 = \frac{SSW}{n-k}\)
\(s_B^2= \frac{SSB}{k-1}\)
k-1 and n-k are the degrees of freedom for \(s_B^2\) and \(s_w^2\)
These terms are also called the mean squares, becuase the represent average deviations around the group means and the grand mena, respectively.
These numbers are usually presented in what’s called the ANOVA Table, along with an F-test, which is \(F = s_B^2/s_w^2\). Finally a useful measure of overal model fit is often presented, which is the model \(R^2\). \(R^2= SSB/SST\) and is the percent of variation in the outcome that is explained by the model, in this case, the differences between the groups.
The ANOVA model can be written:
\(y_i = \beta_0 + \tau_j + \epsilon_{ij}\)
This model states that we can write the ith observation of the jth group as the product of three terms: the grand mean, \(\beta_0\), a group-specific deviation (effect) from the grand mean, \(\tau_j\), both of which are parameters to be estimated. The final term,\(\epsilon_{ij}\), represents the random deviation of each observation around the grand mean. The ??ij are often referred to as the “error term”, since it only represents residual variation of individuals from the mean. This residual variation is just variation that is not attributable to the group structure. We assume the \(\epsilon_{ij}\)’s are Normally distributed with constant variance.
The interpretation of this model really focuses on the \(\tau_j\) terms. These represent the differences in the means of the groups from the grand mean. They may be positive, negative or zero. The ANOVA model specifies the test of the hypothesis that all means are equal using the model terms:
\(H_0 : \tau_1=\tau_2= \cdots = \tau_j =0\) versus the alternative hypothesis that at least one of the \(\tau_j\)’s are note 0.
The ANOVA model is good for examining if there is a global difference in means -i.e. at least two of the groups have different means, but it tells nothing about which groups are different from which other group. This is the realm of so-called post hoc tests, Latin for “Occurring or done after the event”, where the event is the ANOVA F test. This basically involves doing all “pairwise” comparisons between the groups in the analysis.
So, this would be how I would write a linear model to compare more than two group means. This assumes your dependent variable is continuous. This acutally works just like the previous example of comparing two means.
The model we are using is:
\(y_i = \beta_0 + \tau_j + \epsilon_{ij}\)
Assuming:
\(\epsilon_i \sim \text{Normal}(0, \sigma^2_{\epsilon})\)
Now we do the test using the PRB data. Here, we will test if the TFR is the same across continents, not just between Africa and Not African countries.
library(broom)
library(readr)
library(dplyr)
library(ggplot2)
prb<-read_csv("https://raw.githubusercontent.com/coreysparks/data/master/PRB2013_new.csv", col_names=T)
names(prb)<-tolower(names(prb))
#summary statistics by group
prb%>%
group_by(continent)%>%
summarise(means=mean(tfr, na.rm=T), sds=sd(tfr, na.rm=T), n=n())
## # A tibble: 6 x 4
## continent means sds n
## <chr> <dbl> <dbl> <int>
## 1 Africa 4.612727 1.4185353 55
## 2 Asia 2.517647 1.0326095 51
## 3 Europe 1.553333 0.2282343 45
## 4 North America 2.207407 0.5455596 27
## 5 Oceania 3.176471 0.9010615 17
## 6 South America 2.500000 0.4760952 13
#boxplot
prb%>%
ggplot(aes(x=continent, y=tfr))+geom_boxplot()+ggtitle(label = "Total Fertility Rate Across Continents")
Our models are fit again using lm(), and we will use the anova() function to do our F test.
tfr_fit<-prb%>%
lm(tfr~continent, data=.)
anova(tfr_fit)
## Analysis of Variance Table
##
## Response: tfr
## Df Sum Sq Mean Sq F value Pr(>F)
## continent 5 266.61 53.322 57.379 < 2.2e-16 ***
## Residuals 202 187.72 0.929
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The F test shows that there is significant variation between continents in our analysis. But we do not know which continents are different from each other. We can perform the post hoc tests to see that:
pairwise.t.test(x = prb$tfr,g = prb$continent,p.adjust.method = "bonf")
##
## Pairwise comparisons using t tests with pooled SD
##
## data: prb$tfr and prb$continent
##
## Africa Asia Europe North America Oceania
## Asia < 2e-16 - - - -
## Europe < 2e-16 3.1e-05 - - -
## North America < 2e-16 1.000 0.087 - -
## Oceania 3.2e-06 0.233 2.1e-07 0.020 -
## South America 3.0e-10 1.000 0.031 1.000 0.874
##
## P value adjustment method: bonferroni
We can also examine the \(\tau_j\)’s from our model. These are equivalent to the previous example, when we tested if \(\beta_1\) was not 0.
tidy(tfr_fit)
## term estimate std.error statistic p.value
## 1 (Intercept) 4.612727 0.1299852 35.486551 9.608196e-89
## 2 continentAsia -2.095080 0.1873967 -11.179923 6.708022e-23
## 3 continentEurope -3.059394 0.1937705 -15.788747 4.138926e-37
## 4 continentNorth America -2.405320 0.2265265 -10.618271 3.267317e-21
## 5 continentOceania -1.436257 0.2675074 -5.369036 2.164102e-07
## 6 continentSouth America -2.112727 0.2972876 -7.106679 2.001239e-11
So, we have here the deviations of each continent’s mean from the mean of the “reference group”. In this case, Africa was chosen as the reference group, since it came first alpha-numerically. If we want to see how each group differs from the “world mean” we can adust how R does the comparisons.
prb$continent2<-relevel(as.factor(prb$continent),ref = "North America")
tfr_fit2<-lm(tfr~continent2, data=prb)
tidy(tfr_fit2)
## term estimate std.error statistic p.value
## 1 (Intercept) 2.2074074 0.1855212 11.8984133 4.374452e-25
## 2 continent2Africa 2.4053199 0.2265265 10.6182712 3.267317e-21
## 3 continent2Asia 0.3102397 0.2294329 1.3522022 1.778226e-01
## 4 continent2Europe -0.6540741 0.2346678 -2.7872344 5.823166e-03
## 5 continent2Oceania 0.9690632 0.2984663 3.2468092 1.366392e-03
## 6 continent2South America 0.2925926 0.3254256 0.8991074 3.696660e-01
Again, we can check the means using summarise:
cont_means<-prb%>%
group_by(continent2)%>%
summarise(means=mean(tfr, na.rm=T), sds=sd(tfr, na.rm=T), n=n())
cont_means$diff<-cont_means$means-cont_means$means[1]
cont_means
## # A tibble: 6 x 5
## continent2 means sds n diff
## <fctr> <dbl> <dbl> <int> <dbl>
## 1 North America 2.207407 0.5455596 27 0.0000000
## 2 Africa 4.612727 1.4185353 55 2.4053199
## 3 Asia 2.517647 1.0326095 51 0.3102397
## 4 Europe 1.553333 0.2282343 45 -0.6540741
## 5 Oceania 3.176471 0.9010615 17 0.9690632
## 6 South America 2.500000 0.4760952 13 0.2925926
And we see that the column labeled “diff” is the same as the parameters estimated in the tfr_fit2 model, above.
We need to evaluate a key assumption about the errors of the model.
qqnorm(rstudent(tfr_fit), main="Q-Q Plot for Model Residuals")
qqline(rstudent(tfr_fit), col="red")
Not bad, if everything was perfect, then all the dots would line up on the line.
There’s a formal, but overly sensitive test for normality that we can use here:
shapiro.test(resid(tfr_fit))
##
## Shapiro-Wilk normality test
##
## data: resid(tfr_fit)
## W = 0.96102, p-value = 1.756e-05
Which fails the normality test. This is pretty important assumption of the linear model in many settings, here not so much. But we should do our due diligence. If you have evidence of non-normality of your residuals, you can attempt to transform the dependent variable via a set of approaches.
Things we can try are the: natural log transform = log(x) square root transform = sqrt(x) reciprocal transformation = 1/x
We’ll try the log transform first:
tfr_fit3<-lm( log(tfr)~continent, data=prb)
qqnorm(rstudent(tfr_fit3), main="Q-Q Plot for Model Residuals")
qqline(rstudent(tfr_fit3), col="red")
shapiro.test(resid(tfr_fit3))
##
## Shapiro-Wilk normality test
##
## data: resid(tfr_fit3)
## W = 0.98564, p-value = 0.03318
AHHHHH! SOOO close!
Now let’s open a ‘really real’ data file. This is a sample from the 2015 1-year American Community Survey microdata, meaning that each row in these data is a person who responded to the survey in 2015.
I’ve done an extract (do example in class) and stored the data in a stata format on my github data site. The file we are using is called usa_00045.dta.
There is also a codebook that describes the data and all the response levels for each variable in the data. They are also on my github data page, and called Codebook_DEM7273_IPUMS2015.
I can read it from github directly by using the read_dta() function in the haven library:
library(haven)
ipums<-read_dta("https://github.com/coreysparks/data/blob/master/usa_00045.dta?raw=true")
newpums<-ipums%>%
filter(labforce==2, age>=18, incwage>0)%>%
mutate(mywage= ifelse(incwage%in%c(999998,999999), NA,incwage),
sexrecode=ifelse(sex==1, "male", "female"))%>%
mutate(race_eth = case_when(.$hispan %in% c(1:4) & .$race %in%c(1:9) ~ "hispanic",
.$hispan ==0 & .$race==1 ~"nh_white",
.$hispan ==0 & .$race==2 ~"nh_black",
.$hispan ==0 & .$race%in%c(3,7,8,9) ~"nh_other",
.$hispan ==0 & .$race%in%c(4:6) ~"nh_asian",
.$hispan==9 ~ "missing"))
newpums%>%
group_by(race_eth)%>%
summarise(meaninc=mean(mywage, na.rm=T), sds=sd(mywage, na.rm=T), n=n())
## # A tibble: 5 x 4
## race_eth meaninc sds n
## <chr> <dbl> <dbl> <int>
## 1 hispanic 34787.89 39528.74 17909
## 2 nh_asian 60151.03 69964.62 7418
## 3 nh_black 36271.81 39148.12 12474
## 4 nh_other 40001.45 48277.40 3370
## 5 nh_white 54229.95 63980.76 91311
newpums$logwage<-log(newpums$mywage)
#box plot by race/ethnicity
newpums%>%
ggplot(aes(x=race_eth, y=logwage))+geom_boxplot()
#histogram by race/ethnicity
newpums%>%
ggplot(aes(logwage, fill=race_eth))+geom_histogram(aes(y=0.5*..density..))+facet_wrap(~race_eth)+ggtitle(label = "Log-Wage by Race/Ethnicity")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#ANOVA model
newpums$race_eth<-relevel(as.factor(newpums$race_eth), ref = "nh_white")
raceinc<-newpums%>%
lm(mywage~race_eth, data=.)
anova(raceinc)
## Analysis of Variance Table
##
## Response: mywage
## Df Sum Sq Mean Sq F value Pr(>F)
## race_eth 4 9.2284e+12 2.3071e+12 657.23 < 2.2e-16 ***
## Residuals 132477 4.6504e+14 3.5103e+09
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
pairwise.t.test(x = newpums$mywage,g = newpums$race_eth,p.adjust.method = "bonf")
##
## Pairwise comparisons using t tests with pooled SD
##
## data: newpums$mywage and newpums$race_eth
##
## nh_white hispanic nh_asian nh_black
## hispanic < 2e-16 - - -
## nh_asian 1.3e-15 < 2e-16 - -
## nh_black < 2e-16 0.317 < 2e-16 -
## nh_other < 2e-16 2.8e-05 < 2e-16 0.012
##
## P value adjustment method: bonferroni
tidy(raceinc)
## term estimate std.error statistic p.value
## 1 (Intercept) 54229.949 196.0704 276.584095 0.000000e+00
## 2 race_ethhispanic -19442.063 484.2029 -40.152717 0.000000e+00
## 3 race_ethnh_asian 5921.081 715.3045 8.277708 1.267188e-16
## 4 race_ethnh_black -17958.142 565.5570 -31.753017 1.953130e-220
## 5 race_ethnh_other -14228.504 1039.2702 -13.690861 1.231080e-42
So, NH- White’s make an average of 5.422994910^{4} dollars, Hispanics make 1.944206310^{4} less on average than NH whites, NH Asians make 5921.0812901 more than NH whites, NH blacks make 1.795814210^{4} less than NH whites, and NH other’s make 1.422850410^{4} less than NH whites.
Normality checks
qqnorm(rstudent(raceinc))
qqline(rstudent(raceinc), col=2)