This is a simple exercise where we do an Anova test. However we have unbalanced data. We therefore have to be carefull what we ask R to do. This post will work through a simple example showing how to do type III in R. It matters because the default anova in R is type I. This might not always be what you want.
Before we start some explanation about type III anova. You can skip this if you just want to know how to do this in R.
In unbalanced studies ie we dont have the same number of observation in every cell. This Type III or type I difference matters. When you use type III you hypothesise that sample size of the cell is independent of the sample size.
Therefore we use unweighted means. Alternatively one can say; H0, does not depend on the the proportion in the cells.
Some caveats => In Type III, the answer changes depending on the parametrization. that is it depends on the reference group. In this data we have man and woman. The P values for the Anova table will depend on who we consider as reference group.
However it is not order dependend.(type I is) type III tests the main effect to the full model. (we do not respect the rule of marginality). And thus depends on the parametrization. To be valid one needs to use ‘contr.sum’ in R.
So how does Type III anova test for main effects?
for the effect of factor A we test y= mu + b + ab eps y= mu + a + b + ab + eps
for the effect of factor B we test y= mu + a + ab eps y= mu + a + b + ab + eps
for the interaction effect we test y= mu + a + b + eps y= mu + a + b + ab + eps
The important take away for type III is that if you have an unbalanced study and you hypothesise that the cells have different sizes and that this is not due to something important. Use type III.
# type III
fit.type3 <- Anova(lm(DependentVar~factorA*factorB,
contrasts=list(factorA='contr.sum', FactorB ='contr.sum'), data = data),
type='III')
Below you can see the our data. In total we have 132 salries of both man and woman. We do not have the same count in every cell. what we mean by this is that, we do not have exactly the same amount of people in either male or woman or in the degree no degree groups. Therefore we have an unbalanced study.
No degree Degree
Male 42 18
Female 24 48
Here I show a couple of ways you could visualize your data.
two plots depending on your preferense for Moustache plots (common name box plots) or Violin plots ( a very PC name for kutjes plot).
interaction.plot(Education, Gender, Salary, xlab = 'Education', ylab = 'Salary', main= 'interaction plot for Education and gender')
The mean plots show us a very nice descriptive because it shows that woman and men differ but not by a lot. However the difference in salary for those with degree vs those without degree is “dramatic”. Now we want to test if this difference we see in the plot is statistically different.
we test, Do woman get paid on average less than man?
\(H_0\) = \[\dfrac { (\overline\mu_{11}+\overline\mu_{12})}{2} = \dfrac {( \overline\mu_{21}+\overline\mu_{22})}{2}\]
we need to specify contr.sum in R to use type III.
data$Gender =relevel(Gender, ref="Male" )
data$Education=relevel(Education, ref="No degree")
contrasts(data$Gender) = contr.sum
contrasts(data$Education) = contr.sum
fit.type3 <-Anova(lm(Salary ~ Gender*Education, data = data), type ='III')
summary(fit.type3)
pander(fit.type3)
data$Gender =relevel(Gender, ref="Female" )
data$Education=relevel(Education, ref="Degree")
contrasts(data$Gender) = contr.sum
contrasts(data$Education) = contr.sum
fit.type3.b
Analysis of Deviance Table (Type III tests)
Response: Salary
Df F Pr(>F)
(Intercept) 1 6148.2213 < 2.2e-16 ***
Gender 1 23.2910 3.894e-06 ***
Education 1 111.9487 < 2.2e-16 ***
Gender:Education 1 0.3385 0.5617
Residuals 128
---
Signif. codes:
0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
And there you have it, On average Gender and education lead to different salaries.
Do note to make our analysis robust we include the argument white.adjust = TRUE in the Anova function.
We can go more granular by using Tukey multicomp.
The code below gives the robust Tukey test. The detail is in the vcov = vcovHC.
confint(tukey_treat_all)
Simultaneous Confidence Intervals
Fit: lm(formula = Salary ~ Gender * Education, data = data)
Quantile = 2.505
95% family-wise confidence level
Linear Hypotheses:
Estimate lwr upr
(Intercept) == 0 22.9563 22.2229 23.6897
Gender1 == 0 -1.4129 -2.1463 -0.6795
Education1 == 0 3.0977 2.3643 3.8311
Gender1:Education1 == 0 0.1703 -0.5631 0.9037
In this approach we really compare all the groups. In our case we do not have that many groups so I would prefer to do this.
fit.lm2<- aov(Salary~Gender*Education, data = data)
thsd<-TukeyHSD(fit.lm2)
pander(thsd$`Gender:Education`)
-----------------------------------------------------------------------
diff lwr upr p adj
-------------------------------- -------- -------- -------- -----------
**Male:Degree-Female:Degree** 2.485 0.3809 4.59 0.01359
**Female:No -6.536 -8.439 -4.633 5.218e-14
degree-Female:Degree**
**Male:No -3.369 -4.978 -1.761 1.467e-06
degree-Female:Degree**
**Female:No -9.021 -11.4 -6.647 2.82e-14
degree-Male:Degree**
**Male:No degree-Male:Degree** -5.855 -8 -3.71 4.469e-10
**Male:No degree-Female:No 3.167 1.218 5.115 0.000255
degree**
-----------------------------------------------------------------------
thsd$`Gender:Education`
diff lwr
Male:Degree-Female:Degree 2.485193 0.3808626
Female:No degree-Female:Degree -6.536031 -8.4394704
Male:No degree-Female:Degree -3.369498 -4.9781975
Female:No degree-Male:Degree -9.021225 -11.3952304
Male:No degree-Male:Degree -5.854691 -7.9996242
Male:No degree-Female:No degree 3.166533 1.2183015
upr p adj
Male:Degree-Female:Degree 4.589524 1.358516e-02
Female:No degree-Female:Degree -4.632592 5.218048e-14
Male:No degree-Female:Degree -1.760798 1.467154e-06
Female:No degree-Male:Degree -6.647219 2.819966e-14
Male:No degree-Male:Degree -3.709759 4.469114e-10
Male:No degree-Female:No degree 5.114765 2.549672e-04
We can also plot the TukeyHSD, the best plot is the last one.
plot(thsd)
From the above you should conclude the following.
a Woman without a degree makes less than a man with a degree. a woman without a degree earns less than a woman with a degree. a woman with degree earns less than a man with a degree.
a Man without degree makes less than a woman with a degree. a man without degree makes less than a man with a degree. a man without degree makes more than a woman without degree.
Assumptions what? you need the residuals to be normaly distributed. Equal variance (check for homoscedasticity, a difficult word that means equal variance). Outliers might cause issues.
shapiro.test(residuals(fit.lm2))
Shapiro-Wilk normality test
data: residuals(fit.lm2)
W = 0.99563, p-value = 0.9614
we do not reject the shapiro wilk test therefore we have normality.
durbinWatsonTest(fit.lm2, alternative="two.sided",data=data)
lag Autocorrelation D-W Statistic p-value
1 -0.1826385 2.345174 0.086
Alternative hypothesis: rho != 0
we do not reject the durbin watson test hence we have independence.
leveneTest(Salary~Gender*Education, data=data)
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 3 0.8222 0.4839
128
No rejection hence we have equal variance.
plot(fit.lm2)
the residuals look normal we seem to have a few outliers but we should not worry about them affecting our estimates to much since we used a Robust approach above.