Analysis of covariance (ANCOVA) is a statistical method that allows accounting for third variabls, called covariates, when investigating the relationship between an independent and a dependent variable. The covariate is a continuous, never the key independent variable, and always observed. For example, there is a study that looked to estimate the effect of training(independent variable) on job performance (defendant variable). There will be a unexplained variation between those who took the training and those who didn’t. To account for the variation, a performance test can be given prior to the training to get everyone’s baseline performance. This performance test can be the covariate.
We use Regression analysis to create models which describe the effect of variation in independent variables on the dependent variable. If there are categorical variable with binary values like Yes/No or Male/Female, the simple regression analysis gives multiple results for each value of the categorical variable. ANCOVA allows us to see effect of the categorical variable by using it along with the predictor variable and comparing the regression lines for each level of the categorical variable.
In this example we will use select columns from the built in dataset mtcars in R. The columns we will use are:
am: the type of transmission (automatic or manual)
mpg: miles per gallon
hp: horse power
We want see the effect of “am” on the regression between “mpg” and “hp”
library(dplyr)
library(tidyr)
library(skimr)
mtcars_2 <- mtcars %>% select(am, mpg, hp)
skim(mtcars_2)
| Name | mtcars_2 |
| Number of rows | 32 |
| Number of columns | 3 |
| _______________________ | |
| Column type frequency: | |
| numeric | 3 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| am | 0 | 1 | 0.41 | 0.50 | 0.0 | 0.00 | 0.0 | 1.0 | 1.0 | ▇▁▁▁▆ |
| mpg | 0 | 1 | 20.09 | 6.03 | 10.4 | 15.43 | 19.2 | 22.8 | 33.9 | ▃▇▅▁▂ |
| hp | 0 | 1 | 146.69 | 68.56 | 52.0 | 96.50 | 123.0 | 180.0 | 335.0 | ▇▇▆▃▁ |
mtcars_m <- aov(mpg~hp*am,data = mtcars_2)
summary(mtcars_m)
## Df Sum Sq Mean Sq F value Pr(>F)
## hp 1 678.4 678.4 77.391 1.50e-09 ***
## am 1 202.2 202.2 23.072 4.75e-05 ***
## hp:am 1 0.0 0.0 0.001 0.981
## Residuals 28 245.4 8.8
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This above shows that both hp and am are significant, though the interaction between hp:am is not. We can not model the interaction between categorical variable and predictor variable and see that both variables are significant to the impact on mpg
mtcars_mv2 <- aov(mpg~hp+am,data = mtcars_2)
summary(mtcars_mv2)
## Df Sum Sq Mean Sq F value Pr(>F)
## hp 1 678.4 678.4 80.15 7.63e-10 ***
## am 1 202.2 202.2 23.89 3.46e-05 ***
## Residuals 29 245.4 8.5
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can compare the two models to see if the interaction of the variables is truly in-significant
m1 <- aov(mpg~hp*am,data = mtcars_2)
m2 <- aov(mpg~hp+am,data = mtcars_2)
anova(m1,m2)
## Analysis of Variance Table
##
## Model 1: mpg ~ hp * am
## Model 2: mpg ~ hp + am
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 28 245.43
## 2 29 245.44 -1 -0.0052515 6e-04 0.9806
For the result we can see that the interaction between hp and am is not significant.