I shall continue to use ANCOVA as a shorthand for “analyses that combine continuous and categorical variables”. Just remember that it is just another application of LMs, and that the continuous variable may well be of primary interest.
2024-01-29
I shall continue to use ANCOVA as a shorthand for “analyses that combine continuous and categorical variables”. Just remember that it is just another application of LMs, and that the continuous variable may well be of primary interest.
Recall that orthogonality is about
Example: the simulated bodyfat data, with two EVs, sex and bodyweight.
sex) differ in mean bodyweight:sex entails information about bodyweight and vice versa.In an experimental study, you would assign the categorical treatment EV to experimental units by
This reduces any non-orthogonality to just stochastic differences in the distribution of the continuous EV between groups.
(Whereas sex is bodyweight differences are inherent).
Grazing study from earlier: ‘The initial,
In a planned experiment, you can
Example: Effect of lactose treatment (cat. EV) on bacterial growth over 5 days (cont. EV):
lactose added and 10 w/o.day, 2 flasks per treatment are taken out and analysed for bacterial countsOrthogonal ANCOVA design.
with(Bugs, table(lactose, day))
## day ## lactose 1 2 3 4 5 ## 1 2 2 2 2 2 ## 2 2 2 2 2 2
Since orthogonality means “no shared information”, we can test for it by ANOVA (!)
day) from cat. IV (lactose);lm(day ~ lactose, Bugs) %>% anova()
## Analysis of Variance Table ## ## Response: day ## Df Sum Sq Mean Sq F value Pr(>F) ## lactose 1 0 0.0000 0 1 ## Residuals 18 40 2.2222
Why? — the average value of day must be lactose
day?It is not always clear cut whether an EV should be continuous or categorical. The Bugs dataset is the perfect example: bacterial counts are taken every day over five days.
day as a continuous variable,What is better?
Let’s try both and compare the output.
lm(bacteria ~ day + lactose)
| Df | Adj SS | Adj MS | F | Pr(>F) | |
|---|---|---|---|---|---|
| day | 1 | 297.97 | 297.97 | 130.56 | 0 |
| lactose | 1 | 397.07 | 397.07 | 173.99 | 0 |
| Residuals | 17 | 38.80 | 2.28 |
lm(bacteria ~ factor(day) + lactose)
| Df | Adj SS | Adj MS | F | Pr(>F) | |
|---|---|---|---|---|---|
| factor(day) | 4 | 298.51 | 74.63 | 27.30 | 0 |
| lactose | 1 | 397.07 | 397.07 | 145.28 | 0 |
| Residuals | 14 | 38.26 | 2.73 |
With day as five-level factor, \(F\) drops from \(130.56\) to \(27.3\)
Why is this?
| Df | Adj SS | Adj MS | F | Pr(>F) | |
|---|---|---|---|---|---|
| day | 1 | 297.97 | 297.97 | 130.56 | 0 |
| lactose | 1 | 397.07 | 397.07 | 173.99 | 0 |
| Residuals | 17 | 38.80 | 2.28 |
| Df | Adj SS | Adj MS | F | Pr(>F) | |
|---|---|---|---|---|---|
| factor(day) | 4 | 298.51 | 74.63 | 27.30 | 0 |
| lactose | 1 | 397.07 | 397.07 | 145.28 | 0 |
| Residuals | 14 | 38.26 | 2.73 |
Adjusted SSQ for day increases slightly, from \(297.97\) to \(298.51\):
| Df | Adj SS | Adj MS | F | Pr(>F) | |
|---|---|---|---|---|---|
| day | 1 | 297.97 | 297.97 | 130.56 | 0 |
| lactose | 1 | 397.07 | 397.07 | 173.99 | 0 |
| Residuals | 17 | 38.80 | 2.28 |
| Df | Adj SS | Adj MS | F | Pr(>F) | |
|---|---|---|---|---|---|
| factor(day) | 4 | 298.51 | 74.63 | 27.30 | 0 |
| lactose | 1 | 397.07 | 397.07 | 145.28 | 0 |
| Residuals | 14 | 38.26 | 2.73 |
Adjusted MS for day drops dramatically, from \(297.97\) to \(74.63\).
day as a five-level factor has day as a straight line has only Here, treating day as continuous gives more
It depends… you need to plot the data:
If there is a clear trend, treating an IV as continuous will give more power. But the trend may be non-linear!
m <- lm(bac2 ~ day2, Bugs2) anova(m)
## Analysis of Variance Table ## ## Response: bac2 ## Df Sum Sq Mean Sq F value Pr(>F) ## day2 1 0.494 0.4938 0.0604 0.8086 ## Residuals 18 147.144 8.1747
In this example, assuming linearity is clearly foolish!
day seems to affect bacterial number:factor(day) takes 9 Df, not ideal due to loss of power.There are methods that permit modelling of non-linear trends.