2024-01-29

Orthogonality

in the context of continuous and categorical variables

I shall continue to use ANCOVA as a shorthand for “analyses that combine continuous and categorical variables”. Just remember that it is just another application of LMs, and that the continuous variable may well be of primary interest.

ANCOVA is usually non-orthogonal

Recall that orthogonality is about shared information between explanatory variables (EVs).

  • Observational studies will almost always be at least somewhat non-orthogonal.

Example: the simulated bodyfat data, with two EVs, sex and bodyweight.

  • Males and females (sex) differ in mean bodyweight:
    \(\rightarrow\) sex entails information about bodyweight and vice versa.

DOE can reduce non-orthogonality

In an experimental study, you would assign the categorical treatment EV to experimental units by randomisation.

  • Experimental units differ in the continuous EV (the ‘covariate’); but
  • the distribution of the continuous EV is then ideally similar enough in all treatment groups.

This reduces any non-orthogonality to just stochastic differences in the distribution of the continuous EV between groups.

(Whereas sex is not a treatment — the bodyweight differences are inherent).

Can it go wrong?

Grazing study from earlier: ‘The initial, pre-grazing size of the plant was recorded as the diameter of the top of its rootstock.’

  • Grazing itself could affect rootstock diameter, so you do not want to measure it post-grazing!
  • But as it happened, the grazed group ended up with having bigger plants:

Orthogonal ANCOVA designs

In a planned experiment, you can eliminate non-orthogonality entirely if you have full control over the continuous EV.

Example: Effect of lactose treatment (cat. EV) on bacterial growth over 5 days (cont. EV):

  • 20 culture flasks, 10 with lactose added and 10 w/o.
  • each day, 2 flasks per treatment are taken out and analysed for bacterial counts

Orthogonal ANCOVA design.

Orthogonal ANCOVA design.

with(Bugs, table(lactose, day))
##        day
## lactose 1 2 3 4 5
##       1 2 2 2 2 2
##       2 2 2 2 2 2

Another test for orthogonality

Since orthogonality means “no shared information”, we can test for it by ANOVA (!)

  • If no shared information, then it should be impossible to learn anything about the continuous EV from the categorical EV.
  • Test: make a model that seeks to predict cont. EV (day) from cat. IV (lactose);
  • If fully orthogonal, \(F\) ratio should be zero:
lm(day ~ lactose, Bugs) %>% anova()
## Analysis of Variance Table
## 
## Response: day
##           Df Sum Sq Mean Sq F value Pr(>F)
## lactose    1      0  0.0000       0      1
## Residuals 18     40  2.2222

Why? — the average value of day must be exactly the same at each level of lactose

Continuous or categorical?

Not always clear-cut

How should we treat day?

It is not always clear cut whether an EV should be continuous or categorical. The Bugs dataset is the perfect example: bacterial counts are taken every day over five days.

  • we may intuitively treat day as a continuous variable,
  • but we could equally treated it as categorical with five levels.

What is better?

Let’s try both and compare the output.

What’s the difference?

Continuous

lm(bacteria ~ day + lactose)
Adjusted SSQ ANOVA, day as continuous.
Df Adj SS Adj MS F Pr(>F)
day 1 297.97 297.97 130.56 0
lactose 1 397.07 397.07 173.99 0
Residuals 17 38.80 2.28

Categorical

lm(bacteria ~ factor(day) + lactose)
Adjusted SSQ ANOVA, day as factor.
Df Adj SS Adj MS F Pr(>F)
factor(day) 4 298.51 74.63 27.30 0
lactose 1 397.07 397.07 145.28 0
Residuals 14 38.26 2.73


With day as five-level factor, \(F\) drops from \(130.56\) to \(27.3\)

Why is this?

Why the difference?

Adjusted SSQ ANOVA, day as continuous.
Df Adj SS Adj MS F Pr(>F)
day 1 297.97 297.97 130.56 0
lactose 1 397.07 397.07 173.99 0
Residuals 17 38.80 2.28
Adjusted SSQ ANOVA, day as factor.
Df Adj SS Adj MS F Pr(>F)
factor(day) 4 298.51 74.63 27.30 0
lactose 1 397.07 397.07 145.28 0
Residuals 14 38.26 2.73


Adjusted SSQ for day increases slightly, from \(297.97\) to \(298.51\):

  • fitting five means is always going to give a better fit than a straight line (even if the actual relationship is perfectly linear!).

Why the difference? — DoF!

Adjusted SSQ ANOVA, day as continuous.
Df Adj SS Adj MS F Pr(>F)
day 1 297.97 297.97 130.56 0
lactose 1 397.07 397.07 173.99 0
Residuals 17 38.80 2.28
Adjusted SSQ ANOVA, day as factor.
Df Adj SS Adj MS F Pr(>F)
factor(day) 4 298.51 74.63 27.30 0
lactose 1 397.07 397.07 145.28 0
Residuals 14 38.26 2.73


Adjusted MS for day drops dramatically, from \(297.97\) to \(74.63\).

  • day as a five-level factor has 4 DoF,
  • day as a straight line has only 1 DoF!

Here, treating day as continuous gives more power!

But is continuous always better?

It depends… you need to plot the data:

  • growth looks very linear in each group;
  • although parallel lines seem inappropriate…

If there is a clear trend, treating an IV as continuous will give more power. But the trend may be non-linear!

You need to check linearity

m <- lm(bac2 ~ day2, Bugs2)
anova(m)
## Analysis of Variance Table
## 
## Response: bac2
##           Df  Sum Sq Mean Sq F value Pr(>F)
## day2       1   0.494  0.4938  0.0604 0.8086
## Residuals 18 147.144  8.1747

In this example, assuming linearity is clearly foolish!

  • day seems to affect bacterial number:
    there was growth followed by decline.
  • factor(day) takes 9 Df, not ideal due to loss of power.

There are methods that permit modelling of non-linear trends.

Week’s Summary

  • We have used LM framework to analyse the effect of continuous and categorical IVs (‘ANCOVA’).
  • We have defined orthogonality between a continuous and a categorical variable.
  • Some variables may be treated as either continuous or categorical.
    • If the data follow a trend, continuous will increase power;
    • But the trend may not be linear!