BS2004 | Week 3

2024-01-29

Blocking

Controlling for known confounds

In the last mini-lecture, we have introduced randomised designs as a way to guard against selection biases and unconscious patterns.
— Can we do better still?

Can we do better still?

We may have information about (pattern of) confounds:

Low water level
      +-------+-------+-------+-------+
  |   ¦       ¦       ¦       ¦       ¦
  |   ¦   A   ¦   A   ¦   A   ¦   A   ¦
  |   ¦       ¦       ¦       ¦       ¦
G |   +-------+-------+-------+-------+
r |   ¦       ¦       ¦       ¦       ¦
a |   ¦   B   ¦   B   ¦   B   ¦   B   ¦
d |   ¦       ¦       ¦       ¦       ¦
i |   +-------+-------+-------+-------+
e |   ¦       ¦       ¦       ¦       ¦
n |   ¦   C   ¦   C   ¦   C   ¦   D   ¦
t |   ¦       ¦       ¦       ¦       ¦
  |   +-------+-------+-------+-------+
  |   ¦       ¦       ¦       ¦       ¦
  |   ¦   D   ¦   D   ¦   D   ¦   D   ¦
  V   ¦       ¦       ¦       ¦       ¦
      +-------+-------+-------+-------+
High water level

E.g., gradient across plots.
This design is bad: confounds fertiliser with water level.
We could simply randomise.
But can we explicitly account for – model – the effect of water level?

Yes: by blocking.

Blocking for water level

With blocking, we’re conducting the same ‘mini-experiment’ at four water levels:

Low water level
      +-------+-------+-------+-------+
  |   ¦       ¦       ¦       ¦       ¦
  |   ¦   A   ¦   B   ¦   C   ¦   D   ¦ Block 1
  |   ¦       ¦       ¦       ¦       ¦
G |   +-------+-------+-------+-------+
r |   ¦       ¦       ¦       ¦       ¦
a |   ¦   D   ¦   A   ¦   B   ¦   C   ¦ Block 2
d |   ¦       ¦       ¦       ¦       ¦
i |   +-------+-------+-------+-------+
e |   ¦       ¦       ¦       ¦       ¦
n |   ¦   C   ¦   B   ¦   A   ¦   D   ¦ Block 3
t |   ¦       ¦       ¦       ¦       ¦
  |   +-------+-------+-------+-------+
  |   ¦       ¦       ¦       ¦       ¦
  |   ¦   D   ¦   C   ¦   A   ¦   B   ¦ Block 4
  V   ¦       ¦       ¦       ¦       ¦
      +-------+-------+-------+-------+
High water level

Eack block is a ‘mini-experiment’ with all four fertiliser treats.
Randomise fertiliser order within each block.
For each plot, we are recording block and fertiliser
Analyse all blocks in one LM, to estimate block and fertiliser effects:

lm(yield ~ block + fertiliser)

Bean counting

The advantages of blocked design

The Beans dataset is from a similar experiment:

Comparing the yield of 6 varieties of bean;
24 plots, in 4 blocks of 6 plots each.

We can see the difference between a fully randomised design and a randomised block design by analysing the data in two ways:

Ignoring block in the analysis (=pretending it was fully randomised):
lm(yield ~ bean)
Are there differences in yield between the bean varieties?
Including block in the analysis:
lm(yield ~ block + bean)
Are there differences in yield between the bean varieties, once any differences between the blocks have been accounted for?

Bean counting

Analysed as a fully randomised design

yield.m1 <- lm(yield ~ bean,  # ignoring blocks
               data = Beans,
               contrasts = list(bean=contr.sum)
               )
anova(yield.m1)

## Analysis of Variance Table
## 
## Response: yield
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## bean       5 444.43  88.887  14.586 8.579e-06 ***
## Residuals 18 109.69   6.094                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Bean counting

Analysed as a randomised block design

yield.m2 <- lm(yield ~ block + bean,  # including blocks info
               data = Beans,
               contrasts = list(block=contr.sum, bean=contr.sum)
               )
anova(yield.m2)

## Analysis of Variance Table
## 
## Response: yield
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## block      3  52.90  17.632  4.6567   0.01713 *  
## bean       5 444.44  88.887 23.4757 1.341e-06 ***
## Residuals 15  56.79   3.786                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The gains from blocking

Note: the small ANOVA \(P\)-value already indicates that blocks account for some of the observed variation, so blocking was worthwhile…

By taking account of variation between blocks in our model, we have

reduced the Error MSQ, \(6.094 \longrightarrow 3.786\),
despite lower Error DoF \(18 \longrightarrow 15\);
increased F for bean: \(F=14.586 \longrightarrow 23.4757\);
increased the strength of our evidence for an effect of bean:
\(P=8.579\times 10^{-6} \longrightarrow 1.341\times 10^{-6}\).

Partitioning SSQs in blocked design

An ANOVA that factors in the blocks

partitions the total SSQ and DoF into
- between-block: here, 3 df; and
- within-block: here, 20 df
further partitions the within-block SSQ and DoF into
- treatment: here, 5 df (here, treatment is the bean variety) and
- error: 15 df

Another perspective

Error \((\epsilon)\) is the unexplained scatter around the fitted means — around the estimates / predicted values.

Aim of analysis is to separate the signal and the noise

(differences in) fitted means estimate the signal in the data
error variance \(\sigma^2\) estimates the noise in the data

\[\epsilon \sim \mathcal{N}(0, \sigma^2)\]

Blocking allows to ‘push’ some of the variance from the error to known confounds, and thus gives us a clearer picture of the signal.

Blocked designs

Blocks should…

represent a factor known/believed to affect the response.
be internally as homogeneous as possible — and therefore as different from one another as possible: maximise SSQ that can be accounted for by block.

Other applications

It’s not just about counting beans on plots of land!

Changes over time: if you cannot do the entire experiment in one session / day / week… time becomes blocking variable.
Differences between batches, litters…

Latin Squares

What to do if there are two possible confounds? — Can block for both!

    C o l u m n   b l o c k s
+-------+-------+-------+-------+
¦       ¦       ¦       ¦       ¦
¦   A   ¦   B   ¦   C   ¦   D   ¦
¦       ¦       ¦       ¦       ¦  R
+-------+-------+-------+-------+  o
¦       ¦       ¦       ¦       ¦  w
¦   B   ¦   C   ¦   D   ¦   A   ¦ 
¦       ¦       ¦       ¦       ¦ 
+-------+-------+-------+-------+  B
¦       ¦       ¦       ¦       ¦  l
¦   C   ¦   D   ¦   A   ¦   B   ¦  o
¦       ¦       ¦       ¦       ¦  c
+-------+-------+-------+-------+  k
¦       ¦       ¦       ¦       ¦  s
¦   D   ¦   A   ¦   B   ¦   C   ¦ 
¦       ¦       ¦       ¦       ¦
+-------+-------+-------+-------+

Need not be plots on a field
N treatments = N rows = N columns
Fails if gradient is diagonal
(i.e., row:column interaction)

Model becomes

lm(y ~  row + column + treat)