BS2004 | Week 4

2024-01-29

Coding categorical \(X\)

Dummies for dummies

Starting simple

A single categorical predictor with three levels: comparing effect of three kinds of fertiliser on yield (see Grafen & Hails, Section 3.2 Expressing all models as linear equations).

The R model would be

lm(yield ~ treatment)

Grafen & Hails ‘pseudo-matrix’ notation (as in Ch. 5.2, p.87) would be

\[y = \mu + \begin{bmatrix} \mathrm{Treatment} & \mathrm{Difference} \\ 1 & \alpha_1 \\ 2 & \alpha_2 \\ 3 & -\alpha_1-\alpha_2 \end{bmatrix} + \epsilon \]

Writing out the model

Take our \(n\) measurements of \(y\), with \(i = 1\ldots n\).

Then, the value of the \(i^\mathrm{th}\) observation \(y_i\) is:

\(y_i = \mu + \alpha_1 + \epsilon_i\) if treatment = Fertiliser 1;
\(y_i = \mu + \alpha_2 + \epsilon_i\) if treatment = Fertiliser 2;
\(y_i = \mu - \alpha_1 - \alpha_2 + \epsilon_i\) if treatment = Fertiliser 3.

Can we write this in a single equation that reflects the “if fertiliser X” bit? — yes:

We need ‘ON-OFF switches’ for the coefficients so that \(\alpha_1\) gets ‘switched on’ if fertiliser 1, \(\alpha_2\) gets ‘switched on’ if fertiliser 2, etc.

Coding categorical variables

We make the if fertiliser is X ‘switches’ by introducing ‘dummy variables’. We’ll need two \((x_1, x_2)\) for three fertilisers, and \(m-1\) for a categorical variable with \(m\) levels — note the link to DoF here!

We set the ‘switches’ like this:

Fertiliser 1: \(x_1 = 1, x_2 = 0\)
Fertiliser 2: \(x_1 = 0, x_2 = 1\)
Fertiliser 3: \(x_1 = -1, x_2 = -1\)

\(\mathbf C=\left[\begin{array}{rr} 1 & 0 \\ 0 & 1 \\ -1 & -1 \end{array}\right]\)

This is the “contrast matrix”.

We can now write a single linear equation for the linear model: \[y_i = \mu + \alpha_1 x_{1,i} + \alpha_2 x_{2,i} + \epsilon_i\]

How do the dummies work?

\[y_i = \mu + \alpha_1 x_{1,i} + \alpha_2 x_{2,i} + \epsilon_i\]

For Fertiliser 1 \((x_1 = 1, x_2=0)\), we get
\(y_i = \mu + (\alpha_1\times 1) + (\alpha_2\times 0) + \epsilon_i =\)
\(y_i =\thinspace\)\(\mu + \alpha_1 + \epsilon_i\)
For Fertiliser 2 \((x_1 = 0, x_2=1)\), we get
\(y_i = \mu + (\alpha_1\times 0) + (\alpha_2\times 1) + \epsilon_i =\)
\(y_i =\thinspace\)\(\mu + \alpha_2 + \epsilon_i\)
For Fertiliser 3 \((x_1 = -1, x_2 = -1)\), we get
\(y_i = \mu + (\alpha_1\times -1) + (\alpha_2\times -1) + \epsilon_i =\)
\(y_i =\thinspace\)\(\mu - \alpha_1 - \alpha_2+\epsilon_i\)

Same for more than 3 categories

\[ \begin{array}{lrrr} &x_1&x_2&x_3\\ \mathrm{Group\ A:}&1&0&0 \\ \mathrm{Group\ B:}&0&1&0 \\ \mathrm{Group\ C:}&0&0&1 \\ \mathrm{Group\ D:}&-1&-1&-1 \end{array} \]

\[y_i = \mu + \alpha_1 x_{1,i} + \alpha_2 x_{2,i} + \alpha_3 x_{3,i} + \epsilon_i\]

Matrix notation for LMs

Fear not!

For God hath not given us the spirit of fear; but of power, and of love, and of a sound mind.

— 2 Timothy 1:7

Spelling it out for all \(y_i\):

\[y_i = \mu + \alpha_1 x_{1,i} + \alpha_2 x_{2,i} + \epsilon_i\] …if we write this out for all our \(y_i\), \(i=1, \ldots, n\) we get:

\(y_1 = \mu + \alpha_1 x_{1,1} + \alpha_2 x_{2,1} + \epsilon_1\)
\(y_2 = \mu + \alpha_1 x_{1,2} + \alpha_2 x_{2,2} + \epsilon_2\)
\(y_3 = \mu + \alpha_1 x_{1,3} + \alpha_2 x_{2,3} + \epsilon_3\)
\(\ldots\) — pretty tedious and redundant!
\(y_n = \mu + \alpha_1 x_{1,n} + \alpha_2 x_{2,n} + \epsilon_n\)

We have \(n\) equations that all look the same!
Time for more elegant notation.

Vectors and matrices

Let’s clean up this redundancy and…

collect all our \(n\) measurements \(y_1 \ldots y_n\) into a vector \(\mathbf y\);
combine the dummy-coded categorical variable \(x_1,x_2\) into a matrix \(\mathbf X\) with an extra column of \(1\)’s — this is the ‘switch’ for \(\mu\), which we want to be always ‘ON’;
collect our coefficients \(\mu, \alpha_1, \alpha_2\) into a vector \(\boldsymbol\beta\);
and do the same for our \(n\) errors \(\epsilon_i\) — they go into \(\boldsymbol\epsilon\):

\[\mathbf y = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix},\ \mathbf X = \begin{bmatrix} 1 &x_{1,1} & x_{2,1}\\ 1 & x_{1,2} & x_{2,2} \\ \vdots & \vdots & \vdots \\ 1 & x_{1,n} & x_{2,n} \end{bmatrix},\ \boldsymbol\beta = \begin{bmatrix} \mu \\ \alpha_1 \\ \alpha_2 \end{bmatrix},\ \boldsymbol\epsilon = \begin{bmatrix} \epsilon_1 \\ \epsilon_2 \\ \vdots \\ \epsilon_n \end{bmatrix} \]

Simplicity, finally!

We can now state the whole model thus

\[\mathbf{y} = \mathbf X\boldsymbol\beta+\boldsymbol\epsilon\] which, if you had (and remember) any matrix arithmetic, means

\[ \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix} = \begin{bmatrix} 1 &x_{1,1} & x_{2,1}\\ 1 & x_{1,2} & x_{2,2} \\ \vdots & \vdots & \vdots \\ 1 & x_{1,n} & x_{2,n} \end{bmatrix} \begin{bmatrix} \mu \\ \alpha_1 \\ \alpha_2 \end{bmatrix} + \begin{bmatrix} \epsilon_1 \\ \epsilon_2 \\ \vdots \\ \epsilon_n \end{bmatrix} \]

The universal LM

\[\mathbf{y} = \mathbf{X}\boldsymbol{\beta}+\boldsymbol{\epsilon}\]

Coefficients are often called ‘betas’: the “intercept” (Grafen & Hails’ \(\mu\)) becomes \(\beta_0\) and the remaining coefficients \(\beta_1, \beta_2, \ldots\)

This works for any number of \(m\) coefficients — for multiple dummy-coded factors and continuous variables:

\[ \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix} = \begin{bmatrix} 1 & x_{1,1} & \ldots & x_{m,1}\\ 1 & x_{1,2} & \ldots & x_{m,2} \\ \vdots & \vdots & \ddots & \vdots\\ 1 & x_{1,n} & \ldots & x_{m,n} \end{bmatrix} \begin{bmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_m \end{bmatrix} + \begin{bmatrix} \epsilon_1 \\ \epsilon_2 \\ \vdots \\ \epsilon_n \end{bmatrix} \]

Contrasts

Deviation Contrasts

The coding scheme for categorical \(X\) used in Grafen & Hails is known as deviation contrasts, or ‘sum contrasts’ in R.

Group A: \(x_1 = 1, x_2 = 0\)
Group B: \(x_1 = 0, x_2 = 1\)
Group C: \(x_1 = -1, x_2 = -1\)

\(\mathbf C=\left[\begin{array}{rr} 1&0\\0&1 \\ -1&-1 \end{array}\right]\)

Note how the columns of the contrast matrix sum to zero.

You set sum contrasts like this in R:

group <- factor(c("A", "B", "C"))
contrasts(group) <- contr.sum  # set contrasts
contrasts(group)  # contrast matrix

##   [,1] [,2]
## A    1    0
## B    0    1
## C   -1   -1

Treatment Contrasts

The default coding in R is different! R uses treatment contrasts AKA ‘dummy coding’:

Group A: \(x_1 = 0, x_2 = 0\)
Group B: \(x_1 = 1, x_2 = 0\)
Group C: \(x_1 = 0, x_2 = 1\)

\(\mathbf C=\left[\begin{array}{rr} 0&0 \\1&0 \\0&1 \end{array}\right]\)

Note how the columns do not sum to zero.

This is what you get unless you tell R otherwise!

group <- factor(c("A", "B", "C"))  # no explicit contrasts set
contrasts(group)  # contrast matrix is for 'treatment contrasts'!

##   B C
## A 0 0
## B 1 0
## C 0 1

You get different coefficients!

The model equation is still \[y_i = \beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + \epsilon_i\]

but the interpretation of the coefficients \(\boldsymbol\beta\) is very different:

\(\beta_0\): intercept is now mean in Group A (reference level), \(E(y\mid x =\mathrm{A})=\beta_0\).
\(\beta_1\): difference between means in A and B, \(E(y\mid x=\mathrm{B})=\beta_0+\beta_1\).
\(\beta_2\): difference between means of A and C, \(E(y\mid x=\mathrm{C})=\beta_0 + \beta_2\).

…and they test different \(H_0\)’s

The ‘meaning’ of the \(P\)-values that come with the coefficient estimates also changes when the coding scheme changes:

\(H_{0}: \beta_0 = 0\). Is the mean in Group A different from zero? — not interesting.
\(H_{0}: \beta_1 = 0\). Is the mean in B different from A? — interesting!
\(H_{0}: \beta_2 = 0\). Is the mean in C different from A? — interesting!

But you don’t get a test comparing B and C… for this, you’d need post-hoc.

Simulation example

Three groups A, B, C, with means of 0, 1, 2:

y <- c(rnorm(n=100, mean=0), rnorm(n=100, mean=1), rnorm(n=100, mean=2))
x <- factor(rep(c("A", "B", "C"), each=100))

lm(y~x)  %>% summary %>% coefficients %>% round(3)

##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   -0.157      0.100  -1.569    0.118
## xB             1.198      0.141   8.477    0.000
## xC             2.311      0.141  16.354    0.000

contrasts(x) <- contr.sum
lm(y~x)  %>% summary %>% coefficients %>% round(3)

##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)    1.013      0.058  17.557     0.00
## x1            -1.170      0.082 -14.336     0.00
## x2             0.028      0.082   0.346     0.73