Planned comparisons and posthoc analyses

3/20/2024

Motivating example

…

In the PISA 2018 study, reading skills in 15-year-olds was compared across 79 countries.

A small example data set

Small data set
ID	group	score
1	A	72
2	B	88
3	C	76
4	A	89
5	C	46
6	B	17

Last year we learned:

Categorical variables can be coded into numeric variables
Numeric variables are usually dummy variables (R default)
Output gives differences between groups and a reference group
Reference group is usually the first group (R default)

A small example data set

dataset %>% lm(score ~ group, data = .) %>% 
  tidy()

## # A tibble: 3 × 5
##   term        estimate std.error statistic p.value
##   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)     80.5      22.8     3.53   0.0386
## 2 groupB         -28.0      32.2    -0.869  0.449 
## 3 groupC         -19.5      32.2    -0.605  0.588

which group has the highest mean?
which group has the lowest mean?

Coding categorical variables: design matrix

Original Data plus Design Matrix
ID	group	score	(Intercept)	groupB	groupC
1	A	72	1	0	0
2	B	88	1	1	0
3	C	76	1	0	1
4	A	89	1	0	0
5	C	46	1	0	1
6	B	17	1	1	0

Coding categorical variables: design matrix

model.matrix(score ~ group, data = dataset)

##   (Intercept) groupB groupC
## 1           1      0      0
## 2           1      1      0
## 3           1      0      1
## 4           1      0      0
## 5           1      0      1
## 6           1      1      0
## attr(,"assign")
## [1] 0 1 1
## attr(,"contrasts")
## attr(,"contrasts")$group
## [1] "contr.treatment"

Coding categorical variables: design matrix

Design Matrix
(Intercept)	groupB	groupC
1	0	0
1	1	0
1	0	1
1	0	0
1	0	1
1	1	0

Coding categorical variables: design matrix

## # A tibble: 3 × 5
##   term        estimate std.error statistic p.value
##   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)     80.5      22.8     3.53   0.0386
## 2 groupB         -28.0      32.2    -0.869  0.449 
## 3 groupC         -19.5      32.2    -0.605  0.588

Output is a set of contrasts:

groupB: the effect of this new dummy variable is the contrast between group B and group A (the reference group)
groupC: the contrast between group C and group A
(Intercept): ??

Output is a set of contrasts:

groupB: the effect of this new dummy variable is the contrast between group B and group A (the reference group)
groupC: the contrast between group C and group A
(Intercept): the contrast between group A and 0

Output is a set of contrasts:

groupB: group B - group A
groupC: group C - group A
(Intercept): group A - 0

Output is a set of contrasts:

groupB: \(\mu_B - \mu_A\)
groupC: \(\mu_C - \mu_A\)
(Intercept): \(\mu_A - 0 = \mu_A\)

Take home message:

How you code your dummy variables has a direct influence on the interpretation of the output.

How the categorical variable is coded influences the contrasts you see in the output.

If you want to see certain contrasts, then change the way you code your variables.

Different dummy variables lead to different contrasts in output

Orgininal Data plus Design Matrix
ID	group	score	(Intercept)	groupA	groupB
1	A	72	1	1	0
2	B	88	1	0	1
3	C	76	1	0	0
4	A	89	1	1	0
5	C	46	1	0	0
6	B	17	1	0	1

Different variables, different contrasts

## # A tibble: 3 × 5
##   term        estimate std.error statistic p.value
##   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)     61        22.8     2.68   0.0752
## 2 groupA          19.5      32.2     0.605  0.588 
## 3 groupB          -8.5      32.2    -0.264  0.809

Simple solution for everyday situations

change the reference group using relevel().

dataset %>% 
  mutate(group = relevel(group, ref = "B")) %>% 
  lm(score ~ group, data = .) %>% tidy()

## # A tibble: 3 × 5
##   term        estimate std.error statistic p.value
##   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)    52.5       22.8     2.30    0.105
## 2 groupA         28         32.2     0.869   0.449
## 3 groupC          8.50      32.2     0.264   0.809

Motivating example

…

In the PISA 2018 study, reading skills in 15-year-olds was compared across 79 countries.

A contrast

Small data set
ID	group	score
1	A	72
2	B	88
3	C	76
4	A	89
5	C	46
6	B	17

A contrast

Contrast 1: Compare mean group A with the average of groups B and C.

A contrast

\[L1 = M_A - \frac{(M_B + M_C)}{2}\]

A contrast (2)

\[L1 = M_A - \frac{(M_B + M_C)}{2}\] \[ = M_A - \frac{M_B}{2} - \frac{M_C}{2}\]

\[ = 1\times M_A - 0.5 \times M_B - 0.5 \times M_C\]

A contrast (3)

\[L1 = M_A - \frac{(M_B + M_C)}{2}\] \[ = 1\times M_A - 0.5 \times M_B - 0.5 \times M_C\]

l1 <- c(1, -0.5, -0.5)

Group means
A	B	C
80.5	52.5	61.0
1.0	-0.5	-0.5

Another contrast

\[L2 = M_B - M_C\] \[ = 1\times M_B - 1 \times M_C\] \[ = 1\times M_B - 1 \times M_C + 0 \times M_A\] \[ = 0 \times M_A + 1 \times M_B -1 \times M_C\]

Group means
A	B	C
80.5	52.5	61
0.0	1.0	-1

Combining the two contrasts

Group means
A	B	C
80.5	52.5	61.0
1.0	-0.5	-0.5
0.0	1.0	-1.0

Contrasts

l1 <- c(1, -0.5, -0.5)
l2 <- c(0, 1, -1)
rbind(l1, l2)

##    [,1] [,2] [,3]
## l1    1 -0.5 -0.5
## l2    0  1.0 -1.0

Inverting the contrast matrix

Contrast matrix L

l1	1	-0.5	-0.5
l2	0	1.0	-1.0

Inverting…..

Coding scheme S for new variables
	group1	group2
A	0.67	0.0
B	-0.33	0.5
C	-0.33	-0.5

l1 <- c(1, -0.5, -0.5)
l2 <- c(0, 1, -1)
rbind(l1, l2)

##    [,1] [,2] [,3]
## l1    1 -0.5 -0.5
## l2    0  1.0 -1.0

S <- rbind(l1, l2) %>% ginv()
S

##            [,1]          [,2]
## [1,]  0.6666667 -7.514131e-17
## [2,] -0.3333333  5.000000e-01
## [3,] -0.3333333 -5.000000e-01

From coding matrix \(S\) to the numeric variables

Design Matrix (the new variables)
group	(Intercept)	group1	group2
A	1	0.67	0.0
B	1	-0.33	0.5
C	1	-0.33	-0.5
A	1	0.67	0.0
C	1	-0.33	-0.5
B	1	-0.33	0.5