3/20/2024

Motivating example

...

In the PISA 2018 study, reading skills in 15-year-olds was compared across 79 countries.

A small example data set

Small data set
ID group score
1 A 72
2 B 88
3 C 76
4 A 89
5 C 46
6 B 17

Last year we learned:

  • Categorical variables can be coded into numeric variables
  • Numeric variables are usually dummy variables (R default)
  • Output gives differences between groups and a reference group
  • Reference group is usually the first group (R default)

A small example data set

dataset %>% lm(score ~ group, data = .) %>% 
  tidy()
## # A tibble: 3 × 5
##   term        estimate std.error statistic p.value
##   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)     80.5      22.8     3.53   0.0386
## 2 groupB         -28.0      32.2    -0.869  0.449 
## 3 groupC         -19.5      32.2    -0.605  0.588
  • which group has the highest mean?
  • which group has the lowest mean?

Coding categorical variables: design matrix

Original Data plus Design Matrix
ID group score (Intercept) groupB groupC
1 A 72 1 0 0
2 B 88 1 1 0
3 C 76 1 0 1
4 A 89 1 0 0
5 C 46 1 0 1
6 B 17 1 1 0

Coding categorical variables: design matrix

model.matrix(score ~ group, data = dataset)
##   (Intercept) groupB groupC
## 1           1      0      0
## 2           1      1      0
## 3           1      0      1
## 4           1      0      0
## 5           1      0      1
## 6           1      1      0
## attr(,"assign")
## [1] 0 1 1
## attr(,"contrasts")
## attr(,"contrasts")$group
## [1] "contr.treatment"

Coding categorical variables: design matrix

Design Matrix
(Intercept) groupB groupC
1 0 0
1 1 0
1 0 1
1 0 0
1 0 1
1 1 0

Coding categorical variables: design matrix

## # A tibble: 3 × 5
##   term        estimate std.error statistic p.value
##   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)     80.5      22.8     3.53   0.0386
## 2 groupB         -28.0      32.2    -0.869  0.449 
## 3 groupC         -19.5      32.2    -0.605  0.588

Output is a set of contrasts:

  • groupB: the effect of this new dummy variable is the contrast between group B and group A (the reference group)
  • groupC: the contrast between group C and group A
  • (Intercept): ??

Output is a set of contrasts:

  • groupB: the effect of this new dummy variable is the contrast between group B and group A (the reference group)
  • groupC: the contrast between group C and group A
  • (Intercept): the contrast between group A and 0

Output is a set of contrasts:

  • groupB: group B - group A
  • groupC: group C - group A
  • (Intercept): group A - 0

Output is a set of contrasts:

  • groupB: \(\mu_B - \mu_A\)
  • groupC: \(\mu_C - \mu_A\)
  • (Intercept): \(\mu_A - 0 = \mu_A\)

Take home message:

How you code your dummy variables has a direct influence on the interpretation of the output.

How the categorical variable is coded influences the contrasts you see in the output.

If you want to see certain contrasts, then change the way you code your variables.

Different dummy variables lead to different contrasts in output

Orgininal Data plus Design Matrix
ID group score (Intercept) groupA groupB
1 A 72 1 1 0
2 B 88 1 0 1
3 C 76 1 0 0
4 A 89 1 1 0
5 C 46 1 0 0
6 B 17 1 0 1

Different variables, different contrasts

## # A tibble: 3 × 5
##   term        estimate std.error statistic p.value
##   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)     61        22.8     2.68   0.0752
## 2 groupA          19.5      32.2     0.605  0.588 
## 3 groupB          -8.5      32.2    -0.264  0.809

Simple solution for everyday situations

change the reference group using relevel().

dataset %>% 
  mutate(group = relevel(group, ref = "B")) %>% 
  lm(score ~ group, data = .) %>% tidy()
## # A tibble: 3 × 5
##   term        estimate std.error statistic p.value
##   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)    52.5       22.8     2.30    0.105
## 2 groupA         28         32.2     0.869   0.449
## 3 groupC          8.50      32.2     0.264   0.809

Other solutions for more complicated situations

  • What is the difference between group A, compared to the average of the other groups?

We explain how it all works, you only have to remember how to do it in R, and understand the results in the output.

Motivating example

...

In the PISA 2018 study, reading skills in 15-year-olds was compared across 79 countries.

A contrast

Small data set
ID group score
1 A 72
2 B 88
3 C 76
4 A 89
5 C 46
6 B 17

A contrast

Contrast 1: Compare mean group A with the average of groups B and C.

A contrast

\[L1 = M_A - \frac{(M_B + M_C)}{2}\]

A contrast (2)

\[L1 = M_A - \frac{(M_B + M_C)}{2}\] \[ = M_A - \frac{M_B}{2} - \frac{M_C}{2}\]

\[ = 1\times M_A - 0.5 \times M_B - 0.5 \times M_C\]

A contrast (3)

\[L1 = M_A - \frac{(M_B + M_C)}{2}\] \[ = 1\times M_A - 0.5 \times M_B - 0.5 \times M_C\]

l1 <- c(1, -0.5, -0.5)
Group means
A B C
80.5 52.5 61.0
1.0 -0.5 -0.5

Another contrast

\[L2 = M_B - M_C\] \[ = 1\times M_B - 1 \times M_C\] \[ = 1\times M_B - 1 \times M_C + 0 \times M_A\] \[ = 0 \times M_A + 1 \times M_B -1 \times M_C\]
Group means
A B C
80.5 52.5 61
0.0 1.0 -1

Combining the two contrasts

Group means
A B C
80.5 52.5 61.0
1.0 -0.5 -0.5
0.0 1.0 -1.0

Contrasts

l1 <- c(1, -0.5, -0.5)
l2 <- c(0, 1, -1)
rbind(l1, l2) 
##    [,1] [,2] [,3]
## l1    1 -0.5 -0.5
## l2    0  1.0 -1.0

Inverting the contrast matrix

Contrast matrix L
l1 1 -0.5 -0.5
l2 0 1.0 -1.0

Inverting…..

Coding scheme S for new variables
group1 group2
A 0.67 0.0
B -0.33 0.5
C -0.33 -0.5

l1 <- c(1, -0.5, -0.5)
l2 <- c(0, 1, -1)
rbind(l1, l2) 
##    [,1] [,2] [,3]
## l1    1 -0.5 -0.5
## l2    0  1.0 -1.0
S <- rbind(l1, l2) %>% ginv()
S 
##            [,1]          [,2]
## [1,]  0.6666667 -7.514131e-17
## [2,] -0.3333333  5.000000e-01
## [3,] -0.3333333 -5.000000e-01

From coding matrix \(S\) to the numeric variables

Design Matrix (the new variables)
group (Intercept) group1 group2
A 1 0.67 0.0
B 1 -0.33 0.5
C 1 -0.33 -0.5
A 1 0.67 0.0
C 1 -0.33 -0.5
B 1 -0.33 0.5

Get the output for the new contrasts

dataset %>% 
  lm(score ~ group, contrasts = list(group = S), data = .) %>% 
  tidy(conf.int = TRUE)
## # A tibble: 3 × 7
##   term        estimate std.error statistic p.value conf.low conf.high
##   <chr>          <dbl>     <dbl>     <dbl>   <dbl>    <dbl>     <dbl>
## 1 (Intercept)     64.7      13.2     4.92   0.0161     22.8     107. 
## 2 group1          23.8      27.9     0.851  0.457     -65.1     113. 
## 3 group2          -8.5      32.2    -0.264  0.809    -111.       94.0

Exercise exam questions

We have a dataset on three countries. Given is the following contrast matrix.

Contrasts
Botswana Denmark New Zealand
l1 0.3 0.3 0.3
l2 -1.0 1.0 0.0
l3 -0.5 -0.5 1.0

What does the second contrast stand for?

Exercise

We have a dataset on three countries. Given is the following contrast matrix.

Contrasts
Botswana Denmark New Zealand
l1 0.33 0.33 0.33
l2 -1.00 1.00 0.00
l3 -0.50 -0.50 1.00

What does the first contrast stand for?

Exercise

We have a dataset on three countries. Given is the following contrast matrix.

Contrasts
Botswana Denmark New Zealand
l1 0.33 0.33 0.33
l2 -1.00 1.00 0.00
l3 -0.50 -0.50 1.00

What does the third contrast stand for?

Exercise

We have a dataset on three countries. Given is the following contrast matrix.

Contrasts
Botswana Denmark New Zealand
l1 0.33 0.33 0.33
l2 -1.00 1.00 0.00
l3 -0.50 -0.50 1.00

What do we call a set of contrasts like this?

Exercise

We have a dataset on three countries. Given is the following contrast matrix.

Botswana Denmark New Zealand
l1 0.33 0.33 0.33
l2 -1.00 1.00 0.00
l3 -0.50 -0.50 1.00
How do we interpret the output of the model, based on these contrasts?
term estimate std.error statistic p.value
(Intercept) 46.00 16.94927 2.7139811 0.0729188
country1 -32.50 41.51706 -0.7828106 0.4908470
country2 -11.25 35.95483 -0.3128926 0.7748526

Note: the order of the groups is very important to know. You can check the ordering of the groups with levels()

dataset %>% pull(group) %>% levels()
## [1] "A" "B" "C"
dataset %>%
  mutate(group = relevel(group, ref = "C")) %>% 
  pull(group) %>% levels()
## [1] "C" "A" "B"

Your contrasts should match the ordering of your categorical factor variable.

A priori vs. Post hoc

  • Many different contrasts possible for one data set.
  • Per group variable with k levels, you can get k contrasts (including the intercept) per analysis.
  • But you can run endless models, so number of contrasts is also almost endless.
  • Science or fishing expedition?
  • Specific research question vs exploratory analysis.
  • A priori vs post hoc
  • A priori: use regular p-values and confidence intervals
  • post-hoc: adjust p-values and confidence intervals

Type I and type II errors

Four different scenarios for hypothesis tests.
Test conclusion
do not reject \(H_0\) reject \(H_0\)
\(H_0\) true OK Type I Error
\(H_A\) true Type II Error OK

Example

Data on 2 European countries and 2 Asian countries

RQ: what is the difference in the general mean in Europe compared to the general mean in Asia?

levels(dataset_3$country)
## [1] "Albania"     "India"       "Malaysia"    "Switzerland"

How to set up the contrast matrix?

General approach for contrasts

  1. State Research Question
  2. Formulate in terms of a contrast
  3. Specify the contrast into a vector/matrix L
  4. Take the inverse of L to obtain matrix S
  5. Submit matrix S to lm()
  6. Interpret the output as the contrast
  7. Answer Research Question

A priori vs. Post hoc

Data on 2 European countries and 2 Asian countries

RQ: what is the difference in the general mean in Europe compared to the general mean in Asia?

levels(dataset_3$country)
## [1] "Albania"     "India"       "Malaysia"    "Switzerland"

\[ L1 =\frac{M_{Albania} + M_{Switzerland}}{2} - \frac{M_{India} + M_{Malysia}}{2}\]

A priori vs. Post hoc

Data on 2 European countries and 2 Asian countries

Q: what is the difference in the general mean in Europe compared to the general mean in Asia?

levels(dataset_3$country)
## [1] "Albania"     "India"       "Malaysia"    "Switzerland"

\[ L1 =\frac{M_{Albania} + M_{Switzerland}}{2} - \frac{M_{India} + M_{Malysia}}{2}\]

\[0.5 M_{Albania} + 0.5M_{Switzerland} - 0.5 M_{India} - 0.5 M_{Malaysia}\]

\[0.5 M_{Albania} - 0.5 M_{India} - 0.5 M_{Malaysia} + 0.5M_{Switzerland}\]

Setting up the contrast matrix

Contrast matrix

l1 <- c(0.5, -0.5, -0.5, 0.5)
L <- rbind(l1) # has one row

Compute inverse

S <- ginv(L)
S # has one column
##      [,1]
## [1,]  0.5
## [2,] -0.5
## [3,] -0.5
## [4,]  0.5

Setting coding matrix

Four groups, that implies you need four contrasts in your model. This is guaranteed when you assign \(\mathbf{S}\) to your factor variable.

contrasts(dataset_3$country) <- S
contrasts(dataset_3$country) # has now three columns
##             [,1] [,2] [,3]
## Albania      0.5 -0.1 -0.7
## India       -0.5 -0.7  0.1
## Malaysia    -0.5  0.7 -0.1
## Switzerland  0.5  0.1  0.7

Running analysis with four contrasts

R by default adds a variable with all 1s (the intercept).

dataset_3 %>% 
  lm(score ~ country, data = .) %>% tidy()
## # A tibble: 4 × 5
##   term        estimate std.error statistic p.value
##   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)    39.8       8.33     4.77  0.00883
## 2 country1      -29.5      16.7     -1.77  0.151  
## 3 country2       -3.80     16.7     -0.228 0.831  
## 4 country3       18.4      16.7      1.10  0.331

Interpretation

## # A tibble: 4 × 5
##   term        estimate std.error statistic p.value
##   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)    39.8       8.33     4.77  0.00883
## 2 country1      -29.5      16.7     -1.77  0.151  
## 3 country2       -3.80     16.7     -0.228 0.831  
## 4 country3       18.4      16.7      1.10  0.331

Your question can be answered by the effect of country1.

The other contrasts are not relevant.

The European grand mean differs by -29.5 from the Asian grand mean.

Check

The European grand mean differs by -29.5 from the Asian grand mean.

## # A tibble: 4 × 2
##   country      mean
##   <fct>       <dbl>
## 1 Albania      12.5
## 2 India        59  
## 3 Malaysia     50  
## 4 Switzerland  37.5

When looking at these means, you wonder whether the European countries differ from each other, and whether the Asian countries differ from each other. You want to test two null-hypotheses, with an overall Type I error rate of 0.05.

Post hoc questions, so adjust Type I error rate of tests.

Post hoc

pairwise.t.test(dataset_3$score, # dependent variable
                dataset_3$country,  # independent variable
                p.adjust.method = 'none') # no adjustment method yet
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  dataset_3$score and dataset_3$country 
## 
##             Albania India Malaysia
## India       0.12    -     -       
## Malaysia    0.19    0.72  -       
## Switzerland 0.35    0.41  0.62    
## 
## P value adjustment method: none

## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  dataset_3$score and dataset_3$country 
## 
##             Albania India Malaysia
## India       0.12    -     -       
## Malaysia    0.19    0.72  -       
## Switzerland 0.35    0.41  0.62    
## 
## P value adjustment method: none

Relevant \(p\)-values are for Albania vs Switzerland, and India vs. Malaysia. Two hypothesis tests, so multiply \(p\)-values by 2.

Adjusting \(p\)-values using Bonferroni

out <- pairwise.t.test(dataset_3$score, # dependent variable
                dataset_3$country,  # independent variable
                p.adjust.method = 'none') # no adjustment method yet
out$p.value * 2
##               Albania     India Malaysia
## India       0.2394130        NA       NA
## Malaysia    0.3734509 1.4438561       NA
## Switzerland 0.6970468 0.8263315 1.247731

The null-hypothesis that Albania and Switzerland have the same population mean was not rejected, \(p = 0.6970468\). The null-hypothesis that India and Malaysia have the same population mean could also not be rejected, \(p= 1\).

Course of action

  • Before collecting the data, determine what are your research questions and put these in the form of contrasts. For example, with three groups you want to determine the difference between 1 and 2, and the difference between 2 and 3.
  • Collect the data
  • Do the two contrast analyses, answer the research questions. No need to adjust anything.
  • If you have further questions and want to explore, make some new contrasts and correct the \(\alpha\) or \(p\)-value by the number of post-hoc contrasts.

Note: ANOVA is a specific type of research question about any differences between groups. If this question is not part of your research, you don’t have to perform an ANOVA.