…
In the PISA 2018 study, reading skills in 15-year-olds was compared across 79 countries.
3/20/2024
…
In the PISA 2018 study, reading skills in 15-year-olds was compared across 79 countries.
| ID | group | score |
|---|---|---|
| 1 | A | 72 |
| 2 | B | 88 |
| 3 | C | 76 |
| 4 | A | 89 |
| 5 | C | 46 |
| 6 | B | 17 |
dataset %>% lm(score ~ group, data = .) %>% tidy()
## # A tibble: 3 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 80.5 22.8 3.53 0.0386 ## 2 groupB -28.0 32.2 -0.869 0.449 ## 3 groupC -19.5 32.2 -0.605 0.588
| ID | group | score | (Intercept) | groupB | groupC |
|---|---|---|---|---|---|
| 1 | A | 72 | 1 | 0 | 0 |
| 2 | B | 88 | 1 | 1 | 0 |
| 3 | C | 76 | 1 | 0 | 1 |
| 4 | A | 89 | 1 | 0 | 0 |
| 5 | C | 46 | 1 | 0 | 1 |
| 6 | B | 17 | 1 | 1 | 0 |
model.matrix(score ~ group, data = dataset)
## (Intercept) groupB groupC ## 1 1 0 0 ## 2 1 1 0 ## 3 1 0 1 ## 4 1 0 0 ## 5 1 0 1 ## 6 1 1 0 ## attr(,"assign") ## [1] 0 1 1 ## attr(,"contrasts") ## attr(,"contrasts")$group ## [1] "contr.treatment"
| (Intercept) | groupB | groupC |
|---|---|---|
| 1 | 0 | 0 |
| 1 | 1 | 0 |
| 1 | 0 | 1 |
| 1 | 0 | 0 |
| 1 | 0 | 1 |
| 1 | 1 | 0 |
## # A tibble: 3 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 80.5 22.8 3.53 0.0386 ## 2 groupB -28.0 32.2 -0.869 0.449 ## 3 groupC -19.5 32.2 -0.605 0.588
How you code your dummy variables has a direct influence on the interpretation of the output.
How the categorical variable is coded influences the contrasts you see in the output.
If you want to see certain contrasts, then change the way you code your variables.
| ID | group | score | (Intercept) | groupA | groupB |
|---|---|---|---|---|---|
| 1 | A | 72 | 1 | 1 | 0 |
| 2 | B | 88 | 1 | 0 | 1 |
| 3 | C | 76 | 1 | 0 | 0 |
| 4 | A | 89 | 1 | 1 | 0 |
| 5 | C | 46 | 1 | 0 | 0 |
| 6 | B | 17 | 1 | 0 | 1 |
## # A tibble: 3 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 61 22.8 2.68 0.0752 ## 2 groupA 19.5 32.2 0.605 0.588 ## 3 groupB -8.5 32.2 -0.264 0.809
change the reference group using relevel().
dataset %>% mutate(group = relevel(group, ref = "B")) %>% lm(score ~ group, data = .) %>% tidy()
## # A tibble: 3 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 52.5 22.8 2.30 0.105 ## 2 groupA 28 32.2 0.869 0.449 ## 3 groupC 8.50 32.2 0.264 0.809
We explain how it all works, you only have to remember how to do it in R, and understand the results in the output.
…
In the PISA 2018 study, reading skills in 15-year-olds was compared across 79 countries.
| ID | group | score |
|---|---|---|
| 1 | A | 72 |
| 2 | B | 88 |
| 3 | C | 76 |
| 4 | A | 89 |
| 5 | C | 46 |
| 6 | B | 17 |
Contrast 1: Compare mean group A with the average of groups B and C.
\[L1 = M_A - \frac{(M_B + M_C)}{2}\]
\[L1 = M_A - \frac{(M_B + M_C)}{2}\] \[ = M_A - \frac{M_B}{2} - \frac{M_C}{2}\]
\[ = 1\times M_A - 0.5 \times M_B - 0.5 \times M_C\]
\[L1 = M_A - \frac{(M_B + M_C)}{2}\] \[ = 1\times M_A - 0.5 \times M_B - 0.5 \times M_C\]
l1 <- c(1, -0.5, -0.5)
| A | B | C |
|---|---|---|
| 80.5 | 52.5 | 61.0 |
| 1.0 | -0.5 | -0.5 |
| A | B | C |
|---|---|---|
| 80.5 | 52.5 | 61 |
| 0.0 | 1.0 | -1 |
| A | B | C |
|---|---|---|
| 80.5 | 52.5 | 61.0 |
| 1.0 | -0.5 | -0.5 |
| 0.0 | 1.0 | -1.0 |
l1 <- c(1, -0.5, -0.5) l2 <- c(0, 1, -1) rbind(l1, l2)
## [,1] [,2] [,3] ## l1 1 -0.5 -0.5 ## l2 0 1.0 -1.0
| l1 | 1 | -0.5 | -0.5 |
| l2 | 0 | 1.0 | -1.0 |
Inverting…..
| group1 | group2 | |
|---|---|---|
| A | 0.67 | 0.0 |
| B | -0.33 | 0.5 |
| C | -0.33 | -0.5 |
l1 <- c(1, -0.5, -0.5) l2 <- c(0, 1, -1) rbind(l1, l2)
## [,1] [,2] [,3] ## l1 1 -0.5 -0.5 ## l2 0 1.0 -1.0
S <- rbind(l1, l2) %>% ginv() S
## [,1] [,2] ## [1,] 0.6666667 -7.514131e-17 ## [2,] -0.3333333 5.000000e-01 ## [3,] -0.3333333 -5.000000e-01
| group | (Intercept) | group1 | group2 |
|---|---|---|---|
| A | 1 | 0.67 | 0.0 |
| B | 1 | -0.33 | 0.5 |
| C | 1 | -0.33 | -0.5 |
| A | 1 | 0.67 | 0.0 |
| C | 1 | -0.33 | -0.5 |
| B | 1 | -0.33 | 0.5 |
dataset %>% lm(score ~ group, contrasts = list(group = S), data = .) %>% tidy(conf.int = TRUE)
## # A tibble: 3 × 7 ## term estimate std.error statistic p.value conf.low conf.high ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 64.7 13.2 4.92 0.0161 22.8 107. ## 2 group1 23.8 27.9 0.851 0.457 -65.1 113. ## 3 group2 -8.5 32.2 -0.264 0.809 -111. 94.0
We have a dataset on three countries. Given is the following contrast matrix.
| Botswana | Denmark | New Zealand | |
|---|---|---|---|
| l1 | 0.3 | 0.3 | 0.3 |
| l2 | -1.0 | 1.0 | 0.0 |
| l3 | -0.5 | -0.5 | 1.0 |
What does the second contrast stand for?
We have a dataset on three countries. Given is the following contrast matrix.
| Botswana | Denmark | New Zealand | |
|---|---|---|---|
| l1 | 0.33 | 0.33 | 0.33 |
| l2 | -1.00 | 1.00 | 0.00 |
| l3 | -0.50 | -0.50 | 1.00 |
What does the first contrast stand for?
We have a dataset on three countries. Given is the following contrast matrix.
| Botswana | Denmark | New Zealand | |
|---|---|---|---|
| l1 | 0.33 | 0.33 | 0.33 |
| l2 | -1.00 | 1.00 | 0.00 |
| l3 | -0.50 | -0.50 | 1.00 |
What does the third contrast stand for?
We have a dataset on three countries. Given is the following contrast matrix.
| Botswana | Denmark | New Zealand | |
|---|---|---|---|
| l1 | 0.33 | 0.33 | 0.33 |
| l2 | -1.00 | 1.00 | 0.00 |
| l3 | -0.50 | -0.50 | 1.00 |
What do we call a set of contrasts like this?
We have a dataset on three countries. Given is the following contrast matrix.
| Botswana | Denmark | New Zealand | |
|---|---|---|---|
| l1 | 0.33 | 0.33 | 0.33 |
| l2 | -1.00 | 1.00 | 0.00 |
| l3 | -0.50 | -0.50 | 1.00 |
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 46.00 | 16.94927 | 2.7139811 | 0.0729188 |
| country1 | -32.50 | 41.51706 | -0.7828106 | 0.4908470 |
| country2 | -11.25 | 35.95483 | -0.3128926 | 0.7748526 |
Note: the order of the groups is very important to know. You can check the ordering of the groups with levels()
dataset %>% pull(group) %>% levels()
## [1] "A" "B" "C"
dataset %>% mutate(group = relevel(group, ref = "C")) %>% pull(group) %>% levels()
## [1] "C" "A" "B"
Your contrasts should match the ordering of your categorical factor variable.
| Test conclusion | |||
| do not reject \(H_0\) | reject \(H_0\) | ||
| \(H_0\) true | OK | Type I Error | |
| \(H_A\) true | Type II Error | OK |
Data on 2 European countries and 2 Asian countries
RQ: what is the difference in the general mean in Europe compared to the general mean in Asia?
levels(dataset_3$country)
## [1] "Albania" "India" "Malaysia" "Switzerland"
How to set up the contrast matrix?
lm()Data on 2 European countries and 2 Asian countries
RQ: what is the difference in the general mean in Europe compared to the general mean in Asia?
levels(dataset_3$country)
## [1] "Albania" "India" "Malaysia" "Switzerland"
\[ L1 =\frac{M_{Albania} + M_{Switzerland}}{2} - \frac{M_{India} + M_{Malysia}}{2}\]
Data on 2 European countries and 2 Asian countries
Q: what is the difference in the general mean in Europe compared to the general mean in Asia?
levels(dataset_3$country)
## [1] "Albania" "India" "Malaysia" "Switzerland"
\[ L1 =\frac{M_{Albania} + M_{Switzerland}}{2} - \frac{M_{India} + M_{Malysia}}{2}\]
\[0.5 M_{Albania} + 0.5M_{Switzerland} - 0.5 M_{India} - 0.5 M_{Malaysia}\]
\[0.5 M_{Albania} - 0.5 M_{India} - 0.5 M_{Malaysia} + 0.5M_{Switzerland}\]
Contrast matrix
l1 <- c(0.5, -0.5, -0.5, 0.5) L <- rbind(l1) # has one row
Compute inverse
S <- ginv(L) S # has one column
## [,1] ## [1,] 0.5 ## [2,] -0.5 ## [3,] -0.5 ## [4,] 0.5
Four groups, that implies you need four contrasts in your model. This is guaranteed when you assign \(\mathbf{S}\) to your factor variable.
contrasts(dataset_3$country) <- S contrasts(dataset_3$country) # has now three columns
## [,1] [,2] [,3] ## Albania 0.5 -0.1 -0.7 ## India -0.5 -0.7 0.1 ## Malaysia -0.5 0.7 -0.1 ## Switzerland 0.5 0.1 0.7
R by default adds a variable with all 1s (the intercept).
dataset_3 %>% lm(score ~ country, data = .) %>% tidy()
## # A tibble: 4 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 39.8 8.33 4.77 0.00883 ## 2 country1 -29.5 16.7 -1.77 0.151 ## 3 country2 -3.80 16.7 -0.228 0.831 ## 4 country3 18.4 16.7 1.10 0.331
## # A tibble: 4 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 39.8 8.33 4.77 0.00883 ## 2 country1 -29.5 16.7 -1.77 0.151 ## 3 country2 -3.80 16.7 -0.228 0.831 ## 4 country3 18.4 16.7 1.10 0.331
Your question can be answered by the effect of country1.
The other contrasts are not relevant.
The European grand mean differs by -29.5 from the Asian grand mean.
The European grand mean differs by -29.5 from the Asian grand mean.
## # A tibble: 4 × 2 ## country mean ## <fct> <dbl> ## 1 Albania 12.5 ## 2 India 59 ## 3 Malaysia 50 ## 4 Switzerland 37.5
When looking at these means, you wonder whether the European countries differ from each other, and whether the Asian countries differ from each other. You want to test two null-hypotheses, with an overall Type I error rate of 0.05.
Post hoc questions, so adjust Type I error rate of tests.
pairwise.t.test(dataset_3$score, # dependent variable
dataset_3$country, # independent variable
p.adjust.method = 'none') # no adjustment method yet
## ## Pairwise comparisons using t tests with pooled SD ## ## data: dataset_3$score and dataset_3$country ## ## Albania India Malaysia ## India 0.12 - - ## Malaysia 0.19 0.72 - ## Switzerland 0.35 0.41 0.62 ## ## P value adjustment method: none
## ## Pairwise comparisons using t tests with pooled SD ## ## data: dataset_3$score and dataset_3$country ## ## Albania India Malaysia ## India 0.12 - - ## Malaysia 0.19 0.72 - ## Switzerland 0.35 0.41 0.62 ## ## P value adjustment method: none
Relevant \(p\)-values are for Albania vs Switzerland, and India vs. Malaysia. Two hypothesis tests, so multiply \(p\)-values by 2.
out <- pairwise.t.test(dataset_3$score, # dependent variable
dataset_3$country, # independent variable
p.adjust.method = 'none') # no adjustment method yet
out$p.value * 2
## Albania India Malaysia ## India 0.2394130 NA NA ## Malaysia 0.3734509 1.4438561 NA ## Switzerland 0.6970468 0.8263315 1.247731
The null-hypothesis that Albania and Switzerland have the same population mean was not rejected, \(p = 0.6970468\). The null-hypothesis that India and Malaysia have the same population mean could also not be rejected, \(p= 1\).
Note: ANOVA is a specific type of research question about any differences between groups. If this question is not part of your research, you don’t have to perform an ANOVA.