This demo is meant to give a quick run down of the difference between dummy codes and contrast codes!
Libraries
library(dplyr)##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(psych)
library(knitr)
library(ggplot2)##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
source('http://psych.colorado.edu/~jclab/R/mcSummaryLm.R')Generating some fake data
d <- data_frame(group = c(rep("A", 10), rep("B", 10)),
score = c(rnorm(10, 40, 5), rnorm(10, 50, 5)))## Warning: `data_frame()` was deprecated in tibble 1.1.0.
## Please use `tibble()` instead.
Dummy Codes
To see the power of contrast coding we’ll compare a model with dummy coding (the default method of coding in R for factors) to a model with contrast coding when ‘group’ only has two levels (‘A’ and ‘B’). Here’s a summary of the data:
kable(do.call(data.frame, aggregate(. ~ group, d, function(x) c(mean = mean(x), sd = sd(x)))))| group | score.mean | score.sd |
|---|---|---|
| A | 38.59420 | 7.353140 |
| B | 50.49503 | 4.198186 |
And here’s a dummy-coded model:
# First, compute our dummy codes
d$A_B.dummy <- NA
d$A_B.dummy <- ifelse(d$group == "A", 0, 1)
# Now, run a model predicting "score" from "group", and get the model summary
m1 <- lm(score ~ A_B.dummy, data = d)
mcSummary(m1)## Loading required package: car
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:psych':
##
## logit
## The following object is masked from 'package:dplyr':
##
## recode
## lm(formula = score ~ A_B.dummy, data = d)
##
## Omnibus ANOVA
## SS df MS EtaSq F p
## Model 708.149 1 708.149 0.523 19.755 0
## Error 645.241 18 35.847
## Corr Total 1353.390 19 71.231
##
## RMSE AdjEtaSq
## 5.987 0.497
##
## Coefficients
## Est StErr t SSR(3) EtaSq tol CI_2.5 CI_97.5 p
## (Intercept) 38.594 1.893 20.384 14895.120 0.958 NA 34.616 42.572 0
## A_B.dummy 11.901 2.678 4.445 708.149 0.523 NA 6.275 17.526 0
The intercept (when x is 0) is the value of the mean of group ‘A’. As you can see, the intercept is the same value as the descriptive mean of level ‘A’. The estimate for ‘A_B’ is the difference between the mean for group ‘A’ and the mean for group ‘B’.
Plot of our dummy coded model:
ggplot(d,
aes(x = A_B.dummy,
y = score)) +
geom_point() +
geom_path(stat = 'summary', fun = 'mean', aes(group = 1)) +
labs(x = "Group: 0 = A, 1 = B",
y = "'Score'") +
# geom_point(aes(y = score, x = 0.5),
# stat = 'summary', fun = 'mean', size = 2, col = 'red3') +
scale_x_continuous(breaks = c(0,1))Contrast Codes
Now here’s a contrast coded model:
# What are our levels of "group"?
factor(d$group) # or## [1] A A A A A A A A A A B B B B B B B B B B
## Levels: A B
table(d$group)##
## A B
## 10 10
# Compute our contrast code
d$A_B.contrast <- NA
d$A_B.contrast <- ifelse(d$group == "A", -1, 1)
# Now, run a model predicting "score" from "group", and get the model summary
m2 <- lm(score ~ A_B.contrast, data = d)
mcSummary(m2)## lm(formula = score ~ A_B.contrast, data = d)
##
## Omnibus ANOVA
## SS df MS EtaSq F p
## Model 708.149 1 708.149 0.523 19.755 0
## Error 645.241 18 35.847
## Corr Total 1353.390 19 71.231
##
## RMSE AdjEtaSq
## 5.987 0.497
##
## Coefficients
## Est StErr t SSR(3) EtaSq tol CI_2.5 CI_97.5 p
## (Intercept) 44.545 1.339 33.273 39684.449 0.984 NA 41.732 47.357 0
## A_B.contrast 5.950 1.339 4.445 708.149 0.523 NA 3.138 8.763 0
The intercept (when x is 0) is now exactly halfway between group A and group B’s mean. The estimate for ‘A_B’ is still the difference between the mean for group ‘A’ and the mean for group ‘B’. Only now, we have to multiply the estimate by 2 so that we get back into correct units (since we set our groups at -1 and 1, which are 2 whole numbers apart).
Plot of our contrast coded model:
ggplot(d,
aes(x = A_B.contrast,
y = score)) +
geom_point() +
geom_path(stat = 'summary', fun = 'mean', aes(group = 1)) +
labs(x = "Group: -1 = A, 1 = B",
y = "'Score'") +
geom_point(aes(y = score, x = 0),
stat = 'summary', fun = 'mean', size = 2, col = 'red3') +
scale_x_continuous(breaks = c(-1,0,1))When our groups are of equal size, the intercept gives our outcome variable’s grand mean–the mean across the whole sample.
mean(d$score)## [1] 44.54461
BUT! If your groups have unequal Ns, this changes a little. The intercept would then be the group-weighted mean.