This demo is meant to give a quick run down of the difference between dummy codes and contrast codes!

Libraries

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(psych)
library(knitr)
library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
## 
##     %+%, alpha
source('http://psych.colorado.edu/~jclab/R/mcSummaryLm.R')

Generating some fake data

d <- data_frame(group = c(rep("A", 10), rep("B", 10)),
                score = c(rnorm(10, 40, 5), rnorm(10, 50, 5)))
## Warning: `data_frame()` was deprecated in tibble 1.1.0.
## Please use `tibble()` instead.

Dummy Codes

To see the power of contrast coding we’ll compare a model with dummy coding (the default method of coding in R for factors) to a model with contrast coding when ‘group’ only has two levels (‘A’ and ‘B’). Here’s a summary of the data:

kable(do.call(data.frame, aggregate(. ~ group, d, function(x) c(mean = mean(x), sd = sd(x)))))
group score.mean score.sd
A 38.59420 7.353140
B 50.49503 4.198186

And here’s a dummy-coded model:

# First, compute our dummy codes
d$A_B.dummy <- NA
d$A_B.dummy <- ifelse(d$group == "A", 0, 1)

# Now, run a model predicting "score" from "group", and get the model summary
m1 <- lm(score ~ A_B.dummy, data = d)
mcSummary(m1)
## Loading required package: car
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:psych':
## 
##     logit
## The following object is masked from 'package:dplyr':
## 
##     recode
## lm(formula = score ~ A_B.dummy, data = d)
## 
## Omnibus ANOVA
##                  SS df      MS EtaSq      F p
## Model       708.149  1 708.149 0.523 19.755 0
## Error       645.241 18  35.847               
## Corr Total 1353.390 19  71.231               
## 
##   RMSE AdjEtaSq
##  5.987    0.497
## 
## Coefficients
##                Est StErr      t    SSR(3) EtaSq tol CI_2.5 CI_97.5 p
## (Intercept) 38.594 1.893 20.384 14895.120 0.958  NA 34.616  42.572 0
## A_B.dummy   11.901 2.678  4.445   708.149 0.523  NA  6.275  17.526 0

The intercept (when x is 0) is the value of the mean of group ‘A’. As you can see, the intercept is the same value as the descriptive mean of level ‘A’. The estimate for ‘A_B’ is the difference between the mean for group ‘A’ and the mean for group ‘B’.

Plot of our dummy coded model:

ggplot(d,
       aes(x = A_B.dummy,
           y = score)) +
  geom_point() +
  geom_path(stat = 'summary', fun = 'mean', aes(group = 1)) +
  labs(x = "Group: 0 = A, 1 = B",
       y = "'Score'") +
#  geom_point(aes(y = score, x = 0.5), 
#             stat = 'summary', fun = 'mean', size = 2, col = 'red3') +
  scale_x_continuous(breaks = c(0,1))

Contrast Codes

Now here’s a contrast coded model:

# What are our levels of "group"?
factor(d$group) # or
##  [1] A A A A A A A A A A B B B B B B B B B B
## Levels: A B
table(d$group)
## 
##  A  B 
## 10 10
# Compute our contrast code
d$A_B.contrast <- NA
d$A_B.contrast <- ifelse(d$group == "A", -1, 1)

# Now, run a model predicting "score" from "group", and get the model summary
m2 <- lm(score ~ A_B.contrast, data = d)
mcSummary(m2)
## lm(formula = score ~ A_B.contrast, data = d)
## 
## Omnibus ANOVA
##                  SS df      MS EtaSq      F p
## Model       708.149  1 708.149 0.523 19.755 0
## Error       645.241 18  35.847               
## Corr Total 1353.390 19  71.231               
## 
##   RMSE AdjEtaSq
##  5.987    0.497
## 
## Coefficients
##                 Est StErr      t    SSR(3) EtaSq tol CI_2.5 CI_97.5 p
## (Intercept)  44.545 1.339 33.273 39684.449 0.984  NA 41.732  47.357 0
## A_B.contrast  5.950 1.339  4.445   708.149 0.523  NA  3.138   8.763 0

The intercept (when x is 0) is now exactly halfway between group A and group B’s mean. The estimate for ‘A_B’ is still the difference between the mean for group ‘A’ and the mean for group ‘B’. Only now, we have to multiply the estimate by 2 so that we get back into correct units (since we set our groups at -1 and 1, which are 2 whole numbers apart).

Plot of our contrast coded model:

ggplot(d,
       aes(x = A_B.contrast,
           y = score)) +
  geom_point() +
  geom_path(stat = 'summary', fun = 'mean', aes(group = 1)) +
  labs(x = "Group: -1 = A, 1 = B",
       y = "'Score'") +
  geom_point(aes(y = score, x = 0), 
             stat = 'summary', fun = 'mean', size = 2, col = 'red3') +
  scale_x_continuous(breaks = c(-1,0,1))

When our groups are of equal size, the intercept gives our outcome variable’s grand mean–the mean across the whole sample.

mean(d$score)
## [1] 44.54461

BUT! If your groups have unequal Ns, this changes a little. The intercept would then be the group-weighted mean.