This demo is meant to explore the difference between 2-level and 3-level categorical variables, in dummy and contrast coding schemes.

One important point as we start: you ALWAYS need to create a full set of codes for any categorical variables you are entering into your model. There will always be the number of levels in your group - 1 number of codes in a full set of codes. So if your group has 3 levels (like political party: Dem, Rep, Ind), you need 2 codes. If your group has 32 levels (like NFL teams: trust me there are 32 of them), you need 31 codes. The codes as a set represent your variable of interest.

Libraries

library(lmSupport)
library(ggplot2)
library(psych)
library(rgl)
library(MASS)
library(knitr)
source('http://psych.colorado.edu/~jclab/R/mcSummaryLm.R')

Evaluating datasets

str(PlantGrowth)
## 'data.frame':    30 obs. of  2 variables:
##  $ weight: num  4.17 5.58 5.18 6.11 4.5 4.61 5.17 4.53 5.33 5.14 ...
##  $ group : Factor w/ 3 levels "ctrl","trt1",..: 1 1 1 1 1 1 1 1 1 1 ...
str(Cars93)
## 'data.frame':    93 obs. of  27 variables:
##  $ Manufacturer      : Factor w/ 32 levels "Acura","Audi",..: 1 1 2 2 3 4 4 4 4 5 ...
##  $ Model             : Factor w/ 93 levels "100","190E","240",..: 49 56 9 1 6 24 54 74 73 35 ...
##  $ Type              : Factor w/ 6 levels "Compact","Large",..: 4 3 1 3 3 3 2 2 3 2 ...
##  $ Min.Price         : num  12.9 29.2 25.9 30.8 23.7 14.2 19.9 22.6 26.3 33 ...
##  $ Price             : num  15.9 33.9 29.1 37.7 30 15.7 20.8 23.7 26.3 34.7 ...
##  $ Max.Price         : num  18.8 38.7 32.3 44.6 36.2 17.3 21.7 24.9 26.3 36.3 ...
##  $ MPG.city          : int  25 18 20 19 22 22 19 16 19 16 ...
##  $ MPG.highway       : int  31 25 26 26 30 31 28 25 27 25 ...
##  $ AirBags           : Factor w/ 3 levels "Driver & Passenger",..: 3 1 2 1 2 2 2 2 2 2 ...
##  $ DriveTrain        : Factor w/ 3 levels "4WD","Front",..: 2 2 2 2 3 2 2 3 2 2 ...
##  $ Cylinders         : Factor w/ 6 levels "3","4","5","6",..: 2 4 4 4 2 2 4 4 4 5 ...
##  $ EngineSize        : num  1.8 3.2 2.8 2.8 3.5 2.2 3.8 5.7 3.8 4.9 ...
##  $ Horsepower        : int  140 200 172 172 208 110 170 180 170 200 ...
##  $ RPM               : int  6300 5500 5500 5500 5700 5200 4800 4000 4800 4100 ...
##  $ Rev.per.mile      : int  2890 2335 2280 2535 2545 2565 1570 1320 1690 1510 ...
##  $ Man.trans.avail   : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 1 1 1 1 ...
##  $ Fuel.tank.capacity: num  13.2 18 16.9 21.1 21.1 16.4 18 23 18.8 18 ...
##  $ Passengers        : int  5 5 5 6 4 6 6 6 5 6 ...
##  $ Length            : int  177 195 180 193 186 189 200 216 198 206 ...
##  $ Wheelbase         : int  102 115 102 106 109 105 111 116 108 114 ...
##  $ Width             : int  68 71 67 70 69 69 74 78 73 73 ...
##  $ Turn.circle       : int  37 38 37 37 39 41 42 45 41 43 ...
##  $ Rear.seat.room    : num  26.5 30 28 31 27 28 30.5 30.5 26.5 35 ...
##  $ Luggage.room      : int  11 15 14 17 13 16 17 21 14 18 ...
##  $ Weight            : int  2705 3560 3375 3405 3640 2880 3470 4105 3495 3620 ...
##  $ Origin            : Factor w/ 2 levels "USA","non-USA": 2 2 2 2 2 1 1 1 1 1 ...
##  $ Make              : Factor w/ 93 levels "Acura Integra",..: 1 2 4 3 5 6 7 9 8 10 ...

The dataset “PlantGrowth” has 2 variables: group and weight.

Group: 3-level factor (ctrl; trt1; trt2)

Weight: numeric variable (yield as measured by dried weight of plants)

The dataset “Cars93” has 27 variables, but we’re only looking at the following 3.

Origin: 2-level factor (USA; non-USA)

DriveTrain: 3-level factor (4WD; Front; Rear)

Horsepower: numeric variable

Dummy Codes

Let’s look at the difference between a model that predicts an outcome with a 2-level factor, and one that predicts an outcome with a 3-level factor.

2-level factor review

First, descriptives.

# 2-level predictor
kable(do.call(data.frame, aggregate(Horsepower ~ Origin, Cars93, function(x) c(mean = mean(x), sd = sd(x)))))
Origin Horsepower.mean Horsepower.sd
USA 147.5208 54.45474
non-USA 139.8889 50.37145

Construct the dummy codes.

Cars93$USA.d <- ifelse(Cars93$Origin == "USA", 1, 0)

This coding scheme assigns each row a 1 if they were made in the USA, and a 0 if not.

Run the model.

horsepower.origin.d <- lm(Horsepower ~ USA.d, data = Cars93)
mcSummary(horsepower.origin.d)
## lm(formula = Horsepower ~ USA.d, data = Cars93)
## 
## Omnibus ANOVA
##                    SS df       MS EtaSq    F     p
## Model        1352.824  1 1352.824 0.005 0.49 0.486
## Error      251010.424 91 2758.356                 
## Corr Total 252363.247 92 2743.079                 
## 
##   RMSE AdjEtaSq
##  52.52   -0.006
## 
## Coefficients
##                 Est  StErr      t     SSR(3) EtaSq tol  CI_2.5 CI_97.5     p
## (Intercept) 139.889  7.829 17.868 880600.556 0.778  NA 124.337 155.441 0.000
## USA.d         7.632 10.898  0.700   1352.824 0.005  NA -14.015  29.279 0.486

Now the intercept describes the expected Horsepower when everything in the model is 0–i.e., when a car is non-USA in origin–and the predictor “USA.d” tells us the mean difference in Horsepower between our intercept group and our predictor-associated group, or USA.

In this case, it looks like our non-USA cars have an average hpr of 139.9, and our USA cars have an average hpr of 7.6 more than that. But the difference isn’t significant (p = .486).

What do these codes “look” like? Let’s see a table of values on our variable of interest, Origin, and its assigned code.

kable(data.frame("Origin" = c("non-USA","USA"), "USA.d" = c(0,1)))
Origin USA.d
non-USA 0
USA 1

Plot of our dummy coded model:

ggplot(Cars93,
       aes(x = USA.d,
           y = Horsepower)) +
  geom_point() +
  geom_path(stat = 'summary', fun = 'mean', aes(group = 1)) +
  labs(x = "Origin: 0 = non-USA, 1 = USA",
       y = "Horsepower") +
#  geom_point(aes(y = score, x = 0.5), 
#             stat = 'summary', fun = 'mean', size = 2, col = 'red3') +
  scale_x_continuous(breaks = c(0,1))

3-level factor

First, descriptives.

# 3-level predictor
kable(do.call(data.frame, aggregate(Horsepower ~ DriveTrain, Cars93, function(x) c(mean = mean(x), sd = sd(x)))))
DriveTrain Horsepower.mean Horsepower.sd
4WD 143.0000 63.05729
Front 132.8955 43.71399
Rear 190.1250 56.65554

Construct the dummy codes.

Cars93$Front.d <- ifelse(Cars93$DriveTrain == "Front", 1, 0)
Cars93$Rear.d <- ifelse(Cars93$DriveTrain == "Rear", 1, 0)

The first code assigns each row a 1 if they have front-wd, and a 0 if otherwise. The second code assigns each row a 1 if they have rear-wd, and a 0 if otherwise.

Let’s look at these codes again now, before going on.

# Table of just codes
kable(data.frame("DriveTrain" = c("4WD","Front","Rear"), "Front.d" = c(0,1,0), "Rear.d" = c(0,0,1)))
DriveTrain Front.d Rear.d
4WD 0 0
Front 1 0
Rear 0 1
# How about in the actual dataset?
head(Cars93[,c('DriveTrain','Front.d','Rear.d')],20)
##    DriveTrain Front.d Rear.d
## 1       Front       1      0
## 2       Front       1      0
## 3       Front       1      0
## 4       Front       1      0
## 5        Rear       0      1
## 6       Front       1      0
## 7       Front       1      0
## 8        Rear       0      1
## 9       Front       1      0
## 10      Front       1      0
## 11      Front       1      0
## 12      Front       1      0
## 13      Front       1      0
## 14       Rear       0      1
## 15      Front       1      0
## 16      Front       1      0
## 17        4WD       0      0
## 18       Rear       0      1
## 19       Rear       0      1
## 20      Front       1      0

Run the model now.

horsepower.dt.d <- lm(Horsepower ~ Front.d + Rear.d, data = Cars93)
mcSummary(horsepower.dt.d)
## lm(formula = Horsepower ~ Front.d + Rear.d, data = Cars93)
## 
## Omnibus ANOVA
##                   SS df        MS EtaSq     F p
## Model       42309.23  2 21154.614 0.168 9.064 0
## Error      210054.02 90  2333.934              
## Corr Total 252363.25 92  2743.079              
## 
##    RMSE AdjEtaSq
##  48.311    0.149
## 
## Coefficients
##                 Est  StErr      t     SSR(3) EtaSq   tol  CI_2.5 CI_97.5     p
## (Intercept) 143.000 15.277  9.360 204490.000 0.493    NA 112.649 173.351 0.000
## Front.d     -10.104 16.378 -0.617    888.407 0.004 0.465 -42.642  22.433 0.539
## Rear.d       47.125 19.475  2.420  13666.250 0.061 0.465   8.435  85.815 0.018

Now the intercept describes the expected Horsepower when everything in the model is 0–i.e., when a car has 4WD. The predictor “Front.d” tells us the mean difference in Horsepower between our intercept group and our predictor-associated group, or front-wd cars. The predictor “Rear.d” tells us the mean difference in Horsepower between our intercept group and our predictor-associated group, or rear-wd cars.

In this case, it looks like our 4WD cars have an average hpr of 143.0, our front-wd cars have an average hpr of 10.1 less than that (132.9 hpr), and our rear-wd drive cars have an average hpr of 47.1 more than that (190.1 hpr).

Only one of those differences is significant, though!

And a graph of our results:

scatter3d(Horsepower ~ Front.d + Rear.d, data = Cars93, grid = F)

Contrast Codes

2-level factor review

Now here’s a contrast coded model:

# What are our levels of "group"?
table(Cars93$Origin)
## 
##     USA non-USA 
##      48      45
# Compute our contrast code
Cars93$n_US <- NA
Cars93$n_US <- ifelse(Cars93$Origin == "non-USA", -1/2, 1/2)

# Now, run a model predicting "score" from "group", and get the model summary
horsepower.origin.c <- lm(Horsepower ~ n_US, data = Cars93)
mcSummary(horsepower.origin.c)
## lm(formula = Horsepower ~ n_US, data = Cars93)
## 
## Omnibus ANOVA
##                    SS df       MS EtaSq    F     p
## Model        1352.824  1 1352.824 0.005 0.49 0.486
## Error      251010.424 91 2758.356                 
## Corr Total 252363.247 92 2743.079                 
## 
##   RMSE AdjEtaSq
##  52.52   -0.006
## 
## Coefficients
##                 Est  StErr      t      SSR(3) EtaSq tol  CI_2.5 CI_97.5     p
## (Intercept) 143.705  5.449 26.373 1918552.609 0.884  NA 132.881 154.528 0.000
## n_US          7.632 10.898  0.700    1352.824 0.005  NA -14.015  29.279 0.486

The intercept (when x is 0) is exactly halfway between the means for non-USA and USA cars. The estimate for ‘n_US’ is again the difference between the mean for non-USA and the mean for USA cars; we just can’t get the predicted group mean for non-USA cars with this model. Dummy codes are better for that.

3-level factor

Construct the contrast codes.

table(Cars93$DriveTrain)
## 
##   4WD Front  Rear 
##    10    67    16
# Code 1: Front vs. Rear
Cars93$Front_Rear <- (-1/2)*(Cars93$DriveTrain == "Front") + (0)*(Cars93$DriveTrain == "4WD") + (1/2)*(Cars93$DriveTrain == "Rear")

# Code 2: 4WD vs. (Front and Rear)
Cars93$four_FR <- (1/3)*(Cars93$DriveTrain == "Front") + (-2/3)*(Cars93$DriveTrain == "4WD") + (1/3)*(Cars93$DriveTrain == "Rear")

Let’s look at these codes again now, before going on.

# Table of just codes
kable(data.frame("DriveTrain" = c("4WD","Front","Rear"), "Front_Rear" = c(0,-1/2,1/2), "four_FR" = c(-2/3,1/3,1/3)))
DriveTrain Front_Rear four_FR
4WD 0.0 -0.6666667
Front -0.5 0.3333333
Rear 0.5 0.3333333
# How about in the actual dataset? It's actually slightly less helpful for the contrast coding scheme I think...
head(Cars93[,c('DriveTrain','Front_Rear','four_FR')],20)
##    DriveTrain Front_Rear    four_FR
## 1       Front       -0.5  0.3333333
## 2       Front       -0.5  0.3333333
## 3       Front       -0.5  0.3333333
## 4       Front       -0.5  0.3333333
## 5        Rear        0.5  0.3333333
## 6       Front       -0.5  0.3333333
## 7       Front       -0.5  0.3333333
## 8        Rear        0.5  0.3333333
## 9       Front       -0.5  0.3333333
## 10      Front       -0.5  0.3333333
## 11      Front       -0.5  0.3333333
## 12      Front       -0.5  0.3333333
## 13      Front       -0.5  0.3333333
## 14       Rear        0.5  0.3333333
## 15      Front       -0.5  0.3333333
## 16      Front       -0.5  0.3333333
## 17        4WD        0.0 -0.6666667
## 18       Rear        0.5  0.3333333
## 19       Rear        0.5  0.3333333
## 20      Front       -0.5  0.3333333

The intercept is now a bit more confusing to visualize. My advice: Just accept that it is the mean of the 3 group means, and don’t try too hard to picture it in your mind. It just definitely is the mean of the 3 group means.

Now run the model.

horsepower.dt.c <- lm(Horsepower ~ Front_Rear + four_FR, data = Cars93)
mcSummary(horsepower.dt.c)
## lm(formula = Horsepower ~ Front_Rear + four_FR, data = Cars93)
## 
## Omnibus ANOVA
##                   SS df        MS EtaSq     F p
## Model       42309.23  2 21154.614 0.168 9.064 0
## Error      210054.02 90  2333.934              
## Corr Total 252363.25 92  2743.079              
## 
##    RMSE AdjEtaSq
##  48.311    0.149
## 
## Coefficients
##                 Est  StErr      t      SSR(3) EtaSq   tol  CI_2.5 CI_97.5    p
## (Intercept) 155.340  6.783 22.901 1224036.469 0.854    NA 141.864 168.816 0.00
## Front_Rear   57.229 13.443  4.257   42301.548 0.168 0.939  30.523  83.936 0.00
## four_FR      18.510 16.690  1.109    2870.646 0.013 0.939 -14.648  51.669 0.27

So the intercept is the mean of our three group means. And our two predictors (Front_Rear and four_FR) are a pair of codes representing a single factor–our cars’ Drive Trains–and telling us specific differences in effects among our factor levels.

The difference in hpr between front and rear-wheel drives is 57.2, a significant difference. But is 4WD different from front and rear wheel drive together? Nope, b = 18.51, p = .27.

And here’s a graph of our two sets of contrast codes:

ggplot(data = Cars93) +
  geom_point(aes(y = Horsepower, x = four_FR), stat = 'summary', fun = 'mean', color = 'dodgerblue',size = 2) +
  geom_path(stat = 'summary', fun = 'mean', aes(y = Horsepower, x = four_FR, group = 1, col = 'red3')) +
  geom_point(aes(y = Horsepower, x = Front_Rear), stat = 'summary', fun = 'mean', color = 'red3',size = 2) +
  geom_path(data = Cars93[Cars93$DriveTrain != "4WD",], stat = 'summary', fun = 'mean', aes(y = Horsepower, x = Front_Rear, group = 1, col = 'dodgerblue')) +
  labs(title="Cars' Horsepower by Drive Train",
       x = "Drive Train
Blue = 4WD vs. Front & Rear WD;
Red = Front vs. Rear WD",
       y = "Horsepower") +
  theme_bw() +
  coord_cartesian(xlim = c(-.75, .50)) +
  theme(legend.position = "none")

And a graph of just our data:

Cars93$DriveTrain <- factor(Cars93$DriveTrain, levels = c("Front", "4WD", "Rear"))

ggplot(data = Cars93) +
  geom_point(aes(y = Horsepower, x = DriveTrain), stat = 'summary', fun = 'mean',size = 2) +
  geom_path(stat = 'summary', fun = 'mean', aes(y = Horsepower, x = DriveTrain, group = 1)) +
  labs(title="Cars' Horsepower by Drive Train",
       x = "Drive Train",
       y = "Horsepower") +
  theme_bw() +
  theme(legend.position = "none")

Contrast Codes again

Okay, why don’t we try all this again in a dataset that doesn’t use drive trains and horsepower. Let’s use PlantGrowth!

This data describes the effect on plant growth for a 3 condition experiment. In one condition, trt1, plants were exposed to blacklight. In a second condition, trt2, an experimenter sang to the plants for one hour a day. The last condition is, naturally, a control.

Research Question 1: Did our two treatments, on average, affect plant growth compared with our control?

First step is building our contrast codes. We have three levels in treatment: trt1, trt2, ctrl. So we need 2 codes!

# verifying the names of the levels in our group variable
table(PlantGrowth$group)
## 
## ctrl trt1 trt2 
##   10   10   10
# Contrast codes
## Code 1: Control vs. Treatments 1 & 2
PlantGrowth$ctrl_tr12 <- (1/3) * (PlantGrowth$group == "trt1") + (-2/3) * (PlantGrowth$group == "ctrl") + (1/3) * (PlantGrowth$group == "trt2")

## Code 2: Treatment 1 vs. Treatment 2
PlantGrowth$tr1_tr2 <- (-1/2) * (PlantGrowth$group == "trt1") + (0) * (PlantGrowth$group == "ctrl") + (1/2) * (PlantGrowth$group == "trt2")

Model comparison for our question:

Model A: \(Weight_i = \beta_0 + \beta_1Ctrl\) _ \(Tr1Tr2_i + \beta_2T1\) _ \(T2_i + \epsilon_i\)

Model C: \(Weight_i = \beta_0 + \beta_2Tr1\) _ \(Tr2_i + \epsilon_i\)

Null: \(H_0\): \(\beta_1\) = 0

Substantive Null: There is no difference between control and our two treatments, on average.

Now run the model

plants.m1 <- lm(weight ~ ctrl_tr12 + tr1_tr2, data = PlantGrowth)
mcSummary(plants.m1)
## lm(formula = weight ~ ctrl_tr12 + tr1_tr2, data = PlantGrowth)
## 
## Omnibus ANOVA
##                SS df    MS EtaSq     F     p
## Model       3.766  2 1.883 0.264 4.846 0.016
## Error      10.492 27 0.389                  
## Corr Total 14.258 29 0.492                  
## 
##   RMSE AdjEtaSq
##  0.623     0.21
## 
## Coefficients
##               Est StErr      t  SSR(3) EtaSq tol CI_2.5 CI_97.5     p
## (Intercept) 5.073 0.114 44.573 772.060 0.987  NA  4.839   5.307 0.000
## ctrl_tr12   0.061 0.241  0.255   0.025 0.002   1 -0.434   0.557 0.801
## tr1_tr2     0.865 0.279  3.103   3.741 0.263   1  0.293   1.437 0.004

What’s our conclusion?

Research Question 2: Was there a difference between treatment 1 (blacklight) and treatment 2 (singing) on our plants’ growth?

Can we answer this question with the contrast codes we already have? Let’s take a look!

table(PlantGrowth$group, PlantGrowth$ctrl_tr12)
##       
##        -0.666666666666667 0.333333333333333
##   ctrl                 10                 0
##   trt1                  0                10
##   trt2                  0                10
table(PlantGrowth$group, PlantGrowth$tr1_tr2)
##       
##        -0.5  0 0.5
##   ctrl    0 10   0
##   trt1   10  0   0
##   trt2    0  0  10

Model comparison for our question:

Model A: \(Weight_i = \beta_0 + \beta_1Ctrl\) _ \(Tr1Tr2_i + \beta_2Tr1\) _ \(Tr2_i + \epsilon_i\)

Model C: \(Weight_i = \beta_0 + \beta_1Ctrl\) _ \(Tr1Tr2_i + \epsilon_i\)

Null: \(H_0\): \(\beta_2\) = 0

Substantive Null: There is no difference between our two treatments.

Run the model again

mcSummary(plants.m1)
## lm(formula = weight ~ ctrl_tr12 + tr1_tr2, data = PlantGrowth)
## 
## Omnibus ANOVA
##                SS df    MS EtaSq     F     p
## Model       3.766  2 1.883 0.264 4.846 0.016
## Error      10.492 27 0.389                  
## Corr Total 14.258 29 0.492                  
## 
##   RMSE AdjEtaSq
##  0.623     0.21
## 
## Coefficients
##               Est StErr      t  SSR(3) EtaSq tol CI_2.5 CI_97.5     p
## (Intercept) 5.073 0.114 44.573 772.060 0.987  NA  4.839   5.307 0.000
## ctrl_tr12   0.061 0.241  0.255   0.025 0.002   1 -0.434   0.557 0.801
## tr1_tr2     0.865 0.279  3.103   3.741 0.263   1  0.293   1.437 0.004

What’s our conclusion?

Research Question 3: Does condition matter at all?

This last question is what a traditional one-way ANOVA addresses. Forget about specific levels of our condition (though why would you…?), we want to know whether there’s a difference in plant growth by condition in general.

That’s a 2-df question! Our two contrast codes together make up a single predictor, representing condition. Thus, we would want to ask the following model comparison:

Model A: \(Weight_i = \beta_0 + \beta_1Ctrl\) _ \(Tr1Tr2_i + \beta_2Tr1\) _ \(Tr2_i + \epsilon_i\)

Model C: \(Weight_i = \beta_0 + \epsilon_i\)

Null: \(H_0\): \(\beta_1 = \beta_2\) = 0

Substantive Null: There is no difference in plant growth between our three conditions.

Run the model one last time

mcSummary(plants.m1)
## lm(formula = weight ~ ctrl_tr12 + tr1_tr2, data = PlantGrowth)
## 
## Omnibus ANOVA
##                SS df    MS EtaSq     F     p
## Model       3.766  2 1.883 0.264 4.846 0.016
## Error      10.492 27 0.389                  
## Corr Total 14.258 29 0.492                  
## 
##   RMSE AdjEtaSq
##  0.623     0.21
## 
## Coefficients
##               Est StErr      t  SSR(3) EtaSq tol CI_2.5 CI_97.5     p
## (Intercept) 5.073 0.114 44.573 772.060 0.987  NA  4.839   5.307 0.000
## ctrl_tr12   0.061 0.241  0.255   0.025 0.002   1 -0.434   0.557 0.801
## tr1_tr2     0.865 0.279  3.103   3.741 0.263   1  0.293   1.437 0.004

What’s our conclusion?

A final note

This approach is great because it allows you to do an ANOVA and your “follow-up comparisons” simultaneously. You no longer need to ask questions in this order: “Was there any effect of condition (control, treatment1, treatment2)? Was tr1 different from tr2? Was the control different from tr1 and tr2 together? etc.” You can think of the most important comparisons right at the start of your analyses, without having to do post-hoc tests.

Of course, not every comparison possible is represented in your codes. What if you want to know if tr2 is different from the control, after seeing that it is significantly better from tr1? That would require a new set of codes. You will get lots of practice on this week’s HW on how to construct your codes to make them work for your research questions.