I. HELLO

Welcome to the script cheatsheet for Politics & Psychology. This is meant to be a repository of the code and models we’ve reviewed or learned about this semester.

1. How to read this document

Below, in Section I.2, you’ll see a list of research questions. These are divided into type (categorical or continuous) and number of (2 or 3+) predictor variables. Refer to this list to determine which Model you should reference.

Also: This document (.Rmd) is written in an R Markdown file. R Markdown is designed to produce documents that include both code and text–instead of requiring #’s to create a comment, you can type text directly into the file. The trade-off is that any code you want to run must be included in “r chunks”, like this:

3 + 4
## [1] 7

So, only code included in an R chunk will run. You can insert R chunks by clicking the green “+[c]” button above and selecting R. Or, you can hit Cmd + Option + I (Ctrl + Alt + I on Windows).

2. Research question types

Categorical Predictors

  • One predictor (2 levels): Section III.1.a

  • One predictor (3+ levels): Section III.1.b

  • Two or more predictors, no interaction: Section III.1.c

  • Two or more predictors, interaction: Section III.1.d

Continuous Predictors

  • One predictor: Section III.2.a

  • Two+ predictors, no interaction: Section III.2.b

  • Two+ predictors, at least one interaction: Section III.2.c

Mixed Predictors

  • Two+ predictors, no interaction: Section III.3.a

  • Two+ predictors, categorical predictors interact: Section III.3.b

  • Two+ predictors, continuous predictors interact: Section III.3.c

  • Two+ predictors, mixed interaction: Section III.3.d

II. Libraries and data

1. Libraries

library(psych)
library(tidyverse)

2. Data

This cheatsheet is going to be self-contained (i.e., it won’t require you to import a separate dataset), so you need to run this code to create the data you’ll use for the rest of the demo. (If this is confusing to you, don’t worry about it. Just run the chunk of code below and move onto the next part.)

III. Models

1. Categorical Predictors

a. Single categorical predictor, 2-levels (t-test)

This is a standard t-test! Here are two different ways to run a t-test.

i. Basic approach

t.test(data$y ~ data$cat1, var.equal = T)
## 
##  Two Sample t-test
## 
## data:  data$y by data$cat1
## t = -2.0888, df = 198, p-value = 0.038
## alternative hypothesis: true difference in means between group A and group B is not equal to 0
## 95 percent confidence interval:
##  -0.39460283 -0.01134886
## sample estimates:
## mean in group A mean in group B 
##       0.3649130       0.5678889

ii. General Linear Model approach

## Codes
# Contrast Codes
data$A_B <- NA
data$A_B[data$cat1 == "A"] <- -1/2
data$A_B[data$cat1 == "B"] <- 1/2

# Dummy Codes
data$A.d <- NA
data$A.d[data$cat1 == "A"] <- 0
data$A.d[data$cat1 == "B"] <- 1

data$B.d <- NA
data$B.d[data$cat1 == "A"] <- 1
data$B.d[data$cat1 == "B"] <- 0


## Models

# with contrast codes
m1.c <- lm(y ~ A_B, data = data)
summary(m1.c)
## 
## Call:
## lm(formula = y ~ A_B, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.08289 -0.39741 -0.06389  0.38336  2.59211 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.46640    0.04859   9.599   <2e-16 ***
## A_B          0.20298    0.09717   2.089    0.038 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6849 on 198 degrees of freedom
## Multiple R-squared:  0.02156,    Adjusted R-squared:  0.01662 
## F-statistic: 4.363 on 1 and 198 DF,  p-value: 0.038
# with dummy codes: A is centered
m1.d1 <- lm(y ~ A.d, data = data)
summary(m1.d1)
## 
## Call:
## lm(formula = y ~ A.d, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.08289 -0.39741 -0.06389  0.38336  2.59211 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.36491    0.07141   5.110 7.55e-07 ***
## A.d          0.20298    0.09717   2.089    0.038 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6849 on 198 degrees of freedom
## Multiple R-squared:  0.02156,    Adjusted R-squared:  0.01662 
## F-statistic: 4.363 on 1 and 198 DF,  p-value: 0.038
# with dummy codes: B is centered
m1.d2 <- lm(y ~ B.d, data = data)
summary(m1.d2)
## 
## Call:
## lm(formula = y ~ B.d, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.08289 -0.39741 -0.06389  0.38336  2.59211 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.56789    0.06591   8.617 2.18e-15 ***
## B.d         -0.20298    0.09717  -2.089    0.038 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6849 on 198 degrees of freedom
## Multiple R-squared:  0.02156,    Adjusted R-squared:  0.01662 
## F-statistic: 4.363 on 1 and 198 DF,  p-value: 0.038

b. Single categorical predictor, 3-levels+ (One-way ANOVA)

This is a one-way ANOVA! Here are two different ways to run a one-way ANOVA.

i. Basic approach

m2 <- aov(data$y ~ data$cat2)
summary(m2)
##              Df Sum Sq Mean Sq F value   Pr(>F)    
## data$cat2     2  25.58  12.792   36.34 3.69e-14 ***
## Residuals   197  69.35   0.352                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

ii. General Linear Model

Remember, you’ll always need k-1 codes in your model to fully represent a categorical variable (k being the number of levels in a categorical variable).

## Codes
# Contrast Codes
data$C_E <- NA # this code compares group C to group E, zeroing out group D
data$C_E[data$cat2 == "C"] <- -1/2
data$C_E[data$cat2 == "D"] <- 0
data$C_E[data$cat2 == "E"] <- 1/2 

data$D_CE <- NA # this code compares group D to the average across groups C and E
data$D_CE[data$cat2 == "C"] <- 1/3
data$D_CE[data$cat2 == "D"] <- -2/3
data$D_CE[data$cat2 == "E"] <- 1/3

# Dummy Codes: C is centered. 
data$C_D.d <- NA
data$C_D.d[data$cat2 == "C"] <- 0
data$C_D.d[data$cat2 == "D"] <- 1
data$C_D.d[data$cat2 == "E"] <- 0

data$C_E.d <- NA
data$C_E.d[data$cat2 == "C"] <- 0
data$C_E.d[data$cat2 == "D"] <- 0
data$C_E.d[data$cat2 == "E"] <- 1


## Models

# with contrast codes
m2.c <- lm(y ~ C_E + D_CE, data = data)
summary(m2.c)
## 
## Call:
## lm(formula = y ~ C_E + D_CE, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.07041 -0.36612 -0.03174  0.28220  2.28038 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.49130    0.04217  11.651  < 2e-16 ***
## C_E          0.84075    0.09967   8.436 6.99e-15 ***
## D_CE        -0.09617    0.09248  -1.040      0.3    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5933 on 197 degrees of freedom
## Multiple R-squared:  0.2695, Adjusted R-squared:  0.2621 
## F-statistic: 36.34 on 2 and 197 DF,  p-value: 3.688e-14
# with dummy codes: C is centered
m2.d <- lm(y ~ C_D.d + C_E.d, data = data)
summary(m2.d)
## 
## Call:
## lm(formula = y ~ C_D.d + C_E.d, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.07041 -0.36612 -0.03174  0.28220  2.28038 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.03886    0.06897   0.563    0.574    
## C_D.d        0.51655    0.10405   4.964 1.49e-06 ***
## C_E.d        0.84075    0.09967   8.436 6.99e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5933 on 197 degrees of freedom
## Multiple R-squared:  0.2695, Adjusted R-squared:  0.2621 
## F-statistic: 36.34 on 2 and 197 DF,  p-value: 3.688e-14

c. Multiple categorical predictors: No interaction (ANOVA)

i. Basic approach

m3 <- aov(data$y ~ data$cat1 + data$cat2)
summary(m3)
##              Df Sum Sq Mean Sq F value   Pr(>F)    
## data$cat1     1   2.05   2.047   5.846   0.0165 *  
## data$cat2     2  24.26  12.129  34.642 1.31e-13 ***
## Residuals   196  68.63   0.350                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

ii. General Linear Model

R

## Codes
# Contrast Codes
data$A_B <- NA
data$A_B[data$cat1 == "A"] <- -1/2
data$A_B[data$cat1 == "B"] <- 1/2

# Dummy Codes
data$A.d <- NA # A is centered
data$A.d[data$cat1 == "A"] <- 0
data$A.d[data$cat1 == "B"] <- 1

# Cat 2 Contrast Codes
data$C_E <- NA # this code compares group C to group E, zeroing out group D
data$C_E[data$cat2 == "C"] <- -1/2
data$C_E[data$cat2 == "D"] <- 0
data$C_E[data$cat2 == "E"] <- 1/2 

data$D_CE <- NA # this code compares group D to the average across groups C and E
data$D_CE[data$cat2 == "C"] <- 1/3
data$D_CE[data$cat2 == "D"] <- -2/3
data$D_CE[data$cat2 == "E"] <- 1/3

# Cat 2 Dummy Codes: C is centered. 
data$C_D.d <- NA
data$C_D.d[data$cat2 == "C"] <- 0
data$C_D.d[data$cat2 == "D"] <- 1
data$C_D.d[data$cat2 == "E"] <- 0

data$C_E.d <- NA
data$C_E.d[data$cat2 == "C"] <- 0
data$C_E.d[data$cat2 == "D"] <- 0
data$C_E.d[data$cat2 == "E"] <- 1


## Models

# with contrast codes
m3.c <- lm(y ~ A_B + (C_E + D_CE), data = data)
summary(m3.c)
## 
## Call:
## lm(formula = y ~ A_B + (C_E + D_CE), data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.12068 -0.32333 -0.02642  0.25787  2.23037 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.48588    0.04222  11.507  < 2e-16 ***
## A_B          0.12147    0.08460   1.436    0.153    
## C_E          0.82511    0.10000   8.251 2.26e-14 ***
## D_CE        -0.08860    0.09238  -0.959    0.339    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5917 on 196 degrees of freedom
## Multiple R-squared:  0.2771, Adjusted R-squared:  0.266 
## F-statistic: 25.04 on 3 and 196 DF,  p-value: 9.254e-14
# with dummy codes: C is centered
m3.d <- lm(y ~ A.d + (C_D.d + C_E.d), data = data)
summary(m3.d)
## 
## Call:
## lm(formula = y ~ A.d + (C_D.d + C_E.d), data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.12068 -0.32333 -0.02642  0.25787  2.23037 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.01694    0.07901  -0.214    0.830    
## A.d          0.12147    0.08460   1.436    0.153    
## C_D.d        0.50115    0.10432   4.804 3.09e-06 ***
## C_E.d        0.82511    0.10000   8.251 2.26e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5917 on 196 degrees of freedom
## Multiple R-squared:  0.2771, Adjusted R-squared:  0.266 
## F-statistic: 25.04 on 3 and 196 DF,  p-value: 9.254e-14

d. Multiple categorical predictors (Factorial ANOVA)

i. Basic approach

m4 <- aov(y ~ cat1 + cat2 + cat1:cat2, data = data)
summary(m4)
##              Df Sum Sq Mean Sq F value  Pr(>F)    
## cat1          1   2.05   2.047   6.059  0.0147 *  
## cat2          2  24.26  12.129  35.909 5.4e-14 ***
## cat1:cat2     2   3.10   1.548   4.582  0.0114 *  
## Residuals   194  65.53   0.338                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

ii. General Linear Model

R

## Codes
# Contrast Codes
data$A_B <- NA
data$A_B[data$cat1 == "A"] <- -1/2
data$A_B[data$cat1 == "B"] <- 1/2

# Dummy Codes
data$A.d <- NA # A is centered
data$A.d[data$cat1 == "A"] <- 0
data$A.d[data$cat1 == "B"] <- 1

# Cat 2 Contrast Codes
data$C_E <- NA # this code compares group C to group E, zeroing out group D
data$C_E[data$cat2 == "C"] <- -1/2
data$C_E[data$cat2 == "D"] <- 0
data$C_E[data$cat2 == "E"] <- 1/2 

data$D_CE <- NA # this code compares group D to the average across groups C and E
data$D_CE[data$cat2 == "C"] <- 1/3
data$D_CE[data$cat2 == "D"] <- -2/3
data$D_CE[data$cat2 == "E"] <- 1/3

# Cat 2 Dummy Codes: C is centered. 
data$C_D.d <- NA
data$C_D.d[data$cat2 == "C"] <- 0
data$C_D.d[data$cat2 == "D"] <- 1
data$C_D.d[data$cat2 == "E"] <- 0

data$C_E.d <- NA
data$C_E.d[data$cat2 == "C"] <- 0
data$C_E.d[data$cat2 == "D"] <- 0
data$C_E.d[data$cat2 == "E"] <- 1


## Models

# with contrast codes
m4.c <- lm(y ~ A_B*(C_E + D_CE), data = data)
summary(m4.c)
## 
## Call:
## lm(formula = y ~ A_B * (C_E + D_CE), data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.95903 -0.35559 -0.04732  0.26709  2.14435 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.49247    0.04180  11.782  < 2e-16 ***
## A_B          0.09816    0.08360   1.174  0.24175    
## C_E          0.80215    0.09860   8.135 4.82e-14 ***
## D_CE        -0.12922    0.09183  -1.407  0.16097    
## A_B:C_E      0.09707    0.19720   0.492  0.62311    
## A_B:D_CE     0.55101    0.18366   3.000  0.00305 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5812 on 194 degrees of freedom
## Multiple R-squared:  0.3097, Adjusted R-squared:  0.2919 
## F-statistic: 17.41 on 5 and 194 DF,  p-value: 3.178e-14
# with dummy codes: C is centered
m4.d <- lm(y ~ A.d*(C_D.d + C_E.d), data = data)
summary(m4.d)
## 
## Call:
## lm(formula = y ~ A.d * (C_D.d + C_E.d), data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.95903 -0.35559 -0.04732  0.26709  2.14435 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.06832    0.09189  -0.744   0.4581    
## A.d          0.23330    0.13557   1.721   0.0869 .  
## C_D.d        0.78153    0.15006   5.208 4.85e-07 ***
## C_E.d        0.75361    0.14321   5.262 3.75e-07 ***
## A.d:C_D.d   -0.50247    0.20588  -2.441   0.0156 *  
## A.d:C_E.d    0.09707    0.19720   0.492   0.6231    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5812 on 194 degrees of freedom
## Multiple R-squared:  0.3097, Adjusted R-squared:  0.2919 
## F-statistic: 17.41 on 5 and 194 DF,  p-value: 3.178e-14

2. Continuous Predictors

a. Single continuous predictor (Simple regression)

There is now only one approach to evaluate these models: The General Linear Model approach.

# with contrast codes
m5 <- lm(y ~ cont1, data = data)
summary(m5)
## 
## Call:
## lm(formula = y ~ cont1, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.46274 -0.34876 -0.03768  0.28869  1.69326 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.45018    0.03319   13.56   <2e-16 ***
## cont1       -0.53251    0.03483  -15.29   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4689 on 198 degrees of freedom
## Multiple R-squared:  0.5414, Adjusted R-squared:  0.5391 
## F-statistic: 233.8 on 1 and 198 DF,  p-value: < 2.2e-16

b. Multiple continuous predictors: No interaction (Multiple regression)

m6 <- lm(y ~ cont1 + cont2, data = data)
summary(m6)
## 
## Call:
## lm(formula = y ~ cont1 + cont2, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.18305 -0.32673 -0.00228  0.30554  1.72291 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.44688    0.03209  13.924  < 2e-16 ***
## cont1       -0.53063    0.03367 -15.762  < 2e-16 ***
## cont2       -0.24575    0.06355  -3.867  0.00015 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4532 on 197 degrees of freedom
## Multiple R-squared:  0.5738, Adjusted R-squared:  0.5695 
## F-statistic: 132.6 on 2 and 197 DF,  p-value: < 2.2e-16

c. Multiple continuous predictors (Multiple regression)

# With one interaction
m7.a <- lm(y ~ cont1*cont2 + cont3, data = data)
summary(m7.a)
## 
## Call:
## lm(formula = y ~ cont1 * cont2 + cont3, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.58866 -0.26500  0.04565  0.19269  0.64683 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.45481    0.02509  18.124  < 2e-16 ***
## cont1       -0.49606    0.02647 -18.741  < 2e-16 ***
## cont2       -0.21774    0.04957  -4.392 1.84e-05 ***
## cont3        0.51502    0.13149   3.917 0.000124 ***
## cont1:cont2  0.50541    0.04761  10.615  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3527 on 195 degrees of freedom
## Multiple R-squared:  0.7445, Adjusted R-squared:  0.7393 
## F-statistic: 142.1 on 4 and 195 DF,  p-value: < 2.2e-16
# Full interaction model
m7.b <- lm(y ~ cont1*cont2*cont3, data = data)
summary(m7.b)
## 
## Call:
## lm(formula = y ~ cont1 * cont2 * cont3, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.56791 -0.26212  0.05059  0.16782  0.59297 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        0.45525    0.02524  18.035  < 2e-16 ***
## cont1             -0.49600    0.02674 -18.547  < 2e-16 ***
## cont2             -0.22647    0.05052  -4.483 1.27e-05 ***
## cont3              0.52665    0.13345   3.946 0.000111 ***
## cont1:cont2        0.49871    0.04831  10.322  < 2e-16 ***
## cont1:cont3        0.01036    0.14605   0.071 0.943527    
## cont2:cont3       -0.27012    0.28310  -0.954 0.341205    
## cont1:cont2:cont3  0.22055    0.27681   0.797 0.426566    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3538 on 192 degrees of freedom
## Multiple R-squared:  0.7468, Adjusted R-squared:  0.7376 
## F-statistic:  80.9 on 7 and 192 DF,  p-value: < 2.2e-16

3. Mixed Predictors (Multiple regression)

At this point, we’ll drop the dummy codes for simplicity’s sake.

a. Two+ predictors, no interaction

## Codes

# Contrast Codes
data$A_B <- NA
data$A_B[data$cat1 == "A"] <- -1/2
data$A_B[data$cat1 == "B"] <- 1/2

data$C_E <- NA # this code compares group C to group E, zeroing out group D
data$C_E[data$cat2 == "C"] <- -1/2
data$C_E[data$cat2 == "D"] <- 0
data$C_E[data$cat2 == "E"] <- 1/2 

data$D_CE <- NA # this code compares group D to the average across groups C and E
data$D_CE[data$cat2 == "C"] <- 1/3
data$D_CE[data$cat2 == "D"] <- -2/3
data$D_CE[data$cat2 == "E"] <- 1/3


## Model

m8 <- lm(y ~ A_B + (C_E + D_CE) + cont1 + cont2, data = data)
summary(m8)
## 
## Call:
## lm(formula = y ~ A_B + (C_E + D_CE) + cont1 + cont2, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.96856 -0.13121 -0.02026  0.12438  1.53044 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.44894    0.02107  21.302  < 2e-16 ***
## A_B          0.25306    0.04248   5.956 1.19e-08 ***
## C_E          0.72701    0.04995  14.555  < 2e-16 ***
## D_CE        -0.01640    0.04609  -0.356    0.722    
## cont1       -0.52700    0.02214 -23.802  < 2e-16 ***
## cont2       -0.21703    0.04136  -5.248 4.02e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2945 on 194 degrees of freedom
## Multiple R-squared:  0.8227, Adjusted R-squared:  0.8181 
## F-statistic:   180 on 5 and 194 DF,  p-value: < 2.2e-16

b. Two+ predictors, categorical predictors interaction

## Codes

# Contrast Codes
data$A_B <- NA
data$A_B[data$cat1 == "A"] <- -1/2
data$A_B[data$cat1 == "B"] <- 1/2

data$C_E <- NA # this code compares group C to group E, zeroing out group D
data$C_E[data$cat2 == "C"] <- -1/2
data$C_E[data$cat2 == "D"] <- 0
data$C_E[data$cat2 == "E"] <- 1/2 

data$D_CE <- NA # this code compares group D to the average across groups C and E
data$D_CE[data$cat2 == "C"] <- 1/3
data$D_CE[data$cat2 == "D"] <- -2/3
data$D_CE[data$cat2 == "E"] <- 1/3


## Model

m9 <- lm(y ~ A_B*(C_E + D_CE) + cont1 + cont2, data = data)
summary(m9)
## 
## Call:
## lm(formula = y ~ A_B * (C_E + D_CE) + cont1 + cont2, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.92025 -0.12299 -0.01515  0.09638  1.54795 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.44397    0.02085  21.299  < 2e-16 ***
## A_B          0.25465    0.04204   6.058 7.14e-09 ***
## C_E          0.71548    0.04909  14.576  < 2e-16 ***
## D_CE        -0.03265    0.04576  -0.713  0.47645    
## cont1       -0.52886    0.02227 -23.751  < 2e-16 ***
## cont2       -0.20866    0.04061  -5.138 6.79e-07 ***
## A_B:C_E      0.29621    0.09862   3.004  0.00302 ** 
## A_B:D_CE     0.10296    0.09309   1.106  0.27012    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2885 on 192 degrees of freedom
## Multiple R-squared:  0.8316, Adjusted R-squared:  0.8255 
## F-statistic: 135.5 on 7 and 192 DF,  p-value: < 2.2e-16

c. Two+ predictors, continuous predictors interact

## Codes

# Contrast Codes
data$A_B <- NA
data$A_B[data$cat1 == "A"] <- -1/2
data$A_B[data$cat1 == "B"] <- 1/2

data$C_E <- NA # this code compares group C to group E, zeroing out group D
data$C_E[data$cat2 == "C"] <- -1/2
data$C_E[data$cat2 == "D"] <- 0
data$C_E[data$cat2 == "E"] <- 1/2 

data$D_CE <- NA # this code compares group D to the average across groups C and E
data$D_CE[data$cat2 == "C"] <- 1/3
data$D_CE[data$cat2 == "D"] <- -2/3
data$D_CE[data$cat2 == "E"] <- 1/3


## Model

m10 <- lm(y ~ A_B + (C_E + D_CE) + cont1 * cont2, data = data)
summary(m10)
## 
## Call:
## lm(formula = y ~ A_B + (C_E + D_CE) + cont1 * cont2, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.31165 -0.07369 -0.00949  0.06417  0.31120 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.445706   0.008189  54.430   <2e-16 ***
## A_B          0.294124   0.016553  17.769   <2e-16 ***
## C_E          0.699697   0.019424  36.022   <2e-16 ***
## D_CE        -0.029861   0.017910  -1.667   0.0971 .  
## cont1       -0.491064   0.008671 -56.634   <2e-16 ***
## cont2       -0.183506   0.016101 -11.397   <2e-16 ***
## cont1:cont2  0.512306   0.015502  33.048   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1144 on 193 degrees of freedom
## Multiple R-squared:  0.9734, Adjusted R-squared:  0.9725 
## F-statistic:  1176 on 6 and 193 DF,  p-value: < 2.2e-16

d. Two+ predictors, mixed interaction

## Codes

# Contrast Codes
data$A_B <- NA
data$A_B[data$cat1 == "A"] <- -1/2
data$A_B[data$cat1 == "B"] <- 1/2

data$C_E <- NA # this code compares group C to group E, zeroing out group D
data$C_E[data$cat2 == "C"] <- -1/2
data$C_E[data$cat2 == "D"] <- 0
data$C_E[data$cat2 == "E"] <- 1/2 

data$D_CE <- NA # this code compares group D to the average across groups C and E
data$D_CE[data$cat2 == "C"] <- 1/3
data$D_CE[data$cat2 == "D"] <- -2/3
data$D_CE[data$cat2 == "E"] <- 1/3


## Model

m11.a <- lm(y ~ A_B*cont1 + (C_E + D_CE) + cont2, data = data) # 2-way mixed interaction, plus one categorical control and one continuous control
summary(m11.a)
## 
## Call:
## lm(formula = y ~ A_B * cont1 + (C_E + D_CE) + cont2, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.97092 -0.13244 -0.02016  0.12088  1.53355 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.449173   0.021348  21.041  < 2e-16 ***
## A_B          0.252798   0.042726   5.917 1.47e-08 ***
## cont1       -0.527044   0.022206 -23.734  < 2e-16 ***
## C_E          0.727386   0.050313  14.457  < 2e-16 ***
## D_CE        -0.017100   0.047094  -0.363    0.717    
## cont2       -0.217298   0.041612  -5.222 4.56e-07 ***
## A_B:cont1   -0.003495   0.045492  -0.077    0.939    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2953 on 193 degrees of freedom
## Multiple R-squared:  0.8227, Adjusted R-squared:  0.8172 
## F-statistic: 149.3 on 6 and 193 DF,  p-value: < 2.2e-16
m11.b <- lm(y ~ A_B*cont1 + (C_E + D_CE)*cont2, data = data) # two 2-way mixed interactions
summary(m11.b)
## 
## Call:
## lm(formula = y ~ A_B * cont1 + (C_E + D_CE) * cont2, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.90509 -0.11960 -0.03281  0.12086  1.33832 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.450233   0.021308  21.130  < 2e-16 ***
## A_B          0.253551   0.042705   5.937 1.34e-08 ***
## cont1       -0.526329   0.022187 -23.722  < 2e-16 ***
## C_E          0.724633   0.050195  14.436  < 2e-16 ***
## D_CE        -0.020262   0.047007  -0.431   0.6669    
## cont2       -0.209600   0.041722  -5.024 1.16e-06 ***
## A_B:cont1    0.002546   0.045625   0.056   0.9556    
## C_E:cont2    0.018987   0.098680   0.192   0.8476    
## D_CE:cont2  -0.160210   0.091565  -1.750   0.0818 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2945 on 191 degrees of freedom
## Multiple R-squared:  0.8255, Adjusted R-squared:  0.8182 
## F-statistic:   113 on 8 and 191 DF,  p-value: < 2.2e-16
m11.c <- lm(y ~ A_B*(C_E + D_CE)*cont1 + cont2, data = data) # 3-way ineraction plus a continuous control. (These are about as complicated as interactions get in the literature.)
summary(m11.c)
## 
## Call:
## lm(formula = y ~ A_B * (C_E + D_CE) * cont1 + cont2, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.92270 -0.11470 -0.01578  0.09795  1.48038 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     0.434249   0.022157  19.599  < 2e-16 ***
## A_B             0.269169   0.044329   6.072 6.90e-09 ***
## C_E             0.718413   0.049897  14.398  < 2e-16 ***
## D_CE           -0.006560   0.050522  -0.130  0.89683    
## cont1          -0.534344   0.023046 -23.186  < 2e-16 ***
## cont2          -0.202134   0.041317  -4.892 2.14e-06 ***
## A_B:C_E         0.292624   0.100010   2.926  0.00386 ** 
## A_B:D_CE        0.067742   0.101050   0.670  0.50344    
## A_B:cont1       0.015119   0.046309   0.326  0.74442    
## C_E:cont1      -0.034343   0.055424  -0.620  0.53625    
## D_CE:cont1      0.019923   0.049842   0.400  0.68982    
## A_B:C_E:cont1  -0.001934   0.110842  -0.017  0.98610    
## A_B:D_CE:cont1 -0.141660   0.099961  -1.417  0.15810    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2902 on 187 degrees of freedom
## Multiple R-squared:  0.8341, Adjusted R-squared:  0.8234 
## F-statistic: 78.33 on 12 and 187 DF,  p-value: < 2.2e-16
m11.d <- lm(y ~ A_B*(C_E + D_CE)*cont1*cont2, data = data) #4-way interaction--these are rare, and can be quite difficult to interpret
summary(m11.d)
## 
## Call:
## lm(formula = y ~ A_B * (C_E + D_CE) * cont1 * cont2, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.34315 -0.06172 -0.00193  0.04925  0.28615 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           0.439586   0.008265  53.185  < 2e-16 ***
## A_B                   0.300917   0.016530  18.204  < 2e-16 ***
## C_E                   0.692857   0.018599  37.253  < 2e-16 ***
## D_CE                 -0.036088   0.018852  -1.914   0.0572 .  
## cont1                -0.498177   0.008751 -56.929  < 2e-16 ***
## cont2                -0.185246   0.019419  -9.539  < 2e-16 ***
## A_B:C_E               0.220095   0.037197   5.917 1.67e-08 ***
## A_B:D_CE              0.036058   0.037703   0.956   0.3402    
## A_B:cont1            -0.002711   0.017502  -0.155   0.8771    
## C_E:cont1            -0.012730   0.020766  -0.613   0.5407    
## D_CE:cont1            0.001199   0.019125   0.063   0.9501    
## A_B:cont2             0.019021   0.038839   0.490   0.6249    
## C_E:cont2             0.042560   0.037736   1.128   0.2609    
## D_CE:cont2            0.077273   0.048229   1.602   0.1109    
## cont1:cont2           0.509107   0.017532  29.039  < 2e-16 ***
## A_B:C_E:cont1         0.030213   0.041533   0.727   0.4679    
## A_B:D_CE:cont1       -0.027270   0.038250  -0.713   0.4768    
## A_B:C_E:cont2         0.023168   0.075472   0.307   0.7592    
## A_B:D_CE:cont2       -0.030698   0.096458  -0.318   0.7507    
## A_B:cont1:cont2       0.005439   0.035063   0.155   0.8769    
## C_E:cont1:cont2       0.050771   0.040260   1.261   0.2090    
## D_CE:cont1:cont2      0.032917   0.039378   0.836   0.4043    
## A_B:C_E:cont1:cont2  -0.070475   0.080519  -0.875   0.3826    
## A_B:D_CE:cont1:cont2 -0.060845   0.078755  -0.773   0.4408    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1072 on 176 degrees of freedom
## Multiple R-squared:  0.9787, Adjusted R-squared:  0.9759 
## F-statistic: 351.3 on 23 and 176 DF,  p-value: < 2.2e-16

IV. Plots

1. Single Categorical Predictor

a. Two-level

plot1.a <- ggplot(data) +
  geom_bar(aes(x = cat1, y = y),
           stat = "summary", fun = "mean",
           position = position_dodge(.9))

plot1.a +
  xlab("Categorical Variable 1") +
  ylab("Outcome")

b. Three-level

plot1.b <- ggplot(data) +
  geom_bar(aes(x = cat2, y = y),
           stat = "summary", fun = "mean",
           position = position_dodge(.9))

plot1.b +
  xlab("Categorical Variable 2") +
  ylab("Outcome")

2. Single Continuous Predictor

plot2 <- ggplot(data) +
  geom_smooth(method = "lm",
              aes(x = cont1, y = y))

plot2 +
  xlab("Continuous Variable 1") +
  ylab("Outcome")

3. Two Categorical Predictors

plot3 <- ggplot(data) +
  geom_bar(aes(x = cat1,
               y = y,
               fill = cat2),
           stat = "summary", fun = "mean",
           position = position_dodge(.9))

plot3 + 
  xlab("Categorical Variable 1") +
  ylab("Outcome") +
  scale_fill_manual("Categorical Variable 2", values = c("red3","mediumorchid4","dodgerblue"))

# Or, reversed:

plot3.1 <- ggplot(data) +
  geom_bar(aes(x = cat2,
               y = y,
               fill = cat1),
           stat = "summary", fun = "mean",
           position = position_dodge(.9))

plot3.1 + 
  xlab("Categorical Variable 2") +
  ylab("Outcome") +
  scale_fill_manual("Categorical Variable 1", values = c("green4","salmon"))

4. One Categorical, One Continuous Predictor

a. Two-level

plot4.a <- ggplot(data) +
  geom_smooth(method = "lm",
            aes(x = cont1, 
                y = y))

plot4.a + 
  xlab("Continuous Variable 1") +
  ylab("Outcome") +
  facet_wrap(~cat1)

b. Three-level

plot4.b <- ggplot(data) +
  geom_smooth(method = "lm",
            aes(x = cont1, 
                y = y))

plot4.b + 
  xlab("Continuous Variable 1") +
  ylab("Outcome") +
  facet_wrap(~cat2)

V. Model selection questions

If you’re still not sure about any of this, email me to ask!

1. Should I include an interaction term?

Including an interaction term is, like most design and analysis decisions, a judgment call. You should do so when you have a reason to expect that the effect of one variable might depend on what’s going on with another variable.

a. Examples

Some examples of typical interaction questions.

i. Intersectionality

The concept of intersectionality draws on the idea of an interaction effect. When you make the argument that black women experience discrimination in the workplace to a different degree than either black men or white women, you are citing an interaction effect (See Section III.1.d for code).

ii. Math scores

Let’s say I’m trying to model student scores on a math test (Y) as a function of math test scores in the previous year (X1) and a binary variable indicating whether the student attended a refresher course in rudimentary math topics (X2). Because the course only covered rudimentary abilities, there are good theoretical reasons to think that it might produce a bigger impact on test scores for students who started at a lower baseline, and little or no impact on those students who were already doing well (and thus already knew the rudimentary topics it covered). So I should include an interaction term between prior test scores and course attendance to test if this is true or not (I would predict that the interaction term coefficient would be negative and significant in this case; see Section III.3.d for code).

iii. Demographic variables

Often, demographics will be included as controls, but not as interaction terms. However, sometimes it will be useful to use demographics as an interaction term. Let’s say that we are examining support for expanding Medicare, a government program that provides health insurance to older and disabled people, among Democrats and Republicans. We would expect Democrats to support this policy more than Republicans, because Dems support public policies more than Reps, but we might also expect older people to support this more than younger people, because they are more likely to rely on this program. Do older Republicans support the policy more than younger ones? Do they support it as much as Democrats do? To evaluate this questions, we’d need to run an interaction (Section III.3.d again).

2. How do I know if a variable is a predictor or a control?

In general, a variable is a predictor if you have a specific, directional hypothesis about it, and/or if you want to draw conclusions about its relationship to your outcome variable. A variable is a control (or covariate) if you’re pretty confident that it will be related to your outcome variable, but you don’t have any specific hypotheses about it.

Control variables are also not included in interaction terms. That’s generally reserved for predictors only.