I. HELLO
Welcome to the script cheatsheet for Politics & Psychology. This is meant to be a repository of the code and models we’ve reviewed or learned about this semester.
1. How to read this document
Below, in Section I.2, you’ll see a list of research questions. These are divided into type (categorical or continuous) and number of (2 or 3+) predictor variables. Refer to this list to determine which Model you should reference.
Also: This document (.Rmd) is written in an R Markdown file. R Markdown is designed to produce documents that include both code and text–instead of requiring #’s to create a comment, you can type text directly into the file. The trade-off is that any code you want to run must be included in “r chunks”, like this:
## [1] 7
So, only code included in an R chunk will run. You can insert R chunks by clicking the green “+[c]” button above and selecting R. Or, you can hit Cmd + Option + I (Ctrl + Alt + I on Windows).
2. Research question types
Categorical Predictors
One predictor (2 levels): Section III.1.a
One predictor (3+ levels): Section III.1.b
Two or more predictors, no interaction: Section III.1.c
Two or more predictors, interaction: Section III.1.d
Continuous Predictors
One predictor: Section III.2.a
Two+ predictors, no interaction: Section III.2.b
Two+ predictors, at least one interaction: Section III.2.c
Mixed Predictors
Two+ predictors, no interaction: Section III.3.a
Two+ predictors, categorical predictors interact: Section III.3.b
Two+ predictors, continuous predictors interact: Section III.3.c
Two+ predictors, mixed interaction: Section III.3.d
II. Libraries and data
2. Data
This cheatsheet is going to be self-contained (i.e., it won’t require you to import a separate dataset), so you need to run this code to create the data you’ll use for the rest of the demo. (If this is confusing to you, don’t worry about it. Just run the chunk of code below and move onto the next part.)
III. Models
1. Categorical Predictors
a. Single categorical predictor, 2-levels (t-test)
This is a standard t-test! Here are two different ways to run a t-test.
i. Basic approach
##
## Two Sample t-test
##
## data: data$y by data$cat1
## t = -2.0888, df = 198, p-value = 0.038
## alternative hypothesis: true difference in means between group A and group B is not equal to 0
## 95 percent confidence interval:
## -0.39460283 -0.01134886
## sample estimates:
## mean in group A mean in group B
## 0.3649130 0.5678889
ii. General Linear Model approach
## Codes
# Contrast Codes
data$A_B <- NA
data$A_B[data$cat1 == "A"] <- -1/2
data$A_B[data$cat1 == "B"] <- 1/2
# Dummy Codes
data$A.d <- NA
data$A.d[data$cat1 == "A"] <- 0
data$A.d[data$cat1 == "B"] <- 1
data$B.d <- NA
data$B.d[data$cat1 == "A"] <- 1
data$B.d[data$cat1 == "B"] <- 0
## Models
# with contrast codes
m1.c <- lm(y ~ A_B, data = data)
summary(m1.c)
##
## Call:
## lm(formula = y ~ A_B, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.08289 -0.39741 -0.06389 0.38336 2.59211
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.46640 0.04859 9.599 <2e-16 ***
## A_B 0.20298 0.09717 2.089 0.038 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6849 on 198 degrees of freedom
## Multiple R-squared: 0.02156, Adjusted R-squared: 0.01662
## F-statistic: 4.363 on 1 and 198 DF, p-value: 0.038
##
## Call:
## lm(formula = y ~ A.d, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.08289 -0.39741 -0.06389 0.38336 2.59211
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.36491 0.07141 5.110 7.55e-07 ***
## A.d 0.20298 0.09717 2.089 0.038 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6849 on 198 degrees of freedom
## Multiple R-squared: 0.02156, Adjusted R-squared: 0.01662
## F-statistic: 4.363 on 1 and 198 DF, p-value: 0.038
##
## Call:
## lm(formula = y ~ B.d, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.08289 -0.39741 -0.06389 0.38336 2.59211
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.56789 0.06591 8.617 2.18e-15 ***
## B.d -0.20298 0.09717 -2.089 0.038 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6849 on 198 degrees of freedom
## Multiple R-squared: 0.02156, Adjusted R-squared: 0.01662
## F-statistic: 4.363 on 1 and 198 DF, p-value: 0.038
b. Single categorical predictor, 3-levels+ (One-way ANOVA)
This is a one-way ANOVA! Here are two different ways to run a one-way ANOVA.
i. Basic approach
## Df Sum Sq Mean Sq F value Pr(>F)
## data$cat2 2 25.58 12.792 36.34 3.69e-14 ***
## Residuals 197 69.35 0.352
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ii. General Linear Model
Remember, you’ll always need k-1 codes in your model to fully represent a categorical variable (k being the number of levels in a categorical variable).
## Codes
# Contrast Codes
data$C_E <- NA # this code compares group C to group E, zeroing out group D
data$C_E[data$cat2 == "C"] <- -1/2
data$C_E[data$cat2 == "D"] <- 0
data$C_E[data$cat2 == "E"] <- 1/2
data$D_CE <- NA # this code compares group D to the average across groups C and E
data$D_CE[data$cat2 == "C"] <- 1/3
data$D_CE[data$cat2 == "D"] <- -2/3
data$D_CE[data$cat2 == "E"] <- 1/3
# Dummy Codes: C is centered.
data$C_D.d <- NA
data$C_D.d[data$cat2 == "C"] <- 0
data$C_D.d[data$cat2 == "D"] <- 1
data$C_D.d[data$cat2 == "E"] <- 0
data$C_E.d <- NA
data$C_E.d[data$cat2 == "C"] <- 0
data$C_E.d[data$cat2 == "D"] <- 0
data$C_E.d[data$cat2 == "E"] <- 1
## Models
# with contrast codes
m2.c <- lm(y ~ C_E + D_CE, data = data)
summary(m2.c)
##
## Call:
## lm(formula = y ~ C_E + D_CE, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.07041 -0.36612 -0.03174 0.28220 2.28038
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.49130 0.04217 11.651 < 2e-16 ***
## C_E 0.84075 0.09967 8.436 6.99e-15 ***
## D_CE -0.09617 0.09248 -1.040 0.3
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5933 on 197 degrees of freedom
## Multiple R-squared: 0.2695, Adjusted R-squared: 0.2621
## F-statistic: 36.34 on 2 and 197 DF, p-value: 3.688e-14
##
## Call:
## lm(formula = y ~ C_D.d + C_E.d, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.07041 -0.36612 -0.03174 0.28220 2.28038
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.03886 0.06897 0.563 0.574
## C_D.d 0.51655 0.10405 4.964 1.49e-06 ***
## C_E.d 0.84075 0.09967 8.436 6.99e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5933 on 197 degrees of freedom
## Multiple R-squared: 0.2695, Adjusted R-squared: 0.2621
## F-statistic: 36.34 on 2 and 197 DF, p-value: 3.688e-14
c. Multiple categorical predictors: No interaction (ANOVA)
i. Basic approach
## Df Sum Sq Mean Sq F value Pr(>F)
## data$cat1 1 2.05 2.047 5.846 0.0165 *
## data$cat2 2 24.26 12.129 34.642 1.31e-13 ***
## Residuals 196 68.63 0.350
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ii. General Linear Model
R
## Codes
# Contrast Codes
data$A_B <- NA
data$A_B[data$cat1 == "A"] <- -1/2
data$A_B[data$cat1 == "B"] <- 1/2
# Dummy Codes
data$A.d <- NA # A is centered
data$A.d[data$cat1 == "A"] <- 0
data$A.d[data$cat1 == "B"] <- 1
# Cat 2 Contrast Codes
data$C_E <- NA # this code compares group C to group E, zeroing out group D
data$C_E[data$cat2 == "C"] <- -1/2
data$C_E[data$cat2 == "D"] <- 0
data$C_E[data$cat2 == "E"] <- 1/2
data$D_CE <- NA # this code compares group D to the average across groups C and E
data$D_CE[data$cat2 == "C"] <- 1/3
data$D_CE[data$cat2 == "D"] <- -2/3
data$D_CE[data$cat2 == "E"] <- 1/3
# Cat 2 Dummy Codes: C is centered.
data$C_D.d <- NA
data$C_D.d[data$cat2 == "C"] <- 0
data$C_D.d[data$cat2 == "D"] <- 1
data$C_D.d[data$cat2 == "E"] <- 0
data$C_E.d <- NA
data$C_E.d[data$cat2 == "C"] <- 0
data$C_E.d[data$cat2 == "D"] <- 0
data$C_E.d[data$cat2 == "E"] <- 1
## Models
# with contrast codes
m3.c <- lm(y ~ A_B + (C_E + D_CE), data = data)
summary(m3.c)
##
## Call:
## lm(formula = y ~ A_B + (C_E + D_CE), data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.12068 -0.32333 -0.02642 0.25787 2.23037
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.48588 0.04222 11.507 < 2e-16 ***
## A_B 0.12147 0.08460 1.436 0.153
## C_E 0.82511 0.10000 8.251 2.26e-14 ***
## D_CE -0.08860 0.09238 -0.959 0.339
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5917 on 196 degrees of freedom
## Multiple R-squared: 0.2771, Adjusted R-squared: 0.266
## F-statistic: 25.04 on 3 and 196 DF, p-value: 9.254e-14
##
## Call:
## lm(formula = y ~ A.d + (C_D.d + C_E.d), data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.12068 -0.32333 -0.02642 0.25787 2.23037
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.01694 0.07901 -0.214 0.830
## A.d 0.12147 0.08460 1.436 0.153
## C_D.d 0.50115 0.10432 4.804 3.09e-06 ***
## C_E.d 0.82511 0.10000 8.251 2.26e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5917 on 196 degrees of freedom
## Multiple R-squared: 0.2771, Adjusted R-squared: 0.266
## F-statistic: 25.04 on 3 and 196 DF, p-value: 9.254e-14
d. Multiple categorical predictors (Factorial ANOVA)
i. Basic approach
## Df Sum Sq Mean Sq F value Pr(>F)
## cat1 1 2.05 2.047 6.059 0.0147 *
## cat2 2 24.26 12.129 35.909 5.4e-14 ***
## cat1:cat2 2 3.10 1.548 4.582 0.0114 *
## Residuals 194 65.53 0.338
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ii. General Linear Model
R
## Codes
# Contrast Codes
data$A_B <- NA
data$A_B[data$cat1 == "A"] <- -1/2
data$A_B[data$cat1 == "B"] <- 1/2
# Dummy Codes
data$A.d <- NA # A is centered
data$A.d[data$cat1 == "A"] <- 0
data$A.d[data$cat1 == "B"] <- 1
# Cat 2 Contrast Codes
data$C_E <- NA # this code compares group C to group E, zeroing out group D
data$C_E[data$cat2 == "C"] <- -1/2
data$C_E[data$cat2 == "D"] <- 0
data$C_E[data$cat2 == "E"] <- 1/2
data$D_CE <- NA # this code compares group D to the average across groups C and E
data$D_CE[data$cat2 == "C"] <- 1/3
data$D_CE[data$cat2 == "D"] <- -2/3
data$D_CE[data$cat2 == "E"] <- 1/3
# Cat 2 Dummy Codes: C is centered.
data$C_D.d <- NA
data$C_D.d[data$cat2 == "C"] <- 0
data$C_D.d[data$cat2 == "D"] <- 1
data$C_D.d[data$cat2 == "E"] <- 0
data$C_E.d <- NA
data$C_E.d[data$cat2 == "C"] <- 0
data$C_E.d[data$cat2 == "D"] <- 0
data$C_E.d[data$cat2 == "E"] <- 1
## Models
# with contrast codes
m4.c <- lm(y ~ A_B*(C_E + D_CE), data = data)
summary(m4.c)
##
## Call:
## lm(formula = y ~ A_B * (C_E + D_CE), data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.95903 -0.35559 -0.04732 0.26709 2.14435
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.49247 0.04180 11.782 < 2e-16 ***
## A_B 0.09816 0.08360 1.174 0.24175
## C_E 0.80215 0.09860 8.135 4.82e-14 ***
## D_CE -0.12922 0.09183 -1.407 0.16097
## A_B:C_E 0.09707 0.19720 0.492 0.62311
## A_B:D_CE 0.55101 0.18366 3.000 0.00305 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5812 on 194 degrees of freedom
## Multiple R-squared: 0.3097, Adjusted R-squared: 0.2919
## F-statistic: 17.41 on 5 and 194 DF, p-value: 3.178e-14
##
## Call:
## lm(formula = y ~ A.d * (C_D.d + C_E.d), data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.95903 -0.35559 -0.04732 0.26709 2.14435
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.06832 0.09189 -0.744 0.4581
## A.d 0.23330 0.13557 1.721 0.0869 .
## C_D.d 0.78153 0.15006 5.208 4.85e-07 ***
## C_E.d 0.75361 0.14321 5.262 3.75e-07 ***
## A.d:C_D.d -0.50247 0.20588 -2.441 0.0156 *
## A.d:C_E.d 0.09707 0.19720 0.492 0.6231
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5812 on 194 degrees of freedom
## Multiple R-squared: 0.3097, Adjusted R-squared: 0.2919
## F-statistic: 17.41 on 5 and 194 DF, p-value: 3.178e-14
2. Continuous Predictors
a. Single continuous predictor (Simple regression)
There is now only one approach to evaluate these models: The General Linear Model approach.
##
## Call:
## lm(formula = y ~ cont1, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.46274 -0.34876 -0.03768 0.28869 1.69326
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.45018 0.03319 13.56 <2e-16 ***
## cont1 -0.53251 0.03483 -15.29 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4689 on 198 degrees of freedom
## Multiple R-squared: 0.5414, Adjusted R-squared: 0.5391
## F-statistic: 233.8 on 1 and 198 DF, p-value: < 2.2e-16
b. Multiple continuous predictors: No interaction (Multiple regression)
##
## Call:
## lm(formula = y ~ cont1 + cont2, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.18305 -0.32673 -0.00228 0.30554 1.72291
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.44688 0.03209 13.924 < 2e-16 ***
## cont1 -0.53063 0.03367 -15.762 < 2e-16 ***
## cont2 -0.24575 0.06355 -3.867 0.00015 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4532 on 197 degrees of freedom
## Multiple R-squared: 0.5738, Adjusted R-squared: 0.5695
## F-statistic: 132.6 on 2 and 197 DF, p-value: < 2.2e-16
c. Multiple continuous predictors (Multiple regression)
##
## Call:
## lm(formula = y ~ cont1 * cont2 + cont3, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.58866 -0.26500 0.04565 0.19269 0.64683
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.45481 0.02509 18.124 < 2e-16 ***
## cont1 -0.49606 0.02647 -18.741 < 2e-16 ***
## cont2 -0.21774 0.04957 -4.392 1.84e-05 ***
## cont3 0.51502 0.13149 3.917 0.000124 ***
## cont1:cont2 0.50541 0.04761 10.615 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3527 on 195 degrees of freedom
## Multiple R-squared: 0.7445, Adjusted R-squared: 0.7393
## F-statistic: 142.1 on 4 and 195 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = y ~ cont1 * cont2 * cont3, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.56791 -0.26212 0.05059 0.16782 0.59297
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.45525 0.02524 18.035 < 2e-16 ***
## cont1 -0.49600 0.02674 -18.547 < 2e-16 ***
## cont2 -0.22647 0.05052 -4.483 1.27e-05 ***
## cont3 0.52665 0.13345 3.946 0.000111 ***
## cont1:cont2 0.49871 0.04831 10.322 < 2e-16 ***
## cont1:cont3 0.01036 0.14605 0.071 0.943527
## cont2:cont3 -0.27012 0.28310 -0.954 0.341205
## cont1:cont2:cont3 0.22055 0.27681 0.797 0.426566
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3538 on 192 degrees of freedom
## Multiple R-squared: 0.7468, Adjusted R-squared: 0.7376
## F-statistic: 80.9 on 7 and 192 DF, p-value: < 2.2e-16
3. Mixed Predictors (Multiple regression)
At this point, we’ll drop the dummy codes for simplicity’s sake.
a. Two+ predictors, no interaction
## Codes
# Contrast Codes
data$A_B <- NA
data$A_B[data$cat1 == "A"] <- -1/2
data$A_B[data$cat1 == "B"] <- 1/2
data$C_E <- NA # this code compares group C to group E, zeroing out group D
data$C_E[data$cat2 == "C"] <- -1/2
data$C_E[data$cat2 == "D"] <- 0
data$C_E[data$cat2 == "E"] <- 1/2
data$D_CE <- NA # this code compares group D to the average across groups C and E
data$D_CE[data$cat2 == "C"] <- 1/3
data$D_CE[data$cat2 == "D"] <- -2/3
data$D_CE[data$cat2 == "E"] <- 1/3
## Model
m8 <- lm(y ~ A_B + (C_E + D_CE) + cont1 + cont2, data = data)
summary(m8)
##
## Call:
## lm(formula = y ~ A_B + (C_E + D_CE) + cont1 + cont2, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.96856 -0.13121 -0.02026 0.12438 1.53044
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.44894 0.02107 21.302 < 2e-16 ***
## A_B 0.25306 0.04248 5.956 1.19e-08 ***
## C_E 0.72701 0.04995 14.555 < 2e-16 ***
## D_CE -0.01640 0.04609 -0.356 0.722
## cont1 -0.52700 0.02214 -23.802 < 2e-16 ***
## cont2 -0.21703 0.04136 -5.248 4.02e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2945 on 194 degrees of freedom
## Multiple R-squared: 0.8227, Adjusted R-squared: 0.8181
## F-statistic: 180 on 5 and 194 DF, p-value: < 2.2e-16
b. Two+ predictors, categorical predictors interaction
## Codes
# Contrast Codes
data$A_B <- NA
data$A_B[data$cat1 == "A"] <- -1/2
data$A_B[data$cat1 == "B"] <- 1/2
data$C_E <- NA # this code compares group C to group E, zeroing out group D
data$C_E[data$cat2 == "C"] <- -1/2
data$C_E[data$cat2 == "D"] <- 0
data$C_E[data$cat2 == "E"] <- 1/2
data$D_CE <- NA # this code compares group D to the average across groups C and E
data$D_CE[data$cat2 == "C"] <- 1/3
data$D_CE[data$cat2 == "D"] <- -2/3
data$D_CE[data$cat2 == "E"] <- 1/3
## Model
m9 <- lm(y ~ A_B*(C_E + D_CE) + cont1 + cont2, data = data)
summary(m9)
##
## Call:
## lm(formula = y ~ A_B * (C_E + D_CE) + cont1 + cont2, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.92025 -0.12299 -0.01515 0.09638 1.54795
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.44397 0.02085 21.299 < 2e-16 ***
## A_B 0.25465 0.04204 6.058 7.14e-09 ***
## C_E 0.71548 0.04909 14.576 < 2e-16 ***
## D_CE -0.03265 0.04576 -0.713 0.47645
## cont1 -0.52886 0.02227 -23.751 < 2e-16 ***
## cont2 -0.20866 0.04061 -5.138 6.79e-07 ***
## A_B:C_E 0.29621 0.09862 3.004 0.00302 **
## A_B:D_CE 0.10296 0.09309 1.106 0.27012
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2885 on 192 degrees of freedom
## Multiple R-squared: 0.8316, Adjusted R-squared: 0.8255
## F-statistic: 135.5 on 7 and 192 DF, p-value: < 2.2e-16
c. Two+ predictors, continuous predictors interact
## Codes
# Contrast Codes
data$A_B <- NA
data$A_B[data$cat1 == "A"] <- -1/2
data$A_B[data$cat1 == "B"] <- 1/2
data$C_E <- NA # this code compares group C to group E, zeroing out group D
data$C_E[data$cat2 == "C"] <- -1/2
data$C_E[data$cat2 == "D"] <- 0
data$C_E[data$cat2 == "E"] <- 1/2
data$D_CE <- NA # this code compares group D to the average across groups C and E
data$D_CE[data$cat2 == "C"] <- 1/3
data$D_CE[data$cat2 == "D"] <- -2/3
data$D_CE[data$cat2 == "E"] <- 1/3
## Model
m10 <- lm(y ~ A_B + (C_E + D_CE) + cont1 * cont2, data = data)
summary(m10)
##
## Call:
## lm(formula = y ~ A_B + (C_E + D_CE) + cont1 * cont2, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.31165 -0.07369 -0.00949 0.06417 0.31120
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.445706 0.008189 54.430 <2e-16 ***
## A_B 0.294124 0.016553 17.769 <2e-16 ***
## C_E 0.699697 0.019424 36.022 <2e-16 ***
## D_CE -0.029861 0.017910 -1.667 0.0971 .
## cont1 -0.491064 0.008671 -56.634 <2e-16 ***
## cont2 -0.183506 0.016101 -11.397 <2e-16 ***
## cont1:cont2 0.512306 0.015502 33.048 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1144 on 193 degrees of freedom
## Multiple R-squared: 0.9734, Adjusted R-squared: 0.9725
## F-statistic: 1176 on 6 and 193 DF, p-value: < 2.2e-16
d. Two+ predictors, mixed interaction
## Codes
# Contrast Codes
data$A_B <- NA
data$A_B[data$cat1 == "A"] <- -1/2
data$A_B[data$cat1 == "B"] <- 1/2
data$C_E <- NA # this code compares group C to group E, zeroing out group D
data$C_E[data$cat2 == "C"] <- -1/2
data$C_E[data$cat2 == "D"] <- 0
data$C_E[data$cat2 == "E"] <- 1/2
data$D_CE <- NA # this code compares group D to the average across groups C and E
data$D_CE[data$cat2 == "C"] <- 1/3
data$D_CE[data$cat2 == "D"] <- -2/3
data$D_CE[data$cat2 == "E"] <- 1/3
## Model
m11.a <- lm(y ~ A_B*cont1 + (C_E + D_CE) + cont2, data = data) # 2-way mixed interaction, plus one categorical control and one continuous control
summary(m11.a)
##
## Call:
## lm(formula = y ~ A_B * cont1 + (C_E + D_CE) + cont2, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.97092 -0.13244 -0.02016 0.12088 1.53355
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.449173 0.021348 21.041 < 2e-16 ***
## A_B 0.252798 0.042726 5.917 1.47e-08 ***
## cont1 -0.527044 0.022206 -23.734 < 2e-16 ***
## C_E 0.727386 0.050313 14.457 < 2e-16 ***
## D_CE -0.017100 0.047094 -0.363 0.717
## cont2 -0.217298 0.041612 -5.222 4.56e-07 ***
## A_B:cont1 -0.003495 0.045492 -0.077 0.939
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2953 on 193 degrees of freedom
## Multiple R-squared: 0.8227, Adjusted R-squared: 0.8172
## F-statistic: 149.3 on 6 and 193 DF, p-value: < 2.2e-16
m11.b <- lm(y ~ A_B*cont1 + (C_E + D_CE)*cont2, data = data) # two 2-way mixed interactions
summary(m11.b)
##
## Call:
## lm(formula = y ~ A_B * cont1 + (C_E + D_CE) * cont2, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.90509 -0.11960 -0.03281 0.12086 1.33832
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.450233 0.021308 21.130 < 2e-16 ***
## A_B 0.253551 0.042705 5.937 1.34e-08 ***
## cont1 -0.526329 0.022187 -23.722 < 2e-16 ***
## C_E 0.724633 0.050195 14.436 < 2e-16 ***
## D_CE -0.020262 0.047007 -0.431 0.6669
## cont2 -0.209600 0.041722 -5.024 1.16e-06 ***
## A_B:cont1 0.002546 0.045625 0.056 0.9556
## C_E:cont2 0.018987 0.098680 0.192 0.8476
## D_CE:cont2 -0.160210 0.091565 -1.750 0.0818 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2945 on 191 degrees of freedom
## Multiple R-squared: 0.8255, Adjusted R-squared: 0.8182
## F-statistic: 113 on 8 and 191 DF, p-value: < 2.2e-16
m11.c <- lm(y ~ A_B*(C_E + D_CE)*cont1 + cont2, data = data) # 3-way ineraction plus a continuous control. (These are about as complicated as interactions get in the literature.)
summary(m11.c)
##
## Call:
## lm(formula = y ~ A_B * (C_E + D_CE) * cont1 + cont2, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.92270 -0.11470 -0.01578 0.09795 1.48038
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.434249 0.022157 19.599 < 2e-16 ***
## A_B 0.269169 0.044329 6.072 6.90e-09 ***
## C_E 0.718413 0.049897 14.398 < 2e-16 ***
## D_CE -0.006560 0.050522 -0.130 0.89683
## cont1 -0.534344 0.023046 -23.186 < 2e-16 ***
## cont2 -0.202134 0.041317 -4.892 2.14e-06 ***
## A_B:C_E 0.292624 0.100010 2.926 0.00386 **
## A_B:D_CE 0.067742 0.101050 0.670 0.50344
## A_B:cont1 0.015119 0.046309 0.326 0.74442
## C_E:cont1 -0.034343 0.055424 -0.620 0.53625
## D_CE:cont1 0.019923 0.049842 0.400 0.68982
## A_B:C_E:cont1 -0.001934 0.110842 -0.017 0.98610
## A_B:D_CE:cont1 -0.141660 0.099961 -1.417 0.15810
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2902 on 187 degrees of freedom
## Multiple R-squared: 0.8341, Adjusted R-squared: 0.8234
## F-statistic: 78.33 on 12 and 187 DF, p-value: < 2.2e-16
m11.d <- lm(y ~ A_B*(C_E + D_CE)*cont1*cont2, data = data) #4-way interaction--these are rare, and can be quite difficult to interpret
summary(m11.d)
##
## Call:
## lm(formula = y ~ A_B * (C_E + D_CE) * cont1 * cont2, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.34315 -0.06172 -0.00193 0.04925 0.28615
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.439586 0.008265 53.185 < 2e-16 ***
## A_B 0.300917 0.016530 18.204 < 2e-16 ***
## C_E 0.692857 0.018599 37.253 < 2e-16 ***
## D_CE -0.036088 0.018852 -1.914 0.0572 .
## cont1 -0.498177 0.008751 -56.929 < 2e-16 ***
## cont2 -0.185246 0.019419 -9.539 < 2e-16 ***
## A_B:C_E 0.220095 0.037197 5.917 1.67e-08 ***
## A_B:D_CE 0.036058 0.037703 0.956 0.3402
## A_B:cont1 -0.002711 0.017502 -0.155 0.8771
## C_E:cont1 -0.012730 0.020766 -0.613 0.5407
## D_CE:cont1 0.001199 0.019125 0.063 0.9501
## A_B:cont2 0.019021 0.038839 0.490 0.6249
## C_E:cont2 0.042560 0.037736 1.128 0.2609
## D_CE:cont2 0.077273 0.048229 1.602 0.1109
## cont1:cont2 0.509107 0.017532 29.039 < 2e-16 ***
## A_B:C_E:cont1 0.030213 0.041533 0.727 0.4679
## A_B:D_CE:cont1 -0.027270 0.038250 -0.713 0.4768
## A_B:C_E:cont2 0.023168 0.075472 0.307 0.7592
## A_B:D_CE:cont2 -0.030698 0.096458 -0.318 0.7507
## A_B:cont1:cont2 0.005439 0.035063 0.155 0.8769
## C_E:cont1:cont2 0.050771 0.040260 1.261 0.2090
## D_CE:cont1:cont2 0.032917 0.039378 0.836 0.4043
## A_B:C_E:cont1:cont2 -0.070475 0.080519 -0.875 0.3826
## A_B:D_CE:cont1:cont2 -0.060845 0.078755 -0.773 0.4408
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1072 on 176 degrees of freedom
## Multiple R-squared: 0.9787, Adjusted R-squared: 0.9759
## F-statistic: 351.3 on 23 and 176 DF, p-value: < 2.2e-16
IV. Plots
1. Single Categorical Predictor
2. Single Continuous Predictor
plot2 <- ggplot(data) +
geom_smooth(method = "lm",
aes(x = cont1, y = y))
plot2 +
xlab("Continuous Variable 1") +
ylab("Outcome")
3. Two Categorical Predictors
plot3 <- ggplot(data) +
geom_bar(aes(x = cat1,
y = y,
fill = cat2),
stat = "summary", fun = "mean",
position = position_dodge(.9))
plot3 +
xlab("Categorical Variable 1") +
ylab("Outcome") +
scale_fill_manual("Categorical Variable 2", values = c("red3","mediumorchid4","dodgerblue"))
# Or, reversed:
plot3.1 <- ggplot(data) +
geom_bar(aes(x = cat2,
y = y,
fill = cat1),
stat = "summary", fun = "mean",
position = position_dodge(.9))
plot3.1 +
xlab("Categorical Variable 2") +
ylab("Outcome") +
scale_fill_manual("Categorical Variable 1", values = c("green4","salmon"))
V. Model selection questions
If you’re still not sure about any of this, email me to ask! af13@williams.edu
1. Should I include an interaction term?
Including an interaction term is, like most design and analysis decisions, a judgment call. You should do so when you have a reason to expect that the effect of one variable might depend on what’s going on with another variable.
a. Examples
Some examples of typical interaction questions.
i. Intersectionality
The concept of intersectionality draws on the idea of an interaction effect. When you make the argument that black women experience discrimination in the workplace to a different degree than either black men or white women, you are citing an interaction effect (See Section III.1.d for code).
ii. Math scores
Let’s say I’m trying to model student scores on a math test (Y) as a function of math test scores in the previous year (X1) and a binary variable indicating whether the student attended a refresher course in rudimentary math topics (X2). Because the course only covered rudimentary abilities, there are good theoretical reasons to think that it might produce a bigger impact on test scores for students who started at a lower baseline, and little or no impact on those students who were already doing well (and thus already knew the rudimentary topics it covered). So I should include an interaction term between prior test scores and course attendance to test if this is true or not (I would predict that the interaction term coefficient would be negative and significant in this case; see Section III.3.d for code).
iii. Demographic variables
Often, demographics will be included as controls, but not as interaction terms. However, sometimes it will be useful to use demographics as an interaction term. Let’s say that we are examining support for expanding Medicare, a government program that provides health insurance to older and disabled people, among Democrats and Republicans. We would expect Democrats to support this policy more than Republicans, because Dems support public policies more than Reps, but we might also expect older people to support this more than younger people, because they are more likely to rely on this program. Do older Republicans support the policy more than younger ones? Do they support it as much as Democrats do? To evaluate this questions, we’d need to run an interaction (Section III.3.d again).
2. How do I know if a variable is a predictor or a control?
In general, a variable is a predictor if you have a specific, directional hypothesis about it, and/or if you want to draw conclusions about its relationship to your outcome variable. A variable is a control (or covariate) if you’re pretty confident that it will be related to your outcome variable, but you don’t have any specific hypotheses about it.
Control variables are also not included in interaction terms. That’s generally reserved for predictors only.