Interactions for Continuous Variables via Multiple Regression

This is a short demonstration of how interaction terms can be constructed in R when running a multiple regression model. This uses the Prestige dataset from the car package.

library(car)
data("Prestige")
str(Prestige)
## 'data.frame':    102 obs. of  6 variables:
##  $ education: num  13.1 12.3 12.8 11.4 14.6 ...
##  $ income   : int  12351 25879 9271 8865 8403 11030 8258 14163 11377 11023 ...
##  $ women    : num  11.16 4.02 15.7 9.11 11.68 ...
##  $ prestige : num  68.8 69.1 63.4 56.8 73.5 77.6 72.6 78.1 73.1 68.8 ...
##  $ census   : int  1113 1130 1171 1175 2111 2113 2133 2141 2143 2153 ...
##  $ type     : Factor w/ 3 levels "bc","prof","wc": 2 2 2 2 2 2 2 2 2 2 ...

Main Effects Model

Predict income based upon education and prestige.

mod_1 <- lm(income ~ education + prestige, data = Prestige)
summary(mod_1)
## 
## Call:
## lm(formula = income ~ education + prestige, data = Prestige)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6649.0 -1453.5   102.4  1234.3 14901.4 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -714.66    1257.53  -0.568    0.571    
## education    -169.63     207.01  -0.819    0.415    
## prestige      199.30      32.83   6.071 2.36e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2989 on 99 degrees of freedom
## Multiple R-squared:  0.5144, Adjusted R-squared:  0.5046 
## F-statistic: 52.43 on 2 and 99 DF,  p-value: 2.962e-16

Education is not a significant predictor of income, but prestige is. But maybe there is an interaction between income and prestige?

Interactions for Continuous Varialbes

** Method 1**: Create a new variable that is the product of the two variables.

Prestige$int_raw <- Prestige$education * Prestige$prestige
mod_2 <- lm(income ~ education + prestige + int_raw, data = Prestige)
summary(mod_2)
## 
## Call:
## lm(formula = income ~ education + prestige + int_raw, data = Prestige)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7107.5 -1606.3   130.3  1036.2 15345.5 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 8074.663   3903.035   2.069   0.0412 *
## education   -995.023    402.441  -2.472   0.0151 *
## prestige       8.833     86.453   0.102   0.9188  
## int_raw       16.582      6.989   2.373   0.0196 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2921 on 98 degrees of freedom
## Multiple R-squared:  0.5408, Adjusted R-squared:  0.5267 
## F-statistic: 38.47 on 3 and 98 DF,  p-value: < 2.2e-16

The interaction is significant! And the main effect of prestige basically falls away. But what about multicollinearity? Also, ignoring the individual coefficients, note the overall model information: \(R^2\) = .54, \(F\)(3,98) = 38.47, \(p\) < .001.

vif(mod_2) # variance inflation factors 
## education  prestige   int_raw 
##  14.27187  26.18730  60.62971
sqrt(vif(mod_2)) > 2 # problem?
## education  prestige   int_raw 
##      TRUE      TRUE      TRUE

All VIFS are greater than 2, and the VIF for the interaction is especially problematic.

Method 2: Create the new variable based upon the product of the centered raw variables.

Prestige$int_cent <- scale(Prestige$education, scale = FALSE) * scale(Prestige$prestige, scale = FALSE)
mod_3 <- lm(income ~ education + prestige + int_cent, data = Prestige)
summary(mod_3)
## 
## Call:
## lm(formula = income ~ education + prestige + int_cent, data = Prestige)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7107.5 -1606.3   130.3  1036.2 15345.5 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -264.216   1243.697  -0.212   0.8322    
## education   -218.450    203.376  -1.074   0.2854    
## prestige     186.888     32.512   5.748 1.02e-07 ***
## int_cent      16.582      6.989   2.373   0.0196 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2921 on 98 degrees of freedom
## Multiple R-squared:  0.5408, Adjusted R-squared:  0.5267 
## F-statistic: 38.47 on 3 and 98 DF,  p-value: < 2.2e-16

Using this method, the interaction is still significant, but the main effects change (with prestige significant again). The oveall model fit, however, is identical to Method 1: \(R^2\) = .54, \(F\)(3,98) = 38.47, \(p\) < .001. However, we have drastically reduced our issue with multicollinearity:

vif(mod_3)
## education  prestige  int_cent 
##  3.644822  3.703458  1.234984
sqrt(vif(mod_3)) > 2 # problem? no problem!
## education  prestige  int_cent 
##     FALSE     FALSE     FALSE

Method 3: Let R do it for us! We should also note that R has easy syntax to create interaction variables, but which method does it use? Let’s check:

mod_4 <- lm(income ~ education * prestige, data = Prestige)
summary(mod_4)
## 
## Call:
## lm(formula = income ~ education * prestige, data = Prestige)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7107.5 -1606.3   130.3  1036.2 15345.5 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)  
## (Intercept)        8074.663   3903.035   2.069   0.0412 *
## education          -995.023    402.441  -2.472   0.0151 *
## prestige              8.833     86.453   0.102   0.9188  
## education:prestige   16.582      6.989   2.373   0.0196 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2921 on 98 degrees of freedom
## Multiple R-squared:  0.5408, Adjusted R-squared:  0.5267 
## F-statistic: 38.47 on 3 and 98 DF,  p-value: < 2.2e-16

Uh oh! Using this approach, R does not center our variables beforehand and we end up with the un-centered coefficients as above. However, if we use the Anova function in car, then we get the correct coefficients:

Anova(mod_4)
## Anova Table (Type II tests)
## 
## Response: income
##                       Sum Sq Df F value    Pr(>F)    
## education            5997200  1  0.7029   0.40386    
## prestige           329175528  1 38.5791 1.275e-08 ***
## education:prestige  48032945  1  5.6294   0.01961 *  
## Residuals          836183552 98                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Conclusion

The convenience functions for building interaction terms in R are very straight-forward to use (e.g., using \(^3\) to add all possible three-way interactions and so forth), however it should be noted that this should primarily be used for comparing overall fit and the interaction terms, rather than looking at lower order effects, as it does not center our variables beforehand. We can use Anova() in the car library or manually center the variables via scale() to get the proper effects and avoid variance inflation due to multicollinearity.

mod_4 <- lm(income ~ education:prestige, data = Prestige) summary(mod_4)A