Main Effects Model
Predict income based upon education and prestige.
mod_1 <- lm(income ~ education + prestige, data = Prestige)
summary(mod_1)
##
## Call:
## lm(formula = income ~ education + prestige, data = Prestige)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6649.0 -1453.5 102.4 1234.3 14901.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -714.66 1257.53 -0.568 0.571
## education -169.63 207.01 -0.819 0.415
## prestige 199.30 32.83 6.071 2.36e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2989 on 99 degrees of freedom
## Multiple R-squared: 0.5144, Adjusted R-squared: 0.5046
## F-statistic: 52.43 on 2 and 99 DF, p-value: 2.962e-16
Education is not a significant predictor of income, but prestige is. But maybe there is an interaction between income and prestige?
Interactions for Continuous Varialbes
** Method 1**: Create a new variable that is the product of the two variables.
Prestige$int_raw <- Prestige$education * Prestige$prestige
mod_2 <- lm(income ~ education + prestige + int_raw, data = Prestige)
summary(mod_2)
##
## Call:
## lm(formula = income ~ education + prestige + int_raw, data = Prestige)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7107.5 -1606.3 130.3 1036.2 15345.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8074.663 3903.035 2.069 0.0412 *
## education -995.023 402.441 -2.472 0.0151 *
## prestige 8.833 86.453 0.102 0.9188
## int_raw 16.582 6.989 2.373 0.0196 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2921 on 98 degrees of freedom
## Multiple R-squared: 0.5408, Adjusted R-squared: 0.5267
## F-statistic: 38.47 on 3 and 98 DF, p-value: < 2.2e-16
The interaction is significant! And the main effect of prestige basically falls away. But what about multicollinearity? Also, ignoring the individual coefficients, note the overall model information: \(R^2\) = .54, \(F\)(3,98) = 38.47, \(p\) < .001.
vif(mod_2) # variance inflation factors
## education prestige int_raw
## 14.27187 26.18730 60.62971
sqrt(vif(mod_2)) > 2 # problem?
## education prestige int_raw
## TRUE TRUE TRUE
All VIFS are greater than 2, and the VIF for the interaction is especially problematic.
Method 2: Create the new variable based upon the product of the centered raw variables.
Prestige$int_cent <- scale(Prestige$education, scale = FALSE) * scale(Prestige$prestige, scale = FALSE)
mod_3 <- lm(income ~ education + prestige + int_cent, data = Prestige)
summary(mod_3)
##
## Call:
## lm(formula = income ~ education + prestige + int_cent, data = Prestige)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7107.5 -1606.3 130.3 1036.2 15345.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -264.216 1243.697 -0.212 0.8322
## education -218.450 203.376 -1.074 0.2854
## prestige 186.888 32.512 5.748 1.02e-07 ***
## int_cent 16.582 6.989 2.373 0.0196 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2921 on 98 degrees of freedom
## Multiple R-squared: 0.5408, Adjusted R-squared: 0.5267
## F-statistic: 38.47 on 3 and 98 DF, p-value: < 2.2e-16
Using this method, the interaction is still significant, but the main effects change (with prestige significant again). The oveall model fit, however, is identical to Method 1: \(R^2\) = .54, \(F\)(3,98) = 38.47, \(p\) < .001. However, we have drastically reduced our issue with multicollinearity:
vif(mod_3)
## education prestige int_cent
## 3.644822 3.703458 1.234984
sqrt(vif(mod_3)) > 2 # problem? no problem!
## education prestige int_cent
## FALSE FALSE FALSE
Method 3: Let R do it for us! We should also note that R has easy syntax to create interaction variables, but which method does it use? Let’s check:
mod_4 <- lm(income ~ education * prestige, data = Prestige)
summary(mod_4)
##
## Call:
## lm(formula = income ~ education * prestige, data = Prestige)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7107.5 -1606.3 130.3 1036.2 15345.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8074.663 3903.035 2.069 0.0412 *
## education -995.023 402.441 -2.472 0.0151 *
## prestige 8.833 86.453 0.102 0.9188
## education:prestige 16.582 6.989 2.373 0.0196 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2921 on 98 degrees of freedom
## Multiple R-squared: 0.5408, Adjusted R-squared: 0.5267
## F-statistic: 38.47 on 3 and 98 DF, p-value: < 2.2e-16
Uh oh! Using this approach, R does not center our variables beforehand and we end up with the un-centered coefficients as above. However, if we use the Anova function in car, then we get the correct coefficients:
Anova(mod_4)
## Anova Table (Type II tests)
##
## Response: income
## Sum Sq Df F value Pr(>F)
## education 5997200 1 0.7029 0.40386
## prestige 329175528 1 38.5791 1.275e-08 ***
## education:prestige 48032945 1 5.6294 0.01961 *
## Residuals 836183552 98
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1