Today we covered polynomial regression and multicollinearity. Polynomial regression uses a polynomial model instead of a linear model. Multicollinearity refers to predictors that are related to each other.
Two of the conditions for linear regression are linearity and equal variance. Linearity of the data is important to check when doing exploratory data analysis; if the data points follow a curved pattern, then a linear model is not likely the best choice.
If the data seems to follow a linear pattern, then we must also check for equal variance after creating the linear model. The residuals should have a randomized distribution. We want the model to over-predict and under-predict at random. If there are any trends or patterns in the residuals, then we have trouble; we do not want to systematically under-predict or overpredict.
The example using the women dataset was very helpful in understanding the process of doing polynomial regression.
plot(weight~height, data = women)
The data points seem to follow a linear pattern when we plot them, but the residual plot tells a different story.
linearmodwgt <- lm(weight ~ height, data=women)
plot(linearmodwgt$residuals ~ linearmodwgt$fitted.values, ylab = "residuals", xlab = "fitted values")
Here we can clearly see the residuals have a trend. Thus, the equal variance assumption for linear regression does not hold.
Another tool I found useful in looking at residuals is the residualPlot function found in the car package. The argument is the model, and it will output a residual plot along with a line to show the trend in the residuals. A straight line would indicate a randomized distribution of residuals. Here we can see this line is curved.
library(car)
## Warning: package 'car' was built under R version 3.4.3
residualPlot(linearmodwgt)
The women dataset is an extreme case of residuals clearly following a trend. In other cases, the trend may not be easily seen. One such example I found was with the cars dataset.
carmods <- lm(dist ~ speed, data=cars)
residualPlot(carmods)
These residuals don’t show a clear pattern, but the line is still curved so it may be beneficial to see if a polynomial regression model results in a better fit.
Polynomial regression is one method to approach cases where data is not linear and/or the equal variance assumption is violated using a linear model. When we covered simple linear regression, we learned how to check for equal variance and linearity but we weren’t able to do much if one or both conditions were violated. With polynomial regression, we may be able to create a model more suitable for non-linear data. An example of a polynomial regression equation: \[\hat{y} = \beta_0 + \beta_1x + \beta_2x^2\]
The lm function is still used to create a polynomial regression model. The syntax for a k-th power polynomial regression model is model_name <- lm(response ~ predictor + I(predictor^2) + … +I(predictor^k)). Below is how to create a quadratic regression model for the women dataset.
quadwgt <- lm(weight ~ height + I(height^2), data = women)
residualPlot(quadwgt)
This residual plot looks much better than the plot for the linear model. Now let’s test the significance of the quadratic term. This is done with a t-test, or by fitting both the quadratic and linear models and using anova.
summary(quadwgt)
##
## Call:
## lm(formula = weight ~ height + I(height^2), data = women)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.50941 -0.29611 -0.00941 0.28615 0.59706
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 261.87818 25.19677 10.393 2.36e-07 ***
## height -7.34832 0.77769 -9.449 6.58e-07 ***
## I(height^2) 0.08306 0.00598 13.891 9.32e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3841 on 12 degrees of freedom
## Multiple R-squared: 0.9995, Adjusted R-squared: 0.9994
## F-statistic: 1.139e+04 on 2 and 12 DF, p-value: < 2.2e-16
We see a very small p-value so this quadratic term is statistically significant. This example outlines an important point. While the t-test tells us we should definitely have the quadratic term, the quadratic model is only slightly more accurate than the linear one in predicting weight using height.
Polynomial regression models are more difficult to explain and interpret, especially for people without any background in statistics. In this example, the quadratic term is not practically significant, so it’s better to just stick with the linear model.
It’s not ideal to have correlated predictors, since they will inflate standard errors. We can check for multicollinearity using the plot function. The output is a matrix of plots, and we then visually inspect each plot for correlation.
If two predictors seem correlated, we can check the correlation using cor. If the absolute value of the correlation exceeds 0.90, there is cause for concern.
Below is the plot for the mtcars dataset. This has a lot of variables so the plot matrix is very large.
plot(mtcars)
We can see some potential correlation between the variables wt and mpg, which represent weight and miles per US gallon, respectively. If we were building a model and wanted to use these two variables as predictors, we may run into trouble. Let’s check the correlation:
attach(mtcars)
cor(wt, mpg)
## [1] -0.8676594
The correlation is fairly high but below the theshold for concern. Still, it may be better if we don’t use both of these predictors in the model. This only checks for pairwise correlations; since one predictor could be correlated with several others, we need to dig deeper.
The way we check for correlations between multiple predictors is to create a linear model with the predictor in question as a response, and all the others as predictors. Below is the code if we wanted to test horsepower against the other predictors.
HPtest <- lm(hp ~ mpg + cyl + disp + drat + wt + qsec + vs + am + gear + carb)
summary(HPtest)
##
## Call:
## lm(formula = hp ~ mpg + cyl + disp + drat + wt + qsec + vs +
## am + gear + carb)
##
## Residuals:
## Min 1Q Median 3Q Max
## -38.681 -15.558 0.799 18.106 34.718
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 79.0484 184.5041 0.428 0.67270
## mpg -2.0631 2.0906 -0.987 0.33496
## cyl 8.2037 10.0861 0.813 0.42513
## disp 0.4390 0.1492 2.942 0.00778 **
## drat -4.6185 16.0829 -0.287 0.77680
## wt -27.6600 19.2704 -1.435 0.16591
## qsec -1.7844 7.3639 -0.242 0.81089
## vs 25.8129 19.8512 1.300 0.20758
## am 9.4863 20.7599 0.457 0.65240
## gear 7.2164 14.6160 0.494 0.62662
## carb 18.7487 7.0288 2.667 0.01441 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 25.97 on 21 degrees of freedom
## Multiple R-squared: 0.9028, Adjusted R-squared: 0.8565
## F-statistic: 19.5 on 10 and 21 DF, p-value: 1.898e-08
We look at the \(r^2\) for this model, which is 0.9028. Then we calculate the Variance Inflation Factor, or VIF, which is equal to \(1/(1-r^2)\). The VIF is a measure of how much we are inflating our variance. VIF is equal to 1 if \(r^2\) is 0, meaning there is no inflation. As \(r^2\) approaches 1, VIF becomes infinitely large, indicating excessive inflation.
In practice, it’s very rare for the VIF to take on a value of 1, so a VIF that is greater than 10 is cause for concern.
The VIF for horsepower:
1/(1-0.9028^2)
## [1] 5.406804
This is less than 10, so this predictor is alright.
A high VIF causes many problems. Tests for dropping coefficients will have test statistics closer to 0, giving higher p-values which reduce the power of the test. We would be more inclined to fail to reject the null hypothesis and end up dropping the predictor. Confidence intervals will also be wider. Coefficients for predictors with a high VIF will have an inflated standard error, and are highly variable from sample to sample.