In class we covered the basics of polynomial regression and multicollinearity. Polynomial regression is a form of multiple linear regression with a polynomial term. We also learned if two or more predictors in your regression equation are correlated then you have multicollinearity. To start we will look at Polynomial Regression.
First, we viewed our residual plots just as we did in previous chapters, in an ideal situation there would be no trend in this scatter plot. If there is a trend, then we are systematically over/underpredicting results. We don’t want to systematically over/under predict, we want this to be as random as possible. When there is a trend we use polynomial regression. In class we used the women data set to demonstrate this. Below you can find the scatterplot of the women data set residuals.
data("women")
names(women)
## [1] "height" "weight"
attach(women)
wt={weight}
ht={height}
mymod=lm(wt~ht)
myresids=mymod$residuals
plot(myresids~mymod$fitted.values)
abline(0,0)
As you can see there is a definite trend when we plot the residuals. We are systematically overpredicting on the outside and underpredicting in the center.
To try to fix our plot we can add a quadratic term to our model.
mymod2=lm(wt~ht+I(ht^2))
myresids2=mymod2$residuals
plot(myresids2~mymod2$fitted.values)
abline(0,0)
As you can see the trend in the residual plot is practically gone.
We can check that we need this term by using the summary function below.
summary(mymod2)
##
## Call:
## lm(formula = wt ~ ht + I(ht^2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.50941 -0.29611 -0.00941 0.28615 0.59706
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 261.87818 25.19677 10.393 2.36e-07 ***
## ht -7.34832 0.77769 -9.449 6.58e-07 ***
## I(ht^2) 0.08306 0.00598 13.891 9.32e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3841 on 12 degrees of freedom
## Multiple R-squared: 0.9995, Adjusted R-squared: 0.9994
## F-statistic: 1.139e+04 on 2 and 12 DF, p-value: < 2.2e-16
Our p-val for the quadratic term is 9.32e-09 which is very small and will be significant for almost all alphas, so we should add it! With the p-val being that small we could probably add another term.
mymod3=lm(wt~ht+I(ht^2)+I(ht^3))
myresids3=mymod3$residuals
plot(myresids3~mymod3$fitted.values)
abline(0,0)
Now lets check to see if we should include this term.
summary(mymod3)
##
## Call:
## lm(formula = wt ~ ht + I(ht^2) + I(ht^3))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.40677 -0.17391 0.03091 0.12051 0.42191
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.967e+02 2.946e+02 -3.044 0.01116 *
## ht 4.641e+01 1.366e+01 3.399 0.00594 **
## I(ht^2) -7.462e-01 2.105e-01 -3.544 0.00460 **
## I(ht^3) 4.253e-03 1.079e-03 3.940 0.00231 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2583 on 11 degrees of freedom
## Multiple R-squared: 0.9998, Adjusted R-squared: 0.9997
## F-statistic: 1.679e+04 on 3 and 11 DF, p-value: < 2.2e-16
This pval of .00231 is significant with our alpha so we should include it as well! Next, I will touch on the basics of Multicollinearity.
Multicollinearity as I stated before is when two or more predictors are correlated. This can be problematic as it might inflate our standard error. To check for correlation we first plot the function, then we use the function cor(x1, x2) to check if they are actually correlated. We run into a problem if the correlation is greater than .9.
To check the variance inflation factor(VIF) we can use lm(xj~predictors except xj). Then we run a summary function on this equation to find our R squared. Once we have our R squared we can calculate VIF as 1/(1-Rsquared).A VIF closer to 1 is better and any VIF over 10 is very bad which means the standard error is inflated.