Today we talked about polynomial regression and multicollinearity. Polynomial regression is just like linear regression but we can have predictors raised to different powers.
attach(iris)
irismod<- lm(Sepal.Length ~ Sepal.Width)
summary(irismod)
##
## Call:
## lm(formula = Sepal.Length ~ Sepal.Width)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.5561 -0.6333 -0.1120 0.5579 2.2226
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.5262 0.4789 13.63 <2e-16 ***
## Sepal.Width -0.2234 0.1551 -1.44 0.152
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8251 on 148 degrees of freedom
## Multiple R-squared: 0.01382, Adjusted R-squared: 0.007159
## F-statistic: 2.074 on 1 and 148 DF, p-value: 0.1519
plot(irismod$fitted.values, irismod$residuals)
abline(0,0)
It looks like there may be a slight curve to our data with the higher and lower values being under estimated. We can see if adding a quadratic term will improve our model.
qirismod<- lm(Sepal.Length ~ Sepal.Width + I(Sepal.Width^2))
plot(qirismod$fitted.values, qirismod$residuals)
abline(0,0)
These graphs don’t tell us much so we should go to the model readouts to check.
summary(qirismod)
##
## Call:
## lm(formula = Sepal.Length ~ Sepal.Width + I(Sepal.Width^2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.63153 -0.62177 -0.08282 0.50531 2.33336
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.4594 2.4020 1.024 0.3076
## Sepal.Width 2.4312 1.5445 1.574 0.1176
## I(Sepal.Width^2) -0.4246 0.2458 -1.727 0.0862 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8196 on 147 degrees of freedom
## Multiple R-squared: 0.03344, Adjusted R-squared: 0.02029
## F-statistic: 2.543 on 2 and 147 DF, p-value: 0.08209
Since this gives a p value of .086 we probably don’t want to include the quadratic term.
Multicollinearity is important because if we have codependent predictors some bad things will happen. The standard error will be inflated which will give a T stat closer to zero. This will make it hard to see what is signfigant. It will also make the CI much larger and the estimate is unstable.