In class we covered the topics about multicollinearity and polynomial models.
We first tried making a linear model and had a high R^2 value but we saw a pattern in our residual plot so we knew this was not a good model.
data(women)
attach(women)
lmod<-lm(weight~height)
summary(lmod)
##
## Call:
## lm(formula = weight ~ height)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.7333 -1.1333 -0.3833 0.7417 3.1167
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -87.51667 5.93694 -14.74 1.71e-09 ***
## height 3.45000 0.09114 37.85 1.09e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.525 on 13 degrees of freedom
## Multiple R-squared: 0.991, Adjusted R-squared: 0.9903
## F-statistic: 1433 on 1 and 13 DF, p-value: 1.091e-14
lreside<-residuals(lmod)
plot(lreside~height)
abline(0,0)
So the next step I took was to make an x^2 variable and use that to make a polynomial model and then check that residual plot as well.
xsq<-height^2
qmod<-lm(weight~height+xsq)
summary(qmod)
##
## Call:
## lm(formula = weight ~ height + xsq)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.50941 -0.29611 -0.00941 0.28615 0.59706
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 261.87818 25.19677 10.393 2.36e-07 ***
## height -7.34832 0.77769 -9.449 6.58e-07 ***
## xsq 0.08306 0.00598 13.891 9.32e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3841 on 12 degrees of freedom
## Multiple R-squared: 0.9995, Adjusted R-squared: 0.9994
## F-statistic: 1.139e+04 on 2 and 12 DF, p-value: < 2.2e-16
qresid<-residuals(qmod)
plot(qresid~height)
abline(0,0)
As we can see this residual plot has no pattern so it is much better and means our polynomial model is better. Also there is a higher correlation coefficient for the polynomial model.
The last thing to do is then compare the two prediction lines and see how our polynomial line actually fits better.
plot(weight~height)
x<- seq(from=58, to=72, by=.1)
coef(qmod)
## (Intercept) height xsq
## 261.87818358 -7.34831933 0.08306399
y<-coef(qmod) [1]+coef(qmod) [2]*x+coef(qmod) [3]*x^2
lines(x,y, lty=1, col=2)
abline(lmod)
As we can see our red line follows a little better than the straight black line.
The other thing we talked about was multicollinearity which is when two of your predictor variables are correlated with each other. We don’t want to see a R^2 value of more than .9 otherwise that could hurt our model’s accuracy because basically two variables will have similar relationships with our response variable.