In class we covered the topics about multicollinearity and polynomial models.

Polynomial

We first tried making a linear model and had a high R^2 value but we saw a pattern in our residual plot so we knew this was not a good model.

data(women)
attach(women)
lmod<-lm(weight~height)
summary(lmod)
## 
## Call:
## lm(formula = weight ~ height)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.7333 -1.1333 -0.3833  0.7417  3.1167 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -87.51667    5.93694  -14.74 1.71e-09 ***
## height        3.45000    0.09114   37.85 1.09e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.525 on 13 degrees of freedom
## Multiple R-squared:  0.991,  Adjusted R-squared:  0.9903 
## F-statistic:  1433 on 1 and 13 DF,  p-value: 1.091e-14
lreside<-residuals(lmod)
plot(lreside~height)
abline(0,0)

So the next step I took was to make an x^2 variable and use that to make a polynomial model and then check that residual plot as well.

xsq<-height^2
qmod<-lm(weight~height+xsq)
summary(qmod)
## 
## Call:
## lm(formula = weight ~ height + xsq)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.50941 -0.29611 -0.00941  0.28615  0.59706 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 261.87818   25.19677  10.393 2.36e-07 ***
## height       -7.34832    0.77769  -9.449 6.58e-07 ***
## xsq           0.08306    0.00598  13.891 9.32e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3841 on 12 degrees of freedom
## Multiple R-squared:  0.9995, Adjusted R-squared:  0.9994 
## F-statistic: 1.139e+04 on 2 and 12 DF,  p-value: < 2.2e-16
qresid<-residuals(qmod)
plot(qresid~height)
abline(0,0)

As we can see this residual plot has no pattern so it is much better and means our polynomial model is better. Also there is a higher correlation coefficient for the polynomial model.

The last thing to do is then compare the two prediction lines and see how our polynomial line actually fits better.

plot(weight~height)
x<- seq(from=58, to=72, by=.1)
coef(qmod)
##  (Intercept)       height          xsq 
## 261.87818358  -7.34831933   0.08306399
y<-coef(qmod) [1]+coef(qmod) [2]*x+coef(qmod) [3]*x^2
lines(x,y, lty=1, col=2)
abline(lmod)

As we can see our red line follows a little better than the straight black line.

Multicollinearity

The other thing we talked about was multicollinearity which is when two of your predictor variables are correlated with each other. We don’t want to see a R^2 value of more than .9 otherwise that could hurt our model’s accuracy because basically two variables will have similar relationships with our response variable.