Today we went over polynomial regression and multicollinearity. Polynomial regression is the exact same as our multiple linear regression with the one difference of including an additional polynomial term to the equation. Multicollinearity is a measure of how correlated your predictors are. In the event of a high correlation (r>.9) you will find that your confidence interval will be inflated by a factor of 1/(1-R^2). The book states that we should be concerned about this when our variance inflation factor is >10 or r>.9.
Some of the code we got from today includes
data(women)
attach(women)
line1 <- lm(weight ~ height, data = women)
plot(women)
abline(line1)
womenResids <- line1$residuals
plot(womenResids ~ line1$fitted.values)
abline(0,0)
These lines are to plot the residuals for the given data. For women we can tell that there is a fairly obvious trend of underestimating for certain values and overestimating for other values. This shows us that we can add another predictor that is simply height^2. This is done by :
line2 <- lm(weight ~ height + I(height^2),data=women)
anova(line1,line2)
## Analysis of Variance Table
##
## Model 1: weight ~ height
## Model 2: weight ~ height + I(height^2)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 13 30.2333
## 2 12 1.7701 1 28.463 192.96 9.322e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This makes the line and tests whether the new predictor is significant or not. (partial F-test) In this case we can see that the low p value means we need to keep the new predictor.
When it comes to Multicollinearity we learned that simply doing (plot) of our data gives us graphs of all the different variables plotted against one another. We can use this to see if there are any obvious correlations.
library(alr3)
## Loading required package: car
plot(wblake)
With the wblake data set we can see some of the heavy correlations.
If we wanted to test this further we would use the following lines to get the exact correlation.
cor(wblake[2],wblake[3])
## Scale
## Length 0.9386473
And in this instance we can see a definite correlation between length and scale.