In class on Thursday we finished chapter 4 material by covering quadratic regression. We used remaining class time to introduce multicollinearity.
Quadratic regression is best used when the relationship between x and y appears smooth. The equations is:
\[y= \beta_0 + \beta _1x + \beta_2x^2 + \epsilon \]
The quadratic equation (without the error term) above relates the mean value of y to the value of x through the equation of a parabola. \(\beta_0\) is the y-intercept, \(\beta_1\) shifts the parabola left or right, and \(\beta_2\) changes the curvature of the parabola.
One thing we spent some time on in class was the importance of checking if there is a trend in our residuals. We did this by plotting the residuals and determining if our over and under estimates were following a trend. We want errors to be varied and to avoid systematically over predicting and under predicting data. The women data in R is a good example of when residuals in the linear regression model follow a trend and the quadratic regression model reduced the trend significantly.
data("women")
mymod<-lm(weight ~ height, data= women)
plot(mymod$residuals ~ mymod$fitted.values)
abline(0,0)
After setting up the linear regression model, the plot of the residuals against the predicted values indicates that the model over predicts for women who weigh less than about 120 lbs and over about 155 lbs. Between this range the linear regression model systematically under predicts weights. We can again plot the residuals for the quadratic model in the hopes that adding another term improves the model and removes the trend.
mymod2<-lm(weight ~ height + I(height^2), data=women)
plot(mymod2$residuals ~ mymod2$fitted.values, data=women)
abline(0,0)
Over predictions and under predictions seem improved, but there are still some ranges that seems to have strictly over predictions and others with only under predictions. We can run some tests to determine whether the quadratic term improves the model.
We can run a T-test for \(\beta_2\) to determine if we can drop it from our model. The null hypothesis is that \(\beta_2=0\) and the alternative hypothesis is that \(\beta_2\neq0\).
summary(mymod2)
##
## Call:
## lm(formula = weight ~ height + I(height^2), data = women)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.50941 -0.29611 -0.00941 0.28615 0.59706
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 261.87818 25.19677 10.393 2.36e-07 ***
## height -7.34832 0.77769 -9.449 6.58e-07 ***
## I(height^2) 0.08306 0.00598 13.891 9.32e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3841 on 12 degrees of freedom
## Multiple R-squared: 0.9995, Adjusted R-squared: 0.9994
## F-statistic: 1.139e+04 on 2 and 12 DF, p-value: < 2.2e-16
The p-value for the T-test for \(\beta_2\) is 9.32e-09. So we have strong evidence to reject the null hypothesis and accept the alternative that the quadratic term is useful to our model.
Just as it sounds, multicollinearity is when our predictors are correlated, which is something we really do not want in our model. We look at pairwise correlations of predictors by plotting our variables against each other. We should be concerned when |cor(x,y)| > 0.9.
Correlated predictors means our standard errors will be inflated and in turn all of our test statistics will be decreased. Thus the p-values get larger and we lose power to reject our null hypothesis. In the context of testing \(\beta_2\) with a T-test above, we would then be less likely to include it in our model because our test would indicate it is not important to the model. Also, our Beta estimators would become unstable sample to sample.