In class on Thursday we finished chapter 4 material by covering quadratic regression. We used remaining class time to introduce multicollinearity.

Quadratic Regression Model

Quadratic regression is best used when the relationship between x and y appears smooth. The equations is:

\[y= \beta_0 + \beta _1x + \beta_2x^2 + \epsilon \]

The quadratic equation (without the error term) above relates the mean value of y to the value of x through the equation of a parabola. \(\beta_0\) is the y-intercept, \(\beta_1\) shifts the parabola left or right, and \(\beta_2\) changes the curvature of the parabola.

Checking Residuals

One thing we spent some time on in class was the importance of checking if there is a trend in our residuals. We did this by plotting the residuals and determining if our over and under estimates were following a trend. We want errors to be varied and to avoid systematically over predicting and under predicting data. The women data in R is a good example of when residuals in the linear regression model follow a trend and the quadratic regression model reduced the trend significantly.

data("women")
mymod<-lm(weight ~ height, data= women)
plot(mymod$residuals ~ mymod$fitted.values)
abline(0,0)

After setting up the linear regression model, the plot of the residuals against the predicted values indicates that the model over predicts for women who weigh less than about 120 lbs and over about 155 lbs. Between this range the linear regression model systematically under predicts weights. We can again plot the residuals for the quadratic model in the hopes that adding another term improves the model and removes the trend.

mymod2<-lm(weight ~ height + I(height^2), data=women)
plot(mymod2$residuals ~ mymod2$fitted.values, data=women)
abline(0,0)

Over predictions and under predictions seem improved, but there are still some ranges that seems to have strictly over predictions and others with only under predictions. We can run some tests to determine whether the quadratic term improves the model.

Testing the Quadratic Term

We can run a T-test for \(\beta_2\) to determine if we can drop it from our model. The null hypothesis is that \(\beta_2=0\) and the alternative hypothesis is that \(\beta_2\neq0\).

summary(mymod2)

## 
## Call:
## lm(formula = weight ~ height + I(height^2), data = women)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.50941 -0.29611 -0.00941  0.28615  0.59706 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 261.87818   25.19677  10.393 2.36e-07 ***
## height       -7.34832    0.77769  -9.449 6.58e-07 ***
## I(height^2)   0.08306    0.00598  13.891 9.32e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3841 on 12 degrees of freedom
## Multiple R-squared:  0.9995, Adjusted R-squared:  0.9994 
## F-statistic: 1.139e+04 on 2 and 12 DF,  p-value: < 2.2e-16

The p-value for the T-test for \(\beta_2\) is 9.32e-09. So we have strong evidence to reject the null hypothesis and accept the alternative that the quadratic term is useful to our model.

Multicollinearity

Just as it sounds, multicollinearity is when our predictors are correlated, which is something we really do not want in our model. We look at pairwise correlations of predictors by plotting our variables against each other. We should be concerned when |cor(x,y)| > 0.9.

Correlated predictors means our standard errors will be inflated and in turn all of our test statistics will be decreased. Thus the p-values get larger and we lose power to reject our null hypothesis. In the context of testing \(\beta_2\) with a T-test above, we would then be less likely to include it in our model because our test would indicate it is not important to the model. Also, our Beta estimators would become unstable sample to sample.

Learning Log Class 3/8/2018

Jill Wanner

March 11, 2018

Quadratic Regression Model

Checking Residuals

Testing the Quadratic Term

Multicollinearity