Today in class we finished up Chapter 4 by looking at Quadratic/Polynomial Regression, and also started Chapter 5 by looking at Multicolinnearity.

Quadratic/Polynomial Regression

We use quadratic regresssion when we can see that there is a smooth relationship between x and y, but it is not a linear relationship. If it is a linear relationship, we can simply use our multiple linear regression from earlier in the chapter. However, if the relationship is not linear, quadratic regression will allow us to build a much more accurate model. Our equation for quadratic regression is: \[\hat{y_i}= \hat{\beta_0}+\hat{\beta_1} x_i + \hat{\beta_2}x_i^2\] This equation is very similar to our multiple linear regression equation, except with the addition of the x2 term at the end. The way we can interperet these beta values is as follows.

*B0–> Y intercept;

*B1–> Shifts parabola left or right

*B2–> Affects curvature of parabola

There are two ways for us to test if the quadratic term is necessary or if we can drop it from our model. The first way is to use a T-test for B2, and if we find that it is not significant, we can drop it from our model. With the T-test we are testing the null hypothesis that B2=0. If we find a high p-value for our t-test, this means that we can confirm our null hypothesis and that B2 does equal 0 and can be dropped from our model. The second way is we can use the F-test to compare the full model and the reduced model, and if we find that the quadratic term is not significant we will drop it from our model.

One important thing to look at when we are using polynomial regression is the plot of the residual values for our model. Ideally when looking at the residual plot, we would like to have no trend, that is, the errors are varied and there are an approximately equal number of over estimates and under estimates. If there is a trend to our residuals, this means that our model is systematically overpredicting and underpredicting.

The women data set in R provides an excellent example as to when quadratic regression is necessary.

data("women")
women.mod <- lm(weight~height, data = women)
plot(women.mod$residuals ~ women.mod$fitted.values)
abline(0,0)

After setting up our original linear regression, we plot the residual values of our model against the fitted (predicted) values. As you can see in the plot, there is a clear trend in our data, that we are overpredicting on the two ends of the graph, and we are underpredicting for our middle values. This trend indicates to use that we should try including a quadratic term in our regression, in hoeps of creating a better model that isn’t systematically over or underpredicting.

women.mod2 <- lm(weight ~ height + I(height^2), data = women)
plot(women.mod2$residuals~height, data=women)
abline(0,0)

With this new residual plot, we can see that the errors are now much more randomized and do not form the parabola like we were seeing before. This is a strong indication that we should keep the quadratic term in our model. We can test this using the t-test, which is included in our model summary.

summary(women.mod2)
## 
## Call:
## lm(formula = weight ~ height + I(height^2), data = women)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.50941 -0.29611 -0.00941  0.28615  0.59706 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 261.87818   25.19677  10.393 2.36e-07 ***
## height       -7.34832    0.77769  -9.449 6.58e-07 ***
## I(height^2)   0.08306    0.00598  13.891 9.32e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3841 on 12 degrees of freedom
## Multiple R-squared:  0.9995, Adjusted R-squared:  0.9994 
## F-statistic: 1.139e+04 on 2 and 12 DF,  p-value: < 2.2e-16

Looking at the p-value for the I(height^2) term, we can see that the p-value is extremely small and essenitally 0. This indicates that we can reject our null hypothesis that B2 is equal to 0 and that we should keep the quadratic term in our model.

Multicolinnearity

Multicolinnearity is when our predictors are correlated and is not a good thing to have. We can check this by looking at the pairwise correlations of predictors. [plot(data)] This creates a plot of all of our variables against each other. This becomes concerning if |cor(x,y)| > 0.9.

This is bad because it inflates our standard errors. This throws off our test stats by decreasing them (gets closer to 0). So we lose power to reject the null hypothesis: Bk=0, meaning we’re less likely to include Bk in our model. It also makes our estimators unstable. We would expect our B(hat) to be similar from sample to sample, but by making it unstable our predictors become less repeatable and tell a different story based on the data set.

VIF(Variance Inflation Factor)

We can do this by creating a model for a variable xj, lm(xj ~all predictors except xj). Once we get this we get that R2j value. Then our variance inflation factor equals 1/(1-R2j).

GOOD: R2j=0=> VIF=1 which is good :)

BAD: R2j=1=> VIF=HUGE which is bad! :(

We will conclude that a VIF>10 is bad.