We first reviewed residual plots, plotting the residual versus the predicted value. What we want to see is no trend on this plot. That means that sometimes we’re overpredicting, sometimes we’re underpredicting, but it’s all at random. What we don’t want to see is a trend; this means we’re systematically overpredicting and underpredicting. If we see a trend, we can’t continue with our work. We need to change the model. One way we can do this with a polynomial regression model.
We might want to use a polynomial regression if when we first plot the data, we see a smooth relationship, but it’s not linear. We can start by adding an \(x^2\) term, and then continue increasing if each term is significant. We can still use the matrix equation for polynomial regression because even though the power of the x’s are increasing, the \(\beta\)’s still have a linear relationship.
Once we create the model (say, a quadratric regression mod), we have two options for testing if we need the \(\beta_2x^2\) term. We can perform a t-test for \(\beta_2\) testing the null hypothesis \(\beta_2=0\) against the alternative hypothesis \(\beta_2\neq0\), or we can perform a partial F-test, with the reduced model not including the \(\beta_2x^2\) term.
Even though we might statistically have evidence that a polynomial regression is more accurate, we might sometimes choose to just use linear regression, as polynomial regression is harder to interpret and includes more complicated calculations.
I will use the women data set as an example.
data(women)
attach(women)
women.mod <- lm(weight ~ height)
plot(women.mod$residuals ~ women.mod$fitted.values)
abline(0,0)
Plotting the residuals against the predicted values, we can see that there is an obvious trend. This is not what we want to see. We are systematically overpredicting and underpredicting.
Let’s add a quadratic term to see if that improves the residual plot.
women.mod.quad <- lm(weight ~ height + I(height^2))
plot(women.mod.quad$residuals ~ women.mod.quad$fitted.values)
abline(0,0)
This resid plot is a little bit better, but I wouldn’t necessarily say it’s good yet.
Let’s test to see if we need the quadratic term.
summary(women.mod.quad)
##
## Call:
## lm(formula = weight ~ height + I(height^2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.50941 -0.29611 -0.00941 0.28615 0.59706
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 261.87818 25.19677 10.393 2.36e-07 ***
## height -7.34832 0.77769 -9.449 6.58e-07 ***
## I(height^2) 0.08306 0.00598 13.891 9.32e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3841 on 12 degrees of freedom
## Multiple R-squared: 0.9995, Adjusted R-squared: 0.9994
## F-statistic: 1.139e+04 on 2 and 12 DF, p-value: < 2.2e-16
I(height^2) has a p-val of near 0, so it is significant. We do need the quadratic term!
Okay, let’s try adding another term and see if that improves our residual plot.
women.mod.x3 <- lm(weight ~ height + I(height^2) + I(height^3))
plot(women.mod.x3$residuals ~ women.mod.x3$fitted.values)
abline(0,0)
This looks better. Let’s test to double check that we should include the term.
summary(women.mod.x3)
##
## Call:
## lm(formula = weight ~ height + I(height^2) + I(height^3))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.40677 -0.17391 0.03091 0.12051 0.42191
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.967e+02 2.946e+02 -3.044 0.01116 *
## height 4.641e+01 1.366e+01 3.399 0.00594 **
## I(height^2) -7.462e-01 2.105e-01 -3.544 0.00460 **
## I(height^3) 4.253e-03 1.079e-03 3.940 0.00231 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2583 on 11 degrees of freedom
## Multiple R-squared: 0.9998, Adjusted R-squared: 0.9997
## F-statistic: 1.679e+04 on 3 and 11 DF, p-value: < 2.2e-16
We should include this last term, as it is relatively significant.
Multicollinearity is when predictors are correlated. This isn’t good, because it’ll inflate our standard errors.
Before we work with data, we should explore it a little bit. This includes looking for multicollinearity. To look at the correlation between pairs, we use the plot function to visually inspect. If two variables look related, we can check their correlation with cor(\(x_1\), \(x_2\)). Their correlation is “bad” if > .9
However, pairwise correlations is limited, so we can also use lm(\(x_j\) ~ all other predictors) and look at R-squared to look at one predictors multicollinearity with the rest of the predictors.
We also talked about VIF, or the variance inflation factor, with is \(1/(1-R_j^2)\) for \(x_j\)
If \(x_j\) is not related to others, then R-squared will be 0 and VIF is equal to 1.
If \(x_j\) is entirely explained by others, then R-squared will be 1 and the VIF approaches infinity (bad).
A VIF of over 10 is bad. A high VIF means the standard error is flated, causing the CI to be extra wide, and the predictor to look less significant (test stat goes down and p-val goes up).