Quadratic Regression

What is a Quadratic Regression?

This applies to polynomial regression more generally. It is used if there is a smooth relationship between x and y but not linear. It is still linear in betas since only x’s change.

\[\hat{y}= {\beta_0}+{\beta_1} x+{\beta_2} x^2\]

Beta_0 is still the y intercept (when x = 0). Beta_1 shifts the parabola left or right Beta_2 affects the curavture of the parabola

How do you test if a Quardratic Regression is necessary?

2 options for testing: 1. T-test for Beta_2 2. F-test and compare the quadratic regression and the linear regression models

Example with Woman Data

library(MASS)
data(women)
attach(women)
names(women)
## [1] "height" "weight"

We can first look at the scatterplot of the data.

plot(weight~height)

We can see a slight curvature in the data. Let’s look at the linear regression model.

Linear Regression Model

modLin = lm(weight ~ height)
summary(modLin)
## 
## Call:
## lm(formula = weight ~ height)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.7333 -1.1333 -0.3833  0.7417  3.1167 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -87.51667    5.93694  -14.74 1.71e-09 ***
## height        3.45000    0.09114   37.85 1.09e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.525 on 13 degrees of freedom
## Multiple R-squared:  0.991,  Adjusted R-squared:  0.9903 
## F-statistic:  1433 on 1 and 13 DF,  p-value: 1.091e-14

The F-test shows the p-value is 1.091e-14. It is very small which means the model is a good fit. It can be better though, which we will try later.

Now, we can look at the scatterplot and show the line as well.

plot(weight~height)
abline(modLin)

It isn’t perfect but it isn’t too far off either.

Now, we can look at the residuals. That is a better indication of how good the model is. We can plot the residuals against the predicted values. We hope that there is no trend in the residuals. If there is, we’d want to use a different model to better predict it.

plot(modLin$residuals ~ modLin$fitted.values)
abline(0,0)

There is a clear quadratic formation with the residuals. This shows us that we should try the quadratic regression model.

Quadratic Regression Model

Let’s look at the quadratic regression next.

modQuad = lm(weight ~ height + I(height^2))
summary(modQuad)
## 
## Call:
## lm(formula = weight ~ height + I(height^2))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.50941 -0.29611 -0.00941  0.28615  0.59706 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 261.87818   25.19677  10.393 2.36e-07 ***
## height       -7.34832    0.77769  -9.449 6.58e-07 ***
## I(height^2)   0.08306    0.00598  13.891 9.32e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3841 on 12 degrees of freedom
## Multiple R-squared:  0.9995, Adjusted R-squared:  0.9994 
## F-statistic: 1.139e+04 on 2 and 12 DF,  p-value: < 2.2e-16

The F-test shows the p-value is < 2.2e-16. It is very small which means the model is a good fit. It is better than the linear regression model.

Now, let’s look at the scatterplot and add in both lines.

The quadratic line is harder than the linear regression line. First, we need to find the range of the heights (our x value). Then we need to find a bunch of x values so that their predicted values make up the line fairly well. Then we can make the line using the coefficents and the predicted x values. We plot the quadratic in red and the linear in blue to see the difference.

range(height)
## [1] 58 72
x = seq(from=58, to=72, by=.1)
y = coef(modQuad)[1] + coef(modQuad)[2]*x + coef(modQuad)[3]*x^2
plot(weight~height)
lines(x, y, lty=1, col=2)
abline(modLin, col=4)

We can see that the red line is a much better fit. This is one way to confirm that the regression model was good.

Another way is to look at the residuals again. We hope that there is no trend. If there is one, we could look at a cubic or more. For this, however, we are going to stop at quadratic.

residsQuad = modQuad$residuals
plot(residsQuad ~ modQuad$fitted.values)
abline(0,0)

The residuals look much better. There is no clear formation.

Multicollinearity

Multicollinearity is when predictors are correlated (not good).

We’re going to use the water data because we want a couple predictors to see if they are correlated.

library(alr3)
## Loading required package: car
## 
## Attaching package: 'alr3'
## The following object is masked from 'package:MASS':
## 
##     forbes
data(water)
attach(water)
names(water)
## [1] "Year"    "APMAM"   "APSAB"   "APSLAKE" "OPBPC"   "OPRC"    "OPSLAKE"
## [8] "BSAAM"

Look at Pairwise Correlations of Predictors

plot(water[2:4])

We can see that a couple of the scatterplots look to be correlated. We can look at the correlations to confirm.

cor(APMAM,APSAB)
## [1] 0.8276864
cor(APSAB,APSLAKE)
## [1] 0.9003047
cor(APSLAKE,APMAM)
## [1] 0.816076

We can see that all of the correlations are pretty high. It gets concerning if the correlation is about 0.9. The correlation between APSAB and APSLAKE is concerning. The other two are close but not quite.

Why is Multicollinearity Bad?

It’ll inflate your standard errors.

This means that when asking if we can drop Beta_j, sometimes we don’t include it when we should. Since the standard errors are larger, the teststat is closer to 0 so we lose power to reject the null of Beta_j equals 0 so we’re less likely to include Beta_j in our model.

Correlated predictors make estimator Beta_hat unstable.

Beta estimators from two different data samples could be different even if it’s the same sample size and the same population. Beta_hat is not consisten from sample to sample.

VIF - Varience Inflation Factor

  1. lm(x_j ~ all predictors except x_j)
  2. get R-squared
  3. VIF = 1/1-R-squared

If R-squared = 0, VIF = 1 - not inflating variances If R-squared = 1, VIF is really big - inflating variances a lot Rule of Thumb: if VIF > 10, bad and inflating variances