Quadratic Regression

In our most recent class we talked about quadratic regression, which is similar to linear regression, but it looks at a smooth relationship between x and y that is not linear. Our equation for this is: \[\hat{y_i}= \hat{\beta_0}+\hat{\beta_1} x_1 +\hat{\beta_2}x_2^2\]

In this equation, beta(zero) is our intercept, beta(one) shifts our parabola left and right, and beta(two) affects the curvature of the relationship.

When we want to see if using quadratic regression is necessary, we look at the plot of our residuals. In class we did this using the wblake data. First we fit the model and then we plot the residuals:

library(alr3)
## Warning: package 'alr3' was built under R version 3.4.3
## Loading required package: car
## Warning: package 'car' was built under R version 3.4.3
data(wblake)
attach(wblake)
head(wblake)
##   Age Length   Scale
## 1   1     71 1.90606
## 2   1     64 1.87707
## 3   1     57 1.09736
## 4   1     68 1.33108
## 5   1     72 1.59283
## 6   1     80 1.91602
mymod3 <- lm(Age ~ Scale)
myresids3 <- mymod3$residuals
plot(myresids3 ~ mymod3$fitted.values)
abline(0,0)

Because we can see there is a trend in this model, we have an issue. A trend means that we are systematically over and underpredicting. We will fix our linear model by using quadratic regression. We will add x2 and x3 (optional) terms to the model, and our equation should look something like this:

xsq2 <- Scale^2
xc <- Scale^3
mymod4 <- lm(Age ~ Scale+xsq2 +xc)

We will now plot the residuals of our quadratic equation and check for a trend:

myresids4 <- mymod4$residuals
plot(myresids4 ~ mymod4$fitted.values)
abline(0,0)

This plot looks better. If we compare the previous graph and this graph around the x-value of about 1, we can see that there is more variability in this graph’s residuals (equal amounts of data points above and below the abline) than before, meaning we have less systematic errors in our predictions.

We also talked about the fact that you can run a T-test for beta(two) or compare F-tests for the linear and quadratic models to see if the quadratic term is necessary. If we run a T-test, we use the summary function and look at the p-values:

summary(mymod4)
## 
## Call:
## lm(formula = Age ~ Scale + xsq2 + xc)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.99281 -0.65734  0.02646  0.64421  2.75180 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.094809   0.400642  -2.733  0.00654 ** 
## Scale        1.172388   0.210092   5.580 4.23e-08 ***
## xsq2        -0.021289   0.032111  -0.663  0.50769    
## xc          -0.002065   0.001495  -1.381  0.16788    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9339 on 435 degrees of freedom
## Multiple R-squared:  0.7813, Adjusted R-squared:  0.7798 
## F-statistic:   518 on 3 and 435 DF,  p-value: < 2.2e-16

Our p-value for our squared term is >.05, so we will not reject the null hypothesis.

Ch 5

We also began chapter 5 and discussed multicollinearity. Multicollinearity is when our predictors are correlated, which is not a good thing because it inflates our standard errors. This gives us less power to reject the null hypothesis and more likely to not include predictors in our model, even if we may want them.