For this learning log, I will discuss quadratic regression and more generally, when to use polynomial regression to fit a model for non-linear data.
Recall that a quadratic function is an expanded linear function with an x-squared term:
\[ \hat{Y_i}= {\beta_0} + {\beta_1} x_1 + {\beta_2} (x_1)^2\]
The B2 and quadratic term allow for curvature in our model. This new term also adjusts our previous understanding of B1, such that B1 now refers to shifting our model left or right in position rather than the slope of the model. B0 is still chillin’ as our y-intercept though.
Let’s use the women dataset to show how to use quadratic regression.
data(women)
summary(women)
## height weight
## Min. :58.0 Min. :115.0
## 1st Qu.:61.5 1st Qu.:124.5
## Median :65.0 Median :135.0
## Mean :65.0 Mean :136.7
## 3rd Qu.:68.5 3rd Qu.:148.0
## Max. :72.0 Max. :164.0
attach(women)
plot(height, weight)
Our first glance at the dataset with the naked eye appears to be a straight, linear relationship. To check the soundness of a linear model, we’ll look at the residuals to see if a SLR model is appropriate. (hint, hint, it’s not.)
Women <- lm(weight~height)
Women.Residuals <- Women$residuals
plot(Women.Residuals~Women$fitted.values)
abline(0,0)
Darn it, we observe a pretty clear pattern in our residuals which leads us to believe that a SLR model is not an appropriate model for this relationship. The pattern in the residuals shows us that for lower weights, the model over predicts, then mid-range weights, the model underpredicts, and then the model over predicts for the largest weights. These systematic errors for certain x values from the model make the model not appropriate–we would much rather see random errors in the residual values.
Consequently, a quadratic regression may be the answer. Below are a few ways that you can easily add polynomial terms to the model:
-By declaring a new “variable” new var <- x^2 lm(y~x + newvar)
-Using the cool I() command to add the squared term lm(y~x + I(x^2))
-also works for any additional polynomial term lm(y~x + I(x^2) + I(x^3))
So now that we know how to code a quadratic regression, let’s see if this model is more appropriate for predicting a women’s height based on her weight.
Quad.Women <- lm(weight ~ height + I(height^2))
Quad.Women
##
## Call:
## lm(formula = weight ~ height + I(height^2))
##
## Coefficients:
## (Intercept) height I(height^2)
## 261.87818 -7.34832 0.08306
In our R-output, we see the quadratic term listed as another predictor variable at the end. Notice that the coefficient B2 is relatively small. Keep this in in the back of your mind for later
Now, using some fancy code work that Dr. Knudson showed me (don’t worry, you’ll learn about it in the Chp 4-5 R-Guide) we can plot our new model against the data so see if we can see if it appears to be more accurate than our previous linear model.
plot(weight~height)
x <- seq(from=58, to=72, by=0.1)
quad <- 261.87818 + x*-7.34832 + x^2*0.08306
lines(x, quad, lty=1, col="orange")
abline(Women, col ="blue")
The orange line definitely follows the data better than the straight blue line, but lets check our residuals to confirm this. We’d hate to be fooled like last time…
Quad.Women.Residuals <- Quad.Women$residuals
plot(Quad.Women.Residuals~Quad.Women$fitted.values)
abline(0,0)
Much better, our residuals, while not perfect, appear to be much more random which no defined pattern.
This might lead us to believe that the quadratic term is significant in the model. However, as your know from previous LL’s, we should conduct a hypothesis test to check the statistical significance or the quad. term. We’ll do so using the summary function as normal.
summary(Quad.Women)
##
## Call:
## lm(formula = weight ~ height + I(height^2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.50941 -0.29611 -0.00941 0.28615 0.59706
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 261.87818 25.19677 10.393 2.36e-07 ***
## height -7.34832 0.77769 -9.449 6.58e-07 ***
## I(height^2) 0.08306 0.00598 13.891 9.32e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3841 on 12 degrees of freedom
## Multiple R-squared: 0.9995, Adjusted R-squared: 0.9994
## F-statistic: 1.139e+04 on 2 and 12 DF, p-value: < 2.2e-16
As we look at our P-values, we see that the p-value for I(height^2) is very small (0.00000000932) in which we would conclude our t-test in saying that the quad term is significant in the model.
Since you’re a curious little statistician, you wonder if additional polynomial terms (x^3, X^4, etc) are significant as well. We can check these terms in a similar fashion as x^2
poly.mod <- lm(weight~ height + I(height^2) + I(height^3))
summary(poly.mod)
##
## Call:
## lm(formula = weight ~ height + I(height^2) + I(height^3))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.40677 -0.17391 0.03091 0.12051 0.42191
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.967e+02 2.946e+02 -3.044 0.01116 *
## height 4.641e+01 1.366e+01 3.399 0.00594 **
## I(height^2) -7.462e-01 2.105e-01 -3.544 0.00460 **
## I(height^3) 4.253e-03 1.079e-03 3.940 0.00231 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2583 on 11 degrees of freedom
## Multiple R-squared: 0.9998, Adjusted R-squared: 0.9997
## F-statistic: 1.679e+04 on 3 and 11 DF, p-value: < 2.2e-16
An there you have it, the cubic term would be statistically significant as well. But…
If we look at point estimates for B2 and B3, we notice that their values are small decimals, especially x^3. Even though these terms may be statistically significant, their overall impact on the model is not large, or they are not nomially significant. Depending on our audience and budget constraints, or decision to include more terms or get rid of some terms may be different.