Today in class we worked on Quadratic Regression (also known as Polynomial Regression).You should use it when there is a smooth relationship between x and y.
Even though there are x and x2’s, it is still classified as linear regression, because the coefficients are betas. The beta of the x variable shifts the parabola left and right while the beta of the x2 affects the steepness of the parabola.
There are two tests to see if we want to use the quadratic method or not. First, we can look at the t-test for the b_2 coefficient. Looking at the quadratic model, if that pvalue is significant we want to keep using that quadratic model. The second way is to conduct the f-test and look at the two models.
A different system to see if we want to use the quadratic model is by looking at the residuals. We can plot the residuals against the fitted line. If there seems to be vertical trends above or below the (0,0) line of the linear model, we may want to try a quadratic model. When I say trends, I mean that there are more data points above the line than below the line (or vice versa), which means that we are systematically overpredicting (or systematically underpredicting). A really helpful function for this is using residualPlot(). You can use this function on both the linear model and the quadratic model. It will draw in a red line that shows the trend of the data. Ideally, we would want this to be flat. So, we can compare the red curves/lines from both models and see which one is more extremely curved.
To use R, there are a couple changes from the previous simple and multiple linear regressions. We can create the x2 variable two different ways. First, we can define a new variable and put that in our linear model. 1. xsq <- x^2 2. lm(y ~ x + xsq) The second way is using an “I” in the lm function 1. lm(y ~ x + I(x^2))
Hint: you can also use the I for polynomials longer than squared. Just keep on adding the same thing, but changing the degree to which you want the power.
I’ll use an example from class to help demonstrate this. We will use the women data. First, we will look at the plot and create the linear model to see how well that fits the data points.
data("women")
head(women)
## height weight
## 1 58 115
## 2 59 117
## 3 60 120
## 4 61 123
## 5 62 126
## 6 63 129
plot(weight ~ height, data = women)
I always use the head function to know what is in the data set I am working with. This plot looks like it has a slight, small curve. So, I’ll make the linear and quadratic models and see what the residualPlot, f-test and t-tests look like.
lmmod <- lm(weight ~ height, data = women)
summary(lmmod)
##
## Call:
## lm(formula = weight ~ height, data = women)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.7333 -1.1333 -0.3833 0.7417 3.1167
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -87.51667 5.93694 -14.74 1.71e-09 ***
## height 3.45000 0.09114 37.85 1.09e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.525 on 13 degrees of freedom
## Multiple R-squared: 0.991, Adjusted R-squared: 0.9903
## F-statistic: 1433 on 1 and 13 DF, p-value: 1.091e-14
Looking at the summary for the linear model, both the f-test and t-test have signficiant values. Remembering back, the f-test tells us that there is a signficiant predictor in the model. The t-test tells us which predictor is significant. In our one predictor model, the difference doesn’t matter that much. The R2 is pretty good as well at 99.1%–that amount of variability in our model can be linearly attributed to height.
Next, I’ll look the the residuals to see if there are any obvious trends. If we want to stay with the linear model, then we will hope that those doesn’t exist. There are two ways to do this. We can do the residualPlot() command or we can create our own plot: plot(residuals(lmmod) ~ fitted(lmmod)). The residualPlot doesn’t run with KnitR, so you can run it yourself to see.
plot(residuals(lmmod) ~ fitted(lmmod))
As you can see, there is no difference between these data points, the residualPlot function Wow! This curve is so extreme. We definitely want to fit a quadratic model.
quadmod <- lm(weight ~ height + I(height^2), data = women)
summary(quadmod)
##
## Call:
## lm(formula = weight ~ height + I(height^2), data = women)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.50941 -0.29611 -0.00941 0.28615 0.59706
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 261.87818 25.19677 10.393 2.36e-07 ***
## height -7.34832 0.77769 -9.449 6.58e-07 ***
## I(height^2) 0.08306 0.00598 13.891 9.32e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3841 on 12 degrees of freedom
## Multiple R-squared: 0.9995, Adjusted R-squared: 0.9994
## F-statistic: 1.139e+04 on 2 and 12 DF, p-value: < 2.2e-16
After running the quadratic model summary, we can look at the t-test and f-test pvalues. Looking at the f-test, we can see that at least one of of the predictors is significant – the pvalue is less than our alpha level of .05. Diving deeper, we can look at the t-test pvalues. Both the x and x2 values are significant, showing that the quadratic model is a good fit. The R2 also inceased, which means even more of the variaibility is linearly attributed to the quadmod.
We can look at the residuals of this to see if the red line has less curves than it did in the linear model. You can also do the plot comparing the residuals to the fitted to see if it is more dispersed.
plot(residuals(quadmod) ~ fitted(quadmod))
Our red line is much less curved (will have to run for yourself to see)! It still has a slight one, however. We could try to add more degrees to it, but at some point it would only make a stasticial signficiance, not a practical one.
In conclusion, for this example, the quadratic model provides a better fit than original linear one.
After talking about quadratic models in class, we talked about mulitcollinearity. When using several predictors, you don’t want them heavily correlated. If this happens, many values will be changed. Our standard errors will be inflated–we use these values for testing if a predictor can be dropped. If there is collinearity then SE increases –> test stat decreases (closer to 0) –> pvalue increases and we can’t reject the null as well.
Besides changing the SE, the betas will start to be very unstable. When looking from sample to sample, there will be big differences if the predictors are correlated.
A way to test this is using the variance inflation factor (VIF). There are a few steps to take to run this analysis. 1. lm(y ~ all but predictor you’re testing for) 2. Get the second model’s R2 3. use this equation to get VIF = 1 / (1 - R2) If totally unrealated, the VIF should be 1. The more correlated predictors are, the bigger the VIF is. Ideally, you don’t want a VIF higher than 10.