Concepts

Today in class we covered the concepts of polynomial regression and multicollinearity. The idea of polynomial regression is actually pretty simple. It is a type of multiple linear regression where we add a polynomial term to the regression equation. Then, multicollinearity is when two or more predictors in your regression equation are correlated. It is considered to be a problem if these predictor are highly correlated (|r|>.9) because having these predictors in your model can inflate the standard error of the predictor, thus making it more likely to conclude that that predictor is not significant, when it is.

In the case of polynomial regression, we might use it when we see a pattern in the residuals created by the regression equation without a polynomial term. For multicollinearity, we ideally would check this prior to creating any regression model with multiple predictors.

Polynomial Regression

Creating the Model

To illustrate the key equations and code needed for polynomial regression, I will use an example. For this example, we can use the iris dataset. First, we’ll start with a model without a polynomial term. We’ll look at the relationship between sepal length and sepal width.

mod <- lm(Sepal.Length ~ Sepal.Width, data = iris)

We can then make a plot of the residuals and the fitted sepal lengths using our model.

plot(mod$fitted.values, mod$residuals)
abline(0,0)

Though these residuals are fairly evenly spaced around 0 for all fitted values of our model, it seems we could be systematically underestimating the sepal length of our mid-range fitted sepal widths and overestimating the sepal length of our high or low fitted sepal widths.

To try to fix this, we can add a quadratic term to our model like this:

modp <- lm(Sepal.Length ~ Sepal.Width + I(Sepal.Width^2), data = iris)

Again, we can look at the residual plot to see if the pattern of our residuals has disappeared.

plot(modp$fitted.values, modp$residuals)
abline(0,0)

It seems as though our residuals no longer have any sort of pattern with respect to over- or underestimating. However, we can check to see whether the quadratic term in our model actually improves the model using one of two tests. The first test is a t-test. We can find the result of this t-test in the summary of our quadratic regression model.

Testing the Polynomial Term

summary(modp)
## 
## Call:
## lm(formula = Sepal.Length ~ Sepal.Width + I(Sepal.Width^2), data = iris)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.63153 -0.62177 -0.08282  0.50531  2.33336 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)  
## (Intercept)        2.4594     2.4020   1.024   0.3076  
## Sepal.Width        2.4312     1.5445   1.574   0.1176  
## I(Sepal.Width^2)  -0.4246     0.2458  -1.727   0.0862 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8196 on 147 degrees of freedom
## Multiple R-squared:  0.03344,    Adjusted R-squared:  0.02029 
## F-statistic: 2.543 on 2 and 147 DF,  p-value: 0.08209

With a p-value of .0862 for our quadratic term, the quadratic term is only significant at a somewhat high alpha value of .1. From this, we might conclude that the quadratic term is not necessary in the model. The second option to check the necessity of the quadratic term in our model is a partial F-test.

anova(mod, modp)
## Analysis of Variance Table
## 
## Model 1: Sepal.Length ~ Sepal.Width
## Model 2: Sepal.Length ~ Sepal.Width + I(Sepal.Width^2)
##   Res.Df     RSS Df Sum of Sq      F Pr(>F)  
## 1    148 100.756                             
## 2    147  98.752  1    2.0044 2.9838 0.0862 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We have the same p-value given by this test so we conclude again that the quadratic term probably is not neccessary in our model.

Visualizing the Regression Equations

Finally, we can visualize the two regression equations on a plot of our speal lengths and sepal widths like this:

range(iris$Sepal.Width)
## [1] 2.0 4.4
x <- seq(from=2, to=4.4, by=.05)
y <- coef(modp)[1]+coef(modp)[2]*x+coef(modp)[3]*x^2
plot(iris$Sepal.Width, iris$Sepal.Length)
lines(x, y, col="red")
abline(mod)

In this plot, the red line represents our quadratic regression equation and the black line represents our original regression equation. It is hard to see whether or not the quadratic term really helps the regression line to fit the data, but we know from our tests that the quadratic term is probably not necessary.

Multicollinearity

We did not get far on the topic of multicollinearity, but we can look at it briefly using the following example. First, we need to create a multiple linear regression model. We will predict sepal length based on sepal width, petal length and petal width.

flo <- lm(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width, iris)

We can then look at the plots of our predictor variables against one another. We can also check the correlation of each pair of predictor variables.

plot(iris[2:4])

cor(iris[2], iris[3])
##             Petal.Length
## Sepal.Width   -0.4284401
cor(iris[3], iris[4])
##              Petal.Width
## Petal.Length   0.9628654
cor(iris[2], iris[4])
##             Petal.Width
## Sepal.Width  -0.3661259

We see that the correlation between petal length and sepal width is -.43, the correlation between petal width and petal length is .96 and the correlation between petal width and sepal width is -.37. We consider multicollinearity to be a problem when |correlation| > .9. Thus, we could consider the correlation between petal width and petal length to be an issue.

We can then consider the variance inflation factor(VIF) for each variable. We do this by making a model of each predictor variable with the remaining preditor variables as the predictor variables of the model and then plug the R-squared value of the model into the VIF equation.

For an exmaple, we’ll calculate the VIF of petal width.

pw <- lm(Petal.Width ~ Petal.Length + Sepal.Width, data = iris)
summary(pw)
## 
## Call:
## lm(formula = Petal.Width ~ Petal.Length + Sepal.Width, data = iris)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.53907 -0.11443 -0.01447  0.12168  0.65419 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.70648    0.15133  -4.668 6.78e-06 ***
## Petal.Length  0.42627    0.01045  40.804  < 2e-16 ***
## Sepal.Width   0.09940    0.04231   2.349   0.0201 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2034 on 147 degrees of freedom
## Multiple R-squared:  0.9297, Adjusted R-squared:  0.9288 
## F-statistic: 972.7 on 2 and 147 DF,  p-value: < 2.2e-16

We see that the R-squared of the model is .9297, so we plug that value into our VIF equation where VIF = 1/(1 - R-squared).

1/(1-.9297)
## [1] 14.22475

We see that the VIF for petal width is 14.22 and we consider any VIF over 10 to be an issue.

Similarity to Other Topics

The concept of polynomial regression is similar to our other regression topics in the tests that apply to it. It only differs in the variables that the model contains. Multicollinearity is a fairly new genre of topics for us, but it applies to the regression models that we have made thus far.

Overall, both of these topics fit very well with the objectives of the course because they both apply to regression modeling and that is a main topic of this course.