Polynomial Regression

In class last Thursday we reviewed previous data sets and performed polynomial regression. Polynomial (quadratic) regression is an extension to our Multiple Regression Model. If we were to plot our data and line of best fit, we may notice that a curved line could predict more accurately than our straight line. We want to look for any systematic under/overestimation. We will begin by attaching our data from WBLake.

library(alr3)
## Loading required package: car
attach(wblake)
data(wblake)
head(wblake)
##   Age Length   Scale
## 1   1     71 1.90606
## 2   1     64 1.87707
## 3   1     57 1.09736
## 4   1     68 1.33108
## 5   1     72 1.59283
## 6   1     80 1.91602

I will be creating a model to predict the age of a fish and the radius of the fish scale. We can plot the residuals of our data and obeserve if the set is weighted on side over/under than the other.

mod1<-lm(Age~Scale)
plot(mod1$residuals~mod1$fitted.values)
abline(0,0)

Looking at the plot, for values less than 4 & greater than 7-8, we are systematically overestimating. So we can try fitting a quadratic model to see if it can estimate our data better. This is done by adding the predictor squared. See below for the context.

mod2<-lm(Age~Scale+I(Scale^2))
plot(mod2$residuals~mod2$fitted.values)
abline(0,0)

We can see the plot above that models the residuals of the new quadratic relationship. It seems better, but not what we would be looking for ideally. We are still underestimating in the values less than 1, and seem to just be off on a lot of values over 4. So this new model isn’t perfect, but I would say that it is more accurate than the SLR.

We can test the difference between the two models using an anova test.

anova(mod1,mod2)
## Analysis of Variance Table
## 
## Model 1: Age ~ Scale
## Model 2: Age ~ Scale + I(Scale^2)
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1    437 492.13                                  
## 2    436 381.08  1    111.04 127.05 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Seeing that our P-value is significantly small, we can say that there isn’t a significantly better relationship than our SLR model. We could continue this idea and add more predictor values that have a higher degree, which may or may not help. It’s up to the user if they think it is significant enough to only model a quadratic vs. cubic vs.quartic etc.

We also began our chapter 5 topic, on multicollinearity. This is essentially the relationship/correlation our predictors may have with one another in a Multiple Regression Model. We can test for multicollinearity if we compile our correlation matrix with all the predictors. If we look along the values or correlation, we want to see if any two variables are highly correlated with one another, which may be unneccessary to include. If the correlation coefficient is greater than .9, we say multicollinearity exists.

Another aspect is the Variance Inflation Factors. \[VIF_i=1/(1-R^2)\] If our predictors our completely unrelated, our Variance Inflation Facotr would be =1. However, if our VIF is creater than 10, we say multicollinearity exists.

Both the tests I outlined are good tools to use, as different scenarios may prove tests to have different results. So it’s a good rule to test both.