On Thursday, we began class by looking at the distribution of residual plots, which plot the points based on the predicted values and residuals. Ideally, we want there to be no trend in the residual plots. However, if there is a trend, such as a curve, this is not a good sign since that would be a result of systematically overpredicting and underpredicting. In order to improve these residual plots and the regression model overall, Polynomial Regression is used. This takes a similar form to Simple Linear Regression, except the equation becomes more complex as the terms that are added are polynomials. Here is the general equation: \(\hat y= \beta_0 + \beta_1 x + \beta_2 x^2 +...+\beta_n x^n\)
We can use this to better predict the response variable when given the values of the predictor variables. In class, we took a look at the residual plots of both wblake and women datasets to see if there was a good distribution. For example, here is what we found from the women dataset:
attach(women)
data(women)
head(women)
## height weight
## 1 58 115
## 2 59 117
## 3 60 120
## 4 61 123
## 5 62 126
## 6 63 129
range(height)
## [1] 58 72
wommod<-lm(weight~height)
residplot<-wommod$residuals
plot(residplot~wommod$fitted.values,xlab="Predicted Values",ylab="Residuals",main="Residual Plot for Women")
abline(0,0)
The residual plot for the women data set has a clear curve in it meaning that there is systematic over/under prediction.There should be points balanced above and below the line y=0 This means it would be good to consider using polynomial regression.
Adding polynomial terms to the equation can help better distribute the residuals up to a certain point. When adding the polynomial terms, we need to make sure that the term is significant so that we know whether or not we should add it to the model. We use the following hypothesis test for the nth \(\beta\) term: \(H_0: \beta_n=0\) \(H_A: \beta_n\neq 0\) For our example, we are going to see if adding a 3rd beta term, i.e. \(\beta_2\), is significant. This gives us the following hypotheses: \[H_0: \beta_2=0\] \[H_A: \beta_2\neq 0\] We want to take a look at the residual plot to see if the distribution of the data points improved. This would be a good indicator of whether or not the beta term is going to be significant.
xsq<-(height)^2
polymod<-lm(weight~height+xsq)
residplot2<-polymod$residuals
plot(residplot2~polymod$fitted.values, xlab="predicted values", ylab="residuals",main="Residual Plot for Women")
abline(0,0)
There seems to be a much less noticeable trend in the residual plot, which means that this model is going to have less of an issue with systematic under predicting and over predicting. In order to conduct this hypothesis test and determine the p-values, there are two methods that can be used: the T-test, or the partial F-test. We can use the summary() function to find the T-test p-value, or we can use the anova( , ) function to find the partial F-test p-value. Either way, we should get the same result, so in general, conducting both tests isn’t necessary.
summary(polymod)
##
## Call:
## lm(formula = weight ~ height + xsq)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.50941 -0.29611 -0.00941 0.28615 0.59706
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 261.87818 25.19677 10.393 2.36e-07 ***
## height -7.34832 0.77769 -9.449 6.58e-07 ***
## xsq 0.08306 0.00598 13.891 9.32e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3841 on 12 degrees of freedom
## Multiple R-squared: 0.9995, Adjusted R-squared: 0.9994
## F-statistic: 1.139e+04 on 2 and 12 DF, p-value: < 2.2e-16
anova(wommod,polymod)
## Analysis of Variance Table
##
## Model 1: weight ~ height
## Model 2: weight ~ height + xsq
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 13 30.2333
## 2 12 1.7701 1 28.463 192.96 9.322e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Both tests result in a p-value of 9.32e-09, which is statistically significant. This means that we reject the null hypothesis in favor of the alternative, so \(\beta_2\neq 0\). This means that the beta term is significant enough to add to the regression model. We can keep adding terms to improve the prediction of the response values until the terms are no longer significant.
xcub<-(height)^3
polymod2<-lm(weight~height+xsq+xcub)
residplot3<-polymod2$residuals
plot(residplot3~polymod2$fitted.values,xlab="Predicted Values",ylab="Residuals",main="Residual Plot for Women")
abline(0,0)
summary(polymod2)
##
## Call:
## lm(formula = weight ~ height + xsq + xcub)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.40677 -0.17391 0.03091 0.12051 0.42191
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.967e+02 2.946e+02 -3.044 0.01116 *
## height 4.641e+01 1.366e+01 3.399 0.00594 **
## xsq -7.462e-01 2.105e-01 -3.544 0.00460 **
## xcub 4.253e-03 1.079e-03 3.940 0.00231 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2583 on 11 degrees of freedom
## Multiple R-squared: 0.9998, Adjusted R-squared: 0.9997
## F-statistic: 1.679e+04 on 3 and 11 DF, p-value: < 2.2e-16
The residual plot seems a bit more scattered than when there was just a 2nd degree polynomial, so this is a good sign for the model. The p-value of the beta term on \(x^3\) (\(\beta_3\)) is 0.00231, which is still significant at the \(\alpha=0.05\) level, so we again reject the null hypothesis where \(\beta_3=0\) in favor of the alternative where \(\beta_3\neq 0\). We can try looking at adding another beta term to see if it’s still significant.
xq<-(height)^4
polymod3<-lm(weight~height+xsq+xcub+xq)
residplot4<-polymod3$residuals
plot(residplot4~polymod3$fitted.values,xlab="Predicted Values",ylab="Residuals",main="Residual Plot for Women")
abline(0,0)
summary(polymod3)
##
## Call:
## lm(formula = weight ~ height + xsq + xcub + xq)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.32577 -0.11881 0.07792 0.13334 0.30466
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.824e+03 4.550e+03 1.939 0.0812 .
## height -5.552e+02 2.814e+02 -1.973 0.0768 .
## xsq 1.319e+01 6.515e+00 2.024 0.0705 .
## xcub -1.389e-01 6.693e-02 -2.076 0.0646 .
## xq 5.507e-04 2.574e-04 2.140 0.0581 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2244 on 10 degrees of freedom
## Multiple R-squared: 0.9999, Adjusted R-squared: 0.9998
## F-statistic: 1.669e+04 on 4 and 10 DF, p-value: < 2.2e-16
At this point, the residual plot isn’t much better, and the \(x^4\) term is statistically insignificant since the p-value is 0.0581 (where alpha=0.05). This means that we accept the null hypothesis, where \(\beta_4=0\) This leaves us with a 3rd degree polynomial regression model: \(\hat y= \beta_0 + \beta_1 x + \beta_2 x^2 +\beta_3 x^3\)
We also took a look at chapter 5, where we discussed multicollinearity, which is when predictors are correlated to each other. This is not a good sign because it means that the standard errors are going to be inflated. We use the cor() function between 2 variables to see what the correlation value is between them. If the absolute value of the correlation is greater than 0.9, it is at a value that is too high to have both variables included in our model (i.e. there is multicollinearity).
We can also do a pairwise correlation, where we look at a variable’s correlation to all the other predictors. This will take the form of lm(x_j ~all other predictors). We then look at the summary for the \(R^2\) value, and calculate what is called the variance inflation factor(VIF), which is \(\frac{1}{1-R_j^2}\) for \(x_j\). If there is no relation between \(x_j\) and all the other predictors, then the VIF is going to be 1. However, if \(x_j\) is completely related to the other predictors, then the VIF is going to approach infinity. The correlation of the variable \(x_j\) is considered too high if the VIF is greater than 10. The problem with having a high VIF/ high correlation between variables is that the confidence interval is going to be extra wide, which means that the model is going to be less accurate in its predictions as the potential values are going to take on a larger range of values. Additionally,the test stat for dropping \(x_j\) is going to get closer to 0, making it harder to detect the importance of \(x_j\). Overall, \(\widehat\beta_j\) is going to be highly variable as a result, meaning that it is unstable. This means that it is not a very reliable value, and is something that we don’t really want to include in the model as a result.