Learning Log 12

Polynomial Regression

On Thursday, we began class by looking at the distribution of residual plots, which plot the points based on the predicted values and residuals. Ideally, we want there to be no trend in the residual plots. However, if there is a trend, such as a curve, this is not a good sign since that would be a result of systematically overpredicting and underpredicting. In order to improve these residual plots and the regression model overall, Polynomial Regression is used. This takes a similar form to Simple Linear Regression, except the equation becomes more complex as the terms that are added are polynomials. Here is the general equation: \(\hat y= \beta_0 + \beta_1 x + \beta_2 x^2 +...+\beta_n x^n\)

We can use this to better predict the response variable when given the values of the predictor variables. In class, we took a look at the residual plots of both wblake and women datasets to see if there was a good distribution. For example, here is what we found from the women dataset:

attach(women)
data(women)
head(women)

##   height weight
## 1     58    115
## 2     59    117
## 3     60    120
## 4     61    123
## 5     62    126
## 6     63    129

range(height)

## [1] 58 72

wommod<-lm(weight~height)
residplot<-wommod$residuals
plot(residplot~wommod$fitted.values,xlab="Predicted Values",ylab="Residuals",main="Residual Plot for Women")
abline(0,0)

The residual plot for the women data set has a clear curve in it meaning that there is systematic over/under prediction.There should be points balanced above and below the line y=0 This means it would be good to consider using polynomial regression.

Adding polynomial terms to the equation can help better distribute the residuals up to a certain point. When adding the polynomial terms, we need to make sure that the term is significant so that we know whether or not we should add it to the model. We use the following hypothesis test for the nth \(\beta\) term: \(H_0: \beta_n=0\) \(H_A: \beta_n\neq 0\) For our example, we are going to see if adding a 3rd beta term, i.e. \(\beta_2\), is significant. This gives us the following hypotheses: \[H_0: \beta_2=0\] \[H_A: \beta_2\neq 0\] We want to take a look at the residual plot to see if the distribution of the data points improved. This would be a good indicator of whether or not the beta term is going to be significant.

xsq<-(height)^2
polymod<-lm(weight~height+xsq)
residplot2<-polymod$residuals
plot(residplot2~polymod$fitted.values, xlab="predicted values", ylab="residuals",main="Residual Plot for Women")
abline(0,0)

There seems to be a much less noticeable trend in the residual plot, which means that this model is going to have less of an issue with systematic under predicting and over predicting. In order to conduct this hypothesis test and determine the p-values, there are two methods that can be used: the T-test, or the partial F-test. We can use the summary() function to find the T-test p-value, or we can use the anova( , ) function to find the partial F-test p-value. Either way, we should get the same result, so in general, conducting both tests isn’t necessary.

summary(polymod)

## 
## Call:
## lm(formula = weight ~ height + xsq)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.50941 -0.29611 -0.00941  0.28615  0.59706 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 261.87818   25.19677  10.393 2.36e-07 ***
## height       -7.34832    0.77769  -9.449 6.58e-07 ***
## xsq           0.08306    0.00598  13.891 9.32e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3841 on 12 degrees of freedom
## Multiple R-squared:  0.9995, Adjusted R-squared:  0.9994 
## F-statistic: 1.139e+04 on 2 and 12 DF,  p-value: < 2.2e-16

anova(wommod,polymod)

## Analysis of Variance Table
## 
## Model 1: weight ~ height
## Model 2: weight ~ height + xsq
##   Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
## 1     13 30.2333                                  
## 2     12  1.7701  1    28.463 192.96 9.322e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Both tests result in a p-value of 9.32e-09, which is statistically significant. This means that we reject the null hypothesis in favor of the alternative, so \(\beta_2\neq 0\). This means that the beta term is significant enough to add to the regression model. We can keep adding terms to improve the prediction of the response values until the terms are no longer significant.

xcub<-(height)^3
polymod2<-lm(weight~height+xsq+xcub)
residplot3<-polymod2$residuals
plot(residplot3~polymod2$fitted.values,xlab="Predicted Values",ylab="Residuals",main="Residual Plot for Women")
abline(0,0)

summary(polymod2)

## 
## Call:
## lm(formula = weight ~ height + xsq + xcub)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.40677 -0.17391  0.03091  0.12051  0.42191 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept) -8.967e+02  2.946e+02  -3.044  0.01116 * 
## height       4.641e+01  1.366e+01   3.399  0.00594 **
## xsq         -7.462e-01  2.105e-01  -3.544  0.00460 **
## xcub         4.253e-03  1.079e-03   3.940  0.00231 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2583 on 11 degrees of freedom
## Multiple R-squared:  0.9998, Adjusted R-squared:  0.9997 
## F-statistic: 1.679e+04 on 3 and 11 DF,  p-value: < 2.2e-16

The residual plot seems a bit more scattered than when there was just a 2nd degree polynomial, so this is a good sign for the model. The p-value of the beta term on \(x^3\) (\(\beta_3\)) is 0.00231, which is still significant at the \(\alpha=0.05\) level, so we again reject the null hypothesis where \(\beta_3=0\) in favor of the alternative where \(\beta_3\neq 0\). We can try looking at adding another beta term to see if it’s still significant.

xq<-(height)^4
polymod3<-lm(weight~height+xsq+xcub+xq)
residplot4<-polymod3$residuals
plot(residplot4~polymod3$fitted.values,xlab="Predicted Values",ylab="Residuals",main="Residual Plot for Women")
abline(0,0)

summary(polymod3)

## 
## Call:
## lm(formula = weight ~ height + xsq + xcub + xq)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.32577 -0.11881  0.07792  0.13334  0.30466 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  8.824e+03  4.550e+03   1.939   0.0812 .
## height      -5.552e+02  2.814e+02  -1.973   0.0768 .
## xsq          1.319e+01  6.515e+00   2.024   0.0705 .
## xcub        -1.389e-01  6.693e-02  -2.076   0.0646 .
## xq           5.507e-04  2.574e-04   2.140   0.0581 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2244 on 10 degrees of freedom
## Multiple R-squared:  0.9999, Adjusted R-squared:  0.9998 
## F-statistic: 1.669e+04 on 4 and 10 DF,  p-value: < 2.2e-16

At this point, the residual plot isn’t much better, and the \(x^4\) term is statistically insignificant since the p-value is 0.0581 (where alpha=0.05). This means that we accept the null hypothesis, where \(\beta_4=0\) This leaves us with a 3rd degree polynomial regression model: \(\hat y= \beta_0 + \beta_1 x + \beta_2 x^2 +\beta_3 x^3\)

Learning Log 12

Kristen Rutschke

3/13/2018

Learning Log 12

Polynomial Regression

Multicollinearity