On thursday we learned about quadratic regression. This is the process to add in x^2 and possible x^3 to see if it makes the residuals line up better.
data(women)
attach(women)
plot(weight~height)
mod<-lm(weight~height)
abline(mod)
res<-residuals(lm(weight~height))
plot(res)
As we can see from the residuals graph the residuals form an smile which is not what we want. We want residuals that more or less center around 0. We can see from the scatterplot of women that there is a slight curvature in the data too which accounts for the smile.
Now we do a f test.
mod<-lm(weight~height)
summary(mod)
##
## Call:
## lm(formula = weight ~ height)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.7333 -1.1333 -0.3833 0.7417 3.1167
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -87.51667 5.93694 -14.74 1.71e-09 ***
## height 3.45000 0.09114 37.85 1.09e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.525 on 13 degrees of freedom
## Multiple R-squared: 0.991, Adjusted R-squared: 0.9903
## F-statistic: 1433 on 1 and 13 DF, p-value: 1.091e-14
From the f test we realize that the p value is 1.091e-14, which is a good fit and makes us wonder if we could do better.
To see if x^2 will make it better we add I(height^2) to the linear model
mod2 = lm(weight ~ height + I(height^2))
summary(mod2)
##
## Call:
## lm(formula = weight ~ height + I(height^2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.50941 -0.29611 -0.00941 0.28615 0.59706
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 261.87818 25.19677 10.393 2.36e-07 ***
## height -7.34832 0.77769 -9.449 6.58e-07 ***
## I(height^2) 0.08306 0.00598 13.891 9.32e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3841 on 12 degrees of freedom
## Multiple R-squared: 0.9995, Adjusted R-squared: 0.9994
## F-statistic: 1.139e+04 on 2 and 12 DF, p-value: < 2.2e-16
The p value of this is lower so we keep it now we look at its residuals.
res<-residuals(lm(weight~height+ I(height^2)))
plot(res)
They are better than before but still a little off so we add I(height^3) too.
mod3 = lm(weight ~ height + I(height^2)+I(height^3))
res<-(residuals(mod3))
plot(res)
summary(mod3)
##
## Call:
## lm(formula = weight ~ height + I(height^2) + I(height^3))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.40677 -0.17391 0.03091 0.12051 0.42191
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.967e+02 2.946e+02 -3.044 0.01116 *
## height 4.641e+01 1.366e+01 3.399 0.00594 **
## I(height^2) -7.462e-01 2.105e-01 -3.544 0.00460 **
## I(height^3) 4.253e-03 1.079e-03 3.940 0.00231 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2583 on 11 degrees of freedom
## Multiple R-squared: 0.9998, Adjusted R-squared: 0.9997
## F-statistic: 1.679e+04 on 3 and 11 DF, p-value: < 2.2e-16
The p value is the same but the res are even more center at 0 so we keep it.
We also learned Multicollinearity which is when the predictors are correlated. It inflates our standard errors and makes our estimaters unstable. we can test this by using the r code cor(data[1],data[2]). If the number resulting from this is high the data is correlated.