This is what we worked on in class today. We looked at using Quadratic Regression models. These models are interesting because you use the predictors you already have to make a more accurate model. We stated by working with the women data. At first when we plotted the residuals there was a bad trend. The data plots moved together and formed the shape of a banana. This is not good because it means we are systematically overpredicting or underspredicting. After creating a height squared and a height cubed variable, we had a much better model. This is interesting to me because we are using the data we have to make a better model without adding more predictors. We then moved into chapter 5 and talked about Multicollinearity is when the predictors are correlated. It is not good because it inflates our standard errors. If our standard errors are inflated, we will have a larger confidence interval. We talked about the Variance Inflation Factor (VIF). The VIF has values that are “allowed” and values that are “not good”. We talked about these values and why they are good or bad.
data(women)
attach(women)
head(women)
## height weight
## 1 58 115
## 2 59 117
## 3 60 120
## 4 61 123
## 5 62 126
## 6 63 129
myData <- lm(weight ~ height)
myData
##
## Call:
## lm(formula = weight ~ height)
##
## Coefficients:
## (Intercept) height
## -87.52 3.45
summary(myData)
##
## Call:
## lm(formula = weight ~ height)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.7333 -1.1333 -0.3833 0.7417 3.1167
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -87.51667 5.93694 -14.74 1.71e-09 ***
## height 3.45000 0.09114 37.85 1.09e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.525 on 13 degrees of freedom
## Multiple R-squared: 0.991, Adjusted R-squared: 0.9903
## F-statistic: 1433 on 1 and 13 DF, p-value: 1.091e-14
plot(myData)
plot(height,weight, ylab = "Women's Weight", xlab = "Women's Height", main = "Women")
Womenresids <- myData$residuals
plot(Womenresids)
Our Formula: weight = -87.52 + (3.45)(height)
heightSquared <- height^2
quad <- lm(weight ~ height + heightSquared)
summary(quad)
##
## Call:
## lm(formula = weight ~ height + heightSquared)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.50941 -0.29611 -0.00941 0.28615 0.59706
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 261.87818 25.19677 10.393 2.36e-07 ***
## height -7.34832 0.77769 -9.449 6.58e-07 ***
## heightSquared 0.08306 0.00598 13.891 9.32e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3841 on 12 degrees of freedom
## Multiple R-squared: 0.9995, Adjusted R-squared: 0.9994
## F-statistic: 1.139e+04 on 2 and 12 DF, p-value: < 2.2e-16
New Formula: weight = 261.88 + (-7.348)(height) + .08306(height^2) check the p-value
how to plot the modeL: can’t use abline because we have more than one slope we need to use: (plot the values and the predicted values)
coef(myData) x <- seq(from = 58, to = 72, by .1) x y <- coef(quadmod)[1] + coef(quadmod)[2] * x + coef(quadmod)[3]* (x^2) y length(y) length(x) plot(weight~height) lines(x,y,lty = 3, col = 2) abline(myData)
plot(weight ~ height , lty = 2) abline(0,0)
names(quad) plot(quad\(residuals ~ quad\)fitted) abline(0,0)
newResiduals <- quad$residuals
plot(newResiduals)
abline(0,0)
We want there to be residuals above and below the line. Some over and some under predictions
Still a little bit of trend… now we can do the height cubed term
heightCubed <- height^3
cube <- lm(weight ~ height + heightSquared + heightCubed)
cubeResid <- cube$residuals
plot(cubeResid)
summary(cube)
##
## Call:
## lm(formula = weight ~ height + heightSquared + heightCubed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.40677 -0.17391 0.03091 0.12051 0.42191
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.967e+02 2.946e+02 -3.044 0.01116 *
## height 4.641e+01 1.366e+01 3.399 0.00594 **
## heightSquared -7.462e-01 2.105e-01 -3.544 0.00460 **
## heightCubed 4.253e-03 1.079e-03 3.940 0.00231 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2583 on 11 degrees of freedom
## Multiple R-squared: 0.9998, Adjusted R-squared: 0.9997
## F-statistic: 1.679e+04 on 3 and 11 DF, p-value: < 2.2e-16
They are all significant… the height cubed, squared and height
If you plot the model rather than the data plot(mod)
Now, let’s look at the fish data
DONT WANT INDEX ON THE X AXIS!! ( THIS IS THE ROW OF THE FISH!!)
use this instead:
plot(mod\(residuals ~ mod\)fitted.value)
library(alr3) data(wblake) head(wblake) attach(wblake)
myMod <- lm(Age ~ Scale) myMod summary(myMod) plot(Age, Length) myResid <- myMod$residuals plot(myResid) abline(0,0)
plot(myMod\(residuals ~ myMod\)fitted.value)