In Class March8th

This is what we worked on in class today. We looked at using Quadratic Regression models. These models are interesting because you use the predictors you already have to make a more accurate model. We stated by working with the women data. At first when we plotted the residuals there was a bad trend. The data plots moved together and formed the shape of a banana. This is not good because it means we are systematically overpredicting or underspredicting. After creating a height squared and a height cubed variable, we had a much better model. This is interesting to me because we are using the data we have to make a better model without adding more predictors. We then moved into chapter 5 and talked about Multicollinearity is when the predictors are correlated. It is not good because it inflates our standard errors. If our standard errors are inflated, we will have a larger confidence interval. We talked about the Variance Inflation Factor (VIF). The VIF has values that are “allowed” and values that are “not good”. We talked about these values and why they are good or bad.

data(women)
attach(women)
head(women)

##   height weight
## 1     58    115
## 2     59    117
## 3     60    120
## 4     61    123
## 5     62    126
## 6     63    129

myData <- lm(weight ~ height)
myData

## 
## Call:
## lm(formula = weight ~ height)
## 
## Coefficients:
## (Intercept)       height  
##      -87.52         3.45

summary(myData)

## 
## Call:
## lm(formula = weight ~ height)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.7333 -1.1333 -0.3833  0.7417  3.1167 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -87.51667    5.93694  -14.74 1.71e-09 ***
## height        3.45000    0.09114   37.85 1.09e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.525 on 13 degrees of freedom
## Multiple R-squared:  0.991,  Adjusted R-squared:  0.9903 
## F-statistic:  1433 on 1 and 13 DF,  p-value: 1.091e-14

plot(myData)

plot(height,weight, ylab = "Women's Weight", xlab = "Women's Height", main = "Women")

Womenresids <- myData$residuals
plot(Womenresids)

Our Formula: weight = -87.52 + (3.45)(height)

heightSquared <- height^2
quad <- lm(weight ~ height + heightSquared)
summary(quad)

## 
## Call:
## lm(formula = weight ~ height + heightSquared)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.50941 -0.29611 -0.00941  0.28615  0.59706 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   261.87818   25.19677  10.393 2.36e-07 ***
## height         -7.34832    0.77769  -9.449 6.58e-07 ***
## heightSquared   0.08306    0.00598  13.891 9.32e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3841 on 12 degrees of freedom
## Multiple R-squared:  0.9995, Adjusted R-squared:  0.9994 
## F-statistic: 1.139e+04 on 2 and 12 DF,  p-value: < 2.2e-16

New Formula: weight = 261.88 + (-7.348)(height) + .08306(height^2) check the p-value

how to plot the modeL: can’t use abline because we have more than one slope we need to use: (plot the values and the predicted values)

coef(myData) x <- seq(from = 58, to = 72, by .1) x y <- coef(quadmod)[1] + coef(quadmod)[2] * x + coef(quadmod)[3]* (x^2) y length(y) length(x) plot(weight~height) lines(x,y,lty = 3, col = 2) abline(myData)

plot(weight ~ height , lty = 2) abline(0,0)

names(quad) plot(quad$residuals ~ quad$fitted) abline(0,0)

newResiduals <- quad$residuals
plot(newResiduals)
abline(0,0)

We want there to be residuals above and below the line. Some over and some under predictions

Still a little bit of trend… now we can do the height cubed term

heightCubed <- height^3
cube <- lm(weight ~ height + heightSquared + heightCubed)
cubeResid <- cube$residuals
plot(cubeResid)

summary(cube)

## 
## Call:
## lm(formula = weight ~ height + heightSquared + heightCubed)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.40677 -0.17391  0.03091  0.12051  0.42191 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   -8.967e+02  2.946e+02  -3.044  0.01116 * 
## height         4.641e+01  1.366e+01   3.399  0.00594 **
## heightSquared -7.462e-01  2.105e-01  -3.544  0.00460 **
## heightCubed    4.253e-03  1.079e-03   3.940  0.00231 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2583 on 11 degrees of freedom
## Multiple R-squared:  0.9998, Adjusted R-squared:  0.9997 
## F-statistic: 1.679e+04 on 3 and 11 DF,  p-value: < 2.2e-16

They are all significant… the height cubed, squared and height

If you plot the model rather than the data plot(mod)

Now, let’s look at the fish data

DONT WANT INDEX ON THE X AXIS!! ( THIS IS THE ROW OF THE FISH!!)

use this instead:

plot(mod$residuals ~ mod$fitted.value)

library(alr3) data(wblake) head(wblake) attach(wblake)

myMod <- lm(Age ~ Scale) myMod summary(myMod) plot(Age, Length) myResid <- myMod$residuals plot(myResid) abline(0,0)

plot(myMod$residuals ~ myMod$fitted.value)