library(alr3)
## Warning: package 'alr3' was built under R version 3.3.3
## Loading required package: car
data(brains)
Let’s plot brain weight to body weight and check it out
plot(BrainWt~BodyWt, brains)
It does not look good! Lots of small animals with small brains, just a few big animals with big brains. But, let’s make a model anyway and see what happens.
brainMod <- lm(BrainWt~BodyWt, brains)
summary(brainMod)
##
## Call:
## lm(formula = BrainWt ~ BodyWt, data = brains)
##
## Residuals:
## Min 1Q Median 3Q Max
## -810.15 -88.52 -79.65 -13.02 2050.52
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 91.00864 43.55574 2.089 0.0409 *
## BodyWt 0.96646 0.04767 20.276 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 334.7 on 60 degrees of freedom
## Multiple R-squared: 0.8726, Adjusted R-squared: 0.8705
## F-statistic: 411.1 on 1 and 60 DF, p-value: < 2.2e-16
plot(brainMod)
Though the summary of our model shows body weight as a significant predictor for brain weight, we can see from our plots that this model violates many conditions of linear regression. This model doesn’t fly as is, we’re going to need to make some changes and find a better model.
NOTE that the most important thing to look for is curvature in the residual v. fitted values plot, because this indicates that we have the wrong mean function and everything else we see is irrelevant, we know immediately that we need a better model if we want it to mean anything.
Possible transformations when your model looks crummy: 1) sqrt(Y) -> helps w/ heteroscedasticity, could help mean function 2) log(Y) -> needs Y > 0, helps w/ heteroscedasticity, could help mean function 3) 1/Y -> works well if you know of an inverse relationship between variables (duh)
betterBrainMod <- lm(log(BrainWt)~log(BodyWt), brains)
plot(betterBrainMod)
Let’s try more with another dataset
data(stopping)
stopMod <- lm(Distance~Speed, stopping)
summary(stopMod)
##
## Call:
## lm(formula = Distance ~ Speed, data = stopping)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25.410 -7.343 -1.334 5.927 35.608
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -20.1309 3.2308 -6.231 5.04e-08 ***
## Speed 3.1416 0.1514 20.751 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.77 on 60 degrees of freedom
## Multiple R-squared: 0.8777, Adjusted R-squared: 0.8757
## F-statistic: 430.6 on 1 and 60 DF, p-value: < 2.2e-16
plot(stopMod)
Again, we see the summary of our model shows significance, but our plots show that there’s room for improvement.
sqrtStopMod <- lm(sqrt(Distance)~Speed, stopping)
plot(sqrtStopMod)
summary(sqrtStopMod)
##
## Call:
## lm(formula = sqrt(Distance) ~ Speed, data = stopping)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.49948 -0.54761 0.00469 0.53153 1.54350
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.932396 0.197909 4.711 1.5e-05 ***
## Speed 0.252466 0.009274 27.223 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7209 on 60 degrees of freedom
## Multiple R-squared: 0.9251, Adjusted R-squared: 0.9239
## F-statistic: 741.1 on 1 and 60 DF, p-value: < 2.2e-16
logStopMod <- lm(log(Distance)~Speed, stopping)
plot(logStopMod)
inverseStopMod <- lm((1/Distance)~Speed, stopping)
plot(inverseStopMod)
Looks like the model which related speed to the square root of the stopping distance is more within the conditions of linear regression, and the relationship between response (stopping distance) and predictor (speed) is still statistically significant.