library(alr3)
## Warning: package 'alr3' was built under R version 3.3.3
## Loading required package: car
data(brains)

Let’s plot brain weight to body weight and check it out

plot(BrainWt~BodyWt, brains)

It does not look good! Lots of small animals with small brains, just a few big animals with big brains. But, let’s make a model anyway and see what happens.

brainMod <- lm(BrainWt~BodyWt, brains)
summary(brainMod)
## 
## Call:
## lm(formula = BrainWt ~ BodyWt, data = brains)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -810.15  -88.52  -79.65  -13.02 2050.52 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 91.00864   43.55574   2.089   0.0409 *  
## BodyWt       0.96646    0.04767  20.276   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 334.7 on 60 degrees of freedom
## Multiple R-squared:  0.8726, Adjusted R-squared:  0.8705 
## F-statistic: 411.1 on 1 and 60 DF,  p-value: < 2.2e-16
plot(brainMod)

Though the summary of our model shows body weight as a significant predictor for brain weight, we can see from our plots that this model violates many conditions of linear regression. This model doesn’t fly as is, we’re going to need to make some changes and find a better model.

NOTE that the most important thing to look for is curvature in the residual v. fitted values plot, because this indicates that we have the wrong mean function and everything else we see is irrelevant, we know immediately that we need a better model if we want it to mean anything.

Possible transformations when your model looks crummy: 1) sqrt(Y) -> helps w/ heteroscedasticity, could help mean function 2) log(Y) -> needs Y > 0, helps w/ heteroscedasticity, could help mean function 3) 1/Y -> works well if you know of an inverse relationship between variables (duh)

betterBrainMod <- lm(log(BrainWt)~log(BodyWt), brains)
plot(betterBrainMod)

Let’s try more with another dataset

data(stopping)
stopMod <- lm(Distance~Speed, stopping)
summary(stopMod)
## 
## Call:
## lm(formula = Distance ~ Speed, data = stopping)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -25.410  -7.343  -1.334   5.927  35.608 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -20.1309     3.2308  -6.231 5.04e-08 ***
## Speed         3.1416     0.1514  20.751  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.77 on 60 degrees of freedom
## Multiple R-squared:  0.8777, Adjusted R-squared:  0.8757 
## F-statistic: 430.6 on 1 and 60 DF,  p-value: < 2.2e-16
plot(stopMod)

Again, we see the summary of our model shows significance, but our plots show that there’s room for improvement.

sqrtStopMod <- lm(sqrt(Distance)~Speed, stopping)
plot(sqrtStopMod)

summary(sqrtStopMod)
## 
## Call:
## lm(formula = sqrt(Distance) ~ Speed, data = stopping)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.49948 -0.54761  0.00469  0.53153  1.54350 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.932396   0.197909   4.711  1.5e-05 ***
## Speed       0.252466   0.009274  27.223  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7209 on 60 degrees of freedom
## Multiple R-squared:  0.9251, Adjusted R-squared:  0.9239 
## F-statistic: 741.1 on 1 and 60 DF,  p-value: < 2.2e-16
logStopMod <- lm(log(Distance)~Speed, stopping)
plot(logStopMod)

inverseStopMod <- lm((1/Distance)~Speed, stopping)
plot(inverseStopMod)

Looks like the model which related speed to the square root of the stopping distance is more within the conditions of linear regression, and the relationship between response (stopping distance) and predictor (speed) is still statistically significant.