Transforming to get linear trend

Let’s look at the brains data in the alr3 package to analyze the relationship between the two variables.

library(alr3)
## Warning: package 'alr3' was built under R version 3.4.3
## Loading required package: car
## Warning: package 'car' was built under R version 3.4.3
data(brains)
attach(brains)
head(brains)
##            BrainWt  BodyWt
## Arctic_fox  44.500   3.385
## Owl_monkey  15.499   0.480
## Beaver       8.100   1.350
## Cow        423.012 464.983
## Gray_wolf  119.498  36.328
## Goat       114.996  27.660

Now let’s look at a scatter plot of the two variables

plot(BrainWt,BodyWt)

From this graph we can see that there doesn’t appear to be a linear relationship between the two variables.

But let’s investigate further by creating a linear model and analyze the plots of it.

mymod <- lm(BrainWt ~ BodyWt)
plot(mymod)

Looking at the residuals vs. fitted plot we see that there is definitely not currently a linear relationship between the two variables. This plot should always be the one that is checked first before moving on to the other plots provided.

Transformation

Now we will try one of the common transformations to improve our diagnostic plot. First lets try the square root of each variable.

mytranmod <- lm(sqrt(BrainWt)~ sqrt(BodyWt))
plot(mytranmod)

This looks better than before, but we can try with the log transformation to see if it will improve it further.

mylogmod <- lm(log(BrainWt)~ log(BodyWt))
plot(mylogmod)

When looking at the log transformation residual plot, the variability is much closer to being constant throughout. Yes there are still outliers, but the log transformation seems to be the best when attempting to get the model to follow a linear trend. The second graph is also a good sign as well. The quantiles seem to follow the quantiles of a normal distribution. Third is the scale location. Ideally we would like our red trend line to be straight across. We do see some curvature where smaller brain weights have smaller residuals and medium brain weights have larger residuals. But the trend is close enough to horizontal for the extent off our work right now. Finally the Cook’s distance plot has improved as well. We don’t see any values lying outside of that red barrier. This is much better than the original linear model.

Transforming stopping data

We will try this exercise again with the stopping data in the alr3 package.

library(alr3)
data("stopping")
head(stopping)
##   Speed Distance
## 1     4        4
## 2     5        2
## 3     5        4
## 4     5        8
## 5     5        8
## 6     7        7

Now we will look at the plot of the two variables.

with(stopping,plot(Distance,Speed))

This plot appears to have more of a linear trend than the previous example, but we will again look at the plots of the linear model before we come at any conclusions.

mymod2 <- lm(Distance ~ Speed , data = stopping)
plot(mymod2)

The residual plot of the simple linear model is solid, but for purposes of curiosity let’s try some transformations to see if we can improve upon it.

mysqrtmod <- with(stopping, lm(sqrt(Distance) ~ Speed))
plot(mysqrtmod)

Score! The variability appears to be constant throughout, and when we look at our trend line we see that it is almost perfectly horizontal. The enemy of good is better, so we will stick with this transformation instead of trying to improve any further. (More than likely this is the best transformation anyways) The other three plots are also positive signs. The quantiles of our residuals do follow those of a normal distribution. The studentized residuals trend line is slightly upwards, but it is close to the horizontal line that is optimal. And the Cook’s distance plot shows that no data points have overwhelming influence on our coefficients.