Transforming to Reveal a Linear Trend

Let’s use the brains data in the alr3 package and take a look at the relationship between the body weight and brain weight variables.

library(alr3)
data(brains)
attach(brains)

Before we attempt to build a linear model, let’s visualize the data by making a scatterplot.

plot(BrainWt ~ BodyWt)

We can immediately see that there does not appear to be a useful linear relationship between the bodywt and brainwt variables. To be sure, we should check some other plots.

mymod<-lm(BrainWt ~ BodyWt)
plot(mymod)

The residuals vs fitted plot is usually enough to determine there isn’t a linear relationship between our variables. In this case, we see that our points on this graph are not random and the red fitted line is not even close to the horizonal dotted line at a value of 0.

Transformation

Tranforming our model helps bring out a linear trend. In class we mostly focused on the square root and log transformations. We can apply these transformations to just one or both variable and determine which new model gives the best linear trend.

plot(sqrt(BrainWt) ~ sqrt(BodyWt))

We can see just from plotting this tranformation that our data looks more linear, but still not quite as good as we want. Let’s try the log transformation applied to both variables.

plot(log(BrainWt) ~ log(BodyWt))

This tranformation looks much more linear. We can analyze other plots in R to be sure.

logmod<-lm(log(BrainWt) ~ log(BodyWt))
plot(logmod)

In the residuals vs Fitted values plot, the model does not appear to be systematically over predicting or under predicting any values. The red line is also pretty close the the horizontal dotted line that indicated low residual values. The NormalQ-Q plot also has the data fitted pretty close to the dotted line, so the transformation makes the residuals seem normally distributed. The scale location plot has a red line that isn’t horizontal, but also isn’t overally concerning. We see curvature where the smaller brain weights have smaller residuals values and the medium brain weight have larger values. Cook’s distance plot has improved from the original simple linear regression model as well. All our points are well within the dotted red line that indicated points that weight our model too heavily.

Outliers

In class we also discussed outliers and tried to better expain what they are and why they matter. An outlier is any point that is well-seperated from the rest of our data. We can identify an outlier in the y-direction as having a studentized residual >= 2 or <=-2. Going back to Cook’s distance plot, if a point lies outside the red dotted lines and is thus weighting our model to heavily, it is considered an outlier as a predictor, or in the x-direction.

Outliers can be difficult to explain. If we have an outlier we can check that the observation is correct and if so, consider if there are any outside predictors acting on the data or even any special circumstances that caused the outlier. We can also consider if it should even be considered in the data or not when we build a model.