Transformations

As if linear modeling isn’t already a tough enough concept, let’s try changing it up - literally. Sometimes, our data looks awful when we get it, and using a transformation on one of the variables may help give us a clue about how to start drawing interpretations on our data.

library(alr3)
## Loading required package: car
data(brains)
attach(brains)
plot(BrainWt~BodyWt)

The brain/body weight data does not appear linear at all. There’s a couple huge outliers. This makes sense with the huge range of animal weights, because most animals lie around the <500kg, and within that, there may be less certainty about where the brain weights lie. HMMMMM… What can we do about it?
We’ll create an SLR model first that we will then change by transforming. We should first look at the residuals and then decide how to transform it. Linear model creation:

brainmod <- lm(BrainWt~BodyWt)
brainresids <- brainmod$residuals
brainfitted <- brainmod$fitted.values
plot(brainresids~brainfitted, xlab = "Fitted Values", ylab = "Brainmod Residuals")
abline(0,0)

qqnorm(brainresids)
qqline(brainresids)

Those residuals look awful because they’re not spreak out among the fitted values and they’re very clumped. The qq plot also shows HUGE outliers, many many many standard deviations away from the normal line.
My first instinct was to use a log function on both the body and brain weights to transform it. However, you can always start with just one or the other and decide to use both. I then created a new model, which I called logbrainmod.

LogBodyWt <- log(BodyWt)
LogBrainWt <- log(BrainWt)
logbrainmod <- lm(LogBrainWt~LogBodyWt)
plot(LogBrainWt~LogBodyWt)

Wow, doesn’t that just look beautiful in comparison!! There’s definitely a more linear trend here, even if it isn’t perfect. I also want to look at the residuals in this case.

logbrainresids <- logbrainmod$residuals
logbrainfitted <- logbrainmod$fitted.values
plot(logbrainresids~logbrainfitted, xlab = "Fitted Values", ylab = "LogBrainmod Residuals")
abline(0,0)

The residuals look much more spread out. We can question their normality using a qqnorm plot:

qqnorm(logbrainresids)
qqline(logbrainresids)

They’re much better than the previous model’s without huge outliers! The next way we can measure how appropriate the transformation of our linear model is through scale location, where we take the square root of the standardized residuals a versus the fitted values. This is supposed to help us reduce the skewness of the data and allow us to better recognize trends.

plot(sqrt(brainresids)~ brainfitted, main = "Scale Location Plot original values", ylab = "Sqrt of OG Model Residuals")
## Warning in sqrt(brainresids): NaNs produced

sqrtresids <- sqrt(logbrainresids)
## Warning in sqrt(logbrainresids): NaNs produced
plot(sqrtresids~logbrainfitted, main="Scale Location Plot transformed values", ylab = "Sqrt of Log Model Residuals")

As you can tell, the transformed values are much more spread out and don’t have a pattern or trend, which is what we want.

After all that plotting individually, I’m exhausted. Next we were shown in class the command plot(model_name). I’lll use the Stopping distance data to display this fanstastic command’s abilities.

Let’s look at the data we’re working with first.

library(alr3)
data(stopping)
attach(stopping)
plot(Distance~Speed)

I’d say there looks like a very strong trend between distance based on speed, as is to be expected.
Using the command to plot all the helpful graphs, I will compare the original values and some transformed values and decide which is the best linear model.

stopmod1 <- lm(Distance~Speed)
plot(stopmod1)

We would prefer the first graph’s (residuals vs fitted) red data trend line to follow that grey horizontal line, rather than having the quadratic shape that it appears to follow. This would mean the data does not have patterned residuals.
Let’s try getting rid of that by taking the square roots of one of the variables and deciding whether it helps us out with these residuals.

stopmod2 <- lm(Distance~sqrt(Speed))
plot(stopmod2)

Nope, that only made it worse. How about taking the square root of distance?

stopmod3 <- lm(sqrt(Distance)~Speed)
plot(stopmod3)

Gosh darn that just about did it! The red line on the first graph looks very close to following the straight horizontal line, which is exactly what we want.

The great thing about these graphs is it points out the actual values of outliers for us, rather than making us do the grunt work of showing the outliers. It also contains a whole bunch of graphs in one which is prime for humans with the possibility of human error to incorrectly type something or make a mistake and have to troubleshoot for hours on end.

Outliers

We touched out outliers at the end of class and noted the outliers we have in our model earlier in this learning log. The first priority of dealing with outliers is finding out how far an outlier is in the y direction. A rule that we follow regarding the determination of a point (whether it is or is not an outlier) is that if the absolute value of the studentized residual is >2, the nwe can count it as an outlier in the y direction.
Next, we have to determine how to deal with the outliers…
1) it could be incorrectly recorded point, like accidentally adding a 0, so a 20 kg animal becomes a 200 kg animal, which would affect the linear model.
2) if it is a correctly recorded point,
a) you must determine why it is an outlier (maybe a car that has a huge distance has recently had a recall of brake pads)
b) and you should look into finding a predictor to add that could explain this and possibly other outliers.