title: “Learning Log 14” author: “Jimmy Kroll” date: “March 20, 2018” output: html_document —

This learning log will discuss variable transformations in order for our linear models to meet certain assumptions for data with particular relationships.

Residual Analysis

When we evaluate our linear models, one of the things that we’re interested in is our residuals. Examining our residuals shows us if our model assumptions hold–specifically, heteroscedasticy of residual variance and random of the residuals. It is most often the case that these assumptions are not met when our dataset isn’t truly linear. However, there are certain adjustments, known as transformations, that we can perform on our data to create the linear relationship.

Diagnostic Plots

As we mentioned previously, the goal of transforming our variables is to improve our model assumptions by adjusting the residual plot and other diagnostic plots. To accompany our understanding our variable transformations and diagnostic plots, we will use the brains data set.

library(alr4)
## Warning: package 'alr4' was built under R version 3.3.3
## Loading required package: car
## Warning: package 'car' was built under R version 3.3.3
## Loading required package: effects
## Warning: package 'effects' was built under R version 3.3.3
## Loading required package: carData
## Warning: package 'carData' was built under R version 3.3.3
## 
## Attaching package: 'carData'
## The following objects are masked from 'package:car':
## 
##     Guyer, UN, Vocab
## lattice theme set by effectsTheme()
## See ?effectsTheme for details.
data(brains)
head(brains)
##            BrainWt  BodyWt
## Arctic fox  44.500   3.385
## Owl monkey  15.499   0.480
## Beaver       8.100   1.350
## Cow        423.012 464.983
## Gray wolf  119.498  36.328
## Goat       114.996  27.660
dim(brains)
## [1] 62  2
with(brains, plot(BrainWt, BodyWt))

As we take a look at our first plot, we see that the data contains a few outliers and maybe a nonlinear trend even within the BrainWt less than 1000.

For the sake of this first exercise, we will remove the three data points with a brain weight greater than 1000 using the subset command

brains2 <- subset(brains, BrainWt<1000)
attach(brains2)
plot(brains2)

With our subsetted data set, we will now look at our diagnostic plots to see if a basic linear model between BrainWt and BodyWt satisfies our model assumptions.

Mod1 <- lm(BodyWt~BrainWt)
plot(Mod1)

A really easy way to check our diagnostic plots is by using the plot command and using our model name as the only arguement. This gives us four plots: our standard residual plot with residuals vs. fitted values, our Q-Q Norm plot, a standardized residuals plot, and leverage plot. We are very familiar with the first two plots, but we haven’t seen the latter two before. The standardized residual plot simply plots our standardized residuals against the moedel’s fitted values. This reduces the skewness of the data displayed in the residuals and allows any trends in the residuals to appear more clearly. We will discuss the leverage plot at the end of this LL.

In looking at our residual plot, there is a very apparent trend in the residuals as the red trend line varies greatly from the Y=0 line. We could continue to check our remaining three plots, but since the assumptions for the residual plot are violated, we need to stop there. Meeting the assumptions of the residual plot is the most important of all diagnostic plots–its very necessary for the model.

Since the current data can’t quite be explained by a linear model, we will look to transform the data in order to use a linear model.

Transformations

There are three main transformations that we can make on the response or predictor variable to make the data more linear in shape: square root (or some other power), logarithmic, or inverse.

The square root or power transformation can help reduce the residual variance as the predictor increases (heteroscedasticy). The log transformation can help with heteroscedasity, but more importantly, it helps to straighten the data. The inverse transformation (1/y or 1/x) helps create a more linear relationship between x and y if you believe there is an inverse relationship between the two variables.

Let’s try these transformations on the brains2 data to see if we can better our diagnostic plots. From our scatterplot, it seemed that BodyWt was increasing faster than BrainWt, so we’ll try transforming the response variable first.

Mod2 <- lm(sqrt(BodyWt)~BrainWt)
plot(Mod2)

The square root transformation seems relatively good, but we still have a large group of data points clustered together. The log transformation on the predictor should help with this.

Mod3 <- lm(sqrt(BodyWt)~log(BrainWt))
plot(Mod3)

Now our data is spread more, but a trend in residuals popped up. Let’s try a log transformation on both variables to eliminate this.

Mod4 <- lm(log(BodyWt)~log(BrainWt))
plot(Mod4)

By doing a log transformation on both body and brain weight, we’ve straightened our data and decreased the variance of our residuals so that a linear model is appropriate to explain the relationship between the two variables.

If we go ahead and check our other diagnostic plots, they are promising as well: the Q-Q Norm plot follows the y=x line and the standardized residuals appear random also.

We will discuss the leverage plot and outliers in the next section.

Outliers

As you probably know from previous stats classes or just from living life, outliers are data observations that do not quite fit with the rest of the data. In a two-variable world, this could be an observation that’s x-value is far from the rest of the data, or an observation that’s y-value is separated from the other data points. This section will focus on two types of outliers that can influence our models, and measures to find this values.

The first are influencial points. An influencial point is one that’s presence significantly changes our point estimates or standard error. The second are leverage values. Leverage values are outliers with respect to x such that their horizontal separation the data can pull our model towards it, regardless of the other data points that are closer clustered together.

Most times we can identify outliers in a scatterplot, but we can also use our leverage plot to assess hidden outliers. Let’s pull up our leverage plot from our original model.

Mod1 <- lm(BodyWt~BrainWt)
plot(Mod1)

The fourth plot shows our standardized residuals vs. their leverage. The dotted red lines show the bounds when data points have too much leverage on the model and should be removed. Leverage can be measured using Cook’s distance. When we compare this plot to our final model, we see that our high leverage points have all be transformed into the appropriate range for our model.