A hands-on tutorial about Log Transformations using R language

Importing the necessary libraries for this tutorial:

library(openintro)
library(dplyr)
library(ggplot2)
library(e1071)

Sometimes, we have to deal with data that don´t have a normal shape but a skewed one. By the book, a negative skewness reveals that the mean of the values is less than the median, which means that the data distribution is left-skewed. A positive skewness suggests that the mean of the data values is larger than the median, and the data distribution is right-skewed. For example, let´s consider the histogram of BodyWt and BodyWt variables from the mammals dataset.

ggplot(data = mammals, aes(BodyWt)) + geom_histogram()

ggplot(data = mammals, aes(BrainWt)) + geom_histogram()

The histplot indicates that both variable are right-skewed. We can check using the function skewness from the e1071 library.

skewness(mammals$BodyWt)

## [1] 6.249429

skewness(mammals$BrainWt)

## [1] 4.828829

Here are some hints:

If the skewness of the predictor variable is 0, the data is perfectly symmetrical,
If the skewness of the predictor variable is less than -1 or greater than +1, the data is highly skewed,
If the skewness of the predictor variable is between -1 and -0.5 or between +1 and +0.5 then the data is moderately skewed,
If the skewness of the predictor variable is -0.5 and +0.5, the data is approximately symmetric.

Skewed data have a negative impact on linear regression. For example, if we take the scatter-plot of BodyWt and BrainWt we will see something very odd.

ggplot(mammals, aes(BodyWt, BrainWt)) + geom_point()

Sometimes there really is no meaningful relationship between the two variables. However, in many situations a simple transformation of one or both of the variables can disclose a clear relationship.

ggplot2 provides the scale_x_log10() and scale_y_log10() functions perform a base-10 log transformation of each axis.

# Scatterplot with scale_x_log10() and scale_y_log10()
ggplot(data = mammals, aes(x = BodyWt, y = BrainWt)) +
  geom_point() +
  scale_x_log10() + scale_y_log10()

Let’s create a linear regression model to predict BrainWt based on the BodyWt using the raw values from the dataset.

lm.model = lm(BrainWt ~ BodyWt, data = mammals)

Now, let’s take a look into the summary:

summary(lm.model)

## 
## Call:
## lm(formula = BrainWt ~ BodyWt, data = mammals)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -810.07  -88.52  -79.64  -13.02 2050.33 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 91.00440   43.55258    2.09   0.0409 *  
## BodyWt       0.96650    0.04766   20.28   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 334.7 on 60 degrees of freedom
## Multiple R-squared:  0.8727, Adjusted R-squared:  0.8705 
## F-statistic: 411.2 on 1 and 60 DF,  p-value: < 2.2e-16

Notice how the residual standard error is too high: 334.7.

Maybe a log-transformation in the values might help us to improve the model. For that, we will use the log1p function, which, by default, computes the natural logarithm of a given number or set of numbers.

lm_log.model = lm(log1p(BrainWt) ~ log1p(BodyWt), data = mammals)

Now, let’s take a look into the summary:

summary(lm_log.model)

## 
## Call:
## lm(formula = log1p(BrainWt) ~ log1p(BodyWt), data = mammals)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.27379 -0.57842 -0.04617  0.40597  2.05871 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1.40033    0.14208   9.856 3.69e-14 ***
## log1p(BodyWt)  0.89959    0.04578  19.652  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7927 on 60 degrees of freedom
## Multiple R-squared:  0.8655, Adjusted R-squared:  0.8633 
## F-statistic: 386.2 on 1 and 60 DF,  p-value: < 2.2e-16

Our error decreased to 0.79. That’s really much better than our previous model.

A hands-on tutorial about Log Transformations using R language

Marvin Lemos

24 de outubro de 2018