Importing the necessary libraries for this tutorial:

library(openintro)
library(dplyr)
library(ggplot2)
library(e1071)

Sometimes, we have to deal with data that don´t have a normal shape but a skewed one. By the book, a negative skewness reveals that the mean of the values is less than the median, which means that the data distribution is left-skewed. A positive skewness suggests that the mean of the data values is larger than the median, and the data distribution is right-skewed. For example, let´s consider the histogram of BodyWt and BodyWt variables from the mammals dataset.

ggplot(data = mammals, aes(BodyWt)) + geom_histogram()

ggplot(data = mammals, aes(BrainWt)) + geom_histogram()

The histplot indicates that both variable are right-skewed. We can check using the function skewness from the e1071 library.

skewness(mammals$BodyWt)
## [1] 6.249429
skewness(mammals$BrainWt)
## [1] 4.828829

Here are some hints:

Skewed data have a negative impact on linear regression. For example, if we take the scatter-plot of BodyWt and BrainWt we will see something very odd.

ggplot(mammals, aes(BodyWt, BrainWt)) + geom_point()

Sometimes there really is no meaningful relationship between the two variables. However, in many situations a simple transformation of one or both of the variables can disclose a clear relationship.

ggplot2 provides the scale_x_log10() and scale_y_log10() functions perform a base-10 log transformation of each axis.

# Scatterplot with scale_x_log10() and scale_y_log10()
ggplot(data = mammals, aes(x = BodyWt, y = BrainWt)) +
  geom_point() +
  scale_x_log10() + scale_y_log10()

Let’s create a linear regression model to predict BrainWt based on the BodyWt using the raw values from the dataset.

lm.model = lm(BrainWt ~ BodyWt, data = mammals)

Now, let’s take a look into the summary:

summary(lm.model)
## 
## Call:
## lm(formula = BrainWt ~ BodyWt, data = mammals)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -810.07  -88.52  -79.64  -13.02 2050.33 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 91.00440   43.55258    2.09   0.0409 *  
## BodyWt       0.96650    0.04766   20.28   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 334.7 on 60 degrees of freedom
## Multiple R-squared:  0.8727, Adjusted R-squared:  0.8705 
## F-statistic: 411.2 on 1 and 60 DF,  p-value: < 2.2e-16

Notice how the residual standard error is too high: 334.7.

Maybe a log-transformation in the values might help us to improve the model. For that, we will use the log1p function, which, by default, computes the natural logarithm of a given number or set of numbers.

lm_log.model = lm(log1p(BrainWt) ~ log1p(BodyWt), data = mammals)

Now, let’s take a look into the summary:

summary(lm_log.model)
## 
## Call:
## lm(formula = log1p(BrainWt) ~ log1p(BodyWt), data = mammals)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.27379 -0.57842 -0.04617  0.40597  2.05871 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1.40033    0.14208   9.856 3.69e-14 ***
## log1p(BodyWt)  0.89959    0.04578  19.652  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7927 on 60 degrees of freedom
## Multiple R-squared:  0.8655, Adjusted R-squared:  0.8633 
## F-statistic: 386.2 on 1 and 60 DF,  p-value: < 2.2e-16

Our error decreased to 0.79. That’s really much better than our previous model.