Importing the necessary libraries for this tutorial:
library(openintro)
library(dplyr)
library(ggplot2)
library(e1071)
Sometimes, we have to deal with data that don´t have a normal shape but a skewed one. By the book, a negative skewness reveals that the mean of the values is less than the median, which means that the data distribution is left-skewed. A positive skewness suggests that the mean of the data values is larger than the median, and the data distribution is right-skewed. For example, let´s consider the histogram of BodyWt and BodyWt variables from the mammals dataset.
ggplot(data = mammals, aes(BodyWt)) + geom_histogram()
ggplot(data = mammals, aes(BrainWt)) + geom_histogram()
The histplot indicates that both variable are right-skewed. We can check using the function skewness from the e1071 library.
skewness(mammals$BodyWt)
## [1] 6.249429
skewness(mammals$BrainWt)
## [1] 4.828829
Here are some hints:
Skewed data have a negative impact on linear regression. For example, if we take the scatter-plot of BodyWt and BrainWt we will see something very odd.
ggplot(mammals, aes(BodyWt, BrainWt)) + geom_point()
Sometimes there really is no meaningful relationship between the two variables. However, in many situations a simple transformation of one or both of the variables can disclose a clear relationship.
ggplot2 provides the scale_x_log10() and scale_y_log10() functions perform a base-10 log transformation of each axis.
# Scatterplot with scale_x_log10() and scale_y_log10()
ggplot(data = mammals, aes(x = BodyWt, y = BrainWt)) +
geom_point() +
scale_x_log10() + scale_y_log10()
Let’s create a linear regression model to predict BrainWt based on the BodyWt using the raw values from the dataset.
lm.model = lm(BrainWt ~ BodyWt, data = mammals)
Now, let’s take a look into the summary:
summary(lm.model)
##
## Call:
## lm(formula = BrainWt ~ BodyWt, data = mammals)
##
## Residuals:
## Min 1Q Median 3Q Max
## -810.07 -88.52 -79.64 -13.02 2050.33
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 91.00440 43.55258 2.09 0.0409 *
## BodyWt 0.96650 0.04766 20.28 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 334.7 on 60 degrees of freedom
## Multiple R-squared: 0.8727, Adjusted R-squared: 0.8705
## F-statistic: 411.2 on 1 and 60 DF, p-value: < 2.2e-16
Notice how the residual standard error is too high: 334.7.
Maybe a log-transformation in the values might help us to improve the model. For that, we will use the log1p function, which, by default, computes the natural logarithm of a given number or set of numbers.
lm_log.model = lm(log1p(BrainWt) ~ log1p(BodyWt), data = mammals)
Now, let’s take a look into the summary:
summary(lm_log.model)
##
## Call:
## lm(formula = log1p(BrainWt) ~ log1p(BodyWt), data = mammals)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.27379 -0.57842 -0.04617 0.40597 2.05871
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.40033 0.14208 9.856 3.69e-14 ***
## log1p(BodyWt) 0.89959 0.04578 19.652 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7927 on 60 degrees of freedom
## Multiple R-squared: 0.8655, Adjusted R-squared: 0.8633
## F-statistic: 386.2 on 1 and 60 DF, p-value: < 2.2e-16
Our error decreased to 0.79. That’s really much better than our previous model.