The R-function boxCox from the car package
can be used to implement maximum likelihood transformations of data when
modeling. This can be used to transform data to create a more linear
relationship between predictor and dependent variables.
We’ll use the built-in mtcars dataset. Let’s plot our
horsepower variable against our mpg, which is the variable we’d like to
predict:
data(mtcars)
plot(mtcars$hp, mtcars$mpg)
This looks like it could be a linear relationship between horsepower and miles per gallon, but we can build a simple linear model in order to evaluate if this is a simple OLS regression:
# Create model to predict MPG from Horsepower
lm_hp <- lm(mpg ~ hp, mtcars)
summary(lm_hp)
##
## Call:
## lm(formula = mpg ~ hp, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.7121 -2.1122 -0.8854 1.5819 8.2360
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30.09886 1.63392 18.421 < 2e-16 ***
## hp -0.06823 0.01012 -6.742 1.79e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.863 on 30 degrees of freedom
## Multiple R-squared: 0.6024, Adjusted R-squared: 0.5892
## F-statistic: 45.46 on 1 and 30 DF, p-value: 1.788e-07
We can plot our model to begin evaluating how effective it is at
predicting MPG. The plot function called on an
lm object is a useful tool here.
plot(lm_hp)
Looking at the above plots, there’s a bit of a pattern in our residuals vs fitted plot, and some tail behavior in our QQ plot indicating a transform could be used here. We can implement a Box-Cox transformation to improve our fit.
bc <- boxCox(lm_hp)
Our box-cox transform plot gives us a non-zero \(\lambda\) parameter value, around 0.05, indicating we’re close to a log-transformation
We can see similar results below by applying a simple log transformation:
lm_log_hp <- lm(mpg ~ log(hp), mtcars)
summary(lm_log_hp)
##
## Call:
## lm(formula = mpg ~ log(hp), data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.9427 -1.7053 -0.4931 1.7194 8.6460
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 72.640 6.004 12.098 4.55e-13 ***
## log(hp) -10.764 1.224 -8.792 8.39e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.239 on 30 degrees of freedom
## Multiple R-squared: 0.7204, Adjusted R-squared: 0.7111
## F-statistic: 77.3 on 1 and 30 DF, p-value: 8.387e-10
And plot the results of our log-transform model
plot(lm_log_hp)