Power Transforms in R

The R-function boxCox from the car package can be used to implement maximum likelihood transformations of data when modeling. This can be used to transform data to create a more linear relationship between predictor and dependent variables.

We’ll use the built-in mtcars dataset. Let’s plot our horsepower variable against our mpg, which is the variable we’d like to predict:

data(mtcars)
plot(mtcars$hp, mtcars$mpg)

This looks like it could be a linear relationship between horsepower and miles per gallon, but we can build a simple linear model in order to evaluate if this is a simple OLS regression:

# Create model to predict MPG from Horsepower
lm_hp <- lm(mpg ~ hp, mtcars)
summary(lm_hp)
## 
## Call:
## lm(formula = mpg ~ hp, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.7121 -2.1122 -0.8854  1.5819  8.2360 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 30.09886    1.63392  18.421  < 2e-16 ***
## hp          -0.06823    0.01012  -6.742 1.79e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.863 on 30 degrees of freedom
## Multiple R-squared:  0.6024, Adjusted R-squared:  0.5892 
## F-statistic: 45.46 on 1 and 30 DF,  p-value: 1.788e-07

We can plot our model to begin evaluating how effective it is at predicting MPG. The plot function called on an lm object is a useful tool here.

plot(lm_hp)

Looking at the above plots, there’s a bit of a pattern in our residuals vs fitted plot, and some tail behavior in our QQ plot indicating a transform could be used here. We can implement a Box-Cox transformation to improve our fit.

bc <- boxCox(lm_hp)

Our box-cox transform plot gives us a non-zero \(\lambda\) parameter value, around 0.05, indicating we’re close to a log-transformation

We can see similar results below by applying a simple log transformation:

lm_log_hp <- lm(mpg ~ log(hp), mtcars)
summary(lm_log_hp)
## 
## Call:
## lm(formula = mpg ~ log(hp), data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.9427 -1.7053 -0.4931  1.7194  8.6460 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   72.640      6.004  12.098 4.55e-13 ***
## log(hp)      -10.764      1.224  -8.792 8.39e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.239 on 30 degrees of freedom
## Multiple R-squared:  0.7204, Adjusted R-squared:  0.7111 
## F-statistic:  77.3 on 1 and 30 DF,  p-value: 8.387e-10

And plot the results of our log-transform model

plot(lm_log_hp)