Blog 4: Box-Cox Transformation

data("rock")

Box-Cox Transformation

The Box-Cox Transformation is used to transform either the response or predictor variables to be normally distributed, even if the underlying variare not normally distributed. The transformation takes the form:

\[\begin{aligned} y(\lambda) = \begin{cases} \frac{y^{\lambda} - 1}{\lambda}, & \text{if } \begin{aligned}[t] \lambda_1 \ne 0, \end{aligned} \\ log(y), & \text{otherwise} \end{cases} \end{aligned}\]

Modeling

First, let’s construct a linear model to predict our perm variable, which measures permeability in milli-Darcies per rock in the sample

model <- lm(perm ~ area + peri + shape, rock)

hist(residuals(model))

Our residuals vs fitted plot above does show normality in our residuals. However, let’s create some other diagnostic plots for our mode via the built-in plot command.

plot(model)

We see a general pattern form our Residuals vs Fitted Values plot above.

summary(model)
## 
## Call:
## lm(formula = perm ~ area + peri + shape, data = rock)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -750.26  -59.57   10.66  100.25  620.91 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 485.61797  158.40826   3.066 0.003705 ** 
## area          0.09133    0.02499   3.654 0.000684 ***
## peri         -0.34402    0.05111  -6.731 2.84e-08 ***
## shape       899.06926  506.95098   1.773 0.083070 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 246 on 44 degrees of freedom
## Multiple R-squared:  0.7044, Adjusted R-squared:  0.6843 
## F-statistic: 34.95 on 3 and 44 DF,  p-value: 1.033e-11

Transformation

Now, we can run a Box-Cox transformation on our linear model to find the value of \(\lambda\).

# Run Box-Cox 
bc <- boxcox(model)

lambda <- bc$x[which.max(bc$y)]

We can now create another linear model using the \(\lambda\) parameter returned from ur Box-Cox transformation. We see a roughly 8% improvement in our adjusted \(R^2\) value from this model undergoing the transformation

model_transformed <- lm((perm**lambda) ~ area + peri + shape, rock )
summary(model_transformed)
## 
## Call:
## lm(formula = (perm^lambda) ~ area + peri + shape, data = rock)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.30235 -0.32020  0.09305  0.32367  1.06157 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.577e+00  3.451e-01  10.365 2.18e-13 ***
## area         3.074e-04  5.445e-05   5.645 1.12e-06 ***
## peri        -1.025e-03  1.113e-04  -9.207 8.02e-12 ***
## shape        1.242e+00  1.104e+00   1.125    0.267    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5359 on 44 degrees of freedom
## Multiple R-squared:  0.7802, Adjusted R-squared:  0.7652 
## F-statistic: 52.06 on 3 and 44 DF,  p-value: 1.601e-14
plot(model_transformed)

We see a generally better fit from our diagnostic plots above, with less of a noticeable pattern in our Residuals vs Fitted plot above (though stil not ideal).

Using our log-likelihood measures (with the goal of minimizing this criteria for our transformed data model, we print the results for each of our models

# Log-Likelihood for each model
print(logLik(model))
## 'log Lik.' -330.2796 (df=5)
print(logLik(model_transformed))
## 'log Lik.' -36.08093 (df=5)

We can see that our Box-Cox transformed model shows a marked improvement in this criterion for our transformed model. This implies a better fit and a “simpler” model, despite needing to transform the underlying data.