Box-Cox Transformations

In this section, we create a histogram of all the vectors after taking a look at the data. Notice how neither of them appear particularly ‘normal.’

head(cars)
##   speed dist
## 1     4    2
## 2     4   10
## 3     7    4
## 4     7   22
## 5     8   16
## 6     9   10
sapply(cars, hist)

##          speed     dist     
## breaks   Numeric,6 Numeric,7
## counts   Integer,5 Integer,6
## density  Numeric,5 Numeric,6
## mids     Numeric,5 Numeric,6
## xname    "X[[i]]"  "X[[i]]" 
## equidist TRUE      TRUE

Here we build our first, naive model. It has an \(R^2\) of .65.

model1 <- lm(cars$speed ~ cars$dist)
summary(model1)
## 
## Call:
## lm(formula = cars$speed ~ cars$dist)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.5293 -2.1550  0.3615  2.4377  6.4179 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.28391    0.87438   9.474 1.44e-12 ***
## cars$dist    0.16557    0.01749   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.156 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

We can see in the scatter plot that this line doesn’t seem to fit the data as well as we’d like.

plot(cars$speed, cars$dist)
abline(model1)

model1
## 
## Call:
## lm(formula = cars$speed ~ cars$dist)
## 
## Coefficients:
## (Intercept)    cars$dist  
##      8.2839       0.1656

Perhaps this is due to the lack of normality in our data. Let’s take a closer look.

qqnorm(cars$speed)
qqline(cars$speed)

qqnorm(cars$dist)
qqline(cars$dist)

In either case, there is certainly room for improvement. However, the distance vector has larger residuals in the extrema. We can fix this with a Box-Cox transformation. First we install and use the forecast library.

library(forecast)

Then we apply the transform. This is equivalent to \[ x \rightarrow x^\lambda \] where lambda is given by the BoxCox.lambda(x) function and is chosen in such a way as to minimize mean square error. The reason this works is quite complex, but is due to the nature of logarithms and their tendency to show up in the real world.

cars2 <- cars
transform <- function(x){
  x <- BoxCox(x, lambda = BoxCox.lambda(x))
}
cars2 <- as.data.frame(sapply(cars2, transform))
sapply(cars2, hist)

##          speed     dist     
## breaks   Integer,9 Numeric,8
## counts   Integer,8 Integer,7
## density  Numeric,8 Numeric,7
## mids     Numeric,8 Numeric,7
## xname    "X[[i]]"  "X[[i]]" 
## equidist TRUE      TRUE
sapply(cars2, qqnorm)

##   speed      dist      
## x Numeric,50 Numeric,50
## y Numeric,50 Numeric,50

Now, let’s build a new model and compare the results.

plot(cars2$speed, cars2$dist)
model2 <- lm(cars2$dist ~ cars2$speed)
abline(model2)

model2
## 
## Call:
## lm(formula = cars2$dist ~ cars2$speed)
## 
## Coefficients:
## (Intercept)  cars2$speed  
##    -0.09958      1.07446
summary(model2)
## 
## Call:
## lm(formula = cars2$dist ~ cars2$speed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.2914 -0.8456 -0.1705  0.7269  3.4134 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.09958    0.68422  -0.146    0.885    
## cars2$speed  1.07446    0.09629  11.158 6.21e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.268 on 48 degrees of freedom
## Multiple R-squared:  0.7217, Adjusted R-squared:  0.716 
## F-statistic: 124.5 on 1 and 48 DF,  p-value: 6.209e-15

Eyyo! Now our model has an \(R^2\) of .72. Id’ say that’s a good result. However, you might be asking:

What? Why did we magically change our data to make our model better?

Well, this stackoverflow post may clear it up a bit for you. In short, the real-world is messy and solutions to the differential equations that govern things like physics tend to be logarithmic. Let’s check our transform for the data vectors vectors.

BoxCox.lambda(cars$speed)
## [1] 0.6051103
BoxCox.lambda(cars$dist)
## [1] 0.3471381

As we can see, neither data vector was entirely normal and a better model was built using an exponential transform. Because the real-world tends to be logarithmic, applying an exponential functions reverts this. This corresponds to the physics, which suggests that stopping distance, \(d\) is \[ d = \frac{v^2}{2 \mu g} \] where \(\mu\) is a coefficient of friction and \(g\) is the acceleration due to gravity. You can see range of values provided, we can approximate this quadratic equation by transforming the data into something that’s more exponential.

Note that Box-Cox transformations can be a quick way to improve your model, but that they must always be rooted in the underlying data. Otherwise, they’re meaningless.