In this section, we create a histogram of all the vectors after taking a look at the data. Notice how neither of them appear particularly ‘normal.’
head(cars)
## speed dist
## 1 4 2
## 2 4 10
## 3 7 4
## 4 7 22
## 5 8 16
## 6 9 10
sapply(cars, hist)
## speed dist
## breaks Numeric,6 Numeric,7
## counts Integer,5 Integer,6
## density Numeric,5 Numeric,6
## mids Numeric,5 Numeric,6
## xname "X[[i]]" "X[[i]]"
## equidist TRUE TRUE
Here we build our first, naive model. It has an \(R^2\) of .65.
model1 <- lm(cars$speed ~ cars$dist)
summary(model1)
##
## Call:
## lm(formula = cars$speed ~ cars$dist)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.5293 -2.1550 0.3615 2.4377 6.4179
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.28391 0.87438 9.474 1.44e-12 ***
## cars$dist 0.16557 0.01749 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.156 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
We can see in the scatter plot that this line doesn’t seem to fit the data as well as we’d like.
plot(cars$speed, cars$dist)
abline(model1)
model1
##
## Call:
## lm(formula = cars$speed ~ cars$dist)
##
## Coefficients:
## (Intercept) cars$dist
## 8.2839 0.1656
Perhaps this is due to the lack of normality in our data. Let’s take a closer look.
qqnorm(cars$speed)
qqline(cars$speed)
qqnorm(cars$dist)
qqline(cars$dist)
In either case, there is certainly room for improvement. However, the distance vector has larger residuals in the extrema. We can fix this with a Box-Cox transformation. First we install and use the forecast library.
library(forecast)
Then we apply the transform. This is equivalent to \[
x \rightarrow x^\lambda
\] where lambda is given by the BoxCox.lambda(x) function and is chosen in such a way as to minimize mean square error. The reason this works is quite complex, but is due to the nature of logarithms and their tendency to show up in the real world.
cars2 <- cars
transform <- function(x){
x <- BoxCox(x, lambda = BoxCox.lambda(x))
}
cars2 <- as.data.frame(sapply(cars2, transform))
sapply(cars2, hist)
## speed dist
## breaks Integer,9 Numeric,8
## counts Integer,8 Integer,7
## density Numeric,8 Numeric,7
## mids Numeric,8 Numeric,7
## xname "X[[i]]" "X[[i]]"
## equidist TRUE TRUE
sapply(cars2, qqnorm)
## speed dist
## x Numeric,50 Numeric,50
## y Numeric,50 Numeric,50
Now, let’s build a new model and compare the results.
plot(cars2$speed, cars2$dist)
model2 <- lm(cars2$dist ~ cars2$speed)
abline(model2)
model2
##
## Call:
## lm(formula = cars2$dist ~ cars2$speed)
##
## Coefficients:
## (Intercept) cars2$speed
## -0.09958 1.07446
summary(model2)
##
## Call:
## lm(formula = cars2$dist ~ cars2$speed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.2914 -0.8456 -0.1705 0.7269 3.4134
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.09958 0.68422 -0.146 0.885
## cars2$speed 1.07446 0.09629 11.158 6.21e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.268 on 48 degrees of freedom
## Multiple R-squared: 0.7217, Adjusted R-squared: 0.716
## F-statistic: 124.5 on 1 and 48 DF, p-value: 6.209e-15
Eyyo! Now our model has an \(R^2\) of .72. Id’ say that’s a good result. However, you might be asking:
What? Why did we magically change our data to make our model better?
Well, this stackoverflow post may clear it up a bit for you. In short, the real-world is messy and solutions to the differential equations that govern things like physics tend to be logarithmic. Let’s check our transform for the data vectors vectors.
BoxCox.lambda(cars$speed)
## [1] 0.6051103
BoxCox.lambda(cars$dist)
## [1] 0.3471381
As we can see, neither data vector was entirely normal and a better model was built using an exponential transform. Because the real-world tends to be logarithmic, applying an exponential functions reverts this. This corresponds to the physics, which suggests that stopping distance, \(d\) is \[ d = \frac{v^2}{2 \mu g} \] where \(\mu\) is a coefficient of friction and \(g\) is the acceleration due to gravity. You can see range of values provided, we can approximate this quadratic equation by transforming the data into something that’s more exponential.
Note that Box-Cox transformations can be a quick way to improve your model, but that they must always be rooted in the underlying data. Otherwise, they’re meaningless.