The kinetic energy (KE) of a car is \(KE = \frac{1}{2}mv^2\), where ‘m’ is mass of the car and ‘v’ is speed. For KE in Joules (J), mass needs to be in kilograms (Kg) and speed in meters per second (m/s).
For a car to slow down while breaking, work must be done on the wheels of the car by the breaks. Work (W) is defined as \(W = F*d\) where ‘F’ is the force applied by the breaks and ‘d’ is stopping distance of the car. Since the frictional force of the break is anti-parallel to the displacement of the car, \(cos(-\pi) = -1\). The negative sign in this case tells us that work is transferring energy from the system (the car) to its surroundings via heat. The negative sign can be dropped in this analysis without loss of information, as it will cancel out in the steps below. For work to be in Joules (J), force must be in Newtons (N) and distance in meters (m).
Note also drivers tend to apply steady pressure to the break pedal when stopping, we can treat breaking for as a constant.
By the Work-Energy Theorem \(\Delta KE = W\): \[ \Delta KE = KE_f - KE_i\\ \text{The car stops.}\\ KE_f = 0\space J \\ - KE_i = W \\ \space\text{Recall work is negative.}\\ KE_i = W \\ \frac{1}{2}mv_i^2 = F*d \\ d = \frac{1}{2}\frac{m}{F}v_i^2 \\ \text{F, m and 1/2 are all constant, we can absorb them into a constant C.} \\ \text{Note F/m is acceleration so C has units:} \frac{s^2}{m}\\ d = C*v_i^2 \\ \text{The units of d are:} \frac{s^2}{m}*\frac{m^2}{s^2} = m \]
There are no physical negative solutions to the above equation, so a simple linear regression will tell us the two variables are correlated. However, we see that distance is quadratic with respect to speed so a linear model is invalid. However we will proceed with the analysis to highlight the importance of using data visualizations and residual analysis to test validity of linear models.
Linear regression is done very simply in R using lm(). The data generated by the model will be saved in a variable ‘fit’.
summary(cars$speed) #looks like mph
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.0 12.0 15.0 15.4 19.0 25.0
summary(cars$dist) #looks like feet
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 26.00 36.00 42.98 56.00 120.00
fit <- lm(cars$dist ~ cars$speed)
summary(fit)
##
## Call:
## lm(formula = cars$dist ~ cars$speed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## cars$speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
If we didn’t know better we’d say that there is an incredibly significant linear correlation, as the p value is much less than 0.05, and call it quits. But we know that this isn’t the case.
plot(cars$speed, cars$dist, xlab = "Initial Speed (mph)", ylab = "Breaking Distance (ft)", col=' darkviolet')
abline(fit, col = "gold3")
In this case the data visualization isn’t much help in determining that the linear model isn’t valid. In other fields, this would look like a good fit.
Linear models work as long as residuals are normally distributed and appear random when plotted against the independent variable.
res <- resid(fit)
summary(res)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -29.069 -9.525 -2.272 0.000 9.215 43.201
hist(res, xlab = "Residuals of Distance")
There is right skew to the Residuals, it doesn’t seem bad enough to invalidate the linear model outright.
plot(fitted(fit), resid(fit), col = 'steelblue4')
qqnorm(res)
qqline(res)
The residuals here do not appear random. They seem to grow in range as the initial speed increases, and are not uniformly scattered around zero. The Normal Q-Q plot also shows deviation from the theoretical values at the upper quantiles. Given this, the skewness of the residuals, and the underlying physics, a linear model is not valid.
We can transform the data to make a linear regression match the data.
We can make the formula look linear by making a substitution.
\[ d = Cv_i^2 \\ \text{Let x =}v_i^2 \\ d = Cx \]
So we square speed and use it as the predictive variable.
speed_sq = cars$speed^2
fit2 <- lm(cars$dist ~ speed_sq)
summary(fit2)
##
## Call:
## lm(formula = cars$dist ~ speed_sq)
##
## Residuals:
## Min 1Q Median 3Q Max
## -28.448 -9.211 -3.594 5.076 45.862
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.86005 4.08633 2.168 0.0351 *
## speed_sq 0.12897 0.01319 9.781 5.2e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.05 on 48 degrees of freedom
## Multiple R-squared: 0.6659, Adjusted R-squared: 0.6589
## F-statistic: 95.67 on 1 and 48 DF, p-value: 5.2e-13
plot(speed_sq, cars$dist, xlab = 'Speed^2 in (mph)^2', ylab = 'Breaking distance in m', col = 'blue')
abline(fit2, col = 'orange')
hist(resid(fit2))
plot(speed_sq, resid(fit2))
qqnorm(resid(fit2))
qqline(resid(fit2))
Normality of the residuals and hetroskedacity is still a problem using this technique.
\[ d = Cv_i^2 \\ log(d) = log(Cv_i^2)\\ log(d) = log(C) + 2log(v_i) \\ \text{Let: log(d) = y, B_o = log(C), x = log(v_i)} \\ y = B_o + 2x \]
fit3 <- lm(log(cars$dist) ~ log(cars$speed))
summary(fit3)
##
## Call:
## lm(formula = log(cars$dist) ~ log(cars$speed))
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.00215 -0.24578 -0.02898 0.20717 0.88289
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.7297 0.3758 -1.941 0.0581 .
## log(cars$speed) 1.6024 0.1395 11.484 2.26e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4053 on 48 degrees of freedom
## Multiple R-squared: 0.7331, Adjusted R-squared: 0.7276
## F-statistic: 131.9 on 1 and 48 DF, p-value: 2.259e-15
plot(log(cars$speed), log(cars$dist), xlab = 'log(Speed) in log(mph)', ylab = 'log(Breaking distance) in log(m)', col = 'steelblue')
abline(fit3, col = 'orangered')
hist(resid(fit3))
plot(speed_sq, resid(fit3))
qqnorm(resid(fit3))
qqline(resid(fit3))
Normality of the residuals and hetroskedacity is improved using this technique.