DATA 605 HW11

The Physics of Breaking Distance

The kinetic energy (KE) of a car is \(KE = \frac{1}{2}mv^2\), where ‘m’ is mass of the car and ‘v’ is speed. For KE in Joules (J), mass needs to be in kilograms (Kg) and speed in meters per second (m/s).

For a car to slow down while breaking, work must be done on the wheels of the car by the breaks. Work (W) is defined as \(W = F*d\) where ‘F’ is the force applied by the breaks and ‘d’ is stopping distance of the car. Since the frictional force of the break is anti-parallel to the displacement of the car, \(cos(-\pi) = -1\). The negative sign in this case tells us that work is transferring energy from the system (the car) to its surroundings via heat. The negative sign can be dropped in this analysis without loss of information, as it will cancel out in the steps below. For work to be in Joules (J), force must be in Newtons (N) and distance in meters (m).

Note also drivers tend to apply steady pressure to the break pedal when stopping, we can treat breaking for as a constant.

By the Work-Energy Theorem \(\Delta KE = W\): \[ \Delta KE = KE_f - KE_i\\ \text{The car stops.}\\ KE_f = 0\space J \\ - KE_i = W \\ \space\text{Recall work is negative.}\\ KE_i = W \\ \frac{1}{2}mv_i^2 = F*d \\ d = \frac{1}{2}\frac{m}{F}v_i^2 \\ \text{F, m and 1/2 are all constant, we can absorb them into a constant C.} \\ \text{Note F/m is acceleration so C has units:} \frac{s^2}{m}\\ d = C*v_i^2 \\ \text{The units of d are:} \frac{s^2}{m}*\frac{m^2}{s^2} = m \]

There are no physical negative solutions to the above equation, so a simple linear regression will tell us the two variables are correlated. However, we see that distance is quadratic with respect to speed so a linear model is invalid. However we will proceed with the analysis to highlight the importance of using data visualizations and residual analysis to test validity of linear models.

The Linear Regression

Linear regression is done very simply in R using lm(). The data generated by the model will be saved in a variable ‘fit’.

summary(cars$speed) #looks like mph

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     4.0    12.0    15.0    15.4    19.0    25.0

summary(cars$dist) #looks like feet

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   26.00   36.00   42.98   56.00  120.00

fit <- lm(cars$dist ~ cars$speed)
summary(fit)

## 
## Call:
## lm(formula = cars$dist ~ cars$speed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## cars$speed    3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

If we didn’t know better we’d say that there is an incredibly significant linear correlation, as the p value is much less than 0.05, and call it quits. But we know that this isn’t the case.

Data Visualization

plot(cars$speed, cars$dist, xlab = "Initial Speed (mph)", ylab = "Breaking Distance (ft)", col=' darkviolet')
abline(fit, col = "gold3")

In this case the data visualization isn’t much help in determining that the linear model isn’t valid. In other fields, this would look like a good fit.

Residuals

Linear models work as long as residuals are normally distributed and appear random when plotted against the independent variable.

res <- resid(fit)
summary(res)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -29.069  -9.525  -2.272   0.000   9.215  43.201

hist(res, xlab = "Residuals of Distance")

There is right skew to the Residuals, it doesn’t seem bad enough to invalidate the linear model outright.

plot(fitted(fit), resid(fit),  col = 'steelblue4')

qqnorm(res)
qqline(res)

The residuals here do not appear random. They seem to grow in range as the initial speed increases, and are not uniformly scattered around zero. The Normal Q-Q plot also shows deviation from the theoretical values at the upper quantiles. Given this, the skewness of the residuals, and the underlying physics, a linear model is not valid.

Making the Data Linear

We can transform the data to make a linear regression match the data.

Subsitution

We can make the formula look linear by making a substitution.

\[ d = Cv_i^2 \\ \text{Let x =}v_i^2 \\ d = Cx \]

So we square speed and use it as the predictive variable.

speed_sq = cars$speed^2
fit2 <- lm(cars$dist ~ speed_sq)
summary(fit2)

## 
## Call:
## lm(formula = cars$dist ~ speed_sq)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -28.448  -9.211  -3.594   5.076  45.862 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.86005    4.08633   2.168   0.0351 *  
## speed_sq     0.12897    0.01319   9.781  5.2e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.05 on 48 degrees of freedom
## Multiple R-squared:  0.6659, Adjusted R-squared:  0.6589 
## F-statistic: 95.67 on 1 and 48 DF,  p-value: 5.2e-13

plot(speed_sq, cars$dist, xlab = 'Speed^2 in (mph)^2', ylab = 'Breaking distance in m', col = 'blue')
abline(fit2, col = 'orange')

hist(resid(fit2))

plot(speed_sq, resid(fit2))

qqnorm(resid(fit2))
qqline(resid(fit2))

Normality of the residuals and hetroskedacity is still a problem using this technique.

log-log

\[ d = Cv_i^2 \\ log(d) = log(Cv_i^2)\\ log(d) = log(C) + 2log(v_i) \\ \text{Let: log(d) = y, B_o = log(C), x = log(v_i)} \\ y = B_o + 2x \]

fit3 <- lm(log(cars$dist) ~ log(cars$speed))
summary(fit3)

## 
## Call:
## lm(formula = log(cars$dist) ~ log(cars$speed))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.00215 -0.24578 -0.02898  0.20717  0.88289 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -0.7297     0.3758  -1.941   0.0581 .  
## log(cars$speed)   1.6024     0.1395  11.484 2.26e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4053 on 48 degrees of freedom
## Multiple R-squared:  0.7331, Adjusted R-squared:  0.7276 
## F-statistic: 131.9 on 1 and 48 DF,  p-value: 2.259e-15

plot(log(cars$speed), log(cars$dist), xlab = 'log(Speed) in log(mph)', ylab = 'log(Breaking distance) in log(m)', col = 'steelblue')
abline(fit3, col = 'orangered')

hist(resid(fit3))

plot(speed_sq, resid(fit3))

qqnorm(resid(fit3))
qqline(resid(fit3))

Normality of the residuals and hetroskedacity is improved using this technique.