CUNY 605 Homework Week 11

Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)

Loading the dataset.

# Loading
data(mtcars)
# Print the first 6 rows
head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

How many observations (rows) and variables (columns) are there?

nrow(mtcars)
## [1] 32
ncol(mtcars)
## [1] 11

Now the question asks about the stopping distance as a function of speed. There is no explicit speed here, but we can attempt to extrapolate that information. Within the dataset is the quarter mile time measured in seconds. We can measure speed as km/hr. Though this is not perfect given that this is under acceleration, but we will assume that this is the speed that the car maintains when it starts to stop.

# Calculate speed for each car
# Let's create another dataframe mtcars.copy
# Remember: 1 km = 0.621371 miles
mtcars.copy <- mtcars
mtcars.copy$speed <- (0.25 * 1/0.621371)/(mtcars.copy$qsec * 1/3600) # Converts into mile/sec to mile/hr

# Show the head of mtcars.copy
head(mtcars.copy)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
##                      speed
## Mazda RX4         87.99575
## Mazda RX4 Wag     85.10047
## Datsun 710        77.82966
## Hornet 4 Drive    74.50669
## Hornet Sportabout 85.10047
## Valiant           71.63254

Let’s obtain the stopping distance formula. After a Google search, we see a reference: https://en.wikipedia.org/wiki/Braking_distance. In this reference, it states: “If a driver puts on the brakes of a car, the car will not come to a stop immediately. The stopping distance is the distance the car travels before it comes to a rest. It depends on the speed of the car and the coefficient of friction \(\mu\) between the wheels and the road. This stopping distance formula does not include the effect of anti-lock brakes or brake pumping. The SI unit for stopping distance is meters.”

\[stoppingdistance = \frac{velocity^2}{2(coefficientoffraction)(gravitationalacceleration)}\]

Where \(\mu\) is unitless and here, assumed to be 0.7, and \(g = 9.80 m/s^2\).

Let’s calculate all of the braking (stopping) distances for each car.

# Remember to convert speed into meters/sec
# 1 km/hr --> 1000m/1km * 1hr/60 min * 1 min/60s
conversion <- 1000/1 * 1/60 * 1/60
mu <- 0.7
g <- 9.8
mtcars.copy$stop_distance <- ((mtcars.copy$speed * conversion)**2)/(2*mu*g) 

# print head
head(mtcars.copy)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
##                      speed stop_distance
## Mazda RX4         87.99575      43.54761
## Mazda RX4 Wag     85.10047      40.72910
## Datsun 710        77.82966      34.06679
## Hornet 4 Drive    74.50669      31.21989
## Hornet Sportabout 85.10047      40.72910
## Valiant           71.63254      28.85770

Since we are only interested in one variable (speed) and in one outcome (stop_distance), we can munge our data and eliminate the other columns.

mtcars.copy <- mtcars.copy[,c("speed", "stop_distance")]
head(mtcars.copy)
##                      speed stop_distance
## Mazda RX4         87.99575      43.54761
## Mazda RX4 Wag     85.10047      40.72910
## Datsun 710        77.82966      34.06679
## Hornet 4 Drive    74.50669      31.21989
## Hornet Sportabout 85.10047      40.72910
## Valiant           71.63254      28.85770

Now let’s attach this data so we can begin the linear regression analysis.

attach(mtcars.copy)
mtcars.lm <- lm(stop_distance ~ speed)
mtcars.lm
## 
## Call:
## lm(formula = stop_distance ~ speed)
## 
## Coefficients:
## (Intercept)        speed  
##    -38.3229       0.9329

So the formula that was calculated for the linear regression is:

\[\widehat{stopdistance} = 0.9329*speed - 38.3229\]

Let’s visualize this with a scatterplot with the linear regression overlying the data points.

plot(speed, stop_distance)
abline(mtcars.lm)

Let’s evaluate the quality of the model. The information we obtained shows us the regression model’s basic values, but does not tell us anything about the model’s quality. In fact, there are many different ways to evaluate a regression model’s quality. The function summary() extracts some additional information that we can use to determine how well the data fit the resulting model.

summary(mtcars.lm)
## 
## Call:
## lm(formula = stop_distance ~ speed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.3680 -0.3376 -0.2241  0.2222  1.8132 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -38.32285    0.96078  -39.89   <2e-16 ***
## speed         0.93294    0.01167   79.94   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.53 on 30 degrees of freedom
## Multiple R-squared:  0.9953, Adjusted R-squared:  0.9952 
## F-statistic:  6390 on 1 and 30 DF,  p-value: < 2.2e-16

According to the textbook, the residuals are the differences between the actual measured values and the corresponding values of the fitted regression line. Min is the minimal residual value, which is the distance from the regression line to the point furthest below the line. Similarly, Max is the distance from the regression line of the point furthest above the line. Median is the median value of all the residuals. A good model’s residuals should be roughly balanced around and not too far away from the mean of zero. A good model would tend to have a median value near zero, minimum and maximum values of roughly the same magnitude, and first and third quartile values of roughly the same magnitude. This model does demonstrate these qualities and does demonstrate a good fit from this linear regression.

The standard error column shows the statistical standard error for each of the coefficients. For a good model, we typically would like to see a standard error that is at least five to ten times smaller than the corresponding coefficients. The large ratio means that there is relatively little variability.

print("Is the standard error at least 5 to 10 times smaller than the corresponding coefficients for speed?")
## [1] "Is the standard error at least 5 to 10 times smaller than the corresponding coefficients for speed?"
0.93294/0.01167 > 10
## [1] TRUE
print("Is the standard error at least 5 to 10 times smaller than the corresponding coefficients for the intercept?")
## [1] "Is the standard error at least 5 to 10 times smaller than the corresponding coefficients for the intercept?"
abs(-38.32285)/0.96078 > 10
## [1] TRUE

The last column, labeled \(Pr(>|t|)\) shows the probability that the corresponding coefficient is not relevant in the model. This is also known as the significance or p-value of the coefficient. Typically, the smaller the p-value, the more significant the variable is influencing the linear regression. Here, both the intercept and speed have very very small p-values and are considered to be quite significant.

The residual standard error is a measure of the total variation in the residual values. If the residuals are distributed normally, the first and third quantiles of the previous residuals should be 1.5 times this standard error. Degrees of freedom is the total number of measurements used to generate the model, minus the number of coefficients in the model.

Multiple R-squared value is a number between 0 and 1. Is is a statistical measure of how well the model describes the measured data. The reported \(R^2\) of 0.9953 for this model means that the model explains 99.53% of the data’s variation. This is a very strong relationship.

The adjusted R-squared value is the \(R^2\) value modified to take into account the number of predictors used in the model. The adjusted \(R^2\) is always smaller than the \(R^2\) value.

Let’s analyze the residuals. Let’s plot out the residuals from this model.

plot(fitted(mtcars.lm), resid(mtcars.lm))

qqnorm(resid(mtcars.lm))
qqline(resid(mtcars.lm))

The residual analysis plots and the quantile versus quantile plot is very interesting. The residuals appear to have a pattern to this, which suggests that the speed alone may not be the only factor impacting the stopping distance (despite a very high \(R^2\) value). Again, this value suggests that the residuals are not normally distributed. Perhaps, we will need a more complex model that is better able to explain the data.