Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)

#Load the packages needed
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.2.2
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.2.2
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(broom)
## Warning: package 'broom' was built under R version 4.2.2
library(ggpubr)
## Warning: package 'ggpubr' was built under R version 4.2.2
library(carData)
cars.data = cars
head(cars.data)
##   speed dist
## 1     4    2
## 2     4   10
## 3     7    4
## 4     7   22
## 5     8   16
## 6     9   10

Visualize the Data

We will plot the speed values from the car.data dataset as the independent variable on the x-axis. The dependent variable is the dist column from car.data dataset, which we plot on the y-axis. The function argument main=“Car” provides a title for the plot, while xlab=“Speed” and ylab=“Distance” provide labels for the x- and y-axes, respectively

plot(cars.data[,"speed"],cars.data[,"dist"], main="Cars",
xlab="Speed", ylab="Distance")

The figure above shows that the Distance tends to increase as the Speed increases. If we superimpose a straight line on this scatter plot (See below), we see that the relationship between the predictor (Speed) and the output (Distance) is roughly linear.

It is not perfectly linear, however. As the Speed increases, we see a larger spread in Distance.

scatter.smooth(cars.data[,"speed"],cars.data[,"dist"], main="Cars",
xlab="Speed", ylab="Distance") 

The Linear Model Function

yˆ = a0 + a1x1

Where x1 is the input to the system, a0 is the y-intercept of the line, a1 is the slope, and yˆ is the output value the model predicts. The ^ indicates a predicted or estimated value, not the actual observed value.

cars.lm <- lm(dist ~ speed, data=cars.data)
cars.lm
## 
## Call:
## lm(formula = dist ~ speed, data = cars.data)
## 
## Coefficients:
## (Intercept)        speed  
##     -17.579        3.932

The y-intercept is a0 = -17.579 and the slope is a1 = 3.932. Thus, the final regression model is: dist = -17.579 + 3.932 * speed.

The following code plots the original data above along with the fitted line. It plots a line using the slope and intercept of the linear model given in its argument.

plot(dist ~ speed, data=cars.data)
abline(cars.lm)

Evaluating the Quality of the Model

The first few lines, Call:, simply repeat how the lm() function was called.

The residuals are the differences between the actual measured values and the corresponding values on the fitted regression line.

Each data point’s residual is the distance that the individual data point is above (positive residual) or below (negative residual) the regression line. Min is the minimum residual value, which is the distance from the regression line to the point furthest below the line. Similarly, Max is the distance from the regression line of the point furthest above the line. Median is the median value of all of the residuals. The 1Q and 3Q values are the points that mark the first and third quartiles of all the sorted residual values.

The line is a good fit with the data, we would expect residual values that are normally distributed around a mean of zero.

A good model would tend to have a median value near zero, minimum and maximum values of roughly the same magnitude, and first and third quartile values of roughly the same magnitude.

The estimated coefficient values are simply the fitted regression model values. The Std. Error column shows the statistical standard error for each of the coefficients. We look for a standard error that is at least five to ten times smaller than the corresponding coefficient.

Pr(>|t|), shows the probability of observing a test statistic (t value) as extreme or more extreme as the one observed, assuming there is no linear relationship between the predictor and response variables. This value is also known as the p-value of the coefficient.

The final few lines in the output provide some statistical information about the quality of the regression model’s fit to the data. The Residual standard error is a measure of the total variation in the residual values. If the residuals are distributed normally, the first and third quantiles of the previous residuals should be about 1.5 times this standard error. The number of degrees of freedom is the total number of measurements or observations used to generate the model, minus the number of coefficients in the model.We used this data to produce a regression model with two coefficients: the slope and the intercept. The Multiple R-squared value is a number between 0 and 1. It is a statistical measure of how well the model describes the measured data. The Adjusted R-squared value is the R2 value modified to take into account the number of predictors used in the model.The adjusted R2 is always smaller than the R2 value. The Adjusted R-squared value is the R2 value modified to take into account the number of predictors used in the model. The adjusted R2 is always smaller than the R2 value.

summary(cars.lm)
## 
## Call:
## lm(formula = dist ~ speed, data = cars.data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

For this model, the residual values are not too far off.

If we wanted to predict the Distance required for a car to stop given its speed, we would get a training set and produce estimates of the coefficients to then use it in the model formula.

The estimate coefficient in intercept is essentially the expected value of the distance required for a car to stop when we consider the average speed of all cars in the dataset. That is -17.579 feet to come to a stop.

Hypothetically, this would mean that a car going 0 mph takes -17.59 feet to stop. Having a neagtive stopping distance is impossible so y intercept cannot be interpreted in this instance.

The second row in the Coefficients is the slope, the effect speed has in distance required for a car to stop. The slope term in our model is saying that for every 1 mph increase in the speed of a car, the required distance to stop goes up by 3.9324088 feet.

Hypothetically, this would mean that a car going 0 mph takes -17.59 feet to stop. Having a neagtive stopping distance is impossible so y intercept cannot be interpreted in this instance.

The standard error for speed is 9.464 times smaller than the coefficient value. This ratio means that there is strong variability in the slope estimate, a1. The standard error for the intercept, a0, is 6.7584 , which is not roughly the same as the estimated value of -17.5791 for this coefficient. This suggest that there is more certainty in the estimate of this coefficient for this model.

The Pr(>|t|) is 1.49e-12, we can say that there is strong evidence of a linear relationship between speed and distance.

The Residual standard Error the actual distance required to stop can deviate from the true regression line by approximately 15.38 feet, on average.

The multiple R-squared is 0.6511, which means that this model explains 65.11% of the data’s variation.

The F-statistic is larger than 1 which indicates there is a realtionship between predictor and response variable.

Residual Analysis

The Residual Analysis examines the residualvalues to see what they can tell us about the model’s quality.

Residual values greater than zero mean that the regression model predicted a value that was too small compared to the actual measured value, and negative values indicate that the regression model predicted a value that was too large. A model that fits the data well would tend to over-predict as often as it under-predicts. Thus, if we plot the residual values, we would expect to see them distributed normally around zero for a well-fitted model.

The following function produce the residuals plot for our model.

 plot(fitted(cars.lm),resid(cars.lm))

The plot above shows that the residuals look uniformly distributed around zero. The residuals appear to be uniformly scattered above and below zero.

Another test of the residuals uses the quantile-versus-quantile.

qqnorm(resid(cars.lm))
qqline(resid(cars.lm))

This behavior indicates that the residuals are normally distributed. This test further confirms that using only the speed as a predictor in the model is sufficient to explain the data.

The above two diagnostic plots, and two additional other plots can be obtained by using the plot() function with the linear model as the parameter.

The “Scale-Location” plot is an alternate way of visualizing the residuals versus fitted values from the linear regression model, however, the residuals are standardized and then transformed by square-root. This essentially folds the residuals and can aid in finding patterns in the residuals.

The Residuals vs Leverage plot can be used to identify possible outliers.

par(mfrow=c(2,2))
plot(cars.lm)

The scale-location the red line is approximately horizontal. Then the average magnitude of the standardized residuals isn’t changing much as a function of the fitted values. In regards to the spread around the red line varying with the fitted values so then the variability of magnitudes doesn’t vary much as a function of the fitted values is less clear.

In regards to the Residuals vs Leverage, this can be used to detect heteroskedasticity and non-linearity. The spread of standardized residuals shouldn’t change as a function of leverage: here it appears to decrease, indicating heteroskedasticity. Also, points with high leverage may be influential: that is, deleting them would change the model a lot.

Conclusion

From our one-factor linear model of the cars dataset, we see a number of promising regularities in the residuals indicating a strong correlation between the explanatory and response variable (speed and stopping distance). The residuals have been shown to be evenly distributed about a center with directionality, while the model demonstrated a highly significant correlation among the variables (low p-values). Despite this the model is not a perfect fit, with the model only explaining 65% of the actual data, and showing some residual divergence indicating skew and overestimation, among other factors yet to be discovered.

It is likely that speed could play a significant role in a multivariate model which could be driven correlation indicated by statistics and the nearly normal properties of the residuals. It is likely that a larger number of observations might otherwise improve our model through its influence on normality.