Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)
stopping_distance = 3.932(speed) - 17.579
In this linear model, we see the y-intercept is -17.579, and the coefficient for speed is 3.932. Also, speed has a p-value less than 0.001. The adjusted R squared value is 0.6438, and the residual standard error is 15.38.
carsLM <- lm(dist ~ speed, data = cars)
summary(carsLM)
##
## Call:
## lm(formula = dist ~ speed, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
In this section, each step corresponds to one step of the textbook’s analysis.
This plot shows the relationship between speed and stopping distance. The variable speed was measured in miles per hour. The variable stopping distance was measured in feet.
plot(cars[,"speed"], cars[,"dist"], main = "Speed vs. Stopping Distance", xlab="Speed (mph)", ylab="Stopping Distance (feet)")
This sections shows the linear model, the output of the model, and a plot of the model on a graph of the data. The linear regression line appears to follow the upward sloping trend of the data points well.
carsLM <- lm(dist ~ speed, data = cars)
carsLM
##
## Call:
## lm(formula = dist ~ speed, data = cars)
##
## Coefficients:
## (Intercept) speed
## -17.579 3.932
plot(dist ~ speed, data=cars)
abline(carsLM)
In this summary, you can see residuals, coefficients, and other statistics. The sections showing residuals shows the differences between the actual measured values and the values on the regression line. Each residual is the distance between the regression line and the actual data point for one specific value. The coefficient section shows the values for the linear regression equation. Each standard error in the Std. Error column is the standard error for the coefficient. A standard error five to ten times smaller than the coefficient is best. In this case, the standard error of speed is within this range. The ratio of the estimate to the standard error is the test statistic and is noted as the t value in the summary table. The last column shows the probability of the test statistic as extreme or more extreme than than the one observed. The last several lines provide statistical information about the model. The residual standard error measures the total variation in residual values. The multiple R-squared value measures how well the model describes the measured data. Values closer to one show a better-fit model. The adjusted R-squared value is the same except it takes into account the number of predictors used in the model. THe F-statistic compares the current model to one with only the intercept. It contains more information for models with multiple predictors.
In this model for the data ‘cars,’ we see that the standard error for speed was in a good range at about 1/9 the size of the coefficient. The p-value for speed was also significant. The R-squared values appeared to be moderate. They were a little morre than 0.6, so they were not strong and not weak. Since this model only contains one predictor, the F-statistic is not particularly useful, but since the p-value is small, we can assume that the model is a good fit with speed as an independent variable.
summary(carsLM)
##
## Call:
## lm(formula = dist ~ speed, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
This analysis looks at the residuals to help determine the quality of the model. A model that is well-fit will have residuals around 0. A better model will also have residuals that are nearly uniform across all of the data, so the residuals should not have a trend. When the residuals show a well-fit model, then the model sufficiently explains the data. Along with a residual plot, a Q-Q plot shows if the residuals are normally distributed. When the residuals are distributed normally, then the points on the graph will fall on a straight line.
In the graphs below, the residuals are not uniform across the data, and the right side of the Q-Q plot does not follow a straight line. The final four graphs in a grid show more information about the residuals as well as outliers.
plot(fitted(carsLM),resid(carsLM))
qqnorm(resid(carsLM))
qqline(resid(carsLM))
par(mfrow=c(2,2))
plot(carsLM)
Lilja, David J; Linse, Greta M. (2022). Linear Regression Using R: An Introduction to Data Modeling, 2nd Edition. University of Minnesota Libraries Publishing. Retrieved from the University of Minnesota Digital Conservancy, https://hdl.handle.net/11299/189222.