Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(gridExtra)
##
## Attaching package: 'gridExtra'
##
## The following object is masked from 'package:dplyr':
##
## combine
head(cars)
## speed dist
## 1 4 2
## 2 4 10
## 3 7 4
## 4 7 22
## 5 8 16
## 6 9 10
The linear regression model finds the relationship between the input (predictor) varibles and the output (response) variables. In the cars data set, there are 50 observations and 2 variables. To determine whether or not it looks as though a linear relationship exists between the predictor (speed) and the output value (stopping distance)
plot(cars[,"speed"],cars[,"dist"], main="Stopping Distance as a Function of Speed",
xlab="Speed", ylab="Distance")
Figure shows the scattered plot of speed vs distance where speed value is the independent variable on X-axis and Stopping distance is the dependent value on Y-axis.
This function is used to fit linear models to data, that examine the relationship between one or more independent variables (predictors) and a dependent variable (response).
##
## Call:
## lm(formula = dist ~ speed, data = cars)
##
## Coefficients:
## (Intercept) speed
## -17.579 3.932
In this case, the \(y\)-intercept is \(\mathrm{a}_0=-17.579\) and the slope is \(\mathrm{a}_1=3.932\). Thus, the final regression model is: \[ \text { Distance }=-17.579+3.932 * \text { speed } \] There is a regression line on the active plot window, using the slope and intercept of the linear model given in its argument. Two variables are in linear relationship each other.
summary(cars.lm)
##
## Call:
## lm(formula = dist ~ speed, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
The SLR model would tend to have a median value (-2.272) near zero, minimum and maximum values have roughly similar magnitude, and first and third quartile values are the same magnitude. The estimate is the coefficient values.
The standard error for speed is 2.24 times smaller than the coefficient value (3.9324/0.4155 = 2.24). This smaller ratio means that there is relatively large variability in the slope estimate, a1.
The standard error for the intercept, a0, is 6.7584, which is quite different with estimated value of -17.5791 for this coefficient. That means less uncertainty in the estimate of this coefficient for this model.
The p-value of the coefficient is labeled Pr(>|t|) for the slope estimate for clock is 1.49e-12 a tiny value. Since this value is so small, we can say that there is strong evidence of a linear relationship between cars speed and stopping distance.
The Residual standard error is a measure of the total variation in the residual values. If the residuals are distributed normally, the first and third quantiles of the previous residuals should be about 2.27 times this standard error.
The number of degrees of freedom is the total number of measurements or observations used to generate the model, minus the number of coefficients in the model. This example had 50 unique rows in the data frame, corresponding to 50 independent measurements. We used this data to produce a regression model with two coefficients: the slope and the intercept. Thus, we are left with (50 - 2 = 48) degrees of freedom.
The Multiple R-squared value is a number between 0 and 1. It is a statistical measure of how well the model describes the measured data. The reported R2 of 0.6511 for this model means that 65.11% of the variability in performance is explained by the variation in cars speed.
The Adjusted R-squared value is the R2 value modified to take into account the number of predictors used in the model. The adjusted R2 is always smaller than the R2 value (0.6438).
The F-statistic value is the t value squared and has the same p-value when compared to their respective distributions. The F- statistic is the t value squared and has the same p-value when compared to their respective distributions.
Overall, the model suggests that there is a strong correlation between speed and stopping distance, with a positive coefficient for speed.
Residual analysis is the difference between the actual measured value stored in the data frame and the value that the fitted regression line predicts for that corresponding data point. Residual values greater than zero mean that the regression model predicted a value that was too small compared to the actual measured value, and negative values indicate that the regression model predicted a value that was too large. A model that fits the data well would tend to over predict as often as it under predicts.
plot(fitted(cars.lm),resid(cars.lm), main = "The residual values versus the output values from the SLR")
abline(0,0)
Another test of the residuals uses the quantile-versus-quantile, or Q-Q, plot that provides a nice visual indication of whether the residuals from the model are normally distributed.
qqnorm(resid(cars.lm))
qqline(resid(cars.lm))
This plot indicates that the residuals are normally distributed because the points plotted in this figure follows a straight line.
par(mfrow=c(2,2)) # Set up a 2 by 2 grid
plot(cars.lm) # plots for the linear regression model
This Residuals vs Fitted plot shows the differences between observed and predicted values on the y-axis and predicted values on the x-axis. It checks for linearity and a random scatter of points with no clear pattern.
The Normal Q-Q plot assesses whether or not the residuals are normally distributed to see the points should roughly follow a straight line. If they deviate, it suggests non-normality of residuals.
The Scale-Location plot shows the square root of the standardized residuals against the fitted values and check for constant variance of residuals.
The Residuals vs Leverage plot identifies outliers and checks for high leverage points that disproportionately influence the regression model.