Using the “cars” data set in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)
Load data set, cars, from package and view the data frame. The data frame contains 2 variables (speed and dist) and 50 observations. The abbreviations next to the variable describe the type as <dbl>, meaning doubles or real numbers. The data was checked for missing values, na.omit, and omit NA values from the data.
library(tidymodels)
data(cars)
glimpse(cars)
## Rows: 50
## Columns: 2
## $ speed <dbl> 4, 4, 7, 7, 8, 9, 10, 10, 10, 11, 11, 12, 12, 12, 12, 13, 13, 13~
## $ dist <dbl> 2, 10, 4, 22, 16, 10, 18, 26, 34, 17, 28, 14, 20, 24, 28, 26, 34~
vehicles <- na.omit(cars)
The dimension of the new object, vehicles, indicates the same data frame as cars dataset (50 observations and 2 variables).
dim(vehicles)
## [1] 50 2
The summary function calculates the summary statistics value for the vector: min (minimum), 1st Qu (25th percentile), median, 3rd Qu (75th percentile), and max (maximum).
summary(vehicles)
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
The first step in a one-factor modeling process to determine linear relationship between the predictor and the output value.
This scatterplot provides a visual relationship between cars (vehichles) speed and its stopping distance (dist). The x-axis value is the independent variable (speed) and the dependent variable (stopping distance as a function of speed).
library(graphics)
# plot a scatter plot
plot(vehicles$speed, vehicles$dist,
main="Scatterplot of Speed vs. Stopping Distance",
pch = 16,
xlab="Speed",
ylab="Stopping Distance")
# plot a regression line
abline(lm(dist ~ speed, data = vehicles), col = 'red')
The scatterplot indicates the stopping distance tends to increase as the speed increases. A regression line was superimposed on the scatterplot to show the relationship between the predictor (speed) and the output (stopping distance) is roughly linear.
To determine the degree of linearity in the relationship between the independent variable speed and the dependent variable stopping distance, a linear model is calculated using the lm() function in R.
vehicles_lm <- lm(vehicles$dist ~ vehicles$speed)
vehicles_lm
##
## Call:
## lm(formula = vehicles$dist ~ vehicles$speed)
##
## Coefficients:
## (Intercept) vehicles$speed
## -17.579 3.932
The output of the model indicates a linear function as:
\[ stopping\ distance = -17.579 + (3.932 * speed) \]
The y-intercept of \(a_0\) = -17.579 and the slope is \(a_1\) = 3.932.
Use summary() function to extract additional information to determine how well the data fit the resulting model.
summary(vehicles_lm)
##
## Call:
## lm(formula = vehicles$dist ~ vehicles$speed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## vehicles$speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
Residuals: The differences between the actual measured values and the corresponding values on the fitted regression line.
The Min is the minimum residual value, the distance from the regression line to point furthest below the line (-29.069).
The Median value,-2.272, is the median value of all of the residuals.
The 1Q and 3Q values are the points that mark the first and third quartiles of all the sorted residual values.
Is this model a good fit?
Yes, I think this model is a good fit because the median value (-2.272) is close to zero and first aand third quartile values (-9.525 | 9.215) are roughly the same magnitude.
Coefficients: The model standard error is at least five to ten times smaller than the corresponding coefficient. Evidence: the standard error for speed is 9.464 times smaller than the coefficient value 9.4642599.
Significance or p-value of the coefficient: The column labeled Pr(>|t|), shows the probability that corresponding coefficient is not relevant in the model. The probability that speed is not relevant in this model is \(1.40 x 10^{-12}\) - a tiny value. The probability that the intercept is not relevant is 0.0123.
Quality of the regression model’s fit to the data.
Residual standard error: Measures the total variation in the residual values, the result of 15.38. According to the book, for a normal distribution the first and third quartiles of the previous residuals should be about 1.5 times this standard error.
Degrees of freedom: A calculated value of measurements or observations used to generate the model, minus the number of coefficients in the model. This data frame had 50 unique rows corresponding to 50 independent measurements. The regression model had two coefficients: the slope and the intercept. Thus, we are left with (50 - 2 = 48) degrees of freedom.
Multiple R-squared: Contains a number value between 0 and 1. It is a statistical measure of how well the model describes the measured data. The reported \(R^2\) of 0.6511 for this model means that the model explains 65.11% of the data's variation.
Adjusted R-squared: A calculated value of 0.6438 is expected to be smaller than the R-squared value, so this value is reasonable.
F-statistic: Not useful because this linear regression model has only one independent variable.
Examine the residual values to see the model’s quality. The residual values plot will show the data distribution and we would expect to see a uniform distribution around zero for a well-fitted model.
#residual test for a well-fitted model
plot(fitted(vehicles_lm),resid(vehicles_lm), col = "blue") +
abline(0,0) #add a horizontal line at 0
## integer(0)
The x-axis displays the fitted values and the y-axis displays the residuals. The plot displays the spread of the residuals tends to increase as we move to the right. The residuals are relatively uniformly scattered above and below zero. Overall, this residual plot would indicate speed is a good predictor of stopping distance.
Additional test of the residuals uses the Q-Q plot, to provide a visual indication of whether the residuals from the model are normally distributed.
#quantile-versus-quantile (Q-Q) test
#create Q-Q plot
qqnorm(resid(vehicles_lm))
#add straight diagonal line to plot
#plot with red color, line width (lwd), and dashed line (lty)
qqline(resid(vehicles_lm), col = 'red', lwd = 2, lty = 2)
The quantile-versus-quantile plot (Q-Q plot) above indicates that the model two ends diverge slightly from that line. A normal distribution would expect the points to follow the straight line. The plot suggests that the distributions’s tails are “slightly heavier” than what we would expect from a normal distribution. However, this test indicates the speed variable a good predictor of stopping distance as the residuals likely follow a normal distribution.
The data set, cars (vehicles) linear regression model depicts a relatively good fit for the relationship between the predicator (speed) variable and the response (stopping distance - dist). The summary() function for the model indicates that the F-stats (F = 89.57) is greater than 1, and the p-value (1.49 x 10e-12) is less than the significance level, p < 0.05. This finding is good because it means the model fits the data.
Lilja, D. J. (2017, August 8). Linear Regression Using R: An Introduction to Data Modeling, 2nd Edition. University Digital Conservancy Home. https://conservancy.umn.edu/handle/11299/189222