Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)
The data set “cars” is built inside the R and we don’t need to load a special library to get this data.
The data frame consists of two columns (speed and dist) and 50 rows. The summary() function will show a numeric range (between 12 and 25 for speed and between 2 and 120 for distance).
summary(cars)
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
glimpse(cars)
## Rows: 50
## Columns: 2
## $ speed <dbl> 4, 4, 7, 7, 8, 9, 10, 10, 10, 11, 11, 12, 12, 12, 12, 13, 13, 13~
## $ dist <dbl> 2, 10, 4, 22, 16, 10, 18, 26, 34, 17, 28, 14, 20, 24, 28, 26, 34~
First, we should determine whether or not a linear relationship exists between the predictor and the output value.
It looks like linear relationship between speed and distance, when speed is increased, the distance goes up too.
ggplot(cars, aes(x=speed, y=dist)) +
geom_point (color="blue") +
theme_light() +
theme(plot.title = element_text(hjust = 0.5), axis.text.x = element_text(angle = 90))+
labs(x = 'Speed, mph', y = "Stopping distance, ft", title = "Stopping Distance vs. Speed")
Next, we will build the linear model and check if speed/distance relationship is actually linear.
We see that the intercept is -17.5791 and the slope is 3.9324.
lm_cars <- lm(cars$dist ~ cars$speed)
lm_cars
##
## Call:
## lm(formula = cars$dist ~ cars$speed)
##
## Coefficients:
## (Intercept) cars$speed
## -17.579 3.932
Next, we plot the original data along with the fitted line.
plot(cars, xlab = "Speed, mph", ylab = "Stopping distance, ft",
las = 1, main = "Stopping Distance vs. Speed")
abline(lm_cars)
Using summary() function for our linear model, we will get more information about it.
The Std. Error column shows the statistical standard error for each of the coefficients, it should be at least five to ten times smaller than the corresponding coefficient. For the speed the estimated value is 3.9324, the std. error is 0.4155 which is 10 times smaller 3.9324/0.4155=9.46, this ration is called t-statistic and we also see it in the summary table.
Pr(>|t|) shows the probability of observing a test statistic (t value) as extreme or more extreme as the one observed, assuming there is no linear relationship between the predictor and response variables. This value is tiny, just 1.49e-12. It means that there is strong evidence of a linear relationship between the speed and the stopping distance.
The Multiple R-squared value is 0.6511. It is a statistical measure of how well the model describes the measured data, we are at 65%.
summary(lm_cars)
##
## Call:
## lm(formula = cars$dist ~ cars$speed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## cars$speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
The last check for our model is residual analysis. The residuals are the differences between the actual measured values and the corresponding values on the fitted regression line.
For a model to be a good fit with the data, the residuals should be normally distributed around a mean of zero, uniformly scattered above and below zero, the median value near zero, Min/Max and 1Q/3Q are around the same magnitude. In the summary() results we see that our model satisfies the requirements for median (-2.72) Min/Max (-29/43) and 1Q/3Q (-9.525/9.215).
To check other requirements, we will build the residuals plot. The residuals looks like normally distributed around the zero, uniformly scattered above and below zero though there is different variance of the outliers at the beginning and the end of the plot.
We should continue checking the residuals.
plot(fitted(lm_cars),resid(lm_cars), xlab = "Fitted", ylab = "Residuals",
main = "Residuals plot")
abline(0,0)
Another test is to use the quantile-versus-quantile plot, Q-Q. The Q-Q plot shows whether the residuals are normally distributed. We see that the right end diverge from the line. But most of the residuals are on the line and follow normal distribution.
qqnorm(resid(lm_cars))
qqline(resid(lm_cars))
The overall results of the analysis.
par(mfrow=c(2,2))
plot(lm_cars)
As a result of our analysis, the linear model shows good results. The t and p values shows that model fits and that there is a strong correlation between the explanatory (speed) and response variable (stopping distance). The residuals normally distributed around the zero and seems to follow q-q plot. Still the model explains only 65% of the actual data, and residual q-q plot shows some divergence. As a result, we should discover factors that cause that and try to find a model that would be a better fit.