library(tidyverse)
Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)
head(cars)
## speed dist
## 1 4 2
## 2 4 10
## 3 7 4
## 4 7 22
## 5 8 16
## 6 9 10
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
lm_speed <- lm(dist~speed, data = cars)
plot(cars)
abline(lm_speed)
ggplot(data = cars, aes(x = speed, y = dist)) +
geom_point() + #replaces the plot function from above
geom_abline(slope = lm_speed$coefficients[2], intercept = lm_speed$coefficients[1], color = "red") # Regression line
The two plots tell us the exact same thing, I just prefer ggplot because of the customization and clarity. The distance and speed coordinates are plotted using plot or geom_point and then the line of best fit (or the regression line) is drawn using our regression data generated in lm_speed. There are some clear outliers where the stopping distance is much higher than the others at the same speed. This could be potentially due to a third factor, weight would make the most sense.
summary(lm_speed)
##
## Call:
## lm(formula = dist ~ speed, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
The summary table offers some key evaluation tools. Starting with the statistical breakdown of the residuals of the estimated values. Min being -29.069 means that the point furthest below the regression line is -29.069, while the max is the point furthest above the regression line (43.201). Moving on to the 1Q and 3Q values, a well fitting model with have first and 3rd quartile figures of similar value. In the case of the cars data set this is true, the residuals in the first and 3rd quartile are close. And the median value is fairly close to zero, which shows a good sign so far.
Moving onto the Coefficient section: Under Estimate the estimated Y-Intercept and slope are present. At zero speed the stopping time is -17.57, which does not make much sense since the car is not going backwards at zero speed. The speed estimate is the rate at which the distance increases with every unit of speed increase. A well fitting model generally has a Standard Error that is at least 5-10 times smaller than the estimate (This is exactly the tstat). In this case the intercept tstat is worrying however the slope coefficient (speed) is approaching 10 which is great, meaning there is not much variation between the estimated slope estimate. Lastly, the p-value tells us the chance that the observed relationship between speed and distance (the slope estimate) has strong evidence statistically. if P < 0.005 it is considered statistically significant. From my understanding, the significance of the pvalue for the intercept is typically not important.
The final section of the summary is testing the models fit to the data.The Multiple R-Squared shows us that 65.1% of the variance in distance to stop is explained by the cars speed before stopping. While Adjusted R-Squared is used in models with multiple variables, to make sure that teh variables added to the analysis are actially providing value or are they overfitting the model.
# I like ggplot much better so Ill need to do a slight work arround to get the residual plot
residual_vals <- resid(lm_speed)
fitted_vals <- fitted(lm_speed)
resid_df <- data.frame(Fitted = fitted_vals, Residuals = residual_vals)
ggplot(resid_df, aes(x = Fitted, y = Residuals)) +
geom_point() +
labs(x = "Fitted Values", y = "Residuals", title = "Residuals vs Fitted Values Plot")+
geom_smooth(method = "lm", se = F, color = "red")
## `geom_smooth()` using formula = 'y ~ x'
Looking at the residuals (each points distance from the fitted regression line): The residuals are not really uniformly distributed around the mean of zero, outside of the outliers, the data itself looks to have a central cluster where more points are grouped, showing no uniformity
I will try one more visualization method for a better view:
ggplot(resid_df, aes(sample = residual_vals)) +
stat_qq() +
stat_qq_line() +
labs(title = "QQ Plot of Residuals")