Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)
Housekeeping
load the r libraries that will be used for this exercise:
Exploring the dataset
learn the basic shape and type of the data.
## Rows: 50
## Columns: 2
## $ speed <dbl> 4, 4, 7, 7, 8, 9, 10, 10, 10, 11, 11, 12, 12, 12, 12, 13, 13, 1…
## $ dist <dbl> 2, 10, 4, 22, 16, 10, 18, 26, 34, 17, 28, 14, 20, 24, 28, 26, 3…
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
The cars dataframe consists of 50 observations (rows) and 2 features with the labels: speed, dist
visualizations of the cars dataset
visually inspect a scatterplot of the datapoints where distance is given on the y-axis and speed along the x-axis:
#basic scatter plot
ggplot( cars, aes( x = speed, y = dist ) ) +
geom_point() +
ggtitle( 'Scatter plot of `cars` data' ) +
ylab( 'distance' ) There is a positive trend in the datapoints for distance to increase as a function of speed. The relationship appears steady, therefore the plot above shows that the data might be suitable to be fit with a linear model.
Next visualize the features of the data as box plots and look for symmetry of the distribution or for outliers:
#basic box plot
cars_long <- cars %>%
pivot_longer( everything(), names_to = "class", values_to = "count")
ggplot( cars_long, aes( x = class, y = count ) ) +
geom_boxplot(color="red", fill="orange", alpha=0.2) +
ggtitle( 'Box plot of `cars` data' ) there only appears to be one outlier for the distance data. the solid red lines within the box plots indicates the median of the data; that the lines are approximately in the center of the boxes indicates that the distribution is approximately normal. however, we can take another look at the distribution with a density plot:
#plot the density of features
ggplot(cars, aes(x=x) ) +
# Top
geom_density( aes(x = speed, y = ..density..), fill="#69b3a2" ) +
geom_label( aes(x=4.5, y=0.075, label="speed"), color="#69b3a2") +
# Bottom
geom_density( aes(x = dist, y = -..density..), fill= "#404080") +
geom_label( aes(x=4.5, y=-0.075, label="distance"), color="#404080") +
xlab("x") +
ggtitle( 'Density of cars features' ) The feature distributions have density plots mirrored in the figure above. the are no obxious signs of multiple modes in the data and the profiles are approximately normal.
Next, calculate the correlation between the two features:
## [1] 0.8068949
There is a strong positive correlation between speed and distance. This suggests that much of the variance in the distance variable can be explained as a function of the speed variable. These preliminary inspections of the data suggest that a linear model might be appropriate here.
Linear Regression
Now to fit the linear model to the data:
#fit a linear regression to the cars data to model distance by speed
cars_lm <- lm( cars$dist ~ cars$speed )
#display a summary of the data
summary( cars_lm )##
## Call:
## lm(formula = cars$dist ~ cars$speed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## cars$speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
The summay of the linear model tells us that the cars data can be described by a linear model given by the equation: \[dist = -17.58 + 3.932 * speed\]
Here, the scatterplot of the data is shown plotted with the linear mode approximation.
cars_predict <- predict( cars_lm )
ggplot( cars, aes( y= dist, x = speed ) ) +
geom_point() +
geom_line( aes( y = cars_predict )) +
ggtitle( 'Linear Model' ) +
ylab( 'distane' )Regression Diagnostics
## # A tibble: 1 x 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.651 0.644 15.4 89.6 1.49e-12 1 -207. 419. 425.
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
Statistical Significance: From the summary statistics for the linear regression, the p-value of the fit is 0 which is very small (well under relatively strict criterion of 0.01). Therefore, we can safely reject the null hypothesis that there is no relationship between the two variables.
Test Linear Regression Assumptions: the function autoplot() will generate a series of visualization that can help us test the assumptions we made about our data when performing the linear regression
## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
- detecting non-linearity with the Residuals vs Fitted plot: the residuals data points are reasonably symmetry suggesting that the variance of the residuals does not change as a function of fitted value.
- testing linear relation with the Normal Q-Q plot: in plotting the data quartiles against each other, the results follow a line for all but the more extreme values. this supports a linear relationship.
- testing variance with the Scale-Location plot: similar to the residuals plot, this scales the residuals. the general magnitude of the scaled residuals doesn’t hange much, so we can be assured that the variance of the fit is relatively even across the data distributions.
- looking for outlier influence with the Residuals vs Leverage plot: This plot is relatively straight indicating that there aren’t any obvious outliers in the data that have too strong an influence over the linear regression fit.
Conclusion
Taken together, the cars dataset is a good candidate for linear regression that yields a statistically significant fit described by the equation: \[dist = -17.58 + 3.932 * speed\]
thank you for reading.