Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)
#Displaying the car dataset contents:
str(cars)
## 'data.frame': 50 obs. of 2 variables:
## $ speed: num 4 4 7 7 8 9 10 10 10 11 ...
## $ dist : num 2 10 4 22 16 10 18 26 34 17 ...
#Showing stats with Summary function:
summary(cars)
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
==> First to visualize the cars data using scatter plot. The x-axis is speed and y-axis is distance :
plot(cars[,"speed"], cars[,"dist"], main='CARS DATASET', xlab='Speed', ylab='Distance')
The plot shows that as car speed increases, the stopping distance also inreases as expected. A regression model will help us quantify this relationship.
==> Next, The Linear Model Function :
cars_lm <- lm(cars$dist ~ cars$speed)
cars_lm
##
## Call:
## lm(formula = cars$dist ~ cars$speed)
##
## Coefficients:
## (Intercept) cars$speed
## -17.579 3.932
plot(cars[,"speed"], cars[,"dist"], main='CARS DATASET with Linear Model Function', xlab='Speed', ylab='Distance')
abline(cars_lm, col="blue")
==> Next, quality evaluation of the model :
summary(cars_lm)
##
## Call:
## lm(formula = cars$dist ~ cars$speed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## cars$speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
Multiple R-squared value of 0.6511 and Adjusted R-squared of 0.6438 tells us that the model explains about 65% of the data variation. That says that our model is a good fit but may not be an excellent fit for the data provided.
==> Next, residual analysis:
hist(cars_lm$residuals)
qqnorm(resid(cars_lm))
qqline(resid(cars_lm))
Histogram of residual plot appear to be near normally distributed.
As we can see from the Quantile to Quantile (QQ) plot graph, samples are closely lined-up to the theoretical qqline. This signifies a normal distribution of the observed data. We can see a divergence though towards the higher positive quantiles. We can say speeds less than 20 (75th quantile); the model is an excellent predictor of stopping distance