Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)

#data()
# Loading
data(cars)
#?cars
# Check dimensions
dim(cars)
## [1] 50  2
# Inspect column names
names(cars)
## [1] "speed" "dist"
#Print the first 5 rows
#head(cars, 5)
# Plot the 2 variables in the data to visually determine if there is a relationship
plot(cars, xlab = "Speed (mph)", ylab = "Stopping distance (ft)", main = "cars data", las = 1)
lines(lowess(cars$speed, cars$dist, f = 2/3, iter = 3), col = "red")

From this plot, it can be inferred that there is a positive, linear relationship between speed (predictor variable) and stopping distance (response variable).

# Fit the linear model
dist.speed<-lm(cars$dist~cars$speed)
dist.speed
## 
## Call:
## lm(formula = cars$dist ~ cars$speed)
## 
## Coefficients:
## (Intercept)   cars$speed  
##     -17.579        3.932

The intercept is -17.579 and the slope is 3.932. The latter implies that every 1 mph increase in speed adds about 4 feet in stopping distance, according to this model.

# Display the model's summary
summary(dist.speed)
## 
## Call:
## lm(formula = cars$dist ~ cars$speed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## cars$speed    3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

The std error of the slope is quite low compared to it’s estimated value, implying low variability. The p-value is tiny, which implies that the speed is very relevant in explaining the stopping distance. The std error of the y-intercept is not as low, and the corresponding p-value is a little higher as well. The y-intercept does not have much meaning here, because it is a negative number, and conceptually it is hard to interpret a stopping distance meaningfully if the speed of the car is 0 i.e. it is stationary.

The number of degrees is 48 because the data has 50 observations and there are 2 parameters - intercept and slope being estimated.

The R-squared value is about 65% meaning the speed parameter explains 65% of the variability in the stopping distance.

# Plot the regression line over the scatter plot
plot(cars,xlab = "Speed (mph)", ylab = "Stopping distance (ft)", main = "cars data")
abline(dist.speed, col='blue')

The above plot shows the fitted regression line along with the individual data points. There seem to be some more outliers on the positive side i.e. the model under-predicts to a greater degree than it over-predicts.

# plot the linear's models fitted values versus the residuals
plot(fitted(dist.speed),resid(dist.speed))

The above plot shows that the spread in residual values increases as the fitted values increase i.e. the variability in residuals does not stay constant.

# plot a histogram of linear model residuals
hist(dist.speed$residuals)

The histogram above shows that the residuals are not normally distributed. Instead, we can see that the mean is below 0 and there is a positive skew.

# Check if there is any pattern in residuals to indicate lack of independence
plot(dist.speed$residuals) 

The above plot shows some outliers on the positive side, suggesting that the residuals are not distributed normally.

# Plot quantile-quantile plot to check for normality in residuals
qqnorm(resid(dist.speed))
qqline(resid(dist.speed))

The qq plot shows deviation from normality for the residuals, which suggests that there are other attributes besides speed that are relevant for explaining the stopping distance.