Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)

data(cars)
head(cars)
##   speed dist
## 1     4    2
## 2     4   10
## 3     7    4
## 4     7   22
## 5     8   16
## 6     9   10

Visualize the Data

# dependent variable - stopping distance = dist
# independent variable - speed = speed

plot(cars$speed, cars$dist, xlab = "Speed", ylab = "Stopping distance")

The Linear Model Function

Stopping Distance = m×Speed + b

y-intercept is -17.579

slope is 3.932

The final regression model is:

Stopping Distance = -17.579 + 3.932*speed

car_lm <- lm(dist ~ speed, data=cars)
car_lm
## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Coefficients:
## (Intercept)        speed  
##     -17.579        3.932

Plot with abline:

plot(cars$speed, cars$dist, xlab = "Speed", ylab = "Stopping distance")
abline(car_lm)

Evaluating the Quality of the Model

summary(car_lm)
## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

A good model would tend to have a median value near zero,minimum and maximum values of roughly the same magnitude, and first and third quartile values of roughly the same magnitude.

Median is -2.272 which is around zero.

The standard error for clock is 9.464 times smaller than the coefficient value

The p-value for the slope estimate for speed is 1.49e-12 - a tiny value. This means, that the probability of observing a t value of 9.464 or more extreme (in absolute value), assuming there is no linear relationship between the clock speed and the performance, is less than 1.49e-12

Since this value is so small, we can say that there is strong evidence of a linear relationship between speed and stopping distance.

Cars had 50 unique rows in the data frame, corresponding to 50 independent measurements. We used this data to produce a regression model with two coefficients: the slope and the intercept. Thus, we are left with (50 - 2 = 48) degrees of freedom.

The reported R2 of 0.6511 for this model means that 65.11% of the variability in stopping distance is explained by the variation in speed.

Residual Analysis

The residuals seems uniformly scattered above and below zero. It scattered a little bit more below zero

The Q-Q plot provides a nice visual indication of whether the residuals from the model are normally distributed.

Overall, the Q-Q plot follow a straight line, but we can see the right end diverge from the line. This suggest the distribution’s right tail is “heavier” than what we would expect from a normal distribution. This pattern is indicative of a right-skewed distribution.

plot(fitted(car_lm),resid(car_lm))

qqnorm(resid(car_lm))
qqline(resid(car_lm))

par(mfrow=c(2,2))
plot(car_lm)