Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)

Visualize the Data

data(cars)
str(cars)
## 'data.frame':    50 obs. of  2 variables:
##  $ speed: num  4 4 7 7 8 9 10 10 10 11 ...
##  $ dist : num  2 10 4 22 16 10 18 26 34 17 ...
summary(cars)
##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

The dependent variable, Stopping Distance, is plotted against Speed and a roughly linear positive trend is seen. Meaning as speed increases, the stopping distance also increases. A few other features to not is that the spread seems to increase slightly as speed increases. Next, the degree of linearity will be assessed by developing a regression model.

plot(cars$speed, cars$dist, main = 'Stopping Distance v. Speed', xlab = 'Speed', ylab = 'Stopping Distance')

## Evaluating the Quality of the Model

The linear model is generated from the cars dataframe and the coefficients are calculated by using the method of least squares. This method finds the line that most closely fits the measured data by minimizing the distance between the line and the individual points. The linear model is assigned to lm_cars, which shows a y-intercept of -17.579 and a speed coefficient of 3.932. The resulting linear. equation is shown below.

lm_cars <- lm(dist ~ speed, data = cars)
lm_cars
## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Coefficients:
## (Intercept)        speed  
##     -17.579        3.932

\[dist=3.932*speed-17.579\]

The following code plots the original data along with the fitted line from lm_cars. The line generally seems to fit the trend of points well.

plot(dist ~ speed, data = cars, main = 'Stopping Distance v. Speed', xlab = 'Speed', ylab = 'Stopping Distance')
abline(lm_cars)

When calling the sumnmary function on lm_cars, the residual statistics is reported. A good model will have residuals that are roughly normally distributed about media of 0. The median in this case is -2.272, which is close to 0 relative to the range of residuals. The first and third quartiles are very closely the same magnitude. The max and min differ but overall, these statistics follow what we would expect to see in a guassian distribution.

Next we can look at the estimated coefficient values. We see that the standard error for the coefficient of speed is 3.9324/0.4155 = 9.46 times smaller than the coefficient. This ratio is called a test statistic. For a good model we typically like to see this number between 5 to 10 so this is a reasonable ratio. The larger the ratio, the smaller the variability would be in the slope estimate. The y-intercept has a test statistic of 17.5791/6.7584 = 2.60. This is not something you typically worry for the y-intercept.

The probability of observing a test statistic of 9.464, assuming there is no relationship between speed and stopping distance, Pr(>|t|) = 1.49e-12. This p-value is so small that there is strong statistical evidence that there is a linear relationship between speed and stopping distance. The p-value of the intercept is 0.0123, which means that the probability of observing a t value of -2.601, assuming the true intercept is 0 is 1.2%. Which is not as small as the speed p-value but there may be slightly more variability in the estimate for y-intercept.

Additionally, the residual standard error is 15.38 which is a measure of total variation in the residual values. This model used 48 degrees of freedom because there are 50 rows in the cars dataframe minus two coefficients to build this model. The multiple R-squared is 64% meaning 64% of the variability in performance can be explained by the variation in speed.

summary(lm_cars)
## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

Residual Analysis

The residual plot below shows that the residual are generally distributed about 0 equally with a slight increase in scatter and variation as the plot moves to the right. Although this is not a pronounced pattern, and using speed to explain the data may be appropriate.

plot(fitted(lm_cars), resid(lm_cars))

The qq plot is shown next, which shows that the residuals are generally normally distributed about 0. There appears to be some deviation from the straight line at some of the higher values, which means the data deviates from a normal distribution at the tail. The plot shows that the normal distribution is slightly right skewed since the right tail is heavier than the left. This distribution confirms the pattern seen in the residual plot previously. Although this is true, the plot generally shows a straight line and there isn’t any distinct or pronounced patterns deviating from a straight line. It is safe to say that Speed is sufficient to predict stopping distance.

qqnorm(resid(lm_cars))
qqline(resid(lm_cars))

The diagnostic plots can be seen below. When viewing these plots together, you can see that the right side of the residual data is slightly more scattered but overall there are no outliers and this linear model seems to meet all the requirements of a linear model. It can be concluded that speed can sufficiently explain the variation in the data.

par(mfrow=c(2,2))
plot(lm_cars)