Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)
data(cars)
str(cars)
## 'data.frame': 50 obs. of 2 variables:
## $ speed: num 4 4 7 7 8 9 10 10 10 11 ...
## $ dist : num 2 10 4 22 16 10 18 26 34 17 ...
summary(cars)
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
The dependent variable, Stopping Distance, is plotted
against Speed and a roughly linear positive trend is seen.
Meaning as speed increases, the stopping distance also increases. A few
other features to not is that the spread seems to increase slightly as
speed increases. Next, the degree of linearity will be assessed by
developing a regression model.
plot(cars$speed, cars$dist, main = 'Stopping Distance v. Speed', xlab = 'Speed', ylab = 'Stopping Distance')
## Evaluating the Quality of the Model
The linear model is generated from the cars dataframe and the
coefficients are calculated by using the method of least squares. This
method finds the line that most closely fits the measured data by
minimizing the distance between the line and the individual points. The
linear model is assigned to lm_cars, which shows a
y-intercept of -17.579 and a speed coefficient of 3.932. The resulting
linear. equation is shown below.
lm_cars <- lm(dist ~ speed, data = cars)
lm_cars
##
## Call:
## lm(formula = dist ~ speed, data = cars)
##
## Coefficients:
## (Intercept) speed
## -17.579 3.932
\[dist=3.932*speed-17.579\]
The following code plots the original data along with the fitted line
from lm_cars. The line generally seems to fit the trend of
points well.
plot(dist ~ speed, data = cars, main = 'Stopping Distance v. Speed', xlab = 'Speed', ylab = 'Stopping Distance')
abline(lm_cars)
When calling the sumnmary function on
lm_cars, the residual
statistics is reported. A good model will have residuals that are
roughly normally distributed about media of 0. The median in this case
is -2.272, which is close to 0 relative to the range of residuals. The
first and third quartiles are very closely the same magnitude. The max
and min differ but overall, these statistics follow what we would expect
to see in a guassian distribution.
Next we can look at the estimated coefficient values. We see that the standard error for the coefficient of speed is 3.9324/0.4155 = 9.46 times smaller than the coefficient. This ratio is called a test statistic. For a good model we typically like to see this number between 5 to 10 so this is a reasonable ratio. The larger the ratio, the smaller the variability would be in the slope estimate. The y-intercept has a test statistic of 17.5791/6.7584 = 2.60. This is not something you typically worry for the y-intercept.
The probability of observing a test statistic of 9.464, assuming there is no relationship between speed and stopping distance, Pr(>|t|) = 1.49e-12. This p-value is so small that there is strong statistical evidence that there is a linear relationship between speed and stopping distance. The p-value of the intercept is 0.0123, which means that the probability of observing a t value of -2.601, assuming the true intercept is 0 is 1.2%. Which is not as small as the speed p-value but there may be slightly more variability in the estimate for y-intercept.
Additionally, the residual standard error is 15.38 which is a measure of total variation in the residual values. This model used 48 degrees of freedom because there are 50 rows in the cars dataframe minus two coefficients to build this model. The multiple R-squared is 64% meaning 64% of the variability in performance can be explained by the variation in speed.
summary(lm_cars)
##
## Call:
## lm(formula = dist ~ speed, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
The residual plot below shows that the residual are generally distributed about 0 equally with a slight increase in scatter and variation as the plot moves to the right. Although this is not a pronounced pattern, and using speed to explain the data may be appropriate.
plot(fitted(lm_cars), resid(lm_cars))
The qq plot is shown next, which shows that the residuals are generally normally distributed about 0. There appears to be some deviation from the straight line at some of the higher values, which means the data deviates from a normal distribution at the tail. The plot shows that the normal distribution is slightly right skewed since the right tail is heavier than the left. This distribution confirms the pattern seen in the residual plot previously. Although this is true, the plot generally shows a straight line and there isn’t any distinct or pronounced patterns deviating from a straight line. It is safe to say that Speed is sufficient to predict stopping distance.
qqnorm(resid(lm_cars))
qqline(resid(lm_cars))
The diagnostic plots can be seen below. When viewing these plots
together, you can see that the right side of the residual data is
slightly more scattered but overall there are no outliers and this
linear model seems to meet all the requirements of a linear model. It
can be concluded that speed can sufficiently explain the variation in
the data.
par(mfrow=c(2,2))
plot(lm_cars)