Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)
The cars has 2 variables, namely speed and stopping distance (dist) in feet.
str(cars)
## 'data.frame': 50 obs. of 2 variables:
## $ speed: num 4 4 7 7 8 9 10 10 10 11 ...
## $ dist : num 2 10 4 22 16 10 18 26 34 17 ...
When data was collected, the cars seem to be going slowly as their speed ranged from 4 to 25 mph (average = 15mph).
str(cars)
## 'data.frame': 50 obs. of 2 variables:
## $ speed: num 4 4 7 7 8 9 10 10 10 11 ...
## $ dist : num 2 10 4 22 16 10 18 26 34 17 ...
summary (cars$speed)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.0 12.0 15.0 15.4 19.0 25.0
boxplot(cars$speed)
Stopping distance ranges from 2 to 120ft with an average of 42.98ft. We also observe that there are some outliers. The following dipicts that.
summary (cars$dist)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 26.00 36.00 42.98 56.00 120.00
boxplot(cars$dist)
When we look at the coorelation between spped and stopping distance, we clearly see that there is a positive correlation between these 2 variables. When speed increases, stopping distance also increases.
x <- cars$speed # car speed
y <- cars$dist # stopping distance
cars_lm <- lm(y ~ x) # linear model
library(ggplot2)
qplot(x, y,
ylab="Stopping Distance (ft)", xlab="Speed (mph)",
main="Cars Speed vs. Stopping Distance", ymin=-10) +
geom_abline(intercept = cars_lm$coefficients[1],
slope = cars_lm$coefficients[2])
Let’s evaluate the quality of our model using the summary output.because there is only 1 explanatory variable (speed), this is called a simple linear regression.
# we already did that above, but I'll do it cleanly here
model <- lm(dist ~ speed, cars)
summary(model)
##
## Call:
## lm(formula = dist ~ speed, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
Let’s read the above output of our model summary:
Intercept= −17.5791. According to the model, a car going 0 mph would have a stopping distance of about −18 ft. This is not realistic, so the intercept is not meaningful for this model. We can see that by the "*" near it.
Slope= 3.9324. For every 1 mph increase in a car’s speed, the model suggests that its stopping distance increases by about 4 feet.
Standard error= 6.7584 (for intercept) and 0.4155 (for slope). The ratio between the coefficients and standard error is fairly large, meaning there is relatively little variability in the estimates of the slope and intercept.
P-value= 0.0123. This small p-value is significant at the 99% level (one significance star), which means that there is a high likelihood that speed is relevant in the model, and the model more accurately predicts it.
R-squared= 0.6511; adjusted R-squared: 0.6438. This means that speed explains about 65% of the variation in stopping distance. If there are more variables to add and if they are significant, our R-squared will tend to go up to 1, making it a perfect model.
Degrees of freedom: 48. There were 50 observations with only 2 variables, used to generate the model.
F-statistic= 89.57. It’s high, therefore the model is doing more explaining than the errors.the model is therefore significant.
library(ggplot2)
library(grid)
library(gridExtra)
plot1 <- qplot(cars_lm$fitted.values, cars_lm$residuals, ylab="Fitted Values", xlab="Residuals")
plot2 <- ggplot() + geom_qq(aes(sample = cars_lm$residuals))
grid.arrange(plot1, plot2, ncol=1, nrow=2)
we can see that more points fall below zero than above zero. This tells us that our model tends to overestimate a car real stopping distance.
I conclude with the following:
the significance of the speed explanatory variable agrees with the significance of the overall model according to the F-statistics. This tell us that speed is a good predictor, and that the model is doing more xplaning than th errors. However the R-squared could be increased had we introduced new variables (extra variables) especially if those variables are significant. In this dataset, we only have 2 variables in total. So it is what it is.