Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)

Visualization

The cars has 2 variables, namely speed and stopping distance (dist) in feet.

str(cars)
## 'data.frame':    50 obs. of  2 variables:
##  $ speed: num  4 4 7 7 8 9 10 10 10 11 ...
##  $ dist : num  2 10 4 22 16 10 18 26 34 17 ...

When data was collected, the cars seem to be going slowly as their speed ranged from 4 to 25 mph (average = 15mph).

str(cars)
## 'data.frame':    50 obs. of  2 variables:
##  $ speed: num  4 4 7 7 8 9 10 10 10 11 ...
##  $ dist : num  2 10 4 22 16 10 18 26 34 17 ...
summary (cars$speed)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     4.0    12.0    15.0    15.4    19.0    25.0
boxplot(cars$speed)

Stopping distance ranges from 2 to 120ft with an average of 42.98ft. We also observe that there are some outliers. The following dipicts that.

summary (cars$dist)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   26.00   36.00   42.98   56.00  120.00
boxplot(cars$dist)

When we look at the coorelation between spped and stopping distance, we clearly see that there is a positive correlation between these 2 variables. When speed increases, stopping distance also increases.

x <- cars$speed  # car speed
y <- cars$dist   # stopping distance
cars_lm <- lm(y ~ x)   # linear model
library(ggplot2)
qplot(x, y, 
      ylab="Stopping Distance (ft)", xlab="Speed (mph)", 
      main="Cars Speed vs. Stopping Distance", ymin=-10) +
  geom_abline(intercept = cars_lm$coefficients[1], 
              slope = cars_lm$coefficients[2])

Model Creation & Evaluation

Let’s evaluate the quality of our model using the summary output.because there is only 1 explanatory variable (speed), this is called a simple linear regression.

# we already did that above, but I'll do it cleanly here
model <- lm(dist ~ speed, cars)
summary(model)
## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

Let’s read the above output of our model summary:

Degrees of freedom: 48. There were 50 observations with only 2 variables, used to generate the model.

F-statistic= 89.57. It’s high, therefore the model is doing more explaining than the errors.the model is therefore significant.

Residual Analysis

library(ggplot2)
library(grid)
library(gridExtra)

plot1 <- qplot(cars_lm$fitted.values, cars_lm$residuals, ylab="Fitted Values", xlab="Residuals")

plot2 <- ggplot() + geom_qq(aes(sample = cars_lm$residuals))

grid.arrange(plot1, plot2, ncol=1, nrow=2)

we can see that more points fall below zero than above zero. This tells us that our model tends to overestimate a car real stopping distance.

I conclude with the following:

the significance of the speed explanatory variable agrees with the significance of the overall model according to the F-statistics. This tell us that speed is a good predictor, and that the model is doing more xplaning than th errors. However the R-squared could be increased had we introduced new variables (extra variables) especially if those variables are significant. In this dataset, we only have 2 variables in total. So it is what it is.