1 Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)

library(ggplot2)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ lubridate 1.9.3     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
summary(cars)
##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

There are only two variables in the cars dataset, speed and dist. We’ll look at distance as a function of speed.

1.1 Plot the variables

distance is our y variable, as it responds based on speed, the explanatory variable.

cars %>%
  ggplot(aes(x=speed, y=dist, color=dist)) +
  geom_point() +
  labs(title = "Distance based on speed", x="Speed",y="Distance",colour="Distance")

## Regression using linear model

cars_linear <- lm(dist ~ speed,data=cars)
coef(cars_linear)
## (Intercept)       speed 
##  -17.579095    3.932409
summary(cars_linear)
## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

The model’s intercept has a negative value (-17), which we wouldn’t see a car stopping at a negative value, should be 0. Would suggest some modifications should be model to normalize the values. The model states for every unit of speed, distance increases 3.93.

The coefficient for speed, as well as the model’s pvalue is well below 0.05 significance, meaning the model does a good job in explaining distance as a result of speed. In addition, the model explains 65% of the data’s variation, despite the strong correlation between the variables. The residuals have a median near 0, and the min and max values are almost equidistant, suggesting it may have some sort of normal distribution, in which the model can be used.

The model’s standard error of 15.38 is slightly higher than 1-1.5x the 1Q/3Q of +/- 9.215, a bit higher than the desired error.

plot(cars_linear)

1.2 Add linear model line to the scatterplot

cars %>%
  ggplot(aes(x=speed,y=dist))+
  geom_point() +
  geom_smooth(method="lm",color="red") +
  theme_bw()
## `geom_smooth()` using formula = 'y ~ x'

cor(cars$speed,cars$dist)
## [1] 0.8068949

There is a correlation of 0.81, which is close to 1, suggesting a strong relationship between the variables.

1.3 Residual Analysis and QQPlot

plot(fitted(cars_linear), resid(cars_linear))
abline(0,0)

It appears the residuals are kind of evenly distributed above/below 0, suggesting a normal distribution of the data.

qqnorm(resid(cars_linear))
qqline(resid(cars_linear))

We do see as the model progresses towards the theoretical quartiles 1-2, there is some larger variations, however, aside from this, the residuals seem to be centered are the mean, suggesting the model is a good predictor of distance as a function of speed.