Question: Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)

Loading Cars Built in Dataset using Library:

library(datasets)
data(cars)
library(ggplot2)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ lubridate 1.9.3     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

View Top & Last 10 Rows:

head(cars, n=10)
##    speed dist
## 1      4    2
## 2      4   10
## 3      7    4
## 4      7   22
## 5      8   16
## 6      9   10
## 7     10   18
## 8     10   26
## 9     10   34
## 10    11   17
tail(cars, n=10)
##    speed dist
## 41    20   52
## 42    20   56
## 43    20   64
## 44    22   66
## 45    23   54
## 46    24   70
## 47    24   92
## 48    24   93
## 49    24  120
## 50    25   85

Number of Observations:

nrow(cars)
## [1] 50
ncol(cars)
## [1] 2
str(cars)
## 'data.frame':    50 obs. of  2 variables:
##  $ speed: num  4 4 7 7 8 9 10 10 10 11 ...
##  $ dist : num  2 10 4 22 16 10 18 26 34 17 ...
min(cars$speed)
## [1] 4
max(cars$speed)
## [1] 25

Visualize the Data

plot(cars[,"speed"],cars[,"dist"], main="Relationship Trend",
xlab="speed", ylab="Distance")

Summary Statistics:

summary(cars)
##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

Checking Missing Values:

table(is.na(cars))
## 
## FALSE 
##   100

Compute the Model:

cars.lm <- lm(dist ~ speed, data=cars)

cars.lm
## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Coefficients:
## (Intercept)        speed  
##     -17.579        3.932

Plots the Original Data along with the Fitted Line:

plot(dist ~ speed, data=cars)
abline(cars.lm)

Evaluating the Quality of the Model:

summary(cars.lm)
## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

Interpretation of the Results:

Residuals: are the differences between the actual measured values and the corresponding values on the fitted regression line.

Max - the distance from the regression line of the point furthest above the line.

Median - median value of all of the residuals.

The minimum residual is -29.069, and the maximum residual is 43.201, indicates that the residuals vary from -29.069 to 43.201.

1Q and 3Q - the first and third quartiles of all the sorted residual values.

The interquartile range (IQR) is from -9.525 to 9.215, suggesting that 50% of the residuals lie within this range.

The median residual is -2.272.

Intercept: The estimated intercept of the regression line is approximately -17.5791. It represents the value of the dependent variable (dist) when the independent variable (speed) is zero.

Speed: The estimated coefficient for the variable “speed” is approximately 3.9324. This indicates that for each unit increase in speed, the distance traveled is estimated to increase by approximately 3.9324 units.

Significance codes:

Both the intercept and the coefficient for “speed” are significant at the 0.05 level, denoted by ’*’ which means both variables have a statistically significant relationship with the dependent variable.

Residual standard error: This is an estimate of the standard deviation of the error term in the regression model. In this case, it’s approximately 15.38.

Multiple R-squared and Adjusted R-squared:

The multiple R-squared value (0.6511) shows that approximately 65.11% of the variance in the dependent variable “dist” can be explained by the independent variable “speed” in this model.

The adjusted R-squared value (0.6438) adjusts the R-squared value for the number of predictors in the model. It’s slightly lower than the multiple R-squared value because it takes into account the number of predictors and adjusts for it.

F-statistic and p-value: The F-statistic tests the overall significance of the regression model. The extremely low p-value (1.49e-12) indicates that the regression model as a whole is statistically significant, which means that at least one independent variable has a non-zero coefficient in predicting the dependent variable.

The line is a good fit with the data, we would expect residual values that are normally distributed around a mean of zero. The model appears to be statistically significant, with the speed variable having a significant positive effect on the distance traveled by a car.

ggplot(data = cars, aes(x = speed, y = dist)) + 
geom_point(color='blue') +
geom_smooth(method = "lm", se = FALSE)#+xlim(0,25000)
## `geom_smooth()` using formula = 'y ~ x'

### Applying the Model:

linear_model <- lm(dist~speed,cars)

plot(linear_model)

Residual Analysis:

plot(fitted(cars.lm),resid(cars.lm))

qqnorm(resid(cars.lm))
qqline(resid(cars.lm))

par(mfrow=c(2,2))
plot(cars.lm)

Conclusion:

The relationship between stopping distance and speed is statistically significant because the faster a car is going the harder it is for the car to stop.

Based on the model for every additional mpg of speed the distance to stop but increase 3.932 units of distance.

The quantile quantile plot is pretty close to a straight line meaning that the residuals are indeed approximately distributed.

The adjusted R-squared value is .64 which means that around 64% of the variance in stopping distance can be explained with just the speed of the car

The lower the p-value, the greater the statistical significance of the observed difference.

A p-value of 0.05 or lower is generally considered statistically significant.