Processing math: 100%

Simple Linear Modeling

Problem Statement

Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)

Dependent variable: Stopping distance Independent variable: Speed

Load the Data

Load the cars data

# Load the necessary package
library(readr)
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.3.3
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# Load the cars dataset
data(cars)

head(cars)
##   speed dist
## 1     4    2
## 2     4   10
## 3     7    4
## 4     7   22
## 5     8   16
## 6     9   10

Visualize the data

First we will determine whether or not a linear relationship exists between the stopping distance of the car and the speed by plotting a scatter diagram

The figure below shows that stopping distance tends to increase as the speed of the car increases as expected. So there is a relationship to model.

library(ggplot2)
library(dplyr)

# Scatter plot
plot(cars$speed, cars$dist, main = "Stopping Distance vs Speed", xlab = "Speed", ylab = "Stopping Distance")

Simple Linear model of the data

Below is the Simple Linear model of the data with speed as the independent variable and dist as the response variable. The equation of the model is shown below

# Create a linear model predicting stopping distance based on speed
cars.lm <- lm(dist ~ speed, data = cars)

cars.lm
## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Coefficients:
## (Intercept)        speed  
##     -17.579        3.932

dist=17.579+3.932×speed

Evaluate the Quality of the Model

# Summary of the model
summary(cars.lm)
## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

Interpretation of the model output

  • Coefficients: In this model, the y-intercept is 3.932 and the slope is -17.579. The equation from the linear model suggests that the stopping distance (dist) can be predicted from the car’s speed (speed). For every additional unit increase in speed, the stopping distance is expected to increase by 3.932 units. The intercept, -17.579, indicates the estimated stopping distance when the speed is zero. However, a negative stopping distance is not feasible. The simple linear regression model superimposed on the data below suggests that the negative intercept here is due to extrapolation beyond the range of data.
# Scatter plot with regression line
plot(cars$speed, cars$dist, main = "Stopping Distance vs Speed", xlab = "Speed", ylab = "Stopping Distance")
abline(cars.lm, col = "red")

  • Statistical Significance: The p-value for the speed coefficient is very small (1.49e-12), indicating that the relationship between speed and stopping distance is statistically significant.

  • Practical Significance The practical significance of the model lies in understanding that while the statistical interpretation of the slope suggests a positive relationship between speed and stopping distance, the negative intercept is not meaningful in a real-world context. This negative value for the intercept indicates that the model should not be extrapolated beyond the range of data from which it was derived, especially towards a speed of zero where it predicts a negative stopping distance, which is impossible. The model is useful for predicting stopping distances within the range of observed speeds, but caution is needed when applying it outside this range.

  • Fit of the Model: The Multiple R-squared value is 0.6511, indicating that approximately 65.11% of the variance in stopping distance is explained by the model. The Adjusted R-squared value, which adjusts for the number of predictors in the model, is 0.6438, reinforcing that the model explains a good portion of the variance in stopping distance.

  • Residuals: The range of residuals is quite large, from about -29.069 to 43.201, which may suggest some large prediction errors in the model.

par(mfrow=c(2,2))
plot(cars.lm)

Residual analysis

Below is the analysis of the residual plots above:

  1. Residuals vs Fitted: There’s no clear pattern to the residuals, which is good as it suggests that the relationship is linear and the model fits well across all levels of the fitted values. However, there are a few outliers present.

  2. Normal Q-Q Plot: The residuals largely follow the theoretical line, indicating that the residuals are normally distributed. There are a few deviations at the high end (the right side of the plot), which suggests the presence of some outliers.

  3. Scale-Location: The spread of residuals seems constant across the range of fitted values, indicating homoscedasticity. Again, a few outliers are visible.

  4. Residuals vs Leverage: Most data points have low leverage, but a few points have higher leverage, which could potentially be influential. The Cook’s distance lines don’t indicate any points with unduly large influence on the model.

Overall, the model seems to fit the data reasonably well, though you may want to investigate the outliers further to determine if they should be included in the model.