DATA605_HW11

HW 11

Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)

cars <- cars

# Data Visualization
plot(cars)

# Or using the function example from the book
plot(cars[,"speed"],cars[,"dist"], main="cars",
    xlab="speed", ylab="dist")

This figure shows that the distance tends to increases as the speed of the car increases.

Next we develop a regression model that will help us quantify the degree of linearity in the relationship between the output (distance) and the predictor (speed).

# Linear model
cars.lm <- lm(dist ~ speed, data=cars)

cars.lm

## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Coefficients:
## (Intercept)        speed  
##     -17.579        3.932

We can represent the output in the following linear equation:

\[ \text{Stopping Distance} = -17.579 + 3.932 \times \text{Speed} \]

The coefficients tell us:

The intercept is -17.579. This value represents the expected stopping distance when the speed is 0. However, in reality, a negative stopping distance isn’t meaningful, and this negative intercept likely reflects the limitations of using a simple linear model for this data.
The slope of the speed is \(3.932\). This coefficient represents the average increase in stopping distance for each additional mile per hour in speed.

According to this model, if you increase the speed by 1 mph, the stopping distance increases by 3.932 feet on average.

# Plotting the model
plot(dist ~ speed, data=cars)
abline(cars.lm, col="red")

When we superimpose a straight line on this scatter plot, we see that the relationship between the predictor (speed) and the output (distance) is roughly linear. It is not per- fectly linear, however. As the speed increases, we see an increase in the distance required to stop the car.

# Evaluating the Quality of the Model
summary(cars.lm)

## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

We look at the residual values reported by summary(), a good model would tend to have a median value near zero, minimum and maximum values of roughly the same magnitude, and first and third quartile values of roughly the same magnitude.

For this model, the residual values are not too far off what we would expect for Gaussian- distributed numbers.

The standard error for the speed coefficient is much lower than the coefficient itself (ratio of 3.9324/0.4155 ≈ 9.4), indicating a reliable estimate for speed’s effect on stopping distance (a1).

The standard error for the intercept (a0) at 6.7584 is large compared to the intercept value of -17.5791, suggesting less certainty about the intercept’s value. This may impact the model’s accuracy for predictions outside the data range.

Despite this, the precise estimation of the speed coefficient gives us confidence in the model within the observed data range.

The residual standard error of 15.38 indicates the average distance that the data points are from the fitted line.

The model explains 65.11% of the variance (Multiple R-squared), and after adjusting for the number of predictors, it’s 64.38% (Adjusted R-squared).

The F-statistic of 89.57 with a highly significant p-value (around 0) suggests the model fits the data well, and the relationship between speed and stopping distance is statistically significant.

# Residual Analysis
plot(fitted(cars.lm),resid(cars.lm))

In this plot, we see that the residuals tend to increase as we move to the right. Additionally, the residuals are not uniformly scattered above and below zero. Overall, this plot tells us that using the speed as the sole predictor in this regression model may not sufficiently or fully explain our data.

# Quantile-quantile test
qqnorm(resid(cars.lm))
qqline(resid(cars.lm))

Viewing this plot, we don’t see the risduals diverge significantly from the line. This behavior indicates that the residuals are normally distributed.

# An alternate way to visualize
par(mfrow=c(2,2))
plot(cars.lm)

Further more, based on the Q-Q plot, the data appears to be approximately normally distributed. The residuals mostly follow the reference line, indicating that their distribution is not far from normal. However, there are a few points towards the ends that deviate slightly from the line, which is common in real-world data.

DATA605_HW11

Haig Bedros

2024-04-06

HW 11