Data 605 HW Week 11

Linear model for stopping distance as a function of speed
Simple Linear Regression Model
- Residuals For Linear Regression Model
- Conclusion for Simple Linear Regression Model
Linear Regression with Box Cox Transformation
Linear Regression with Log Transformation
- Residuals for the Log Transformation
- Conclusion for Linear Regression with Log Transformation

Linear model for stopping distance as a function of speed

For the exercise, I performed the linear regression analysis on the orginal data as well the data that was tranformed using box-cox and logarithm methods.

plot(cars$speed, cars$dist, main="Scatter plot of Stopping Distance vs Speed", xlab="Speed", ylab="Stopping Distance")

Simple Linear Regression Model

Coefficients:

Estimate for Speed (3.9324): The coefficient indicates the that for every increase in speed by 1 mph, the stopping distance increases by approximately 3.93 feet.

Standard Error:

Speed (0.4155): The small standard error indicates that precise coefficient estimate is relatively narrow/precise.

T-value:

Speed (9.464): The positive t-value indicates a strong positive, highly statistically significant relationship between speed and stopping distance,

P-value

The p-value (1.49e-12) for speed is extremely small. We reject the null hypothesis and indicate a statistically significant effect of speed on stopping distance.

Multiple R-squared (0.6511):

Approximately 65.11% of the variability in stopping distance is explained by the model.

# Build the linear model
car_model <- lm(dist ~ speed, data=cars)

# Summary
summary(car_model)

## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

# Plotting the model with actual vs fitted values
plot(cars$speed, cars$dist, main="Actual vs Fitted Values", xlab="Speed", ylab="Stopping Distance")
abline(car_model, col="red")

Residuals For Linear Regression Model

Residuals vs Fitted

The red line seems fairly flat and horizontal, although there is a slight curve. There doesn’t appear to be any systematic pattern to the residuals, suggesting residuals are randomly distributed and the model’s predictions are unbiased.

Normal Q-Q Plot

Most points seem to follow the reference line suggesting normal distribution, but there are some deviations at the ends, particularly in the upper right. This could indicate that the residuals have heavier tails than the normal distribution and that outliers are affecting the model.

Scale-Location

The Residuals seem to spread randomly and have a constant spread across the range of fitted values suggesting homoscedasticity of residuals.

Residuals vs Leverage

There are a few points with higher leverage but none exceed the Cook’s distance threshold significantly. Observation 49 appears to be the most influential, but it doesn’t appear to be problematic since it does not exceed Cook’s threshold.

Conclusion for Simple Linear Regression Model

Overall, these diagnostic plots suggest that the simple linear regression model fits reasonably well, although there might be some concerns about non-normality in the residuals due to the Q-Q plot’s tails. We should investigate potential outliers and their impact on the model.

#Residual for Linear Regression
par(mfrow=c(2,2))
plot(car_model)

Linear Regression with Box Cox Transformation

Coefficients

Estimate for Speed (0.64483): The coefficient indicates that for every increase in speed by 1 mph, the Box-Cox transformed stopping distance increases by approximately 0.64483 units.

Standard Error

Speed (0.05957): The small standard error implies that the coefficient for speed is estimated with a relatively narrow/precise.

T-value Speed (10.825): This significant positive t-value indicates a highly statistically significant relationship between speed and the box cox transformed stopping distance.

P-value

Speed (1.77e-14):The small p-value indicates that we should reject the null hypothesis and assume a statistically significant effect.

Multiple R-squared (0.7094):

Approximately 70.94% of the variability in the transformed stopping distance is accounted for by the model.

Identify Optimal Lambda for Box Cox Tranformation

The Box-Cox transformation stabilizes variance and makes a dataset more closely approximate a normal distribution; it is usefiul when homoscedasticity or normality are violated. λ (lambda)determines the nature and degree of the transformation applied to the data. We use the boxcox funciton to find the optimal lambda for the data. In this case, optimal lambda is approximately 0.5

# Load the library
library(MASS) 

# Find the optimal lambda for Box-Cox Transformation
boxcox(car_model, plotit = TRUE)

Box Cox Transformation Summary

# Apply the Box-Cox transformation
lambda_optimal <- 0.5
cars$dist_boxcox <- (cars$dist^lambda_optimal - 1) / lambda_optimal

# Build model with Box Cox transformation
car_model_boxcox <- lm(dist_boxcox ~ speed, data=cars)

# Summary
summary(car_model_boxcox)

## 
## Call:
## lm(formula = dist_boxcox ~ speed, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.1369 -1.3966 -0.3598  1.1817  6.3069 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.55410    0.96888   0.572     0.57    
## speed        0.64483    0.05957  10.825 1.77e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.205 on 48 degrees of freedom
## Multiple R-squared:  0.7094, Adjusted R-squared:  0.7034 
## F-statistic: 117.2 on 1 and 48 DF,  p-value: 1.773e-14

Box Cox Plot

# Scatter plot of transformed stopping distance vs. speed
plot(cars$speed, cars$dist_boxcox, 
     main = "Scatter Plot of Box-Cox Transformed Stopping Distance vs. Speed",
     xlab = "Speed (mph)", ylab = "Transformed Stopping Distance")

# Adding a regression line to the scatter plot
abline(car_model_boxcox, col = "red")

Residuals for Box Cox Transformation

Residuals vs Fitted

The residuals seem to be scattered around the horizontal line without any clear pattern, which suggests that the relationship is linear and the errors are homoscedastic.

Normal Q-Q Plot

The residuals largely follow the line, particularly in the middle of the distribution. Slight deviation at the ends suggest that the residuals have light tails but this overall suggest normal distribution.

Scale-Location

The spread of residuals is relatively even across the range of fitted values, suggesting that the variance homoscedastic.

Residuals vs Leverage

Conclusion for Linear Regression with Box Cox Tranformation

Overall, the plots suggest that the Box-Cox transformation has helped in linearizing the relationship and stabilizing the variance of the errors slightly. The transformation has normalized a bit of the skewness detected in the original linear regression model. The residuals don’t undermine the assumptions of linear regression.

# Residuals plots for the Box Cox Transformation
par(mfrow=c(2,2))
plot(car_model_boxcox)

Linear Regression with Log Transformation

Coefficients

Speed (0.12077): The coefficient implies that for each 1 mph increase in speed, the log of the stopping distance increases by approximately 0.12077 units.

Standard Error

Speed (0.01206): The small standard error implies that the coefficient for speed is estimated with a relatively narrow/precise

T-value

Speed (10.015): The positive t-value indicates a strong and statistically significant relationship between speed and the log of the stopping distance.

P-value

Speed (2.41e-13): The low p-value associated with the speed coefficient rejects the null hypothesis and indicates a statistical significant effect.

Multiple R-squared (0.6763):

Approximately 67.63% of the variability in the log-transformed stopping distance can be explained by the model.

# Apply the log transformation
cars$log_dist = log(cars$dist)

# linear model with the log
car_model_log = lm(log_dist ~ speed, data=cars)

# Summary
summary(car_model_log)

## 
## Call:
## lm(formula = log_dist ~ speed, data = cars)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.46604 -0.20800 -0.01683  0.24080  1.01519 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.67612    0.19614   8.546 3.34e-11 ***
## speed        0.12077    0.01206  10.015 2.41e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4463 on 48 degrees of freedom
## Multiple R-squared:  0.6763, Adjusted R-squared:  0.6696 
## F-statistic: 100.3 on 1 and 48 DF,  p-value: 2.413e-13

# Plot
plot(cars$speed, cars$log_dist, main="Scatter plot of Log(Stopping Distance) vs Speed", xlab="Speed", ylab="Log(Stopping Distance)")

# Add the fitted regression line to the plot
abline(car_model_log, col="red")

Residuals for the Log Transformation

Residuals vs Fitted

The points are randomly dispersed and hover around the horizontal line. There does not appear to be any systematic pattern, suggesting an appropriate linear relationship between the transformed stopping distance variable speed.

Normal Q-Q Plot

The residuals follow the reference line closely except for a couple of slight deviations at the end suggesting the residuals are approximately normally distributed.

Scale-Location

The residuals do not show any obvious pattern and are spread fairly evenly indicating that the variance of the residuals is homescedastic.

Residuals vs Leverage

Like the previous models, most data points have low leverage. Once again point 49 appears to have higher leverage but it is not beyond the Cook’s distance threshold suggesting it may not be disproportionately affecting the model

Conclusion for Linear Regression with Log Transformation

The log transformation genarates a decent linear regression model. The residuals appear to be evenly distributed with a constant variance, and they are mostly normally distributed. There is one data point with relatively high leverage, but it doesn’t appear to be overly influential.

# Residuals plot for the log transformation 
par(mfrow=c(2,2))
plot(car_model_log)

Data 605 HW Week 11

2024-04-07

Linear model for stopping distance as a function of speed

Simple Linear Regression Model

Residuals For Linear Regression Model

Conclusion for Simple Linear Regression Model

Linear Regression with Box Cox Transformation

Identify Optimal Lambda for Box Cox Tranformation

Box Cox Transformation Summary

Box Cox Plot

Residuals for Box Cox Transformation

Conclusion for Linear Regression with Box Cox Tranformation

Linear Regression with Log Transformation

Residuals for the Log Transformation

Conclusion for Linear Regression with Log Transformation