For the exercise, I performed the linear regression analysis on the orginal data as well the data that was tranformed using box-cox and logarithm methods.
plot(cars$speed, cars$dist, main="Scatter plot of Stopping Distance vs Speed", xlab="Speed", ylab="Stopping Distance")
Coefficients:
Estimate for Speed (3.9324): The coefficient indicates the that for every increase in speed by 1 mph, the stopping distance increases by approximately 3.93 feet.
Standard Error:
Speed (0.4155): The small standard error indicates that precise coefficient estimate is relatively narrow/precise.
T-value:
Speed (9.464): The positive t-value indicates a strong positive, highly statistically significant relationship between speed and stopping distance,
P-value
The p-value (1.49e-12) for speed is extremely small. We reject the null hypothesis and indicate a statistically significant effect of speed on stopping distance.
Multiple R-squared (0.6511):
Approximately 65.11% of the variability in stopping distance is explained by the model.
##
## Call:
## lm(formula = dist ~ speed, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
# Plotting the model with actual vs fitted values
plot(cars$speed, cars$dist, main="Actual vs Fitted Values", xlab="Speed", ylab="Stopping Distance")
abline(car_model, col="red")
Residuals vs Fitted
The red line seems fairly flat and horizontal, although there is a slight curve. There doesn’t appear to be any systematic pattern to the residuals, suggesting residuals are randomly distributed and the model’s predictions are unbiased.
Normal Q-Q Plot
Most points seem to follow the reference line suggesting normal distribution, but there are some deviations at the ends, particularly in the upper right. This could indicate that the residuals have heavier tails than the normal distribution and that outliers are affecting the model.
Scale-Location
The Residuals seem to spread randomly and have a constant spread across the range of fitted values suggesting homoscedasticity of residuals.
Residuals vs Leverage
There are a few points with higher leverage but none exceed the Cook’s distance threshold significantly. Observation 49 appears to be the most influential, but it doesn’t appear to be problematic since it does not exceed Cook’s threshold.
Overall, these diagnostic plots suggest that the simple linear regression model fits reasonably well, although there might be some concerns about non-normality in the residuals due to the Q-Q plot’s tails. We should investigate potential outliers and their impact on the model.
Coefficients
Estimate for Speed (0.64483): The coefficient indicates that for every increase in speed by 1 mph, the Box-Cox transformed stopping distance increases by approximately 0.64483 units.
Standard Error
Speed (0.05957): The small standard error implies that the coefficient for speed is estimated with a relatively narrow/precise.
T-value Speed (10.825): This significant positive t-value indicates a highly statistically significant relationship between speed and the box cox transformed stopping distance.
P-value
Speed (1.77e-14):The small p-value indicates that we should reject the null hypothesis and assume a statistically significant effect.
Multiple R-squared (0.7094):
Approximately 70.94% of the variability in the transformed stopping distance is accounted for by the model.
The Box-Cox transformation stabilizes variance and makes a dataset more closely approximate a normal distribution; it is usefiul when homoscedasticity or normality are violated. λ (lambda)determines the nature and degree of the transformation applied to the data. We use the boxcox funciton to find the optimal lambda for the data. In this case, optimal lambda is approximately 0.5
# Load the library
library(MASS)
# Find the optimal lambda for Box-Cox Transformation
boxcox(car_model, plotit = TRUE)
# Apply the Box-Cox transformation
lambda_optimal <- 0.5
cars$dist_boxcox <- (cars$dist^lambda_optimal - 1) / lambda_optimal
# Build model with Box Cox transformation
car_model_boxcox <- lm(dist_boxcox ~ speed, data=cars)
# Summary
summary(car_model_boxcox)
##
## Call:
## lm(formula = dist_boxcox ~ speed, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.1369 -1.3966 -0.3598 1.1817 6.3069
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.55410 0.96888 0.572 0.57
## speed 0.64483 0.05957 10.825 1.77e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.205 on 48 degrees of freedom
## Multiple R-squared: 0.7094, Adjusted R-squared: 0.7034
## F-statistic: 117.2 on 1 and 48 DF, p-value: 1.773e-14
# Scatter plot of transformed stopping distance vs. speed
plot(cars$speed, cars$dist_boxcox,
main = "Scatter Plot of Box-Cox Transformed Stopping Distance vs. Speed",
xlab = "Speed (mph)", ylab = "Transformed Stopping Distance")
# Adding a regression line to the scatter plot
abline(car_model_boxcox, col = "red")
Residuals vs Fitted
The residuals seem to be scattered around the horizontal line without any clear pattern, which suggests that the relationship is linear and the errors are homoscedastic.
Normal Q-Q Plot
The residuals largely follow the line, particularly in the middle of the distribution. Slight deviation at the ends suggest that the residuals have light tails but this overall suggest normal distribution.
Scale-Location
The spread of residuals is relatively even across the range of fitted values, suggesting that the variance homoscedastic.
Residuals vs Leverage
There are a few points with higher leverage but none exceed the Cook’s distance threshold significantly. Observation 49 appears to be the most influential, but it doesn’t appear to be problematic.
Overall, the plots suggest that the Box-Cox transformation has helped in linearizing the relationship and stabilizing the variance of the errors slightly. The transformation has normalized a bit of the skewness detected in the original linear regression model. The residuals don’t undermine the assumptions of linear regression.
Coefficients
Speed (0.12077): The coefficient implies that for each 1 mph increase in speed, the log of the stopping distance increases by approximately 0.12077 units.
Standard Error
Speed (0.01206): The small standard error implies that the coefficient for speed is estimated with a relatively narrow/precise
T-value
Speed (10.015): The positive t-value indicates a strong and statistically significant relationship between speed and the log of the stopping distance.
P-value
Speed (2.41e-13): The low p-value associated with the speed coefficient rejects the null hypothesis and indicates a statistical significant effect.
Multiple R-squared (0.6763):
Approximately 67.63% of the variability in the log-transformed stopping distance can be explained by the model.
# Apply the log transformation
cars$log_dist = log(cars$dist)
# linear model with the log
car_model_log = lm(log_dist ~ speed, data=cars)
# Summary
summary(car_model_log)
##
## Call:
## lm(formula = log_dist ~ speed, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.46604 -0.20800 -0.01683 0.24080 1.01519
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.67612 0.19614 8.546 3.34e-11 ***
## speed 0.12077 0.01206 10.015 2.41e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4463 on 48 degrees of freedom
## Multiple R-squared: 0.6763, Adjusted R-squared: 0.6696
## F-statistic: 100.3 on 1 and 48 DF, p-value: 2.413e-13
# Plot
plot(cars$speed, cars$log_dist, main="Scatter plot of Log(Stopping Distance) vs Speed", xlab="Speed", ylab="Log(Stopping Distance)")
# Add the fitted regression line to the plot
abline(car_model_log, col="red")
Residuals vs Fitted
The points are randomly dispersed and hover around the horizontal line. There does not appear to be any systematic pattern, suggesting an appropriate linear relationship between the transformed stopping distance variable speed.
Normal Q-Q Plot
The residuals follow the reference line closely except for a couple of slight deviations at the end suggesting the residuals are approximately normally distributed.
Scale-Location
The residuals do not show any obvious pattern and are spread fairly evenly indicating that the variance of the residuals is homescedastic.
Residuals vs Leverage
Like the previous models, most data points have low leverage. Once again point 49 appears to have higher leverage but it is not beyond the Cook’s distance threshold suggesting it may not be disproportionately affecting the model
The log transformation genarates a decent linear regression model. The residuals appear to be evenly distributed with a constant variance, and they are mostly normally distributed. There is one data point with relatively high leverage, but it doesn’t appear to be overly influential.