# Load required libraries
library(ggplot2)
library(corrplot)
## corrplot 0.95 loaded
# Load the mtcars dataset
data(mtcars)
# Review the codebook by ?
?mtcars
## starting httpd help server ...
## done
# Display the first few rows and summary statistics
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
summary(mtcars)
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
## drat wt qsec vs
## Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
## Median :3.695 Median :3.325 Median :17.71 Median :0.0000
## Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
## Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
## am gear carb
## Min. :0.0000 Min. :3.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
## Median :0.0000 Median :4.000 Median :2.000
## Mean :0.4062 Mean :3.688 Mean :2.812
## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :1.0000 Max. :5.000 Max. :8.000
# Create linear regression model
model <- lm(mpg ~ wt + hp, data = mtcars)
# View model summary
summary(model)
##
## Call:
## lm(formula = mpg ~ wt + hp, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.941 -1.600 -0.182 1.050 5.854
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.22727 1.59879 23.285 < 2e-16 ***
## wt -3.87783 0.63273 -6.129 1.12e-06 ***
## hp -0.03177 0.00903 -3.519 0.00145 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.593 on 29 degrees of freedom
## Multiple R-squared: 0.8268, Adjusted R-squared: 0.8148
## F-statistic: 69.21 on 2 and 29 DF, p-value: 9.109e-12
Answer: Here is the linear regression model to predict mpg with just the variables wt and hp. The results indicate that both weight and horsepower have statistically significant negative relationships with mpg. The coefficient for wt is −3.88. This means that if weight increases by 1000 pounds, the car’s mpg decreases by about 3.88 mpg, assuming horsepower stays the same. The coefficient for hp is −0.032. This means that if horsepower increases by 1 unit, the car’s mpg decreases by about 0.03 mpg, assuming weight stays the same.
# Fit model
model <- lm(mpg ~ wt + hp, data = mtcars)
# Diagnostic plots
par(mfrow = c(2,2))
plot(model)
Answer: I used diagnostic plots by running plot(model). From the Residuals vs Fitted plot, the points look fairly randomly scattered, so the linear relationship seems reasonable. The Normal Q-Q plot mostly follows a straight line, so the residuals appear approximately normal. The Scale-Location plot does not show a strong funnel shape, suggesting the variance is fairly constant. The Residuals vs Leverage plot does not show any extremely influential points. Overall, the assumptions appear to be reasonably met for this dataset.
# Get predicted values
predicted <- predict(model)
# Calculate MSE
mse <- mean((mtcars$mpg - predicted)^2)
mse
## [1] 6.095242
Answer: The Mean Squared Error (MSE) of the regression model is 6.095. This means that, on average, the squared difference between the actual mpg values and the predicted mpg values is about 6.10.
summary(model)$r.squared
## [1] 0.8267855
Answer: The R squared value of the regression model is 0.8268. This means that approximately 82.7% of the variability in miles per gallon (mpg) is explained by weight and horsepower. This indicates that the model has a strong fit.
# Model with interaction term
model_int <- lm(mpg ~ wt * hp, data = mtcars)
summary(model_int)
##
## Call:
## lm(formula = mpg ~ wt * hp, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.0632 -1.6491 -0.7362 1.4211 4.5513
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 49.80842 3.60516 13.816 5.01e-14 ***
## wt -8.21662 1.26971 -6.471 5.20e-07 ***
## hp -0.12010 0.02470 -4.863 4.04e-05 ***
## wt:hp 0.02785 0.00742 3.753 0.000811 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.153 on 28 degrees of freedom
## Multiple R-squared: 0.8848, Adjusted R-squared: 0.8724
## F-statistic: 71.66 on 3 and 28 DF, p-value: 2.981e-13
summary(model_int)$r.squared
## [1] 0.8847637
Answer: I added an interaction term between wt and hp using wt * hp. The interaction term is statistically significant (p = 0.000811 < 0.05), meaning the effect of weight on mpg depends on horsepower. The R² increased from 0.8268 to 0.8848, indicating that the interaction improves the model’s ability to explain variation in mpg.
# Find 5th and 95th percentiles
lower_hp <- quantile(mtcars$hp, 0.05)
upper_hp <- quantile(mtcars$hp, 0.95)
# Winsorize hp
mtcars$hp_win <- ifelse(mtcars$hp < lower_hp, lower_hp,
ifelse(mtcars$hp > upper_hp, upper_hp,
mtcars$hp))
model_win <- lm(mpg ~ wt + hp_win, data = mtcars)
summary(model_win)
##
## Call:
## lm(formula = mpg ~ wt + hp_win, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.8825 -1.6545 -0.0968 0.8367 5.7259
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.31722 1.56964 23.774 < 2e-16 ***
## wt -3.58279 0.66427 -5.394 8.5e-06 ***
## hp_win -0.03952 0.01059 -3.732 0.000824 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.546 on 29 degrees of freedom
## Multiple R-squared: 0.833, Adjusted R-squared: 0.8215
## F-statistic: 72.34 on 2 and 29 DF, p-value: 5.348e-12
summary(model_win)$r.squared
## [1] 0.8330309
Answer: After winsorizing horsepower at the 5% and 95% levels, I refitted the regression model using wt and hp_win. The R squared increased slightly from 0.8268 to 0.8330, showing a small improvement in model fit. The coefficients changed a little, but weight is still the strongest predictor of mpg. Overall, winsorizing hp only made a minor difference in the model.
library(car)
## Loading required package: carData
vif(model)
## wt hp
## 1.766625 1.766625
Answer: The VIF values for wt and hp are both about 1.77, which is well below 5. This indicates that there is no serious multicollinearity problem in the model.
Answer: An improved R squared does not necessarily mean the model has better predictive performance. R squared only measures how well the model fits the existing data. Adding more variables can increase R squared even if they do not improve predictions on new data. Therefore, prediction accuracy should be evaluated using measures like MSE or cross-validation.