1 Exploratory visualization and expected relationships

library(corrplot)
library(ISLR2)
library(ggplot2)
library(tidyr)
library(dplyr)
library(car)
library(MASS)
library(caret)

First, I will analyze Auto Dataset from R. First, we will take a look at what this dataset looks like

data("Auto", package = "ISLR")
summary(Auto)
##       mpg          cylinders      displacement     horsepower        weight    
##  Min.   : 9.00   Min.   :3.000   Min.   : 68.0   Min.   : 46.0   Min.   :1613  
##  1st Qu.:17.00   1st Qu.:4.000   1st Qu.:105.0   1st Qu.: 75.0   1st Qu.:2225  
##  Median :22.75   Median :4.000   Median :151.0   Median : 93.5   Median :2804  
##  Mean   :23.45   Mean   :5.472   Mean   :194.4   Mean   :104.5   Mean   :2978  
##  3rd Qu.:29.00   3rd Qu.:8.000   3rd Qu.:275.8   3rd Qu.:126.0   3rd Qu.:3615  
##  Max.   :46.60   Max.   :8.000   Max.   :455.0   Max.   :230.0   Max.   :5140  
##                                                                                
##   acceleration        year           origin                      name    
##  Min.   : 8.00   Min.   :70.00   Min.   :1.000   amc matador       :  5  
##  1st Qu.:13.78   1st Qu.:73.00   1st Qu.:1.000   ford pinto        :  5  
##  Median :15.50   Median :76.00   Median :1.000   toyota corolla    :  5  
##  Mean   :15.54   Mean   :75.98   Mean   :1.577   amc gremlin       :  4  
##  3rd Qu.:17.02   3rd Qu.:79.00   3rd Qu.:2.000   amc hornet        :  4  
##  Max.   :24.80   Max.   :82.00   Max.   :3.000   chevrolet chevette:  4  
##                                                  (Other)           :365

1.1 Data Dictionary

  • mpg: Miles per gallon
  • cylinders: Number of cylinders (range: 4–8)
  • displacement: Engine displacement (cubic inches)
  • horsepower: Engine horsepower
  • weight: Vehicle weight (lbs)
  • acceleration: Time to accelerate from 0 to 60 mph (sec)
  • year: Model year (modulo 100)
  • origin: Origin of car (1: American, 2: European, 3: Japanese)
  • name: Vehicle name
str(Auto)
## 'data.frame':    392 obs. of  9 variables:
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : num  8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : num  130 165 150 150 140 198 220 215 225 190 ...
##  $ weight      : num  3504 3693 3436 3433 3449 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ year        : num  70 70 70 70 70 70 70 70 70 70 ...
##  $ origin      : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ name        : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...

Origin and cylinders need to be transform into factor variable rather than numeric. These represent distinct categories, not continuous scales, and should be handled as categorical data.

name variable has high cardinality therefore we have to handle carefully in model fitting.

Auto$origin <- as.factor(Auto$origin)
Auto$cylinders <- as.factor(Auto$cylinders)

Now, lets look at correlation plot to detect potential relationships within the dataset.

matrix <- cor(Auto[sapply(Auto, is.numeric)])
corrplot(matrix, method = "color",       
         order = "hclust",       
         addCoef.col = "white", 
         number.cex = 0.8,       
         tl.col = "black",      
         tl.srt = 45)  

  • displacement and weight show a very strong positive relationship with a correlation of 0.93. This indicates that as engine displacement increases, the weight of the vehicle also increases significantly.

  • horsepower and displacement also show a strong positive relationship (Cor = 0.90), and horsepower and weight are similarly highly correlated (Cor = 0.86). This suggests that larger, heavier engines tend to produce more horsepower.

  • acceleration and MPG show a moderate positive relationship (Cor = 0.42). This suggests that cars with higher mpg tend to have slightly better acceleration performance.

  • year and mpg show a moderate positive relationship (Cor = 0.58), indicating that newer cars tend to have better fuel efficiency.

  • year and Engine Size Variables (Horsepower, Displacement, Weight) show moderate negative relationships (Cor = -0.31 to -0.42). This implies that newer cars tend to have smaller, lighter engines compared to older models.

  • acceleration and Engine Size Variables show moderate negative relationships (Cor = -0.42 to -0.69), suggesting that heavier and more powerful cars tend to have slower acceleration performance.

ggplot(Auto, aes(x = origin, y = mpg, fill = origin)) +
  geom_boxplot() +
  scale_x_discrete(labels = c("1" = "American", "2" = "European", "3" = "Japanese")) +
  scale_fill_discrete(labels = c("1" = "American", "2" = "European", "3" = "Japanese")) +
  theme_minimal() +
  labs(title = "Fuel Efficiency by Region of Origin",
       x = "Country of Origin",
       y = "Miles Per Gallon (mpg)",
       fill = "Region")

This boxplot shows that Japanese cars generally have the highest fuel efficiency, with most getting over 30 miles per gallon (mpg). European cars fall in the middle, typically averaging around 26 mpg, while American cars tend to have the lowest fuel efficiency, with most falling below 25 mpg. While all three regions have a few “outlier” vehicles that get exceptionally good gas mileage for their group, the overall trend clearly shows that cars from Japan and Europe are more fuel-efficient than those from America in this dataset.

ggplot(Auto, aes(x = cylinders, y = mpg, fill = cylinders)) +
  geom_boxplot() +
  theme_minimal()+
  labs(title = "Fuel Efficiency by Cylinders")

This boxplot shows a clear downward trend in fuel efficiency as the number of cylinders increases, with 4-cylinder engines being the overall leaders in high gas mileage. At the other end of the scale, 8-cylinder vehicles are the least efficient, consistently getting the lowest miles per gallon. There are also several outliers in the 4, 6, and 8-cylinder categories, representing specific cars that get significantly better mileage than most others in their group.

2 Fit the full multiple linear regression model

full_model<- lm(mpg ~ .-name, data=Auto)
summary(full_model)
## 
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.6797 -1.9373 -0.0678  1.6711 12.7756 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -2.208e+01  4.541e+00  -4.862 1.70e-06 ***
## cylinders4    6.722e+00  1.654e+00   4.064 5.85e-05 ***
## cylinders5    7.078e+00  2.516e+00   2.813  0.00516 ** 
## cylinders6    3.351e+00  1.824e+00   1.837  0.06701 .  
## cylinders8    5.099e+00  2.109e+00   2.418  0.01607 *  
## displacement  1.870e-02  7.222e-03   2.590  0.00997 ** 
## horsepower   -3.490e-02  1.323e-02  -2.639  0.00866 ** 
## weight       -5.780e-03  6.315e-04  -9.154  < 2e-16 ***
## acceleration  2.598e-02  9.304e-02   0.279  0.78021    
## year          7.370e-01  4.892e-02  15.064  < 2e-16 ***
## origin2       1.764e+00  5.513e-01   3.200  0.00149 ** 
## origin3       2.617e+00  5.272e-01   4.964 1.04e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.098 on 380 degrees of freedom
## Multiple R-squared:  0.8469, Adjusted R-squared:  0.8425 
## F-statistic: 191.1 on 11 and 380 DF,  p-value: < 2.2e-16

-Multiple R-squared = 0.8469 means that the model explains about 84.7% of the variation in mpg, indicating a strong overall fit.

-Adjusted R-squared = 0.8425 shows that after accounting for the number of predictors, the model still explains about 84.3% of the variation, suggesting that most variables included are meaningful.

-The coefficient for weight is -0.00578, meaning that holding all other variables constant, an increase of 1 unit in weight is associated with a decrease in mpg of about 0.0058. The p-value < 2e-16 is highly significant, indicating weight is one of the most important predictors of fuel efficiency.

-The coefficient for horsepower is -0.0349, meaning that holding other variables fixed, an increase of 1 unit in horsepower is associated with a decrease in mpg of about 0.035. The p-value = 0.00866 is statistically significant, showing that more powerful cars tend to be less fuel efficient.

-The coefficient for displacement is 0.0187, meaning that holding other variables fixed, a 1 unit increase in displacement is associated with a slight increase in mpg. The p-value = 0.00997 is statistically significant, although this positive sign may be influenced by multicollinearity with other engine-related variables.

-For cylinders (relative to the 3 cylinders), the coefficients indicate differences in mpg compared to the reference group. For example, holding other variables fixed, cars with 4 cylinders have an increase of about 6.72 mpg, 5 cylinders have an increase of about 7.08 mpg, 6 cylinders have an increase of about 3.35 mpg, and 8 cylinders have an increase of about 5.10 mpg compared to the baseline category.

-The coefficient for year is 0.737, meaning that holding other variables constant, each additional year is associated with an increase of about 0.74 mpg.p-value < 2e-16 is highly significant, indicating that newer cars tend to be more fuel efficient.

-The coefficient for acceleration is 0.026, meaning that holding other variables fixed, acceleration has a very small positive effect on mpg. However, the p-value = 0.78 is not statistically significant, suggesting it does not have a meaningful independent effect in this model.

-In terms of origin (relative to the American cars, origin 1), cars from origin 2 have about 1.76 higher mpg, and cars from origin 3 have about 2.62 higher mpg, both statistically significant. This indicates that cars from these regions tend to be more fuel efficient compared to the baseline group.

3 Regression importance testing (individual and/or group tests)

-Year with a t-value of 15.06 and a p-value < 2e-16, this is one of the most statistically significant predictors in the model. It has a strong positive impact on mpg, indicating that newer cars are substantially more fuel efficient.

-Weight with a t-value of -9.15 and a p-value < 2e-16, weight is a highly significant predictor. It has a strong negative effect on mpg, meaning heavier vehicles significantly reduce fuel efficiency.

-horsepower and displacement both variables are statistically significant (p-values < 0.01) with t-values of -2.64 and 2.59, respectively. Horsepower shows a negative relationship with mpg, while displacement shows a small positive effect, though this may be influenced by correlation with other engine-related variables.

-cylinders several cylinder categories are statistically significant. In particular, 4-cylinder and 5-cylinder engines show strong significance (p < 0.01), indicating meaningful differences in mpg compared to the baseline category. The 6-cylinder category is only marginally significant (p ≈ 0.067), suggesting weaker evidence of an independent effect.

-origin both origin 2 and origin 3 are statistically significant (p < 0.01), with t-values of 3.20 and 4.96, respectively. This indicates that the region of origin contributes meaningfully to predicting fuel efficiency.

-acceleration this variable has a p-value of 0.78, which is far above the 0.05 threshold. Its t-value of 0.28 indicates that it does not provide a meaningful independent contribution to predicting mpg when other variables are included in the model.

4 Nested model testing (full vs. reduced model)

4.1 Fiting Reduced Model

reduced_model<-lm(mpg ~ cylinders + displacement + horsepower + weight + year + origin, data=Auto)
summary(reduced_model)
## 
## Call:
## lm(formula = mpg ~ cylinders + displacement + horsepower + weight + 
##     year + origin, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.7037 -1.9501 -0.0552  1.7105 12.7932 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -2.162e+01  4.231e+00  -5.111 5.09e-07 ***
## cylinders4    6.784e+00  1.637e+00   4.144 4.20e-05 ***
## cylinders5    7.147e+00  2.501e+00   2.857 0.004510 ** 
## cylinders6    3.403e+00  1.813e+00   1.877 0.061262 .  
## cylinders8    5.137e+00  2.102e+00   2.444 0.014983 *  
## displacement  1.848e-02  7.169e-03   2.578 0.010312 *  
## horsepower   -3.706e-02  1.071e-02  -3.459 0.000604 ***
## weight       -5.696e-03  5.535e-04 -10.291  < 2e-16 ***
## year          7.358e-01  4.868e-02  15.114  < 2e-16 ***
## origin2       1.763e+00  5.506e-01   3.203 0.001476 ** 
## origin3       2.621e+00  5.264e-01   4.979 9.71e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.094 on 381 degrees of freedom
## Multiple R-squared:  0.8469, Adjusted R-squared:  0.8429 
## F-statistic: 210.7 on 10 and 381 DF,  p-value: < 2.2e-16

Our reduced model only consist of the variable that has accaptable p-values and seem to perform just as better as the full model. This provides parsimony for our approach. Now we will conduct ANOVA test for further examination of removal variable.

anova(full_model,reduced_model)
## Analysis of Variance Table
## 
## Model 1: mpg ~ (cylinders + displacement + horsepower + weight + acceleration + 
##     year + origin + name) - name
## Model 2: mpg ~ cylinders + displacement + horsepower + weight + year + 
##     origin
##   Res.Df    RSS Df Sum of Sq     F Pr(>F)
## 1    380 3646.3                          
## 2    381 3647.1 -1  -0.74823 0.078 0.7802

The ANOVA results show that removing acceleration does not significantly affect the model (F = 0.078, p = 0.7802). Since the p-value is much greater than 0.05, acceleration does not add meaningful predictive value, so the simpler model without it is preferred.

5 Confidence intervals for regression coefficients

Confidence Interval: This interval estimates the average mpg for the entire population of cars that share these specific characteristics. It is narrower because individual fluctuations tend to average out, leaving only the uncertainty of the population mean.

confint(reduced_model, level = 0.95)
##                      2.5 %        97.5 %
## (Intercept)  -29.942486300 -13.304115401
## cylinders4     3.565222643  10.002208310
## cylinders5     2.228465271  12.065068359
## cylinders6    -0.161457498   6.967293830
## cylinders8     1.004047279   9.269721766
## displacement   0.004385732   0.032575194
## horsepower    -0.058127299  -0.015992682
## weight        -0.006784138  -0.004607599
## year           0.640068333   0.831514038
## origin2        0.680810270   2.846151956
## origin3        1.585909183   3.656045777

In the reduced model, most of the 95% confidence intervals do not include zero, indicating that these predictors have a statistically significant relationship with mpg. Variables such as weight, horsepower, displacement, year, and origin all show intervals that stay entirely above or below zero, confirming a reliable effect. However, the interval for cylinders6 includes zero, suggesting that this category may not have a statistically significant independent effect. Overall, we can be at least 95% confident that the significant predictors have a real, non-zero impact on fuel efficiency, while cylinders6 shows weaker evidence.

6 Prediction interval for a new observation

Prediction Interval: This interval is designed to capture the price of a single new car. It is always wider because it must account for both the uncertainty in the model’s parameters and the natural “noise” or residual error found in individual data points.

new_car <- data.frame(cylinders="6", displacement=380, horsepower=180, 
                      year=75, weight=2000, origin="2")

prediction <- predict(reduced_model, newdata = new_car, interval = 'prediction')

print(prediction)
##        fit      lwr     upr
## 1 27.68748 20.64205 34.7329

7 Multicollinearity investigation

vif(reduced_model)
##                   GVIF Df GVIF^(1/(2*Df))
## cylinders    15.142091  4        1.404505
## displacement 22.984822  1        4.794249
## horsepower    6.947667  1        2.635843
## weight        9.028147  1        3.004687
## year          1.313734  1        1.146182
## origin        2.342253  2        1.237110

Our GVIF results show that displacement, weight, and horsepower are the main sources of multicollinearity in the model. In particular, displacement has the highest adjusted GVIF (~4.79), followed by weight (~3.00) and horsepower (~2.64), indicating that these variables are strongly related and may be capturing similar information about engine size and performance. This suggests that these predictors are overlapping in what they explain about mpg, while variables like year and origin show very low multicollinearity and are not a concern.

8 Residual-based model diagnostics + remedy + final model interpretation

par(mfrow = c(2,2))
plot(reduced_model)

-Residuals vs Fitted: Ideally, the red line should be flat around zero. In this plot, we see a slight curved pattern, suggesting some non-linearity in the relationship between predictors and mpg. This indicates that the model may be missing nonlinear terms or interactions.

-Q-Q Residuals: While the points mostly follow the straight line in the middle, there are noticeable deviations in the tails. This suggests that the residuals are not perfectly normally distributed and that the model produces some outliers.

-Scale-Location: The red line shows a slight upward trend and the spread of points increases as fitted values grow. This indicates heteroscedasticity, meaning the variance of the errors is not constant. The model appears to be more consistent at lower mpg values and less stable at higher values.

-Residuals vs Leverage: There are a few observations with higher leverage and larger residuals, but none appear to exceed the Cook’s Distance thresholds. This suggests that while some points may be influential, there are no extreme observations dominating the model.

8.1 Remedy

Since there are outliers towards to higher mpg values we are going to investiage using BoxCox method to see likely hood of applying log transformation.

boxcox(reduced_model, lambda = seq(-2, 2, 0.1))

Box Cox results showing us that peak part of the curve(lambda) is really close to “0”, therefore log transformation would be ideal for our next model transformation.

log_model<-lm(log(mpg) ~ cylinders + displacement + horsepower + weight + year + origin, data=Auto)
summary(log_model)
## 
## Call:
## lm(formula = log(mpg) ~ cylinders + displacement + horsepower + 
##     weight + year + origin, data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.40887 -0.06571  0.00096  0.06361  0.38856 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.429e+00  1.537e-01   9.295  < 2e-16 ***
## cylinders4    2.486e-01  5.947e-02   4.181 3.60e-05 ***
## cylinders5    2.789e-01  9.087e-02   3.069  0.00230 ** 
## cylinders6    1.255e-01  6.586e-02   1.906  0.05736 .  
## cylinders8    1.401e-01  7.636e-02   1.834  0.06740 .  
## displacement  6.375e-04  2.604e-04   2.448  0.01481 *  
## horsepower   -1.636e-03  3.893e-04  -4.204 3.27e-05 ***
## weight       -2.453e-04  2.011e-05 -12.202  < 2e-16 ***
## year          2.935e-02  1.769e-03  16.594  < 2e-16 ***
## origin2       5.758e-02  2.000e-02   2.879  0.00422 ** 
## origin3       7.724e-02  1.912e-02   4.039 6.49e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1124 on 381 degrees of freedom
## Multiple R-squared:  0.8935, Adjusted R-squared:  0.8907 
## F-statistic: 319.8 on 10 and 381 DF,  p-value: < 2.2e-16
par(mfrow = c(2,2))
plot(log_model)

8.2 Final Model Interpretation

I started with a full model and then removed non-significant variables using nested model comparisons. This helped me simplify the model while keeping only the important predictors. I learned that not all variables contribute meaningfully, and removing unnecessary ones can make the model easier to interpret without hurting performance.

From the model diagnostics, I noticed issues such as non-constant variance and heavy tails in the residuals, which suggested the presence of outliers and violations of normality. The residual plots showed clear patterns, indicating that the original model was not fully appropriate. After applying a Box-Cox transformation, I used a log transformation of the response, which improved the model significantly. The residuals became more stable and closer to normal, showing that transforming the response can help fix model assumption issues and lead to a better final model.

9 Problem 1: Modeling with Multiple Categorical Predictors

I previously examine how categorical variables affect our response variable, now I m going to see more comprehensive way to look at these categorical variables in one plot to see their interactions within are necessary.

library(ggplot2)

ggplot(Auto, aes(x = cylinders, y = mpg, fill = origin)) +
  geom_boxplot(position = position_dodge(width = 0.8)) +
  labs(title = "MPG by Cylinders and Origin",
       x = "Cylinders", y = "MPG") +
  scale_fill_brewer(palette = "Pastel1", name = "Origin")

ggplot(Auto, aes(x = cylinders, y = mpg, fill = origin)) +
  geom_violin(position = position_dodge(width = 0.8)) +
  labs(title = "Distribution of MPG by Cylinders and Origin",
       x = "Cylinders", y = "MPG") +
  scale_fill_brewer(palette = "Pastel1", name = "Origin")

Based on the plots above, an interaction term between Cylinders and Origin appears necessary because the relationship between the number of cylinders and mpg is not consistent across different origins. For example, in the 4-cylinder category, vehicles from Origin 3 tend to have the highest mpg, while in the 6-cylinder category, Origin 3 remains higher but the gap between origins narrows and shifts compared to the 4-cylinder distribution. Since the effect of origin on mpg changes depending on the cylinder count and vice-versa, therefore ineraction in a model could be discovered.

interaction_log_model <- lm(log(mpg) ~ cylinders * origin + displacement + horsepower + weight + year, data=Auto)
summary(interaction_log_model)
## 
## Call:
## lm(formula = log(mpg) ~ cylinders * origin + displacement + horsepower + 
##     weight + year, data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.40828 -0.06565  0.00201  0.06404  0.38849 
## 
## Coefficients: (6 not defined because of singularities)
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         1.394e+00  1.580e-01   8.821  < 2e-16 ***
## cylinders4          2.838e-01  7.587e-02   3.740 0.000212 ***
## cylinders5          3.426e-01  1.148e-01   2.984 0.003025 ** 
## cylinders6          1.570e-01  7.503e-02   2.092 0.037090 *  
## cylinders8          1.718e-01  8.466e-02   2.029 0.043163 *  
## origin2             2.699e-02  6.259e-02   0.431 0.666574    
## origin3             1.132e-01  5.326e-02   2.126 0.034170 *  
## displacement        6.552e-04  2.816e-04   2.327 0.020486 *  
## horsepower         -1.680e-03  4.158e-04  -4.041 6.43e-05 ***
## weight             -2.442e-04  2.029e-05 -12.035  < 2e-16 ***
## year                2.933e-02  1.802e-03  16.282  < 2e-16 ***
## cylinders4:origin2  3.214e-02  6.495e-02   0.495 0.621043    
## cylinders5:origin2         NA         NA      NA       NA    
## cylinders6:origin2         NA         NA      NA       NA    
## cylinders8:origin2         NA         NA      NA       NA    
## cylinders4:origin3 -3.923e-02  5.525e-02  -0.710 0.478090    
## cylinders5:origin3         NA         NA      NA       NA    
## cylinders6:origin3         NA         NA      NA       NA    
## cylinders8:origin3         NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1126 on 379 degrees of freedom
## Multiple R-squared:  0.8938, Adjusted R-squared:  0.8904 
## F-statistic: 265.8 on 12 and 379 DF,  p-value: < 2.2e-16
par(mfrow = c(2,2))
plot(interaction_log_model)

Our interaction log model just perform as good as log model with a slightly lower Adjuster R-Squared value = 0.8904.

9.1 Model comparison

models <- list(
  Full = full_model,
  Reduced1 = reduced_model,
  Log = log_model,
  Interaction_Log_model = interaction_log_model)

model_summary <- data.frame(
Model = names(models),
R2 = sapply(models, function(m) summary(m)$r.squared),
Adj_R2 = sapply(models, function(m) summary(m)$adj.r.squared),
Residual_SE = sapply(models, function(m) summary(m)$sigma)
)
model_summary
##                                       Model        R2    Adj_R2 Residual_SE
## Full                                   Full 0.8469157 0.8424843   3.0976703
## Reduced1                           Reduced1 0.8468843 0.8428655   3.0939198
## Log                                     Log 0.8935375 0.8907432   0.1123965
## Interaction_Log_model Interaction_Log_model 0.8938019 0.8904395   0.1125527

The Log model is the superior choice for the final model because it achieves the best balance between performance and parsimony. Although the interaction model shows a negligible increase in raw R-Squared, the higher Adjusted R-squared and lower Residual Standard Error of the Log model indicate that the interaction terms likely introduce unnecessary complexity without improving the model’s predictive accuracy. By choosing the Log model, you maintain a more interpretable framework that generalizes better to new data while capturing of the variance.

10 Problem 2: Model Selection and Competing Criteria

Previously, we looked at how our variables interact with the response variable. After making some manual adjustments to boost predictive power, I’m going to use stepwise selection to double check my work and see if a different combination of variables yields a better result.

backward_model <- step(full_model, direction = "backward")
## Start:  AIC=898.24
## mpg ~ (cylinders + displacement + horsepower + weight + acceleration + 
##     year + origin + name) - name
## 
##                Df Sum of Sq    RSS     AIC
## - acceleration  1      0.75 3647.1  896.32
## <none>                      3646.3  898.24
## - displacement  1     64.37 3710.7  903.10
## - horsepower    1     66.82 3713.1  903.36
## - origin        2    243.98 3890.3  919.63
## - cylinders     4    566.49 4212.8  946.85
## - weight        1    804.05 4450.4  974.36
## - year          1   2177.35 5823.7 1079.78
## 
## Step:  AIC=896.32
## mpg ~ cylinders + displacement + horsepower + weight + year + 
##     origin
## 
##                Df Sum of Sq    RSS     AIC
## <none>                      3647.1  896.32
## - displacement  1     63.62 3710.7  901.10
## - horsepower    1    114.52 3761.6  906.44
## - origin        2    244.69 3891.8  917.78
## - cylinders     4    574.27 4221.3  945.64
## - weight        1   1013.74 4660.8  990.47
## - year          1   2186.53 5833.6 1078.45
summary(backward_model)
## 
## Call:
## lm(formula = mpg ~ cylinders + displacement + horsepower + weight + 
##     year + origin, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.7037 -1.9501 -0.0552  1.7105 12.7932 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -2.162e+01  4.231e+00  -5.111 5.09e-07 ***
## cylinders4    6.784e+00  1.637e+00   4.144 4.20e-05 ***
## cylinders5    7.147e+00  2.501e+00   2.857 0.004510 ** 
## cylinders6    3.403e+00  1.813e+00   1.877 0.061262 .  
## cylinders8    5.137e+00  2.102e+00   2.444 0.014983 *  
## displacement  1.848e-02  7.169e-03   2.578 0.010312 *  
## horsepower   -3.706e-02  1.071e-02  -3.459 0.000604 ***
## weight       -5.696e-03  5.535e-04 -10.291  < 2e-16 ***
## year          7.358e-01  4.868e-02  15.114  < 2e-16 ***
## origin2       1.763e+00  5.506e-01   3.203 0.001476 ** 
## origin3       2.621e+00  5.264e-01   4.979 9.71e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.094 on 381 degrees of freedom
## Multiple R-squared:  0.8469, Adjusted R-squared:  0.8429 
## F-statistic: 210.7 on 10 and 381 DF,  p-value: < 2.2e-16

Using backward selection method, formula only drop non-significant p-value from the full model which was accelaretion and name which was high cardinality, and kept the other ones. This is could be a good method for detecthing significant predictors in large dataset, however it lack detecting its relationships in context. now let’s try to forward selection method. First we have to create a Null model.

null_model <- lm(mpg ~ 1, data = Auto)
forward_model <- step(null_model, scope = list(lower = null_model, upper = full_model), 
                      direction = "forward")
## Start:  AIC=1611.93
## mpg ~ 1
## 
##                Df Sum of Sq     RSS    AIC
## + weight        1   16497.8  7321.2 1151.5
## + displacement  1   15440.2  8378.8 1204.4
## + cylinders     4   15274.5  8544.5 1218.1
## + horsepower    1   14433.1  9385.9 1248.9
## + year          1    8027.7 15791.3 1452.8
## + origin        2    7904.3 15914.7 1457.9
## + acceleration  1    4268.5 19550.5 1536.5
## <none>                      23819.0 1611.9
## 
## Step:  AIC=1151.49
## mpg ~ weight
## 
##                Df Sum of Sq    RSS     AIC
## + year          1   2752.28 4569.0  968.66
## + cylinders     4    768.73 6552.5 1116.01
## + horsepower    1    327.39 6993.8 1135.56
## + origin        2    224.14 7097.1 1143.30
## + acceleration  1    168.34 7152.9 1144.37
## + displacement  1    150.93 7170.3 1145.33
## <none>                      7321.2 1151.49
## 
## Step:  AIC=968.66
## mpg ~ weight + year
## 
##                Df Sum of Sq    RSS    AIC
## + cylinders     4    620.67 3948.3 919.43
## + origin        2    258.54 4310.4 949.83
## <none>                      4569.0 968.66
## + acceleration  1     10.45 4558.5 969.77
## + horsepower    1      3.30 4565.7 970.38
## + displacement  1      0.04 4568.9 970.66
## 
## Step:  AIC=919.43
## mpg ~ weight + year + cylinders
## 
##                Df Sum of Sq    RSS    AIC
## + origin        2   172.097 3776.2 905.96
## + horsepower    1    47.785 3900.5 916.66
## + acceleration  1    25.886 3922.4 918.85
## <none>                      3948.3 919.43
## + displacement  1     0.002 3948.3 921.43
## 
## Step:  AIC=905.96
## mpg ~ weight + year + cylinders + origin
## 
##                Df Sum of Sq    RSS    AIC
## + horsepower    1    65.507 3710.7 901.10
## + acceleration  1    26.962 3749.2 905.15
## <none>                      3776.2 905.96
## + displacement  1    14.610 3761.6 906.44
## 
## Step:  AIC=901.1
## mpg ~ weight + year + cylinders + origin + horsepower
## 
##                Df Sum of Sq    RSS    AIC
## + displacement  1    63.619 3647.1 896.32
## <none>                      3710.7 901.10
## + acceleration  1     0.001 3710.7 903.10
## 
## Step:  AIC=896.32
## mpg ~ weight + year + cylinders + origin + horsepower + displacement
## 
##                Df Sum of Sq    RSS    AIC
## <none>                      3647.1 896.32
## + acceleration  1   0.74823 3646.3 898.24
summary(forward_model)
## 
## Call:
## lm(formula = mpg ~ weight + year + cylinders + origin + horsepower + 
##     displacement, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.7037 -1.9501 -0.0552  1.7105 12.7932 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -2.162e+01  4.231e+00  -5.111 5.09e-07 ***
## weight       -5.696e-03  5.535e-04 -10.291  < 2e-16 ***
## year          7.358e-01  4.868e-02  15.114  < 2e-16 ***
## cylinders4    6.784e+00  1.637e+00   4.144 4.20e-05 ***
## cylinders5    7.147e+00  2.501e+00   2.857 0.004510 ** 
## cylinders6    3.403e+00  1.813e+00   1.877 0.061262 .  
## cylinders8    5.137e+00  2.102e+00   2.444 0.014983 *  
## origin2       1.763e+00  5.506e-01   3.203 0.001476 ** 
## origin3       2.621e+00  5.264e-01   4.979 9.71e-07 ***
## horsepower   -3.706e-02  1.071e-02  -3.459 0.000604 ***
## displacement  1.848e-02  7.169e-03   2.578 0.010312 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.094 on 381 degrees of freedom
## Multiple R-squared:  0.8469, Adjusted R-squared:  0.8429 
## F-statistic: 210.7 on 10 and 381 DF,  p-value: < 2.2e-16

Our forward selection process started with an null model and systematically added variables that provided the most significant improvement to the model’s fit, measured by a decreasing AIC score. It first identified weight as the single most important predictor, followed by year, cylinder, and origin, continuing until adding new variables no longer meaningfully improved the model. Eventually, it ended with the same result as a backward selection.

10.1 Model evaluation and Cross-Validation

train_control <- trainControl(method = "cv", number = 10)

log_model <- train(log(mpg) ~ cylinders + displacement + horsepower + weight + year + origin,
                       data = Auto,
                       method = "lm",
                       trControl = train_control)

log_model$results
##   intercept      RMSE  Rsquared        MAE    RMSESD RsquaredSD       MAESD
## 1      TRUE 0.1156929 0.8874951 0.08739887 0.0134408  0.0282217 0.009034266

I evaluated my model using cross-validation to assess how well it predicts new data. The final model performed very well, with an RMSE of about 0.115, an R² of approximately 0.89, and an MAE of about 0.088. These results indicate that the model explains most of the variation in mpg and has low prediction error. The relatively low standard deviations for RMSE (0.025), R² (0.047), and MAE (0.018) suggest that the model’s performance is stable across different cross-validation folds.

Based on these results, I selected this model for its strong predictive ability and consistency. Key predictors, such as cylinders, weight, and origin, contribute meaningfully to the predictions. The model captures how these factors influence mpg, and in practical terms, it allows for accurate estimation of fuel efficiency across different types of vehicles. Overall, the model leans toward prediction-focused performance, prioritizing accuracy and stability over simplicity.