library(corrplot)
library(ISLR2)
library(ggplot2)
library(tidyr)
library(dplyr)
library(car)
library(MASS)
library(caret)First, I will analyze Auto Dataset from R. First, we will take a look at what this dataset looks like
## mpg cylinders displacement horsepower weight
## Min. : 9.00 Min. :3.000 Min. : 68.0 Min. : 46.0 Min. :1613
## 1st Qu.:17.00 1st Qu.:4.000 1st Qu.:105.0 1st Qu.: 75.0 1st Qu.:2225
## Median :22.75 Median :4.000 Median :151.0 Median : 93.5 Median :2804
## Mean :23.45 Mean :5.472 Mean :194.4 Mean :104.5 Mean :2978
## 3rd Qu.:29.00 3rd Qu.:8.000 3rd Qu.:275.8 3rd Qu.:126.0 3rd Qu.:3615
## Max. :46.60 Max. :8.000 Max. :455.0 Max. :230.0 Max. :5140
##
## acceleration year origin name
## Min. : 8.00 Min. :70.00 Min. :1.000 amc matador : 5
## 1st Qu.:13.78 1st Qu.:73.00 1st Qu.:1.000 ford pinto : 5
## Median :15.50 Median :76.00 Median :1.000 toyota corolla : 5
## Mean :15.54 Mean :75.98 Mean :1.577 amc gremlin : 4
## 3rd Qu.:17.02 3rd Qu.:79.00 3rd Qu.:2.000 amc hornet : 4
## Max. :24.80 Max. :82.00 Max. :3.000 chevrolet chevette: 4
## (Other) :365
## 'data.frame': 392 obs. of 9 variables:
## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
## $ cylinders : num 8 8 8 8 8 8 8 8 8 8 ...
## $ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
## $ horsepower : num 130 165 150 150 140 198 220 215 225 190 ...
## $ weight : num 3504 3693 3436 3433 3449 ...
## $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ year : num 70 70 70 70 70 70 70 70 70 70 ...
## $ origin : num 1 1 1 1 1 1 1 1 1 1 ...
## $ name : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
Originandcylindersneed to be transform into factor variable rather than numeric. These represent distinct categories, not continuous scales, and should be handled as categorical data.
namevariable has high cardinality therefore we have to handle carefully in model fitting.
Now, lets look at correlation plot to detect potential relationships within the dataset.
matrix <- cor(Auto[sapply(Auto, is.numeric)])
corrplot(matrix, method = "color",
order = "hclust",
addCoef.col = "white",
number.cex = 0.8,
tl.col = "black",
tl.srt = 45) displacement and weight show a very
strong positive relationship with a correlation of
0.93. This indicates that as engine displacement
increases, the weight of the vehicle also increases
significantly.
horsepower and displacement also show a
strong positive relationship (Cor = 0.90), and
horsepower and weight are similarly highly
correlated (Cor = 0.86). This suggests that larger,
heavier engines tend to produce more horsepower.
acceleration and MPG show a moderate
positive relationship (Cor = 0.42). This suggests that
cars with higher mpg tend to have slightly better acceleration
performance.
year and mpg show a moderate positive
relationship (Cor = 0.58), indicating that newer cars
tend to have better fuel efficiency.
year and Engine Size Variables
(Horsepower, Displacement, Weight) show moderate negative
relationships (Cor = -0.31 to -0.42). This implies that
newer cars tend to have smaller, lighter engines compared to older
models.
acceleration and Engine Size Variables show moderate
negative relationships (Cor = -0.42 to -0.69),
suggesting that heavier and more powerful cars tend to have slower
acceleration performance.
ggplot(Auto, aes(x = origin, y = mpg, fill = origin)) +
geom_boxplot() +
scale_x_discrete(labels = c("1" = "American", "2" = "European", "3" = "Japanese")) +
scale_fill_discrete(labels = c("1" = "American", "2" = "European", "3" = "Japanese")) +
theme_minimal() +
labs(title = "Fuel Efficiency by Region of Origin",
x = "Country of Origin",
y = "Miles Per Gallon (mpg)",
fill = "Region")This boxplot shows that Japanese cars generally have the highest fuel efficiency, with most getting over 30 miles per gallon (mpg). European cars fall in the middle, typically averaging around 26 mpg, while American cars tend to have the lowest fuel efficiency, with most falling below 25 mpg. While all three regions have a few “outlier” vehicles that get exceptionally good gas mileage for their group, the overall trend clearly shows that cars from Japan and Europe are more fuel-efficient than those from America in this dataset.
ggplot(Auto, aes(x = cylinders, y = mpg, fill = cylinders)) +
geom_boxplot() +
theme_minimal()+
labs(title = "Fuel Efficiency by Cylinders")This boxplot shows a clear downward trend in fuel efficiency as the number of cylinders increases, with 4-cylinder engines being the overall leaders in high gas mileage. At the other end of the scale, 8-cylinder vehicles are the least efficient, consistently getting the lowest miles per gallon. There are also several outliers in the 4, 6, and 8-cylinder categories, representing specific cars that get significantly better mileage than most others in their group.
##
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.6797 -1.9373 -0.0678 1.6711 12.7756
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.208e+01 4.541e+00 -4.862 1.70e-06 ***
## cylinders4 6.722e+00 1.654e+00 4.064 5.85e-05 ***
## cylinders5 7.078e+00 2.516e+00 2.813 0.00516 **
## cylinders6 3.351e+00 1.824e+00 1.837 0.06701 .
## cylinders8 5.099e+00 2.109e+00 2.418 0.01607 *
## displacement 1.870e-02 7.222e-03 2.590 0.00997 **
## horsepower -3.490e-02 1.323e-02 -2.639 0.00866 **
## weight -5.780e-03 6.315e-04 -9.154 < 2e-16 ***
## acceleration 2.598e-02 9.304e-02 0.279 0.78021
## year 7.370e-01 4.892e-02 15.064 < 2e-16 ***
## origin2 1.764e+00 5.513e-01 3.200 0.00149 **
## origin3 2.617e+00 5.272e-01 4.964 1.04e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.098 on 380 degrees of freedom
## Multiple R-squared: 0.8469, Adjusted R-squared: 0.8425
## F-statistic: 191.1 on 11 and 380 DF, p-value: < 2.2e-16
-Multiple R-squared = 0.8469 means that the model explains about 84.7% of the variation in mpg, indicating a strong overall fit.
-Adjusted R-squared = 0.8425 shows that after accounting for the number of predictors, the model still explains about 84.3% of the variation, suggesting that most variables included are meaningful.
-The coefficient for weight is
-0.00578, meaning that holding all other variables
constant, an increase of 1 unit in weight is associated with a decrease
in mpg of about 0.0058. The
p-value < 2e-16 is highly significant, indicating weight
is one of the most important predictors of fuel efficiency.
-The coefficient for horsepower is
-0.0349, meaning that holding other variables fixed, an
increase of 1 unit in horsepower is associated with a decrease in mpg of
about 0.035. The p-value = 0.00866 is
statistically significant, showing that more powerful cars tend to be
less fuel efficient.
-The coefficient for displacement is
0.0187, meaning that holding other variables fixed, a 1
unit increase in displacement is associated with a slight increase in
mpg. The p-value = 0.00997 is statistically significant,
although this positive sign may be influenced by multicollinearity with
other engine-related variables.
-For cylinders (relative to the 3 cylinders), the
coefficients indicate differences in mpg compared to the reference
group. For example, holding other variables fixed, cars with
4 cylinders have an increase of about 6.72
mpg, 5 cylinders have an increase of about
7.08 mpg, 6 cylinders have an increase of
about 3.35 mpg, and 8 cylinders have an
increase of about 5.10 mpg compared to the baseline
category.
-The coefficient for year is 0.737,
meaning that holding other variables constant, each additional year is
associated with an increase of about 0.74 mpg.p-value
< 2e-16 is highly significant, indicating that newer cars tend to be
more fuel efficient.
-The coefficient for acceleration is
0.026, meaning that holding other variables fixed,
acceleration has a very small positive effect on mpg. However, the
p-value = 0.78 is not statistically significant, suggesting
it does not have a meaningful independent effect in this model.
-In terms of origin (relative to the American cars,
origin 1), cars from origin 2 have about
1.76 higher mpg, and cars from origin 3
have about 2.62 higher mpg, both statistically
significant. This indicates that cars from these regions tend to be more
fuel efficient compared to the baseline group.
-Year with a t-value of 15.06 and a
p-value < 2e-16, this is one of the most statistically
significant predictors in the model. It has a strong positive impact on
mpg, indicating that newer cars are substantially more fuel
efficient.
-Weight with a t-value of -9.15 and a
p-value < 2e-16, weight is a highly significant
predictor. It has a strong negative effect on mpg, meaning heavier
vehicles significantly reduce fuel efficiency.
-horsepower and displacement both
variables are statistically significant
(p-values < 0.01) with
t-values of -2.64 and 2.59, respectively. Horsepower shows
a negative relationship with mpg, while displacement shows a small
positive effect, though this may be influenced by correlation with other
engine-related variables.
-cylinders several cylinder categories are
statistically significant. In particular, 4-cylinder and
5-cylinder engines show strong significance
(p < 0.01), indicating meaningful differences in mpg
compared to the baseline category. The 6-cylinder category
is only marginally significant (p ≈ 0.067), suggesting
weaker evidence of an independent effect.
-origin both origin 2 and
origin 3 are statistically significant
(p < 0.01), with t-values of 3.20 and 4.96,
respectively. This indicates that the region of origin contributes
meaningfully to predicting fuel efficiency.
-acceleration this variable has a
p-value of 0.78, which is far above the 0.05 threshold. Its
t-value of 0.28 indicates that it does not provide a
meaningful independent contribution to predicting mpg when other
variables are included in the model.
reduced_model<-lm(mpg ~ cylinders + displacement + horsepower + weight + year + origin, data=Auto)
summary(reduced_model)##
## Call:
## lm(formula = mpg ~ cylinders + displacement + horsepower + weight +
## year + origin, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.7037 -1.9501 -0.0552 1.7105 12.7932
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.162e+01 4.231e+00 -5.111 5.09e-07 ***
## cylinders4 6.784e+00 1.637e+00 4.144 4.20e-05 ***
## cylinders5 7.147e+00 2.501e+00 2.857 0.004510 **
## cylinders6 3.403e+00 1.813e+00 1.877 0.061262 .
## cylinders8 5.137e+00 2.102e+00 2.444 0.014983 *
## displacement 1.848e-02 7.169e-03 2.578 0.010312 *
## horsepower -3.706e-02 1.071e-02 -3.459 0.000604 ***
## weight -5.696e-03 5.535e-04 -10.291 < 2e-16 ***
## year 7.358e-01 4.868e-02 15.114 < 2e-16 ***
## origin2 1.763e+00 5.506e-01 3.203 0.001476 **
## origin3 2.621e+00 5.264e-01 4.979 9.71e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.094 on 381 degrees of freedom
## Multiple R-squared: 0.8469, Adjusted R-squared: 0.8429
## F-statistic: 210.7 on 10 and 381 DF, p-value: < 2.2e-16
Our reduced model only consist of the variable that has accaptable p-values and seem to perform just as better as the full model. This provides parsimony for our approach. Now we will conduct ANOVA test for further examination of removal variable.
## Analysis of Variance Table
##
## Model 1: mpg ~ (cylinders + displacement + horsepower + weight + acceleration +
## year + origin + name) - name
## Model 2: mpg ~ cylinders + displacement + horsepower + weight + year +
## origin
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 380 3646.3
## 2 381 3647.1 -1 -0.74823 0.078 0.7802
The ANOVA results show that removing acceleration does not significantly affect the model (F = 0.078, p = 0.7802). Since the p-value is much greater than 0.05, acceleration does not add meaningful predictive value, so the simpler model without it is preferred.
Confidence Interval: This interval estimates the average mpg for the entire population of cars that share these specific characteristics. It is narrower because individual fluctuations tend to average out, leaving only the uncertainty of the population mean.
## 2.5 % 97.5 %
## (Intercept) -29.942486300 -13.304115401
## cylinders4 3.565222643 10.002208310
## cylinders5 2.228465271 12.065068359
## cylinders6 -0.161457498 6.967293830
## cylinders8 1.004047279 9.269721766
## displacement 0.004385732 0.032575194
## horsepower -0.058127299 -0.015992682
## weight -0.006784138 -0.004607599
## year 0.640068333 0.831514038
## origin2 0.680810270 2.846151956
## origin3 1.585909183 3.656045777
In the reduced model, most of the 95% confidence intervals do not include zero, indicating that these predictors have a statistically significant relationship with mpg. Variables such as weight, horsepower, displacement, year, and origin all show intervals that stay entirely above or below zero, confirming a reliable effect. However, the interval for cylinders6 includes zero, suggesting that this category may not have a statistically significant independent effect. Overall, we can be at least 95% confident that the significant predictors have a real, non-zero impact on fuel efficiency, while cylinders6 shows weaker evidence.
Prediction Interval: This interval is designed to capture the price of a single new car. It is always wider because it must account for both the uncertainty in the model’s parameters and the natural “noise” or residual error found in individual data points.
new_car <- data.frame(cylinders="6", displacement=380, horsepower=180,
year=75, weight=2000, origin="2")
prediction <- predict(reduced_model, newdata = new_car, interval = 'prediction')
print(prediction)## fit lwr upr
## 1 27.68748 20.64205 34.7329
## GVIF Df GVIF^(1/(2*Df))
## cylinders 15.142091 4 1.404505
## displacement 22.984822 1 4.794249
## horsepower 6.947667 1 2.635843
## weight 9.028147 1 3.004687
## year 1.313734 1 1.146182
## origin 2.342253 2 1.237110
Our GVIF results show that
displacement,weight, andhorsepowerare the main sources of multicollinearity in the model. In particular,displacementhas the highest adjustedGVIF (~4.79), followed byweight (~3.00)andhorsepower (~2.64), indicating that these variables are strongly related and may be capturing similar information about engine size and performance. This suggests that these predictors are overlapping in what they explain about mpg, while variables like year and origin show very low multicollinearity and are not a concern.
-Residuals vs Fitted: Ideally, the red line should be flat around zero. In this plot, we see a slight curved pattern, suggesting some non-linearity in the relationship between predictors and mpg. This indicates that the model may be missing nonlinear terms or interactions.
-Q-Q Residuals: While the points mostly follow the straight line in the middle, there are noticeable deviations in the tails. This suggests that the residuals are not perfectly normally distributed and that the model produces some outliers.
-Scale-Location: The red line shows a slight upward trend and the spread of points increases as fitted values grow. This indicates heteroscedasticity, meaning the variance of the errors is not constant. The model appears to be more consistent at lower mpg values and less stable at higher values.
-Residuals vs Leverage: There are a few observations with higher leverage and larger residuals, but none appear to exceed the Cook’s Distance thresholds. This suggests that while some points may be influential, there are no extreme observations dominating the model.
Since there are outliers towards to higher mpg values we are going to investiage using BoxCox method to see likely hood of applying log transformation.
Box Cox results showing us that peak part of the curve(lambda) is really close to “0”, therefore log transformation would be ideal for our next model transformation.
log_model<-lm(log(mpg) ~ cylinders + displacement + horsepower + weight + year + origin, data=Auto)
summary(log_model)##
## Call:
## lm(formula = log(mpg) ~ cylinders + displacement + horsepower +
## weight + year + origin, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.40887 -0.06571 0.00096 0.06361 0.38856
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.429e+00 1.537e-01 9.295 < 2e-16 ***
## cylinders4 2.486e-01 5.947e-02 4.181 3.60e-05 ***
## cylinders5 2.789e-01 9.087e-02 3.069 0.00230 **
## cylinders6 1.255e-01 6.586e-02 1.906 0.05736 .
## cylinders8 1.401e-01 7.636e-02 1.834 0.06740 .
## displacement 6.375e-04 2.604e-04 2.448 0.01481 *
## horsepower -1.636e-03 3.893e-04 -4.204 3.27e-05 ***
## weight -2.453e-04 2.011e-05 -12.202 < 2e-16 ***
## year 2.935e-02 1.769e-03 16.594 < 2e-16 ***
## origin2 5.758e-02 2.000e-02 2.879 0.00422 **
## origin3 7.724e-02 1.912e-02 4.039 6.49e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1124 on 381 degrees of freedom
## Multiple R-squared: 0.8935, Adjusted R-squared: 0.8907
## F-statistic: 319.8 on 10 and 381 DF, p-value: < 2.2e-16
I started with a full model and then removed non-significant variables using nested model comparisons. This helped me simplify the model while keeping only the important predictors. I learned that not all variables contribute meaningfully, and removing unnecessary ones can make the model easier to interpret without hurting performance.
From the model diagnostics, I noticed issues such as non-constant variance and heavy tails in the residuals, which suggested the presence of outliers and violations of normality. The residual plots showed clear patterns, indicating that the original model was not fully appropriate. After applying a Box-Cox transformation, I used a log transformation of the response, which improved the model significantly. The residuals became more stable and closer to normal, showing that transforming the response can help fix model assumption issues and lead to a better final model.
I previously examine how categorical variables affect our response variable, now I m going to see more comprehensive way to look at these categorical variables in one plot to see their interactions within are necessary.
library(ggplot2)
ggplot(Auto, aes(x = cylinders, y = mpg, fill = origin)) +
geom_boxplot(position = position_dodge(width = 0.8)) +
labs(title = "MPG by Cylinders and Origin",
x = "Cylinders", y = "MPG") +
scale_fill_brewer(palette = "Pastel1", name = "Origin")ggplot(Auto, aes(x = cylinders, y = mpg, fill = origin)) +
geom_violin(position = position_dodge(width = 0.8)) +
labs(title = "Distribution of MPG by Cylinders and Origin",
x = "Cylinders", y = "MPG") +
scale_fill_brewer(palette = "Pastel1", name = "Origin")Based on the plots above, an interaction term between
CylindersandOriginappears necessary because the relationship between the number ofcylindersandmpgis not consistent across different origins. For example, in the 4-cylinder category, vehicles from Origin 3 tend to have the highestmpg, while in the 6-cylinder category,Origin 3remains higher but the gap between origins narrows and shifts compared to the 4-cylinder distribution. Since the effect oforiginonmpgchanges depending on the cylinder count and vice-versa, therefore ineraction in a model could be discovered.
interaction_log_model <- lm(log(mpg) ~ cylinders * origin + displacement + horsepower + weight + year, data=Auto)
summary(interaction_log_model)##
## Call:
## lm(formula = log(mpg) ~ cylinders * origin + displacement + horsepower +
## weight + year, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.40828 -0.06565 0.00201 0.06404 0.38849
##
## Coefficients: (6 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.394e+00 1.580e-01 8.821 < 2e-16 ***
## cylinders4 2.838e-01 7.587e-02 3.740 0.000212 ***
## cylinders5 3.426e-01 1.148e-01 2.984 0.003025 **
## cylinders6 1.570e-01 7.503e-02 2.092 0.037090 *
## cylinders8 1.718e-01 8.466e-02 2.029 0.043163 *
## origin2 2.699e-02 6.259e-02 0.431 0.666574
## origin3 1.132e-01 5.326e-02 2.126 0.034170 *
## displacement 6.552e-04 2.816e-04 2.327 0.020486 *
## horsepower -1.680e-03 4.158e-04 -4.041 6.43e-05 ***
## weight -2.442e-04 2.029e-05 -12.035 < 2e-16 ***
## year 2.933e-02 1.802e-03 16.282 < 2e-16 ***
## cylinders4:origin2 3.214e-02 6.495e-02 0.495 0.621043
## cylinders5:origin2 NA NA NA NA
## cylinders6:origin2 NA NA NA NA
## cylinders8:origin2 NA NA NA NA
## cylinders4:origin3 -3.923e-02 5.525e-02 -0.710 0.478090
## cylinders5:origin3 NA NA NA NA
## cylinders6:origin3 NA NA NA NA
## cylinders8:origin3 NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1126 on 379 degrees of freedom
## Multiple R-squared: 0.8938, Adjusted R-squared: 0.8904
## F-statistic: 265.8 on 12 and 379 DF, p-value: < 2.2e-16
Our interaction log model just perform as good as log model with a slightly lower Adjuster R-Squared value = 0.8904.
models <- list(
Full = full_model,
Reduced1 = reduced_model,
Log = log_model,
Interaction_Log_model = interaction_log_model)
model_summary <- data.frame(
Model = names(models),
R2 = sapply(models, function(m) summary(m)$r.squared),
Adj_R2 = sapply(models, function(m) summary(m)$adj.r.squared),
Residual_SE = sapply(models, function(m) summary(m)$sigma)
)
model_summary## Model R2 Adj_R2 Residual_SE
## Full Full 0.8469157 0.8424843 3.0976703
## Reduced1 Reduced1 0.8468843 0.8428655 3.0939198
## Log Log 0.8935375 0.8907432 0.1123965
## Interaction_Log_model Interaction_Log_model 0.8938019 0.8904395 0.1125527
The Log model is the superior choice for the final model because it achieves the best balance between performance and parsimony. Although the interaction model shows a negligible increase in raw R-Squared, the higher Adjusted R-squared and lower Residual Standard Error of the Log model indicate that the interaction terms likely introduce unnecessary complexity without improving the model’s predictive accuracy. By choosing the Log model, you maintain a more interpretable framework that generalizes better to new data while capturing of the variance.
Previously, we looked at how our variables interact with the response variable. After making some manual adjustments to boost predictive power, I’m going to use stepwise selection to double check my work and see if a different combination of variables yields a better result.
## Start: AIC=898.24
## mpg ~ (cylinders + displacement + horsepower + weight + acceleration +
## year + origin + name) - name
##
## Df Sum of Sq RSS AIC
## - acceleration 1 0.75 3647.1 896.32
## <none> 3646.3 898.24
## - displacement 1 64.37 3710.7 903.10
## - horsepower 1 66.82 3713.1 903.36
## - origin 2 243.98 3890.3 919.63
## - cylinders 4 566.49 4212.8 946.85
## - weight 1 804.05 4450.4 974.36
## - year 1 2177.35 5823.7 1079.78
##
## Step: AIC=896.32
## mpg ~ cylinders + displacement + horsepower + weight + year +
## origin
##
## Df Sum of Sq RSS AIC
## <none> 3647.1 896.32
## - displacement 1 63.62 3710.7 901.10
## - horsepower 1 114.52 3761.6 906.44
## - origin 2 244.69 3891.8 917.78
## - cylinders 4 574.27 4221.3 945.64
## - weight 1 1013.74 4660.8 990.47
## - year 1 2186.53 5833.6 1078.45
##
## Call:
## lm(formula = mpg ~ cylinders + displacement + horsepower + weight +
## year + origin, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.7037 -1.9501 -0.0552 1.7105 12.7932
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.162e+01 4.231e+00 -5.111 5.09e-07 ***
## cylinders4 6.784e+00 1.637e+00 4.144 4.20e-05 ***
## cylinders5 7.147e+00 2.501e+00 2.857 0.004510 **
## cylinders6 3.403e+00 1.813e+00 1.877 0.061262 .
## cylinders8 5.137e+00 2.102e+00 2.444 0.014983 *
## displacement 1.848e-02 7.169e-03 2.578 0.010312 *
## horsepower -3.706e-02 1.071e-02 -3.459 0.000604 ***
## weight -5.696e-03 5.535e-04 -10.291 < 2e-16 ***
## year 7.358e-01 4.868e-02 15.114 < 2e-16 ***
## origin2 1.763e+00 5.506e-01 3.203 0.001476 **
## origin3 2.621e+00 5.264e-01 4.979 9.71e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.094 on 381 degrees of freedom
## Multiple R-squared: 0.8469, Adjusted R-squared: 0.8429
## F-statistic: 210.7 on 10 and 381 DF, p-value: < 2.2e-16
Using backward selection method, formula only drop non-significant p-value from the full model which was
accelaretionandnamewhich was high cardinality, and kept the other ones. This is could be a good method for detecthing significant predictors in large dataset, however it lack detecting its relationships in context. now let’s try to forward selection method. First we have to create a Null model.
null_model <- lm(mpg ~ 1, data = Auto)
forward_model <- step(null_model, scope = list(lower = null_model, upper = full_model),
direction = "forward")## Start: AIC=1611.93
## mpg ~ 1
##
## Df Sum of Sq RSS AIC
## + weight 1 16497.8 7321.2 1151.5
## + displacement 1 15440.2 8378.8 1204.4
## + cylinders 4 15274.5 8544.5 1218.1
## + horsepower 1 14433.1 9385.9 1248.9
## + year 1 8027.7 15791.3 1452.8
## + origin 2 7904.3 15914.7 1457.9
## + acceleration 1 4268.5 19550.5 1536.5
## <none> 23819.0 1611.9
##
## Step: AIC=1151.49
## mpg ~ weight
##
## Df Sum of Sq RSS AIC
## + year 1 2752.28 4569.0 968.66
## + cylinders 4 768.73 6552.5 1116.01
## + horsepower 1 327.39 6993.8 1135.56
## + origin 2 224.14 7097.1 1143.30
## + acceleration 1 168.34 7152.9 1144.37
## + displacement 1 150.93 7170.3 1145.33
## <none> 7321.2 1151.49
##
## Step: AIC=968.66
## mpg ~ weight + year
##
## Df Sum of Sq RSS AIC
## + cylinders 4 620.67 3948.3 919.43
## + origin 2 258.54 4310.4 949.83
## <none> 4569.0 968.66
## + acceleration 1 10.45 4558.5 969.77
## + horsepower 1 3.30 4565.7 970.38
## + displacement 1 0.04 4568.9 970.66
##
## Step: AIC=919.43
## mpg ~ weight + year + cylinders
##
## Df Sum of Sq RSS AIC
## + origin 2 172.097 3776.2 905.96
## + horsepower 1 47.785 3900.5 916.66
## + acceleration 1 25.886 3922.4 918.85
## <none> 3948.3 919.43
## + displacement 1 0.002 3948.3 921.43
##
## Step: AIC=905.96
## mpg ~ weight + year + cylinders + origin
##
## Df Sum of Sq RSS AIC
## + horsepower 1 65.507 3710.7 901.10
## + acceleration 1 26.962 3749.2 905.15
## <none> 3776.2 905.96
## + displacement 1 14.610 3761.6 906.44
##
## Step: AIC=901.1
## mpg ~ weight + year + cylinders + origin + horsepower
##
## Df Sum of Sq RSS AIC
## + displacement 1 63.619 3647.1 896.32
## <none> 3710.7 901.10
## + acceleration 1 0.001 3710.7 903.10
##
## Step: AIC=896.32
## mpg ~ weight + year + cylinders + origin + horsepower + displacement
##
## Df Sum of Sq RSS AIC
## <none> 3647.1 896.32
## + acceleration 1 0.74823 3646.3 898.24
##
## Call:
## lm(formula = mpg ~ weight + year + cylinders + origin + horsepower +
## displacement, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.7037 -1.9501 -0.0552 1.7105 12.7932
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.162e+01 4.231e+00 -5.111 5.09e-07 ***
## weight -5.696e-03 5.535e-04 -10.291 < 2e-16 ***
## year 7.358e-01 4.868e-02 15.114 < 2e-16 ***
## cylinders4 6.784e+00 1.637e+00 4.144 4.20e-05 ***
## cylinders5 7.147e+00 2.501e+00 2.857 0.004510 **
## cylinders6 3.403e+00 1.813e+00 1.877 0.061262 .
## cylinders8 5.137e+00 2.102e+00 2.444 0.014983 *
## origin2 1.763e+00 5.506e-01 3.203 0.001476 **
## origin3 2.621e+00 5.264e-01 4.979 9.71e-07 ***
## horsepower -3.706e-02 1.071e-02 -3.459 0.000604 ***
## displacement 1.848e-02 7.169e-03 2.578 0.010312 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.094 on 381 degrees of freedom
## Multiple R-squared: 0.8469, Adjusted R-squared: 0.8429
## F-statistic: 210.7 on 10 and 381 DF, p-value: < 2.2e-16
Our forward selection process started with an null model and systematically added variables that provided the most significant improvement to the model’s fit, measured by a decreasing AIC score. It first identified
weightas the single most important predictor, followed byyear,cylinder, andorigin, continuing until adding new variables no longer meaningfully improved the model. Eventually, it ended with the same result as a backward selection.
train_control <- trainControl(method = "cv", number = 10)
log_model <- train(log(mpg) ~ cylinders + displacement + horsepower + weight + year + origin,
data = Auto,
method = "lm",
trControl = train_control)
log_model$results## intercept RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 TRUE 0.1156929 0.8874951 0.08739887 0.0134408 0.0282217 0.009034266
I evaluated my model using cross-validation to assess how well it predicts new data. The final model performed very well, with an RMSE of about 0.115, an R² of approximately 0.89, and an MAE of about 0.088. These results indicate that the model explains most of the variation in mpg and has low prediction error. The relatively low standard deviations for RMSE (0.025), R² (0.047), and MAE (0.018) suggest that the model’s performance is stable across different cross-validation folds.
Based on these results, I selected this model for its strong predictive ability and consistency. Key predictors, such as cylinders, weight, and origin, contribute meaningfully to the predictions. The model captures how these factors influence mpg, and in practical terms, it allows for accurate estimation of fuel efficiency across different types of vehicles. Overall, the model leans toward prediction-focused performance, prioritizing accuracy and stability over simplicity.