Libraries

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ISLR2)
library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 1.4.1 ──
## ✔ broom        1.0.10     ✔ rsample      1.3.1 
## ✔ dials        1.4.2      ✔ tailor       0.1.0 
## ✔ infer        1.0.9      ✔ tune         2.0.0 
## ✔ modeldata    1.5.1      ✔ workflows    1.3.0 
## ✔ parsnip      1.3.3      ✔ workflowsets 1.1.1 
## ✔ recipes      1.3.1      ✔ yardstick    1.3.2 
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter()   masks stats::filter()
## ✖ recipes::fixed()  masks stringr::fixed()
## ✖ dplyr::lag()      masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step()   masks stats::step()
library(skimr)
library(GGally)

EDA

Skimming through the data

skim(Auto)
Data summary
Name Auto
Number of rows 392
Number of columns 9
_______________________
Column type frequency:
factor 1
numeric 8
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
name 0 1 FALSE 301 amc: 5, for: 5, toy: 5, amc: 4

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
mpg 0 1 23.45 7.81 9 17.00 22.75 29.00 46.6 ▆▇▆▃▁
cylinders 0 1 5.47 1.71 3 4.00 4.00 8.00 8.0 ▇▁▃▁▅
displacement 0 1 194.41 104.64 68 105.00 151.00 275.75 455.0 ▇▂▂▃▁
horsepower 0 1 104.47 38.49 46 75.00 93.50 126.00 230.0 ▆▇▃▁▁
weight 0 1 2977.58 849.40 1613 2225.25 2803.50 3614.75 5140.0 ▇▇▅▅▂
acceleration 0 1 15.54 2.76 8 13.78 15.50 17.02 24.8 ▁▆▇▂▁
year 0 1 75.98 3.68 70 73.00 76.00 79.00 82.0 ▇▆▇▆▇
origin 0 1 1.58 0.81 1 1.00 1.00 2.00 3.0 ▇▁▂▁▂
auto <- as_tibble(Auto) |>
  mutate(origin = as.factor(origin)) |>
  select(acceleration, mpg, cylinders, displacement, horsepower, weight, year, origin, -name)
glimpse(auto)
## Rows: 392
## Columns: 8
## $ acceleration <dbl> 12.0, 11.5, 11.0, 12.0, 10.5, 10.0, 9.0, 8.5, 10.0, 8.5, …
## $ mpg          <dbl> 18, 15, 18, 16, 17, 15, 14, 14, 14, 15, 15, 14, 15, 14, 2…
## $ cylinders    <int> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 4, 6, 6, 6, 4, …
## $ displacement <dbl> 307, 350, 318, 304, 302, 429, 454, 440, 455, 390, 383, 34…
## $ horsepower   <int> 130, 165, 150, 150, 140, 198, 220, 215, 225, 190, 170, 16…
## $ weight       <int> 3504, 3693, 3436, 3433, 3449, 4341, 4354, 4312, 4425, 385…
## $ year         <int> 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 7…
## $ origin       <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 1, 3, …

Correlations between varibles

ggpairs(auto)
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

Pairwise correlations show that acceleration has moderate relationships with several predictors, including a positive association with mpg and negative associations with cylinders, displacement, horsepower, and weight. The strongest correlations overall occur among the predictors themselves that is displacement, horsepower, and weight are highly correlated, and cylinders is closely tied to all three, suggesting clear multicollinearity within the engine size variables. Year is positively related to mpg and negatively related to engine size, reflecting industry shifts toward lighter, more efficient cars over time.

Distribution of acceleration

auto |>
  ggplot(aes(acceleration)) +
  geom_histogram(bins = 30, color = "white") +
  labs(title = "Distribution of Acceleration",
       y = "Count",
       x = "Acceleration (Time to accelerate from 0 to 60 mph) Sec") +
  theme_minimal()

The acceleration variable shows a roughly bell-shaped distribution, with most cars taking about 14–16 seconds to reach 60 mph and fewer observations at the extreme ends.

Distribution of mpg

auto |> 
  ggplot(aes(mpg)) +
  geom_histogram(bins = 30, color = "white") +
  labs(title = "MPG Distribution",
       x = "mpg (Miles per gallon)",
       y = "Count") +
  theme_minimal()

The mpg values are right-skewed, with most cars falling between about 15 and 30 mpg and only a few reaching the higher mileage range.

Distribution of cylinders

auto |> 
  ggplot(aes(cylinders)) +
  geom_bar() +
  labs(title = "Cylinders",
       y = "Count",
       x = "Number of Cylinders") +
  theme_minimal()

Most cars in the dataset have 4 cylinders, with smaller groups of 6- and 8-cylinder vehicles making up the rest.

Distribution of displacement

auto |>
  ggplot(aes(displacement)) +
  geom_histogram(bins = 30, color = "white") +
  labs(title = "Engine Displacement",
       x = "Displacement (Engine displacement in cu. inches)",
       y = "Count") + 
  theme_minimal()

Engine displacement is heavily right-skewed, with most cars clustered around smaller engines (roughly 100–150 cu. inches) and only a few having very large engines.

Distribution of horsepower

auto |>
  ggplot(aes(horsepower)) +
  geom_histogram(bins = 30, color = "white") +
  labs(title = "Horsepower",
       y = "Count",
       x = "Engine Horsepower") +
  theme_minimal()

Horsepower is right-skewed, with most cars producing between about 60 and 120 horsepower and fewer models in the higher-power range.

Distribution of weight

auto |>
  ggplot(aes(weight)) +
  geom_histogram(bins = 30, color = "white") +
  labs(title = "Car Weight",
       x = "Weight (Vehicle weight in (lbs))",
       y = "Count") +
  theme_minimal()

Car weight is right-skewed, with most vehicles weighing between about 2,000 and 3,500 pounds and fewer appearing in the heavier range.

Distribution of year

auto |>
  ggplot(aes(year)) +
  geom_bar() +
  labs(title = "Model Year",
       x = "Year",
       y = "Count") +
  scale_x_continuous(breaks = seq(min(auto$year), max(auto$year), by = 1)) +
  theme_bw()

The model years are evenly represented across the early 1970s to early 1980s, with each year appearing in similar numbers.

Distribution of origin

auto |>
  ggplot(aes(origin)) +
  geom_bar() +
  scale_x_discrete(labels = c(
    "1" = "American",
    "2" = "European",
    "3" = "Japanese")) +
  labs(
    title = "Origin of Cars",
    x = "Origin",
    y = "Count")

Most cars in the dataset are American, with smaller groups coming from Europe and Japan.

Summary of univariate EDA

A quick look at the Auto dataset shows a mix of symmetric and skewed variables. Acceleration is fairly balanced with no extreme values, which makes it a stable response for modeling. Several predictors—mpg, displacement, horsepower, and weight—are right-skewed, with most cars clustered at lower values and a small number extending into the higher range. Cylinders follows the expected pattern, with most vehicles using 4, 6, or 8 cylinders. Model year is spread evenly across the early 1970s to early 1980s, and the origin variable shows a clear split among American, European, and Japanese cars. Overall, the response looks well-behaved, while some predictors show skewness that may affect their relationship with acceleration, which the modeling step will help sort out.

Model Fitting

I did not create a separate train/test split because the assignment uses 5 fold cross validation as the main way to judge performance. Cross validation trains the model on different parts of the data and checks it on the fold that’s left out, which already gives a solid estimate of how well the model generalizes. Since each fold acts like a temporary test set, there’s no need for an extra split. I also used recipes so that any preprocessing (such as normalizing the numeric predictors) was done inside each fold rather than leaking information across folds.

Cross Validation Setup

set.seed(2025)

folds <- vfold_cv(auto, v = 5)

Single predictor models

predictors <- c("mpg", "cylinders", "displacement", "horsepower", "weight", "year")

#function to fit CV model for each predictor
cv_single_model <- function(var) {
  formula <- as.formula(paste("acceleration ~", var, "+ origin"))
  
  #recipe + workflow
  wf <- workflow() |>
    add_recipe(
      recipe(formula, data = auto) |>
        step_normalize(all_numeric_predictors())) |>
    add_model(linear_reg() |>
        set_engine("lm"))
  
fit_resamples(
    wf,
    resamples = folds,
    control = control_resamples(save_pred = TRUE)
  ) |>
    collect_metrics() |>
    dplyr::filter(.metric == "rmse") |>
    dplyr::transmute(model = var, rmse = mean, std_err)
}
cv_single_results <- purrr::map_df(predictors, cv_single_model)

cv_single_results
## # A tibble: 6 × 3
##   model         rmse std_err
##   <chr>        <dbl>   <dbl>
## 1 mpg           2.49   0.125
## 2 cylinders     2.38   0.128
## 3 displacement  2.28   0.141
## 4 horsepower    1.98   0.122
## 5 weight        2.49   0.155
## 6 year          2.56   0.141

The cross validation results show that horsepower is the strongest single predictor of acceleration, giving the lowest RMSE. This makes sense: cars with more horsepower tend to accelerate faster, and the relationship is fairly direct. Displacement and cylinders also perform reasonably well, while MPG, weight, and year show weaker predictive ability by themselves. Year performs the worst, which suggests that acceleration varies within each model year more than it varies across years. Based on these results, horsepower is the best starting variable for building a forward selection model.

Forward Selection CV

Ordered list of variables with least RMSE

ordered_vars <- cv_single_results |>
  arrange(rmse) |>
  pull(model)

ordered_vars
## [1] "horsepower"   "displacement" "cylinders"    "mpg"          "weight"      
## [6] "year"

Function for forward selection for any predictors

cv_model <- function(vars){
  formula <- as.formula(
    paste("acceleration ~", paste(vars, collapse = " + "), "+ origin")
  )

  wf <- workflow() |>
    add_recipe(
      recipe(formula, data = auto) |>
        step_normalize(all_numeric_predictors())
    ) |>
    add_model(linear_reg() |> set_engine("lm"))

  fit_resamples(
    wf,
    resamples = folds,
    control = control_resamples(save_pred = TRUE)
  ) |>
    collect_metrics() |>
    filter(.metric == "rmse") |>
    mutate(predictors = paste(vars, collapse = ", "))
}

Adding variables in their order

results_list <- list()

for(i in seq_along(ordered_vars)){
  vars <- ordered_vars[1:i]
  results_list[[i]] <- cv_model(vars)
}

cv_forward_results <- bind_rows(results_list) |>
  group_by(predictors) |>
  summarize(rmse = mean(mean))

cv_forward_results 
## # A tibble: 6 × 2
##   predictors                                              rmse
##   <chr>                                                  <dbl>
## 1 horsepower                                              1.98
## 2 horsepower, displacement                                1.95
## 3 horsepower, displacement, cylinders                     1.95
## 4 horsepower, displacement, cylinders, mpg                1.93
## 5 horsepower, displacement, cylinders, mpg, weight        1.73
## 6 horsepower, displacement, cylinders, mpg, weight, year  1.74

Plot

ggplot(cv_forward_results,
       aes(x = predictors, y = rmse)) +
  geom_line(group = 1) +
  geom_point(size = 2) +
  labs(
    title = "Forward Selection: RMSE by Predictor Combination",
    x = "Predictors Included",
    y = "RMSE"
  ) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

My Best model: acceleration ~ horsepower + displacement + cylinders + mpg + weight + origin

Forward-selection results showed that horsepower was the strongest single predictor of acceleration. As I added predictors one at a time, model performance improved only slightly until weight entered the model. The combination of horsepower, displacement, cylinders, mpg, and weight produced the lowest cross-validated RMSE (1.725), making it the best-performing model. Adding year increased RMSE, so it was not included in the final model. These results suggest that acceleration is most strongly related to engine power, engine size, fuel efficiency, and vehicle weight, while model year contributes little once these other variables are considered.

Final Model

Final Model fitting

final_vars <- c("horsepower", "displacement", "cylinders", "mpg", "weight")

final_formula <- as.formula(
  paste("acceleration ~", paste(final_vars, collapse = " + "), "+ origin"))

final_wf <- workflow() |>
  add_recipe(recipe(final_formula, data = auto) |>
               step_normalize(all_numeric_predictors())) |>
  add_model(linear_reg() |> 
              set_engine("lm"))

final_fit <- final_wf |> 
  fit(auto)

final_fit
## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: linear_reg()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 1 Recipe Step
## 
## • step_normalize()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## 
## Call:
## stats::lm(formula = ..y ~ ., data = data)
## 
## Coefficients:
##  (Intercept)    horsepower  displacement     cylinders           mpg  
##     15.50389      -3.26410      -0.73276      -0.28172      -0.04279  
##       weight       origin2       origin3  
##      2.60013       0.14094       0.06447

Interpretation

horsepower (−3.264) Cars with higher horsepower tend to accelerate faster. Since acceleration in this dataset is “time to reach 60 mph,” a negative coefficient means more horsepower reduces acceleration time, which matches expectations.

displacement (−0.733) Cars with larger engine displacement also tend to have quicker acceleration. The effect is weaker than horsepower but still negative, meaning bigger engines shorten acceleration time.

cylinders (−0.282) Cars with more cylinders accelerate slightly faster. This effect is small compared to horsepower and displacement, which suggests cylinders mainly matter because they are related to engine size and power.

mpg (−0.043) mpg has a very small negative effect. High mpg cars may be slightly quicker when holding everything else constant, but the effect is small.

weight (+2.600) Heavier cars accelerate more slowly. This is the strongest positive coefficient, meaning weight strongly increases acceleration time.

origin2 (+0.141) – European cars After accounting for horsepower, engine size, weight, and the other predictors, European cars take slightly longer to reach 60 mph compared to American cars.

origin3 (+0.064) – Japanese cars Japanese cars also show a small positive effect, meaning they take slightly longer to accelerate than American cars when holding everything else constant.

Co-efficients

final_coefs <- tidy(final_fit)
final_coefs
## # A tibble: 8 × 5
##   term         estimate std.error statistic   p.value
##   <chr>           <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)   15.5        0.129   120.    1.15e-306
## 2 horsepower    -3.26       0.217   -15.1   1.36e- 40
## 3 displacement  -0.733      0.414    -1.77  7.74e-  2
## 4 cylinders     -0.282      0.286    -0.986 3.25e-  1
## 5 mpg           -0.0428     0.165    -0.259 7.95e-  1
## 6 weight         2.60       0.266     9.77  2.69e- 20
## 7 origin2        0.141      0.291     0.484 6.29e-  1
## 8 origin3        0.0645     0.295     0.219 8.27e-  1

From the final model, only horsepower and weight come out as truly meaningful. Both have very small p-values, which shows their effects are strong and consistent. The other variables (displacement, cylinders, mpg, and origin) don’t reach statistical significance once the main predictors are in the model. This means they don’t add much new information about acceleration beyond what horsepower and weight already explain.

the model above Cross Validated

final_cv <- fit_resamples(
  final_wf,
  resamples = folds,
  control = control_resamples(save_pred = TRUE))

final_rmse <- final_cv |>
  collect_metrics() |>
  filter(.metric == "rmse")

final_rmse
## # A tibble: 1 × 6
##   .metric .estimator  mean     n std_err .config        
##   <chr>   <chr>      <dbl> <int>   <dbl> <chr>          
## 1 rmse    standard    1.73     5   0.110 pre0_mod0_post0

The final model produced an average RMSE of about 1.73 across the five folds, with a standard error of roughly 0.11. This means the model’s predictions are, on average, about 1.7 acceleration units away from the true values. The small standard error also shows that the model performed consistently across the folds, which gives confidence that it generalizes well to new data.

Plot(prediction vs actual)

final_preds <- final_cv |> collect_predictions()

ggplot(final_preds, aes(.pred, acceleration)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = FALSE, linewidth = 1) +
  labs(
    title = "Predicted vs Actual Acceleration",
    x = "Predicted Acceleration",
    y = "Actual Acceleration"
  )
## `geom_smooth()` using formula = 'y ~ x'

The predicted vs actual plot shows a strong upward pattern, meaning the model’s predictions closely follow the true acceleration values. Most points cluster around the fitted line, which indicates the model is capturing the main relationship well.

Significant Predictors only

I also fit a reduced model that kept only the statistically significant variables from the full model, which were horsepower and weight. After cross-validating this reduced model using the same 5-fold setup, I compared its RMSE to the full model. This allowed me to check whether the extra, non-significant variables were actually helping the model or just adding noise.

sig_vars <- c("horsepower", "weight")

sig_formula <- as.formula(
  paste("acceleration ~", paste(sig_vars, collapse = " + ")))

sig_wf <- workflow() |>
  add_recipe(recipe(sig_formula, data = auto) |>
      step_normalize(all_numeric_predictors())) |>
  add_model(linear_reg() |> 
              set_engine("lm"))

# fit on full data
sig_fit <- sig_wf |> fit(auto)

tidy(sig_fit)
## # A tibble: 3 × 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)    15.5     0.0882     176.  0       
## 2 horsepower     -3.59    0.176      -20.5 1.21e-63
## 3 weight          1.96    0.176       11.1 3.55e-25

In the reduced model, only horsepower and weight were included because they were the only statistically significant predictors from the full model. Both variables remained highly significant with very small p-values. The coefficient for horsepower is negative, meaning cars with more power accelerate faster. The coefficient for weight is positive, showing that heavier cars take longer to reach 60 mph. These results are consistent with the patterns seen in the full model, and they confirm that these two predictors carry most of the signal related to acceleration.

Cross validation of significant predictors

sig_cv <- fit_resamples(
  sig_wf,
  resamples = folds,
  control = control_resamples(save_pred = TRUE))

sig_cv_results <- sig_cv |>
  collect_metrics() |>
  filter(.metric == "rmse")

sig_cv_results
## # A tibble: 1 × 6
##   .metric .estimator  mean     n std_err .config        
##   <chr>   <chr>      <dbl> <int>   <dbl> <chr>          
## 1 rmse    standard    1.74     5   0.115 pre0_mod0_post0

The reduced model, which included only horsepower and weight, had a cross-validated RMSE of about 1.74. This is almost the same as the full model’s RMSE of about 1.73. Since the predictive accuracy barely changed after removing the non-significant variables, the extra predictors did not add much value. This suggests that horsepower and weight carry most of the information needed to predict acceleration, and a simpler model with just these two variables performs just as well as the more complicated version.