library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.2
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ISLR2)
library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 1.4.1 ──
## ✔ broom 1.0.10 ✔ rsample 1.3.1
## ✔ dials 1.4.2 ✔ tailor 0.1.0
## ✔ infer 1.0.9 ✔ tune 2.0.0
## ✔ modeldata 1.5.1 ✔ workflows 1.3.0
## ✔ parsnip 1.3.3 ✔ workflowsets 1.1.1
## ✔ recipes 1.3.1 ✔ yardstick 1.3.2
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ recipes::fixed() masks stringr::fixed()
## ✖ dplyr::lag() masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step() masks stats::step()
library(skimr)
library(GGally)
skim(Auto)
| Name | Auto |
| Number of rows | 392 |
| Number of columns | 9 |
| _______________________ | |
| Column type frequency: | |
| factor | 1 |
| numeric | 8 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| name | 0 | 1 | FALSE | 301 | amc: 5, for: 5, toy: 5, amc: 4 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| mpg | 0 | 1 | 23.45 | 7.81 | 9 | 17.00 | 22.75 | 29.00 | 46.6 | ▆▇▆▃▁ |
| cylinders | 0 | 1 | 5.47 | 1.71 | 3 | 4.00 | 4.00 | 8.00 | 8.0 | ▇▁▃▁▅ |
| displacement | 0 | 1 | 194.41 | 104.64 | 68 | 105.00 | 151.00 | 275.75 | 455.0 | ▇▂▂▃▁ |
| horsepower | 0 | 1 | 104.47 | 38.49 | 46 | 75.00 | 93.50 | 126.00 | 230.0 | ▆▇▃▁▁ |
| weight | 0 | 1 | 2977.58 | 849.40 | 1613 | 2225.25 | 2803.50 | 3614.75 | 5140.0 | ▇▇▅▅▂ |
| acceleration | 0 | 1 | 15.54 | 2.76 | 8 | 13.78 | 15.50 | 17.02 | 24.8 | ▁▆▇▂▁ |
| year | 0 | 1 | 75.98 | 3.68 | 70 | 73.00 | 76.00 | 79.00 | 82.0 | ▇▆▇▆▇ |
| origin | 0 | 1 | 1.58 | 0.81 | 1 | 1.00 | 1.00 | 2.00 | 3.0 | ▇▁▂▁▂ |
auto <- as_tibble(Auto) |>
mutate(origin = as.factor(origin)) |>
select(acceleration, mpg, cylinders, displacement, horsepower, weight, year, origin, -name)
glimpse(auto)
## Rows: 392
## Columns: 8
## $ acceleration <dbl> 12.0, 11.5, 11.0, 12.0, 10.5, 10.0, 9.0, 8.5, 10.0, 8.5, …
## $ mpg <dbl> 18, 15, 18, 16, 17, 15, 14, 14, 14, 15, 15, 14, 15, 14, 2…
## $ cylinders <int> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 4, 6, 6, 6, 4, …
## $ displacement <dbl> 307, 350, 318, 304, 302, 429, 454, 440, 455, 390, 383, 34…
## $ horsepower <int> 130, 165, 150, 150, 140, 198, 220, 215, 225, 190, 170, 16…
## $ weight <int> 3504, 3693, 3436, 3433, 3449, 4341, 4354, 4312, 4425, 385…
## $ year <int> 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 7…
## $ origin <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 1, 3, …
ggpairs(auto)
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
Pairwise correlations show that acceleration has moderate relationships with several predictors, including a positive association with mpg and negative associations with cylinders, displacement, horsepower, and weight. The strongest correlations overall occur among the predictors themselves that is displacement, horsepower, and weight are highly correlated, and cylinders is closely tied to all three, suggesting clear multicollinearity within the engine size variables. Year is positively related to mpg and negatively related to engine size, reflecting industry shifts toward lighter, more efficient cars over time.
auto |>
ggplot(aes(acceleration)) +
geom_histogram(bins = 30, color = "white") +
labs(title = "Distribution of Acceleration",
y = "Count",
x = "Acceleration (Time to accelerate from 0 to 60 mph) Sec") +
theme_minimal()
The acceleration variable shows a roughly bell-shaped distribution, with
most cars taking about 14–16 seconds to reach 60 mph and fewer
observations at the extreme ends.
auto |>
ggplot(aes(mpg)) +
geom_histogram(bins = 30, color = "white") +
labs(title = "MPG Distribution",
x = "mpg (Miles per gallon)",
y = "Count") +
theme_minimal()
The mpg values are right-skewed, with most cars falling between about 15 and 30 mpg and only a few reaching the higher mileage range.
auto |>
ggplot(aes(cylinders)) +
geom_bar() +
labs(title = "Cylinders",
y = "Count",
x = "Number of Cylinders") +
theme_minimal()
Most cars in the dataset have 4 cylinders, with smaller groups of 6- and 8-cylinder vehicles making up the rest.
auto |>
ggplot(aes(displacement)) +
geom_histogram(bins = 30, color = "white") +
labs(title = "Engine Displacement",
x = "Displacement (Engine displacement in cu. inches)",
y = "Count") +
theme_minimal()
Engine displacement is heavily right-skewed, with most cars clustered around smaller engines (roughly 100–150 cu. inches) and only a few having very large engines.
auto |>
ggplot(aes(horsepower)) +
geom_histogram(bins = 30, color = "white") +
labs(title = "Horsepower",
y = "Count",
x = "Engine Horsepower") +
theme_minimal()
Horsepower is right-skewed, with most cars producing between about 60 and 120 horsepower and fewer models in the higher-power range.
auto |>
ggplot(aes(weight)) +
geom_histogram(bins = 30, color = "white") +
labs(title = "Car Weight",
x = "Weight (Vehicle weight in (lbs))",
y = "Count") +
theme_minimal()
Car weight is right-skewed, with most vehicles weighing between about 2,000 and 3,500 pounds and fewer appearing in the heavier range.
auto |>
ggplot(aes(year)) +
geom_bar() +
labs(title = "Model Year",
x = "Year",
y = "Count") +
scale_x_continuous(breaks = seq(min(auto$year), max(auto$year), by = 1)) +
theme_bw()
The model years are evenly represented across the early 1970s to early 1980s, with each year appearing in similar numbers.
auto |>
ggplot(aes(origin)) +
geom_bar() +
scale_x_discrete(labels = c(
"1" = "American",
"2" = "European",
"3" = "Japanese")) +
labs(
title = "Origin of Cars",
x = "Origin",
y = "Count")
Most cars in the dataset are American, with smaller groups coming from Europe and Japan.
A quick look at the Auto dataset shows a mix of symmetric and skewed variables. Acceleration is fairly balanced with no extreme values, which makes it a stable response for modeling. Several predictors—mpg, displacement, horsepower, and weight—are right-skewed, with most cars clustered at lower values and a small number extending into the higher range. Cylinders follows the expected pattern, with most vehicles using 4, 6, or 8 cylinders. Model year is spread evenly across the early 1970s to early 1980s, and the origin variable shows a clear split among American, European, and Japanese cars. Overall, the response looks well-behaved, while some predictors show skewness that may affect their relationship with acceleration, which the modeling step will help sort out.
I did not create a separate train/test split because the assignment uses 5 fold cross validation as the main way to judge performance. Cross validation trains the model on different parts of the data and checks it on the fold that’s left out, which already gives a solid estimate of how well the model generalizes. Since each fold acts like a temporary test set, there’s no need for an extra split. I also used recipes so that any preprocessing (such as normalizing the numeric predictors) was done inside each fold rather than leaking information across folds.
set.seed(2025)
folds <- vfold_cv(auto, v = 5)
predictors <- c("mpg", "cylinders", "displacement", "horsepower", "weight", "year")
#function to fit CV model for each predictor
cv_single_model <- function(var) {
formula <- as.formula(paste("acceleration ~", var, "+ origin"))
#recipe + workflow
wf <- workflow() |>
add_recipe(
recipe(formula, data = auto) |>
step_normalize(all_numeric_predictors())) |>
add_model(linear_reg() |>
set_engine("lm"))
fit_resamples(
wf,
resamples = folds,
control = control_resamples(save_pred = TRUE)
) |>
collect_metrics() |>
dplyr::filter(.metric == "rmse") |>
dplyr::transmute(model = var, rmse = mean, std_err)
}
cv_single_results <- purrr::map_df(predictors, cv_single_model)
cv_single_results
## # A tibble: 6 × 3
## model rmse std_err
## <chr> <dbl> <dbl>
## 1 mpg 2.49 0.125
## 2 cylinders 2.38 0.128
## 3 displacement 2.28 0.141
## 4 horsepower 1.98 0.122
## 5 weight 2.49 0.155
## 6 year 2.56 0.141
The cross validation results show that horsepower is the strongest single predictor of acceleration, giving the lowest RMSE. This makes sense: cars with more horsepower tend to accelerate faster, and the relationship is fairly direct. Displacement and cylinders also perform reasonably well, while MPG, weight, and year show weaker predictive ability by themselves. Year performs the worst, which suggests that acceleration varies within each model year more than it varies across years. Based on these results, horsepower is the best starting variable for building a forward selection model.
ordered_vars <- cv_single_results |>
arrange(rmse) |>
pull(model)
ordered_vars
## [1] "horsepower" "displacement" "cylinders" "mpg" "weight"
## [6] "year"
cv_model <- function(vars){
formula <- as.formula(
paste("acceleration ~", paste(vars, collapse = " + "), "+ origin")
)
wf <- workflow() |>
add_recipe(
recipe(formula, data = auto) |>
step_normalize(all_numeric_predictors())
) |>
add_model(linear_reg() |> set_engine("lm"))
fit_resamples(
wf,
resamples = folds,
control = control_resamples(save_pred = TRUE)
) |>
collect_metrics() |>
filter(.metric == "rmse") |>
mutate(predictors = paste(vars, collapse = ", "))
}
results_list <- list()
for(i in seq_along(ordered_vars)){
vars <- ordered_vars[1:i]
results_list[[i]] <- cv_model(vars)
}
cv_forward_results <- bind_rows(results_list) |>
group_by(predictors) |>
summarize(rmse = mean(mean))
cv_forward_results
## # A tibble: 6 × 2
## predictors rmse
## <chr> <dbl>
## 1 horsepower 1.98
## 2 horsepower, displacement 1.95
## 3 horsepower, displacement, cylinders 1.95
## 4 horsepower, displacement, cylinders, mpg 1.93
## 5 horsepower, displacement, cylinders, mpg, weight 1.73
## 6 horsepower, displacement, cylinders, mpg, weight, year 1.74
ggplot(cv_forward_results,
aes(x = predictors, y = rmse)) +
geom_line(group = 1) +
geom_point(size = 2) +
labs(
title = "Forward Selection: RMSE by Predictor Combination",
x = "Predictors Included",
y = "RMSE"
) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
My Best model: acceleration ~ horsepower + displacement + cylinders + mpg + weight + origin
Forward-selection results showed that horsepower was the strongest single predictor of acceleration. As I added predictors one at a time, model performance improved only slightly until weight entered the model. The combination of horsepower, displacement, cylinders, mpg, and weight produced the lowest cross-validated RMSE (1.725), making it the best-performing model. Adding year increased RMSE, so it was not included in the final model. These results suggest that acceleration is most strongly related to engine power, engine size, fuel efficiency, and vehicle weight, while model year contributes little once these other variables are considered.
final_vars <- c("horsepower", "displacement", "cylinders", "mpg", "weight")
final_formula <- as.formula(
paste("acceleration ~", paste(final_vars, collapse = " + "), "+ origin"))
final_wf <- workflow() |>
add_recipe(recipe(final_formula, data = auto) |>
step_normalize(all_numeric_predictors())) |>
add_model(linear_reg() |>
set_engine("lm"))
final_fit <- final_wf |>
fit(auto)
final_fit
## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: linear_reg()
##
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 1 Recipe Step
##
## • step_normalize()
##
## ── Model ───────────────────────────────────────────────────────────────────────
##
## Call:
## stats::lm(formula = ..y ~ ., data = data)
##
## Coefficients:
## (Intercept) horsepower displacement cylinders mpg
## 15.50389 -3.26410 -0.73276 -0.28172 -0.04279
## weight origin2 origin3
## 2.60013 0.14094 0.06447
Interpretation
horsepower (−3.264) Cars with higher horsepower tend to accelerate faster. Since acceleration in this dataset is “time to reach 60 mph,” a negative coefficient means more horsepower reduces acceleration time, which matches expectations.
displacement (−0.733) Cars with larger engine displacement also tend to have quicker acceleration. The effect is weaker than horsepower but still negative, meaning bigger engines shorten acceleration time.
cylinders (−0.282) Cars with more cylinders accelerate slightly faster. This effect is small compared to horsepower and displacement, which suggests cylinders mainly matter because they are related to engine size and power.
mpg (−0.043) mpg has a very small negative effect. High mpg cars may be slightly quicker when holding everything else constant, but the effect is small.
weight (+2.600) Heavier cars accelerate more slowly. This is the strongest positive coefficient, meaning weight strongly increases acceleration time.
origin2 (+0.141) – European cars After accounting for horsepower, engine size, weight, and the other predictors, European cars take slightly longer to reach 60 mph compared to American cars.
origin3 (+0.064) – Japanese cars Japanese cars also show a small positive effect, meaning they take slightly longer to accelerate than American cars when holding everything else constant.
final_coefs <- tidy(final_fit)
final_coefs
## # A tibble: 8 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 15.5 0.129 120. 1.15e-306
## 2 horsepower -3.26 0.217 -15.1 1.36e- 40
## 3 displacement -0.733 0.414 -1.77 7.74e- 2
## 4 cylinders -0.282 0.286 -0.986 3.25e- 1
## 5 mpg -0.0428 0.165 -0.259 7.95e- 1
## 6 weight 2.60 0.266 9.77 2.69e- 20
## 7 origin2 0.141 0.291 0.484 6.29e- 1
## 8 origin3 0.0645 0.295 0.219 8.27e- 1
From the final model, only horsepower and weight come out as truly meaningful. Both have very small p-values, which shows their effects are strong and consistent. The other variables (displacement, cylinders, mpg, and origin) don’t reach statistical significance once the main predictors are in the model. This means they don’t add much new information about acceleration beyond what horsepower and weight already explain.
final_cv <- fit_resamples(
final_wf,
resamples = folds,
control = control_resamples(save_pred = TRUE))
final_rmse <- final_cv |>
collect_metrics() |>
filter(.metric == "rmse")
final_rmse
## # A tibble: 1 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 rmse standard 1.73 5 0.110 pre0_mod0_post0
The final model produced an average RMSE of about 1.73 across the five folds, with a standard error of roughly 0.11. This means the model’s predictions are, on average, about 1.7 acceleration units away from the true values. The small standard error also shows that the model performed consistently across the folds, which gives confidence that it generalizes well to new data.
final_preds <- final_cv |> collect_predictions()
ggplot(final_preds, aes(.pred, acceleration)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", se = FALSE, linewidth = 1) +
labs(
title = "Predicted vs Actual Acceleration",
x = "Predicted Acceleration",
y = "Actual Acceleration"
)
## `geom_smooth()` using formula = 'y ~ x'
The predicted vs actual plot shows a strong upward pattern, meaning the model’s predictions closely follow the true acceleration values. Most points cluster around the fitted line, which indicates the model is capturing the main relationship well.
I also fit a reduced model that kept only the statistically significant variables from the full model, which were horsepower and weight. After cross-validating this reduced model using the same 5-fold setup, I compared its RMSE to the full model. This allowed me to check whether the extra, non-significant variables were actually helping the model or just adding noise.
sig_vars <- c("horsepower", "weight")
sig_formula <- as.formula(
paste("acceleration ~", paste(sig_vars, collapse = " + ")))
sig_wf <- workflow() |>
add_recipe(recipe(sig_formula, data = auto) |>
step_normalize(all_numeric_predictors())) |>
add_model(linear_reg() |>
set_engine("lm"))
# fit on full data
sig_fit <- sig_wf |> fit(auto)
tidy(sig_fit)
## # A tibble: 3 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 15.5 0.0882 176. 0
## 2 horsepower -3.59 0.176 -20.5 1.21e-63
## 3 weight 1.96 0.176 11.1 3.55e-25
In the reduced model, only horsepower and weight were included because they were the only statistically significant predictors from the full model. Both variables remained highly significant with very small p-values. The coefficient for horsepower is negative, meaning cars with more power accelerate faster. The coefficient for weight is positive, showing that heavier cars take longer to reach 60 mph. These results are consistent with the patterns seen in the full model, and they confirm that these two predictors carry most of the signal related to acceleration.
sig_cv <- fit_resamples(
sig_wf,
resamples = folds,
control = control_resamples(save_pred = TRUE))
sig_cv_results <- sig_cv |>
collect_metrics() |>
filter(.metric == "rmse")
sig_cv_results
## # A tibble: 1 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 rmse standard 1.74 5 0.115 pre0_mod0_post0
The reduced model, which included only horsepower and weight, had a cross-validated RMSE of about 1.74. This is almost the same as the full model’s RMSE of about 1.73. Since the predictive accuracy barely changed after removing the non-significant variables, the extra predictors did not add much value. This suggests that horsepower and weight carry most of the information needed to predict acceleration, and a simpler model with just these two variables performs just as well as the more complicated version.