Welcome to my STA 631 Modeling Portfolio.
This portfolio demonstrates comprehensive statistical modeling skills
using the tidymodels framework in R, with an emphasis
on model validation, clarity, and reproducibility.
This portfolio is structured to serve two audiences:
All examples use the Auto and
Diamonds datasets,
and include:
- Multiple Linear Regression
- Polynomial Regression
- Interaction Models
- Multinomial Logistic Regression
- LDA & QDA
- Poisson Regression
- Ridge & Lasso Penalized Regression
💡 What this example demonstrates
- Fitting and interpreting multiple linear regression models
- Using EDA to check relationships and multicollinearity
- Applying cross-validation for model assessment
- Checking regression assumptions with diagnostic plots
# =======================================================
# STEP 1: Load and prepare Auto data
# =======================================================
data("Auto", package = "ISLR2")
auto <- Auto %>%
as_tibble() %>%
mutate(
origin = factor(origin),
horsepower = as.numeric(horsepower)
) %>%
drop_na()
glimpse(auto)
## Rows: 392
## Columns: 9
## $ mpg <dbl> 18, 15, 18, 16, 17, 15, 14, 14, 14, 15, 15, 14, 15, 14, 2…
## $ cylinders <int> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 4, 6, 6, 6, 4, …
## $ displacement <dbl> 307, 350, 318, 304, 302, 429, 454, 440, 455, 390, 383, 34…
## $ horsepower <dbl> 130, 165, 150, 150, 140, 198, 220, 215, 225, 190, 170, 16…
## $ weight <int> 3504, 3693, 3436, 3433, 3449, 4341, 4354, 4312, 4425, 385…
## $ acceleration <dbl> 12.0, 11.5, 11.0, 12.0, 10.5, 10.0, 9.0, 8.5, 10.0, 8.5, …
## $ year <int> 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 7…
## $ origin <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 1, 3, …
## $ name <fct> chevrolet chevelle malibu, buick skylark 320, plymouth sa…
response <- "acceleration"
predictors <- c("mpg", "cylinders", "displacement",
"horsepower", "weight", "year", "origin")
# =======================================================
# STEP 2 — Exploratory Data Analysis
# =======================================================
# Univariate summary
vars_to_keep <- c(response, predictors)
auto %>%
dplyr::select(tidyselect::any_of(vars_to_keep)) %>%
summary()
## acceleration mpg cylinders displacement
## Min. : 8.00 Min. : 9.00 Min. :3.000 Min. : 68.0
## 1st Qu.:13.78 1st Qu.:17.00 1st Qu.:4.000 1st Qu.:105.0
## Median :15.50 Median :22.75 Median :4.000 Median :151.0
## Mean :15.54 Mean :23.45 Mean :5.472 Mean :194.4
## 3rd Qu.:17.02 3rd Qu.:29.00 3rd Qu.:8.000 3rd Qu.:275.8
## Max. :24.80 Max. :46.60 Max. :8.000 Max. :455.0
## horsepower weight year origin
## Min. : 46.0 Min. :1613 Min. :70.00 1:245
## 1st Qu.: 75.0 1st Qu.:2225 1st Qu.:73.00 2: 68
## Median : 93.5 Median :2804 Median :76.00 3: 79
## Mean :104.5 Mean :2978 Mean :75.98
## 3rd Qu.:126.0 3rd Qu.:3615 3rd Qu.:79.00
## Max. :230.0 Max. :5140 Max. :82.00
auto %>% dplyr::select(acceleration, horsepower, weight) %>%
pivot_longer(everything(), names_to = "var", values_to = "val") %>%
ggplot(aes(x = val)) +
geom_histogram(bins = 30) +
facet_wrap(~var, scales = "free") +
labs(title = "Univariate Distributions of Key Variables")
# Multicollinearity check
auto_for_pairs <- auto[, c("acceleration", "mpg", "cylinders",
"displacement", "horsepower", "weight", "year")]
GGally::ggpairs(auto_for_pairs)
Note:
Strong correlations among displacement, cylinders, horsepower, and weight justify using cross-validation or stepwise model selection to avoid multicollinearity issues.
# =======================================================
# STEP 3: Fit Final MLR Model
# =======================================================
final_formula <- acceleration ~ horsepower + weight + displacement
final_lm <- lm(final_formula, data = auto)
summary(final_lm)
##
## Call:
## lm(formula = final_formula, data = auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.1506 -1.0407 -0.2052 0.9333 6.8062
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.0729227 0.4839873 35.276 < 2e-16 ***
## horsepower -0.0835315 0.0051858 -16.108 < 2e-16 ***
## weight 0.0030709 0.0002883 10.652 < 2e-16 ***
## displacement -0.0100247 0.0026637 -3.763 0.000193 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.717 on 388 degrees of freedom
## Multiple R-squared: 0.6159, Adjusted R-squared: 0.6129
## F-statistic: 207.4 on 3 and 388 DF, p-value: < 2.2e-16
set.seed(123)
folds <- vfold_cv(auto, v = 5, strata = acceleration)
lm_spec <- linear_reg() %>%
set_engine("lm")
lm_wf <- workflow() %>%
add_model(lm_spec) %>%
add_formula(final_formula)
cv_results <- fit_resamples(
lm_wf,
resamples = folds,
control = control_resamples(save_pred = TRUE)
)
collect_metrics(cv_results)
## # A tibble: 2 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 rmse standard 1.73 5 0.0625 pre0_mod0_post0
## 2 rsq standard 0.622 5 0.0295 pre0_mod0_post0
diag <- augment(final_lm) %>%
mutate(std_resid = rstandard(final_lm))
# Residuals vs Fitted
ggplot(diag, aes(x = .fitted, y = .resid)) +
geom_point(alpha = 0.6) +
geom_smooth(method = "loess", se = FALSE) +
labs(title = "Residuals vs Fitted")
# QQ Plot
ggplot(diag, aes(sample = std_resid)) +
stat_qq() + stat_qq_line() +
labs(title = "Normal Q–Q Plot")
# Scale-location
ggplot(diag, aes(x = .fitted, y = sqrt(abs(std_resid)))) +
geom_point() +
geom_smooth(method = "loess", se = FALSE) +
labs(title = "Scale–Location Plot")
💡 What this example demonstrates
- Building a polynomial regression model using tidymodels
- Creating nonlinear features with
step_poly()
- Using cross-validation to validate the polynomial fit
- Visualizing and interpreting curvature in the horsepower–mpg relationship
# =======================================================
# STEP 1B: Prepare Auto data for polynomial regression
# =======================================================
auto_poly <- Auto %>%
as_tibble() %>%
mutate(horsepower = as.numeric(horsepower)) %>%
drop_na()
# =======================================================
# STEP 2B: Polynomial feature engineering
# =======================================================
poly_rec <- recipe(mpg ~ horsepower, data = auto_poly) %>%
step_poly(horsepower, degree = 2) %>% # <-- creates hp and hp^2
step_center(all_numeric_predictors()) %>%
step_scale(all_numeric_predictors())
poly_rec
# =======================================================
# STEP 3B: Fit polynomial regression with CV
# =======================================================
# Model specification
poly_spec <- linear_reg() %>%
set_engine("lm")
# Workflow
poly_wf <- workflow() %>%
add_model(poly_spec) %>%
add_recipe(poly_rec)
# 5-fold CV
set.seed(123)
poly_folds <- vfold_cv(auto_poly, v = 5, strata = mpg)
poly_cv <- fit_resamples(
poly_wf,
resamples = poly_folds,
control = control_resamples(save_pred = TRUE)
)
collect_metrics(poly_cv)
## # A tibble: 2 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 rmse standard 4.37 5 0.239 pre0_mod0_post0
## 2 rsq standard 0.692 5 0.0320 pre0_mod0_post0
poly_fit <- fit(poly_wf, data = auto_poly)
poly_aug <- auto_poly %>%
mutate(.fitted = predict(poly_fit, new_data = auto_poly)$.pred)
ggplot(auto_poly, aes(x = horsepower, y = mpg)) +
geom_point(alpha = 0.5) +
geom_line(data = poly_aug,
aes(x = horsepower, y = .fitted),
linewidth = 1) +
labs(title = "Quadratic Polynomial Fit: mpg vs horsepower")
💡 What this example demonstrates
- Modeling interactions between quantitative and qualitative predictors
- Creating derived features (
volume,log_volume) to address multicollinearity
- Fitting and validating interaction models with tidymodels workflows
- Visualizing how the relationship between log-volume and log-carat varies by cut category
- Using regression diagnostics to assess nonlinearity and model assumptions
# =======================================================
# STEP 1: Load and prepare Diamonds data
# =======================================================
library(ggplot2)
diamonds <- ggplot2::diamonds %>%
as_tibble()
diamonds_clean <- diamonds %>%
filter(x > 0, y > 0, z > 0) %>%
mutate(
volume = x * y * z,
log_carat = log(carat),
log_volume = log(volume),
cut = factor(cut, ordered = FALSE)
)
glimpse(diamonds_clean)
## Rows: 53,920
## Columns: 13
## $ carat <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.23,…
## $ cut <fct> Ideal, Premium, Good, Premium, Good, Very Good, Very Good, …
## $ color <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, J, J,…
## $ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI1, VS…
## $ depth <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59.4,…
## $ table <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, 62,…
## $ price <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 340,…
## $ x <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, 4.00,…
## $ y <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.05,…
## $ z <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.39,…
## $ volume <dbl> 38.20203, 34.50586, 38.07688, 46.72458, 51.91725, 38.69395,…
## $ log_carat <dbl> -1.469676, -1.560648, -1.469676, -1.237874, -1.171183, -1.4…
## $ log_volume <dbl> 3.642889, 3.541129, 3.639607, 3.844270, 3.949651, 3.655683,…
# =======================================================
# STEP 2: EDA for Diamonds
# =======================================================
diamonds_clean %>%
dplyr::select(log_carat, log_volume, depth, table) %>%
cor() %>% round(3)
## log_carat log_volume depth table
## log_carat 1.000 0.999 0.031 0.192
## log_volume 0.999 1.000 0.016 0.181
## depth 0.031 0.016 1.000 -0.296
## table 0.192 0.181 -0.296 1.000
ggplot(diamonds_clean, aes(x = log_volume, y = log_carat)) +
geom_point(alpha = 0.1) +
geom_smooth(method = "lm") +
labs(title = "Log-Carat vs Log-Volume")
ggplot(diamonds_clean, aes(x = cut, y = carat)) +
geom_boxplot() +
scale_y_continuous(trans = "log10") +
labs(title = "Distribution of Carat by Cut Grade")
ggplot(diamonds_clean, aes(x = log_volume, y = log_carat, color = cut)) +
geom_point(alpha = 0.1) +
geom_smooth(method = "lm", se = FALSE) +
labs(title = "Log-Carat vs Log-Volume by Cut")
# =======================================================
# STEP 3: Fit the log(carat) ~ log(volume) * cut model
# =======================================================
# model formula
formula_diamonds <- log_carat ~ log_volume * cut
lm_diamonds <- lm(formula_diamonds, data = diamonds_clean)
summary(lm_diamonds)
##
## Call:
## lm(formula = formula_diamonds, data = diamonds_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.47426 -0.01014 0.00008 0.01005 1.32058
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.146865 0.007784 -661.194 < 2e-16 ***
## log_volume 1.016385 0.001550 655.803 < 2e-16 ***
## cutGood -0.006643 0.008580 -0.774 0.4387
## cutVery Good 0.016136 0.008067 2.000 0.0455 *
## cutPremium 0.045073 0.008024 5.617 1.95e-08 ***
## cutIdeal 0.013612 0.007947 1.713 0.0867 .
## log_volume:cutGood -0.001673 0.001722 -0.971 0.3314
## log_volume:cutVery Good -0.008386 0.001613 -5.200 2.00e-07 ***
## log_volume:cutPremium -0.014469 0.001601 -9.037 < 2e-16 ***
## log_volume:cutIdeal -0.009912 0.001588 -6.242 4.36e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.02834 on 53910 degrees of freedom
## Multiple R-squared: 0.9977, Adjusted R-squared: 0.9977
## F-statistic: 2.545e+06 on 9 and 53910 DF, p-value: < 2.2e-16
# =======================================================
# STEP 4: Diagnostics for the Interaction Model
# =======================================================
diag_diamonds <- augment(lm_diamonds) %>%
mutate(std_resid = rstandard(lm_diamonds))
ggplot(diag_diamonds, aes(x = .fitted, y = .resid)) +
geom_point(alpha = 0.4) +
geom_smooth(method = "loess", se = FALSE) +
labs(title = "Residuals vs Fitted — Diamonds Interaction Model",
x = "Fitted values",
y = "Residuals")
ggplot(diag_diamonds, aes(sample = std_resid)) +
stat_qq(alpha = 0.5) +
stat_qq_line() +
labs(title = "Normal Q–Q Plot — Diamonds Interaction Model",
x = "Theoretical Quantiles",
y = "Standardized Residuals")
ggplot(diag_diamonds, aes(x = .fitted, y = sqrt(abs(std_resid)))) +
geom_point(alpha = 0.4) +
geom_smooth(method = "loess", se = FALSE) +
labs(title = "Scale–Location Plot — Diamonds Interaction Model",
x = "Fitted values",
y = "√|Standardized Residuals|")
log_volume (≈ 1.02)
shows that diamond weight (log-carat) increases almost proportionally
with log-volume, consistent with physical expectations.💡 What this example demonstrates
- Building multinomial, LDA, and QDA classifiers using tidymodels
- Using recipes and cross-validation for classification models
- Evaluating models with confusion matrices and accuracy metrics
- Comparing model performance in a structured way
# =======================================================
# STEP 1: Load and prepare Auto data (Classification)
# =======================================================
data("Auto", package = "ISLR2")
auto_class <- Auto %>%
as_tibble() %>%
mutate(
origin = factor(origin,
levels = c(1, 2, 3),
labels = c("USA", "Europe", "Japan")),
horsepower = as.numeric(horsepower)
) %>%
drop_na()
auto_class %>% count(origin)
## # A tibble: 3 × 2
## origin n
## <fct> <int>
## 1 USA 245
## 2 Europe 68
## 3 Japan 79
set.seed(123)
class_split <- initial_split(auto_class, prop = 0.8, strata = origin)
train_class <- training(class_split)
test_class <- testing(class_split)
nrow(train_class)
## [1] 313
nrow(test_class)
## [1] 79
# =======================================================
# STEP 2: Recipe and CV folds for classification
# =======================================================
# Classification recipe:
# - Outcome: origin (USA, Europe, Japan)
# - Predictors: mpg, cylinders, displacement, horsepower, weight, acceleration, year
# - Preprocessing: remove zero-variance predictors, center and scale numerics
class_rec <- recipe(origin ~ mpg + cylinders + displacement +
horsepower + weight + acceleration + year,
data = train_class) %>%
step_zv(all_predictors()) %>%
step_center(all_numeric_predictors()) %>%
step_scale(all_numeric_predictors())
class_rec
# 5-fold cross-validation on training set, stratified by origin
set.seed(123)
class_folds <- vfold_cv(train_class, v = 5, strata = origin)
class_folds
## # 5-fold cross-validation using stratification
## # A tibble: 5 × 2
## splits id
## <list> <chr>
## 1 <split [249/64]> Fold1
## 2 <split [250/63]> Fold2
## 3 <split [250/63]> Fold3
## 4 <split [251/62]> Fold4
## 5 <split [252/61]> Fold5
# =======================================================
# STEP 3: Multinomial logistic regression (origin)
# =======================================================
# Model specification: multinomial logistic regression via nnet
multinom_spec <- multinom_reg(mode = "classification") %>%
set_engine("nnet")
multinom_wf <- workflow() %>%
add_model(multinom_spec) %>%
add_recipe(class_rec)
# 3a. Cross-validated performance (5-fold CV on training set)
set.seed(123)
multinom_cv <- fit_resamples(
multinom_wf,
resamples = class_folds,
control = control_resamples(save_pred = TRUE)
)
collect_metrics(multinom_cv)
## # A tibble: 3 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy multiclass 0.786 5 0.0117 pre0_mod0_post0
## 2 brier_class multiclass 0.139 5 0.00674 pre0_mod0_post0
## 3 roc_auc hand_till 0.871 5 0.0109 pre0_mod0_post0
multinom_fit <- fit(multinom_wf, data = train_class)
multinom_preds <- predict(multinom_fit, new_data = test_class, type = "class") %>%
bind_cols(
predict(multinom_fit, new_data = test_class, type = "prob"),
test_class %>% dplyr::select(origin) # <---- FIXED
)
head(multinom_preds)
## # A tibble: 6 × 5
## .pred_class .pred_USA .pred_Europe .pred_Japan origin
## <fct> <dbl> <dbl> <dbl> <fct>
## 1 USA 1.000 2.67e- 5 7.34e- 7 USA
## 2 USA 1.000 9.12e-11 4.37e-11 USA
## 3 USA 1.000 1.15e-11 4.92e-11 USA
## 4 Europe 0.0640 6.10e- 1 3.26e- 1 Japan
## 5 USA 0.966 3.36e- 2 5.83e- 4 USA
## 6 Europe 0.0217 5.67e- 1 4.11e- 1 Japan
# 3d. Confusion matrix and accuracy on test set
multinom_conf <- conf_mat(multinom_preds, truth = origin, estimate = .pred_class)
multinom_conf
## Truth
## Prediction USA Europe Japan
## USA 42 0 5
## Europe 3 9 6
## Japan 4 5 5
multinom_metrics <- multinom_preds %>%
metrics(truth = origin, estimate = .pred_class)
multinom_metrics
## # A tibble: 2 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy multiclass 0.709
## 2 kap multiclass 0.475
# =======================================================
# STEP 4: Linear Discriminant Analysis (LDA)
# =======================================================
# LDA model specification
lda_spec <- discrim_linear(mode = "classification") %>%
set_engine("MASS")
lda_wf <- workflow() %>%
add_model(lda_spec) %>%
add_recipe(class_rec)
# 4a. Cross-validated performance (5-fold CV)
set.seed(123)
lda_cv <- fit_resamples(
lda_wf,
resamples = class_folds,
control = control_resamples(save_pred = TRUE)
)
collect_metrics(lda_cv)
## # A tibble: 3 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy multiclass 0.732 5 0.0223 pre0_mod0_post0
## 2 brier_class multiclass 0.168 5 0.00925 pre0_mod0_post0
## 3 roc_auc hand_till 0.826 5 0.0211 pre0_mod0_post0
lda_fit <- fit(lda_wf, data = train_class)
lda_preds <- predict(lda_fit, new_data = test_class, type = "class") %>%
bind_cols(
predict(lda_fit, new_data = test_class, type = "prob"),
test_class %>% dplyr::select(origin)
)
head(lda_preds)
## # A tibble: 6 × 5
## .pred_class .pred_USA .pred_Europe .pred_Japan origin
## <fct> <dbl> <dbl> <dbl> <fct>
## 1 USA 0.951 0.0287 0.0206 USA
## 2 USA 0.997 0.00145 0.00178 USA
## 3 USA 0.996 0.00115 0.00261 USA
## 4 Europe 0.0583 0.614 0.328 Japan
## 5 USA 0.609 0.293 0.0981 USA
## 6 Europe 0.0263 0.585 0.389 Japan
# 4d. Confusion matrix and accuracy on test set
lda_conf <- conf_mat(lda_preds, truth = origin, estimate = .pred_class)
lda_conf
## Truth
## Prediction USA Europe Japan
## USA 41 0 1
## Europe 2 11 5
## Japan 6 3 10
lda_metrics <- lda_preds %>%
metrics(truth = origin, estimate = .pred_class)
lda_metrics
## # A tibble: 2 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy multiclass 0.785
## 2 kap multiclass 0.630
# =======================================================
# STEP 5: Quadratic Discriminant Analysis (QDA)
# =======================================================
qda_spec <- discrim_quad(mode = "classification") %>%
set_engine("MASS")
qda_wf <- workflow() %>%
add_model(qda_spec) %>%
add_recipe(class_rec)
# 5a. Cross-validated performance (5-fold CV)
set.seed(123)
qda_cv <- fit_resamples(
qda_wf,
resamples = class_folds,
control = control_resamples(save_pred = TRUE)
)
collect_metrics(qda_cv)
## # A tibble: 3 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy multiclass 0.788 5 0.0210 pre0_mod0_post0
## 2 brier_class multiclass 0.161 5 0.0163 pre0_mod0_post0
## 3 roc_auc hand_till 0.874 5 0.0103 pre0_mod0_post0
qda_fit <- fit(qda_wf, data = train_class)
qda_preds <- predict(qda_fit, new_data = test_class, type = "class") %>%
bind_cols(
predict(qda_fit, new_data = test_class, type = "prob"),
test_class %>% dplyr::select(origin)
)
head(qda_preds)
## # A tibble: 6 × 5
## .pred_class .pred_USA .pred_Europe .pred_Japan origin
## <fct> <dbl> <dbl> <dbl> <fct>
## 1 USA 1 1.70e- 54 2.47e- 26 USA
## 2 USA 1 3.65e-132 2.94e- 87 USA
## 3 USA 1 4.58e-145 1.45e-101 USA
## 4 Japan 0.0124 2.87e- 1 7.01e- 1 Japan
## 5 USA 1.000 1.91e- 13 2.67e- 8 USA
## 6 Japan 0.00968 3.99e- 1 5.91e- 1 Japan
# Confusion matrix and accuracy on test set
qda_conf <- conf_mat(qda_preds, truth = origin, estimate = .pred_class)
qda_conf
## Truth
## Prediction USA Europe Japan
## USA 44 0 0
## Europe 3 7 5
## Japan 2 7 11
qda_metrics <- qda_preds %>%
metrics(truth = origin, estimate = .pred_class)
qda_metrics
## # A tibble: 2 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy multiclass 0.785
## 2 kap multiclass 0.622
# =======================================================
# STEP 6: Compare Cross-Validated and Test-Set Accuracy
# =======================================================
# ---- Cross-validated accuracy ----
cv_multinom <- multinom_cv %>% collect_metrics() %>% filter(.metric == "accuracy") %>% select(mean)
cv_lda <- lda_cv %>% collect_metrics() %>% filter(.metric == "accuracy") %>% select(mean)
cv_qda <- qda_cv %>% collect_metrics() %>% filter(.metric == "accuracy") %>% select(mean)
# ---- Test-set accuracy ----
test_multinom <- multinom_metrics %>% filter(.metric == "accuracy") %>% select(.estimate)
test_lda <- lda_metrics %>% filter(.metric == "accuracy") %>% select(.estimate)
test_qda <- qda_metrics %>% filter(.metric == "accuracy") %>% select(.estimate)
# ---- Merging into one comparison tibble ----
comparison_table <- tibble(
Model = c("Multinomial Logistic Regression", "LDA", "QDA"),
CV_Accuracy = c(cv_multinom$mean, cv_lda$mean, cv_qda$mean),
Test_Accuracy = c(test_multinom$.estimate, test_lda$.estimate, test_qda$.estimate)
)
knitr::kable(
comparison_table,
caption = "Comparison of Classification Models (CV & Test Accuracy)",
digits = 3,
align = c("l", "c")
)
| Model | CV_Accuracy | Test_Accuracy |
|---|---|---|
| Multinomial Logistic Regression | 0.786 | 0.709 |
| LDA | 0.732 | 0.785 |
| QDA | 0.788 | 0.785 |
Overall, all three classifiers performed well, with accuracies ranging from about 0.71 to 0.79. LDA and QDA achieved the strongest test-set performance (≈ 0.785), while multinomial logistic regression remained competitive but slightly weaker on unseen data. USA-origin cars were classified most reliably across models, with more confusion between Europe and Japan due to similar engine and weight characteristics. These results show how model flexibility (linear vs. quadratic boundaries) influences classification accuracy.
💡 What this example demonstrates
- Fitting a Poisson regression model for count outcomes
- Modeling cylinder count using engine characteristics (displacement, horsepower, weight, mpg, year)
- Checking Poisson assumptions through overdispersion tests and diagnostic plots
- Interpreting multiplicative (rate) effects from a log-linear model
- Using tidymodels workflows alongside base
glm()for GLM diagnostics
# =======================================================
# STEP 1: Load and prepare Auto data (Poisson)
# =======================================================
data("Auto", package = "ISLR2") # or ISLR if needed
auto_pois <- Auto %>%
as_tibble() %>%
mutate(
cylinders = as.integer(cylinders),
horsepower = as.numeric(horsepower)
) %>%
drop_na()
# Quick check of the distribution of cylinders
auto_pois %>%
count(cylinders)
## # A tibble: 5 × 2
## cylinders n
## <int> <int>
## 1 3 4
## 2 4 199
## 3 5 3
## 4 6 83
## 5 8 103
# will model cylinders as a count response, using:
# predictors: displacement, horsepower, weight, year, mpg
# =======================================================
# STEP 2: EDA for Poisson Regression
# =======================================================
ggplot(auto_pois, aes(x = cylinders)) +
geom_bar(fill = "steelblue") +
labs(
title = "Distribution of Cylinder Counts",
x = "Number of Cylinders",
y = "Count of Cars"
)
auto_pois %>%
dplyr::select(cylinders, displacement, horsepower, weight, mpg, year) %>%
pivot_longer(-cylinders, names_to = "variable", values_to = "value") %>%
ggplot(aes(x = value, y = cylinders)) +
geom_point(alpha = 0.4) +
geom_smooth(method = "loess") +
facet_wrap(~ variable, scales = "free") +
labs(
title = "Cylinders vs. Key Predictors",
x = "Predictor Value",
y = "Cylinders"
)
ggplot(auto_pois, aes(x = factor(year), y = cylinders)) +
geom_boxplot(outlier.alpha = 0.3) +
labs(
title = "Cylinders Across Model Years",
x = "Year",
y = "Cylinders"
) +
theme(axis.text.x = element_text(angle = 90))
cyl_mean <- mean(auto_pois$cylinders)
cyl_var <- var(auto_pois$cylinders)
cyl_summary <- tibble(
Mean_Cylinders = cyl_mean,
Variance_Cylinders = cyl_var,
Overdispersion = cyl_var > cyl_mean
)
cyl_summary
## # A tibble: 1 × 3
## Mean_Cylinders Variance_Cylinders Overdispersion
## <dbl> <dbl> <lgl>
## 1 5.47 2.91 FALSE
# =======================================================
# STEP 3: Fit Poisson Regression Model
# =======================================================
poisson_spec <- poisson_reg(mode = "regression") %>%
set_engine("glm")
# Workflow
poisson_wf <- workflow() %>%
add_model(poisson_spec) %>%
add_formula(cylinders ~ displacement + horsepower + weight + mpg + year)
# 3a. 5-fold cross-validation
set.seed(123)
poisson_folds <- vfold_cv(auto_pois, v = 5, strata = cylinders)
poisson_cv <- fit_resamples(
poisson_wf,
resamples = poisson_folds,
control = control_resamples(save_pred = TRUE)
)
collect_metrics(poisson_cv)
## # A tibble: 2 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 rmse standard 0.595 5 0.0382 pre0_mod0_post0
## 2 rsq standard 0.882 5 0.0124 pre0_mod0_post0
poisson_glm <- glm(
cylinders ~ displacement + horsepower + weight + mpg + year,
data = auto_pois,
family = poisson()
)
summary(poisson_glm)
##
## Call:
## glm(formula = cylinders ~ displacement + horsepower + weight +
## mpg + year, family = poisson(), data = auto_pois)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.020e+00 5.509e-01 1.851 0.0641 .
## displacement 2.542e-03 6.413e-04 3.964 7.37e-05 ***
## horsepower -1.183e-03 1.251e-03 -0.945 0.3446
## weight 2.675e-05 7.738e-05 0.346 0.7296
## mpg -6.089e-03 6.619e-03 -0.920 0.3576
## year 4.360e-03 8.219e-03 0.531 0.5957
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 201.717 on 391 degrees of freedom
## Residual deviance: 21.527 on 386 degrees of freedom
## AIC: 1415
##
## Number of Fisher Scoring iterations: 4
poisson_preds <- auto_pois %>%
mutate(
.pred = predict(poisson_glm, type = "response")
) %>%
dplyr::select(cylinders, .pred)
head(poisson_preds)
## # A tibble: 6 × 2
## cylinders .pred
## <int> <dbl>
## 1 8 6.93
## 2 8 7.59
## 3 8 6.95
## 4 8 6.79
## 5 8 6.79
## 6 8 9.08
# =======================================================
# STEP 4: Diagnostics for Poisson Regression
# =======================================================
poisson_diag <- augment(poisson_glm) %>%
mutate(
std_resid = rstandard(poisson_glm)
)
ggplot(poisson_diag, aes(x = .fitted, y = .resid)) +
geom_point(alpha = 0.4) +
geom_smooth(method = "loess", se = FALSE) +
labs(
title = "Residuals vs Fitted — Poisson Model",
x = "Fitted Cylinder Count",
y = "Residuals"
)
ggplot(poisson_diag, aes(sample = std_resid)) +
stat_qq(alpha = 0.5) +
stat_qq_line() +
labs(
title = "Normal Q–Q Plot — Poisson Residuals",
x = "Theoretical Quantiles",
y = "Standardized Residuals"
)
ggplot(poisson_diag, aes(x = .fitted, y = sqrt(abs(std_resid)))) +
geom_point(alpha = 0.4) +
geom_smooth(method = "loess", se = FALSE) +
labs(
title = "Scale–Location Plot — Poisson Model",
x = "Fitted Values",
y = "√|Standardized Residuals|"
)
# Overdispersion test:
# If ratio >> 1, the model may be overdispersed.
residual_deviance <- deviance(poisson_glm)
df_residual <- df.residual(poisson_glm)
overdispersion_ratio <- residual_deviance / df_residual
tibble(
Residual_Deviance = residual_deviance,
DF_Residual = df_residual,
Overdispersion_Ratio = overdispersion_ratio
)
## # A tibble: 1 × 3
## Residual_Deviance DF_Residual Overdispersion_Ratio
## <dbl> <int> <dbl>
## 1 21.5 386 0.0558
The Poisson model fits the cylinder counts well, with a very low overdispersion ratio (≈ 0.056), indicating that the Poisson variance assumption is not violated. Among the predictors, only displacement shows a significant positive effect on cylinder count, consistent with the idea that larger engines generally have more cylinders. Other variables such as horsepower, weight, mpg, and year do not contribute meaningfully after accounting for displacement. Diagnostic plots show mild curvature but no major deviations, suggesting that the model provides a reasonable and interpretable approximation for count data in this setting.
💡 What this example demonstrates
- Applying penalized regression methods (ridge and lasso) using tidymodels
- Creating dummy variables and normalizing predictors for glmnet
- Tuning the penalty parameter λ through cross-validation
- Comparing model performance using test-set RMSE
- Understanding shrinkage and feature selection in regularized regression
# =======================================================
# STEP 1: Prepare Auto data for penalized regression
# =======================================================
data("Auto", package = "ISLR2") # or ISLR
auto_pen <- Auto %>%
as_tibble() %>%
mutate(
origin = factor(origin),
horsepower = as.numeric(horsepower)
) %>%
drop_na()
# Train/test split stratified by mpg (for stability)
set.seed(123)
pen_split <- initial_split(auto_pen, prop = 0.8, strata = mpg)
pen_train <- training(pen_split)
pen_test <- testing(pen_split)
nrow(pen_train)
## [1] 312
nrow(pen_test)
## [1] 80
# =======================================================
# STEP 2: Define recipe for penalized regression
# =======================================================
pen_rec <- recipe(mpg ~ cylinders + displacement + horsepower +
weight + acceleration + year + origin,
data = pen_train) %>%
step_dummy(all_nominal_predictors()) %>%
step_zv(all_predictors()) %>%
step_normalize(all_predictors())
pen_rec
# =======================================================
# STEP 3: Ridge and Lasso specs + tuning grid
# =======================================================
# Ridge: mixture = 0
ridge_spec <- linear_reg(
mode = "regression",
penalty = tune(),
mixture = 0
) %>%
set_engine("glmnet")
# Lasso: mixture = 1
lasso_spec <- linear_reg(
mode = "regression",
penalty = tune(),
mixture = 1
) %>%
set_engine("glmnet")
# Workflows
ridge_wf <- workflow() %>%
add_model(ridge_spec) %>%
add_recipe(pen_rec)
lasso_wf <- workflow() %>%
add_model(lasso_spec) %>%
add_recipe(pen_rec)
set.seed(123)
pen_folds <- vfold_cv(pen_train, v = 5, strata = mpg)
# Penalty grid (lambda)
lambda_grid <- grid_regular(penalty(), levels = 30)
lambda_grid
## # A tibble: 30 × 1
## penalty
## <dbl>
## 1 1 e-10
## 2 2.21e-10
## 3 4.89e-10
## 4 1.08e- 9
## 5 2.40e- 9
## 6 5.30e- 9
## 7 1.17e- 8
## 8 2.59e- 8
## 9 5.74e- 8
## 10 1.27e- 7
## # ℹ 20 more rows
# =======================================================
# STEP 4: Tune penalty for ridge and lasso
# =======================================================
set.seed(123)
ridge_tune <- tune_grid(
ridge_wf,
resamples = pen_folds,
grid = lambda_grid,
control = control_grid(save_pred = TRUE)
)
set.seed(123)
lasso_tune <- tune_grid(
lasso_wf,
resamples = pen_folds,
grid = lambda_grid,
control = control_grid(save_pred = TRUE)
)
ridge_tune %>%
collect_metrics() %>%
filter(.metric == "rmse") %>%
arrange(mean) %>%
slice(1:5)
## # A tibble: 5 × 7
## penalty .metric .estimator mean n std_err .config
## <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 1 e-10 rmse standard 3.42 5 0.135 pre0_mod01_post0
## 2 2.21e-10 rmse standard 3.42 5 0.135 pre0_mod02_post0
## 3 4.89e-10 rmse standard 3.42 5 0.135 pre0_mod03_post0
## 4 1.08e- 9 rmse standard 3.42 5 0.135 pre0_mod04_post0
## 5 2.40e- 9 rmse standard 3.42 5 0.135 pre0_mod05_post0
lasso_tune %>%
collect_metrics() %>%
filter(.metric == "rmse") %>%
arrange(mean) %>%
slice(1:5)
## # A tibble: 5 × 7
## penalty .metric .estimator mean n std_err .config
## <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 0.0924 rmse standard 3.35 5 0.132 pre0_mod27_post0
## 2 0.0418 rmse standard 3.36 5 0.131 pre0_mod26_post0
## 3 0.00853 rmse standard 3.36 5 0.133 pre0_mod24_post0
## 4 0.0189 rmse standard 3.36 5 0.132 pre0_mod25_post0
## 5 0.0000000001 rmse standard 3.36 5 0.133 pre0_mod01_post0
# =======================================================
# STEP 5: Finalize best ridge and lasso models
# =======================================================
best_ridge <- select_best(ridge_tune, metric = "rmse")
best_lasso <- select_best(lasso_tune, metric = "rmse")
best_ridge
## # A tibble: 1 × 2
## penalty .config
## <dbl> <chr>
## 1 0.0000000001 pre0_mod01_post0
best_lasso
## # A tibble: 1 × 2
## penalty .config
## <dbl> <chr>
## 1 0.0924 pre0_mod27_post0
# Final workflows
ridge_final_wf <- finalize_workflow(ridge_wf, best_ridge)
lasso_final_wf <- finalize_workflow(lasso_wf, best_lasso)
# Fit on full training data
ridge_final_fit <- fit(ridge_final_wf, data = pen_train)
lasso_final_fit <- fit(lasso_final_wf, data = pen_train)
# Predict on test data
ridge_test_preds <- predict(ridge_final_fit, new_data = pen_test) %>%
bind_cols(pen_test %>% dplyr::select(mpg))
lasso_test_preds <- predict(lasso_final_fit, new_data = pen_test) %>%
bind_cols(pen_test %>% dplyr::select(mpg))
# Test metrics
ridge_test_metrics <- ridge_test_preds %>%
metrics(truth = mpg, estimate = .pred)
lasso_test_metrics <- lasso_test_preds %>%
metrics(truth = mpg, estimate = .pred)
ridge_test_metrics
## # A tibble: 3 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 3.52
## 2 rsq standard 0.817
## 3 mae standard 2.45
lasso_test_metrics
## # A tibble: 3 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 3.38
## 2 rsq standard 0.829
## 3 mae standard 2.41
ridge_rmse <- ridge_test_metrics %>%
dplyr::filter(.metric == "rmse") %>%
dplyr::pull(.estimate)
lasso_rmse <- lasso_test_metrics %>%
dplyr::filter(.metric == "rmse") %>%
dplyr::pull(.estimate)
comparison_table <- tibble(
Model = c("Ridge", "Lasso"),
Test_RMSE = c(ridge_rmse, lasso_rmse)
)
knitr::kable(
comparison_table,
caption = "Test RMSE Comparison: Ridge vs Lasso Regression",
digits = 3,
align = c("l", "c")
)
| Model | Test_RMSE |
|---|---|
| Ridge | 3.519 |
| Lasso | 3.382 |
Both ridge and lasso regression produced strong predictive performance for mpg, with lasso achieving the lowest test RMSE (≈ 3.38) compared to ridge (≈ 3.52). This suggests that a sparser model with selective coefficient shrinkage provides a slightly better balance between bias and variance for this dataset. Ridge retained all predictors with small shrinkage, whereas lasso likely removed weaker predictors by shrinking their coefficients toward zero. Overall, the difference in performance is modest, but lasso offers a more interpretable model with comparable or better accuracy.
This portfolio demonstrates a comprehensive set of modeling skills using the tidymodels ecosystem, including regression, classification, interaction modeling, count data modeling, and penalized methods. Across all examples, the emphasis was placed on clean data preparation, thoughtful exploratory analysis, appropriate model validation, and interpretable results.
The models used here show how different statistical approaches respond to nonlinear patterns, multicollinearity, categorical interactions, and varying levels of model flexibility. Together, these examples highlight the importance of understanding both the mathematical structure of models and the practical behavior of real-world datasets.