Portfolio Overview

Welcome to my STA 631 Modeling Portfolio.
This portfolio demonstrates comprehensive statistical modeling skills using the tidymodels framework in R, with an emphasis on model validation, clarity, and reproducibility.

This portfolio is structured to serve two audiences:

  • Academic: Demonstrates mastery of course learning objectives
  • Professional: Presents clean, modern examples of applied modeling

All examples use the Auto and Diamonds datasets,
and include:
- Multiple Linear Regression
- Polynomial Regression
- Interaction Models
- Multinomial Logistic Regression
- LDA & QDA
- Poisson Regression
- Ridge & Lasso Penalized Regression

Methods Used in This Portfolio

🧠 Regression Models

  • MLR — Multiple Linear Regression
  • POLY — Polynomial Regression
  • INT — Interaction Models
  • RIDGE — Ridge Regression
  • LASSO — Lasso Regression

🎯 Classification Models

  • MULTINOM — Multinomial Logistic Regression
  • LDA — Linear Discriminant Analysis
  • QDA — Quadratic Discriminant Analysis

🔢 Count Models

  • POISSON — Poisson Regression

SECTION 1: Multiple Linear Regression (Auto)

Example 1: Multiple Linear Regression — Predicting Acceleration (Auto)

💡 What this example demonstrates

  • Fitting and interpreting multiple linear regression models
  • Using EDA to check relationships and multicollinearity
  • Applying cross-validation for model assessment
  • Checking regression assumptions with diagnostic plots

STEP 1: Load and prepare Auto data

# =======================================================
# STEP 1: Load and prepare Auto data 
# =======================================================

data("Auto", package = "ISLR2")

auto <- Auto %>%
  as_tibble() %>%
  mutate(
    origin = factor(origin),
    horsepower = as.numeric(horsepower)
  ) %>%
  drop_na()

glimpse(auto)
## Rows: 392
## Columns: 9
## $ mpg          <dbl> 18, 15, 18, 16, 17, 15, 14, 14, 14, 15, 15, 14, 15, 14, 2…
## $ cylinders    <int> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 4, 6, 6, 6, 4, …
## $ displacement <dbl> 307, 350, 318, 304, 302, 429, 454, 440, 455, 390, 383, 34…
## $ horsepower   <dbl> 130, 165, 150, 150, 140, 198, 220, 215, 225, 190, 170, 16…
## $ weight       <int> 3504, 3693, 3436, 3433, 3449, 4341, 4354, 4312, 4425, 385…
## $ acceleration <dbl> 12.0, 11.5, 11.0, 12.0, 10.5, 10.0, 9.0, 8.5, 10.0, 8.5, …
## $ year         <int> 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 7…
## $ origin       <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 1, 3, …
## $ name         <fct> chevrolet chevelle malibu, buick skylark 320, plymouth sa…
response <- "acceleration"
predictors <- c("mpg", "cylinders", "displacement",
                "horsepower", "weight", "year", "origin")

STEP 2 — Exploratory Data Analysis

# =======================================================
# STEP 2 — Exploratory Data Analysis
# =======================================================

# Univariate summary
vars_to_keep <- c(response, predictors)

auto %>%
  dplyr::select(tidyselect::any_of(vars_to_keep)) %>%
  summary()
##   acceleration        mpg          cylinders      displacement  
##  Min.   : 8.00   Min.   : 9.00   Min.   :3.000   Min.   : 68.0  
##  1st Qu.:13.78   1st Qu.:17.00   1st Qu.:4.000   1st Qu.:105.0  
##  Median :15.50   Median :22.75   Median :4.000   Median :151.0  
##  Mean   :15.54   Mean   :23.45   Mean   :5.472   Mean   :194.4  
##  3rd Qu.:17.02   3rd Qu.:29.00   3rd Qu.:8.000   3rd Qu.:275.8  
##  Max.   :24.80   Max.   :46.60   Max.   :8.000   Max.   :455.0  
##    horsepower        weight          year       origin 
##  Min.   : 46.0   Min.   :1613   Min.   :70.00   1:245  
##  1st Qu.: 75.0   1st Qu.:2225   1st Qu.:73.00   2: 68  
##  Median : 93.5   Median :2804   Median :76.00   3: 79  
##  Mean   :104.5   Mean   :2978   Mean   :75.98          
##  3rd Qu.:126.0   3rd Qu.:3615   3rd Qu.:79.00          
##  Max.   :230.0   Max.   :5140   Max.   :82.00
auto %>% dplyr::select(acceleration, horsepower, weight) %>%
  pivot_longer(everything(), names_to = "var", values_to = "val") %>%
  ggplot(aes(x = val)) +
  geom_histogram(bins = 30) +
  facet_wrap(~var, scales = "free") +
  labs(title = "Univariate Distributions of Key Variables")

# Multicollinearity check 
auto_for_pairs <- auto[, c("acceleration", "mpg", "cylinders",
                           "displacement", "horsepower", "weight", "year")]

GGally::ggpairs(auto_for_pairs)

Note:
Strong correlations among displacement, cylinders, horsepower, and weight justify using cross-validation or stepwise model selection to avoid multicollinearity issues.

STEP 3 — Fit final MLR model

# =======================================================
# STEP 3: Fit Final MLR Model
# =======================================================

final_formula <- acceleration ~ horsepower + weight + displacement
final_lm <- lm(final_formula, data = auto)

summary(final_lm)
## 
## Call:
## lm(formula = final_formula, data = auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.1506 -1.0407 -0.2052  0.9333  6.8062 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  17.0729227  0.4839873  35.276  < 2e-16 ***
## horsepower   -0.0835315  0.0051858 -16.108  < 2e-16 ***
## weight        0.0030709  0.0002883  10.652  < 2e-16 ***
## displacement -0.0100247  0.0026637  -3.763 0.000193 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.717 on 388 degrees of freedom
## Multiple R-squared:  0.6159, Adjusted R-squared:  0.6129 
## F-statistic: 207.4 on 3 and 388 DF,  p-value: < 2.2e-16

STEP 3B — Cross-Validation (5-fold)

set.seed(123)

folds <- vfold_cv(auto, v = 5, strata = acceleration)

lm_spec <- linear_reg() %>%
  set_engine("lm")

lm_wf <- workflow() %>%
  add_model(lm_spec) %>%
  add_formula(final_formula)

cv_results <- fit_resamples(
  lm_wf,
  resamples = folds,
  control = control_resamples(save_pred = TRUE)
)

collect_metrics(cv_results)
## # A tibble: 2 × 6
##   .metric .estimator  mean     n std_err .config        
##   <chr>   <chr>      <dbl> <int>   <dbl> <chr>          
## 1 rmse    standard   1.73      5  0.0625 pre0_mod0_post0
## 2 rsq     standard   0.622     5  0.0295 pre0_mod0_post0

STEP 4 — Diagnostics

diag <- augment(final_lm) %>%
  mutate(std_resid = rstandard(final_lm))

# Residuals vs Fitted
ggplot(diag, aes(x = .fitted, y = .resid)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "loess", se = FALSE) +
  labs(title = "Residuals vs Fitted")

# QQ Plot
ggplot(diag, aes(sample = std_resid)) +
  stat_qq() + stat_qq_line() +
  labs(title = "Normal Q–Q Plot")

# Scale-location
ggplot(diag, aes(x = .fitted, y = sqrt(abs(std_resid)))) +
  geom_point() +
  geom_smooth(method = "loess", se = FALSE) +
  labs(title = "Scale–Location Plot")

STEP 5 — Short Interpretation

  • Horsepower, weight, and displacement are all statistically significant predictors of acceleration (p < 0.001).
  • Higher horsepower and higher displacement are associated with lower acceleration time (cars accelerate faster).
  • Higher weight is associated with higher acceleration time (heavier cars accelerate more slowly).
  • Model performance is strong: cross-validated RMSE ≈ 1.73 and R² ≈ 0.62, indicating the model explains a substantial portion of variation in acceleration.
  • Diagnostic plots show mild curvature in the residuals vs. fitted values, suggesting that a polynomial or nonlinear model could further reduce systematic patterns.

SECTION 1B: Polynomial Regression (Auto)

Example 1B: Predicting mpg using a quadratic polynomial of horsepower

💡 What this example demonstrates

  • Building a polynomial regression model using tidymodels
  • Creating nonlinear features with step_poly()
  • Using cross-validation to validate the polynomial fit
  • Visualizing and interpreting curvature in the horsepower–mpg relationship

STEP 1 — Use Auto dataset for polynomial regression

# =======================================================
# STEP 1B: Prepare Auto data for polynomial regression
# =======================================================

auto_poly <- Auto %>%
  as_tibble() %>%
  mutate(horsepower = as.numeric(horsepower)) %>%
  drop_na()

STEP 2 — Create quadratic term for horsepower

# =======================================================
# STEP 2B: Polynomial feature engineering
# =======================================================

poly_rec <- recipe(mpg ~ horsepower, data = auto_poly) %>%
  step_poly(horsepower, degree = 2) %>%   # <-- creates hp and hp^2
  step_center(all_numeric_predictors()) %>%
  step_scale(all_numeric_predictors())

poly_rec

STEP 3 — Fit quadratic regression model using tidymodels

# =======================================================
# STEP 3B: Fit polynomial regression with CV
# =======================================================

# Model specification
poly_spec <- linear_reg() %>%
  set_engine("lm")

# Workflow
poly_wf <- workflow() %>%
  add_model(poly_spec) %>%
  add_recipe(poly_rec)

# 5-fold CV
set.seed(123)
poly_folds <- vfold_cv(auto_poly, v = 5, strata = mpg)

poly_cv <- fit_resamples(
  poly_wf,
  resamples = poly_folds,
  control = control_resamples(save_pred = TRUE)
)

collect_metrics(poly_cv)
## # A tibble: 2 × 6
##   .metric .estimator  mean     n std_err .config        
##   <chr>   <chr>      <dbl> <int>   <dbl> <chr>          
## 1 rmse    standard   4.37      5  0.239  pre0_mod0_post0
## 2 rsq     standard   0.692     5  0.0320 pre0_mod0_post0
poly_fit <- fit(poly_wf, data = auto_poly)

STEP 4 — Fit final model and visualize curvature

poly_aug <- auto_poly %>%
  mutate(.fitted = predict(poly_fit, new_data = auto_poly)$.pred)


ggplot(auto_poly, aes(x = horsepower, y = mpg)) +
  geom_point(alpha = 0.5) +
  geom_line(data = poly_aug,
            aes(x = horsepower, y = .fitted),
            linewidth = 1) +
  labs(title = "Quadratic Polynomial Fit: mpg vs horsepower")

STEP 5 — Interpretation

  • The quadratic model captures the strong nonlinear relationship between horsepower and fuel efficiency (mpg).
  • As horsepower increases, mpg decreases sharply at first and then levels off, which aligns with automotive expectations.
  • Cross-validation shows solid predictive performance (RMSE ≈ 4.37, R² ≈ 0.69), noticeably better than a simple linear model would achieve.
  • The fitted curve follows the data closely without overfitting, demonstrating that a second-degree polynomial is an appropriate level of model complexity.

SECTION 2: Diamonds — Interactions with Qualitative Predictors

Example 2: Predicting Log-Carat from Log-Volume × Cut

💡 What this example demonstrates

  • Modeling interactions between quantitative and qualitative predictors
  • Creating derived features (volume, log_volume) to address multicollinearity
  • Fitting and validating interaction models with tidymodels workflows
  • Visualizing how the relationship between log-volume and log-carat varies by cut category
  • Using regression diagnostics to assess nonlinearity and model assumptions

STEP 1: Load and prepare Diamonds data

# =======================================================
# STEP 1: Load and prepare Diamonds data
# =======================================================


library(ggplot2)

diamonds <- ggplot2::diamonds %>%
  as_tibble()

diamonds_clean <- diamonds %>%
  filter(x > 0, y > 0, z > 0) %>%
  mutate(
    volume = x * y * z,
    log_carat = log(carat),
    log_volume = log(volume),
    cut = factor(cut, ordered = FALSE)
  )

glimpse(diamonds_clean)
## Rows: 53,920
## Columns: 13
## $ carat      <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.23,…
## $ cut        <fct> Ideal, Premium, Good, Premium, Good, Very Good, Very Good, …
## $ color      <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, J, J,…
## $ clarity    <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI1, VS…
## $ depth      <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59.4,…
## $ table      <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, 62,…
## $ price      <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 340,…
## $ x          <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, 4.00,…
## $ y          <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.05,…
## $ z          <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.39,…
## $ volume     <dbl> 38.20203, 34.50586, 38.07688, 46.72458, 51.91725, 38.69395,…
## $ log_carat  <dbl> -1.469676, -1.560648, -1.469676, -1.237874, -1.171183, -1.4…
## $ log_volume <dbl> 3.642889, 3.541129, 3.639607, 3.844270, 3.949651, 3.655683,…

STEP 2 — Exploratory Data Analysis

# =======================================================
# STEP 2: EDA for Diamonds
# =======================================================

diamonds_clean %>%
  dplyr::select(log_carat, log_volume, depth, table) %>%
  cor() %>% round(3)
##            log_carat log_volume  depth  table
## log_carat      1.000      0.999  0.031  0.192
## log_volume     0.999      1.000  0.016  0.181
## depth          0.031      0.016  1.000 -0.296
## table          0.192      0.181 -0.296  1.000
ggplot(diamonds_clean, aes(x = log_volume, y = log_carat)) +
  geom_point(alpha = 0.1) +
  geom_smooth(method = "lm") +
  labs(title = "Log-Carat vs Log-Volume")

ggplot(diamonds_clean, aes(x = cut, y = carat)) +
  geom_boxplot() +
  scale_y_continuous(trans = "log10") +
  labs(title = "Distribution of Carat by Cut Grade")

ggplot(diamonds_clean, aes(x = log_volume, y = log_carat, color = cut)) +
  geom_point(alpha = 0.1) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "Log-Carat vs Log-Volume by Cut")

STEP 3 — Fit the Log-Volume × Cut Interaction Model

# =======================================================
# STEP 3: Fit the log(carat) ~ log(volume) * cut model
# =======================================================

#  model formula
formula_diamonds <- log_carat ~ log_volume * cut


lm_diamonds <- lm(formula_diamonds, data = diamonds_clean)


summary(lm_diamonds)
## 
## Call:
## lm(formula = formula_diamonds, data = diamonds_clean)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.47426 -0.01014  0.00008  0.01005  1.32058 
## 
## Coefficients:
##                          Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)             -5.146865   0.007784 -661.194  < 2e-16 ***
## log_volume               1.016385   0.001550  655.803  < 2e-16 ***
## cutGood                 -0.006643   0.008580   -0.774   0.4387    
## cutVery Good             0.016136   0.008067    2.000   0.0455 *  
## cutPremium               0.045073   0.008024    5.617 1.95e-08 ***
## cutIdeal                 0.013612   0.007947    1.713   0.0867 .  
## log_volume:cutGood      -0.001673   0.001722   -0.971   0.3314    
## log_volume:cutVery Good -0.008386   0.001613   -5.200 2.00e-07 ***
## log_volume:cutPremium   -0.014469   0.001601   -9.037  < 2e-16 ***
## log_volume:cutIdeal     -0.009912   0.001588   -6.242 4.36e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02834 on 53910 degrees of freedom
## Multiple R-squared:  0.9977, Adjusted R-squared:  0.9977 
## F-statistic: 2.545e+06 on 9 and 53910 DF,  p-value: < 2.2e-16

STEP 4 — Diagnostics

# =======================================================
# STEP 4: Diagnostics for the Interaction Model
# =======================================================


diag_diamonds <- augment(lm_diamonds) %>%
  mutate(std_resid = rstandard(lm_diamonds))


ggplot(diag_diamonds, aes(x = .fitted, y = .resid)) +
  geom_point(alpha = 0.4) +
  geom_smooth(method = "loess", se = FALSE) +
  labs(title = "Residuals vs Fitted — Diamonds Interaction Model",
       x = "Fitted values",
       y = "Residuals")

ggplot(diag_diamonds, aes(sample = std_resid)) +
  stat_qq(alpha = 0.5) +
  stat_qq_line() +
  labs(title = "Normal Q–Q Plot — Diamonds Interaction Model",
       x = "Theoretical Quantiles",
       y = "Standardized Residuals")

ggplot(diag_diamonds, aes(x = .fitted, y = sqrt(abs(std_resid)))) +
  geom_point(alpha = 0.4) +
  geom_smooth(method = "loess", se = FALSE) +
  labs(title = "Scale–Location Plot — Diamonds Interaction Model",
       x = "Fitted values",
       y = "√|Standardized Residuals|")

STEP 5 — Short Interpretation

  • The strong positive coefficient on log_volume (≈ 1.02) shows that diamond weight (log-carat) increases almost proportionally with log-volume, consistent with physical expectations.
  • Several interaction terms (e.g., Very Good, Premium, Ideal) are significant, indicating that the slope of the log-volume–log-carat relationship differs meaningfully by cut quality.
  • For high-quality cuts (Premium, Ideal), the slope is slightly flatter, meaning those diamonds achieve slightly higher carat weight per unit volume.
  • Model performance is extremely strong (R² ≈ 0.998), reflecting that carat weight is highly predictable from physical volume once measurement noise is removed.
  • Diagnostic plots show slight heteroscedasticity and curvature at extreme fitted values, likely due to outliers and very large stones, but overall model assumptions are reasonably satisfied.

SECTION 3: Classification Models — Multinomial, LDA, and QDA

Example 3: Predicting Car Origin (USA, Europe, Japan) Using Multinomial Logistic Regression, LDA, and QDA

💡 What this example demonstrates

  • Building multinomial, LDA, and QDA classifiers using tidymodels
  • Using recipes and cross-validation for classification models
  • Evaluating models with confusion matrices and accuracy metrics
  • Comparing model performance in a structured way

STEP 1 — Load and prepare the Auto data for classification

# =======================================================
# STEP 1: Load and prepare Auto data (Classification)
# =======================================================

data("Auto", package = "ISLR2")   

auto_class <- Auto %>%
  as_tibble() %>%
  mutate(
    origin = factor(origin,
                    levels = c(1, 2, 3),
                    labels = c("USA", "Europe", "Japan")),
    horsepower = as.numeric(horsepower)
  ) %>%
  drop_na()


auto_class %>% count(origin)
## # A tibble: 3 × 2
##   origin     n
##   <fct>  <int>
## 1 USA      245
## 2 Europe    68
## 3 Japan     79
set.seed(123)
class_split <- initial_split(auto_class, prop = 0.8, strata = origin)

train_class <- training(class_split)
test_class  <- testing(class_split)

nrow(train_class)
## [1] 313
nrow(test_class)
## [1] 79

STEP 2 — Define the recipe and cross-validation folds

# =======================================================
# STEP 2: Recipe and CV folds for classification
# =======================================================

# Classification recipe:
# - Outcome: origin (USA, Europe, Japan)
# - Predictors: mpg, cylinders, displacement, horsepower, weight, acceleration, year
# - Preprocessing: remove zero-variance predictors, center and scale numerics

class_rec <- recipe(origin ~ mpg + cylinders + displacement +
                      horsepower + weight + acceleration + year,
                    data = train_class) %>%
  step_zv(all_predictors()) %>%
  step_center(all_numeric_predictors()) %>%
  step_scale(all_numeric_predictors())


class_rec

# 5-fold cross-validation on training set, stratified by origin
set.seed(123)
class_folds <- vfold_cv(train_class, v = 5, strata = origin)

class_folds
## #  5-fold cross-validation using stratification 
## # A tibble: 5 × 2
##   splits           id   
##   <list>           <chr>
## 1 <split [249/64]> Fold1
## 2 <split [250/63]> Fold2
## 3 <split [250/63]> Fold3
## 4 <split [251/62]> Fold4
## 5 <split [252/61]> Fold5

STEP 3 — Multinomial logistic regression (predicting origin)

# =======================================================
# STEP 3: Multinomial logistic regression (origin)
# =======================================================

# Model specification: multinomial logistic regression via nnet
multinom_spec <- multinom_reg(mode = "classification") %>%
  set_engine("nnet")


multinom_wf <- workflow() %>%
  add_model(multinom_spec) %>%
  add_recipe(class_rec)

# 3a. Cross-validated performance (5-fold CV on training set)
set.seed(123)
multinom_cv <- fit_resamples(
  multinom_wf,
  resamples = class_folds,
  control = control_resamples(save_pred = TRUE)
)


collect_metrics(multinom_cv)
## # A tibble: 3 × 6
##   .metric     .estimator  mean     n std_err .config        
##   <chr>       <chr>      <dbl> <int>   <dbl> <chr>          
## 1 accuracy    multiclass 0.786     5 0.0117  pre0_mod0_post0
## 2 brier_class multiclass 0.139     5 0.00674 pre0_mod0_post0
## 3 roc_auc     hand_till  0.871     5 0.0109  pre0_mod0_post0
multinom_fit <- fit(multinom_wf, data = train_class)

multinom_preds <- predict(multinom_fit, new_data = test_class, type = "class") %>%
  bind_cols(
    predict(multinom_fit, new_data = test_class, type = "prob"),
    test_class %>% dplyr::select(origin)   # <---- FIXED
  )

head(multinom_preds)
## # A tibble: 6 × 5
##   .pred_class .pred_USA .pred_Europe .pred_Japan origin
##   <fct>           <dbl>        <dbl>       <dbl> <fct> 
## 1 USA            1.000      2.67e- 5    7.34e- 7 USA   
## 2 USA            1.000      9.12e-11    4.37e-11 USA   
## 3 USA            1.000      1.15e-11    4.92e-11 USA   
## 4 Europe         0.0640     6.10e- 1    3.26e- 1 Japan 
## 5 USA            0.966      3.36e- 2    5.83e- 4 USA   
## 6 Europe         0.0217     5.67e- 1    4.11e- 1 Japan
# 3d. Confusion matrix and accuracy on test set
multinom_conf <- conf_mat(multinom_preds, truth = origin, estimate = .pred_class)
multinom_conf
##           Truth
## Prediction USA Europe Japan
##     USA     42      0     5
##     Europe   3      9     6
##     Japan    4      5     5
multinom_metrics <- multinom_preds %>%
  metrics(truth = origin, estimate = .pred_class)

multinom_metrics
## # A tibble: 2 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy multiclass     0.709
## 2 kap      multiclass     0.475

STEP 4 — Linear Discriminant Analysis (LDA)

# =======================================================
# STEP 4: Linear Discriminant Analysis (LDA)
# =======================================================

# LDA model specification
lda_spec <- discrim_linear(mode = "classification") %>%
  set_engine("MASS")


lda_wf <- workflow() %>%
  add_model(lda_spec) %>%
  add_recipe(class_rec)

# 4a. Cross-validated performance (5-fold CV)
set.seed(123)
lda_cv <- fit_resamples(
  lda_wf,
  resamples = class_folds,
  control = control_resamples(save_pred = TRUE)
)

collect_metrics(lda_cv)
## # A tibble: 3 × 6
##   .metric     .estimator  mean     n std_err .config        
##   <chr>       <chr>      <dbl> <int>   <dbl> <chr>          
## 1 accuracy    multiclass 0.732     5 0.0223  pre0_mod0_post0
## 2 brier_class multiclass 0.168     5 0.00925 pre0_mod0_post0
## 3 roc_auc     hand_till  0.826     5 0.0211  pre0_mod0_post0
lda_fit <- fit(lda_wf, data = train_class)


lda_preds <- predict(lda_fit, new_data = test_class, type = "class") %>%
  bind_cols(
    predict(lda_fit, new_data = test_class, type = "prob"),
    test_class %>% dplyr::select(origin)
  )

head(lda_preds)
## # A tibble: 6 × 5
##   .pred_class .pred_USA .pred_Europe .pred_Japan origin
##   <fct>           <dbl>        <dbl>       <dbl> <fct> 
## 1 USA            0.951       0.0287      0.0206  USA   
## 2 USA            0.997       0.00145     0.00178 USA   
## 3 USA            0.996       0.00115     0.00261 USA   
## 4 Europe         0.0583      0.614       0.328   Japan 
## 5 USA            0.609       0.293       0.0981  USA   
## 6 Europe         0.0263      0.585       0.389   Japan
# 4d. Confusion matrix and accuracy on test set
lda_conf <- conf_mat(lda_preds, truth = origin, estimate = .pred_class)
lda_conf
##           Truth
## Prediction USA Europe Japan
##     USA     41      0     1
##     Europe   2     11     5
##     Japan    6      3    10
lda_metrics <- lda_preds %>%
  metrics(truth = origin, estimate = .pred_class)

lda_metrics
## # A tibble: 2 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy multiclass     0.785
## 2 kap      multiclass     0.630

STEP 5 — Quadratic Discriminant Analysis (QDA)

# =======================================================
# STEP 5: Quadratic Discriminant Analysis (QDA)
# =======================================================


qda_spec <- discrim_quad(mode = "classification") %>%
  set_engine("MASS")


qda_wf <- workflow() %>%
  add_model(qda_spec) %>%
  add_recipe(class_rec)

# 5a. Cross-validated performance (5-fold CV)
set.seed(123)
qda_cv <- fit_resamples(
  qda_wf,
  resamples = class_folds,
  control = control_resamples(save_pred = TRUE)
)

collect_metrics(qda_cv)
## # A tibble: 3 × 6
##   .metric     .estimator  mean     n std_err .config        
##   <chr>       <chr>      <dbl> <int>   <dbl> <chr>          
## 1 accuracy    multiclass 0.788     5  0.0210 pre0_mod0_post0
## 2 brier_class multiclass 0.161     5  0.0163 pre0_mod0_post0
## 3 roc_auc     hand_till  0.874     5  0.0103 pre0_mod0_post0
qda_fit <- fit(qda_wf, data = train_class)


qda_preds <- predict(qda_fit, new_data = test_class, type = "class") %>%
  bind_cols(
    predict(qda_fit, new_data = test_class, type = "prob"),
    test_class %>% dplyr::select(origin)
  )

head(qda_preds)
## # A tibble: 6 × 5
##   .pred_class .pred_USA .pred_Europe .pred_Japan origin
##   <fct>           <dbl>        <dbl>       <dbl> <fct> 
## 1 USA           1          1.70e- 54   2.47e- 26 USA   
## 2 USA           1          3.65e-132   2.94e- 87 USA   
## 3 USA           1          4.58e-145   1.45e-101 USA   
## 4 Japan         0.0124     2.87e-  1   7.01e-  1 Japan 
## 5 USA           1.000      1.91e- 13   2.67e-  8 USA   
## 6 Japan         0.00968    3.99e-  1   5.91e-  1 Japan
#  Confusion matrix and accuracy on test set
qda_conf <- conf_mat(qda_preds, truth = origin, estimate = .pred_class)
qda_conf
##           Truth
## Prediction USA Europe Japan
##     USA     44      0     0
##     Europe   3      7     5
##     Japan    2      7    11
qda_metrics <- qda_preds %>%
  metrics(truth = origin, estimate = .pred_class)

qda_metrics
## # A tibble: 2 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy multiclass     0.785
## 2 kap      multiclass     0.622

STEP 6 — Comparison of Multinomial, LDA, and QDA

# =======================================================
# STEP 6: Compare Cross-Validated and Test-Set Accuracy
# =======================================================

# ---- Cross-validated accuracy ----
cv_multinom <- multinom_cv %>% collect_metrics() %>% filter(.metric == "accuracy") %>% select(mean)
cv_lda      <- lda_cv %>% collect_metrics() %>% filter(.metric == "accuracy") %>% select(mean)
cv_qda      <- qda_cv %>% collect_metrics() %>% filter(.metric == "accuracy") %>% select(mean)

# ---- Test-set accuracy ----
test_multinom <- multinom_metrics %>% filter(.metric == "accuracy") %>% select(.estimate)
test_lda      <- lda_metrics %>% filter(.metric == "accuracy") %>% select(.estimate)
test_qda      <- qda_metrics %>% filter(.metric == "accuracy") %>% select(.estimate)

# ---- Merging into one comparison tibble ----
comparison_table <- tibble(
  Model = c("Multinomial Logistic Regression", "LDA", "QDA"),
  CV_Accuracy = c(cv_multinom$mean, cv_lda$mean, cv_qda$mean),
  Test_Accuracy = c(test_multinom$.estimate, test_lda$.estimate, test_qda$.estimate)
)

knitr::kable(
  comparison_table,
  caption = "Comparison of Classification Models (CV & Test Accuracy)",
  digits = 3,
  align = c("l", "c")
)
Comparison of Classification Models (CV & Test Accuracy)
Model CV_Accuracy Test_Accuracy
Multinomial Logistic Regression 0.786 0.709
LDA 0.732 0.785
QDA 0.788 0.785

Overall Interpretation for Example 3

Overall, all three classifiers performed well, with accuracies ranging from about 0.71 to 0.79. LDA and QDA achieved the strongest test-set performance (≈ 0.785), while multinomial logistic regression remained competitive but slightly weaker on unseen data. USA-origin cars were classified most reliably across models, with more confusion between Europe and Japan due to similar engine and weight characteristics. These results show how model flexibility (linear vs. quadratic boundaries) influences classification accuracy.

SECTION 4: Count Models — Poisson Regression

Example 4: Modeling Number of Cylinders Using Poisson Regression

💡 What this example demonstrates

  • Fitting a Poisson regression model for count outcomes
  • Modeling cylinder count using engine characteristics (displacement, horsepower, weight, mpg, year)
  • Checking Poisson assumptions through overdispersion tests and diagnostic plots
  • Interpreting multiplicative (rate) effects from a log-linear model
  • Using tidymodels workflows alongside base glm() for GLM diagnostics

STEP 1 — Load and prepare Auto data for Poisson regression

# =======================================================
# STEP 1: Load and prepare Auto data (Poisson)
# =======================================================

data("Auto", package = "ISLR2")   # or ISLR if needed

auto_pois <- Auto %>%
  as_tibble() %>%
  mutate(
    cylinders = as.integer(cylinders),
    horsepower = as.numeric(horsepower)
  ) %>%
  drop_na()

# Quick check of the distribution of cylinders
auto_pois %>%
  count(cylinders)
## # A tibble: 5 × 2
##   cylinders     n
##       <int> <int>
## 1         3     4
## 2         4   199
## 3         5     3
## 4         6    83
## 5         8   103
# will model cylinders as a count response, using:
# predictors: displacement, horsepower, weight, year, mpg

STEP 2 — Exploratory Data Analysis for Poisson Regression

# =======================================================
# STEP 2: EDA for Poisson Regression
# =======================================================


ggplot(auto_pois, aes(x = cylinders)) +
  geom_bar(fill = "steelblue") +
  labs(
    title = "Distribution of Cylinder Counts",
    x = "Number of Cylinders",
    y = "Count of Cars"
  )

auto_pois %>%
  dplyr::select(cylinders, displacement, horsepower, weight, mpg, year) %>%
  pivot_longer(-cylinders, names_to = "variable", values_to = "value") %>%
  ggplot(aes(x = value, y = cylinders)) +
  geom_point(alpha = 0.4) +
  geom_smooth(method = "loess") +
  facet_wrap(~ variable, scales = "free") +
  labs(
    title = "Cylinders vs. Key Predictors",
    x = "Predictor Value",
    y = "Cylinders"
  )

ggplot(auto_pois, aes(x = factor(year), y = cylinders)) +
  geom_boxplot(outlier.alpha = 0.3) +
  labs(
    title = "Cylinders Across Model Years",
    x = "Year",
    y = "Cylinders"
  ) +
  theme(axis.text.x = element_text(angle = 90))

cyl_mean <- mean(auto_pois$cylinders)
cyl_var  <- var(auto_pois$cylinders)

cyl_summary <- tibble(
  Mean_Cylinders = cyl_mean,
  Variance_Cylinders = cyl_var,
  Overdispersion = cyl_var > cyl_mean
)

cyl_summary
## # A tibble: 1 × 3
##   Mean_Cylinders Variance_Cylinders Overdispersion
##            <dbl>              <dbl> <lgl>         
## 1           5.47               2.91 FALSE

STEP 3 — Fit the Poisson Regression Model

# =======================================================
# STEP 3: Fit Poisson Regression Model
# =======================================================

poisson_spec <- poisson_reg(mode = "regression") %>%
  set_engine("glm")

# Workflow
poisson_wf <- workflow() %>%
  add_model(poisson_spec) %>%
  add_formula(cylinders ~ displacement + horsepower + weight + mpg + year)

# 3a. 5-fold cross-validation
set.seed(123)
poisson_folds <- vfold_cv(auto_pois, v = 5, strata = cylinders)

poisson_cv <- fit_resamples(
  poisson_wf,
  resamples = poisson_folds,
  control = control_resamples(save_pred = TRUE)
)

collect_metrics(poisson_cv)
## # A tibble: 2 × 6
##   .metric .estimator  mean     n std_err .config        
##   <chr>   <chr>      <dbl> <int>   <dbl> <chr>          
## 1 rmse    standard   0.595     5  0.0382 pre0_mod0_post0
## 2 rsq     standard   0.882     5  0.0124 pre0_mod0_post0
poisson_glm <- glm(
  cylinders ~ displacement + horsepower + weight + mpg + year,
  data = auto_pois,
  family = poisson()
)

summary(poisson_glm)
## 
## Call:
## glm(formula = cylinders ~ displacement + horsepower + weight + 
##     mpg + year, family = poisson(), data = auto_pois)
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   1.020e+00  5.509e-01   1.851   0.0641 .  
## displacement  2.542e-03  6.413e-04   3.964 7.37e-05 ***
## horsepower   -1.183e-03  1.251e-03  -0.945   0.3446    
## weight        2.675e-05  7.738e-05   0.346   0.7296    
## mpg          -6.089e-03  6.619e-03  -0.920   0.3576    
## year          4.360e-03  8.219e-03   0.531   0.5957    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 201.717  on 391  degrees of freedom
## Residual deviance:  21.527  on 386  degrees of freedom
## AIC: 1415
## 
## Number of Fisher Scoring iterations: 4
poisson_preds <- auto_pois %>%
  mutate(
    .pred = predict(poisson_glm, type = "response")
  ) %>%
  dplyr::select(cylinders, .pred)

head(poisson_preds)
## # A tibble: 6 × 2
##   cylinders .pred
##       <int> <dbl>
## 1         8  6.93
## 2         8  7.59
## 3         8  6.95
## 4         8  6.79
## 5         8  6.79
## 6         8  9.08

STEP 4 — Diagnostics for the Poisson Regression Model

# =======================================================
# STEP 4: Diagnostics for Poisson Regression
# =======================================================

poisson_diag <- augment(poisson_glm) %>%
  mutate(
    std_resid = rstandard(poisson_glm)
  )
ggplot(poisson_diag, aes(x = .fitted, y = .resid)) +
  geom_point(alpha = 0.4) +
  geom_smooth(method = "loess", se = FALSE) +
  labs(
    title = "Residuals vs Fitted — Poisson Model",
    x = "Fitted Cylinder Count",
    y = "Residuals"
  )

ggplot(poisson_diag, aes(sample = std_resid)) +
  stat_qq(alpha = 0.5) +
  stat_qq_line() +
  labs(
    title = "Normal Q–Q Plot — Poisson Residuals",
    x = "Theoretical Quantiles",
    y = "Standardized Residuals"
  )

ggplot(poisson_diag, aes(x = .fitted, y = sqrt(abs(std_resid)))) +
  geom_point(alpha = 0.4) +
  geom_smooth(method = "loess", se = FALSE) +
  labs(
    title = "Scale–Location Plot — Poisson Model",
    x = "Fitted Values",
    y = "√|Standardized Residuals|"
  )

# Overdispersion test:
# If ratio >> 1, the model may be overdispersed.

residual_deviance <- deviance(poisson_glm)
df_residual       <- df.residual(poisson_glm)

overdispersion_ratio <- residual_deviance / df_residual

tibble(
  Residual_Deviance    = residual_deviance,
  DF_Residual          = df_residual,
  Overdispersion_Ratio = overdispersion_ratio
)
## # A tibble: 1 × 3
##   Residual_Deviance DF_Residual Overdispersion_Ratio
##               <dbl>       <int>                <dbl>
## 1              21.5         386               0.0558

STEP 5 — Interpretation

The Poisson model fits the cylinder counts well, with a very low overdispersion ratio (≈ 0.056), indicating that the Poisson variance assumption is not violated. Among the predictors, only displacement shows a significant positive effect on cylinder count, consistent with the idea that larger engines generally have more cylinders. Other variables such as horsepower, weight, mpg, and year do not contribute meaningfully after accounting for displacement. Diagnostic plots show mild curvature but no major deviations, suggesting that the model provides a reasonable and interpretable approximation for count data in this setting.

SECTION 5: Penalized Regression — Ridge and Lasso (Auto)

Example 5: Predicting mpg using ridge and lasso regression

💡 What this example demonstrates

  • Applying penalized regression methods (ridge and lasso) using tidymodels
  • Creating dummy variables and normalizing predictors for glmnet
  • Tuning the penalty parameter λ through cross-validation
  • Comparing model performance using test-set RMSE
  • Understanding shrinkage and feature selection in regularized regression

STEP 1 — Prepare Auto data

# =======================================================
# STEP 1: Prepare Auto data for penalized regression
# =======================================================

data("Auto", package = "ISLR2")   # or ISLR

auto_pen <- Auto %>%
  as_tibble() %>%
  mutate(
    origin = factor(origin),
    horsepower = as.numeric(horsepower)
  ) %>%
  drop_na()

# Train/test split stratified by mpg (for stability)
set.seed(123)
pen_split <- initial_split(auto_pen, prop = 0.8, strata = mpg)

pen_train <- training(pen_split)
pen_test  <- testing(pen_split)

nrow(pen_train)
## [1] 312
nrow(pen_test)
## [1] 80

STEP 2 — Recipe: dummy encodings and normalization

# =======================================================
# STEP 2: Define recipe for penalized regression
# =======================================================

pen_rec <- recipe(mpg ~ cylinders + displacement + horsepower +
                    weight + acceleration + year + origin,
                  data = pen_train) %>%
  step_dummy(all_nominal_predictors()) %>%
  step_zv(all_predictors()) %>%
  step_normalize(all_predictors())

pen_rec

STEP 3 — Ridge and lasso model specifications and CV grid

# =======================================================
# STEP 3: Ridge and Lasso specs + tuning grid
# =======================================================

# Ridge: mixture = 0
ridge_spec <- linear_reg(
  mode = "regression",
  penalty = tune(),
  mixture = 0
) %>%
  set_engine("glmnet")

# Lasso: mixture = 1
lasso_spec <- linear_reg(
  mode = "regression",
  penalty = tune(),
  mixture = 1
) %>%
  set_engine("glmnet")

# Workflows
ridge_wf <- workflow() %>%
  add_model(ridge_spec) %>%
  add_recipe(pen_rec)

lasso_wf <- workflow() %>%
  add_model(lasso_spec) %>%
  add_recipe(pen_rec)


set.seed(123)
pen_folds <- vfold_cv(pen_train, v = 5, strata = mpg)

# Penalty grid (lambda)
lambda_grid <- grid_regular(penalty(), levels = 30)
lambda_grid
## # A tibble: 30 × 1
##     penalty
##       <dbl>
##  1 1   e-10
##  2 2.21e-10
##  3 4.89e-10
##  4 1.08e- 9
##  5 2.40e- 9
##  6 5.30e- 9
##  7 1.17e- 8
##  8 2.59e- 8
##  9 5.74e- 8
## 10 1.27e- 7
## # ℹ 20 more rows

STEP 4 — Cross-validation for ridge and lasso

# =======================================================
# STEP 4: Tune penalty for ridge and lasso
# =======================================================


set.seed(123)
ridge_tune <- tune_grid(
  ridge_wf,
  resamples = pen_folds,
  grid = lambda_grid,
  control = control_grid(save_pred = TRUE)
)

set.seed(123)
lasso_tune <- tune_grid(
  lasso_wf,
  resamples = pen_folds,
  grid = lambda_grid,
  control = control_grid(save_pred = TRUE)
)


ridge_tune %>%
  collect_metrics() %>%
  filter(.metric == "rmse") %>%
  arrange(mean) %>%
  slice(1:5)
## # A tibble: 5 × 7
##    penalty .metric .estimator  mean     n std_err .config         
##      <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>           
## 1 1   e-10 rmse    standard    3.42     5   0.135 pre0_mod01_post0
## 2 2.21e-10 rmse    standard    3.42     5   0.135 pre0_mod02_post0
## 3 4.89e-10 rmse    standard    3.42     5   0.135 pre0_mod03_post0
## 4 1.08e- 9 rmse    standard    3.42     5   0.135 pre0_mod04_post0
## 5 2.40e- 9 rmse    standard    3.42     5   0.135 pre0_mod05_post0
lasso_tune %>%
  collect_metrics() %>%
  filter(.metric == "rmse") %>%
  arrange(mean) %>%
  slice(1:5)
## # A tibble: 5 × 7
##        penalty .metric .estimator  mean     n std_err .config         
##          <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>           
## 1 0.0924       rmse    standard    3.35     5   0.132 pre0_mod27_post0
## 2 0.0418       rmse    standard    3.36     5   0.131 pre0_mod26_post0
## 3 0.00853      rmse    standard    3.36     5   0.133 pre0_mod24_post0
## 4 0.0189       rmse    standard    3.36     5   0.132 pre0_mod25_post0
## 5 0.0000000001 rmse    standard    3.36     5   0.133 pre0_mod01_post0

STEP 5 — Final ridge and lasso models and test performance

# =======================================================
# STEP 5: Finalize best ridge and lasso models
# =======================================================


best_ridge <- select_best(ridge_tune, metric = "rmse")
best_lasso <- select_best(lasso_tune, metric = "rmse")

best_ridge
## # A tibble: 1 × 2
##        penalty .config         
##          <dbl> <chr>           
## 1 0.0000000001 pre0_mod01_post0
best_lasso
## # A tibble: 1 × 2
##   penalty .config         
##     <dbl> <chr>           
## 1  0.0924 pre0_mod27_post0
# Final workflows
ridge_final_wf <- finalize_workflow(ridge_wf, best_ridge)
lasso_final_wf <- finalize_workflow(lasso_wf, best_lasso)

# Fit on full training data
ridge_final_fit <- fit(ridge_final_wf, data = pen_train)
lasso_final_fit <- fit(lasso_final_wf, data = pen_train)

# Predict on test data
ridge_test_preds <- predict(ridge_final_fit, new_data = pen_test) %>%
  bind_cols(pen_test %>% dplyr::select(mpg))

lasso_test_preds <- predict(lasso_final_fit, new_data = pen_test) %>%
  bind_cols(pen_test %>% dplyr::select(mpg))

# Test metrics
ridge_test_metrics <- ridge_test_preds %>%
  metrics(truth = mpg, estimate = .pred)

lasso_test_metrics <- lasso_test_preds %>%
  metrics(truth = mpg, estimate = .pred)

ridge_test_metrics
## # A tibble: 3 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard       3.52 
## 2 rsq     standard       0.817
## 3 mae     standard       2.45
lasso_test_metrics
## # A tibble: 3 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard       3.38 
## 2 rsq     standard       0.829
## 3 mae     standard       2.41

STEP 6 — Comparison of ridge and lasso

ridge_rmse <- ridge_test_metrics %>%
  dplyr::filter(.metric == "rmse") %>%
  dplyr::pull(.estimate)

lasso_rmse <- lasso_test_metrics %>%
  dplyr::filter(.metric == "rmse") %>%
  dplyr::pull(.estimate)

comparison_table <- tibble(
  Model = c("Ridge", "Lasso"),
  Test_RMSE = c(ridge_rmse, lasso_rmse)
)

knitr::kable(
  comparison_table,
  caption = "Test RMSE Comparison: Ridge vs Lasso Regression",
  digits = 3,
  align = c("l", "c")
)
Test RMSE Comparison: Ridge vs Lasso Regression
Model Test_RMSE
Ridge 3.519
Lasso 3.382

Overall Interpretation for Example 5

Both ridge and lasso regression produced strong predictive performance for mpg, with lasso achieving the lowest test RMSE (≈ 3.38) compared to ridge (≈ 3.52). This suggests that a sparser model with selective coefficient shrinkage provides a slightly better balance between bias and variance for this dataset. Ridge retained all predictors with small shrinkage, whereas lasso likely removed weaker predictors by shrinking their coefficients toward zero. Overall, the difference in performance is modest, but lasso offers a more interpretable model with comparable or better accuracy.

Conclusion

This portfolio demonstrates a comprehensive set of modeling skills using the tidymodels ecosystem, including regression, classification, interaction modeling, count data modeling, and penalized methods. Across all examples, the emphasis was placed on clean data preparation, thoughtful exploratory analysis, appropriate model validation, and interpretable results.

The models used here show how different statistical approaches respond to nonlinear patterns, multicollinearity, categorical interactions, and varying levels of model flexibility. Together, these examples highlight the importance of understanding both the mathematical structure of models and the practical behavior of real-world datasets.

Future Work

  • Random forests or gradient boosting for flexible nonlinear modeling with automatic interaction detection.
  • Bayesian regression approaches, providing posterior uncertainty estimates for coefficients and predictions.
  • Model deployment examples, such as creating prediction APIs or lightweight Shiny apps.
  • Cross-dataset comparisons, evaluating whether insights from one domain generalize to others.

References

  • James, Witten, Hastie, Tibshirani. An Introduction to Statistical Learning (ISLR), 2nd Edition.
  • R for Data Science (2e), Wickham, Çetinkaya-Rundel, Grolemund.