Feature Engineering

Setup

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.4.3

## Warning: package 'ggplot2' was built under R version 4.4.3

## Warning: package 'tidyr' was built under R version 4.4.3

## Warning: package 'readr' was built under R version 4.4.3

## Warning: package 'purrr' was built under R version 4.4.3

## Warning: package 'dplyr' was built under R version 4.4.3

## Warning: package 'forcats' was built under R version 4.4.3

## Warning: package 'lubridate' was built under R version 4.4.3

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(tidymodels)

## Warning: package 'tidymodels' was built under R version 4.4.3

## ── Attaching packages ────────────────────────────────────── tidymodels 1.3.0 ──
## ✔ broom        1.0.8     ✔ rsample      1.3.0
## ✔ dials        1.4.0     ✔ tune         1.3.0
## ✔ infer        1.0.8     ✔ workflows    1.2.0
## ✔ modeldata    1.4.0     ✔ workflowsets 1.1.0
## ✔ parsnip      1.3.1     ✔ yardstick    1.3.2
## ✔ recipes      1.3.0

## Warning: package 'broom' was built under R version 4.4.3

## Warning: package 'dials' was built under R version 4.4.3

## Warning: package 'scales' was built under R version 4.4.3

## Warning: package 'infer' was built under R version 4.4.3

## Warning: package 'modeldata' was built under R version 4.4.3

## Warning: package 'parsnip' was built under R version 4.4.3

## Warning: package 'recipes' was built under R version 4.4.3

## Warning: package 'rsample' was built under R version 4.4.3

## Warning: package 'tune' was built under R version 4.4.3

## Warning: package 'workflows' was built under R version 4.4.3

## Warning: package 'workflowsets' was built under R version 4.4.3

## Warning: package 'yardstick' was built under R version 4.4.3

## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter()   masks stats::filter()
## ✖ recipes::fixed()  masks stringr::fixed()
## ✖ dplyr::lag()      masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step()   masks stats::step()

load("C:/Users/pearl/Downloads/cdc3.Rdata")

Introduction

We will work on feature engineering for the purpose of predicting gender based on other characteristics. Use cdc3.

Split The Data

set.seed(123)
cdc3_split <- initial_split(cdc3, prop = 0.7, strata = gender)
cdc3_training <- training(cdc3_split)
cdc3_test <- testing(cdc3_split)

Base Model

logistic_model <- logistic_reg() %>%
  set_engine("glm") %>%
  set_mode("classification")

cdc3_recipe <- recipe(gender ~ ., data = cdc3_training) %>%
  step_corr(all_numeric(), threshold = 0.8) %>%
  step_log(all_numeric(), base = 10) %>%
  step_normalize(all_numeric()) %>%
  step_dummy(all_nominal(), -all_outcomes())

cdc3_recipe_prep <- prep(cdc3_recipe, training = cdc3_training)
cdc3_training_prep <- bake(cdc3_recipe_prep, new_data = NULL)
cdc3_test_prep <- bake(cdc3_recipe_prep, new_data = cdc3_test)

logistic_fit <- logistic_model %>%
  fit(gender ~ ., data = cdc3_training_prep)

base_preds <- predict(logistic_fit, new_data = cdc3_test_prep, type = "prob") %>%
  bind_cols(cdc3_test_prep %>% select(gender))

roc_auc(base_preds, truth = gender, .pred_f)

Change One:

Removed step_log()

# Change 1: Remove step_log()

cdc3_recipe1 <- recipe(gender ~ ., data = cdc3_training) %>%
  step_corr(all_numeric(), threshold = 0.8) %>%
  # Removed log transform
  step_normalize(all_numeric()) %>%
  step_dummy(all_nominal(), -all_outcomes())

cdc3_recipe1_prep <- prep(cdc3_recipe1, training = cdc3_training)
cdc3_training1 <- bake(cdc3_recipe1_prep, new_data = NULL)
cdc3_test1 <- bake(cdc3_recipe1_prep, new_data = cdc3_test)

fit1 <- logistic_model %>%
  fit(gender ~ ., data = cdc3_training1)

preds1 <- predict(fit1, new_data = cdc3_test1, type = "prob") %>%
  bind_cols(cdc3_test1 %>% select(gender))

# Calculate AUC
roc_auc(preds1, truth = gender, .pred_f)

#ROC Plot
# ROC Plot
preds1 %>%
  roc_curve(truth = gender, .pred_f) %>%
  autoplot() +
  ggtitle("ROC Curve for Change 1: Removed Log Transformation")

Results: Removing Log transformation slightly reduced the models performance. Some numeric predictors likely benefit from log scaling to reduce skew.

Change Two:

Removed step_corr()

cdc3_recipe2 <- recipe(gender ~ ., data = cdc3_training) %>%
  # Removed correlation filter
  step_log(all_numeric(), base = 10) %>%
  step_normalize(all_numeric()) %>%
  step_dummy(all_nominal(), -all_outcomes())

cdc3_recipe2_prep <- prep(cdc3_recipe2, training = cdc3_training)
cdc3_training2 <- bake(cdc3_recipe2_prep, new_data = NULL)
cdc3_test2 <- bake(cdc3_recipe2_prep, new_data = cdc3_test)

fit2 <- logistic_model %>%
  fit(gender ~ ., data = cdc3_training2)

preds2 <- predict(fit2, new_data = cdc3_test2, type = "prob") %>%
  bind_cols(cdc3_test2 %>% select(gender))

roc_auc(preds2, truth = gender, .pred_f)

# ROC Plot
preds2 %>%
  roc_curve(truth = gender, .pred_f) %>%
  autoplot() +
  ggtitle("ROC Curve for Change 2: Removed Correlation Filter")

Results: step_corr only removes numeric variables with absolute pairwise correlations above 0.8, therefore if no numeric features exceed the correlation threshold, the recipe behaves identical to the base model.

Change Three:

Added interaction term height:weight

cdc3_recipe3 <- recipe(gender ~ ., data = cdc3_training) %>%
  step_corr(all_numeric(), threshold = 0.8) %>%
  step_log(all_numeric(), base = 10) %>%
  step_normalize(all_numeric()) %>%
  step_interact(terms = ~ starts_with("height"):starts_with("weight")) %>%
  step_dummy(all_nominal(), -all_outcomes())

cdc3_recipe3_prep <- prep(cdc3_recipe3, training = cdc3_training)
cdc3_training3 <- bake(cdc3_recipe3_prep, new_data = NULL)
cdc3_test3 <- bake(cdc3_recipe3_prep, new_data = cdc3_test)

fit3 <- logistic_model %>%
  fit(gender ~ ., data = cdc3_training3)

preds3 <- predict(fit3, new_data = cdc3_test3, type = "prob") %>%
  bind_cols(cdc3_test3 %>% select(gender))

roc_auc(preds3, truth = gender, .pred_f)

# ROC Plot
preds3 %>%
  roc_curve(truth = gender, .pred_f) %>%
  autoplot() +
  ggtitle("ROC Curve for Change 3: Add height*weight Interaction")

The purpose of change three was to test if including an interaction between height and weight improved the models ability to predict gender, but the interaction term offered no additional predictive value as it may just be a linear combo of things the model already knows.

Change 4:

Use only smoke100 as predictor

# Change 4: Use only smoke100 as predictor

# Step 1: Build the recipe using only smoke100
cdc3_recipe4 <- recipe(gender ~ smoke100, data = cdc3_training) %>%
  step_dummy(all_nominal(), -all_outcomes())

# Step 2: Prep and bake the data
cdc3_recipe4_prep <- prep(cdc3_recipe4, training = cdc3_training)
cdc3_training4 <- bake(cdc3_recipe4_prep, new_data = NULL)
cdc3_test4 <- bake(cdc3_recipe4_prep, new_data = cdc3_test)

# Step 3: Fit the logistic regression model
fit4 <- logistic_model %>%
  fit(gender ~ ., data = cdc3_training4)

# Step 4: Get predicted probabilities
preds4 <- predict(fit4, new_data = cdc3_test4, type = "prob") %>%
  bind_cols(cdc3_test4 %>% select(gender))

# Step 5: Compute AUC
roc_auc(preds4, truth = gender, .pred_f)

# Step 6: Plot the ROC curve with improved visuals
preds4 %>%
  roc_curve(truth = gender, .pred_f) %>%
  autoplot() +
  ggtitle("ROC Curve for Change 4: Using Only smoke100") +
  theme_minimal(base_size = 14) +
  theme(
    plot.title = element_text(face = "bold", hjust = 0.5),
    axis.title = element_text(face = "bold"),
    panel.grid.major = element_line(color = "gray80"),
    panel.grid.minor = element_blank()
  ) +
  xlab("1 - Specificity") +
  ylab("Sensitivity") +
  coord_equal() +
  scale_x_continuous(limits = c(0, 1)) +
  scale_y_continuous(limits = c(0, 1))

## Coordinate system already present. Adding new coordinate system, which will
## replace the existing one.

This resulted in an AUC of 0.54, only slightly better than random guessing (0.5). Using a single variable like smoke100 introduces a weak signal. This suggests that smoking status alone does not reliably distinguish gender in this dataset. The pattern may be too diffuse or balanced across genders for the model to separate classes meaningfully

Conclusion

Through each individual change, we observed how different steps in a recipe can affect model performance.

Log transformation slightly improved AUC.
Removing highly correlated features had no impact.
Interaction terms and binary variables can add structure, but may not always increase predictive power.
Behavioral variables like smoke100 show weak individual signals.

This demonstrates the importance of feature engineering as an iterative, evidence-driven process, especially when working with real-world, imperfect data.

The ROC curve visualizations provide an essential lens into how each feature engineering change influenced the model’s predictive ability.

The base model and Changes 1 through 3 (removing log transformation, removing correlated features, and adding a height-weight interaction) all yielded ROC curves that closely hug the upper-left corner, indicating strong classifier performance and minimal change to the model’s ability to distinguish between genders.

This visual similarity suggests that these changes did not significantly impact the ranking of predictions, likely due to normalization compensating for scale differences and the strength of other features carrying the model.

In contrast, Change 4, which limited the model to using only the smoke100 variable, produced a noticeably flatter ROC curve with a much lower AUC (~0.54), close to random guessing.

This confirms that smoke100 alone is a weak predictor of gender in this dataset. The stark visual difference in this curve reinforces the importance of using rich, multi-feature models and validates the value of graphical diagnostics in evaluating feature engineering choices.