Introduction

Feature engineering is a critical preprocessing step in predictive modeling. The quality and relevance of the engineered features often determine the performance and interpretability of machine learning models (Kuhn & Johnson, 2019). This report demonstrates how to perform robust and justified feature engineering on the Titanic dataset, ensuring no data leakage and maintaining transformation consistency across training and test datasets.

Step 1: Load the Data

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.4.3
## Warning: package 'ggplot2' was built under R version 4.4.3
## Warning: package 'tidyr' was built under R version 4.4.3
## Warning: package 'readr' was built under R version 4.4.3
## Warning: package 'purrr' was built under R version 4.4.3
## Warning: package 'dplyr' was built under R version 4.4.3
## Warning: package 'stringr' was built under R version 4.4.3
## Warning: package 'forcats' was built under R version 4.4.3
## Warning: package 'lubridate' was built under R version 4.4.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(stringr)

# Set working directory
setwd("C:/DDS 8501 Titanic")  # Adjust as needed

train <- read_csv("train.csv")
## Rows: 891 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): Name, Sex, Ticket, Cabin, Embarked
## dbl (7): PassengerId, Survived, Pclass, Age, SibSp, Parch, Fare
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
test <- read_csv("test.csv")
## Rows: 418 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): Name, Sex, Ticket, Cabin, Embarked
## dbl (6): PassengerId, Pclass, Age, SibSp, Parch, Fare
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

What is being done:
This operation begin by importing the training and test datasets using the read_csv function from the readr package, which is part of the tidyverse. This operation also load stringr for string manipulation tasks later in the pipeline.

Why is being done it:
The training data is used to build and validate models, while the test data evaluates generalization. These datasets must be loaded separately to prevent information leakage and to maintain the integrity of downstream evaluations (Kelleher & Tierney, 2018).

Step 2: Extract Title from Name

extract_title <- function(name) {
  # Extracts the title using a capture group
  str_trim(str_match(name, ",\\s*([^\\.]+)\\.")[,2])
}

train <- train %>% mutate(Title = extract_title(Name))
test <- test %>% mutate(Title = extract_title(Name))

What is being done:
This operation define and apply a function to extract each passenger’s title (e.g., “Mr”, “Mrs”, “Miss”) from the Name field using regular expressions. This creates a new feature called Title.

Why is being done it:
The Name variable contains embedded social information. Extracting the title provides explicit access to an individual’s social role, gender, and potentially age category (e.g., “Master” indicates a the analystng male). Studies have shown that such demographic and social identity features are often strong predictors in survival analysis (Hornik et al., 2020). Feature extraction from unstructured text enables domain-relevant insight that enhances model performance (Kuhn & Johnson, 2019).

Step 3: Create a Quantitative Interaction Feature: Age * Pclass

train <- train %>% mutate(Age_Pclass = Age * Pclass)
test <- test %>% mutate(Age_Pclass = Age * Pclass)

What is being done:
This operation create a new numeric feature by multiplying the Age and Pclass variables. This interaction term, Age_Pclass, captures how age varies across different socioeconomic classes.

Why is being done it:
Interaction features are crucial in uncovering hidden dependencies betthe analysten predictors. While Age measures passenger age and Pclass captures travel class (a proxy for socioeconomic status), their interaction may reflect compounded effects. For example, a 10-year-old in 1st class may have had higher survival odds than a 10-year-old in 3rd class. Interaction terms are the analystll-established enhancements in both linear and nonlinear models (James et al., 2021). As predictive modeling seeks to reflect complex realities, such combinations increase both granularity and model expressiveness.

Step 4: Collapse Factor Levels for Title + Save Datasets

collapse_titles <- function(title) {
  case_when(
    title %in% c("Mr") ~ "Mr",
    title %in% c("Miss", "Mlle", "Ms") ~ "Miss",
    title %in% c("Mrs", "Mme") ~ "Mrs",
    title == "Master" ~ "Master",
    title %in% c("Dr", "Rev", "Col", "Major", "Capt") ~ "Officer",
    title %in% c("Don", "Sir", "Lady", "Countess", "Jonkheer", "Dona") ~ "Royalty",
    TRUE ~ "Other"
  )
}

train <- train %>% mutate(Title_Collapsed = collapse_titles(Title))
test <- test %>% mutate(Title_Collapsed = collapse_titles(Title))

write_csv(train, "train_engineered.csv")
write_csv(test, "test_engineered.csv")

What is being done:
This operation consolidate the Title categories into broader groups like “Mr”, “Miss”, “Officer”, and “Royalty”. This operation then save the engineered versions of both datasets to CSV files.

Why is being done it:
Categorical features with many sparse levels (high cardinality) can lead to overfitting and unreliable model estimates. By collapsing semantically similar or infrequent titles, This operation preserve essential distinctions while improving model robustness (Kuhn & Johnson, 2019). For example, collapsing “Capt”, “Col”, and “Dr” into “Officer” maintains hierarchical role information without the noise of rare labels. This also aligns with feature reduction strategies aimed at improving generalization and reducing variance (James et al., 2021).

Why This operation save the data now:
By saving the engineered datasets here, This operation ensure that they are readily available for modeling and validation. This separation of feature engineering and model building improves reproducibility and auditability.

Step 5: Detailed Justification for Each Feature Engineering Operation

This feature engineering pipeline is driven by both data-driven insights and literature-informed decisions:

Step 6: Importance of Consistent Transformations for Test Data

Applying the exact same feature engineering steps to the test dataset as the training dataset is absolutely essential. Failure to do so introduces data leakage or inconsistent input space, rendering any performance evaluation meaningless.

Why consistency matters:

  1. Predictive Integrity: Models learn patterns based on the structure and range of features in the training data. Any mismatch in test data features breaks this consistency (Kelleher & Tierney, 2018).
  2. Reproducibility: Academic and production pipelines must yield the same output for the same transformations. Any deviation introduces variability that cannot be debugged.
  3. Data Leakage Avoidance: If test data is transformed differently (or using statistics derived from the training set), the model is effectively “seeing” test data during training—a violation of core machine learning principles (Kuhn & Johnson, 2019).

Scholarly Support:
Kelleher and Tierney (2018) emphasize the “closed-world” assumption in supervised learning: the model assumes that features seen during inference are drawn from the same distribution and transformation pipeline as the training data. Any deviation compromises generalization. James et al. (2021) further warn that preprocessing inconsistencies result in misleading error rates and underperformance in deployment.

Step 7: Summary of Feature Engineering

This section provides a concise overview of the feature engineering operations and the rationale behind each:

References