Feature engineering is a critical preprocessing step in predictive modeling. The quality and relevance of the engineered features often determine the performance and interpretability of machine learning models (Kuhn & Johnson, 2019). This report demonstrates how to perform robust and justified feature engineering on the Titanic dataset, ensuring no data leakage and maintaining transformation consistency across training and test datasets.
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.4.3
## Warning: package 'ggplot2' was built under R version 4.4.3
## Warning: package 'tidyr' was built under R version 4.4.3
## Warning: package 'readr' was built under R version 4.4.3
## Warning: package 'purrr' was built under R version 4.4.3
## Warning: package 'dplyr' was built under R version 4.4.3
## Warning: package 'stringr' was built under R version 4.4.3
## Warning: package 'forcats' was built under R version 4.4.3
## Warning: package 'lubridate' was built under R version 4.4.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(stringr)
# Set working directory
setwd("C:/DDS 8501 Titanic") # Adjust as needed
train <- read_csv("train.csv")
## Rows: 891 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): Name, Sex, Ticket, Cabin, Embarked
## dbl (7): PassengerId, Survived, Pclass, Age, SibSp, Parch, Fare
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
test <- read_csv("test.csv")
## Rows: 418 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): Name, Sex, Ticket, Cabin, Embarked
## dbl (6): PassengerId, Pclass, Age, SibSp, Parch, Fare
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
What is being done:
This operation begin by importing the training and test datasets using
the read_csv function from the readr package,
which is part of the tidyverse. This operation also load
stringr for string manipulation tasks later in the
pipeline.
Why is being done it:
The training data is used to build and validate models, while the test
data evaluates generalization. These datasets must be loaded separately
to prevent information leakage and to maintain the integrity of
downstream evaluations (Kelleher & Tierney, 2018).
extract_title <- function(name) {
# Extracts the title using a capture group
str_trim(str_match(name, ",\\s*([^\\.]+)\\.")[,2])
}
train <- train %>% mutate(Title = extract_title(Name))
test <- test %>% mutate(Title = extract_title(Name))
What is being done:
This operation define and apply a function to extract each passenger’s
title (e.g., “Mr”, “Mrs”, “Miss”) from the Name field using
regular expressions. This creates a new feature called
Title.
Why is being done it:
The Name variable contains embedded social information.
Extracting the title provides explicit access to an individual’s social
role, gender, and potentially age category (e.g., “Master” indicates a
the analystng male). Studies have shown that such demographic and social
identity features are often strong predictors in survival analysis
(Hornik et al., 2020). Feature extraction from unstructured text enables
domain-relevant insight that enhances model performance (Kuhn &
Johnson, 2019).
train <- train %>% mutate(Age_Pclass = Age * Pclass)
test <- test %>% mutate(Age_Pclass = Age * Pclass)
What is being done:
This operation create a new numeric feature by multiplying the
Age and Pclass variables. This interaction
term, Age_Pclass, captures how age varies across different
socioeconomic classes.
Why is being done it:
Interaction features are crucial in uncovering hidden dependencies
betthe analysten predictors. While Age measures passenger
age and Pclass captures travel class (a proxy for
socioeconomic status), their interaction may reflect compounded effects.
For example, a 10-year-old in 1st class may have had higher survival
odds than a 10-year-old in 3rd class. Interaction terms are the
analystll-established enhancements in both linear and nonlinear models
(James et al., 2021). As predictive modeling seeks to reflect complex
realities, such combinations increase both granularity and model
expressiveness.
collapse_titles <- function(title) {
case_when(
title %in% c("Mr") ~ "Mr",
title %in% c("Miss", "Mlle", "Ms") ~ "Miss",
title %in% c("Mrs", "Mme") ~ "Mrs",
title == "Master" ~ "Master",
title %in% c("Dr", "Rev", "Col", "Major", "Capt") ~ "Officer",
title %in% c("Don", "Sir", "Lady", "Countess", "Jonkheer", "Dona") ~ "Royalty",
TRUE ~ "Other"
)
}
train <- train %>% mutate(Title_Collapsed = collapse_titles(Title))
test <- test %>% mutate(Title_Collapsed = collapse_titles(Title))
write_csv(train, "train_engineered.csv")
write_csv(test, "test_engineered.csv")
What is being done:
This operation consolidate the Title categories into
broader groups like “Mr”, “Miss”, “Officer”, and “Royalty”. This
operation then save the engineered versions of both datasets to CSV
files.
Why is being done it:
Categorical features with many sparse levels (high cardinality) can lead
to overfitting and unreliable model estimates. By collapsing
semantically similar or infrequent titles, This operation preserve
essential distinctions while improving model robustness (Kuhn &
Johnson, 2019). For example, collapsing “Capt”, “Col”, and “Dr” into
“Officer” maintains hierarchical role information without the noise of
rare labels. This also aligns with feature reduction strategies aimed at
improving generalization and reducing variance (James et al., 2021).
Why This operation save the data now:
By saving the engineered datasets here, This operation ensure that they
are readily available for modeling and validation. This separation of
feature engineering and model building improves reproducibility and
auditability.
This feature engineering pipeline is driven by both data-driven insights and literature-informed decisions:
Applying the exact same feature engineering steps to the test dataset as the training dataset is absolutely essential. Failure to do so introduces data leakage or inconsistent input space, rendering any performance evaluation meaningless.
Why consistency matters:
Scholarly Support:
Kelleher and Tierney (2018) emphasize the “closed-world” assumption in
supervised learning: the model assumes that features seen during
inference are drawn from the same distribution and transformation
pipeline as the training data. Any deviation compromises generalization.
James et al. (2021) further warn that preprocessing inconsistencies
result in misleading error rates and underperformance in deployment.
This section provides a concise overview of the feature engineering operations and the rationale behind each:
Title extraction from Name
Extraction of the passenger title via a capture-group regex converted
unstructured name strings into a categorical feature reflecting social
status and gender. Such demographic signals have been shown to improve
survival models (Hornik et al., 2020; Kuhn & Johnson,
2019).
Interaction term (Age × Pclass)
Creation of the Age_Pclass variable by multiplying Age and
Pclass captured compounded socioeconomic–age effects. Interaction
features often reveal patterns missed by linear terms alone (James et
al., 2021; Kelleher & Tierney, 2018).
Title collapsing
Consolidation of rare or semantically similar titles into broader groups
(e.g., “Dr”, “Rev”, “Col” → “Officer”) reduced feature cardinality,
mitigated overfitting, and improved model stability (Kuhn & Johnson,
2019; James et al., 2021).
Consistent transformations on test data
Application of identical preprocessing steps to test data prevented data
leakage, preserved the closed-world assumption, and ensured that
performance metrics accurately reflect generalization ability (Kelleher
& Tierney, 2018; Kuhn & Johnson, 2019).