The goal of this analysis is to build a predictive model to estimate survival on the Titanic, and to see how different imputation methods for missing Age values affect model performance. We will: Load the Titanic data and explore missingness. Split the data into training and test sets. Try three different imputation strategies for Age: 1. Global mean imputation 2. Global median imputation 3. Grouped median imputation (median Age within Pclass × Sex groups) 4. KNN Imputation (using preProcess(method = “knnImpute”)) For each imputation, fit a logistic regression model predicting Survived. Compare performance on the test set using confusion matrices and accuracy. Our analytics goal is predictive performance, so the “best” imputation method is the one that leads to the best out-of-sample classification results.
#install.packages("tidyverse")
#install.packages("caret")
library(tidyverse)
library(caret)
library(dplyr)
titanic <- read.csv("Titanic.csv", stringsAsFactors = FALSE)
str(titanic)
## 'data.frame': 891 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : chr "male" "female" "female" "female" ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr "" "C85" "" "C123" ...
## $ Embarked : chr "S" "C" "S" "S" ...
summary(titanic)
## PassengerId Survived Pclass Name
## Min. : 1.0 Min. :0.0000 Min. :1.000 Length:891
## 1st Qu.:223.5 1st Qu.:0.0000 1st Qu.:2.000 Class :character
## Median :446.0 Median :0.0000 Median :3.000 Mode :character
## Mean :446.0 Mean :0.3838 Mean :2.309
## 3rd Qu.:668.5 3rd Qu.:1.0000 3rd Qu.:3.000
## Max. :891.0 Max. :1.0000 Max. :3.000
##
## Sex Age SibSp Parch
## Length:891 Min. : 0.42 Min. :0.000 Min. :0.0000
## Class :character 1st Qu.:20.12 1st Qu.:0.000 1st Qu.:0.0000
## Mode :character Median :28.00 Median :0.000 Median :0.0000
## Mean :29.70 Mean :0.523 Mean :0.3816
## 3rd Qu.:38.00 3rd Qu.:1.000 3rd Qu.:0.0000
## Max. :80.00 Max. :8.000 Max. :6.0000
## NA's :177
## Ticket Fare Cabin Embarked
## Length:891 Min. : 0.00 Length:891 Length:891
## Class :character 1st Qu.: 7.91 Class :character Class :character
## Mode :character Median : 14.45 Mode :character Mode :character
## Mean : 32.20
## 3rd Qu.: 31.00
## Max. :512.33
##
titanic <- titanic %>%
mutate(
Pclass = factor(Pclass),
Sex = factor(Sex),
Embarked = factor(Embarked)
) %>%
select(Survived, Pclass, Sex, Age, SibSp, Parch, Fare, Embarked)
colSums(is.na(titanic))
## Survived Pclass Sex Age SibSp Parch Fare Embarked
## 0 0 0 177 0 0 0 0
Insight: Age has a substantial number of missing values. Other selected variables are complete.
To fairly compare imputation methods, we first split into a training set and a test set, and then apply each imputation method on top of this fixed split.
set.seed(123) # for reproducibility
n <- nrow(titanic)
titanic$Embarked <- factor(titanic$Embarked) # keep all levels
titanic$Pclass <- factor(titanic$Pclass)
titanic$Sex <- factor(titanic$Sex)
train_index <- sample(seq_len(n), size = 0.7 * n)
train <- titanic[train_index, ]
test <- titanic[-train_index, ]
for (v in c("Pclass", "Sex", "Embarked")) {
base_levels <- levels(titanic[[v]]) # global levels from full data
train[[v]] <- factor(train[[v]], levels = base_levels)
test[[v]] <- factor(test[[v]], levels = base_levels)
}
colSums(is.na(train))
## Survived Pclass Sex Age SibSp Parch Fare Embarked
## 0 0 0 126 0 0 0 0
colSums(is.na(test))
## Survived Pclass Sex Age SibSp Parch Fare Embarked
## 0 0 0 51 0 0 0 0
impute_age_mean <- function(train, test) {
mean_age <- mean(train$Age, na.rm = TRUE)
train$Age[is.na(train$Age)] <- mean_age
test$Age[is.na(test$Age)] <- mean_age
list(train = train, test = test)
}
impute_age_median <- function(train, test) {
median_age <- median(train$Age, na.rm = TRUE)
train$Age[is.na(train$Age)] <- median_age
test$Age[is.na(test$Age)] <- median_age
list(train = train, test = test)
}
This method respects known structure in the data (e.g., first-class women tend to be older than third-class children), so we expect it to be more realistic than a single global number.
impute_age_grouped_median <- function(train, test) {
age_by_group <- train %>%
group_by(Pclass, Sex) %>%
summarize(median_age = median(Age, na.rm = TRUE), .groups = "drop")
train_imp <- train %>%
left_join(age_by_group, by = c("Pclass", "Sex")) %>%
mutate(Age = ifelse(is.na(Age), median_age, Age)) %>%
select(-median_age)
test_imp <- test %>%
left_join(age_by_group, by = c("Pclass", "Sex")) %>%
mutate(Age = ifelse(is.na(Age), median_age, Age)) %>%
select(-median_age)
list(train = train_imp, test = test_imp)
}
We follow the idea to use preProcess() with method = “knnImpute”. Since KNN needs numeric predictors, we create numeric versions of the factor variables.
impute_age_knn <- function(train, test, k = 5) {
make_knn_data <- function(df) {
df %>%
mutate(
Pclass_num = as.numeric(Pclass),
Sex_num = ifelse(Sex == "male", 1, 0),
Embarked_num = as.numeric(Embarked)
) %>%
select(Age, Pclass_num, Sex_num, SibSp, Parch, Fare, Embarked_num)
}
train_knn <- make_knn_data(train)
test_knn <- make_knn_data(test)
combined <- bind_rows(train_knn, test_knn)
preProcVal <- caret::preProcess(combined, method = "knnImpute", k = k)
combined_imp <- predict(preProcVal, combined) # returns **scaled** values
age_all <- combined_imp$Age
n_train <- nrow(train)
train$Age <- age_all[1:n_train]
test$Age <- age_all[(n_train + 1):length(age_all)]
list(train = train, test = test)
}
Note: KNN imputation with caret centers & scales variables. Thus the imputed Age used in the KNN method is on a z-score scale. Logistic regression is invariant to linear rescaling of predictors.
fit_and_evaluate <- function(train, test, threshold = 0.45){
# Logistic regression model
model <- glm(
Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare,
data = train,
family = binomial
)
# Align factor levels between train and test
for (v in names(train)) {
if (is.factor(train[[v]])) {
test[[v]] <- factor(test[[v]], levels = levels(train[[v]]))
}
}
# Predictions
probs <- predict(model, newdata = test, type = "response")
preds <- ifelse(probs >= threshold, 1, 0)
cm <- table(Predicted = preds, Actual = test$Survived)
accuracy <- sum(diag(cm)) / sum(cm)
list(
model = model,
cm = cm,
accuracy = accuracy
)
}
results <- list()
## Method 1: Global mean age
imp_mean <- impute_age_mean(train, test)
res_mean <- fit_and_evaluate(imp_mean$train, imp_mean$test)
results$mean <- res_mean
## Method 2: Global median age
imp_median <- impute_age_median(train, test)
res_median <- fit_and_evaluate(imp_median$train, imp_median$test)
results$median <- res_median
## Method 3: Grouped median age (Pclass x Sex)
imp_grouped <- impute_age_grouped_median(train, test)
res_grouped <- fit_and_evaluate(imp_grouped$train, imp_grouped$test)
results$grouped <- res_grouped
## Method 4: KNN
imp_knn <- impute_age_knn(train, test, k = 5)
res_knn <- fit_and_evaluate(imp_knn$train, imp_knn$test)
results$knn <- res_knn
#global mean
results$mean$cm
## Actual
## Predicted 0 1
## 0 135 28
## 1 33 72
results$mean$accuracy
## [1] 0.7723881
#global median
results$median$cm
## Actual
## Predicted 0 1
## 0 135 28
## 1 33 72
results$median$accuracy
## [1] 0.7723881
#global median pclass x sex
results$grouped$cm
## Actual
## Predicted 0 1
## 0 138 26
## 1 30 74
results$grouped$accuracy
## [1] 0.7910448
#KNN
results$knn$cm
## Actual
## Predicted 0 1
## 0 135 26
## 1 33 74
results$knn$accuracy
## [1] 0.7798507
Each confusion matrix shows: Rows = Predicted class (0 = died, 1 = survived) Columns = Actual class Accuracy is the proportion of correct predictions on the test set.
age_train_long <- bind_rows(
tibble(Method = "Mean", Age = imp_mean$train$Age),
tibble(Method = "Median", Age = imp_median$train$Age),
tibble(Method = "Grouped", Age = imp_grouped$train$Age),
tibble(Method = "KNN (scaled)", Age = imp_knn$train$Age)
)
ggplot(age_train_long, aes(x = Age)) +
geom_histogram(bins = 30, color = "white") +
facet_wrap(~ Method, scales = "free") +
labs(
title = "Age distributions in the TRAIN set after different imputations",
x = "Age (raw or scaled)",
y = "Count"
)
# a scatterplot of Fare vs Age for the KNN-imputed data, colored by Sex.
ggplot(imp_knn$train, aes(x = Fare, y = Age, color = Sex)) +
geom_point(alpha = 0.7) +
labs(
title = "KNN-imputed Age (scaled) vs Fare, colored by Sex (TRAIN set)",
x = "Fare",
y = "Age (scaled z-score)"
)
## Summary
performance_summary <- tibble(
Method = c(
"Global mean Age",
"Global median Age",
"Grouped median (Pclass x Sex)",
"KNN (caret, k = 5, scaled Age)"
),
Accuracy = c(
results$mean$accuracy,
results$median$accuracy,
results$grouped$accuracy,
results$knn$accuracy
)
)
performance_summary
## # A tibble: 4 × 2
## Method Accuracy
## <chr> <dbl>
## 1 Global mean Age 0.772
## 2 Global median Age 0.772
## 3 Grouped median (Pclass x Sex) 0.791
## 4 KNN (caret, k = 5, scaled Age) 0.780
Based on the summary table above: The accuracy for each method is: Global mean Age: r round(results\(mean\)accuracy, 3) Global median Age: r round(results\(median\)accuracy, 3) Grouped median (Pclass × Sex): r round(results\(grouped\)accuracy, 3) KNN (caret, k = 5, scaled Age): r round(results\(knn\)accuracy, 3) The best-performing method in terms of test-set accuracy is:
accs <- c(results$mean$accuracy,
results$median$accuracy,
results$grouped$accuracy,
results$knn$accuracy)
best_idx <- which.max(accs)
best_name <- performance_summary$Method[best_idx]
best_name
## [1] "Grouped median (Pclass x Sex)"
Insight: Global mean / median imputation works better because it is very simple, but ignores important structure in the data. Every missing Age is replaced by the same value, which can shrink variability and distort relationships between Age and survival. Grouped median imputation (Pclass × Sex): Uses information known at prediction time (class and sex) to provide a more realistic guess for Age. Better preserves differences in age distributions across groups (e.g., women and children in higher classes vs. male passengers in lower classes). KNN imputation uses multiple variables (class, sex, family size, fare, embarkation) to find similar passengers and borrow their ages. This can better preserve complex relationships between Age and other predictors. However, it is more computationally intensive and less transparent. Because our goal is prediction, using this richer structure can help the model pick up patterns more accurately, which is reflected in the confusion matrix and accuracy.