11.20-imputation-Yuhan Wen

Introduction

The goal of this analysis is to build a predictive model to estimate survival on the Titanic, and to see how different imputation methods for missing Age values affect model performance. We will: Load the Titanic data and explore missingness. Split the data into training and test sets. Try three different imputation strategies for Age: 1. Global mean imputation 2. Global median imputation 3. Grouped median imputation (median Age within Pclass × Sex groups) 4. KNN Imputation (using preProcess(method = “knnImpute”)) For each imputation, fit a logistic regression model predicting Survived. Compare performance on the test set using confusion matrices and accuracy. Our analytics goal is predictive performance, so the “best” imputation method is the one that leads to the best out-of-sample classification results.

EDA

#install.packages("tidyverse")
#install.packages("caret")
library(tidyverse)
library(caret)
library(dplyr)
titanic <- read.csv("Titanic.csv", stringsAsFactors = FALSE)
str(titanic)

## 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : chr  "male" "female" "female" "female" ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  "" "C85" "" "C123" ...
##  $ Embarked   : chr  "S" "C" "S" "S" ...

summary(titanic)

##   PassengerId       Survived          Pclass          Name          
##  Min.   :  1.0   Min.   :0.0000   Min.   :1.000   Length:891        
##  1st Qu.:223.5   1st Qu.:0.0000   1st Qu.:2.000   Class :character  
##  Median :446.0   Median :0.0000   Median :3.000   Mode  :character  
##  Mean   :446.0   Mean   :0.3838   Mean   :2.309                     
##  3rd Qu.:668.5   3rd Qu.:1.0000   3rd Qu.:3.000                     
##  Max.   :891.0   Max.   :1.0000   Max.   :3.000                     
##                                                                     
##      Sex                 Age            SibSp           Parch       
##  Length:891         Min.   : 0.42   Min.   :0.000   Min.   :0.0000  
##  Class :character   1st Qu.:20.12   1st Qu.:0.000   1st Qu.:0.0000  
##  Mode  :character   Median :28.00   Median :0.000   Median :0.0000  
##                     Mean   :29.70   Mean   :0.523   Mean   :0.3816  
##                     3rd Qu.:38.00   3rd Qu.:1.000   3rd Qu.:0.0000  
##                     Max.   :80.00   Max.   :8.000   Max.   :6.0000  
##                     NA's   :177                                     
##     Ticket               Fare           Cabin             Embarked        
##  Length:891         Min.   :  0.00   Length:891         Length:891        
##  Class :character   1st Qu.:  7.91   Class :character   Class :character  
##  Mode  :character   Median : 14.45   Mode  :character   Mode  :character  
##                     Mean   : 32.20                                        
##                     3rd Qu.: 31.00                                        
##                     Max.   :512.33                                        
##

titanic <- titanic %>%
mutate(
Pclass   = factor(Pclass),
Sex      = factor(Sex),
Embarked = factor(Embarked)
) %>%
select(Survived, Pclass, Sex, Age, SibSp, Parch, Fare, Embarked)

colSums(is.na(titanic))

## Survived   Pclass      Sex      Age    SibSp    Parch     Fare Embarked 
##        0        0        0      177        0        0        0        0

Insight: Age has a substantial number of missing values. Other selected variables are complete.

Train & Test set

To fairly compare imputation methods, we first split into a training set and a test set, and then apply each imputation method on top of this fixed split.

set.seed(123) # for reproducibility
n <- nrow(titanic)
titanic$Embarked <- factor(titanic$Embarked)  # keep all levels
titanic$Pclass   <- factor(titanic$Pclass)
titanic$Sex      <- factor(titanic$Sex)
train_index <- sample(seq_len(n), size = 0.7 * n)
train <- titanic[train_index, ]
test <- titanic[-train_index, ]
for (v in c("Pclass", "Sex", "Embarked")) {
  base_levels <- levels(titanic[[v]])       # global levels from full data
  train[[v]]  <- factor(train[[v]], levels = base_levels)
  test[[v]]   <- factor(test[[v]],  levels = base_levels)
}
colSums(is.na(train))

## Survived   Pclass      Sex      Age    SibSp    Parch     Fare Embarked 
##        0        0        0      126        0        0        0        0

colSums(is.na(test))

## Survived   Pclass      Sex      Age    SibSp    Parch     Fare Embarked 
##        0        0        0       51        0        0        0        0

Method 1: Age Global mean imputation

impute_age_mean <- function(train, test) {
mean_age <- mean(train$Age, na.rm = TRUE)
train$Age[is.na(train$Age)] <- mean_age
test$Age[is.na(test$Age)] <- mean_age
list(train = train, test = test)
}

Method 2: Global median imputation

impute_age_median <- function(train, test) {
median_age <- median(train$Age, na.rm = TRUE)
train$Age[is.na(train$Age)] <- median_age
test$Age[is.na(test$Age)] <- median_age
list(train = train, test = test)
}

Method 3: Grouped median (by Pclass × Sex)

This method respects known structure in the data (e.g., first-class women tend to be older than third-class children), so we expect it to be more realistic than a single global number.

impute_age_grouped_median <- function(train, test) {
age_by_group <- train %>%
group_by(Pclass, Sex) %>%
summarize(median_age = median(Age, na.rm = TRUE), .groups = "drop")
train_imp <- train %>%
left_join(age_by_group, by = c("Pclass", "Sex")) %>%
mutate(Age = ifelse(is.na(Age), median_age, Age)) %>%
select(-median_age)
test_imp <- test %>%
left_join(age_by_group, by = c("Pclass", "Sex")) %>%
mutate(Age = ifelse(is.na(Age), median_age, Age)) %>%
select(-median_age)
list(train = train_imp, test = test_imp)
}

Method 4: KNN Imputation (caret, similar to Imputation.pdf)

We follow the idea to use preProcess() with method = “knnImpute”. Since KNN needs numeric predictors, we create numeric versions of the factor variables.

impute_age_knn <- function(train, test, k = 5) {

make_knn_data <- function(df) {
df %>%
mutate(
Pclass_num   = as.numeric(Pclass),
Sex_num      = ifelse(Sex == "male", 1, 0),
Embarked_num = as.numeric(Embarked)
) %>%
select(Age, Pclass_num, Sex_num, SibSp, Parch, Fare, Embarked_num)
}

train_knn <- make_knn_data(train)
test_knn  <- make_knn_data(test)

combined   <- bind_rows(train_knn, test_knn)

preProcVal <- caret::preProcess(combined, method = "knnImpute", k = k)
combined_imp <- predict(preProcVal, combined)   # returns **scaled** values

age_all <- combined_imp$Age

n_train <- nrow(train)
train$Age <- age_all[1:n_train]
test$Age  <- age_all[(n_train + 1):length(age_all)]

list(train = train, test = test)
}

Note: KNN imputation with caret centers & scales variables. Thus the imputed Age used in the KNN method is on a z-score scale. Logistic regression is invariant to linear rescaling of predictors.

Model fit evaluation

fit_and_evaluate <- function(train, test, threshold = 0.45){

  # Logistic regression model
model <- glm(
  Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare,
  data   = train,
  family = binomial
)

  # Align factor levels between train and test
for (v in names(train)) {
  if (is.factor(train[[v]])) {
    test[[v]] <- factor(test[[v]], levels = levels(train[[v]]))
  }
}

  # Predictions
  probs <- predict(model, newdata = test, type = "response")

  preds <- ifelse(probs >= threshold, 1, 0)

  cm <- table(Predicted = preds, Actual = test$Survived)
  accuracy <- sum(diag(cm)) / sum(cm)

  list(
    model    = model,
    cm       = cm,
    accuracy = accuracy
  )
}

results <- list()

## Method 1: Global mean age

imp_mean <- impute_age_mean(train, test)
res_mean <- fit_and_evaluate(imp_mean$train, imp_mean$test)
results$mean <- res_mean

## Method 2: Global median age

imp_median <- impute_age_median(train, test)
res_median <- fit_and_evaluate(imp_median$train, imp_median$test)
results$median <- res_median

## Method 3: Grouped median age (Pclass x Sex)

imp_grouped <- impute_age_grouped_median(train, test)
res_grouped <- fit_and_evaluate(imp_grouped$train, imp_grouped$test)
results$grouped <- res_grouped

## Method 4: KNN
imp_knn <- impute_age_knn(train, test, k = 5)
res_knn <- fit_and_evaluate(imp_knn$train, imp_knn$test)
results$knn <- res_knn

Confusion Matrix

#global mean
results$mean$cm

##          Actual
## Predicted   0   1
##         0 135  28
##         1  33  72

results$mean$accuracy

## [1] 0.7723881

#global median
results$median$cm

##          Actual
## Predicted   0   1
##         0 135  28
##         1  33  72

results$median$accuracy

## [1] 0.7723881

#global median pclass x sex
results$grouped$cm

##          Actual
## Predicted   0   1
##         0 138  26
##         1  30  74

results$grouped$accuracy

## [1] 0.7910448

#KNN
results$knn$cm

##          Actual
## Predicted   0   1
##         0 135  26
##         1  33  74

results$knn$accuracy

## [1] 0.7798507

Each confusion matrix shows: Rows = Predicted class (0 = died, 1 = survived) Columns = Actual class Accuracy is the proportion of correct predictions on the test set.

Visuals - Age distributions after each imputation (train set)

age_train_long <- bind_rows(
tibble(Method = "Mean",    Age = imp_mean$train$Age),
tibble(Method = "Median",  Age = imp_median$train$Age),
tibble(Method = "Grouped", Age = imp_grouped$train$Age),
tibble(Method = "KNN (scaled)", Age = imp_knn$train$Age)
)

ggplot(age_train_long, aes(x = Age)) +
geom_histogram(bins = 30, color = "white") +
facet_wrap(~ Method, scales = "free") +
labs(
title = "Age distributions in the TRAIN set after different imputations",
x = "Age (raw or scaled)",
y = "Count"
)

# a scatterplot of Fare vs Age for the KNN-imputed data, colored by Sex.
ggplot(imp_knn$train, aes(x = Fare, y = Age, color = Sex)) +
geom_point(alpha = 0.7) +
labs(
title = "KNN-imputed Age (scaled) vs Fare, colored by Sex (TRAIN set)",
x = "Fare",
y = "Age (scaled z-score)"
)

## Summary

performance_summary <- tibble(
Method = c(
"Global mean Age",
"Global median Age",
"Grouped median (Pclass x Sex)",
"KNN (caret, k = 5, scaled Age)"
),
Accuracy = c(
results$mean$accuracy,
results$median$accuracy,
results$grouped$accuracy,
results$knn$accuracy
)
)

performance_summary

## # A tibble: 4 × 2
##   Method                         Accuracy
##   <chr>                             <dbl>
## 1 Global mean Age                   0.772
## 2 Global median Age                 0.772
## 3 Grouped median (Pclass x Sex)     0.791
## 4 KNN (caret, k = 5, scaled Age)    0.780

Based on the summary table above: The accuracy for each method is: Global mean Age: r round(results\(mean\)accuracy, 3) Global median Age: r round(results\(median\)accuracy, 3) Grouped median (Pclass × Sex): r round(results\(grouped\)accuracy, 3) KNN (caret, k = 5, scaled Age): r round(results\(knn\)accuracy, 3) The best-performing method in terms of test-set accuracy is:

accs <- c(results$mean$accuracy,
results$median$accuracy,
results$grouped$accuracy,
results$knn$accuracy)
best_idx <- which.max(accs)
best_name <- performance_summary$Method[best_idx]
best_name

## [1] "Grouped median (Pclass x Sex)"

Insight: Global mean / median imputation works better because it is very simple, but ignores important structure in the data. Every missing Age is replaced by the same value, which can shrink variability and distort relationships between Age and survival. Grouped median imputation (Pclass × Sex): Uses information known at prediction time (class and sex) to provide a more realistic guess for Age. Better preserves differences in age distributions across groups (e.g., women and children in higher classes vs. male passengers in lower classes). KNN imputation uses multiple variables (class, sex, family size, fare, embarkation) to find similar passengers and borrow their ages. This can better preserve complex relationships between Age and other predictors. However, it is more computationally intensive and less transparent. Because our goal is prediction, using this richer structure can help the model pick up patterns more accurately, which is reflected in the confusion matrix and accuracy.

Managerial takeaway From a managerial perspective, KNN and grouped median are attractive because they use richer information and better respect heterogeneity across passengers, but the final choice should balance predictive gain, complexity, and interpretability.