Logistic Regression on Titanic Dataset

🧠 Introduction

This document walks through how to use logistic regression to model survival on the Titanic. Logistic regression is used when the outcome variable is binary (e.g., survived or not). The goal is to predict the probability of survival based on variables like sex, age, and passenger class.

📦 Load and Clean the Data

We use the titanic package, clean missing data, and prepare our variables.

library(titanic)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(rsample)
library(ggplot2)
library(caret)

## Loading required package: lattice

library(broom)
library(pROC)

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

data("titanic_train")

df <- titanic_train %>%
  select(Survived, Sex, Age, Pclass) %>%
  filter(!is.na(Age)) %>%
  mutate(
    Survived = factor(Survived),
    Sex = factor(Sex),
    Pclass = factor(Pclass)
  )

🧪 Train/Test Split

We split the data into 70% training and 30% testing for model evaluation.

set.seed(42)
split <- initial_split(df, prop = 0.7, strata = Survived)
train_data <- training(split)
test_data <- testing(split)

🔧 Fit the Logistic Regression Model

We build a logistic model using glm() with a binomial family.

log_model <- glm(Survived ~ Sex + Age + Pclass, data = train_data, family = "binomial")
summary(log_model)

## 
## Call:
## glm(formula = Survived ~ Sex + Age + Pclass, family = "binomial", 
##     data = train_data)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  3.372949   0.455281   7.408 1.28e-13 ***
## Sexmale     -2.472427   0.243842 -10.139  < 2e-16 ***
## Age         -0.028393   0.008902  -3.189 0.001426 ** 
## Pclass2     -1.247157   0.326906  -3.815 0.000136 ***
## Pclass3     -2.333740   0.327017  -7.136 9.58e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 674.33  on 498  degrees of freedom
## Residual deviance: 461.24  on 494  degrees of freedom
## AIC: 471.24
## 
## Number of Fisher Scoring iterations: 4

📊 Interpret Model Coefficients

We exponentiate the coefficients to interpret them as odds ratios.

tidy(log_model, exponentiate = TRUE, conf.int = TRUE)

## # A tibble: 5 × 7
##   term        estimate std.error statistic  p.value conf.low conf.high
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
## 1 (Intercept)  29.2      0.455        7.41 1.28e-13  12.3       73.6  
## 2 Sexmale       0.0844   0.244      -10.1  3.69e-24   0.0517     0.135
## 3 Age           0.972    0.00890     -3.19 1.43e- 3   0.955      0.989
## 4 Pclass2       0.287    0.327       -3.82 1.36e- 4   0.150      0.540
## 5 Pclass3       0.0969   0.327       -7.14 9.58e-13   0.0501     0.181

An odds ratio < 1 means reduced odds of survival.
An odds ratio > 1 means increased odds of survival.

🔮 Predict on Test Data

We predict probabilities and convert them into predicted classes.

test_data$predicted_prob <- predict(log_model, newdata = test_data, type = "response")
test_data$predicted_class <- ifelse(test_data$predicted_prob > 0.5, "1", "0")
test_data$predicted_class <- factor(test_data$predicted_class, levels = c("0", "1"))

✅ Evaluate Model Performance

We use a confusion matrix to see how well the model performs.

confusionMatrix(test_data$predicted_class, test_data$Survived, positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 108  25
##          1  20  62
##                                          
##                Accuracy : 0.7907         
##                  95% CI : (0.7301, 0.843)
##     No Information Rate : 0.5953         
##     P-Value [Acc > NIR] : 9.242e-10      
##                                          
##                   Kappa : 0.5616         
##                                          
##  Mcnemar's Test P-Value : 0.551          
##                                          
##             Sensitivity : 0.7126         
##             Specificity : 0.8438         
##          Pos Pred Value : 0.7561         
##          Neg Pred Value : 0.8120         
##              Prevalence : 0.4047         
##          Detection Rate : 0.2884         
##    Detection Prevalence : 0.3814         
##       Balanced Accuracy : 0.7782         
##                                          
##        'Positive' Class : 1              
##

Accuracy: Overall correct predictions
Sensitivity: Correctly identified survivors
Specificity: Correctly identified non-survivors

📈 ROC Curve and AUC

roc_obj <- roc(test_data$Survived, test_data$predicted_prob)

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

plot(roc_obj, col = "blue", main = "ROC Curve")

auc(roc_obj)

## Area under the curve: 0.8682

AUC close to 1 → great model
AUC = 0.5 → no better than guessing

🤖 Predict Survival for a New Passenger

Suppose we want to estimate the probability of survival for a 25-year-old male passenger in 3rd class:

new_passenger <- data.frame(
  Sex = factor("male", levels = levels(df$Sex)),
  Age = 25,
  Pclass = factor("3", levels = levels(df$Pclass))
)

predict(log_model, newdata = new_passenger, type = "response")

##         1 
## 0.1049837

This will return the predicted probability of survival (a value between 0 and 1) for that passenger.

📝 Summary

Logistic regression models the log odds of a binary outcome
We used Sex, Age, and Pclass to predict survival
Evaluated performance with accuracy, confusion matrix, and AUC
Logistic regression provides insight and prediction

You can adjust the threshold (default = 0.5) or try more predictors to improve the model!