๐Ÿง  Introduction

This document walks through how to use logistic regression to model survival on the Titanic. Logistic regression is used when the outcome variable is binary (e.g., survived or not). The goal is to predict the probability of survival based on variables like sex, age, and passenger class.


๐Ÿ“ฆ Load and Clean the Data

We use the titanic package, clean missing data, and prepare our variables.

library(titanic)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(rsample)
library(ggplot2)
library(caret)
## Loading required package: lattice
library(broom)
library(pROC)
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
data("titanic_train")

df <- titanic_train %>%
  select(Survived, Sex, Age, Pclass) %>%
  filter(!is.na(Age)) %>%
  mutate(
    Survived = factor(Survived),
    Sex = factor(Sex),
    Pclass = factor(Pclass)
  )

๐Ÿงช Train/Test Split

We split the data into 70% training and 30% testing for model evaluation.

set.seed(42)
split <- initial_split(df, prop = 0.7, strata = Survived)
train_data <- training(split)
test_data <- testing(split)

๐Ÿ”ง Fit the Logistic Regression Model

We build a logistic model using glm() with a binomial family.

log_model <- glm(Survived ~ Sex + Age + Pclass, data = train_data, family = "binomial")
summary(log_model)
## 
## Call:
## glm(formula = Survived ~ Sex + Age + Pclass, family = "binomial", 
##     data = train_data)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  3.372949   0.455281   7.408 1.28e-13 ***
## Sexmale     -2.472427   0.243842 -10.139  < 2e-16 ***
## Age         -0.028393   0.008902  -3.189 0.001426 ** 
## Pclass2     -1.247157   0.326906  -3.815 0.000136 ***
## Pclass3     -2.333740   0.327017  -7.136 9.58e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 674.33  on 498  degrees of freedom
## Residual deviance: 461.24  on 494  degrees of freedom
## AIC: 471.24
## 
## Number of Fisher Scoring iterations: 4

๐Ÿ“Š Interpret Model Coefficients

We exponentiate the coefficients to interpret them as odds ratios.

tidy(log_model, exponentiate = TRUE, conf.int = TRUE)
## # A tibble: 5 ร— 7
##   term        estimate std.error statistic  p.value conf.low conf.high
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
## 1 (Intercept)  29.2      0.455        7.41 1.28e-13  12.3       73.6  
## 2 Sexmale       0.0844   0.244      -10.1  3.69e-24   0.0517     0.135
## 3 Age           0.972    0.00890     -3.19 1.43e- 3   0.955      0.989
## 4 Pclass2       0.287    0.327       -3.82 1.36e- 4   0.150      0.540
## 5 Pclass3       0.0969   0.327       -7.14 9.58e-13   0.0501     0.181

๐Ÿ”ฎ Predict on Test Data

We predict probabilities and convert them into predicted classes.

test_data$predicted_prob <- predict(log_model, newdata = test_data, type = "response")
test_data$predicted_class <- ifelse(test_data$predicted_prob > 0.5, "1", "0")
test_data$predicted_class <- factor(test_data$predicted_class, levels = c("0", "1"))

โœ… Evaluate Model Performance

We use a confusion matrix to see how well the model performs.

confusionMatrix(test_data$predicted_class, test_data$Survived, positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 108  25
##          1  20  62
##                                          
##                Accuracy : 0.7907         
##                  95% CI : (0.7301, 0.843)
##     No Information Rate : 0.5953         
##     P-Value [Acc > NIR] : 9.242e-10      
##                                          
##                   Kappa : 0.5616         
##                                          
##  Mcnemar's Test P-Value : 0.551          
##                                          
##             Sensitivity : 0.7126         
##             Specificity : 0.8438         
##          Pos Pred Value : 0.7561         
##          Neg Pred Value : 0.8120         
##              Prevalence : 0.4047         
##          Detection Rate : 0.2884         
##    Detection Prevalence : 0.3814         
##       Balanced Accuracy : 0.7782         
##                                          
##        'Positive' Class : 1              
## 

๐Ÿ“ˆ ROC Curve and AUC

roc_obj <- roc(test_data$Survived, test_data$predicted_prob)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
plot(roc_obj, col = "blue", main = "ROC Curve")

auc(roc_obj)
## Area under the curve: 0.8682

๐Ÿค– Predict Survival for a New Passenger

Suppose we want to estimate the probability of survival for a 25-year-old male passenger in 3rd class:

new_passenger <- data.frame(
  Sex = factor("male", levels = levels(df$Sex)),
  Age = 25,
  Pclass = factor("3", levels = levels(df$Pclass))
)

predict(log_model, newdata = new_passenger, type = "response")
##         1 
## 0.1049837

This will return the predicted probability of survival (a value between 0 and 1) for that passenger.


๐Ÿ“ Summary

You can adjust the threshold (default = 0.5) or try more predictors to improve the model!