This document walks through how to use logistic regression to model survival on the Titanic. Logistic regression is used when the outcome variable is binary (e.g., survived or not). The goal is to predict the probability of survival based on variables like sex, age, and passenger class.
We use the titanic package, clean missing data, and
prepare our variables.
library(titanic)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(rsample)
library(ggplot2)
library(caret)
## Loading required package: lattice
library(broom)
library(pROC)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
data("titanic_train")
df <- titanic_train %>%
select(Survived, Sex, Age, Pclass) %>%
filter(!is.na(Age)) %>%
mutate(
Survived = factor(Survived),
Sex = factor(Sex),
Pclass = factor(Pclass)
)
We split the data into 70% training and 30% testing for model evaluation.
set.seed(42)
split <- initial_split(df, prop = 0.7, strata = Survived)
train_data <- training(split)
test_data <- testing(split)
We build a logistic model using glm() with a binomial
family.
log_model <- glm(Survived ~ Sex + Age + Pclass, data = train_data, family = "binomial")
summary(log_model)
##
## Call:
## glm(formula = Survived ~ Sex + Age + Pclass, family = "binomial",
## data = train_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.372949 0.455281 7.408 1.28e-13 ***
## Sexmale -2.472427 0.243842 -10.139 < 2e-16 ***
## Age -0.028393 0.008902 -3.189 0.001426 **
## Pclass2 -1.247157 0.326906 -3.815 0.000136 ***
## Pclass3 -2.333740 0.327017 -7.136 9.58e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 674.33 on 498 degrees of freedom
## Residual deviance: 461.24 on 494 degrees of freedom
## AIC: 471.24
##
## Number of Fisher Scoring iterations: 4
We exponentiate the coefficients to interpret them as odds ratios.
tidy(log_model, exponentiate = TRUE, conf.int = TRUE)
## # A tibble: 5 ร 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 29.2 0.455 7.41 1.28e-13 12.3 73.6
## 2 Sexmale 0.0844 0.244 -10.1 3.69e-24 0.0517 0.135
## 3 Age 0.972 0.00890 -3.19 1.43e- 3 0.955 0.989
## 4 Pclass2 0.287 0.327 -3.82 1.36e- 4 0.150 0.540
## 5 Pclass3 0.0969 0.327 -7.14 9.58e-13 0.0501 0.181
We predict probabilities and convert them into predicted classes.
test_data$predicted_prob <- predict(log_model, newdata = test_data, type = "response")
test_data$predicted_class <- ifelse(test_data$predicted_prob > 0.5, "1", "0")
test_data$predicted_class <- factor(test_data$predicted_class, levels = c("0", "1"))
We use a confusion matrix to see how well the model performs.
confusionMatrix(test_data$predicted_class, test_data$Survived, positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 108 25
## 1 20 62
##
## Accuracy : 0.7907
## 95% CI : (0.7301, 0.843)
## No Information Rate : 0.5953
## P-Value [Acc > NIR] : 9.242e-10
##
## Kappa : 0.5616
##
## Mcnemar's Test P-Value : 0.551
##
## Sensitivity : 0.7126
## Specificity : 0.8438
## Pos Pred Value : 0.7561
## Neg Pred Value : 0.8120
## Prevalence : 0.4047
## Detection Rate : 0.2884
## Detection Prevalence : 0.3814
## Balanced Accuracy : 0.7782
##
## 'Positive' Class : 1
##
roc_obj <- roc(test_data$Survived, test_data$predicted_prob)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
plot(roc_obj, col = "blue", main = "ROC Curve")
auc(roc_obj)
## Area under the curve: 0.8682
Suppose we want to estimate the probability of survival for a 25-year-old male passenger in 3rd class:
new_passenger <- data.frame(
Sex = factor("male", levels = levels(df$Sex)),
Age = 25,
Pclass = factor("3", levels = levels(df$Pclass))
)
predict(log_model, newdata = new_passenger, type = "response")
## 1
## 0.1049837
This will return the predicted probability of survival (a value between 0 and 1) for that passenger.
Sex, Age, and Pclass
to predict survivalYou can adjust the threshold (default = 0.5) or try more predictors to improve the model!