Part 1
After reading, I concluded that including interaction terms in a logit regression model can be tricky because the effects are not as straightforward as in a linear model. In a linear regression, an interaction term simply adjusts how one variable influences another. However, in a logit model, since everything operates on a transformed (log-odds) scale, the interaction effect is not just the coefficient—it changes depending on the values of other variables. Ai and Norton (2003) highlight how this can easily lead to misinterpretation if coefficients are read the same way as in a standard regression.
A simulation-based approach helps overcome this challenge. Instead of relying solely on raw coefficients, methods like those discussed by Zelner (2009) and King et al. (2000) allow for the simulation of different scenarios to visualize how probabilities change based on interaction effects. This makes it easier to grasp the real-world impact of relationships rather than deciphering abstract numbers. Additionally, simulations help capture uncertainty around estimates, leading to clearer and more reliable conclusions.
Part 2
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.3.3
## Warning: package 'ggplot2' was built under R version 4.3.3
## Warning: package 'tibble' was built under R version 4.3.3
## Warning: package 'tidyr' was built under R version 4.3.3
## Warning: package 'readr' was built under R version 4.3.3
## Warning: package 'purrr' was built under R version 4.3.3
## Warning: package 'dplyr' was built under R version 4.3.3
## Warning: package 'stringr' was built under R version 4.3.3
## Warning: package 'forcats' was built under R version 4.3.3
## Warning: package 'lubridate' was built under R version 4.3.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(broom)
## Warning: package 'broom' was built under R version 4.3.3
library(ggplot2)
library(MASS)
##
## Attaching package: 'MASS'
##
## The following object is masked from 'package:dplyr':
##
## select
if (!requireNamespace("titanic", quietly = TRUE)) {
install.packages("titanic")
}
data("titanic_train", package = "titanic")
head(data)
##
## 1 function (..., list = character(), package = NULL, lib.loc = NULL,
## 2 verbose = getOption("verbose"), envir = .GlobalEnv, overwrite = TRUE)
## 3 {
## 4 fileExt <- function(x) {
## 5 db <- grepl("\\\\.[^.]+\\\\.(gz|bz2|xz)$", x)
## 6 ans <- sub(".*\\\\.", "", x)
titanic_data <- as.data.frame(titanic_train)
colSums(is.na(titanic_train))
## PassengerId Survived Pclass Name Sex Age
## 0 0 0 0 0 177
## SibSp Parch Ticket Fare Cabin Embarked
## 0 0 0 0 0 0
titanic_data <- titanic_train %>%
dplyr::select(Survived, Pclass, Sex, Age, Fare) %>%
drop_na()
print(dim(titanic_data))
## [1] 714 5
model2 <- glm(Survived ~ Pclass + Sex, data = titanic_data, family = binomial)
model3 <- glm(Survived ~ Pclass + Sex + Age + Fare, data = titanic_data, family = binomial)
nrow(model.frame(model2))
## [1] 714
nrow(model.frame(model3))
## [1] 714
anova(model2, model3, test = "Chisq")
## Analysis of Deviance Table
##
## Model 1: Survived ~ Pclass + Sex
## Model 2: Survived ~ Pclass + Sex + Age + Fare
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 711 672.51
## 2 709 647.23 2 25.281 3.239e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
AIC(model2, model3)
## df AIC
## model2 3 678.5106
## model3 5 657.2298
BIC(model2, model3)
## df BIC
## model2 3 692.2232
## model3 5 680.0842
summary(model3)
##
## Call:
## glm(formula = Survived ~ Pclass + Sex + Age + Fare, family = binomial,
## data = titanic_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.9880403 0.5721894 8.717 < 2e-16 ***
## Pclass -1.2697410 0.1586252 -8.005 1.20e-15 ***
## Sexmale -2.5181969 0.2078562 -12.115 < 2e-16 ***
## Age -0.0367073 0.0076795 -4.780 1.75e-06 ***
## Fare 0.0005373 0.0021821 0.246 0.805
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 964.52 on 713 degrees of freedom
## Residual deviance: 647.23 on 709 degrees of freedom
## AIC: 657.23
##
## Number of Fisher Scoring iterations: 5
exp(coef(model3))
## (Intercept) Pclass Sexmale Age Fare
## 146.64875622 0.28090437 0.08060482 0.96395823 1.00053749
The analysis compared two logistic regression models to predict survival on the Titanic, with Model 2 including Pclass and Sex as predictors and Model 3 adding Age and Fare. The Likelihood Ratio Test showed a statistically significant improvement in Model 3 (p < 0.001), indicating that adding Age and Fare provided a better fit. Additionally, AIC and BIC values were lower for Model 3 (AIC = 657.23, BIC = 680.08) compared to Model 2 (AIC = 678.51, BIC = 692.22), further confirming that Model 3 is preferable.
Interpreting the results, passenger class and sex are the strongest predictors of survival. Being in a lower class significantly reduced survival odds, with each class drop decreasing the odds by ~72%. Males had about 92% lower survival odds compared to females, highlighting the “women and children first” evacuation pattern. Age had a small but significant negative effect, with each additional year slightly reducing survival chances by ~3.6%. However, Fare was not a significant predictor, suggesting that ticket price alone did not influence survival when controlling for class and other factors. Overall, Model 3 provides a more accurate representation of survival probabilities on the Titanic.