Part 1

After reading, I concluded that including interaction terms in a logit regression model can be tricky because the effects are not as straightforward as in a linear model. In a linear regression, an interaction term simply adjusts how one variable influences another. However, in a logit model, since everything operates on a transformed (log-odds) scale, the interaction effect is not just the coefficient—it changes depending on the values of other variables. Ai and Norton (2003) highlight how this can easily lead to misinterpretation if coefficients are read the same way as in a standard regression.

A simulation-based approach helps overcome this challenge. Instead of relying solely on raw coefficients, methods like those discussed by Zelner (2009) and King et al. (2000) allow for the simulation of different scenarios to visualize how probabilities change based on interaction effects. This makes it easier to grasp the real-world impact of relationships rather than deciphering abstract numbers. Additionally, simulations help capture uncertainty around estimates, leading to clearer and more reliable conclusions.

Part 2

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.3.3

## Warning: package 'ggplot2' was built under R version 4.3.3

## Warning: package 'tibble' was built under R version 4.3.3

## Warning: package 'tidyr' was built under R version 4.3.3

## Warning: package 'readr' was built under R version 4.3.3

## Warning: package 'purrr' was built under R version 4.3.3

## Warning: package 'dplyr' was built under R version 4.3.3

## Warning: package 'stringr' was built under R version 4.3.3

## Warning: package 'forcats' was built under R version 4.3.3

## Warning: package 'lubridate' was built under R version 4.3.3

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(broom)

## Warning: package 'broom' was built under R version 4.3.3

library(ggplot2)    
library(MASS)

## 
## Attaching package: 'MASS'
## 
## The following object is masked from 'package:dplyr':
## 
##     select

if (!requireNamespace("titanic", quietly = TRUE)) {
  install.packages("titanic")
}

Load data

data("titanic_train", package = "titanic")
head(data)

##                                                                             
## 1 function (..., list = character(), package = NULL, lib.loc = NULL,        
## 2     verbose = getOption("verbose"), envir = .GlobalEnv, overwrite = TRUE) 
## 3 {                                                                         
## 4     fileExt <- function(x) {                                              
## 5         db <- grepl("\\\\.[^.]+\\\\.(gz|bz2|xz)$", x)                     
## 6         ans <- sub(".*\\\\.", "", x)

titanic_data <- as.data.frame(titanic_train)
colSums(is.na(titanic_train))

## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0         177 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           0           0           0

titanic_data <- titanic_train %>%
  dplyr::select(Survived, Pclass, Sex, Age, Fare) %>%
  drop_na()  

print(dim(titanic_data))

## [1] 714   5

model2 <- glm(Survived ~ Pclass + Sex, data = titanic_data, family = binomial)
model3 <- glm(Survived ~ Pclass + Sex + Age + Fare, data = titanic_data, family = binomial)

nrow(model.frame(model2))

## [1] 714

nrow(model.frame(model3))

## [1] 714

anova(model2, model3, test = "Chisq")

## Analysis of Deviance Table
## 
## Model 1: Survived ~ Pclass + Sex
## Model 2: Survived ~ Pclass + Sex + Age + Fare
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1       711     672.51                          
## 2       709     647.23  2   25.281 3.239e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Compare AIC and BIC

AIC(model2, model3)

##        df      AIC
## model2  3 678.5106
## model3  5 657.2298

BIC(model2, model3)

##        df      BIC
## model2  3 692.2232
## model3  5 680.0842

summary(model3)

## 
## Call:
## glm(formula = Survived ~ Pclass + Sex + Age + Fare, family = binomial, 
##     data = titanic_data)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  4.9880403  0.5721894   8.717  < 2e-16 ***
## Pclass      -1.2697410  0.1586252  -8.005 1.20e-15 ***
## Sexmale     -2.5181969  0.2078562 -12.115  < 2e-16 ***
## Age         -0.0367073  0.0076795  -4.780 1.75e-06 ***
## Fare         0.0005373  0.0021821   0.246    0.805    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 964.52  on 713  degrees of freedom
## Residual deviance: 647.23  on 709  degrees of freedom
## AIC: 657.23
## 
## Number of Fisher Scoring iterations: 5

exp(coef(model3))

##  (Intercept)       Pclass      Sexmale          Age         Fare 
## 146.64875622   0.28090437   0.08060482   0.96395823   1.00053749

The analysis compared two logistic regression models to predict survival on the Titanic, with Model 2 including Pclass and Sex as predictors and Model 3 adding Age and Fare. The Likelihood Ratio Test showed a statistically significant improvement in Model 3 (p < 0.001), indicating that adding Age and Fare provided a better fit. Additionally, AIC and BIC values were lower for Model 3 (AIC = 657.23, BIC = 680.08) compared to Model 2 (AIC = 678.51, BIC = 692.22), further confirming that Model 3 is preferable.

Interpreting the results, passenger class and sex are the strongest predictors of survival. Being in a lower class significantly reduced survival odds, with each class drop decreasing the odds by ~72%. Males had about 92% lower survival odds compared to females, highlighting the “women and children first” evacuation pattern. Age had a small but significant negative effect, with each additional year slightly reducing survival chances by ~3.6%. However, Fare was not a significant predictor, suggesting that ticket price alone did not influence survival when controlling for class and other factors. Overall, Model 3 provides a more accurate representation of survival probabilities on the Titanic.

Assignment 4

Ritasha Sethi

2025-03-10

Load data

Compare AIC and BIC