Titanic Analysis

Author

Ryan Murray

STA-6013 Final Project: Generalized Linear Models (GLMs)

“The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.”

credit: <https://www.kaggle.com/competitions/titanic/overview>

General Guidelines:

clearly define a response variable and predictors,
apply appropriate GLM methodology,
include model fitting, testing,
justify your modeling decisions and interpretations.

Dataset - Titanic

Data Dictionary

Variable	Definition	Key
survival	Survival	0 = No, 1 = Yes
pclass	Ticket class	1 = 1st, 2 = 2nd, 3 = 3rd
sex	Sex
Age	Age in years
sibsp	# of siblings / spouses aboard the Titanic
parch	# of parents / children aboard the Titanic
ticket	Ticket number
fare	Passenger fare
cabin	Cabin number
embarked	Port of Embarkation	C = Cherbourg, Q = Queenstown, S = Southampton

Variable Notes

pclass: A proxy for socio-economic status (SES)
1st = Upper
2nd = Middle
3rd = Lower

age: Age is fractional if less than 1.

sibsp: The dataset defines family relations in this way…
Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife (mistresses and fiancés were ignored)

parch: The dataset defines family relations in this way…
Parent = mother, father
Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for them.

Libraries

Code

pacman::p_load(tidyverse, janitor, ggplot2, car)

Load the dataset

Code

df_all <- read.csv("train.csv")
summary(df_all)

  PassengerId       Survived          Pclass          Name          
 Min.   :  1.0   Min.   :0.0000   Min.   :1.000   Length:891        
 1st Qu.:223.5   1st Qu.:0.0000   1st Qu.:2.000   Class :character  
 Median :446.0   Median :0.0000   Median :3.000   Mode  :character  
 Mean   :446.0   Mean   :0.3838   Mean   :2.309                     
 3rd Qu.:668.5   3rd Qu.:1.0000   3rd Qu.:3.000                     
 Max.   :891.0   Max.   :1.0000   Max.   :3.000                     
                                                                    
     Sex                 Age            SibSp           Parch       
 Length:891         Min.   : 0.42   Min.   :0.000   Min.   :0.0000  
 Class :character   1st Qu.:20.12   1st Qu.:0.000   1st Qu.:0.0000  
 Mode  :character   Median :28.00   Median :0.000   Median :0.0000  
                    Mean   :29.70   Mean   :0.523   Mean   :0.3816  
                    3rd Qu.:38.00   3rd Qu.:1.000   3rd Qu.:0.0000  
                    Max.   :80.00   Max.   :8.000   Max.   :6.0000  
                    NA's   :177                                     
    Ticket               Fare           Cabin             Embarked        
 Length:891         Min.   :  0.00   Length:891         Length:891        
 Class :character   1st Qu.:  7.91   Class :character   Class :character  
 Mode  :character   Median : 14.45   Mode  :character   Mode  :character  
                    Mean   : 32.20                                        
                    3rd Qu.: 31.00                                        
                    Max.   :512.33

Code

str(df_all)

'data.frame':   891 obs. of  12 variables:
 $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
 $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
 $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
 $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
 $ Sex        : chr  "male" "female" "female" "female" ...
 $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
 $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
 $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
 $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
 $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
 $ Cabin      : chr  "" "C85" "" "C123" ...
 $ Embarked   : chr  "S" "C" "S" "S" ...

Cleaning the data

Code

table(df_all$Survived)


  0   1 
549 342

Code

str(df_all$Survived)

 int [1:891] 0 1 1 1 0 0 0 0 1 1 ...

Code

df_all$Pclass <- factor(df_all$Pclass)
df_all$Sex <- factor(df_all$Sex)
df_all$Embarked <- factor(df_all$Embarked)
str(df_all)

'data.frame':   891 obs. of  12 variables:
 $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
 $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
 $ Pclass     : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
 $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
 $ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
 $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
 $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
 $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
 $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
 $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
 $ Cabin      : chr  "" "C85" "" "C123" ...
 $ Embarked   : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...

After examining the structure and summary of the dataset, I began the data cleaning process by checking the distribution and type of the response variable (Survived) and converting relevant categorical variables (Pclass, Sex, and Embarked) into factors. This step ensures that the model correctly treats these variables as categorical rather than numeric, which is proper for interpretation and accurate estimation in logistic regression.

Code

colSums(is.na(df_all))

PassengerId    Survived      Pclass        Name         Sex         Age 
          0           0           0           0           0         177 
      SibSp       Parch      Ticket        Fare       Cabin    Embarked 
          0           0           0           0           0           0

Code

df_all$Age[is.na(df_all$Age)] <- median(df_all$Age, na.rm = TRUE)
df_all$Fare[is.na(df_all$Fare)] <- median(df_all$Fare, na.rm = TRUE)
head(df_all)

  PassengerId Survived Pclass
1           1        0      3
2           2        1      1
3           3        1      3
4           4        1      1
5           5        0      3
6           6        0      3
                                                 Name    Sex Age SibSp Parch
1                             Braund, Mr. Owen Harris   male  22     1     0
2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38     1     0
3                              Heikkinen, Miss. Laina female  26     0     0
4        Futrelle, Mrs. Jacques Heath (Lily May Peel) female  35     1     0
5                            Allen, Mr. William Henry   male  35     0     0
6                                    Moran, Mr. James   male  28     0     0
            Ticket    Fare Cabin Embarked
1        A/5 21171  7.2500              S
2         PC 17599 71.2833   C85        C
3 STON/O2. 3101282  7.9250              S
4           113803 53.1000  C123        S
5           373450  8.0500              S
6           330877  8.4583              Q

After checking for missing values, I found that Age and Fare contained some NA values, so I handled them by imputing the median of each variable. This approach preserves the dataset size while providing a reasonable central value that is less sensitive to outliers than the mean, allowing the model to run without errors due to missing data.

Code

df_all$PassengerId <- NULL
df_all$Name <- NULL
df_all$Ticket <- NULL

df_all$FamilySize <- df_all$SibSp + df_all$Parch + 1

df_all$Embarked <- factor(df_all$Embarked)
head(df_all)

  Survived Pclass    Sex Age SibSp Parch    Fare Cabin Embarked FamilySize
1        0      3   male  22     1     0  7.2500              S          2
2        1      1 female  38     1     0 71.2833   C85        C          2
3        1      3 female  26     0     0  7.9250              S          1
4        1      1 female  35     1     0 53.1000  C123        S          2
5        0      3   male  35     0     0  8.0500              S          1
6        0      3   male  28     0     0  8.4583              Q          1

Lastly, I removed variables such as PassengerId, Name, and Ticket, as they do not provide meaningful predictive value for the model. I then created a new variable, FamilySize, by combining SibSp and Parch to better capture the overall family context of each passenger in a single, more interpretable feature. This helps simplify the model while still retaining the underlying information about family relationships. Finally, I ensured Embarked was properly formatted as a categorical variable so it would be treated correctly in the analysis, before beginning EDA.

Visuals for EDA

Code

titanic_plot <- df_all %>%
  mutate(
    Survived = factor(
      Survived,
      levels = c(0, 1),
      labels = c("Did not survive", "Survived")
    ),
    Pclass = factor(Pclass),
    Sex = factor(Sex, levels = c("male", "female"))
  )

ggplot(
  titanic_plot %>% filter(!is.na(Age)),
  aes(x = Age, fill = Survived, color = Survived)
) +
  geom_density(
    alpha = 0.35,
    linewidth = 1.1
  ) +
  
  scale_fill_manual(
    values = c(
      "Did not survive" = "#5C7A89",
      "Survived" = "#E3B23C"
    )
  ) +
  
  scale_color_manual(
    values = c(
      "Did not survive" = "#8AA6B3",
      "Survived" = "#FFD166"
    )
  ) +
  
  labs(
    title = "Age Distribution by Survival Status",
    subtitle = "Survival patterns change across age, supporting a nonlinear age effect",
    x = "Age",
    y = "Density",
    fill = "",
    color = ""
  ) +
  
  theme_minimal() +
  theme(
    plot.background = element_rect(fill = "#0B1D2A", color = NA),
    panel.background = element_rect(fill = "#0B1D2A", color = NA),
    
    text = element_text(color = "white"),
    axis.text = element_text(color = "white"),
    axis.title = element_text(color = "white"),
    
    plot.title = element_text(size = 18, face = "bold"),
    plot.subtitle = element_text(size = 12, color = "#E3B23C"),
    
    panel.grid.major = element_line(color = "#1C2E3A"),
    panel.grid.minor = element_blank(),
    
    legend.background = element_rect(fill = "#0B1D2A"),
    legend.text = element_text(color = "white"),
    legend.title = element_blank(),
    
    axis.ticks = element_blank()
  )

Code

ggplot(df_all, aes(x = Sex, fill = factor(Survived))) +
  
  geom_bar(position = "fill", color = "#0B1D2A") +
  
  scale_fill_manual(
    values = c(
      "0" = "#5C7A89",   # did not survive (cold, muted)
      "1" = "#E3B23C"    # survived (warm, gold)
    ),
    labels = c("Did not survive", "Survived")
  ) +
  
  labs(
    title = "Survival by Gender",
    subtitle = "Women had dramatically higher survival rates",
    x = "Gender",
    y = "Proportion",
    fill = ""
  ) +
  
  scale_y_continuous(labels = scales::percent) +
  
  theme_minimal() +
  theme(
    plot.background = element_rect(fill = "#0B1D2A", color = NA),
    panel.background = element_rect(fill = "#0B1D2A", color = NA),
    
    text = element_text(color = "white"),
    axis.text = element_text(color = "white"),
    axis.title = element_text(color = "white"),
    
    plot.title = element_text(size = 18, face = "bold"),
    plot.subtitle = element_text(size = 12, color = "#E3B23C"),
    
    panel.grid.major = element_line(color = "#1C2E3A"),
    panel.grid.minor = element_blank(),
    
    legend.background = element_rect(fill = "#0B1D2A"),
    legend.text = element_text(color = "white"),
    
    axis.ticks = element_blank()
  )

Code

ggplot(df_all, aes(x = Pclass, fill = factor(Survived))) +
  
  geom_bar(position = "fill", color = "#0B1D2A") +
  
  scale_fill_manual(
    values = c(
      "0" = "#5C7A89",   # did not survive (cold blue-gray)
      "1" = "#E3B23C"    # survived (warm gold)
    ),
    labels = c("Did not survive", "Survived")
  ) +
  
  labs(
    title = "Survival by Passenger Class",
    subtitle = "Higher class passengers had significantly greater survival rates",
    x = "Passenger Class",
    y = "Proportion",
    fill = ""
  ) +
  
  scale_y_continuous(labels = scales::percent) +
  
  theme_minimal() +
  theme(
    plot.background = element_rect(fill = "#0B1D2A", color = NA),
    panel.background = element_rect(fill = "#0B1D2A", color = NA),
    
    text = element_text(color = "white"),
    axis.text = element_text(color = "white"),
    axis.title = element_text(color = "white"),
    
    plot.title = element_text(size = 18, face = "bold"),
    plot.subtitle = element_text(size = 12, color = "#E3B23C"),
    
    panel.grid.major = element_line(color = "#1C2E3A"),
    panel.grid.minor = element_blank(),
    
    legend.background = element_rect(fill = "#0B1D2A"),
    legend.text = element_text(color = "white"),
    
    axis.ticks = element_blank()
  )

Code

titanic_plot <- df_all %>%
  mutate(
    Survived = factor(
      Survived,
      levels = c(0, 1),
      labels = c("Did not survive", "Survived")
    ),
    Pclass = factor(Pclass),
    Sex = factor(Sex, levels = c("male", "female"))
  )


ggplot(
  titanic_plot,
  aes(x = Sex, fill = Survived)
) +
  
  geom_bar(position = "fill", color = "#0B1D2A") +
  
  facet_wrap(~ Pclass) +
  
  scale_fill_manual(
    values = c(
      "Did not survive" = "#5C7A89",  # cold loss
      "Survived" = "#E3B23C"          # warm survival
    )
  ) +
  
  scale_y_continuous(labels = scales::percent) +
  
  labs(
    title = "Survival Depends on Both Class and Gender",
    subtitle = "The advantage of being female changes across passenger class",
    x = "Gender",
    y = "Proportion",
    fill = ""
  ) +
  
  theme_minimal() +
  theme(
    plot.background = element_rect(fill = "#0B1D2A", color = NA),
    panel.background = element_rect(fill = "#0B1D2A", color = NA),
    
    text = element_text(color = "white"),
    axis.text = element_text(color = "white"),
    axis.title = element_text(color = "white"),
    
    plot.title = element_text(size = 18, face = "bold"),
    plot.subtitle = element_text(size = 12, color = "#E3B23C"),
    
    panel.grid.major = element_line(color = "#1C2E3A"),
    panel.grid.minor = element_blank(),
    
    legend.background = element_rect(fill = "#0B1D2A"),
    legend.text = element_text(color = "white"),
    
    strip.background = element_rect(fill = "#1C2E3A", color = NA),
    strip.text = element_text(color = "#E3B23C", face = "bold"),
    
    axis.ticks = element_blank()
  )

Code

ggplot(df_all, aes(x = FamilySize, y = Survived)) +
  
  # Jittered points (background data)
  geom_jitter(
    width = 0.25,
    height = 0.05,
    alpha = 0.25,
    color = "#6C8EA4"   # muted steel (fits your palette)
  ) +
  
  # Logistic curve (MAIN STORY)
  geom_smooth(
    method = "glm",
    method.args = list(family = "binomial"),
    color = "#E3B23C",   # survival gold
    size = 1.4,
    se = TRUE,
    fill = "#E3B23C",
    alpha = 0.15
  ) +
  
  labs(
    title = "Family Size and Survival",
    subtitle = "Larger families faced lower odds of survival",
    x = "Family Size",
    y = "Probability of Survival"
  ) +
  
  theme_minimal() +
  theme(
    plot.background = element_rect(fill = "#0B1D2A", color = NA),
    panel.background = element_rect(fill = "#0B1D2A", color = NA),
    
    text = element_text(color = "white"),
    axis.text = element_text(color = "white"),
    axis.title = element_text(color = "white"),
    
    plot.title = element_text(size = 18, face = "bold"),
    plot.subtitle = element_text(size = 12, color = "#E3B23C"),
    
    panel.grid.major = element_line(color = "#1C2E3A"),
    panel.grid.minor = element_blank()
  )

Family Size Interpretation:

Small groups or individuals may have had mobility advantages
Large families may have had difficulty coordinating escape

Logistic Regression Model

\[ \text{logit}(P(Y=1)) = \ln\left(\frac{P(Y=1)}{1-P(Y=1)}\right) = \beta_0 + \beta_1 X_1 + \dots + \beta_k X_k \]

Based on the data and the outcome of our Y variable, ‘Survived’ has a binary outcome of either 1 or 0, we are using the logistic regression model. Logistic regression is tailored for binary outcomes measuring probability of either outcome which is called log-odds and represented as:

\[ \text{logit}(P(Y=1)) = \ln\left(\frac{P(Y=1)}{1-P(Y=1)}\right) \]

Now, we have our full model to begin our analysis:

\[ \begin{split} \ln\left(\frac{P(\text{survival}=1)}{1-P(\text{survival}=1)}\right) = \beta_0 + \beta_1(\text{pclass}) + \beta_2(\text{sex}) + \beta_3(\text{Age}) \ + \\ \beta_4(\text{fare}) + \beta_5(\text{embarked}) + \beta_6(\text{FamilySize}) \end{split} \]

Running the full model

Code

full_mod <- glm(Survived ~ Pclass + Sex + Age + Fare + Embarked + FamilySize, data = df_all, family = "binomial")

summary(full_mod)


Call:
glm(formula = Survived ~ Pclass + Sex + Age + Fare + Embarked + 
    FamilySize, family = "binomial", data = df_all)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)  16.546634 611.180599   0.027  0.97840    
Pclass2      -0.887693   0.296238  -2.997  0.00273 ** 
Pclass3      -2.124560   0.297076  -7.152 8.58e-13 ***
Sexmale      -2.728333   0.200349 -13.618  < 2e-16 ***
Age          -0.038075   0.007853  -4.849 1.24e-06 ***
Fare          0.002501   0.002471   1.012  0.31151    
EmbarkedC   -12.280651 611.180480  -0.020  0.98397    
EmbarkedQ   -12.368549 611.180532  -0.020  0.98385    
EmbarkedS   -12.736976 611.180468  -0.021  0.98337    
FamilySize   -0.218108   0.068112  -3.202  0.00136 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1186.66  on 890  degrees of freedom
Residual deviance:  786.04  on 881  degrees of freedom
AIC: 806.04

Number of Fisher Scoring iterations: 13

Code

exp(coef(full_mod))

 (Intercept)      Pclass2      Pclass3      Sexmale          Age         Fare 
1.535013e+07 4.116041e-01 1.194856e-01 6.532812e-02 9.626407e-01 1.002504e+00 
   EmbarkedC    EmbarkedQ    EmbarkedS   FamilySize 
4.640675e-06 4.250181e-06 2.940368e-06 8.040390e-01

Interpreting coefficients

All interpretations are:

Significant predictors

Pclass2 (-0.89)

Holding all other variables constant, passengers in 2nd class have lower odds of survival than 1st class.
Holding all other variables constant, passengers in 2nd class have 58.9% lower odds of survival than those in 1st class.

Pclass3 (-2.12)

Passengers in 3rd class have much lower odds of survival than 1st class, holding all other variables constant.
Passengers in 3rd class have 87.9% lower odds of survival than those in 1st class, holding all other variables constant.

Sexmale (-2.73)

Males have substantially lower odds of survival than females, holding all other variables constant.
Males have 93.5% lower odds of survival than females, holding all other variables constant.

Age (-0.038)

Each additional year of age reduces the odds of survival slightly, holding all other variables constant.
Each additional year of age reduces the odds of survival by approximately 3.7%, holding all other variables constant.

FamilySize (-0.22)

Larger families are associated with lower odds of survival, holding all other variables constant.
Larger families are associated with 19.7% lower odds of survival per additional family member, holding all other variables constant.

Not significant

Fare (p = 0.31)

Fare does not show a clear independent effect once other variables are included, holding all other variables constant.

ANOVA model to check reject or fail to reject the null hypothesis

Code

# Creating the null model to compare
null_mod <- glm(Survived ~ 1, data= df_all, family = "binomial")

anova(null_mod, full_mod, test = "Chisq")

Analysis of Deviance Table

Model 1: Survived ~ 1
Model 2: Survived ~ Pclass + Sex + Age + Fare + Embarked + FamilySize
  Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
1       890    1186.66                          
2       881     786.04  9   400.61 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Null Hypothesis

H₀: the additional predictors do not improve the model (null model is sufficient)

\[ \ln(\mu) = \beta_0 \]

Results:

A deviance reduction of 400.61 on 9 degrees of freedom indicates a substantial improvement in model fit, meaning the added predictors greatly increase the model’s ability to explain the outcome.

p-value < 2.2e-16 is extremely small so we reject the null hypothesis ( \(H_0\)) .

The predictors (Pclass, Sex, Age, Fare, Embarked, FamilySize) collectively improve the model significantly compared to having no predictors at all.

VIF check for multicollinearity

Code

vif(full_mod)

               GVIF Df GVIF^(1/(2*Df))
Pclass     2.054066  2        1.197164
Sex        1.202613  1        1.096637
Age        1.291333  1        1.136368
Fare       1.605060  1        1.266910
Embarked   1.232556  3        1.035463
FamilySize 1.303258  1        1.141603

Code

 # 1. Calculate VIF
vif_values <- vif(full_mod)

# 2. Convert matrix to data frame correctly
vif_df <- as.data.frame(vif_values)

# 3. Add the variable names as a column
vif_df$Variable <- rownames(vif_df)
vif_df$Status <- ifelse(vif_df$GVIF > 5, "High", "Low")


ggplot(vif_df, aes(x = reorder(Variable, GVIF), y = GVIF, fill = Status)) +
  
  geom_bar(stat = "identity", width = 0.7, color = "#0B1D2A") +
  
  scale_fill_manual(
    values = c(
      "Low" = "#6C8EA4",   # muted steel-blue
      "High" = "#C94C4C"   # restrained warning red
    )
  ) +
  
  geom_hline(
    yintercept = 5,
    linetype = "dashed",
    color = "#E3B23C",
    linewidth = 1
  ) +
  
  coord_flip() +
  
  labs(
    title = "VIF Check: Multicollinearity",
    subtitle = "All predictors remain below the concern threshold",
    caption = "Dashed gold line marks VIF = 5",
    x = NULL,
    y = "VIF Value"
  ) +
  
  theme_minimal() +
  theme(
    plot.background = element_rect(fill = "#0B1D2A", color = NA),
    panel.background = element_rect(fill = "#0B1D2A", color = NA),
    
    text = element_text(color = "white"),
    axis.text = element_text(color = "white", face = "bold"),
    axis.title = element_text(color = "white"),
    
    plot.title = element_text(size = 18, face = "bold"),
    plot.subtitle = element_text(size = 12, color = "#E3B23C"),
    plot.caption = element_text(color = "#A7A9AC", face = "italic"),
    
    panel.grid.major = element_line(color = "#1C2E3A"),
    panel.grid.minor = element_blank(),
    
    legend.position = "none",
    axis.ticks = element_blank()
  )

An initial check of VIF shows that none of the variables exceed the threshold of 5, suggesting there are no serious multicollinearity issues and each variable is largely contributing independently to the outcome.

The only notable point from the VIF values is a slight relationship between Pclass and Fare. Since it’s reasonable to assume that Fare is influenced by Pclass, we can remove Fare without losing meaningful information about the dependent variable. Using this reasoning, we construct our first reduced model by keeping only the significant variables from the full model, excluding Fare.

Reduced Model

Code

reduced_model <- glm(
  Survived ~ Pclass + Sex + Age + FamilySize,
  data = df_all,
  family = "binomial"
)

summary(reduced_model)


Call:
glm(formula = Survived ~ Pclass + Sex + Age + FamilySize, family = "binomial", 
    data = df_all)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)  4.247858   0.430886   9.858  < 2e-16 ***
Pclass2     -1.169760   0.260580  -4.489 7.15e-06 ***
Pclass3     -2.349265   0.242962  -9.669  < 2e-16 ***
Sexmale     -2.783225   0.197773 -14.073  < 2e-16 ***
Age         -0.039109   0.007793  -5.018 5.21e-07 ***
FamilySize  -0.215382   0.064242  -3.353    8e-04 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1186.66  on 890  degrees of freedom
Residual deviance:  792.83  on 885  degrees of freedom
AIC: 804.83

Number of Fisher Scoring iterations: 5

Reduced model summary

Our reduced model is a simple linear model.

Pclass2 (-1.17)

Holding other variables constant, passengers in 2nd class have lower odds of survival than 1st class.
Pclass3 (-2.35)

Holding other variables constant, passengers in 3rd class have much lower odds of survival than 1st class.
Sexmale (-2.78)

Holding other variables constant, males have dramatically lower odds of survival than females.
Age (-0.039)

Holding other variables constant, each additional year of age reduces survival odds slightly.
FamilySize (-0.215)

Holding other variables constant, larger family size reduces survival odds.

All predictors in the reduced model are now highly statistically significant, with p-values well below our conventional thresholds, indicating that each variable contributes meaningfully to explaining our dependent variable (Survived). The coefficient signs are also consistent with expectations, survival decreases for lower passenger classes, males, older individuals, and larger family sizes. The substantial drop from the null deviance (1186.66) to the residual deviance (792.83) further suggests that the model captures a large portion of the variation in the outcome.

Next, we will compare this reduced model to the full model using AIC and BIC to confirm that simplifying the model did not come at the cost of overall model quality.

Model Comparison AIC/BIC

Code

AIC(full_mod, reduced_model)

              df      AIC
full_mod      10 806.0443
reduced_model  6 804.8306

Code

BIC(full_mod, reduced_model)

              df      BIC
full_mod      10 853.9677
reduced_model  6 833.5847

Model quality comparison

Results:

Full model AIC: 806.04
Reduced model AIC: 804.83

The reduced model has a slightly lower AIC (804.83 vs. 806.04), indicating a better overall fit compared to the full model. While the residual deviance increases slightly (786 to 792), this change is minimal and expected when simplifying a model. By removing less stable or noisy variables, we’ve made the model more efficient without sacrificing meaningful explanatory power. Overall, this is the ideal outcome: a simpler model that performs just as well.

Anova Check

Code

anova(reduced_model, full_mod, test = "Chisq")

Analysis of Deviance Table

Model 1: Survived ~ Pclass + Sex + Age + FamilySize
Model 2: Survived ~ Pclass + Sex + Age + Fare + Embarked + FamilySize
  Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1       885     792.83                     
2       881     786.04  4   6.7863   0.1476

Interpreting ANOVA

Model 1 (reduced):

Survived ~ Pclass + Sex + Age + FamilySize
Model 2 (full):

adds Fare + Embarked

Result:

Deviance difference = 6.79
df difference = 4
p-value = 0.1476

Adding Fare and Embarked to the model does not significantly improve model fit compared to the reduced model (χ²(4) = 6.79, p = 0.148).

Interpretation

The improvement in fit is small relative to the extra complexity
The p-value > 0.05 meaning it fails to reject the null hypothesis
So the reduced model is preferred

The likelihood ratio test indicates that the additional predictors (Fare and Embarked) do not significantly improve the model, supporting the use of the more parsimonious reduced model.

Improving the Reduced Model

First checking

Code

#crPlots(reduced_model)
par(
  bg = "#6C777F",
  col.axis = "white",
  col.lab = "white",
  col.main = "white",
  fg = "white",          # axes/lines
  las = 1                # horizontal axis labels
)

crPlots(reduced_model)

Code

ggplot(df_all, aes(x = Age, y = Survived)) +
  
  geom_jitter(
    width = 0,
    height = 0.05,
    alpha = 0.25,
    color = "#6C8EA4"
  ) +
  
  geom_smooth(
    method = "glm",
    method.args = list(family = "binomial"),
    formula = y ~ poly(x, 2),
    se = TRUE,
    color = "#E3B23C",
    fill = "#E3B23C",
    alpha = 0.15,
    linewidth = 1.4
  ) +
  
  labs(
    title = "Age and Survival",
    subtitle = "A curved pattern supports adding a nonlinear age term",
    x = "Age",
    y = "Survival Probability"
  ) +
  
  theme_minimal() +
  theme(
    plot.background = element_rect(fill = "#0B1D2A", color = NA),
    panel.background = element_rect(fill = "#0B1D2A", color = NA),
    
    text = element_text(color = "white"),
    axis.text = element_text(color = "white"),
    axis.title = element_text(color = "white"),
    
    plot.title = element_text(size = 18, face = "bold"),
    plot.subtitle = element_text(size = 12, color = "#E3B23C"),
    
    panel.grid.major = element_line(color = "#1C2E3A"),
    panel.grid.minor = element_blank(),
    
    axis.ticks = element_blank()
  )

Obsevations:

The categorical variables show clear separation in their effects. For Pclass, survival decreases consistently from 1st to 3rd class, reinforcing that it is a strong predictor. Similarly, Sex shows a pronounced difference, with females having much higher survival than males, confirming its strong influence in the model.

For the continuous variables, the relationships are more subtle. The Age plot shows a slight downward trend but also some curvature, suggesting the relationship may not be strictly linear and supporting the use of a polynomial term.

A polynomial term is preferred over a log transform because the relationship between Age and survival shows curvature rather than diminishing returns. The quadratic term allows the model to capture changes in direction, which a log transformation cannot.

The FamilySize plot appears relatively flat with a mild negative trend, indicating a weaker effect and possibly limited contribution compared to other predictors.

Overall, there are no major violations of linearity or extreme patterns in the residuals, but the slight curvature in Age stands out as the main area where a nonlinear specification could improve the model.

Code

library(patchwork)

p_linear <- ggplot(df_all, aes(x = Age, y = Survived)) +
  
  geom_jitter(
    height = 0.05,
    width = 0,
    alpha = 0.25,
    color = "#6C8EA4"
  ) +
  
  geom_smooth(
    method = "glm",
    method.args = list(family = "binomial"),
    se = FALSE,
    color = "#5C7A89",   # colder, less emphasis
    linewidth = 1.2
  ) +
  
  labs(
    title = "Linear Fit",
    subtitle = "Assumes survival changes at a constant rate",
    x = "Age",
    y = "Survival Probability"
  ) +
  
  theme_minimal() +
  theme(
    plot.background = element_rect(fill = "#0B1D2A", color = NA),
    panel.background = element_rect(fill = "#0B1D2A", color = NA),
    
    text = element_text(color = "white"),
    axis.text = element_text(color = "white"),
    axis.title = element_text(color = "white"),
    
    plot.title = element_text(size = 16, face = "bold"),
    plot.subtitle = element_text(size = 11, color = "#5C7A89"),
    
    panel.grid.major = element_line(color = "#1C2E3A"),
    panel.grid.minor = element_blank(),
    
    axis.ticks = element_blank()
  )

p_poly <- ggplot(df_all, aes(x = Age, y = Survived)) +
  
  geom_jitter(
    height = 0.05,
    width = 0,
    alpha = 0.25,
    color = "#6C8EA4"
  ) +
  
  geom_smooth(
    method = "glm",
    method.args = list(family = "binomial"),
    formula = y ~ poly(x, 2),
    se = TRUE,
    color = "#E3B23C",   # gold = important
    fill = "#E3B23C",
    alpha = 0.15,
    linewidth = 1.4
  ) +
  
  labs(
    title = "Nonlinear (Quadratic) Fit",
    subtitle = "Captures the true pattern in survival",
    x = "Age",
    y = "Survival Probability"
  ) +
  
  theme_minimal() +
  theme(
    plot.background = element_rect(fill = "#0B1D2A", color = NA),
    panel.background = element_rect(fill = "#0B1D2A", color = NA),
    
    text = element_text(color = "white"),
    axis.text = element_text(color = "white"),
    axis.title = element_text(color = "white"),
    
    plot.title = element_text(size = 16, face = "bold"),
    plot.subtitle = element_text(size = 11, color = "#E3B23C"),
    
    panel.grid.major = element_line(color = "#1C2E3A"),
    panel.grid.minor = element_blank(),
    
    axis.ticks = element_blank()
  )



p_linear + p_poly

I wanted to have a visual confirmation of what adding a polynomial transform would do to the Age variable.

Both graphs support the idea that Age has a non-linear relationship with survival, but when I add the poly(Age, 2) transform, it reveals a curved “U-shape” that a straight line smooths over. The linear fit suggests a simple downward trend, implying survival just decreases with age, whereas the quadratic fit shows that this isn’t the full story. It highlights that children have relatively higher survival rates, survival drops through middle ages, and then levels off or slightly increases again at older ages.

Now, we have a justified reason to apply a poly transform to Age, and then check the reduced model variables for any influential outliers using Cook’s distance.

Checking outliers (Cook’s Distance)

Code

#plot(reduced_model, which = 4)  # Cook's distance

# Extract Cook's distance
cook_df <- data.frame(
  index = 1:nrow(df_all),
  cooks = cooks.distance(reduced_model)
)

# Threshold (common rule of thumb)
threshold <- 4 / nrow(df_all)

cook_df$flag <- cook_df$cooks > threshold

ggplot(cook_df, aes(x = index, y = cooks)) +
  
  geom_col(aes(fill = flag)) +
  scale_fill_manual(
  values = c("FALSE" = "#6C8EA4", "TRUE" = "#C94C4C")
  ) +
  
  geom_hline(
    yintercept = threshold,
    linetype = "dashed",
    color = "#E3B23C",
    linewidth = 1
  ) +
  
  labs(
    title = "Influential Observations (Cook's Distance)",
    subtitle = "A few observations show higher influence, but none dominate the model",
    x = "Observation Index",
    y = "Cook's Distance"
  ) +
  
  theme_minimal() +
  theme(
    plot.background = element_rect(fill = "#0B1D2A", color = NA),
    panel.background = element_rect(fill = "#0B1D2A", color = NA),
    
    text = element_text(color = "white"),
    axis.text = element_text(color = "white"),
    axis.title = element_text(color = "white"),
    
    plot.title = element_text(size = 18, face = "bold"),
    plot.subtitle = element_text(size = 12, color = "#E3B23C"),
    
    panel.grid.major = element_line(color = "#1C2E3A"),
    panel.grid.minor = element_blank(),
    
    axis.ticks = element_blank()
  )

The Cook’s distance plot identifies a small number of observations with higher influence on the model (observations 262, 298, and 631). While these exceed the common threshold of 4/n, their values remain relatively low overall, indicating moderate rather than extreme influence. As a result, there is no strong evidence that these points unduly distort the model, and no immediate need to remove them, though they may warrant further inspection.

Since we are not removing any outliers and we have good justification for applying a polynomial transform to Age, a final check on the reduced model is to assess goodness of fit using the Hosmer–Lemeshow test.

Code

library(ResourceSelection)
hoslem.test(df_all$Survived, fitted(reduced_model))


    Hosmer and Lemeshow goodness of fit (GOF) test

data:  df_all$Survived, fitted(reduced_model)
X-squared = 25.55, df = 8, p-value = 0.001254

The Hosmer–Lemeshow test is statistically significant (p = 0.001254), leading us to reject the null hypothesis of good fit. This indicates that the model’s predicted probabilities do not align well with the observed outcomes, suggesting a lack of fit and that further model refinement may be needed.

The next step is to apply the polynomial transform.

Polynomial transform model and AIC comparison to the reduced model

Code

model_poly <- glm(
  Survived ~ Pclass + Sex + poly(Age, 2) + FamilySize,
  data = df_all,
  family = binomial
)

# comparing models with AIC
AIC(reduced_model, model_poly)

              df      AIC
reduced_model  6 804.8306
model_poly     7 800.2359

Code

df_all$FamilyGroup <- cut(df_all$FamilySize,
                         breaks = c(0,1,4,10),
                         labels = c("Alone","Small","Large"))


# baseline
model1 <- reduced_model

# nonlinear age
model2 <- glm(Survived ~ Pclass + Sex + poly(Age,2) + FamilySize, family="binomial", data=df_all)

# interaction
model3 <- glm(Survived ~ Pclass * Sex + Age + FamilySize, family="binomial", data=df_all)

# both
model4 <- glm(Survived ~ Pclass * Sex + poly(Age,2) + FamilySize, family="binomial", data=df_all)

AIC(model1, model2, model3, model4)

       df      AIC
model1  6 804.8306
model2  7 800.2359
model3  8 779.2358
model4  9 773.1571

The polynomial model improves the overall fit, with a lower AIC (800.24 vs. 804.83), indicating that adding a nonlinear term for Age better captures the relationship with survival. This supports the earlier visual evidence that Age is not strictly linear and benefits from a quadratic specification.

From here, I explored multiple model variations to ensure we are not missing important structure in the data. In particular, the component + residual plots suggested a possible interaction between Pclass and Sex, so I tested models with and without this interaction, as well as with the polynomial Age term, to compare their overall fit and determine the most appropriate specification.

Model 4 has the lowest AIC (773.16), indicating it provides the best overall fit among the models considered. Although it is the most complex model (Df = 9), the improvement in fit more than justifies the additional parameters, suggesting that both the interaction between Pclass and Sex and the nonlinear effect of Age meaningfully contribute to explaining survival.

Fit the best model

Code

model_final <- glm(
  Survived ~ Pclass * Sex + poly(Age, 2) + FamilySize,
  data = df_all,
  family = binomial
)

summary(model_final)


Call:
glm(formula = Survived ~ Pclass * Sex + poly(Age, 2) + FamilySize, 
    family = binomial, data = df_all)

Coefficients:
                 Estimate Std. Error z value Pr(>|z|)    
(Intercept)       4.27469    0.62019   6.893 5.48e-12 ***
Pclass2          -1.11729    0.73094  -1.529 0.126370    
Pclass3          -3.84351    0.62575  -6.142 8.14e-10 ***
Sexmale          -4.06840    0.62215  -6.539 6.18e-11 ***
poly(Age, 2)1   -16.58290    3.14577  -5.271 1.35e-07 ***
poly(Age, 2)2     8.02330    2.75294   2.914 0.003563 ** 
FamilySize       -0.26851    0.07302  -3.677 0.000236 ***
Pclass2:Sexmale  -0.44946    0.80564  -0.558 0.576922    
Pclass3:Sexmale   2.08906    0.66412   3.146 0.001657 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1186.66  on 890  degrees of freedom
Residual deviance:  755.16  on 882  degrees of freedom
AIC: 773.16

Number of Fisher Scoring iterations: 6

The final model, which includes both the polynomial Age term and the interaction between Pclass and Sex, provides the best overall fit (AIC = 773.16) and a substantial reduction in deviance from the null model. Most predictors are highly significant, including the nonlinear Age terms and FamilySize, confirming their importance in explaining survival. The interaction term for Pclass3 and Sex is also significant, indicating that the effect of gender on survival differs meaningfully across passenger classes. While not every interaction term is significant, the overall improvement in model fit suggests that including both nonlinearity and interaction captures important structure in the data.

Before concluding that this is our final model, the next step is to evaluate its classification accuracy on a train/test split to see how well it generalizes to unseen data and to provide a practical check on its predictive performance.

Function: classification accuracy for one train/test split

Loading the test data

Code

test_data <- read.csv("test.csv")

test_data$Pclass <- factor(test_data$Pclass)
test_data$Sex <- factor(test_data$Sex)
test_data$Embarked <- factor(test_data$Embarked)

colSums(is.na(test_data))

PassengerId      Pclass        Name         Sex         Age       SibSp 
          0           0           0           0          86           0 
      Parch      Ticket        Fare       Cabin    Embarked 
          0           0           1           0           0

Code

test_data$Age[is.na(test_data$Age)] <- median(test_data$Age, na.rm = TRUE)
test_data$Fare[is.na(test_data$Fare)] <- median(test_data$Fare, na.rm = TRUE)

test_data$PassengerId <- NULL
test_data$Name <- NULL
test_data$Ticket <- NULL

test_data$FamilySize <- test_data$SibSp + test_data$Parch + 1

test_data$Embarked <- factor(test_data$Embarked)

str(test_data)

'data.frame':   418 obs. of  9 variables:
 $ Pclass    : Factor w/ 3 levels "1","2","3": 3 3 2 3 3 3 3 2 3 3 ...
 $ Sex       : Factor w/ 2 levels "female","male": 2 1 2 2 1 2 1 2 1 2 ...
 $ Age       : num  34.5 47 62 27 22 14 30 26 18 21 ...
 $ SibSp     : int  0 1 0 0 1 0 0 1 0 2 ...
 $ Parch     : int  0 0 0 0 1 0 0 1 0 0 ...
 $ Fare      : num  7.83 7 9.69 8.66 12.29 ...
 $ Cabin     : chr  "" "" "" "" ...
 $ Embarked  : Factor w/ 3 levels "C","Q","S": 2 3 2 3 3 3 2 3 1 3 ...
 $ FamilySize: num  1 2 1 1 3 1 1 3 1 3 ...

Code

str(df_all)

'data.frame':   891 obs. of  11 variables:
 $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
 $ Pclass     : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
 $ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
 $ Age        : num  22 38 26 35 35 28 54 2 27 14 ...
 $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
 $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
 $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
 $ Cabin      : chr  "" "C85" "" "C123" ...
 $ Embarked   : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
 $ FamilySize : num  2 2 1 2 1 1 1 5 3 2 ...
 $ FamilyGroup: Factor w/ 3 levels "Alone","Small",..: 2 2 1 2 1 1 1 3 2 2 ...

The external test set provided by Kaggle was reserved for final validation; however, since ground-truth labels (the known outcomes for passenger survival) were unavailable, local cross-validation and a 20% holdout split will be utilized to assess generalization.

Creating a test data set from original data set

Code

# renaming the original dataset
train_data <- df_all

set.seed(123)
n <- nrow(train_data)
# 80% to train, 20% to test
train_id <- sample(seq_len(n), size = floor(0.8 * n))

local_train <- train_data[train_id, ]
local_test  <- train_data[-train_id, ]

Code

classification_accuracy <- function(train_data, test_data, formula_obj,
                                    threshold = 0.5) {
  fit <- glm(formula_obj, data = train_data, family = binomial)
  prob <- predict(fit, newdata = test_data, type = "response")
  pred <- ifelse(prob > threshold, 1, 0)
  mean(pred == test_data$Survived)
}

## Example single split
set.seed(123)
n <- nrow(train_data)
train_id <- sample(seq_len(n), size = floor(0.7 * n), replace = FALSE)

train_dat <- train_data[train_id, ]
test_dat  <- train_data[-train_id, ]

classification_accuracy(train_dat, test_dat, Survived ~ Pclass + Sex + Age + FamilySize)

[1] 0.7910448

Code

classification_accuracy(train_dat, test_dat, Survived ~ Pclass * Sex + Age + FamilySize)

[1] 0.7835821

Code

classification_accuracy(train_dat, test_dat, Survived ~ Pclass * Sex + poly(Age,2) + FamilySize)

[1] 0.8022388

These accuracy scores are all fairly close (around 78–80%), indicating that the models perform similarly in terms of predictive accuracy, with a slight improvement in the best model (~80.2%). The small differences suggest incremental gains rather than dramatic improvements, reinforcing that model refinements are helping, but not drastically changing overall performance.

Here the classification accuracy shows the final model as the best. Because this score tends to be a bit optimistic we will conduct a k-fold cross-validation accuracy to test the models further. Normally, this would be done with an ‘unseen’ dataset, but due to the issues mentioned before, the training set will be split instead to perform the test.

Function: K-fold CV accuracy for logistic regression

Code

cv_accuracy_glm <- function(data, formula_obj, K = 5, threshold = 0.5,
                            seed = 1234) {
  set.seed(seed)
  n <- nrow(data)
  fold_id <- sample(rep(1:K, length.out = n))
  
  acc_vec <- numeric(K)
  
  for (k in 1:K) {
    train_data <- data[fold_id != k, , drop = FALSE]
    valid_data <- data[fold_id == k, , drop = FALSE]
    
    fit <- glm(formula_obj, data = train_data, family = binomial)
    prob <- predict(fit, newdata = valid_data, type = "response")
    pred <- ifelse(prob > threshold, 1, 0)
    acc_vec[k] <- mean(pred == valid_data$Survived)
  }
  
  mean(acc_vec)
}
# simple model
result1 <- cv_accuracy_glm(df_all, Survived ~ Pclass + Sex + Age + FamilySize)
print(result1)

[1] 0.793497

Code

# winning model with the interaction and polynomial
result2 <- cv_accuracy_glm(df_all, Survived ~ Pclass * Sex + poly(Age, 2) + FamilySize)
print(result2)

[1] 0.7923796

The cross-validation accuracy scores are nearly identical (~0.793 vs. ~0.792), indicating that both models perform essentially the same when evaluated on unseen data. This suggests that the added complexity in the final model does not provide a meaningful improvement in predictive performance.

After running the simple linear model against the final model we see there is not a statistical difference to justify using the more complex model. Next, we will check if the interaction between Pclass and Sex is doing the heavy lifting and making any statistical difference to keep that model instead of the simple linear one.

Visually checking the interaction between class and gender

Code

ggplot(df_all, aes(x = Pclass, fill = Sex)) +
  
  geom_bar(position = "fill", color = "#0B1D2A") +
  
  facet_wrap(
    ~ Survived,
    labeller = labeller(
      Survived = c(
        "0" = "Did Not Survive",
        "1" = "Survived"
      )
    )
  ) +
  
  scale_fill_manual(
    values = c(
      "male" = "#6C8EA4",     # muted steel blue
      "female" = "#E3B23C"    # survival gold / protected group
    ),
    labels = c("Male", "Female")
  ) +
  
  scale_y_continuous(labels = scales::percent) +
  
  labs(
    title = "Class and Gender Shaped Survival Together",
    subtitle = "Survival patterns change when passenger class and gender are viewed jointly",
    x = "Passenger Class",
    y = "Proportion",
    fill = ""
  ) +
  
  theme_minimal() +
  theme(
    plot.background = element_rect(fill = "#0B1D2A", color = NA),
    panel.background = element_rect(fill = "#0B1D2A", color = NA),
    
    text = element_text(color = "white"),
    axis.text = element_text(color = "white"),
    axis.title = element_text(color = "white"),
    
    plot.title = element_text(size = 18, face = "bold"),
    plot.subtitle = element_text(size = 12, color = "#E3B23C"),
    
    panel.grid.major = element_line(color = "#1C2E3A"),
    panel.grid.minor = element_blank(),
    
    strip.background = element_rect(fill = "#1C2E3A", color = NA),
    strip.text = element_text(color = "#E3B23C", face = "bold"),
    
    legend.background = element_rect(fill = "#0B1D2A"),
    legend.text = element_text(color = "white"),
    
    axis.ticks = element_blank()
  )

Survival depends on structured relationships:

gender effects vary by class
age has a nonlinear impact
family size matters independently

CV Accuracy check with the Pclass * Sex and linear Age + FamilySize

Code

print(result1)

[1] 0.793497

Code

result3 <- cv_accuracy_glm(df_all, Survived ~ Pclass * Sex + Age + FamilySize)
print(result3)

[1] 0.7912811

Again, we see the same insignificant difference to justify using the any version other than the simple linear model which we first created after looking at the full model.

Plotting the ROC curves of the Simple Model and the Poly/Interaction Model

Code

library(pROC)
library(dplyr)

# 1. Generate probabilities for the Simple Model
fit_linear <- glm(
  Survived ~ Pclass + Sex + Age + FamilySize,
  data = train_dat,
  family = binomial
)

prob_linear <- predict(fit_linear, newdata = test_dat, type = "response")

# 2. Generate probabilities for the Poly/Interaction Model
fit_poly <- glm(
  Survived ~ Pclass * Sex + poly(Age, 2) + FamilySize,
  data = train_dat,
  family = binomial
)

prob_poly <- predict(fit_poly, newdata = test_dat, type = "response")

# 3. Create ROC objects
roc_linear <- roc(test_dat$Survived, prob_linear)
roc_poly <- roc(test_dat$Survived, prob_poly)

# 4. Convert ROC objects to data frames
roc_df <- bind_rows(
  data.frame(
    specificity = roc_linear$specificities,
    sensitivity = roc_linear$sensitivities,
    Model = "Simple Model"
  ),
  data.frame(
    specificity = roc_poly$specificities,
    sensitivity = roc_poly$sensitivities,
    Model = "Final Model"
  )
) %>%
  mutate(
    false_positive_rate = 1 - specificity
  )

# 5. AUC labels
auc_linear <- round(as.numeric(auc(roc_linear)), 3)
auc_poly <- round(as.numeric(auc(roc_poly)), 3)

# 6. Plot
ggplot(
  roc_df,
  aes(x = false_positive_rate, y = sensitivity, color = Model)
) +
  geom_line(linewidth = 1.4) +
  
  geom_abline(
    slope = 1,
    intercept = 0,
    linetype = "dashed",
    color = "#5C7A89",
    linewidth = 0.9
  ) +
  
  scale_color_manual(
    values = c(
      "Simple Model" = "#6C8EA4",
      "Final Model" = "#E3B23C"
    ),
    labels = c(
      paste0("Simple Model: AUC = ", auc_linear),
      paste0("Final Model: AUC = ", auc_poly)
    )
  ) +
  
  labs(
    title = "ROC Curve Comparison",
    subtitle = "Both models show excellent classification performance",
    x = "False Positive Rate",
    y = "True Positive Rate",
    color = ""
  ) +
  
  coord_equal() +
  
  theme_minimal() +
  theme(
    plot.background = element_rect(fill = "#0B1D2A", color = NA),
    panel.background = element_rect(fill = "#0B1D2A", color = NA),
    
    text = element_text(color = "white"),
    axis.text = element_text(color = "white"),
    axis.title = element_text(color = "white"),
    
    plot.title = element_text(size = 18, face = "bold"),
    plot.subtitle = element_text(size = 12, color = "#E3B23C"),
    
    panel.grid.major = element_line(color = "#1C2E3A"),
    panel.grid.minor = element_blank(),
    
    legend.background = element_rect(fill = "#0B1D2A"),
    legend.text = element_text(color = "white"),
    
    axis.ticks = element_blank()
  )

Code

roc.test(roc_linear, roc_poly)


    DeLong's test for two correlated ROC curves

data:  roc_linear and roc_poly
Z = -0.44447, p-value = 0.6567
alternative hypothesis: true difference in AUC is not equal to 0
95 percent confidence interval:
 -0.02576035  0.01623654
sample estimates:
AUC of roc1 AUC of roc2 
  0.8534226   0.8581845

AUC Score:

0.5: No better than a coin flip.

0.7–0.8: Good model.

0.8–0.9: Excellent model.

There is no statistical difference between the simple model and the complex one.

1. The Verdict: p-value = 0.6567

In statistics, we look for a p-value less than 0.05 to claim a significant difference. The p-value is 0.6567, which is very high. This means the slight increase in AUC (from 0.853 to 0.858) is likely just due to random noise in the data rather than the polynomial or interaction terms actually being better.

2. The AUC Scores: 0.853 vs. 0.858

Both models are Excellent: Any AUC above 0.85 is considered a very strong model for the Titanic dataset.
The Difference is Negligible: The “Poly” model is technically higher by 0.0047, but as the test shows, and that is not enough to matter.

3. The Confidence Interval: -0.025 to 0.016

This interval represents the likely range of the “true” difference between the models. Because the interval crosses zero (it goes from negative to positive), it tells us that the simple model could actually be better than the poly model, or vice-versa, but we can’t say for sure.

Rerun the Hosmer-Lemeshow test

Code

hoslem.test(df_all$Survived, fitted(reduced_model))


    Hosmer and Lemeshow goodness of fit (GOF) test

data:  df_all$Survived, fitted(reduced_model)
X-squared = 25.55, df = 8, p-value = 0.001254

Code

hoslem.test(df_all$Survived, fitted(model_final))


    Hosmer and Lemeshow goodness of fit (GOF) test

data:  df_all$Survived, fitted(model_final)
X-squared = 10.447, df = 8, p-value = 0.235

We next ran a Hosmer-Lemeshow test to check for ‘goodness of fit’, and hopefully come to a decisive conclusion about which model we prefer.

We need to check if predictive probabilities align with our model outcomes.

Unlike most tests, a small p-value in the Hosmer–Lemeshow test is undesirable because it indicates lack of fit. We prefer a larger p-value that suggests the model’s predictions align with the observed data.

Reduced model: p-value = 0.0013

The model’s predictions don’t line up well with reality, thus poor fit or miscalibration

Final model: p-value = 0.235

The model’s predictions are reasonably consistent with observed outcomes, thus adequate fit.

The reduced model rejects the null hypothesis and it is a poor fit. The final model fails to reject the null hypothesis meaning it is not significant, and that it’s output aligns with the observed results.

Our Conclusion:

The final model is the better model for predictive power.

Final Model (model_final):

\[ \text{Survived} \sim \text{Pclass} \times \text{Sex} + \text{poly}(\text{Age}, 2) + \text{FamilySize} \]

Residual Plot for Model_final

Code

library(arm)

par(
  bg = "#6C777F",
  col.axis = "white",
  col.lab = "white",
  col.main = "white",
  fg = "white"
)

binnedplot(
  fitted(model_final),
  residuals(model_final, type = "response"),
  main = "Binned Residual Plot: Final Model",
  xlab = "Expected Survival Probability",
  ylab = "Average Residuals",
  
  col.int = "#E3B23C",     # gold bands
  col.points = "#FFFFFF",  # FORCE white points
  pch = 16,                # solid circle
  cex = 1.3                # increase size so they stand out
)

The plot above confirms our Halsem-Lemeshow test.

All points are within the boundaries: Every single dot (representing bins of passengers) falls inside the grey 95% confidence lines. This is the “gold standard” for a logistic regression diagnostic. It means there are no groups where the model is consistently failing.
No “Shape” or Trend: The dots are scattered randomly above and below the zero line. There is no “U-shape” or diagonal slant, which proves that poly(Age, 2) and the interaction terms successfully captured the non-linearities in the data.
Performance across the spectrum: Whether the predicted probability of survival is low (0.1), medium (0.5), or high (0.9), the model remains accurate. It isn’t “guessing” better at one end than the other.

Overall model summary

The final logistic regression model indicates that survival on the Titanic is significantly influenced by passenger class, sex, age (modeled nonlinearly), and family size, with an important interaction between sex and passenger class. This model represents a substantial improvement over the null, with residual deviance decreasing from 1186 to 755, and achieves the lowest AIC (773.16) among all models considered, indicating the best overall fit.

A key result of this model is the interaction between sex and passenger class, which shows that these variables cannot be interpreted independently. Using female passengers in 1st class as the baseline, the results show that males in 1st class have dramatically lower odds of survival, and that being in 3rd class significantly reduces survival among females. However, the positive and significant interaction term for 3rd class males suggests that the negative effect of being male is less severe in 3rd class than in 1st class. In practical terms, this indicates that while a gender gap in survival exists across all classes, it is most pronounced in 1st class and somewhat reduced in 3rd class, likely reflecting differences in how evacuation protocols such as “women and children first” were enforced. The interaction for 2nd class is not statistically significant, suggesting no meaningful difference from 1st class in this respect.

The effect of age is also an important component of the model and is clearly nonlinear, as both polynomial terms are statistically significant. Rather than a simple linear decline, the relationship between age and survival follows a curved pattern, where children have higher survival probabilities, survival decreases through adulthood, and then levels off at older ages. This confirms that a linear specification would fail to capture the true structure of the relationship.

Family size also shows a significant negative effect, indicating that larger family groups are associated with lower survival probabilities, though the magnitude of this effect is smaller compared to class and sex.

Summary Conclusion

This analysis demonstrates that while predictive accuracy (AUC ≈ 0.85) can be achieved with a parsimonious linear model, such a model fails fundamental goodness-of-fit requirements (Hosmer-Lemeshow p < 0.001). By incorporating a quadratic term for Age and a Pclass/Sex interaction, we achieved a model that is not only accurate but statistically well-calibrated (p = 0.235).

Our findings confirm that Titanic survival was not merely a linear function of demographics. The survival advantage of being female was significantly moderated by socio-economic status, and the effect of age was non-linear, likely reflecting the “women and children first” protocol. Ultimately, the more complex model is preferred as it captures the nuanced, structured relationships inherent in this historic event.

Code

library(effects)
library(lattice)

eff <- allEffects(model_final)

trellis.par.set(
  background = list(col = "#0B1D2A"),
  panel.background = list(col = "#6C777F"),
  axis.line = list(col = "white"),
  axis.text = list(col = "white"),
  par.xlab.text = list(col = "white"),
  par.ylab.text = list(col = "white"),
  par.main.text = list(col = "white"),
  strip.background = list(col = "#6C777F"),
  strip.border = list(col = "#1C2E3A")
)

plot(
  eff,
  main = "Estimated Effects from Final Model",
  colors = "#E3B23C",
  band.colors = "#E3B23C",
  rug = FALSE
)

Model	Complexity	AIC	HL Test (p-value)	Verdict
Full Model	High	806.4	-	Over fit/Noisey
Reduced Model	Low (Best AIC)	804.83	0.001 (fail)	Poor fit
Final Model	Medium	773.16	0.235 (pass)	Winner