Code
pacman::p_load(tidyverse, janitor, ggplot2, car)“The sinking of the Titanic is one of the most infamous shipwrecks in history.
On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.
While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.”
credit: <https://www.kaggle.com/competitions/titanic/overview>
General Guidelines:
clearly define a response variable and predictors,
apply appropriate GLM methodology,
include model fitting, testing,
justify your modeling decisions and interpretations.
| Variable | Definition | Key |
|---|---|---|
| survival | Survival | 0 = No, 1 = Yes |
| pclass | Ticket class | 1 = 1st, 2 = 2nd, 3 = 3rd |
| sex | Sex | |
| Age | Age in years | |
| sibsp | # of siblings / spouses aboard the Titanic | |
| parch | # of parents / children aboard the Titanic | |
| ticket | Ticket number | |
| fare | Passenger fare | |
| cabin | Cabin number | |
| embarked | Port of Embarkation | C = Cherbourg, Q = Queenstown, S = Southampton |
pclass: A proxy for socio-economic status (SES)
1st = Upper
2nd = Middle
3rd = Lower
age: Age is fractional if less than 1.
sibsp: The dataset defines family relations in this way…
Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife (mistresses and fiancés were ignored)
parch: The dataset defines family relations in this way…
Parent = mother, father
Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for them.
pacman::p_load(tidyverse, janitor, ggplot2, car)df_all <- read.csv("train.csv")
summary(df_all) PassengerId Survived Pclass Name
Min. : 1.0 Min. :0.0000 Min. :1.000 Length:891
1st Qu.:223.5 1st Qu.:0.0000 1st Qu.:2.000 Class :character
Median :446.0 Median :0.0000 Median :3.000 Mode :character
Mean :446.0 Mean :0.3838 Mean :2.309
3rd Qu.:668.5 3rd Qu.:1.0000 3rd Qu.:3.000
Max. :891.0 Max. :1.0000 Max. :3.000
Sex Age SibSp Parch
Length:891 Min. : 0.42 Min. :0.000 Min. :0.0000
Class :character 1st Qu.:20.12 1st Qu.:0.000 1st Qu.:0.0000
Mode :character Median :28.00 Median :0.000 Median :0.0000
Mean :29.70 Mean :0.523 Mean :0.3816
3rd Qu.:38.00 3rd Qu.:1.000 3rd Qu.:0.0000
Max. :80.00 Max. :8.000 Max. :6.0000
NA's :177
Ticket Fare Cabin Embarked
Length:891 Min. : 0.00 Length:891 Length:891
Class :character 1st Qu.: 7.91 Class :character Class :character
Mode :character Median : 14.45 Mode :character Mode :character
Mean : 32.20
3rd Qu.: 31.00
Max. :512.33
str(df_all)'data.frame': 891 obs. of 12 variables:
$ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
$ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
$ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
$ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
$ Sex : chr "male" "female" "female" "female" ...
$ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
$ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
$ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
$ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
$ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
$ Cabin : chr "" "C85" "" "C123" ...
$ Embarked : chr "S" "C" "S" "S" ...
table(df_all$Survived)
0 1
549 342
str(df_all$Survived) int [1:891] 0 1 1 1 0 0 0 0 1 1 ...
df_all$Pclass <- factor(df_all$Pclass)
df_all$Sex <- factor(df_all$Sex)
df_all$Embarked <- factor(df_all$Embarked)
str(df_all)'data.frame': 891 obs. of 12 variables:
$ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
$ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
$ Pclass : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
$ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
$ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
$ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
$ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
$ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
$ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
$ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
$ Cabin : chr "" "C85" "" "C123" ...
$ Embarked : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
After examining the structure and summary of the dataset, I began the data cleaning process by checking the distribution and type of the response variable (Survived) and converting relevant categorical variables (Pclass, Sex, and Embarked) into factors. This step ensures that the model correctly treats these variables as categorical rather than numeric, which is proper for interpretation and accurate estimation in logistic regression.
colSums(is.na(df_all))PassengerId Survived Pclass Name Sex Age
0 0 0 0 0 177
SibSp Parch Ticket Fare Cabin Embarked
0 0 0 0 0 0
df_all$Age[is.na(df_all$Age)] <- median(df_all$Age, na.rm = TRUE)
df_all$Fare[is.na(df_all$Fare)] <- median(df_all$Fare, na.rm = TRUE)
head(df_all) PassengerId Survived Pclass
1 1 0 3
2 2 1 1
3 3 1 3
4 4 1 1
5 5 0 3
6 6 0 3
Name Sex Age SibSp Parch
1 Braund, Mr. Owen Harris male 22 1 0
2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0
3 Heikkinen, Miss. Laina female 26 0 0
4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0
5 Allen, Mr. William Henry male 35 0 0
6 Moran, Mr. James male 28 0 0
Ticket Fare Cabin Embarked
1 A/5 21171 7.2500 S
2 PC 17599 71.2833 C85 C
3 STON/O2. 3101282 7.9250 S
4 113803 53.1000 C123 S
5 373450 8.0500 S
6 330877 8.4583 Q
After checking for missing values, I found that Age and Fare contained some NA values, so I handled them by imputing the median of each variable. This approach preserves the dataset size while providing a reasonable central value that is less sensitive to outliers than the mean, allowing the model to run without errors due to missing data.
df_all$PassengerId <- NULL
df_all$Name <- NULL
df_all$Ticket <- NULL
df_all$FamilySize <- df_all$SibSp + df_all$Parch + 1
df_all$Embarked <- factor(df_all$Embarked)
head(df_all) Survived Pclass Sex Age SibSp Parch Fare Cabin Embarked FamilySize
1 0 3 male 22 1 0 7.2500 S 2
2 1 1 female 38 1 0 71.2833 C85 C 2
3 1 3 female 26 0 0 7.9250 S 1
4 1 1 female 35 1 0 53.1000 C123 S 2
5 0 3 male 35 0 0 8.0500 S 1
6 0 3 male 28 0 0 8.4583 Q 1
Lastly, I removed variables such as PassengerId, Name, and Ticket, as they do not provide meaningful predictive value for the model. I then created a new variable, FamilySize, by combining SibSp and Parch to better capture the overall family context of each passenger in a single, more interpretable feature. This helps simplify the model while still retaining the underlying information about family relationships. Finally, I ensured Embarked was properly formatted as a categorical variable so it would be treated correctly in the analysis, before beginning EDA.
titanic_plot <- df_all %>%
mutate(
Survived = factor(
Survived,
levels = c(0, 1),
labels = c("Did not survive", "Survived")
),
Pclass = factor(Pclass),
Sex = factor(Sex, levels = c("male", "female"))
)
ggplot(
titanic_plot %>% filter(!is.na(Age)),
aes(x = Age, fill = Survived, color = Survived)
) +
geom_density(
alpha = 0.35,
linewidth = 1.1
) +
scale_fill_manual(
values = c(
"Did not survive" = "#5C7A89",
"Survived" = "#E3B23C"
)
) +
scale_color_manual(
values = c(
"Did not survive" = "#8AA6B3",
"Survived" = "#FFD166"
)
) +
labs(
title = "Age Distribution by Survival Status",
subtitle = "Survival patterns change across age, supporting a nonlinear age effect",
x = "Age",
y = "Density",
fill = "",
color = ""
) +
theme_minimal() +
theme(
plot.background = element_rect(fill = "#0B1D2A", color = NA),
panel.background = element_rect(fill = "#0B1D2A", color = NA),
text = element_text(color = "white"),
axis.text = element_text(color = "white"),
axis.title = element_text(color = "white"),
plot.title = element_text(size = 18, face = "bold"),
plot.subtitle = element_text(size = 12, color = "#E3B23C"),
panel.grid.major = element_line(color = "#1C2E3A"),
panel.grid.minor = element_blank(),
legend.background = element_rect(fill = "#0B1D2A"),
legend.text = element_text(color = "white"),
legend.title = element_blank(),
axis.ticks = element_blank()
)ggplot(df_all, aes(x = Sex, fill = factor(Survived))) +
geom_bar(position = "fill", color = "#0B1D2A") +
scale_fill_manual(
values = c(
"0" = "#5C7A89", # did not survive (cold, muted)
"1" = "#E3B23C" # survived (warm, gold)
),
labels = c("Did not survive", "Survived")
) +
labs(
title = "Survival by Gender",
subtitle = "Women had dramatically higher survival rates",
x = "Gender",
y = "Proportion",
fill = ""
) +
scale_y_continuous(labels = scales::percent) +
theme_minimal() +
theme(
plot.background = element_rect(fill = "#0B1D2A", color = NA),
panel.background = element_rect(fill = "#0B1D2A", color = NA),
text = element_text(color = "white"),
axis.text = element_text(color = "white"),
axis.title = element_text(color = "white"),
plot.title = element_text(size = 18, face = "bold"),
plot.subtitle = element_text(size = 12, color = "#E3B23C"),
panel.grid.major = element_line(color = "#1C2E3A"),
panel.grid.minor = element_blank(),
legend.background = element_rect(fill = "#0B1D2A"),
legend.text = element_text(color = "white"),
axis.ticks = element_blank()
)ggplot(df_all, aes(x = Pclass, fill = factor(Survived))) +
geom_bar(position = "fill", color = "#0B1D2A") +
scale_fill_manual(
values = c(
"0" = "#5C7A89", # did not survive (cold blue-gray)
"1" = "#E3B23C" # survived (warm gold)
),
labels = c("Did not survive", "Survived")
) +
labs(
title = "Survival by Passenger Class",
subtitle = "Higher class passengers had significantly greater survival rates",
x = "Passenger Class",
y = "Proportion",
fill = ""
) +
scale_y_continuous(labels = scales::percent) +
theme_minimal() +
theme(
plot.background = element_rect(fill = "#0B1D2A", color = NA),
panel.background = element_rect(fill = "#0B1D2A", color = NA),
text = element_text(color = "white"),
axis.text = element_text(color = "white"),
axis.title = element_text(color = "white"),
plot.title = element_text(size = 18, face = "bold"),
plot.subtitle = element_text(size = 12, color = "#E3B23C"),
panel.grid.major = element_line(color = "#1C2E3A"),
panel.grid.minor = element_blank(),
legend.background = element_rect(fill = "#0B1D2A"),
legend.text = element_text(color = "white"),
axis.ticks = element_blank()
)titanic_plot <- df_all %>%
mutate(
Survived = factor(
Survived,
levels = c(0, 1),
labels = c("Did not survive", "Survived")
),
Pclass = factor(Pclass),
Sex = factor(Sex, levels = c("male", "female"))
)
ggplot(
titanic_plot,
aes(x = Sex, fill = Survived)
) +
geom_bar(position = "fill", color = "#0B1D2A") +
facet_wrap(~ Pclass) +
scale_fill_manual(
values = c(
"Did not survive" = "#5C7A89", # cold loss
"Survived" = "#E3B23C" # warm survival
)
) +
scale_y_continuous(labels = scales::percent) +
labs(
title = "Survival Depends on Both Class and Gender",
subtitle = "The advantage of being female changes across passenger class",
x = "Gender",
y = "Proportion",
fill = ""
) +
theme_minimal() +
theme(
plot.background = element_rect(fill = "#0B1D2A", color = NA),
panel.background = element_rect(fill = "#0B1D2A", color = NA),
text = element_text(color = "white"),
axis.text = element_text(color = "white"),
axis.title = element_text(color = "white"),
plot.title = element_text(size = 18, face = "bold"),
plot.subtitle = element_text(size = 12, color = "#E3B23C"),
panel.grid.major = element_line(color = "#1C2E3A"),
panel.grid.minor = element_blank(),
legend.background = element_rect(fill = "#0B1D2A"),
legend.text = element_text(color = "white"),
strip.background = element_rect(fill = "#1C2E3A", color = NA),
strip.text = element_text(color = "#E3B23C", face = "bold"),
axis.ticks = element_blank()
)ggplot(df_all, aes(x = FamilySize, y = Survived)) +
# Jittered points (background data)
geom_jitter(
width = 0.25,
height = 0.05,
alpha = 0.25,
color = "#6C8EA4" # muted steel (fits your palette)
) +
# Logistic curve (MAIN STORY)
geom_smooth(
method = "glm",
method.args = list(family = "binomial"),
color = "#E3B23C", # survival gold
size = 1.4,
se = TRUE,
fill = "#E3B23C",
alpha = 0.15
) +
labs(
title = "Family Size and Survival",
subtitle = "Larger families faced lower odds of survival",
x = "Family Size",
y = "Probability of Survival"
) +
theme_minimal() +
theme(
plot.background = element_rect(fill = "#0B1D2A", color = NA),
panel.background = element_rect(fill = "#0B1D2A", color = NA),
text = element_text(color = "white"),
axis.text = element_text(color = "white"),
axis.title = element_text(color = "white"),
plot.title = element_text(size = 18, face = "bold"),
plot.subtitle = element_text(size = 12, color = "#E3B23C"),
panel.grid.major = element_line(color = "#1C2E3A"),
panel.grid.minor = element_blank()
)Small groups or individuals may have had mobility advantages
Large families may have had difficulty coordinating escape
\[ \text{logit}(P(Y=1)) = \ln\left(\frac{P(Y=1)}{1-P(Y=1)}\right) = \beta_0 + \beta_1 X_1 + \dots + \beta_k X_k \]
Based on the data and the outcome of our Y variable, ‘Survived’ has a binary outcome of either 1 or 0, we are using the logistic regression model. Logistic regression is tailored for binary outcomes measuring probability of either outcome which is called log-odds and represented as:
\[ \text{logit}(P(Y=1)) = \ln\left(\frac{P(Y=1)}{1-P(Y=1)}\right) \]
Now, we have our full model to begin our analysis:
\[ \begin{split} \ln\left(\frac{P(\text{survival}=1)}{1-P(\text{survival}=1)}\right) = \beta_0 + \beta_1(\text{pclass}) + \beta_2(\text{sex}) + \beta_3(\text{Age}) \ + \\ \beta_4(\text{fare}) + \beta_5(\text{embarked}) + \beta_6(\text{FamilySize}) \end{split} \]
full_mod <- glm(Survived ~ Pclass + Sex + Age + Fare + Embarked + FamilySize, data = df_all, family = "binomial")
summary(full_mod)
Call:
glm(formula = Survived ~ Pclass + Sex + Age + Fare + Embarked +
FamilySize, family = "binomial", data = df_all)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 16.546634 611.180599 0.027 0.97840
Pclass2 -0.887693 0.296238 -2.997 0.00273 **
Pclass3 -2.124560 0.297076 -7.152 8.58e-13 ***
Sexmale -2.728333 0.200349 -13.618 < 2e-16 ***
Age -0.038075 0.007853 -4.849 1.24e-06 ***
Fare 0.002501 0.002471 1.012 0.31151
EmbarkedC -12.280651 611.180480 -0.020 0.98397
EmbarkedQ -12.368549 611.180532 -0.020 0.98385
EmbarkedS -12.736976 611.180468 -0.021 0.98337
FamilySize -0.218108 0.068112 -3.202 0.00136 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1186.66 on 890 degrees of freedom
Residual deviance: 786.04 on 881 degrees of freedom
AIC: 806.04
Number of Fisher Scoring iterations: 13
exp(coef(full_mod)) (Intercept) Pclass2 Pclass3 Sexmale Age Fare
1.535013e+07 4.116041e-01 1.194856e-01 6.532812e-02 9.626407e-01 1.002504e+00
EmbarkedC EmbarkedQ EmbarkedS FamilySize
4.640675e-06 4.250181e-06 2.940368e-06 8.040390e-01
All interpretations are:
Pclass2 (-0.89)
Holding all other variables constant, passengers in 2nd class have lower odds of survival than 1st class.
Holding all other variables constant, passengers in 2nd class have 58.9% lower odds of survival than those in 1st class.
Pclass3 (-2.12)
Passengers in 3rd class have much lower odds of survival than 1st class, holding all other variables constant.
Passengers in 3rd class have 87.9% lower odds of survival than those in 1st class, holding all other variables constant.
Sexmale (-2.73)
Males have substantially lower odds of survival than females, holding all other variables constant.
Males have 93.5% lower odds of survival than females, holding all other variables constant.
Age (-0.038)
Each additional year of age reduces the odds of survival slightly, holding all other variables constant.
Each additional year of age reduces the odds of survival by approximately 3.7%, holding all other variables constant.
FamilySize (-0.22)
Larger families are associated with lower odds of survival, holding all other variables constant.
Larger families are associated with 19.7% lower odds of survival per additional family member, holding all other variables constant.
Fare (p = 0.31)
# Creating the null model to compare
null_mod <- glm(Survived ~ 1, data= df_all, family = "binomial")
anova(null_mod, full_mod, test = "Chisq")Analysis of Deviance Table
Model 1: Survived ~ 1
Model 2: Survived ~ Pclass + Sex + Age + Fare + Embarked + FamilySize
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 890 1186.66
2 881 786.04 9 400.61 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
H₀: the additional predictors do not improve the model (null model is sufficient)
\[ \ln(\mu) = \beta_0 \]
Results:
A deviance reduction of 400.61 on 9 degrees of freedom indicates a substantial improvement in model fit, meaning the added predictors greatly increase the model’s ability to explain the outcome.
p-value < 2.2e-16 is extremely small so we reject the null hypothesis ( \(H_0\)) .
The predictors (Pclass, Sex, Age, Fare, Embarked, FamilySize) collectively improve the model significantly compared to having no predictors at all.
vif(full_mod) GVIF Df GVIF^(1/(2*Df))
Pclass 2.054066 2 1.197164
Sex 1.202613 1 1.096637
Age 1.291333 1 1.136368
Fare 1.605060 1 1.266910
Embarked 1.232556 3 1.035463
FamilySize 1.303258 1 1.141603
# 1. Calculate VIF
vif_values <- vif(full_mod)
# 2. Convert matrix to data frame correctly
vif_df <- as.data.frame(vif_values)
# 3. Add the variable names as a column
vif_df$Variable <- rownames(vif_df)
vif_df$Status <- ifelse(vif_df$GVIF > 5, "High", "Low")
ggplot(vif_df, aes(x = reorder(Variable, GVIF), y = GVIF, fill = Status)) +
geom_bar(stat = "identity", width = 0.7, color = "#0B1D2A") +
scale_fill_manual(
values = c(
"Low" = "#6C8EA4", # muted steel-blue
"High" = "#C94C4C" # restrained warning red
)
) +
geom_hline(
yintercept = 5,
linetype = "dashed",
color = "#E3B23C",
linewidth = 1
) +
coord_flip() +
labs(
title = "VIF Check: Multicollinearity",
subtitle = "All predictors remain below the concern threshold",
caption = "Dashed gold line marks VIF = 5",
x = NULL,
y = "VIF Value"
) +
theme_minimal() +
theme(
plot.background = element_rect(fill = "#0B1D2A", color = NA),
panel.background = element_rect(fill = "#0B1D2A", color = NA),
text = element_text(color = "white"),
axis.text = element_text(color = "white", face = "bold"),
axis.title = element_text(color = "white"),
plot.title = element_text(size = 18, face = "bold"),
plot.subtitle = element_text(size = 12, color = "#E3B23C"),
plot.caption = element_text(color = "#A7A9AC", face = "italic"),
panel.grid.major = element_line(color = "#1C2E3A"),
panel.grid.minor = element_blank(),
legend.position = "none",
axis.ticks = element_blank()
)An initial check of VIF shows that none of the variables exceed the threshold of 5, suggesting there are no serious multicollinearity issues and each variable is largely contributing independently to the outcome.
The only notable point from the VIF values is a slight relationship between Pclass and Fare. Since it’s reasonable to assume that Fare is influenced by Pclass, we can remove Fare without losing meaningful information about the dependent variable. Using this reasoning, we construct our first reduced model by keeping only the significant variables from the full model, excluding Fare.
reduced_model <- glm(
Survived ~ Pclass + Sex + Age + FamilySize,
data = df_all,
family = "binomial"
)
summary(reduced_model)
Call:
glm(formula = Survived ~ Pclass + Sex + Age + FamilySize, family = "binomial",
data = df_all)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 4.247858 0.430886 9.858 < 2e-16 ***
Pclass2 -1.169760 0.260580 -4.489 7.15e-06 ***
Pclass3 -2.349265 0.242962 -9.669 < 2e-16 ***
Sexmale -2.783225 0.197773 -14.073 < 2e-16 ***
Age -0.039109 0.007793 -5.018 5.21e-07 ***
FamilySize -0.215382 0.064242 -3.353 8e-04 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1186.66 on 890 degrees of freedom
Residual deviance: 792.83 on 885 degrees of freedom
AIC: 804.83
Number of Fisher Scoring iterations: 5
Our reduced model is a simple linear model.
Pclass2 (-1.17)
Holding other variables constant, passengers in 2nd class have lower odds of survival than 1st class.
Pclass3 (-2.35)
Holding other variables constant, passengers in 3rd class have much lower odds of survival than 1st class.
Sexmale (-2.78)
Holding other variables constant, males have dramatically lower odds of survival than females.
Age (-0.039)
Holding other variables constant, each additional year of age reduces survival odds slightly.
FamilySize (-0.215)
Holding other variables constant, larger family size reduces survival odds.
All predictors in the reduced model are now highly statistically significant, with p-values well below our conventional thresholds, indicating that each variable contributes meaningfully to explaining our dependent variable (Survived). The coefficient signs are also consistent with expectations, survival decreases for lower passenger classes, males, older individuals, and larger family sizes. The substantial drop from the null deviance (1186.66) to the residual deviance (792.83) further suggests that the model captures a large portion of the variation in the outcome.
Next, we will compare this reduced model to the full model using AIC and BIC to confirm that simplifying the model did not come at the cost of overall model quality.
AIC(full_mod, reduced_model) df AIC
full_mod 10 806.0443
reduced_model 6 804.8306
BIC(full_mod, reduced_model) df BIC
full_mod 10 853.9677
reduced_model 6 833.5847
Results:
Full model AIC: 806.04
Reduced model AIC: 804.83
The reduced model has a slightly lower AIC (804.83 vs. 806.04), indicating a better overall fit compared to the full model. While the residual deviance increases slightly (786 to 792), this change is minimal and expected when simplifying a model. By removing less stable or noisy variables, we’ve made the model more efficient without sacrificing meaningful explanatory power. Overall, this is the ideal outcome: a simpler model that performs just as well.
anova(reduced_model, full_mod, test = "Chisq")Analysis of Deviance Table
Model 1: Survived ~ Pclass + Sex + Age + FamilySize
Model 2: Survived ~ Pclass + Sex + Age + Fare + Embarked + FamilySize
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 885 792.83
2 881 786.04 4 6.7863 0.1476
Model 1 (reduced):
Survived ~ Pclass + Sex + Age + FamilySize
Model 2 (full):
adds Fare + Embarked
Result:
Deviance difference = 6.79
df difference = 4
p-value = 0.1476
Adding Fare and Embarked to the model does not significantly improve model fit compared to the reduced model (χ²(4) = 6.79, p = 0.148).
Interpretation
The improvement in fit is small relative to the extra complexity
The p-value > 0.05 meaning it fails to reject the null hypothesis
So the reduced model is preferred
The likelihood ratio test indicates that the additional predictors (Fare and Embarked) do not significantly improve the model, supporting the use of the more parsimonious reduced model.
#crPlots(reduced_model)
par(
bg = "#6C777F",
col.axis = "white",
col.lab = "white",
col.main = "white",
fg = "white", # axes/lines
las = 1 # horizontal axis labels
)
crPlots(reduced_model)ggplot(df_all, aes(x = Age, y = Survived)) +
geom_jitter(
width = 0,
height = 0.05,
alpha = 0.25,
color = "#6C8EA4"
) +
geom_smooth(
method = "glm",
method.args = list(family = "binomial"),
formula = y ~ poly(x, 2),
se = TRUE,
color = "#E3B23C",
fill = "#E3B23C",
alpha = 0.15,
linewidth = 1.4
) +
labs(
title = "Age and Survival",
subtitle = "A curved pattern supports adding a nonlinear age term",
x = "Age",
y = "Survival Probability"
) +
theme_minimal() +
theme(
plot.background = element_rect(fill = "#0B1D2A", color = NA),
panel.background = element_rect(fill = "#0B1D2A", color = NA),
text = element_text(color = "white"),
axis.text = element_text(color = "white"),
axis.title = element_text(color = "white"),
plot.title = element_text(size = 18, face = "bold"),
plot.subtitle = element_text(size = 12, color = "#E3B23C"),
panel.grid.major = element_line(color = "#1C2E3A"),
panel.grid.minor = element_blank(),
axis.ticks = element_blank()
)Obsevations:
The categorical variables show clear separation in their effects. For Pclass, survival decreases consistently from 1st to 3rd class, reinforcing that it is a strong predictor. Similarly, Sex shows a pronounced difference, with females having much higher survival than males, confirming its strong influence in the model.
For the continuous variables, the relationships are more subtle. The Age plot shows a slight downward trend but also some curvature, suggesting the relationship may not be strictly linear and supporting the use of a polynomial term.
A polynomial term is preferred over a log transform because the relationship between Age and survival shows curvature rather than diminishing returns. The quadratic term allows the model to capture changes in direction, which a log transformation cannot.
The FamilySize plot appears relatively flat with a mild negative trend, indicating a weaker effect and possibly limited contribution compared to other predictors.
Overall, there are no major violations of linearity or extreme patterns in the residuals, but the slight curvature in Age stands out as the main area where a nonlinear specification could improve the model.
library(patchwork)
p_linear <- ggplot(df_all, aes(x = Age, y = Survived)) +
geom_jitter(
height = 0.05,
width = 0,
alpha = 0.25,
color = "#6C8EA4"
) +
geom_smooth(
method = "glm",
method.args = list(family = "binomial"),
se = FALSE,
color = "#5C7A89", # colder, less emphasis
linewidth = 1.2
) +
labs(
title = "Linear Fit",
subtitle = "Assumes survival changes at a constant rate",
x = "Age",
y = "Survival Probability"
) +
theme_minimal() +
theme(
plot.background = element_rect(fill = "#0B1D2A", color = NA),
panel.background = element_rect(fill = "#0B1D2A", color = NA),
text = element_text(color = "white"),
axis.text = element_text(color = "white"),
axis.title = element_text(color = "white"),
plot.title = element_text(size = 16, face = "bold"),
plot.subtitle = element_text(size = 11, color = "#5C7A89"),
panel.grid.major = element_line(color = "#1C2E3A"),
panel.grid.minor = element_blank(),
axis.ticks = element_blank()
)
p_poly <- ggplot(df_all, aes(x = Age, y = Survived)) +
geom_jitter(
height = 0.05,
width = 0,
alpha = 0.25,
color = "#6C8EA4"
) +
geom_smooth(
method = "glm",
method.args = list(family = "binomial"),
formula = y ~ poly(x, 2),
se = TRUE,
color = "#E3B23C", # gold = important
fill = "#E3B23C",
alpha = 0.15,
linewidth = 1.4
) +
labs(
title = "Nonlinear (Quadratic) Fit",
subtitle = "Captures the true pattern in survival",
x = "Age",
y = "Survival Probability"
) +
theme_minimal() +
theme(
plot.background = element_rect(fill = "#0B1D2A", color = NA),
panel.background = element_rect(fill = "#0B1D2A", color = NA),
text = element_text(color = "white"),
axis.text = element_text(color = "white"),
axis.title = element_text(color = "white"),
plot.title = element_text(size = 16, face = "bold"),
plot.subtitle = element_text(size = 11, color = "#E3B23C"),
panel.grid.major = element_line(color = "#1C2E3A"),
panel.grid.minor = element_blank(),
axis.ticks = element_blank()
)
p_linear + p_polyI wanted to have a visual confirmation of what adding a polynomial transform would do to the Age variable.
Both graphs support the idea that Age has a non-linear relationship with survival, but when I add the poly(Age, 2) transform, it reveals a curved “U-shape” that a straight line smooths over. The linear fit suggests a simple downward trend, implying survival just decreases with age, whereas the quadratic fit shows that this isn’t the full story. It highlights that children have relatively higher survival rates, survival drops through middle ages, and then levels off or slightly increases again at older ages.
Now, we have a justified reason to apply a poly transform to Age, and then check the reduced model variables for any influential outliers using Cook’s distance.
#plot(reduced_model, which = 4) # Cook's distance
# Extract Cook's distance
cook_df <- data.frame(
index = 1:nrow(df_all),
cooks = cooks.distance(reduced_model)
)
# Threshold (common rule of thumb)
threshold <- 4 / nrow(df_all)
cook_df$flag <- cook_df$cooks > threshold
ggplot(cook_df, aes(x = index, y = cooks)) +
geom_col(aes(fill = flag)) +
scale_fill_manual(
values = c("FALSE" = "#6C8EA4", "TRUE" = "#C94C4C")
) +
geom_hline(
yintercept = threshold,
linetype = "dashed",
color = "#E3B23C",
linewidth = 1
) +
labs(
title = "Influential Observations (Cook's Distance)",
subtitle = "A few observations show higher influence, but none dominate the model",
x = "Observation Index",
y = "Cook's Distance"
) +
theme_minimal() +
theme(
plot.background = element_rect(fill = "#0B1D2A", color = NA),
panel.background = element_rect(fill = "#0B1D2A", color = NA),
text = element_text(color = "white"),
axis.text = element_text(color = "white"),
axis.title = element_text(color = "white"),
plot.title = element_text(size = 18, face = "bold"),
plot.subtitle = element_text(size = 12, color = "#E3B23C"),
panel.grid.major = element_line(color = "#1C2E3A"),
panel.grid.minor = element_blank(),
axis.ticks = element_blank()
)The Cook’s distance plot identifies a small number of observations with higher influence on the model (observations 262, 298, and 631). While these exceed the common threshold of 4/n, their values remain relatively low overall, indicating moderate rather than extreme influence. As a result, there is no strong evidence that these points unduly distort the model, and no immediate need to remove them, though they may warrant further inspection.
Since we are not removing any outliers and we have good justification for applying a polynomial transform to Age, a final check on the reduced model is to assess goodness of fit using the Hosmer–Lemeshow test.
library(ResourceSelection)
hoslem.test(df_all$Survived, fitted(reduced_model))
Hosmer and Lemeshow goodness of fit (GOF) test
data: df_all$Survived, fitted(reduced_model)
X-squared = 25.55, df = 8, p-value = 0.001254
The Hosmer–Lemeshow test is statistically significant (p = 0.001254), leading us to reject the null hypothesis of good fit. This indicates that the model’s predicted probabilities do not align well with the observed outcomes, suggesting a lack of fit and that further model refinement may be needed.
The next step is to apply the polynomial transform.
model_poly <- glm(
Survived ~ Pclass + Sex + poly(Age, 2) + FamilySize,
data = df_all,
family = binomial
)
# comparing models with AIC
AIC(reduced_model, model_poly) df AIC
reduced_model 6 804.8306
model_poly 7 800.2359
df_all$FamilyGroup <- cut(df_all$FamilySize,
breaks = c(0,1,4,10),
labels = c("Alone","Small","Large"))
# baseline
model1 <- reduced_model
# nonlinear age
model2 <- glm(Survived ~ Pclass + Sex + poly(Age,2) + FamilySize, family="binomial", data=df_all)
# interaction
model3 <- glm(Survived ~ Pclass * Sex + Age + FamilySize, family="binomial", data=df_all)
# both
model4 <- glm(Survived ~ Pclass * Sex + poly(Age,2) + FamilySize, family="binomial", data=df_all)
AIC(model1, model2, model3, model4) df AIC
model1 6 804.8306
model2 7 800.2359
model3 8 779.2358
model4 9 773.1571
The polynomial model improves the overall fit, with a lower AIC (800.24 vs. 804.83), indicating that adding a nonlinear term for Age better captures the relationship with survival. This supports the earlier visual evidence that Age is not strictly linear and benefits from a quadratic specification.
From here, I explored multiple model variations to ensure we are not missing important structure in the data. In particular, the component + residual plots suggested a possible interaction between Pclass and Sex, so I tested models with and without this interaction, as well as with the polynomial Age term, to compare their overall fit and determine the most appropriate specification.
Model 4 has the lowest AIC (773.16), indicating it provides the best overall fit among the models considered. Although it is the most complex model (Df = 9), the improvement in fit more than justifies the additional parameters, suggesting that both the interaction between Pclass and Sex and the nonlinear effect of Age meaningfully contribute to explaining survival.
model_final <- glm(
Survived ~ Pclass * Sex + poly(Age, 2) + FamilySize,
data = df_all,
family = binomial
)
summary(model_final)
Call:
glm(formula = Survived ~ Pclass * Sex + poly(Age, 2) + FamilySize,
family = binomial, data = df_all)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 4.27469 0.62019 6.893 5.48e-12 ***
Pclass2 -1.11729 0.73094 -1.529 0.126370
Pclass3 -3.84351 0.62575 -6.142 8.14e-10 ***
Sexmale -4.06840 0.62215 -6.539 6.18e-11 ***
poly(Age, 2)1 -16.58290 3.14577 -5.271 1.35e-07 ***
poly(Age, 2)2 8.02330 2.75294 2.914 0.003563 **
FamilySize -0.26851 0.07302 -3.677 0.000236 ***
Pclass2:Sexmale -0.44946 0.80564 -0.558 0.576922
Pclass3:Sexmale 2.08906 0.66412 3.146 0.001657 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1186.66 on 890 degrees of freedom
Residual deviance: 755.16 on 882 degrees of freedom
AIC: 773.16
Number of Fisher Scoring iterations: 6
The final model, which includes both the polynomial Age term and the interaction between Pclass and Sex, provides the best overall fit (AIC = 773.16) and a substantial reduction in deviance from the null model. Most predictors are highly significant, including the nonlinear Age terms and FamilySize, confirming their importance in explaining survival. The interaction term for Pclass3 and Sex is also significant, indicating that the effect of gender on survival differs meaningfully across passenger classes. While not every interaction term is significant, the overall improvement in model fit suggests that including both nonlinearity and interaction captures important structure in the data.
Before concluding that this is our final model, the next step is to evaluate its classification accuracy on a train/test split to see how well it generalizes to unseen data and to provide a practical check on its predictive performance.
test_data <- read.csv("test.csv")
test_data$Pclass <- factor(test_data$Pclass)
test_data$Sex <- factor(test_data$Sex)
test_data$Embarked <- factor(test_data$Embarked)
colSums(is.na(test_data))PassengerId Pclass Name Sex Age SibSp
0 0 0 0 86 0
Parch Ticket Fare Cabin Embarked
0 0 1 0 0
test_data$Age[is.na(test_data$Age)] <- median(test_data$Age, na.rm = TRUE)
test_data$Fare[is.na(test_data$Fare)] <- median(test_data$Fare, na.rm = TRUE)
test_data$PassengerId <- NULL
test_data$Name <- NULL
test_data$Ticket <- NULL
test_data$FamilySize <- test_data$SibSp + test_data$Parch + 1
test_data$Embarked <- factor(test_data$Embarked)
str(test_data)'data.frame': 418 obs. of 9 variables:
$ Pclass : Factor w/ 3 levels "1","2","3": 3 3 2 3 3 3 3 2 3 3 ...
$ Sex : Factor w/ 2 levels "female","male": 2 1 2 2 1 2 1 2 1 2 ...
$ Age : num 34.5 47 62 27 22 14 30 26 18 21 ...
$ SibSp : int 0 1 0 0 1 0 0 1 0 2 ...
$ Parch : int 0 0 0 0 1 0 0 1 0 0 ...
$ Fare : num 7.83 7 9.69 8.66 12.29 ...
$ Cabin : chr "" "" "" "" ...
$ Embarked : Factor w/ 3 levels "C","Q","S": 2 3 2 3 3 3 2 3 1 3 ...
$ FamilySize: num 1 2 1 1 3 1 1 3 1 3 ...
str(df_all)'data.frame': 891 obs. of 11 variables:
$ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
$ Pclass : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
$ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
$ Age : num 22 38 26 35 35 28 54 2 27 14 ...
$ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
$ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
$ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
$ Cabin : chr "" "C85" "" "C123" ...
$ Embarked : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
$ FamilySize : num 2 2 1 2 1 1 1 5 3 2 ...
$ FamilyGroup: Factor w/ 3 levels "Alone","Small",..: 2 2 1 2 1 1 1 3 2 2 ...
The external test set provided by Kaggle was reserved for final validation; however, since ground-truth labels (the known outcomes for passenger survival) were unavailable, local cross-validation and a 20% holdout split will be utilized to assess generalization.
# renaming the original dataset
train_data <- df_all
set.seed(123)
n <- nrow(train_data)
# 80% to train, 20% to test
train_id <- sample(seq_len(n), size = floor(0.8 * n))
local_train <- train_data[train_id, ]
local_test <- train_data[-train_id, ]classification_accuracy <- function(train_data, test_data, formula_obj,
threshold = 0.5) {
fit <- glm(formula_obj, data = train_data, family = binomial)
prob <- predict(fit, newdata = test_data, type = "response")
pred <- ifelse(prob > threshold, 1, 0)
mean(pred == test_data$Survived)
}
## Example single split
set.seed(123)
n <- nrow(train_data)
train_id <- sample(seq_len(n), size = floor(0.7 * n), replace = FALSE)
train_dat <- train_data[train_id, ]
test_dat <- train_data[-train_id, ]
classification_accuracy(train_dat, test_dat, Survived ~ Pclass + Sex + Age + FamilySize)[1] 0.7910448
classification_accuracy(train_dat, test_dat, Survived ~ Pclass * Sex + Age + FamilySize)[1] 0.7835821
classification_accuracy(train_dat, test_dat, Survived ~ Pclass * Sex + poly(Age,2) + FamilySize)[1] 0.8022388
These accuracy scores are all fairly close (around 78–80%), indicating that the models perform similarly in terms of predictive accuracy, with a slight improvement in the best model (~80.2%). The small differences suggest incremental gains rather than dramatic improvements, reinforcing that model refinements are helping, but not drastically changing overall performance.
Here the classification accuracy shows the final model as the best. Because this score tends to be a bit optimistic we will conduct a k-fold cross-validation accuracy to test the models further. Normally, this would be done with an ‘unseen’ dataset, but due to the issues mentioned before, the training set will be split instead to perform the test.
cv_accuracy_glm <- function(data, formula_obj, K = 5, threshold = 0.5,
seed = 1234) {
set.seed(seed)
n <- nrow(data)
fold_id <- sample(rep(1:K, length.out = n))
acc_vec <- numeric(K)
for (k in 1:K) {
train_data <- data[fold_id != k, , drop = FALSE]
valid_data <- data[fold_id == k, , drop = FALSE]
fit <- glm(formula_obj, data = train_data, family = binomial)
prob <- predict(fit, newdata = valid_data, type = "response")
pred <- ifelse(prob > threshold, 1, 0)
acc_vec[k] <- mean(pred == valid_data$Survived)
}
mean(acc_vec)
}
# simple model
result1 <- cv_accuracy_glm(df_all, Survived ~ Pclass + Sex + Age + FamilySize)
print(result1)[1] 0.793497
# winning model with the interaction and polynomial
result2 <- cv_accuracy_glm(df_all, Survived ~ Pclass * Sex + poly(Age, 2) + FamilySize)
print(result2)[1] 0.7923796
The cross-validation accuracy scores are nearly identical (~0.793 vs. ~0.792), indicating that both models perform essentially the same when evaluated on unseen data. This suggests that the added complexity in the final model does not provide a meaningful improvement in predictive performance.
After running the simple linear model against the final model we see there is not a statistical difference to justify using the more complex model. Next, we will check if the interaction between Pclass and Sex is doing the heavy lifting and making any statistical difference to keep that model instead of the simple linear one.
ggplot(df_all, aes(x = Pclass, fill = Sex)) +
geom_bar(position = "fill", color = "#0B1D2A") +
facet_wrap(
~ Survived,
labeller = labeller(
Survived = c(
"0" = "Did Not Survive",
"1" = "Survived"
)
)
) +
scale_fill_manual(
values = c(
"male" = "#6C8EA4", # muted steel blue
"female" = "#E3B23C" # survival gold / protected group
),
labels = c("Male", "Female")
) +
scale_y_continuous(labels = scales::percent) +
labs(
title = "Class and Gender Shaped Survival Together",
subtitle = "Survival patterns change when passenger class and gender are viewed jointly",
x = "Passenger Class",
y = "Proportion",
fill = ""
) +
theme_minimal() +
theme(
plot.background = element_rect(fill = "#0B1D2A", color = NA),
panel.background = element_rect(fill = "#0B1D2A", color = NA),
text = element_text(color = "white"),
axis.text = element_text(color = "white"),
axis.title = element_text(color = "white"),
plot.title = element_text(size = 18, face = "bold"),
plot.subtitle = element_text(size = 12, color = "#E3B23C"),
panel.grid.major = element_line(color = "#1C2E3A"),
panel.grid.minor = element_blank(),
strip.background = element_rect(fill = "#1C2E3A", color = NA),
strip.text = element_text(color = "#E3B23C", face = "bold"),
legend.background = element_rect(fill = "#0B1D2A"),
legend.text = element_text(color = "white"),
axis.ticks = element_blank()
)Survival depends on structured relationships:
gender effects vary by class
age has a nonlinear impact
family size matters independently
print(result1)[1] 0.793497
result3 <- cv_accuracy_glm(df_all, Survived ~ Pclass * Sex + Age + FamilySize)
print(result3)[1] 0.7912811
Again, we see the same insignificant difference to justify using the any version other than the simple linear model which we first created after looking at the full model.
library(pROC)
library(dplyr)
# 1. Generate probabilities for the Simple Model
fit_linear <- glm(
Survived ~ Pclass + Sex + Age + FamilySize,
data = train_dat,
family = binomial
)
prob_linear <- predict(fit_linear, newdata = test_dat, type = "response")
# 2. Generate probabilities for the Poly/Interaction Model
fit_poly <- glm(
Survived ~ Pclass * Sex + poly(Age, 2) + FamilySize,
data = train_dat,
family = binomial
)
prob_poly <- predict(fit_poly, newdata = test_dat, type = "response")
# 3. Create ROC objects
roc_linear <- roc(test_dat$Survived, prob_linear)
roc_poly <- roc(test_dat$Survived, prob_poly)
# 4. Convert ROC objects to data frames
roc_df <- bind_rows(
data.frame(
specificity = roc_linear$specificities,
sensitivity = roc_linear$sensitivities,
Model = "Simple Model"
),
data.frame(
specificity = roc_poly$specificities,
sensitivity = roc_poly$sensitivities,
Model = "Final Model"
)
) %>%
mutate(
false_positive_rate = 1 - specificity
)
# 5. AUC labels
auc_linear <- round(as.numeric(auc(roc_linear)), 3)
auc_poly <- round(as.numeric(auc(roc_poly)), 3)
# 6. Plot
ggplot(
roc_df,
aes(x = false_positive_rate, y = sensitivity, color = Model)
) +
geom_line(linewidth = 1.4) +
geom_abline(
slope = 1,
intercept = 0,
linetype = "dashed",
color = "#5C7A89",
linewidth = 0.9
) +
scale_color_manual(
values = c(
"Simple Model" = "#6C8EA4",
"Final Model" = "#E3B23C"
),
labels = c(
paste0("Simple Model: AUC = ", auc_linear),
paste0("Final Model: AUC = ", auc_poly)
)
) +
labs(
title = "ROC Curve Comparison",
subtitle = "Both models show excellent classification performance",
x = "False Positive Rate",
y = "True Positive Rate",
color = ""
) +
coord_equal() +
theme_minimal() +
theme(
plot.background = element_rect(fill = "#0B1D2A", color = NA),
panel.background = element_rect(fill = "#0B1D2A", color = NA),
text = element_text(color = "white"),
axis.text = element_text(color = "white"),
axis.title = element_text(color = "white"),
plot.title = element_text(size = 18, face = "bold"),
plot.subtitle = element_text(size = 12, color = "#E3B23C"),
panel.grid.major = element_line(color = "#1C2E3A"),
panel.grid.minor = element_blank(),
legend.background = element_rect(fill = "#0B1D2A"),
legend.text = element_text(color = "white"),
axis.ticks = element_blank()
)roc.test(roc_linear, roc_poly)
DeLong's test for two correlated ROC curves
data: roc_linear and roc_poly
Z = -0.44447, p-value = 0.6567
alternative hypothesis: true difference in AUC is not equal to 0
95 percent confidence interval:
-0.02576035 0.01623654
sample estimates:
AUC of roc1 AUC of roc2
0.8534226 0.8581845
AUC Score:
0.5: No better than a coin flip.
0.7–0.8: Good model.
0.8–0.9: Excellent model.
There is no statistical difference between the simple model and the complex one.
1. The Verdict: p-value = 0.6567
In statistics, we look for a p-value less than 0.05 to claim a significant difference. The p-value is 0.6567, which is very high. This means the slight increase in AUC (from 0.853 to 0.858) is likely just due to random noise in the data rather than the polynomial or interaction terms actually being better.
2. The AUC Scores: 0.853 vs. 0.858
Both models are Excellent: Any AUC above 0.85 is considered a very strong model for the Titanic dataset.
The Difference is Negligible: The “Poly” model is technically higher by 0.0047, but as the test shows, and that is not enough to matter.
3. The Confidence Interval: -0.025 to 0.016
This interval represents the likely range of the “true” difference between the models. Because the interval crosses zero (it goes from negative to positive), it tells us that the simple model could actually be better than the poly model, or vice-versa, but we can’t say for sure.
hoslem.test(df_all$Survived, fitted(reduced_model))
Hosmer and Lemeshow goodness of fit (GOF) test
data: df_all$Survived, fitted(reduced_model)
X-squared = 25.55, df = 8, p-value = 0.001254
hoslem.test(df_all$Survived, fitted(model_final))
Hosmer and Lemeshow goodness of fit (GOF) test
data: df_all$Survived, fitted(model_final)
X-squared = 10.447, df = 8, p-value = 0.235
We next ran a Hosmer-Lemeshow test to check for ‘goodness of fit’, and hopefully come to a decisive conclusion about which model we prefer.
We need to check if predictive probabilities align with our model outcomes.
Unlike most tests, a small p-value in the Hosmer–Lemeshow test is undesirable because it indicates lack of fit. We prefer a larger p-value that suggests the model’s predictions align with the observed data.
Reduced model: p-value = 0.0013
Final model: p-value = 0.235
The reduced model rejects the null hypothesis and it is a poor fit. The final model fails to reject the null hypothesis meaning it is not significant, and that it’s output aligns with the observed results.
Our Conclusion:
The final model is the better model for predictive power.
\[ \text{Survived} \sim \text{Pclass} \times \text{Sex} + \text{poly}(\text{Age}, 2) + \text{FamilySize} \]
library(arm)
par(
bg = "#6C777F",
col.axis = "white",
col.lab = "white",
col.main = "white",
fg = "white"
)
binnedplot(
fitted(model_final),
residuals(model_final, type = "response"),
main = "Binned Residual Plot: Final Model",
xlab = "Expected Survival Probability",
ylab = "Average Residuals",
col.int = "#E3B23C", # gold bands
col.points = "#FFFFFF", # FORCE white points
pch = 16, # solid circle
cex = 1.3 # increase size so they stand out
)All points are within the boundaries: Every single dot (representing bins of passengers) falls inside the grey 95% confidence lines. This is the “gold standard” for a logistic regression diagnostic. It means there are no groups where the model is consistently failing.
No “Shape” or Trend: The dots are scattered randomly above and below the zero line. There is no “U-shape” or diagonal slant, which proves that poly(Age, 2) and the interaction terms successfully captured the non-linearities in the data.
Performance across the spectrum: Whether the predicted probability of survival is low (0.1), medium (0.5), or high (0.9), the model remains accurate. It isn’t “guessing” better at one end than the other.
The final logistic regression model indicates that survival on the Titanic is significantly influenced by passenger class, sex, age (modeled nonlinearly), and family size, with an important interaction between sex and passenger class. This model represents a substantial improvement over the null, with residual deviance decreasing from 1186 to 755, and achieves the lowest AIC (773.16) among all models considered, indicating the best overall fit.
A key result of this model is the interaction between sex and passenger class, which shows that these variables cannot be interpreted independently. Using female passengers in 1st class as the baseline, the results show that males in 1st class have dramatically lower odds of survival, and that being in 3rd class significantly reduces survival among females. However, the positive and significant interaction term for 3rd class males suggests that the negative effect of being male is less severe in 3rd class than in 1st class. In practical terms, this indicates that while a gender gap in survival exists across all classes, it is most pronounced in 1st class and somewhat reduced in 3rd class, likely reflecting differences in how evacuation protocols such as “women and children first” were enforced. The interaction for 2nd class is not statistically significant, suggesting no meaningful difference from 1st class in this respect.
The effect of age is also an important component of the model and is clearly nonlinear, as both polynomial terms are statistically significant. Rather than a simple linear decline, the relationship between age and survival follows a curved pattern, where children have higher survival probabilities, survival decreases through adulthood, and then levels off at older ages. This confirms that a linear specification would fail to capture the true structure of the relationship.
Family size also shows a significant negative effect, indicating that larger family groups are associated with lower survival probabilities, though the magnitude of this effect is smaller compared to class and sex.
This analysis demonstrates that while predictive accuracy (AUC ≈ 0.85) can be achieved with a parsimonious linear model, such a model fails fundamental goodness-of-fit requirements (Hosmer-Lemeshow p < 0.001). By incorporating a quadratic term for Age and a Pclass/Sex interaction, we achieved a model that is not only accurate but statistically well-calibrated (p = 0.235).
Our findings confirm that Titanic survival was not merely a linear function of demographics. The survival advantage of being female was significantly moderated by socio-economic status, and the effect of age was non-linear, likely reflecting the “women and children first” protocol. Ultimately, the more complex model is preferred as it captures the nuanced, structured relationships inherent in this historic event.
library(effects)
library(lattice)
eff <- allEffects(model_final)
trellis.par.set(
background = list(col = "#0B1D2A"),
panel.background = list(col = "#6C777F"),
axis.line = list(col = "white"),
axis.text = list(col = "white"),
par.xlab.text = list(col = "white"),
par.ylab.text = list(col = "white"),
par.main.text = list(col = "white"),
strip.background = list(col = "#6C777F"),
strip.border = list(col = "#1C2E3A")
)
plot(
eff,
main = "Estimated Effects from Final Model",
colors = "#E3B23C",
band.colors = "#E3B23C",
rug = FALSE
)| Model | Complexity | AIC | HL Test (p-value) | Verdict |
|---|---|---|---|---|
| Full Model | High | 806.4 | - | Over fit/Noisey |
| Reduced Model | Low (Best AIC) | 804.83 | 0.001 (fail) | Poor fit |
| Final Model | Medium | 773.16 | 0.235 (pass) | Winner |