Introduction
The RMS Titanic was a British luxury passenger liner that sank during
its maiden voyage en route to New York City from Southampton, England,
killing about 1,500 passengers and ship personnel. It is one of the most
famous tragedies in modern history, it inspired numerous stories,
several films, and a musical and has been the subject of much
scholarship and scientific speculation.
This report will continue the scientific exploration of the Titanic.
The estimated 2,224 passengers all have their own stories of escape or
death in the wreck. It is believed, however, that studying the
underlying information about these passengers can provide insight on why
certain people survived.
Data Source
The data set used in this report is sourced from Kaggle.com. It was
orignially created for a machine learning competition hosted by Kaggle.
The link for this data set is https://www.kaggle.com/c/titanic. The data is originally
split into seperate train and testing data sets for purposes of the
competition. The train data is used in this report for more observations
and the complete set of variables.
The data was uploaded to a GitHub repository for perpetual online
access. This report will source the data from this repository (URL, https://raw.githubusercontent.com/ncbrechbill/STA321/refs/heads/main/STA321/titanic.csv)
This data contains 891 observations on 12 variables. Any observations
of N/A in variables of interest will be removed. The variables are:
- PassengerID: A unique identifier
- Survived: The survival status of the passenger (0 =
No, 1 = Yes)
- Pclass: The passenger’s ticket class (1st, 2nd, or
3rd)
- Name: The passenger’s name
- Sex: The passenger’s sex
- Age: The passenger’s age in years
- sibSp: Number of siblings/spouse aboard
- Parch: Number of parents/children aboard
- Ticket: Passenger’s ticket number
- Fare: Passenger’s boarding fare in USD
- Cabin: The passenger’s cabin
- Embarked: Location which passenger embarked (C =
Cherbourg, Q = Queenstown, S = Southampton)
url = "https://raw.githubusercontent.com/ncbrechbill/STA321/refs/heads/main/STA321/titanic.csv"
titanic <- read.csv(url)
data <- na.omit(dplyr::select(titanic, Survived, Pclass, Age, Sex, SibSp, Parch, Fare))
data$Pclass <- as.factor(data$Pclass)
Research Question
This report investigates the relationship between a passenger’s age,
ticket class, and sex with their likelihood of survival on the Titanic.
To address this, a multiple logistic regression model is applied, with
survival (survived vs. not survived) as the binary outcome variable and
passenger age, sex, and ticket class as the predictors. This approach
allows us to evaluate whether these factors significantly affected
survival odds, estimate the direction and magnitude of the effects, and
assess the model’s ability to explain variation in survival
outcomes.
Exploratory Data
Analysis
All observations with missing values in any of the variables were
removed. This leaves 714 observations. Their distributions and potential
correlations are provided in the figure below.
pairs.panels(data[,-9],
method = "pearson", # correlation method
hist.col = "royalblue",
density = TRUE, # show density plots
ellipses = TRUE # show correlation ellipses
)

We can see that there are more deaths than survivals, more high class
passengers than middle or low, and more men than women. Age is slightly
skewed right, but no transformation will be applied. SibSP, Parch, and
Fare are highly skewed. PassengerID, Name, Ticket, Cabin, and Embarked
were not included for irrelevance.
Multiple Logistic
Regression
The variables sex, Age, and Pclass will be used for logistic
regression. Fare was not selected because it is already categorized by
Pclass. Parch and SibSp were not selected because of a few confounding
factors. For example a passenger could have at most 2 parents aboard,
but any realistically possible amount of children could be aboard with a
parent. Similarly, a passenger can have at most one spouse, but any
number of siblings aboard. These variables can lead to any number of
interpretations and their influence on a model would be obscured by
these numerous possibilities.
Full Model
The first model contains all three factors.
full.model <- glm(Survived ~
Pclass+Age+Sex,
family = binomial(link = "logit"),
data = data)
pander(summary(full.model)$coef,
caption="Summary of inferential statistics of the full model")
Summary of inferential statistics of the full model
| (Intercept) |
3.777 |
0.4011 |
9.416 |
4.682e-21 |
| Pclass2 |
-1.31 |
0.2781 |
-4.71 |
2.472e-06 |
| Pclass3 |
-2.581 |
0.2814 |
-9.169 |
4.761e-20 |
| Age |
-0.03699 |
0.007656 |
-4.831 |
1.359e-06 |
| Sexmale |
-2.523 |
0.2074 |
-12.16 |
4.811e-34 |
All values were statistically significant. Before creating a reduced
model, we explore potential multicollinearity among the predictor
variables. The following table shows the VIF for each individual
variable.
pander(vif(full.model))
| Pclass |
1.417 |
2 |
1.091 |
| Age |
1.333 |
1 |
1.155 |
| Sex |
1.073 |
1 |
1.036 |
All GVIF values are under 1.5. This suggests no multicollinearity
issues between these variables.
Reduced Model
The reduced model was constructed on the variable sex, determined by
the correlation coefficients determined on the response variable during
EDA. The coefficient remains similar for the sex variable, and in the
same direction (negative towards male).
reduced.model <- glm(Survived~
Sex,
family = binomial(link = "logit"),
data = data)
pander(summary(reduced.model)$coef)
| (Intercept) |
1.124 |
0.1439 |
7.814 |
5.524e-15 |
| Sexmale |
-2.478 |
0.185 |
-13.39 |
6.701e-41 |
Automatic Model
Selection
The reduced model may use insufficient information, resulting in
underfitting (missing important effects), while the full model can lead
to overfitting (modeling noise and failing to generalize). To address
this, we use automatic variable selection to find a simpler, more
interpretable, and better-performing model by intelligently choosing a
subset of predictors that includes the important variables from the
reduced model while avoiding the noise of the full model.
final.model.forward = stepAIC(reduced.model,
scope = list(lower=formula(reduced.model),upper=formula(full.model)),
direction = "forward", # forward selection
trace = 0 # do not show the details
)
pander(summary(final.model.forward)$coef,
caption="Summary of inferential statistics of the final model")
Summary of inferential statistics of the final model
| (Intercept) |
3.777 |
0.4011 |
9.416 |
4.682e-21 |
| Sexmale |
-2.523 |
0.2074 |
-12.16 |
4.811e-34 |
| Pclass2 |
-1.31 |
0.2781 |
-4.71 |
2.472e-06 |
| Pclass3 |
-2.581 |
0.2814 |
-9.169 |
4.761e-20 |
| Age |
-0.03699 |
0.007656 |
-4.831 |
1.359e-06 |
Final Model
The final model is translated into an odds ratio for interpretation.
Starting at a baseline 43.69 Odds ratio of survival, the ratio was
reduced by a passenger being male and having a 2nd or 3rd class ticket.
Every year of age also reduced survival odds.
# Odds ratio
model.coef.stats = summary(final.model.forward)$coef
odds.ratio = exp(coef(final.model.forward))
out.stats = cbind(model.coef.stats, odds.ratio = odds.ratio)
pander(out.stats,caption = "Summary Stats with Odds Ratios")
Summary Stats with Odds Ratios
| (Intercept) |
3.777 |
0.4011 |
9.416 |
4.682e-21 |
43.69 |
| Sexmale |
-2.523 |
0.2074 |
-12.16 |
4.811e-34 |
0.08024 |
| Pclass2 |
-1.31 |
0.2781 |
-4.71 |
2.472e-06 |
0.2699 |
| Pclass3 |
-2.581 |
0.2814 |
-9.169 |
4.761e-20 |
0.07573 |
| Age |
-0.03699 |
0.007656 |
-4.831 |
1.359e-06 |
0.9637 |
Analysis
The final model allows us to analyze the demographics that associate
with survival the most. Of the passengers analyzed, the younger 1st
class women had the best odds of survival. Older 3rd class men had the
worst odds of survival. This provides insights into what the passengers
found most important into factoring who’s survival was prioritized.
A popular excerpt from the Titanic’s wreck was “women and children
first”. At a glance, the data supports this, as both women and younger
passenger’s had better odds of survival. Additionally, we can see that
higher class passengers had higher odds of survival. One can surmise
that this narrative would be less popular than the relatively heroic
“women and children first” rhetoric. In several instances, wealth helps
people survive, and it seems the wreck of the titanic was no
exception.
This model will be tweaked for prediction in a later report.
