Introduction
The RMS Titanic was a British luxury passenger liner that sank during
its maiden voyage en route to New York City from Southampton, England,
killing about 1,500 passengers and ship personnel. It is one of the most
famous tragedies in modern history, it inspired numerous stories,
several films, and a musical and has been the subject of much
scholarship and scientific speculation.
This report will continue the scientific exploration of the Titanic.
The estimated 2,224 passengers all have their own stories of escape or
death in the wreck. It is believed, however, that studying the
underlying information about these passengers can provide insight on why
certain people survived.
RMS Titanic
Data Source
The data set used in this report is sourced from Kaggle.com. It was
orignially created for a machine learning competition hosted by Kaggle.
The link for this data set is https://www.kaggle.com/c/titanic. The data is originally
split into seperate train and testing data sets for purposes of the
competition. The train data is used in this report for more observations
and the complete set of variables.
The data was uploaded to a GitHub repository for perpetual online
access. This report will source the data from this repository (URL, https://raw.githubusercontent.com/ncbrechbill/STA321/refs/heads/main/STA321/titanic.csv)
This data contains 891 observations on 12 variables. Any observations
of N/A in variables of interest will be removed. The variables are:
- PassengerID: A unique identifier
- Survived: The survival status of the passenger (0 =
No, 1 = Yes)
- Pclass: The passengers ticket class (1st, 2nd, or
3rd)
- Name: The passengers name
- Sex: The passengers sex
- Age: The passengers age in years
- sibSp: Number of siblings/spouse aboard
- Parch: Number of parents/children aboard
- Ticket: Passenger’s ticket number
- Fare: Passengers boarding fare in USD
- Cabin: The passengers cabin
- Embarked: Location which passenger embarked (C =
Cherbourg, Q = Queenstown, S = Southampton)
url = "https://raw.githubusercontent.com/ncbrechbill/STA321/refs/heads/main/STA321/titanic.csv"
titanic <- read.csv(url)
data <- na.omit(select(titanic, Survived, Age, Sex, Pclass))
data$Pclass <- as.factor(data$Pclass)
Research Question
In many retellings of the Titanic’s story, the excerpt arises of
“Women and Children first” when loading the lifeboats for escape. This
suggests that age could factor greatly into one’s survival of the
catastrophe. This report investigates the relationship between a
passenger’s age and their likelihood of survival on the Titanic. To
address this, a simple logistic regression model is applied, with
survival (survived vs. not survived) as the binary outcome variable and
passenger age as the predictor. This approach allows us to evaluate
whether age significantly affected survival odds, estimate the direction
and magnitude of the effect, and assess the model’s ability to explain
variation in survival outcomes.
Exploratory Data
Analysis
Under simple logistic regression, only one variable is examined, and
thus, a simple check of the variable distribution provides insight to
any issues of skew.
Analysis
The histogram suggests slight right-skew, which is to be expected
with age distributions in most situations. No transformation will be
applied.
The frequency of survival does not imply a class imbalance that would
result in bias.
Logistic Regression
Model
Standard Model
With non-survival coded = 0, this level is the reference level used
by R, and survival = 1 is the event level.
#Build GLM
s.logit = glm(Survived ~ Age,
family = binomial(link = "logit"), # family is the binomial, logit(p) = log(p/(1-p))!
data = data)
result = summary(s.logit)
pander(result)
| (Intercept) |
-0.05672 |
0.1736 |
-0.3268 |
0.7438 |
| Age |
-0.01096 |
0.00533 |
-2.057 |
0.03969 |
(Dispersion parameter for binomial family taken to be 1 )
| Null deviance: |
964.5 on 713 degrees of freedom |
| Residual deviance: |
960.2 on 712 degrees of freedom |
pander(confint(s.logit))
| (Intercept) |
-0.3971 |
0.2841 |
| Age |
-0.02151 |
-0.0005832 |
Age shows a negative relationship with a passengers survival with
coefficient \(\beta_1\) = -0.01096 , p
= 0.03969. This is supported by the \(\beta_1\) 95% confidence interval of
[-0.02151, -0.0005832], which excludes the null value zero.
Odds Ratio Model
For more practical use, the coefficient has been transformed into an
odds ratio.
# Odds ratio
model.coef.stats = summary(s.logit)$coef
odds.ratio = exp(coef(s.logit))
out.stats = cbind(model.coef.stats, odds.ratio = odds.ratio)
pander(out.stats,caption = "Simple Logistic Regression Model with Odds Ratios")
Simple Logistic Regression Model with Odds Ratios
| (Intercept) |
-0.05672 |
0.1736 |
-0.3268 |
0.7438 |
0.9449 |
| Age |
-0.01096 |
0.00533 |
-2.057 |
0.03969 |
0.9891 |
The odds ratio of survival by age is 98.9%. This indicates that every
year of age reduces one’s odds by 1.1%. This may not
make a great difference for people a few years apart, but certainly
indicates a difference between a child and an elder.
Analysis
While a statistically significant relationship could be found, it is
apparent that age alone is not a strong predictor of one’s odds of
surviving this catastrophe. The slight drop in deviance suggests that
age can be an useful predictor, however it is likely not the only
predictor that would predict a passengers fate. Further research using
multiple logistic regression will likely find that multiple factors,
including age, are important to predicting one’s survival.
Application
Using the logistic regression model, we can determine a few
hypethetical passenger’s odds of survival at different ages. This helps
conceptualize the differences between these ages, assuming age the only
cause of difference in survival odds.
# Create a new dataframe with some example ages
new_passengers <- data.frame(Age = c(5, 20, 30, 50, 70))
# Predict probabilities of survival
pred_probs <- predict(s.logit, newdata = new_passengers, type = "response")
# Convert probabilities to odds
pred_odds <- pred_probs / (1 - pred_probs)
# Combine results into a summary table
predictions <- data.frame(
Age = new_passengers$Age,
Probability_of_Survival = round(pred_probs, 3),
Odds_of_Survival = round(pred_odds, 3)
)
pander(print(predictions))
Age Probability_of_Survival Odds_of_Survival 1 5 0.472 0.894 2 20
0.431 0.759 3 30 0.405 0.680 4 50 0.353 0.546 5 70 0.305 0.439
| 5 |
0.472 |
0.894 |
| 20 |
0.431 |
0.759 |
| 30 |
0.405 |
0.68 |
| 50 |
0.353 |
0.546 |
| 70 |
0.305 |
0.439 |
Discussion
Age was found to have an impact on a passengers odds of survival.
Every year of age of a passenger decreased the odds of their survival by
1.1%. It is important to notes that survival of this
tragedy has many complex factors to consider. It is expected, however,
that age will be likely to persist with a negative relationship to
survival odds. It may also be worth noting that the relationship may not
be entirely linear, and that survival factors may change at certain
ages. For example, the youngest passenger aboard was Master Assad
Alexander Thomas at about 4 months old. This passenger survived, but
their survival was certainly dependent on those around them, and not
their own actions. Consider the ages which a child may be seen as
“independent”, though certainly not capable of their own survival (such
as 1 year old Miss. Maria (“Mary”) Nakid.)
Multiple Logistic
Regression
Standard Model
As before, survival is coded as the event level. As a qualitative
factor, passenger class was coded with first class as the baseline.
#Build GLM
m.logit = glm(Survived ~ Age + Sex + Pclass,
family = binomial(link = "logit"), # family is the binomial, logit(p) = log(p/(1-p))!
data = data)
mresult = summary(m.logit)
pander(mresult)
| (Intercept) |
3.777 |
0.4011 |
9.416 |
4.682e-21 |
| Age |
-0.03699 |
0.007656 |
-4.831 |
1.359e-06 |
| Sexmale |
-2.523 |
0.2074 |
-12.16 |
4.811e-34 |
| Pclass2 |
-1.31 |
0.2781 |
-4.71 |
2.472e-06 |
| Pclass3 |
-2.581 |
0.2814 |
-9.169 |
4.761e-20 |
(Dispersion parameter for binomial family taken to be 1 )
| Null deviance: |
964.5 on 713 degrees of freedom |
| Residual deviance: |
647.3 on 709 degrees of freedom |
pander(confint(m.logit))
| (Intercept) |
3.015 |
4.589 |
| Age |
-0.05229 |
-0.02223 |
| Sexmale |
-2.939 |
-2.125 |
| Pclass2 |
-1.863 |
-0.7718 |
| Pclass3 |
-3.147 |
-2.042 |
Odds-Ratio Model
Odds-ratio conversions allow for easier interpretation.
# Odds ratio
m.model.coef.stats = summary(m.logit)$coef
m.odds.ratio = exp(coef(m.logit))
m.out.stats = cbind(m.model.coef.stats, odds.ratio = m.odds.ratio)
pander(m.out.stats,caption = "Multiple Logistic Regression Model with Odds Ratios")
Multiple Logistic Regression Model with Odds Ratios
| (Intercept) |
3.777 |
0.4011 |
9.416 |
4.682e-21 |
43.69 |
| Age |
-0.03699 |
0.007656 |
-4.831 |
1.359e-06 |
0.9637 |
| Sexmale |
-2.523 |
0.2074 |
-12.16 |
4.811e-34 |
0.08024 |
| Pclass2 |
-1.31 |
0.2781 |
-4.71 |
2.472e-06 |
0.2699 |
| Pclass3 |
-2.581 |
0.2814 |
-9.169 |
4.761e-20 |
0.07573 |
Discussion
All predictors were found to be significant. Every year of age would
reduce a passengers odds of survival by about 4%. Being male reduced
odds of survival by about 92%. Compared to first class passengers, 2nd
class passengers had 73% reduced odds of survival, and 3rd class
passengers had 93% reduced odds of survival.
