1 Introduction

The RMS Titanic was a British luxury passenger liner that sank during its maiden voyage en route to New York City from Southampton, England, killing about 1,500 passengers and ship personnel. It is one of the most famous tragedies in modern history, it inspired numerous stories, several films, and a musical and has been the subject of much scholarship and scientific speculation.

This report will continue the scientific exploration of the Titanic. The estimated 2,224 passengers all have their own stories of escape or death in the wreck. It is believed, however, that studying the underlying information about these passengers can provide insight on why certain people survived.

RMS Titanic
RMS Titanic

2 Data Source

The data set used in this report is sourced from Kaggle.com. It was orignially created for a machine learning competition hosted by Kaggle. The link for this data set is https://www.kaggle.com/c/titanic. The data is originally split into seperate train and testing data sets for purposes of the competition. The train data is used in this report for more observations and the complete set of variables.

The data was uploaded to a GitHub repository for perpetual online access. This report will source the data from this repository (URL, https://raw.githubusercontent.com/ncbrechbill/STA321/refs/heads/main/STA321/titanic.csv)

This data contains 891 observations on 12 variables. Any observations of N/A in variables of interest will be removed. The variables are:

  1. PassengerID: A unique identifier
  2. Survived: The survival status of the passenger (0 = No, 1 = Yes)
  3. Pclass: The passengers ticket class (1st, 2nd, or 3rd)
  4. Name: The passengers name
  5. Sex: The passengers sex
  6. Age: The passengers age in years
  7. sibSp: Number of siblings/spouse aboard
  8. Parch: Number of parents/children aboard
  9. Ticket: Passenger’s ticket number
  10. Fare: Passengers boarding fare in USD
  11. Cabin: The passengers cabin
  12. Embarked: Location which passenger embarked (C = Cherbourg, Q = Queenstown, S = Southampton)
url = "https://raw.githubusercontent.com/ncbrechbill/STA321/refs/heads/main/STA321/titanic.csv"
titanic <- read.csv(url)
data <- na.omit(select(titanic, Survived, Age, Sex, Pclass))
data$Pclass <- as.factor(data$Pclass)

3 Research Question

In many retellings of the Titanic’s story, the excerpt arises of “Women and Children first” when loading the lifeboats for escape. This suggests that age could factor greatly into one’s survival of the catastrophe. This report investigates the relationship between a passenger’s age and their likelihood of survival on the Titanic. To address this, a simple logistic regression model is applied, with survival (survived vs. not survived) as the binary outcome variable and passenger age as the predictor. This approach allows us to evaluate whether age significantly affected survival odds, estimate the direction and magnitude of the effect, and assess the model’s ability to explain variation in survival outcomes.

4 Exploratory Data Analysis

Under simple logistic regression, only one variable is examined, and thus, a simple check of the variable distribution provides insight to any issues of skew.

4.1 Figures

#Histogram for response distribution
hist(data$Age, probability = TRUE, 
     main = "Age Distribution", 
     xlab="Age", 
     col = "lightgray", 
     border="black")
lines(density(data$Age, adjust=2), col="maroon") 

pander(summary(data$Age))
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.42 20.12 28 29.7 38 80
label <- factor(data$Survived,
                        levels = c(0, 1),
                        labels = c("Did Not Survive", "Survived"))
summary_table <- cbind(
  Frequency = table(label),
  Proportion = prop.table(table(label)))

pander(summary_table)
  Frequency Proportion
Did Not Survive 424 0.5938
Survived 290 0.4062

4.2 Analysis

The histogram suggests slight right-skew, which is to be expected with age distributions in most situations. No transformation will be applied.

The frequency of survival does not imply a class imbalance that would result in bias.

5 Logistic Regression Model

5.1 Standard Model

With non-survival coded = 0, this level is the reference level used by R, and survival = 1 is the event level.

#Build GLM
s.logit = glm(Survived ~ Age, 
              family = binomial(link = "logit"),  # family is the binomial, logit(p) = log(p/(1-p))!
              data = data)
result = summary(s.logit)
pander(result)
  Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.05672 0.1736 -0.3268 0.7438
Age -0.01096 0.00533 -2.057 0.03969

(Dispersion parameter for binomial family taken to be 1 )

Null deviance: 964.5 on 713 degrees of freedom
Residual deviance: 960.2 on 712 degrees of freedom
pander(confint(s.logit))
  2.5 % 97.5 %
(Intercept) -0.3971 0.2841
Age -0.02151 -0.0005832

Age shows a negative relationship with a passengers survival with coefficient \(\beta_1\) = -0.01096 , p = 0.03969. This is supported by the \(\beta_1\) 95% confidence interval of [-0.02151, -0.0005832], which excludes the null value zero.

5.2 Odds Ratio Model

For more practical use, the coefficient has been transformed into an odds ratio.

# Odds ratio
model.coef.stats = summary(s.logit)$coef
odds.ratio = exp(coef(s.logit))
out.stats = cbind(model.coef.stats, odds.ratio = odds.ratio)                 
pander(out.stats,caption = "Simple Logistic Regression Model with Odds Ratios")
Simple Logistic Regression Model with Odds Ratios
  Estimate Std. Error z value Pr(>|z|) odds.ratio
(Intercept) -0.05672 0.1736 -0.3268 0.7438 0.9449
Age -0.01096 0.00533 -2.057 0.03969 0.9891

The odds ratio of survival by age is 98.9%. This indicates that every year of age reduces one’s odds by 1.1%. This may not make a great difference for people a few years apart, but certainly indicates a difference between a child and an elder.

5.3 Analysis

While a statistically significant relationship could be found, it is apparent that age alone is not a strong predictor of one’s odds of surviving this catastrophe. The slight drop in deviance suggests that age can be an useful predictor, however it is likely not the only predictor that would predict a passengers fate. Further research using multiple logistic regression will likely find that multiple factors, including age, are important to predicting one’s survival.

5.4 Application

Using the logistic regression model, we can determine a few hypethetical passenger’s odds of survival at different ages. This helps conceptualize the differences between these ages, assuming age the only cause of difference in survival odds.

# Create a new dataframe with some example ages
new_passengers <- data.frame(Age = c(5, 20, 30, 50, 70))

# Predict probabilities of survival
pred_probs <- predict(s.logit, newdata = new_passengers, type = "response")

# Convert probabilities to odds
pred_odds <- pred_probs / (1 - pred_probs)

# Combine results into a summary table
predictions <- data.frame(
  Age = new_passengers$Age,
  Probability_of_Survival = round(pred_probs, 3),
  Odds_of_Survival = round(pred_odds, 3)
)

pander(print(predictions))

Age Probability_of_Survival Odds_of_Survival 1 5 0.472 0.894 2 20 0.431 0.759 3 30 0.405 0.680 4 50 0.353 0.546 5 70 0.305 0.439

Age Probability_of_Survival Odds_of_Survival
5 0.472 0.894
20 0.431 0.759
30 0.405 0.68
50 0.353 0.546
70 0.305 0.439

5.5 Discussion

Age was found to have an impact on a passengers odds of survival. Every year of age of a passenger decreased the odds of their survival by 1.1%. It is important to notes that survival of this tragedy has many complex factors to consider. It is expected, however, that age will be likely to persist with a negative relationship to survival odds. It may also be worth noting that the relationship may not be entirely linear, and that survival factors may change at certain ages. For example, the youngest passenger aboard was Master Assad Alexander Thomas at about 4 months old. This passenger survived, but their survival was certainly dependent on those around them, and not their own actions. Consider the ages which a child may be seen as “independent”, though certainly not capable of their own survival (such as 1 year old Miss. Maria (“Mary”) Nakid.)

6 Multiple Logistic Regression

6.1 Standard Model

As before, survival is coded as the event level. As a qualitative factor, passenger class was coded with first class as the baseline.

#Build GLM
m.logit = glm(Survived ~ Age + Sex + Pclass, 
              family = binomial(link = "logit"),  # family is the binomial, logit(p) = log(p/(1-p))!
              data = data)
mresult = summary(m.logit)
pander(mresult)
  Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.777 0.4011 9.416 4.682e-21
Age -0.03699 0.007656 -4.831 1.359e-06
Sexmale -2.523 0.2074 -12.16 4.811e-34
Pclass2 -1.31 0.2781 -4.71 2.472e-06
Pclass3 -2.581 0.2814 -9.169 4.761e-20

(Dispersion parameter for binomial family taken to be 1 )

Null deviance: 964.5 on 713 degrees of freedom
Residual deviance: 647.3 on 709 degrees of freedom
pander(confint(m.logit))
  2.5 % 97.5 %
(Intercept) 3.015 4.589
Age -0.05229 -0.02223
Sexmale -2.939 -2.125
Pclass2 -1.863 -0.7718
Pclass3 -3.147 -2.042

6.2 Odds-Ratio Model

Odds-ratio conversions allow for easier interpretation.

# Odds ratio
m.model.coef.stats = summary(m.logit)$coef
m.odds.ratio = exp(coef(m.logit))
m.out.stats = cbind(m.model.coef.stats, odds.ratio = m.odds.ratio)                 
pander(m.out.stats,caption = "Multiple Logistic Regression Model with Odds Ratios")
Multiple Logistic Regression Model with Odds Ratios
  Estimate Std. Error z value Pr(>|z|) odds.ratio
(Intercept) 3.777 0.4011 9.416 4.682e-21 43.69
Age -0.03699 0.007656 -4.831 1.359e-06 0.9637
Sexmale -2.523 0.2074 -12.16 4.811e-34 0.08024
Pclass2 -1.31 0.2781 -4.71 2.472e-06 0.2699
Pclass3 -2.581 0.2814 -9.169 4.761e-20 0.07573

6.3 Discussion

All predictors were found to be significant. Every year of age would reduce a passengers odds of survival by about 4%. Being male reduced odds of survival by about 92%. Compared to first class passengers, 2nd class passengers had 73% reduced odds of survival, and 3rd class passengers had 93% reduced odds of survival.

---
title: "Logistic Regression Models: Titanic"
author: "Noah Brechbill"
date: "`r Sys.Date()`"
output:
  html_document:
    toc: yes
    toc_float: yes
    toc_depth: 4
    fig_width: 6
    fig_height: 6
    fig_caption: yes
    number_sections: yes
    toc_collapsed: yes
    code_folding: hide
    code_download: yes
    smooth_scroll: yes
    theme: lumen
---

```{=html}
<style type="text/css">
h1.title {
  font-size: 20px;
  color: DarkRed;
  text-align: center;
}
h4.author { /* Header 4 - and the author and data headers use this too  */
    font-size: 18px;
  font-family: "Times New Roman", Times, serif;
  color: DarkRed;
  text-align: center;
}
h4.date { /* Header 4 - and the author and data headers use this too  */
  font-size: 18px;
  font-family: "Times New Roman", Times, serif;
  color: DarkBlue;
  text-align: center;
}
h1 { /* Header 3 - and the author and data headers use this too  */
    font-size: 22px;
    font-family: "Times New Roman", Times, serif;
    color: darkred;
    text-align: center;
}
h2 { /* Header 3 - and the author and data headers use this too  */
    font-size: 18px;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h3 { /* Header 3 - and the author and data headers use this too  */
    font-size: 15px;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h4 { /* Header 4 - and the author and data headers use this too  */
    font-size: 18px;
    font-family: "Times New Roman", Times, serif;
    color: darkred;
    text-align: left;
}
</style>
```

------------------------------------------------------------------------

```{r setup, include=FALSE}
if (!require("tidyverse")) {
  install.packages("tidyverse")
  library(tidyverse)
}
if (!require("knitr")) {
  install.packages("knitr")
  library(knitr)
}
if (!require("pander")) {
  install.packages("pander")
  library(pander)
}
knitr::opts_chunk$set(echo = TRUE,      
                      warning = FALSE,   
                      message = FALSE,  
                      results = TRUE,
                      comment = NA,
                      fig.align = "center"
                      )   
```

# Introduction

The RMS Titanic was a British luxury passenger liner that sank during its maiden voyage en route to New York City from Southampton, England, killing about 1,500 passengers and ship personnel. It is one of the most famous tragedies in modern history, it inspired numerous stories, several films, and a musical and has been the subject of much scholarship and scientific speculation.

This report will continue the scientific exploration of the Titanic. The estimated 2,224 passengers all have their own stories of escape or death in the wreck. It is believed, however, that studying the underlying information about these passengers can provide insight on why certain people survived.

![RMS Titanic](images/clipboard-2041768422.png)

# Data Source

The data set used in this report is sourced from Kaggle.com. It was orignially created for a machine learning competition hosted by Kaggle. The link for this data set is <https://www.kaggle.com/c/titanic>. The data is originally split into seperate train and testing data sets for purposes of the competition. The train data is used in this report for more observations and the complete set of variables.

The data was uploaded to a GitHub repository for perpetual online access. This report will source the data from this repository (URL, <https://raw.githubusercontent.com/ncbrechbill/STA321/refs/heads/main/STA321/titanic.csv>)

This data contains 891 observations on 12 variables. Any observations of N/A in variables of interest will be removed. The variables are:

1.  **PassengerID**: A unique identifier
2.  **Survived**: The survival status of the passenger (0 = No, 1 = Yes)
3.  **Pclass**: The passengers ticket class (1st, 2nd, or 3rd)
4.  **Name**: The passengers name
5.  **Sex**: The passengers sex
6.  **Age**: The passengers age in years
7.  **sibSp**: Number of siblings/spouse aboard
8.  **Parch**: Number of parents/children aboard
9.  **Ticket**: Passenger's ticket number
10. **Fare**: Passengers boarding fare in USD
11. **Cabin**: The passengers cabin
12. **Embarked**: Location which passenger embarked (C = Cherbourg, Q = Queenstown, S = Southampton)

```{r}
url = "https://raw.githubusercontent.com/ncbrechbill/STA321/refs/heads/main/STA321/titanic.csv"
titanic <- read.csv(url)
data <- na.omit(select(titanic, Survived, Age, Sex, Pclass))
data$Pclass <- as.factor(data$Pclass)
```

# Research Question

In many retellings of the Titanic's story, the excerpt arises of "Women and Children first" when loading the lifeboats for escape. This suggests that age could factor greatly into one's survival of the catastrophe. This report investigates the relationship between a passenger’s age and their likelihood of survival on the Titanic. To address this, a simple logistic regression model is applied, with survival (survived vs. not survived) as the binary outcome variable and passenger age as the predictor. This approach allows us to evaluate whether age significantly affected survival odds, estimate the direction and magnitude of the effect, and assess the model’s ability to explain variation in survival outcomes.

# Exploratory Data Analysis

Under simple logistic regression, only one variable is examined, and thus, a simple check of the variable distribution provides insight to any issues of skew.

## Figures

```{r}
#Histogram for response distribution
hist(data$Age, probability = TRUE, 
     main = "Age Distribution", 
     xlab="Age", 
     col = "lightgray", 
     border="black")
lines(density(data$Age, adjust=2), col="maroon") 
pander(summary(data$Age))



label <- factor(data$Survived,
                        levels = c(0, 1),
                        labels = c("Did Not Survive", "Survived"))
summary_table <- cbind(
  Frequency = table(label),
  Proportion = prop.table(table(label)))

pander(summary_table)
```

## Analysis

The histogram suggests slight right-skew, which is to be expected with age distributions in most situations. No transformation will be applied.

The frequency of survival does not imply a class imbalance that would result in bias.

# Logistic Regression Model

## Standard Model

With non-survival coded = 0, this level is the reference level used by R, and survival = 1 is the event level.

```{r}
#Build GLM
s.logit = glm(Survived ~ Age, 
              family = binomial(link = "logit"),  # family is the binomial, logit(p) = log(p/(1-p))!
              data = data)
result = summary(s.logit)
pander(result)
pander(confint(s.logit))
```

Age shows a negative relationship with a passengers survival with coefficient $\beta_1$ = -0.01096 , p = 0.03969. This is supported by the $\beta_1$ 95% confidence interval of [-0.02151, -0.0005832], which excludes the null value zero.

## Odds Ratio Model

For more practical use, the coefficient has been transformed into an odds ratio.

```{r}
# Odds ratio
model.coef.stats = summary(s.logit)$coef
odds.ratio = exp(coef(s.logit))
out.stats = cbind(model.coef.stats, odds.ratio = odds.ratio)                 
pander(out.stats,caption = "Simple Logistic Regression Model with Odds Ratios")
```

The odds ratio of survival by age is 98.9%. This indicates that every year of age reduces one's odds by **1.1%**. This may not make a great difference for people a few years apart, but certainly indicates a difference between a child and an elder.

## Analysis

While a statistically significant relationship could be found, it is apparent that age alone is not a strong predictor of one's odds of surviving this catastrophe. The slight drop in deviance suggests that age can be an useful predictor, however it is likely not the only predictor that would predict a passengers fate. Further research using multiple logistic regression will likely find that multiple factors, including age, are important to predicting one's survival.

## Application

Using the logistic regression model, we can determine a few hypethetical passenger's odds of survival at different ages. This helps conceptualize the differences between these ages, assuming age the only cause of difference in survival odds.

```{r}
# Create a new dataframe with some example ages
new_passengers <- data.frame(Age = c(5, 20, 30, 50, 70))

# Predict probabilities of survival
pred_probs <- predict(s.logit, newdata = new_passengers, type = "response")

# Convert probabilities to odds
pred_odds <- pred_probs / (1 - pred_probs)

# Combine results into a summary table
predictions <- data.frame(
  Age = new_passengers$Age,
  Probability_of_Survival = round(pred_probs, 3),
  Odds_of_Survival = round(pred_odds, 3)
)

pander(print(predictions))
```

## Discussion

Age was found to have an impact on a passengers odds of survival. Every year of age of a passenger decreased the odds of their survival by **1.1%**. It is important to notes that survival of this tragedy has many complex factors to consider. It is expected, however, that age will be likely to persist with a negative relationship to survival odds. It may also be worth noting that the relationship may not be entirely linear, and that survival factors may change at certain ages. For example, the youngest passenger aboard was Master Assad Alexander Thomas at about 4 months old. This passenger survived, but their survival was certainly dependent on those around them, and not their own actions. Consider the ages which a child may be seen as "independent", though certainly not capable of their own survival (such as 1 year old Miss. Maria ("Mary") Nakid.)

# Multiple Logistic Regression

## Standard Model

As before, survival is coded as the event level. As a qualitative factor, passenger class was coded with first class as the baseline.

```{r}
#Build GLM
m.logit = glm(Survived ~ Age + Sex + Pclass, 
              family = binomial(link = "logit"),  # family is the binomial, logit(p) = log(p/(1-p))!
              data = data)
mresult = summary(m.logit)
pander(mresult)
pander(confint(m.logit))
```

## Odds-Ratio Model

Odds-ratio conversions allow for easier interpretation.

```{r}
# Odds ratio
m.model.coef.stats = summary(m.logit)$coef
m.odds.ratio = exp(coef(m.logit))
m.out.stats = cbind(m.model.coef.stats, odds.ratio = m.odds.ratio)                 
pander(m.out.stats,caption = "Multiple Logistic Regression Model with Odds Ratios")
```

## Discussion

All predictors were found to be significant. Every year of age would reduce a passengers odds of survival by about 4%. Being male reduced odds of survival by about 92%. Compared to first class passengers, 2nd class passengers had 73% reduced odds of survival, and 3rd class passengers had 93% reduced odds of survival.
