Introduction
The RMS Titanic was a British luxury passenger liner that sank during
its maiden voyage en route to New York City from Southampton, England,
killing about 1,500 passengers and ship personnel. It is one of the most
famous tragedies in modern history, it inspired numerous stories,
several films, and a musical and has been the subject of much
scholarship and scientific speculation.
This report will continue the scientific exploration of the Titanic.
The estimated 2,224 passengers all have their own stories of escape or
death in the wreck. It is believed, however, that studying the
underlying information about these passengers can provide insight on why
certain people survived.
Data Source
The data set used in this report is sourced from Kaggle.com. It was
orignially created for a machine learning competition hosted by Kaggle.
The link for this data set is https://www.kaggle.com/c/titanic. The data is originally
split into seperate train and testing data sets for purposes of the
competition. The train data is used in this report for more observations
and the complete set of variables.
The data was uploaded to a GitHub repository for perpetual online
access. This report will source the data from this repository (URL, https://raw.githubusercontent.com/ncbrechbill/STA321/refs/heads/main/STA321/titanic.csv)
This data contains 891 observations on 12 variables. Any observations
of N/A in variables of interest will be removed. The variables are:
- PassengerID: A unique identifier
- Survived: The survival status of the passenger (0 =
No, 1 = Yes)
- Pclass: The passenger’s ticket class (1st, 2nd, or
3rd)
- Name: The passenger’s name
- Sex: The passenger’s sex
- Age: The passenger’s age in years
- sibSp: Number of siblings/spouse aboard
- Parch: Number of parents/children aboard
- Ticket: Passenger’s ticket number
- Fare: Passenger’s boarding fare in USD
- Cabin: The passenger’s cabin
- Embarked: Location which passenger embarked (C =
Cherbourg, Q = Queenstown, S = Southampton)
url = "https://raw.githubusercontent.com/ncbrechbill/STA321/refs/heads/main/STA321/titanic.csv"
titanic <- read.csv(url)
data <- na.omit(dplyr::select(titanic, Survived, Pclass, Age, Sex, SibSp, Parch, Fare))
data$Survived <- as.factor(data$Survived)
data$Pclass <- as.factor(data$Pclass)
Research Question
This report builds a logistic regression model to predict the
survival of the Titanic passengers using demographic information of a
sample of survivors
Exploratory Data
Analysis
All observations with missing values in any of the variables were
removed. This leaves 714 observations. Scatter plots were built to
display potential issues with predictor variables. The scatter plots are
displayed below.
pairs.panels(data[,-9],
method = "pearson", # correlation method
hist.col = "#00AFBB",
density = TRUE, # show density plots
ellipses = TRUE # show correlation ellipses
)

We can see that there are more deaths than survivals, more high class
passengers than middle or low, and more men than women. Age is slightly
skewed right, but no transformation will be applied. SibSP, Parch, and
Fare are highly skewed. PassengerID, Name, Ticket, Cabin, and Embarked
were not included by irrelevance.
Predictive Logistic
Regression
Split: Training and
Testing Data
The data is randomly split into two unique subsets. 60% of the data
is a training subset on which the model will be constructed. The
remaining 40% will be used for testing the model and providing
performance indications. The randomization seed will be set for
reproducibility of the samples.
set.seed(123)
train <- createDataPartition(data$Survived, p=0.6, list=FALSE)
training <- data[train,]
testing <- data[-train,]
pander(head(training), caption = "First 5 entries of the training data subset")
First 5 entries of the training data subset
| 3 |
1 |
3 |
26 |
female |
0 |
0 |
7.925 |
| 4 |
1 |
1 |
35 |
female |
1 |
0 |
53.1 |
| 5 |
0 |
3 |
35 |
male |
0 |
0 |
8.05 |
| 8 |
0 |
3 |
2 |
male |
3 |
1 |
21.07 |
| 10 |
1 |
2 |
14 |
female |
1 |
0 |
30.07 |
| 11 |
1 |
3 |
4 |
female |
1 |
1 |
16.7 |
Multiple Logistic
Regression
Candidate Model
#1
The first candidate model is built upon the variables age, class, and
sex. 10-fold cross validation will be used in this models training. The
coefficients will be displayed with an exponential transformation.
ctrl <- trainControl(method = "cv", number = 10, savePredictions = TRUE)
model <- train(Survived ~ Age + Pclass + Sex,
data=training,
method="glm",
family="binomial",
trControl=ctrl,
tuneLength = 10)
pander(exp(coef(model$finalModel)))
| 32.41 |
0.9697 |
0.2259 |
0.07259 |
0.1157 |
Candidate Model
#2
The second candidate model is built upon the variable age. This is
because age is a continuous variable that will have more predictive
power than the categorical variables. The coefficients are again
displayed with exponential transformation.
model.reduced <- train(Survived ~ Age,
data=training,
method="glm",
family="binomial",
trControl=ctrl,
tuneLength = 10)
pander(exp(coef(model.reduced$finalModel)))
We use a likelihood ratio test to determine if using the larger
candidate model is demonstrably improved upon the simpler, smaller
model. P <0.001, and we thus gain evidence in favor of the larger
model.
model.glm <- glm(Survived ~ Age + Sex + Pclass,
data=training,
family="binomial")
model.reduced.glm <- glm(Survived ~ Age,
data=training,
family="binomial")
anova(model.glm, model.reduced.glm, test="Chisq")
Analysis of Deviance Table
Model 1: Survived ~ Age + Sex + Pclass
Model 2: Survived ~ Age
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 424 409.31
2 427 578.62 -3 -169.3 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Variable
Importance
We assess the importance of each variable in the multiple logistic
regression model.
varImp(model)
glm variable importance
Overall
Sexmale 100.00
Pclass3 80.04
Pclass2 19.41
Age 0.00
Classification
Rate
Using the testing data subset, we can assess the accuracy of the
model. This model had an accuracy of 81.05%. The model resulted in an
equal amount of false-survivals and false-deaths. The sensitivity was
84.02% and specificity was 76.72%.
pred = predict(model, newdata=testing)
accuracy <- table(pred, testing[,"Survived"])
confusionMatrix(data=pred, testing$Survived)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 142 27
1 27 89
Accuracy : 0.8105
95% CI : (0.7601, 0.8543)
No Information Rate : 0.593
P-Value [Acc > NIR] : 3.435e-15
Kappa : 0.6075
Mcnemar's Test P-Value : 1
Sensitivity : 0.8402
Specificity : 0.7672
Pos Pred Value : 0.8402
Neg Pred Value : 0.7672
Prevalence : 0.5930
Detection Rate : 0.4982
Detection Prevalence : 0.5930
Balanced Accuracy : 0.8037
'Positive' Class : 0
---
title: "Predictive Logistic Analysis of RMS Titanic Survivors"
author: "Noah Brechbill"
date: "`r Sys.Date()`"
output:
  html_document:
    toc: yes
    toc_float: yes
    toc_depth: 4
    fig_width: 6
    fig_height: 6
    fig_caption: yes
    number_sections: yes
    toc_collapsed: yes
    code_folding: hide
    code_download: yes
    smooth_scroll: yes
    theme: lumen
---

```{css, echo = FALSE}
/* Cascading Style Sheets (CSS) is a stylesheet language used to describe the presentation of a document written in HTML or XML. it is a simple mechanism for adding style (e.g., fonts, colors, spacing) to Web documents. */

h1.title {  /* Title - font specifications of the report title */
  font-size: 24px;
  color: DarkRed;
  text-align: center;
  font-weight: bold;
  font-family: "Gill Sans", sans-serif;
}
h4.author { /* Header 4 - font specifications for authors  */
  font-size: 20px;
  font-family: system-ui;
  font-weight: bold;
  color: DarkRed;
  text-align: center;
}
h4.date { /* Header 4 - font specifications for the date  */
  font-size: 18px;
  font-family: system-ui;
  font-weight: bold;
  color: DarkBlue;
  text-align: center;
}
h1 { /* Header 1 - font specifications for level 1 section title  */
    font-size: 22px;
    font-family: system-ui;
  font-weight: bold;
    color: navy;
    text-align: left;
}
h2 { /* Header 2 - font specifications for level 2 section title */
    font-size: 20px;
  font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h3 { /* Header 3 - font specifications of level 3 section title  */
    font-size: 18px;
  font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h4 { /* Header 4 - font specifications of level 4 section title  */
    font-size: 18px;
  font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: darkred;
    text-align: left;
}

body { background-color:white; }

.highlightme { background-color:yellow; }

p { background-color:white; }
```

```{r packages, include=FALSE}
load_packages <- function(pkg_list) {
  for (pkg in pkg_list) {
    if (!require(pkg, character.only = TRUE)) {
      install.packages(pkg, dependencies = TRUE)
      library(pkg, character.only = TRUE)
    }
  }
}

packages <- c("tidyverse", "GGally", "pander", "psych", "MASS", "caret")
load_packages(packages)

knitr::opts_chunk$set(echo = TRUE,      
                      warning = FALSE,   
                      message = FALSE,  
                      results = TRUE,
                      comment = NA,
                      fig.align = "center"
                      )   
```

# Introduction

The RMS Titanic was a British luxury passenger liner that sank during its maiden voyage en route to New York City from Southampton, England, killing about 1,500 passengers and ship personnel. It is one of the most famous tragedies in modern history, it inspired numerous stories, several films, and a musical and has been the subject of much scholarship and scientific speculation.

This report will continue the scientific exploration of the Titanic. The estimated 2,224 passengers all have their own stories of escape or death in the wreck. It is believed, however, that studying the underlying information about these passengers can provide insight on why certain people survived.

# Data Source

The data set used in this report is sourced from Kaggle.com. It was orignially created for a machine learning competition hosted by Kaggle. The link for this data set is <https://www.kaggle.com/c/titanic>. The data is originally split into seperate train and testing data sets for purposes of the competition. The train data is used in this report for more observations and the complete set of variables.

The data was uploaded to a GitHub repository for perpetual online access. This report will source the data from this repository (URL, <https://raw.githubusercontent.com/ncbrechbill/STA321/refs/heads/main/STA321/titanic.csv>)

This data contains 891 observations on 12 variables. Any observations of N/A in variables of interest will be removed. The variables are:

1.  **PassengerID**: A unique identifier
2.  **Survived**: The survival status of the passenger (0 = No, 1 = Yes)
3.  **Pclass**: The passenger's ticket class (1st, 2nd, or 3rd)
4.  **Name**: The passenger's name
5.  **Sex**: The passenger's sex
6.  **Age**: The passenger's age in years
7.  **sibSp**: Number of siblings/spouse aboard
8.  **Parch**: Number of parents/children aboard
9.  **Ticket**: Passenger's ticket number
10. **Fare**: Passenger's boarding fare in USD
11. **Cabin**: The passenger's cabin
12. **Embarked**: Location which passenger embarked (C = Cherbourg, Q = Queenstown, S = Southampton)

```{r data loading}
url = "https://raw.githubusercontent.com/ncbrechbill/STA321/refs/heads/main/STA321/titanic.csv"
titanic <- read.csv(url)
data <- na.omit(dplyr::select(titanic, Survived, Pclass, Age, Sex, SibSp, Parch, Fare))
data$Survived <- as.factor(data$Survived)
data$Pclass <- as.factor(data$Pclass)
```

# Research Question

This report builds a logistic regression model to predict the survival of the Titanic passengers using demographic information of a sample of survivors

# Exploratory Data Analysis

All observations with missing values in any of the variables were removed. This leaves 714 observations. Scatter plots were built to display potential issues with predictor variables. The scatter plots are displayed below.

```{r}
pairs.panels(data[,-9], 
             method = "pearson", # correlation method
             hist.col = "#00AFBB",
             density = TRUE,  # show density plots
             ellipses = TRUE # show correlation ellipses
             )
```

We can see that there are more deaths than survivals, more high class passengers than middle or low, and more men than women. Age is slightly skewed right, but no transformation will be applied. SibSP, Parch, and Fare are highly skewed. PassengerID, Name, Ticket, Cabin, and Embarked were not included by irrelevance.

# Predictive Logistic Regression

## Split: Training and Testing Data

The data is randomly split into two unique subsets. 60% of the data is a training subset on which the model will be constructed. The remaining 40% will be used for testing the model and providing performance indications. The randomization seed will be set for reproducibility of the samples.

```{r}
set.seed(123)
train <- createDataPartition(data$Survived, p=0.6, list=FALSE)
training <- data[train,]
testing <- data[-train,]
pander(head(training), caption = "First 5 entries of the training data subset")
```

## Multiple Logistic Regression

### Candidate Model #1

The first candidate model is built upon the variables age, class, and sex. 10-fold cross validation will be used in this models training. The coefficients will be displayed with an exponential transformation.

```{r}
ctrl <- trainControl(method = "cv", number = 10, savePredictions = TRUE)
model <- train(Survived ~ Age + Pclass + Sex,
               data=training,
               method="glm",
               family="binomial",
               trControl=ctrl,
               tuneLength = 10)
pander(exp(coef(model$finalModel)))
```

### Candidate Model #2 

The second candidate model is built upon the variable age. This is because age is a continuous variable that will have more predictive power than the categorical variables. The coefficients are again displayed with exponential transformation.

```{r}
model.reduced <- train(Survived ~ Age,
                       data=training,
                       method="glm",
                       family="binomial",
                       trControl=ctrl,
                       tuneLength = 10)
pander(exp(coef(model.reduced$finalModel)))
```

We use a likelihood ratio test to determine if using the larger candidate model is demonstrably improved upon the simpler, smaller model. P \<0.001, and we thus gain evidence in favor of the larger model.

```{r}
model.glm <- glm(Survived ~ Age + Sex + Pclass,
               data=training,
               family="binomial")
model.reduced.glm <- glm(Survived ~ Age,
                     data=training,
                     family="binomial")
anova(model.glm, model.reduced.glm, test="Chisq")
```

## Variable Importance

We assess the importance of each variable in the multiple logistic regression model.

```{r}
varImp(model)
```

## Classification Rate

Using the testing data subset, we can assess the accuracy of the model. This model had an accuracy of 81.05%. The model resulted in an equal amount of false-survivals and false-deaths. The sensitivity was 84.02% and specificity was 76.72%.

```{r}
pred = predict(model, newdata=testing)
accuracy <- table(pred, testing[,"Survived"])
confusionMatrix(data=pred, testing$Survived)
```
