1 Introduction

The RMS Titanic was a British luxury passenger liner that sank during its maiden voyage en route to New York City from Southampton, England, killing about 1,500 passengers and ship personnel. It is one of the most famous tragedies in modern history, it inspired numerous stories, several films, and a musical and has been the subject of much scholarship and scientific speculation.

This report will continue the scientific exploration of the Titanic. The estimated 2,224 passengers all have their own stories of escape or death in the wreck. It is believed, however, that studying the underlying information about these passengers can provide insight on why certain people survived.

2 Data Source

The data set used in this report is sourced from Kaggle.com. It was orignially created for a machine learning competition hosted by Kaggle. The link for this data set is https://www.kaggle.com/c/titanic. The data is originally split into seperate train and testing data sets for purposes of the competition. The train data is used in this report for more observations and the complete set of variables.

The data was uploaded to a GitHub repository for perpetual online access. This report will source the data from this repository (URL, https://raw.githubusercontent.com/ncbrechbill/STA321/refs/heads/main/STA321/titanic.csv)

This data contains 891 observations on 12 variables. Any observations of N/A in variables of interest will be removed. The variables are:

  1. PassengerID: A unique identifier
  2. Survived: The survival status of the passenger (0 = No, 1 = Yes)
  3. Pclass: The passenger’s ticket class (1st, 2nd, or 3rd)
  4. Name: The passenger’s name
  5. Sex: The passenger’s sex
  6. Age: The passenger’s age in years
  7. sibSp: Number of siblings/spouse aboard
  8. Parch: Number of parents/children aboard
  9. Ticket: Passenger’s ticket number
  10. Fare: Passenger’s boarding fare in USD
  11. Cabin: The passenger’s cabin
  12. Embarked: Location which passenger embarked (C = Cherbourg, Q = Queenstown, S = Southampton)
url = "https://raw.githubusercontent.com/ncbrechbill/STA321/refs/heads/main/STA321/titanic.csv"
titanic <- read.csv(url)
data <- na.omit(dplyr::select(titanic, Survived, Pclass, Age, Sex, SibSp, Parch, Fare))
data$Survived <- as.factor(data$Survived)
data$Pclass <- as.factor(data$Pclass)

3 Research Question

This report builds a logistic regression model to predict the survival of the Titanic passengers using demographic information of a sample of survivors

4 Exploratory Data Analysis

All observations with missing values in any of the variables were removed. This leaves 714 observations. Scatter plots were built to display potential issues with predictor variables. The scatter plots are displayed below.

pairs.panels(data[,-9], 
             method = "pearson", # correlation method
             hist.col = "#00AFBB",
             density = TRUE,  # show density plots
             ellipses = TRUE # show correlation ellipses
             )

We can see that there are more deaths than survivals, more high class passengers than middle or low, and more men than women. Age is slightly skewed right, but no transformation will be applied. SibSP, Parch, and Fare are highly skewed. PassengerID, Name, Ticket, Cabin, and Embarked were not included by irrelevance.

5 Predictive Logistic Regression

5.1 Split: Training and Testing Data

The data is randomly split into two unique subsets. 60% of the data is a training subset on which the model will be constructed. The remaining 40% will be used for testing the model and providing performance indications. The randomization seed will be set for reproducibility of the samples.

set.seed(123)
train <- createDataPartition(data$Survived, p=0.6, list=FALSE)
training <- data[train,]
testing <- data[-train,]
pander(head(training), caption = "First 5 entries of the training data subset")
First 5 entries of the training data subset
  Survived Pclass Age Sex SibSp Parch Fare
3 1 3 26 female 0 0 7.925
4 1 1 35 female 1 0 53.1
5 0 3 35 male 0 0 8.05
8 0 3 2 male 3 1 21.07
10 1 2 14 female 1 0 30.07
11 1 3 4 female 1 1 16.7

5.2 Multiple Logistic Regression

5.2.1 Candidate Model #1

The first candidate model is built upon the variables age, class, and sex. 10-fold cross validation will be used in this models training. The coefficients will be displayed with an exponential transformation.

ctrl <- trainControl(method = "cv", number = 10, savePredictions = TRUE)
model <- train(Survived ~ Age + Pclass + Sex,
               data=training,
               method="glm",
               family="binomial",
               trControl=ctrl,
               tuneLength = 10)
pander(exp(coef(model$finalModel)))
(Intercept) Age Pclass2 Pclass3 Sexmale
32.41 0.9697 0.2259 0.07259 0.1157

5.2.2 Candidate Model #2

The second candidate model is built upon the variable age. This is because age is a continuous variable that will have more predictive power than the categorical variables. The coefficients are again displayed with exponential transformation.

model.reduced <- train(Survived ~ Age,
                       data=training,
                       method="glm",
                       family="binomial",
                       trControl=ctrl,
                       tuneLength = 10)
pander(exp(coef(model.reduced$finalModel)))
(Intercept) Age
0.8075 0.9943

We use a likelihood ratio test to determine if using the larger candidate model is demonstrably improved upon the simpler, smaller model. P <0.001, and we thus gain evidence in favor of the larger model.

model.glm <- glm(Survived ~ Age + Sex + Pclass,
               data=training,
               family="binomial")
model.reduced.glm <- glm(Survived ~ Age,
                     data=training,
                     family="binomial")
anova(model.glm, model.reduced.glm, test="Chisq")
Analysis of Deviance Table

Model 1: Survived ~ Age + Sex + Pclass
Model 2: Survived ~ Age
  Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
1       424     409.31                          
2       427     578.62 -3   -169.3 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

5.3 Variable Importance

We assess the importance of each variable in the multiple logistic regression model.

varImp(model)
glm variable importance

        Overall
Sexmale  100.00
Pclass3   80.04
Pclass2   19.41
Age        0.00

5.4 Classification Rate

Using the testing data subset, we can assess the accuracy of the model. This model had an accuracy of 81.05%. The model resulted in an equal amount of false-survivals and false-deaths. The sensitivity was 84.02% and specificity was 76.72%.

pred = predict(model, newdata=testing)
accuracy <- table(pred, testing[,"Survived"])
confusionMatrix(data=pred, testing$Survived)
Confusion Matrix and Statistics

          Reference
Prediction   0   1
         0 142  27
         1  27  89
                                          
               Accuracy : 0.8105          
                 95% CI : (0.7601, 0.8543)
    No Information Rate : 0.593           
    P-Value [Acc > NIR] : 3.435e-15       
                                          
                  Kappa : 0.6075          
                                          
 Mcnemar's Test P-Value : 1               
                                          
            Sensitivity : 0.8402          
            Specificity : 0.7672          
         Pos Pred Value : 0.8402          
         Neg Pred Value : 0.7672          
             Prevalence : 0.5930          
         Detection Rate : 0.4982          
   Detection Prevalence : 0.5930          
      Balanced Accuracy : 0.8037          
                                          
       'Positive' Class : 0               
                                          
---
title: "Predictive Logistic Analysis of RMS Titanic Survivors"
author: "Noah Brechbill"
date: "`r Sys.Date()`"
output:
  html_document:
    toc: yes
    toc_float: yes
    toc_depth: 4
    fig_width: 6
    fig_height: 6
    fig_caption: yes
    number_sections: yes
    toc_collapsed: yes
    code_folding: hide
    code_download: yes
    smooth_scroll: yes
    theme: lumen
---

```{css, echo = FALSE}
/* Cascading Style Sheets (CSS) is a stylesheet language used to describe the presentation of a document written in HTML or XML. it is a simple mechanism for adding style (e.g., fonts, colors, spacing) to Web documents. */

h1.title {  /* Title - font specifications of the report title */
  font-size: 24px;
  color: DarkRed;
  text-align: center;
  font-weight: bold;
  font-family: "Gill Sans", sans-serif;
}
h4.author { /* Header 4 - font specifications for authors  */
  font-size: 20px;
  font-family: system-ui;
  font-weight: bold;
  color: DarkRed;
  text-align: center;
}
h4.date { /* Header 4 - font specifications for the date  */
  font-size: 18px;
  font-family: system-ui;
  font-weight: bold;
  color: DarkBlue;
  text-align: center;
}
h1 { /* Header 1 - font specifications for level 1 section title  */
    font-size: 22px;
    font-family: system-ui;
  font-weight: bold;
    color: navy;
    text-align: left;
}
h2 { /* Header 2 - font specifications for level 2 section title */
    font-size: 20px;
  font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h3 { /* Header 3 - font specifications of level 3 section title  */
    font-size: 18px;
  font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h4 { /* Header 4 - font specifications of level 4 section title  */
    font-size: 18px;
  font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: darkred;
    text-align: left;
}

body { background-color:white; }

.highlightme { background-color:yellow; }

p { background-color:white; }
```

```{r packages, include=FALSE}
load_packages <- function(pkg_list) {
  for (pkg in pkg_list) {
    if (!require(pkg, character.only = TRUE)) {
      install.packages(pkg, dependencies = TRUE)
      library(pkg, character.only = TRUE)
    }
  }
}

packages <- c("tidyverse", "GGally", "pander", "psych", "MASS", "caret")
load_packages(packages)

knitr::opts_chunk$set(echo = TRUE,      
                      warning = FALSE,   
                      message = FALSE,  
                      results = TRUE,
                      comment = NA,
                      fig.align = "center"
                      )   
```

# Introduction

The RMS Titanic was a British luxury passenger liner that sank during its maiden voyage en route to New York City from Southampton, England, killing about 1,500 passengers and ship personnel. It is one of the most famous tragedies in modern history, it inspired numerous stories, several films, and a musical and has been the subject of much scholarship and scientific speculation.

This report will continue the scientific exploration of the Titanic. The estimated 2,224 passengers all have their own stories of escape or death in the wreck. It is believed, however, that studying the underlying information about these passengers can provide insight on why certain people survived.

# Data Source

The data set used in this report is sourced from Kaggle.com. It was orignially created for a machine learning competition hosted by Kaggle. The link for this data set is <https://www.kaggle.com/c/titanic>. The data is originally split into seperate train and testing data sets for purposes of the competition. The train data is used in this report for more observations and the complete set of variables.

The data was uploaded to a GitHub repository for perpetual online access. This report will source the data from this repository (URL, <https://raw.githubusercontent.com/ncbrechbill/STA321/refs/heads/main/STA321/titanic.csv>)

This data contains 891 observations on 12 variables. Any observations of N/A in variables of interest will be removed. The variables are:

1.  **PassengerID**: A unique identifier
2.  **Survived**: The survival status of the passenger (0 = No, 1 = Yes)
3.  **Pclass**: The passenger's ticket class (1st, 2nd, or 3rd)
4.  **Name**: The passenger's name
5.  **Sex**: The passenger's sex
6.  **Age**: The passenger's age in years
7.  **sibSp**: Number of siblings/spouse aboard
8.  **Parch**: Number of parents/children aboard
9.  **Ticket**: Passenger's ticket number
10. **Fare**: Passenger's boarding fare in USD
11. **Cabin**: The passenger's cabin
12. **Embarked**: Location which passenger embarked (C = Cherbourg, Q = Queenstown, S = Southampton)

```{r data loading}
url = "https://raw.githubusercontent.com/ncbrechbill/STA321/refs/heads/main/STA321/titanic.csv"
titanic <- read.csv(url)
data <- na.omit(dplyr::select(titanic, Survived, Pclass, Age, Sex, SibSp, Parch, Fare))
data$Survived <- as.factor(data$Survived)
data$Pclass <- as.factor(data$Pclass)
```

# Research Question

This report builds a logistic regression model to predict the survival of the Titanic passengers using demographic information of a sample of survivors

# Exploratory Data Analysis

All observations with missing values in any of the variables were removed. This leaves 714 observations. Scatter plots were built to display potential issues with predictor variables. The scatter plots are displayed below.

```{r}
pairs.panels(data[,-9], 
             method = "pearson", # correlation method
             hist.col = "#00AFBB",
             density = TRUE,  # show density plots
             ellipses = TRUE # show correlation ellipses
             )
```

We can see that there are more deaths than survivals, more high class passengers than middle or low, and more men than women. Age is slightly skewed right, but no transformation will be applied. SibSP, Parch, and Fare are highly skewed. PassengerID, Name, Ticket, Cabin, and Embarked were not included by irrelevance.

# Predictive Logistic Regression

## Split: Training and Testing Data

The data is randomly split into two unique subsets. 60% of the data is a training subset on which the model will be constructed. The remaining 40% will be used for testing the model and providing performance indications. The randomization seed will be set for reproducibility of the samples.

```{r}
set.seed(123)
train <- createDataPartition(data$Survived, p=0.6, list=FALSE)
training <- data[train,]
testing <- data[-train,]
pander(head(training), caption = "First 5 entries of the training data subset")
```

## Multiple Logistic Regression

### Candidate Model #1

The first candidate model is built upon the variables age, class, and sex. 10-fold cross validation will be used in this models training. The coefficients will be displayed with an exponential transformation.

```{r}
ctrl <- trainControl(method = "cv", number = 10, savePredictions = TRUE)
model <- train(Survived ~ Age + Pclass + Sex,
               data=training,
               method="glm",
               family="binomial",
               trControl=ctrl,
               tuneLength = 10)
pander(exp(coef(model$finalModel)))
```

### Candidate Model #2 

The second candidate model is built upon the variable age. This is because age is a continuous variable that will have more predictive power than the categorical variables. The coefficients are again displayed with exponential transformation.

```{r}
model.reduced <- train(Survived ~ Age,
                       data=training,
                       method="glm",
                       family="binomial",
                       trControl=ctrl,
                       tuneLength = 10)
pander(exp(coef(model.reduced$finalModel)))
```

We use a likelihood ratio test to determine if using the larger candidate model is demonstrably improved upon the simpler, smaller model. P \<0.001, and we thus gain evidence in favor of the larger model.

```{r}
model.glm <- glm(Survived ~ Age + Sex + Pclass,
               data=training,
               family="binomial")
model.reduced.glm <- glm(Survived ~ Age,
                     data=training,
                     family="binomial")
anova(model.glm, model.reduced.glm, test="Chisq")
```

## Variable Importance

We assess the importance of each variable in the multiple logistic regression model.

```{r}
varImp(model)
```

## Classification Rate

Using the testing data subset, we can assess the accuracy of the model. This model had an accuracy of 81.05%. The model resulted in an equal amount of false-survivals and false-deaths. The sensitivity was 84.02% and specificity was 76.72%.

```{r}
pred = predict(model, newdata=testing)
accuracy <- table(pred, testing[,"Survived"])
confusionMatrix(data=pred, testing$Survived)
```
