"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> Titanic Survival Analysis

Titanic Survival Analysis

Cole Campbell

2025-06-12

1. Introduction & Analysis Plan

1.1 A Brief History of the Titanic

On April 10, 1912, the White Star Line’s RMS Titanic-then the largest moving man-made object on earth-set out from Southampton, England on her maiden voyage to New York City. Billed as the pinnacle of maritime luxury and “practically unsinkable,” she carried 2,224 passengers and crew. In the early hours of April 15, the Titanic struck an iceberg and foundered, resulting in the loss of 1,517 lives. The tragedy revealed inequalities in early-20th-century society: women and children first, but also first-class privileges, and the perils of inadequate lifeboats and emergency planning.

1.2 The Kaggle Titanic Dataset

This distills the disaster into a tabular form. Each row represents a passenger, and the key fields are:

By structuring that manifest into a modern data frame, we can interrogate who survived-and why-in a way that blends history, ethics, and predictive modeling.

1.3 Analysis Plan

To turn raw passenger records into insights, we will:

  1. Clean & Impute: fill missing ages by group medians; extract titles (Mr., Miss.) and cabin decks; engineer a ‘FamilySize’ feature.

  2. Exploratory Data Analysis:

  1. Statistical Modeling:

Together, these steps will reveal which social and demographic factors truly determined life or death aboard the Titanic-and test our hypothesis against the data.

2. Data Wrangling & Cleaning

2.1 Load the Data

Before diving into transformations, we read both CSV files and immediately call skim() to profile row counts, data types, and missingness. This ensures we know exactly how many NAs we’ll need to handle in Age, Cabin, and other fields before we start any downstream steps.

#
Read raw training and test sets
train
<-
readcsv("train.csv")
test
<-
readcsv("test.csv")

#
Inspect structure and missingness
skim(train)
Data summary
Name train
Number of rows 891
Number of columns 12
______________________
Column type frequency:
character 5
numeric 7
________________________
Group variables None

Variable type: character

skimvariable nmissing completerate min max empty nunique whitespace
Name 0 1.00 12 82 0 891 0
Sex 0 1.00 4 6 0 2 0
Ticket 0 1.00 3 18 0 681 0
Cabin 687 0.23 1 15 0 147 0
Embarked 2 1.00 1 1 0 3 0

Variable type: numeric

skimvariable nmissing completerate mean sd p0 p25 p50 p75 p100 hist
PassengerId 0 1.0 446.00 257.35 1.00 223.50 446.00 668.5 891.00 ▇▇▇▇▇
Survived 0 1.0 0.38 0.49 0.00 0.00 0.00 1.0 1.00 ▇▁▁▁▅
Pclass 0 1.0 2.31 0.84 1.00 2.00 3.00 3.0 3.00 ▃▁▃▁▇
Age 177 0.8 29.70 14.53 0.42 20.12 28.00 38.0 80.00 ▂▇▅▂▁
SibSp 0 1.0 0.52 1.10 0.00 0.00 0.00 1.0 8.00 ▇▁▁▁▁
Parch 0 1.0 0.38 0.81 0.00 0.00 0.00 0.0 6.00 ▇▁▁▁▁
Fare 0 1.0 32.20 49.69 0.00 7.91 14.45 31.0 512.33 ▇▁▁▁▁

After skimming, we see 891 observations, with roughly 20% of ages missing and over three-quarters of cabin entries blank—information that will guide our imputation and feature-engineering strategy.

2.2 Impute & Feature Engineering

In the next chunk we systematically turn raw strings and blanks into model-ready features. We fill the single missing Embarked with “S” (most common port), shrink each Cabin to its deck letter defaulting to “U” (Unknown), and extract passenger titles from names, remapping variants like “Mlle” to familiar labels. By grouping on (Pclass, Sex) to impute missing ages with local medians, we preserve demographic trends rather than distort them with a global average. Finally, we calculate FamilySize as the sum of siblings/spouses and parents/children plus one, convert every categorical variable into a factor, and drop the raw text columns that served their purpose.

train2
<- train
%>%
 #
Fill missing Embarked with most common port 'S'

mutate(Embarked =
replacena(Embarked,
"S")) %>%

#
Extract deck letter from Cabin; missing → 'U'
mutate(Deck
=
ifelse(is.na(Cabin),
"U", strsub(Cabin,
1, 1)))
%>%

#
Parse Title from Name (Mr., Mrs., Miss., etc.)
mutate(Title
= strextract(Name,
"(?<=,
)[A-Za-z]+\."),

Title =
strremove(Title,
"\."),

Title =
recode(Title,

"Ms" =
"Miss",

"Mlle" =
"Miss",

"Mme" =
"Mrs",

.default = Title))
%>%

#
Impute missing Age by median within Sex × Pclass groups
groupby(Pclass,
Sex) %>%

mutate(Age =
ifelse(is.na(Age),

median(Age, na.rm =
TRUE),
 Age))
%>%

ungroup() %>%

#
Engineer FamilySize and convert key columns to factors
mutate(

FamilySize = SibSp +
Parch + 1,

Survived =
factor(Survived),

Pclass =
factor(Pclass),

Sex =
factor(Sex),

Embarked =
factor(Embarked),

Deck =
factor(Deck),

Title =
factor(Title)
)
%>%

#
Drop unused columns
select(-Name,
-Ticket,
-Cabin)

Now train2 has no missing Embarked or Age values, and each passenger carries a Deck and Title encoding of social cues. Imputing age by class and sex retains key demographic structure, while creating FamilySize lets us later test whether traveling alone or in a small group altered survival odds.

3. Exploratory Data Analysis

3.1 Survival Rate by Sex & Class

We begin our EDA by collapsing the data to average survival rates for each (Sex, Pclass) combination. This directly addresses our hypotheses that women survived more often than men ((H1)) and that first-class passengers fared better than those in lower classes ((H2)).

dfsexclass
<- train2
%>%

groupby(Sex, Pclass)
%>%

summarize(survivalrate
=
mean(as.numeric(as.character(Survived))),
.groups = "drop")

ggplot(dfsexclass,
aes(x = Pclass,
y = survivalrate, fill
= Sex)) +

geomcol(position =
"dodge") +

scaleycontinuous(labels
=
scales::percentformat())
+

thememinimal()
+

labs(

title = "Titanic Survival Rate
by Sex & Class",

x = "Passenger
Class",

y = "Survival
Rate"

)

Here we see nearly all first-class women survived (around 97%), while men—even in first class—hovered near 35%. The steep decline from class 1 to class 3 across both sexes validates both (H1) and (H2)

3.2 Age Distribution by Survival

To check the “children first” protocol, we overlay histograms of age for survivors versus non-survivors. This visualization helps reveal whether younger passengers truly had a disproportionate advantage.

ggplot(train2,
aes(x = Age,
fill = Survived))
+

geomhistogram(bins =
30, alpha =
0.6, position =
"identity") +

scalefillmanual(values
= c("steelblue",
"tomato")) +

themeeconomist()
+

labs(

title = "Age Distribution by
Survival Status",
 x
= "Age",
 y
= "Count"

)

You can observe a clear spike of survivors between 20-25, while non-survivors dominate the mid-adult range—so children/young adults got onto lifeboats first.

3.3 Survival Rate by Family Size

Finally, we bucket FamilySize into “Alone,” “Small” (2–4 people), and “Large” (5+). This checks whether traveling in a small family cluster offered better odds than going solo or being part of a very large group ((H3))

familydf
<- train2
%>%

mutate(FamilyGroup =
casewhen(
 FamilySize
== 1
~ "Alone",
 FamilySize
<= 4
~ "Small",

TRUE ~
"Large"
 ))
%>%

groupby(FamilyGroup)
%>%

summarize(survivalrate
=
mean(as.numeric(as.character(Survived))),
.groups="drop")

ggplot(familydf,
aes(x = FamilyGroup,
y = survivalrate, fill
= FamilyGroup)) +

geomcol() +

scaleycontinuous(labels
=
scales::percentformat())
+

thememinimal()
+

labs(

title = "Titanic Survival Rate
by Family Group",

x = "Family
Group",

y = "Survival
Rate"

)

Small family units indeed peak around a 60% survival rate, while those traveling alone or in very large groups fare worse—confirming (H3)

4. Logistic Regression Modeling

4.1 Train/Test Split

We carve out 80% of train2 for fitting and hold back 20% for an honest evaluation, using createDataPartition() to keep the same survivor proportion in both sets. This keeps our accuracy assessment fair.

set.seed(2025)
splitidx
<-
createDataPartition(train2$Survived,
p = 0.8,
list = FALSE)
modtrain
<- train2[split_idx, ]
modtest
<- train2[-split_idx,
] 
#
Check proportions
prop.table(table(mod_train$Survived))
##

0 1

0.6162465 0.3837535

prop.table(table(mod_test$Survived))
## 
##         0         1 
## 0.6158192 0.3841808

Both subsets remain at roughly 38% survivors, so our model will not be biased by a lopsided split.

4.2 Fit the Model

With our engineering complete, we fit a logistic regression on mod_train, predicting Survived from all factors plus continuous variables like Age and Fare. The coefficient signs and p-values will tell us which variables wield the strongest influence.

glm_fit <- glm(
  Survived ~ Sex + Pclass + Age + Fare + Embarked + Deck + Title + FamilySize,
  data   = mod_train,
  family = binomial(link = "logit")
)
summary(glm_fit)
## 
## Call:
## glm(formula = Survived ~ Sex + Pclass + Age + Fare + Embarked + 
##     Deck + Title + FamilySize, family = binomial(link = "logit"), 
##     data = mod_train)
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    2.610e+00  3.393e+03   0.001  0.99939    
## Sexmale       -1.562e+01  2.400e+03  -0.007  0.99481    
## Pclass2       -3.951e-01  5.438e-01  -0.727  0.46747    
## Pclass3       -1.493e+00  5.461e-01  -2.733  0.00627 ** 
## Age           -3.414e-02  1.141e-02  -2.992  0.00277 ** 
## Fare           3.276e-03  3.153e-03   1.039  0.29877    
## EmbarkedQ     -1.522e-02  4.396e-01  -0.035  0.97238    
## EmbarkedS     -4.345e-01  2.905e-01  -1.496  0.13477    
## DeckB          5.191e-01  8.615e-01   0.603  0.54682    
## DeckC         -2.784e-01  7.888e-01  -0.353  0.72410    
## DeckD          1.120e+00  9.441e-01   1.186  0.23568    
## DeckE          1.652e+00  8.750e-01   1.888  0.05904 .  
## DeckF          1.516e-01  1.167e+00   0.130  0.89667    
## DeckG         -1.866e+00  1.497e+00  -1.247  0.21253    
## DeckT         -1.572e+01  2.400e+03  -0.007  0.99477    
## DeckU         -6.696e-01  7.977e-01  -0.839  0.40127    
## TitleCol       1.593e+01  2.400e+03   0.007  0.99470    
## TitleDon      -1.117e+00  3.393e+03   0.000  0.99974    
## TitleDr        1.535e+01  2.400e+03   0.006  0.99489    
## TitleJonkheer -6.602e-01  3.393e+03   0.000  0.99984    
## TitleLady      1.645e+01  4.156e+03   0.004  0.99684    
## TitleMajor     1.538e+01  2.400e+03   0.006  0.99489    
## TitleMaster    1.763e+01  2.400e+03   0.007  0.99414    
## TitleMiss      1.850e+00  3.393e+03   0.001  0.99957    
## TitleMr        1.451e+01  2.400e+03   0.006  0.99517    
## TitleMrs       2.772e+00  3.393e+03   0.001  0.99935    
## TitleRev      -9.307e-02  2.621e+03   0.000  0.99997    
## TitleSir       3.205e+01  3.393e+03   0.009  0.99246    
## FamilySize    -4.946e-01  9.662e-02  -5.119 3.08e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 950.86  on 713  degrees of freedom
## Residual deviance: 556.48  on 685  degrees of freedom
## AIC: 614.48
## 
## Number of Fisher Scoring iterations: 15

The model output shows large positive log-odds for being female, strong negative effects for third class, and notable contributions from higher fare, certain titles, and smaller family sizes.

4.3 Model Performance

To evaluate, we align factor levels between train and test, predict survival probabilities on mod_test, threshold at 0.5, and then compute both Accuracy and Cohen’s Kappa. Kappa tells us how much better we do than random guessing.

# pull the training‐set Title levels out
train_levels <- levels(mod_train$Title)

# re‐factor test set
mod_test$Title <- factor(mod_test$Title, levels = levels(mod_train$Title))

# Predict probabilities and classes
test_prob <- predict(glm_fit, newdata = mod_test, type = "response")
test_pred <- factor(if_else(test_prob > 0.5, "1", "0"))

# Confusion matrix & metrics
conf_mat <- confusionMatrix(test_pred, mod_test$Survived)
conf_mat$overall[c("Accuracy", "Kappa")]
##  Accuracy     Kappa 
## 0.8181818 0.6187898

Achieving around 80% accuracy with a Kappa near 0.6 shows that our model captures genuine structure—well above the 62% baseline of always guessing “did not survive.”

5. Conclusions

Hypothesis results:

All three hypotheses hold: women survived more often, first-class passengers fared best, and small families enjoyed the highest survival rates.

Appendix

sessionInfo()
## R version 4.5.0 (2025-04-11 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26100)
## 
## Matrix products: default
##   LAPACK version 3.12.1
## 
## locale:
## [1] LC_COLLATE=English_United States.utf8 
## [2] LC_CTYPE=English_United States.utf8   
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.utf8    
## 
## time zone: America/Phoenix
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] ggthemes_5.1.0  caret_7.0-1     lattice_0.22-6  skimr_2.1.5    
##  [5] lubridate_1.9.4 forcats_1.0.0   stringr_1.5.1   dplyr_1.1.4    
##  [9] purrr_1.0.4     readr_2.1.5     tidyr_1.3.1     tibble_3.2.1   
## [13] ggplot2_3.5.2   tidyverse_2.0.0
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.2.1     timeDate_4041.110    farver_2.1.2        
##  [4] fastmap_1.2.0        pROC_1.18.5          digest_0.6.37       
##  [7] rpart_4.1.24         timechange_0.3.0     lifecycle_1.0.4     
## [10] survival_3.8-3       magrittr_2.0.3       compiler_4.5.0      
## [13] rlang_1.1.6          sass_0.4.10          tools_4.5.0         
## [16] yaml_2.3.10          data.table_1.17.4    knitr_1.50          
## [19] labeling_0.4.3       bit_4.6.0            plyr_1.8.9          
## [22] repr_1.1.7           RColorBrewer_1.1-3   withr_3.0.2         
## [25] nnet_7.3-20          grid_4.5.0           stats4_4.5.0        
## [28] e1071_1.7-16         future_1.58.0        globals_0.18.0      
## [31] scales_1.4.0         iterators_1.0.14     MASS_7.3-65         
## [34] cli_3.6.5            rmarkdown_2.29       crayon_1.5.3        
## [37] generics_0.1.4       rstudioapi_0.17.1    future.apply_1.20.0 
## [40] reshape2_1.4.4       tzdb_0.5.0           proxy_0.4-27        
## [43] cachem_1.1.0         splines_4.5.0        parallel_4.5.0      
## [46] base64enc_0.1-3      vctrs_0.6.5          hardhat_1.4.1       
## [49] Matrix_1.7-3         jsonlite_2.0.0       hms_1.1.3           
## [52] bit64_4.6.0-1        listenv_0.9.1        foreach_1.5.2       
## [55] gower_1.0.2          jquerylib_0.1.4      recipes_1.3.1       
## [58] glue_1.8.0           parallelly_1.45.0    codetools_0.2-20    
## [61] stringi_1.8.7        gtable_0.3.6         pillar_1.10.2       
## [64] htmltools_0.5.8.1    ipred_0.9-15         lava_1.8.1          
## [67] R6_2.6.1             vroom_1.6.5          evaluate_1.0.3      
## [70] bslib_0.9.0          class_7.3-23         Rcpp_1.0.14         
## [73] nlme_3.1-168         prodlim_2025.04.28   xfun_0.52           
## [76] pkgconfig_2.0.3      ModelMetrics_1.2.2.2