"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
Cole Campbell
2025-06-12
On April 10, 1912, the White Star Line’s RMS Titanic-then the largest moving man-made object on earth-set out from Southampton, England on her maiden voyage to New York City. Billed as the pinnacle of maritime luxury and “practically unsinkable,” she carried 2,224 passengers and crew. In the early hours of April 15, the Titanic struck an iceberg and foundered, resulting in the loss of 1,517 lives. The tragedy revealed inequalities in early-20th-century society: women and children first, but also first-class privileges, and the perils of inadequate lifeboats and emergency planning.
This distills the disaster into a tabular form. Each row represents a passenger, and the key fields are:
Survived: 0 = No, 1 = Yes (our target)
Pclass: Passenger class (1-3)
Sex, Age: Demographic attributes
SibSp, Parch: Number of siblings/spouses & parents/children aboard
Fare, Embarked: Ticket price & port of embarkation
Cabin, Name, Ticket: Richer metadata
By structuring that manifest into a modern data frame, we can interrogate who survived-and why-in a way that blends history, ethics, and predictive modeling.
To turn raw passenger records into insights, we will:
Clean & Impute: fill missing ages by group medians; extract titles (Mr., Miss.) and cabin decks; engineer a ‘FamilySize’ feature.
Exploratory Data Analysis:
Compare survival rates by sex and class (Tests (H1) & (H2),).
Examine age distributions to capture “children first.” - Measure the impact of traveling alone vs. in a family (tests (H3)).
Fit a logistic-regression to quantify each factor’s effect on survival.
Evaluate with an 80/20 train/test split and report accuracy & Cohen’s Kappa.
Before diving into transformations, we read both CSV files and immediately call skim() to profile row counts, data types, and missingness. This ensures we know exactly how many NAs we’ll need to handle in Age, Cabin, and other fields before we start any downstream steps.
#
Read raw training and test sets
train
<-
readcsv("train.csv")
test
<-
readcsv("test.csv")
#
Inspect structure and missingness
skim(train)| Name | train |
| Number of rows | 891 |
| Number of columns | 12 |
| ______________________ | |
| Column type frequency: | |
| character | 5 |
| numeric | 7 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skimvariable | nmissing | completerate | min | max | empty | nunique | whitespace |
|---|---|---|---|---|---|---|---|
| Name | 0 | 1.00 | 12 | 82 | 0 | 891 | 0 |
| Sex | 0 | 1.00 | 4 | 6 | 0 | 2 | 0 |
| Ticket | 0 | 1.00 | 3 | 18 | 0 | 681 | 0 |
| Cabin | 687 | 0.23 | 1 | 15 | 0 | 147 | 0 |
| Embarked | 2 | 1.00 | 1 | 1 | 0 | 3 | 0 |
Variable type: numeric
| skimvariable | nmissing | completerate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| PassengerId | 0 | 1.0 | 446.00 | 257.35 | 1.00 | 223.50 | 446.00 | 668.5 | 891.00 | ▇▇▇▇▇ |
| Survived | 0 | 1.0 | 0.38 | 0.49 | 0.00 | 0.00 | 0.00 | 1.0 | 1.00 | ▇▁▁▁▅ |
| Pclass | 0 | 1.0 | 2.31 | 0.84 | 1.00 | 2.00 | 3.00 | 3.0 | 3.00 | ▃▁▃▁▇ |
| Age | 177 | 0.8 | 29.70 | 14.53 | 0.42 | 20.12 | 28.00 | 38.0 | 80.00 | ▂▇▅▂▁ |
| SibSp | 0 | 1.0 | 0.52 | 1.10 | 0.00 | 0.00 | 0.00 | 1.0 | 8.00 | ▇▁▁▁▁ |
| Parch | 0 | 1.0 | 0.38 | 0.81 | 0.00 | 0.00 | 0.00 | 0.0 | 6.00 | ▇▁▁▁▁ |
| Fare | 0 | 1.0 | 32.20 | 49.69 | 0.00 | 7.91 | 14.45 | 31.0 | 512.33 | ▇▁▁▁▁ |
After skimming, we see 891 observations, with roughly 20% of ages missing and over three-quarters of cabin entries blank—information that will guide our imputation and feature-engineering strategy.
In the next chunk we systematically turn raw strings and blanks into model-ready features. We fill the single missing Embarked with “S” (most common port), shrink each Cabin to its deck letter defaulting to “U” (Unknown), and extract passenger titles from names, remapping variants like “Mlle” to familiar labels. By grouping on (Pclass, Sex) to impute missing ages with local medians, we preserve demographic trends rather than distort them with a global average. Finally, we calculate FamilySize as the sum of siblings/spouses and parents/children plus one, convert every categorical variable into a factor, and drop the raw text columns that served their purpose.
train2
<- train
%>%
#
Fill missing Embarked with most common port 'S'
mutate(Embarked =
replacena(Embarked,
"S")) %>%
#
Extract deck letter from Cabin; missing → 'U'
mutate(Deck
=
ifelse(is.na(Cabin),
"U", strsub(Cabin,
1, 1)))
%>%
#
Parse Title from Name (Mr., Mrs., Miss., etc.)
mutate(Title
= strextract(Name,
"(?<=,
)[A-Za-z]+\."),
Title =
strremove(Title,
"\."),
Title =
recode(Title,
"Ms" =
"Miss",
"Mlle" =
"Miss",
"Mme" =
"Mrs",
.default = Title))
%>%
#
Impute missing Age by median within Sex × Pclass groups
groupby(Pclass,
Sex) %>%
mutate(Age =
ifelse(is.na(Age),
median(Age, na.rm =
TRUE),
Age))
%>%
ungroup() %>%
#
Engineer FamilySize and convert key columns to factors
mutate(
FamilySize = SibSp +
Parch + 1,
Survived =
factor(Survived),
Pclass =
factor(Pclass),
Sex =
factor(Sex),
Embarked =
factor(Embarked),
Deck =
factor(Deck),
Title =
factor(Title)
)
%>%
#
Drop unused columns
select(-Name,
-Ticket,
-Cabin)Now train2 has no missing Embarked or Age values, and each passenger carries a Deck and Title encoding of social cues. Imputing age by class and sex retains key demographic structure, while creating FamilySize lets us later test whether traveling alone or in a small group altered survival odds.
We begin our EDA by collapsing the data to average survival rates for each (Sex, Pclass) combination. This directly addresses our hypotheses that women survived more often than men ((H1)) and that first-class passengers fared better than those in lower classes ((H2)).
dfsexclass
<- train2
%>%
groupby(Sex, Pclass)
%>%
summarize(survivalrate
=
mean(as.numeric(as.character(Survived))),
.groups = "drop")
ggplot(dfsexclass,
aes(x = Pclass,
y = survivalrate, fill
= Sex)) +
geomcol(position =
"dodge") +
scaleycontinuous(labels
=
scales::percentformat())
+
thememinimal()
+
labs(
title = "Titanic Survival Rate
by Sex & Class",
x = "Passenger
Class",
y = "Survival
Rate"
)
Here we see nearly all first-class women survived (around 97%), while
men—even in first class—hovered near 35%. The steep decline from class 1
to class 3 across both sexes validates both (H1) and (H2)
To check the “children first” protocol, we overlay histograms of age for survivors versus non-survivors. This visualization helps reveal whether younger passengers truly had a disproportionate advantage.
ggplot(train2,
aes(x = Age,
fill = Survived))
+
geomhistogram(bins =
30, alpha =
0.6, position =
"identity") +
scalefillmanual(values
= c("steelblue",
"tomato")) +
themeeconomist()
+
labs(
title = "Age Distribution by
Survival Status",
x
= "Age",
y
= "Count"
)
You can observe a clear spike of survivors between 20-25, while
non-survivors dominate the mid-adult range—so children/young adults got
onto lifeboats first.
Finally, we bucket FamilySize into “Alone,” “Small” (2–4 people), and “Large” (5+). This checks whether traveling in a small family cluster offered better odds than going solo or being part of a very large group ((H3))
familydf
<- train2
%>%
mutate(FamilyGroup =
casewhen(
FamilySize
== 1
~ "Alone",
FamilySize
<= 4
~ "Small",
TRUE ~
"Large"
))
%>%
groupby(FamilyGroup)
%>%
summarize(survivalrate
=
mean(as.numeric(as.character(Survived))),
.groups="drop")
ggplot(familydf,
aes(x = FamilyGroup,
y = survivalrate, fill
= FamilyGroup)) +
geomcol() +
scaleycontinuous(labels
=
scales::percentformat())
+
thememinimal()
+
labs(
title = "Titanic Survival Rate
by Family Group",
x = "Family
Group",
y = "Survival
Rate"
)
Small family units indeed peak around a 60% survival rate, while those
traveling alone or in very large groups fare worse—confirming (H3)
We carve out 80% of train2 for fitting and hold back 20% for an honest evaluation, using createDataPartition() to keep the same survivor proportion in both sets. This keeps our accuracy assessment fair.
set.seed(2025)
splitidx
<-
createDataPartition(train2$Survived,
p = 0.8,
list = FALSE)
modtrain
<- train2[split_idx, ]
modtest
<- train2[-split_idx,
]
#
Check proportions
prop.table(table(mod_train$Survived))##
0 1
0.6162465 0.3837535
##
## 0 1
## 0.6158192 0.3841808
Both subsets remain at roughly 38% survivors, so our model will not be biased by a lopsided split.
With our engineering complete, we fit a logistic regression on mod_train, predicting Survived from all factors plus continuous variables like Age and Fare. The coefficient signs and p-values will tell us which variables wield the strongest influence.
glm_fit <- glm(
Survived ~ Sex + Pclass + Age + Fare + Embarked + Deck + Title + FamilySize,
data = mod_train,
family = binomial(link = "logit")
)
summary(glm_fit)##
## Call:
## glm(formula = Survived ~ Sex + Pclass + Age + Fare + Embarked +
## Deck + Title + FamilySize, family = binomial(link = "logit"),
## data = mod_train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.610e+00 3.393e+03 0.001 0.99939
## Sexmale -1.562e+01 2.400e+03 -0.007 0.99481
## Pclass2 -3.951e-01 5.438e-01 -0.727 0.46747
## Pclass3 -1.493e+00 5.461e-01 -2.733 0.00627 **
## Age -3.414e-02 1.141e-02 -2.992 0.00277 **
## Fare 3.276e-03 3.153e-03 1.039 0.29877
## EmbarkedQ -1.522e-02 4.396e-01 -0.035 0.97238
## EmbarkedS -4.345e-01 2.905e-01 -1.496 0.13477
## DeckB 5.191e-01 8.615e-01 0.603 0.54682
## DeckC -2.784e-01 7.888e-01 -0.353 0.72410
## DeckD 1.120e+00 9.441e-01 1.186 0.23568
## DeckE 1.652e+00 8.750e-01 1.888 0.05904 .
## DeckF 1.516e-01 1.167e+00 0.130 0.89667
## DeckG -1.866e+00 1.497e+00 -1.247 0.21253
## DeckT -1.572e+01 2.400e+03 -0.007 0.99477
## DeckU -6.696e-01 7.977e-01 -0.839 0.40127
## TitleCol 1.593e+01 2.400e+03 0.007 0.99470
## TitleDon -1.117e+00 3.393e+03 0.000 0.99974
## TitleDr 1.535e+01 2.400e+03 0.006 0.99489
## TitleJonkheer -6.602e-01 3.393e+03 0.000 0.99984
## TitleLady 1.645e+01 4.156e+03 0.004 0.99684
## TitleMajor 1.538e+01 2.400e+03 0.006 0.99489
## TitleMaster 1.763e+01 2.400e+03 0.007 0.99414
## TitleMiss 1.850e+00 3.393e+03 0.001 0.99957
## TitleMr 1.451e+01 2.400e+03 0.006 0.99517
## TitleMrs 2.772e+00 3.393e+03 0.001 0.99935
## TitleRev -9.307e-02 2.621e+03 0.000 0.99997
## TitleSir 3.205e+01 3.393e+03 0.009 0.99246
## FamilySize -4.946e-01 9.662e-02 -5.119 3.08e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 950.86 on 713 degrees of freedom
## Residual deviance: 556.48 on 685 degrees of freedom
## AIC: 614.48
##
## Number of Fisher Scoring iterations: 15
The model output shows large positive log-odds for being female, strong negative effects for third class, and notable contributions from higher fare, certain titles, and smaller family sizes.
To evaluate, we align factor levels between train and test, predict survival probabilities on mod_test, threshold at 0.5, and then compute both Accuracy and Cohen’s Kappa. Kappa tells us how much better we do than random guessing.
# pull the training‐set Title levels out
train_levels <- levels(mod_train$Title)
# re‐factor test set
mod_test$Title <- factor(mod_test$Title, levels = levels(mod_train$Title))
# Predict probabilities and classes
test_prob <- predict(glm_fit, newdata = mod_test, type = "response")
test_pred <- factor(if_else(test_prob > 0.5, "1", "0"))
# Confusion matrix & metrics
conf_mat <- confusionMatrix(test_pred, mod_test$Survived)
conf_mat$overall[c("Accuracy", "Kappa")]## Accuracy Kappa
## 0.8181818 0.6187898
Achieving around 80% accuracy with a Kappa near 0.6 shows that our model captures genuine structure—well above the 62% baseline of always guessing “did not survive.”
All three hypotheses hold: women survived more often, first-class passengers fared best, and small families enjoyed the highest survival rates.
## R version 4.5.0 (2025-04-11 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26100)
##
## Matrix products: default
## LAPACK version 3.12.1
##
## locale:
## [1] LC_COLLATE=English_United States.utf8
## [2] LC_CTYPE=English_United States.utf8
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.utf8
##
## time zone: America/Phoenix
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ggthemes_5.1.0 caret_7.0-1 lattice_0.22-6 skimr_2.1.5
## [5] lubridate_1.9.4 forcats_1.0.0 stringr_1.5.1 dplyr_1.1.4
## [9] purrr_1.0.4 readr_2.1.5 tidyr_1.3.1 tibble_3.2.1
## [13] ggplot2_3.5.2 tidyverse_2.0.0
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.2.1 timeDate_4041.110 farver_2.1.2
## [4] fastmap_1.2.0 pROC_1.18.5 digest_0.6.37
## [7] rpart_4.1.24 timechange_0.3.0 lifecycle_1.0.4
## [10] survival_3.8-3 magrittr_2.0.3 compiler_4.5.0
## [13] rlang_1.1.6 sass_0.4.10 tools_4.5.0
## [16] yaml_2.3.10 data.table_1.17.4 knitr_1.50
## [19] labeling_0.4.3 bit_4.6.0 plyr_1.8.9
## [22] repr_1.1.7 RColorBrewer_1.1-3 withr_3.0.2
## [25] nnet_7.3-20 grid_4.5.0 stats4_4.5.0
## [28] e1071_1.7-16 future_1.58.0 globals_0.18.0
## [31] scales_1.4.0 iterators_1.0.14 MASS_7.3-65
## [34] cli_3.6.5 rmarkdown_2.29 crayon_1.5.3
## [37] generics_0.1.4 rstudioapi_0.17.1 future.apply_1.20.0
## [40] reshape2_1.4.4 tzdb_0.5.0 proxy_0.4-27
## [43] cachem_1.1.0 splines_4.5.0 parallel_4.5.0
## [46] base64enc_0.1-3 vctrs_0.6.5 hardhat_1.4.1
## [49] Matrix_1.7-3 jsonlite_2.0.0 hms_1.1.3
## [52] bit64_4.6.0-1 listenv_0.9.1 foreach_1.5.2
## [55] gower_1.0.2 jquerylib_0.1.4 recipes_1.3.1
## [58] glue_1.8.0 parallelly_1.45.0 codetools_0.2-20
## [61] stringi_1.8.7 gtable_0.3.6 pillar_1.10.2
## [64] htmltools_0.5.8.1 ipred_0.9-15 lava_1.8.1
## [67] R6_2.6.1 vroom_1.6.5 evaluate_1.0.3
## [70] bslib_0.9.0 class_7.3-23 Rcpp_1.0.14
## [73] nlme_3.1-168 prodlim_2025.04.28 xfun_0.52
## [76] pkgconfig_2.0.3 ModelMetrics_1.2.2.2