The sinking of the Titanic remains one of the most famous disasters in history. Of the roughly 2,200 passengers aboard its maiden voyage just over 700 would survive—about one third. The movie inspired by the disaster is a fictional story about the plebeian Jack, played by Leonardo DiCaprio, falling in love with the aristocratic Rose, played by Kate Winslet. While their characters and romance are fictional the stark presence of socioeconomic classes aboard the ship is not.
In an assignment for my Causal Inference class we use a dataset regarding Titanic passengers to examine the effects of socioeconomic class on survival. Were first class passengers more likely to survive?
library(stargazer)
##
## Please cite as:
## Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
## R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
library(magrittr) # for %$% pipes
library(tidyverse)
## ── Attaching packages
## ───────────────────────────────────────
## tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.8 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ tidyr::extract() masks magrittr::extract()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ purrr::set_names() masks magrittr::set_names()
library(haven)
library(RTextTools)
## Loading required package: SparseM
##
## Attaching package: 'SparseM'
##
## The following object is masked from 'package:base':
##
## backsolve
This is simply a function written to pull in the dataset. The assignment was initially done in Stata thus the .dta file type. The main question at hand here revolves around class. As such we immediately setup the class variable as a dummy variable where 1 = first class and 0 = not first class both denoted as d.
read_data <- function(df)
{
full_path <- paste("https://raw.github.com/scunning1975/mixtape/master/",
df, sep = "")
df <- read_dta(full_path)
return(df)
}
titanic <- read_data("titanic.dta") %>%
mutate(d = case_when(class == 1 ~ 1, TRUE ~ 0))
head(titanic)
## # A tibble: 6 × 5
## class age sex survived d
## <dbl+lbl> <dbl+lbl> <dbl+lbl> <dbl+lbl> <dbl>
## 1 1 [1st class] 1 [adults] 1 [man] 1 [yes] 1
## 2 1 [1st class] 1 [adults] 1 [man] 1 [yes] 1
## 3 1 [1st class] 1 [adults] 1 [man] 1 [yes] 1
## 4 1 [1st class] 1 [adults] 1 [man] 1 [yes] 1
## 5 1 [1st class] 1 [adults] 1 [man] 1 [yes] 1
## 6 1 [1st class] 1 [adults] 1 [man] 1 [yes] 1
By filtering for first class passengers who survived and getting the mean of the survived variable which is binary we are essentially getting the survival rate of first class passenger.
ey1 -> survival rate of first class passengers -> ~62%
Likewise ey0 follow the same methodology to calculate the survival rate of passengers who were not first class.
ey0 -> survival rate of non-first class passengers -> ~27%
A simple difference in outcome (sdo) can then be calculated. First class passengers were more than twice as likely to survive with a survival rate over 35 points higher.
Is this simple difference in outcomes enough to prove causation between first class and survival and quantify the average treatment effect? No. Like most things in this world there are other variables which could be affecting survival rates across classes. Based off the data here age and sex could be possible confounders.
#Simple Difference in Outcome
ey1 <- titanic %>%
filter(d == 1) %>%
pull(survived) %>%
mean()
ey1
## [1] 0.6246154
ey0 <- titanic %>%
filter(d == 0) %>%
pull(survived) %>%
mean()
ey0
## [1] 0.2707889
sdo <- ey1 - ey0
sdo
## [1] 0.3538265
Here we partition all passengers into four groups denoted as s
Female Adults
Female Children
Male Adults
Male Children
titanic %<>%
mutate(s = case_when(sex == 0 & age == 1 ~ 1,
sex == 0 & age == 0 ~ 2,
sex == 1 & age == 1 ~ 3,
sex == 1 & age == 0 ~ 4,
TRUE ~ 0))
head(titanic)
## # A tibble: 6 × 6
## class age sex survived d s
## <dbl+lbl> <dbl+lbl> <dbl+lbl> <dbl+lbl> <dbl> <dbl>
## 1 1 [1st class] 1 [adults] 1 [man] 1 [yes] 1 3
## 2 1 [1st class] 1 [adults] 1 [man] 1 [yes] 1 3
## 3 1 [1st class] 1 [adults] 1 [man] 1 [yes] 1 3
## 4 1 [1st class] 1 [adults] 1 [man] 1 [yes] 1 3
## 5 1 [1st class] 1 [adults] 1 [man] 1 [yes] 1 3
## 6 1 [1st class] 1 [adults] 1 [man] 1 [yes] 1 3
Having grouped passengers in into one of four groups, s, we can now use the class dummy variable setup, d, to get the survival rates for each of the groups. In doing so we are splitting the four groups we just made into 8 subgroups. Since survived is a dummy variable of either 1 or 0 we can get the mean of survived in each subgroup to determine survival rates.
| Group | Survival Rate (mean of survived as %) |
|---|---|
| First Class Female Adult (ey11) | 97.2% |
| Lower Class Female Adult (ey10) | 62.6% |
| First Class Female Child (ey21) | 100% |
| Lower Class Female Child (ey20) | 61.4% |
| First Class Male Adult (ey31) | 32.6% |
| Lower Class Male Adult (ey30) | 18.8% |
| First Class Male Children (ey41) | 100% |
| Lower Class Male Children (ey40) | 40.6% |
#Subclassification
ey11 <- titanic %>%
filter(s == 1 & d == 1) %$%
mean(survived)
ey11
## [1] 0.9722222
ey10 <- titanic %>%
filter(s == 1 & d == 0) %$%
mean(survived)
ey10
## [1] 0.6263345
ey21 <- titanic %>%
filter(s == 2 & d == 1) %$%
mean(survived)
ey21
## [1] 1
ey20 <- titanic %>%
filter(s == 2 & d == 0) %$%
mean(survived)
ey20
## [1] 0.6136364
ey31 <- titanic %>%
filter(s == 3 & d == 1) %$%
mean(survived)
ey31
## [1] 0.3257143
ey30 <- titanic %>%
filter(s == 3 & d == 0) %$%
mean(survived)
ey30
## [1] 0.1883378
ey41 <- titanic %>%
filter(s == 4 & d == 1) %$%
mean(survived)
ey41
## [1] 1
ey40 <- titanic %>%
filter(s == 4 & d == 0) %$%
mean(survived)
ey40
## [1] 0.4067797
The survival rate differences are then calculated and defined. Class had the smallest effect on Adult Males and the largest effect on Male Children while the survival rate difference among females was comparable regardless of age.
| Subgroup | Survival Rate Difference |
|---|---|
| Female Adults | .345 |
| Female Children | .386 |
| Male Adults | .137 |
| Male Children | .593 |
diff1 = ey11 - ey10
diff2 = ey21 - ey20
diff3 = ey31 - ey30
diff4 = ey41 - ey40
diff1
## [1] 0.3458877
diff2
## [1] 0.3863636
diff3
## [1] 0.1373765
diff4
## [1] 0.5932203
The simple difference in outcome calculated earlier was a bit misleading. It just threw everyone together and ignored sex and age. Now that we see sex and age played a role in survival rates we can determine weights to use in calculating a weighted average treatment effect. This is done by taking the amount of passengers in each of our four subgroups which were lower-class and did not survive and dividing by the total number of lower-class passengers. The resulting proportions are our weights for to attempt at getting a more accurate average treatment effect. Note how the weights add up to 1.
| Subgroup | Weights |
|---|---|
| Female Adults | 0.149 |
| Female Children | 0.023 |
| Male Adults | 0.795 |
| Male Children | 0.031 |
obs = nrow(titanic %>% filter(d == 0))
wt1 <- titanic %>%
filter(s == 1 & d == 0) %$%
nrow(.)/obs
wt1
## [1] 0.1497868
wt2 <- titanic %>%
filter(s == 2 & d == 0) %$%
nrow(.)/obs
wt2
## [1] 0.02345416
wt3 <- titanic %>%
filter(s == 3 & d == 0) %$%
nrow(.)/obs
wt3
## [1] 0.7953092
wt4 <- titanic %>%
filter(s == 4 & d == 0) %$%
nrow(.)/obs
wt4
## [1] 0.03144989
Having now controlled for the effects of age and gender as well as class when analyzing survival rates on the Titanic we can calculate an average treatment effects which accounts for the weights or the importance of our age/gender subgroups. The wate comes in at 0.188 or 18.8 percentage points.
wate = diff1*wt1 + diff2*wt2 + diff3*wt3 + diff4*wt4
wate
## [1] 0.1887847
stargazer(wate, sdo, type = "text")
##
## =====
## 0.189
## -----
##
## =====
## 0.354
## -----
Initially our average treatment effect (sdo) calculated a 35.4 percentage point increase in the probability of survival based on whether a passenger was first-class. However, upon subclassifying passengers into groups which allowed us to control for sex and age as well as class the average treatment effect (wate) is only 18.8 percentage points–nearly half of our initial estimate.
While being a passenger in first class did increase the probability of surviving the sinking of the Titanic, it was not the only factor.