About this project

For three reasons I spent a lot of time on this project. First, data engineering is my wheelhouse. I am an experienced, full time lead/principal data engineer for NBC. Second, I am an amateur historian for the period between about 1890 and 1950. And lastly, I have used these two base data sets to practice programming and plotting before. That said, I figured even if I only scratched the surface of something, it better be well wrangled, interesting, and provocative.

I didn’t know what I was going to do with this project when I started. I came up with the idea to standardize and union because I have never done that with these two datasheets before. I figured if I transformed it into a single data set, I might be able to find or discover something different.

Sure enough, after all the wrangling, I saw something right of the gate. From there I titled my paper, and spent a lot of time trying to get the graphs to show exactly what I saw.

About thie data, an overiew of the Titanic and the Lusitania

The two data sets in this project are the passenger and crew manifests of two ships, the RMS Titanic and the RMS Lusitania.

We all know the basics of both stories. On 15 April 1912, the Titanic sank in the Atlantic after hitting an iceberg. Of the 2,224 people on board, about 1,500 or 67% of them died. The sinking of the unsinkable Titanic was big deal. The titanic was supposedly so unsinkable that the makers and operators of the ocean liner didn’t think they needed many lifeboats. This error in thought created one of the worst civilian maritime disasters ever.

And then two years later on May 7th 1915, a German U Boat sank the Lusitania off the southern coast of Ireland. There were about 1,962 people on that boat, 761 or 38% of them died. This brutal passenger ship sinking was a major factor for the United States entering World War One.

Biases

The main one I glossed over for this paper is, simply stated, rich people data was probably more accurate than poor people data.

Conclusions and closing thoughts

With only two charts, I believe this project shows a clear window into different social norms back then. Racism and sexism were gross and everywhere and normal, but that was also the time when during times of peril, men tried to take the brunt of difficult situations on behalf of the species. When you look at the Titanic, where the population had to self-select survivorship much further and longer down pick list, the common thread shows through.

After I built these two charts, I immediately fond myself asking the question, “If the Titanic disaster happened in an almost identical way one week from today, would these two charts look the same?. Or better yet, should they?”

Load, Clean and Standardize Data

# First load data. 
library(sqldf)
library(dplyr)
library(stringr)
library(tidyr)
library(ggplot2)

g_repo = "https://raw.githubusercontent.com/tonythor/cuny-datascience/" 
lf =  paste(g_repo,"develop/bridge/R/data/lusitania_manifest.csv", sep = "")
tf =  paste(g_repo,"develop/bridge/R/data/titanic_lifeboats.csv", sep="")
titanic <- read.csv(tf)
lusitania <- read.csv(lf)

## Clean and standardize the titanic data set
titanic[c('last_name', 'first_name')]<- str_split_fixed(titanic$name, ',', 2)
replacements  = c('Miss.'='','Master.'='', 'Mrs.'='', 'Mister.'='', 'Mr.'='')
titanic$first_name<- str_replace_all(titanic$first_name, replacements )
titanic <- dplyr::rename(titanic, c("gender" = "sex"), c("lifeboat" = "boat"))
titanic_boat= c("titanic")
titanic$boat = titanic_boat
titanic <- titanic %>% mutate_at(c('age'), ~replace_na(.,0))
titanic <- titanic %>%   mutate(age = as.numeric(age))
# Could be any column name to filter not. Last row is empty and needs to be dropped. 
titanic <- titanic %>% filter(!last_name=='')
titanic_clean <- titanic %>% select(boat,last_name, first_name, 
                                    survived, gender, age, lifeboat)
## Clean and standardize the Lusitania data set
lusitania_boat= c("lusitania")
lusitania$boat = lusitania_boat
lusitania <- dplyr::rename(lusitania, c("first_name" = "Personal.name"), 
                           c("last_name" = "Family.name"),
                           c("lifeboat" = "Lifeboat"),
                           c("age" = "Age"),
                           c("gender" = "Sex"))
lusitania <- lusitania %>%  mutate(age = as.numeric(age))
lusitania <- lusitania %>% mutate_at(c('age'), ~replace_na(.,0))
lusitania = lusitania %>% mutate(survived = case_when(grepl("Saved", Fate) ~ 1,
                                                      .default = 0)) 
lusitania$gender <- tolower(lusitania$gender)
lusitania$last_name <- str_to_title(lusitania$last_name)
lusitania_clean <- lusitania %>% select(boat,last_name, first_name, survived, 
                                        gender, age, lifeboat)
## uninion the data sets into one, add a numerical categorical flag for gender
boats = union(lusitania_clean,titanic_clean)

The general distribution and survivors, then by age

par(mar = c(4, 4, .1, .1))
ggplot(data=boats %>% filter(age > 1) ,aes(x=age, fill = boat)) +
  geom_histogram(bins=25, show.legend = FALSE) +
  ylim(0, 325) +
  xlim(0, 90) + 
  scale_y_continuous(n.breaks = 15) +
  scale_x_continuous(n.breaks = 10) +
  ylab("All Passangeers") + xlab ("Age")
ggplot(data=boats %>% filter(survived == 1) %>% filter(age >1) ,aes(x=age, fill = boat)) +
  geom_histogram(bins=25) +
  ylim(0, 325) +
  xlim(0, 90) + 
  scale_x_continuous(n.breaks = 10) +
  ylab("Survivors") + xlab ("Age")

Look at the children from the titanic

The first that stands out is that of the survivors. Children of the titanic under the age of 10 were so well represented. Not all of them lived, but most seemed to.

Looking at gender

#survived, by boat / gender!  
sbb  =  boats %>% filter(survived == 1)  %>%  group_by(boat,gender) %>%   
  summarise(total_count=n(), .groups = 'drop') 
# agg, get boat totals  
sbb_agg = sbb %>% group_by(boat) %>% summarise(sum=sum(total_count), 
                                               .groups = 'drop')
survivers_lus = sbb_agg %>% filter(boat=="lusitania") %>% select("sum")
survivers_tit = sbb_agg %>% filter(boat=="titanic") %>% select("sum")

# add by boat survivors as a column, let's see if we can compare sex to survivorship
sbb = sbb %>% mutate(survivors = case_when(grepl("titanic", boat) 
                                           ~ survivers_tit, .default = survivers_lus)) 
sbb <- sbb %>% mutate(survivorship_percentage = (total_count / survivors$sum) * 100)

## As we expected, when you're short lifeboats, womena and children first! 
g <- ggplot(sbb, aes(x=boat,  y=survivorship_percentage))
g + geom_col(aes(fill = gender)) +  ylab("Survivorship Percentage") + xlab ("Boat")

Look at gender

More men were on the boats than women, sure, but when a population starts choosing who has to live or die, and when scarcity of those available slots grows by a lot, the choice is clear. Save the women.