The R.M.S. Titanic has captivated the world since it sank the night of April 14, 1912. Scrolling through the data sets, even seeing the word “Titanic” was so evocative I had to select it even just to see what it contained.
The Titanic disaster was a turning point for marine transportation even today. Many rules, laws, and safety procedures that stemmed from investigation and recommendations following the sinking are still in place today. The question I wish to address using the current data set is:
What did the typical survivor and victim look like?
Perhaps obviously, my assumption before doing much analysis is that women, children, and 1st class passengers would be the most likely survivor and older, male, and 3rd class passengers the most likely victim, following in the “women and children first” trope. I wish this data set had additional categories, such as crew or passenger status, national origin, and room and deck assignments.
Downloading the data set and re-uploading it to my GitHub successfully pulling from my account.
TiSurURL <- "https://raw.githubusercontent.com/iscostello/Rbridgedata/master/TitanicSurvival.csv"
TiSur <- read.csv(file = TiSurURL, header = TRUE, sep = ",")
head(TiSur)
## X survived sex age passengerClass
## 1 Allen, Miss. Elisabeth Walton yes female 29.0000 1st
## 2 Allison, Master. Hudson Trevor yes male 0.9167 1st
## 3 Allison, Miss. Helen Loraine no female 2.0000 1st
## 4 Allison, Mr. Hudson Joshua Crei no male 30.0000 1st
## 5 Allison, Mrs. Hudson J C (Bessi no female 25.0000 1st
## 6 Anderson, Mr. Harry yes male 48.0000 1st
summary(TiSur)
## X survived sex age
## Length:1309 Length:1309 Length:1309 Min. : 0.1667
## Class :character Class :character Class :character 1st Qu.:21.0000
## Mode :character Mode :character Mode :character Median :28.0000
## Mean :29.8811
## 3rd Qu.:39.0000
## Max. :80.0000
## NA's :263
## passengerClass
## Length:1309
## Class :character
## Mode :character
##
##
##
##
The summary statistics for the entire data set do not offer much insight into my question. However, there are some interesting observations.
TiSur$survivedNew <- ifelse(TiSur$survived == "yes", TRUE, FALSE)
TiSur$sexNew <- ifelse(TiSur$sex == "female", TRUE, FALSE)
TiSur$passClassNew <- as.factor(ifelse(TiSur$passengerClass == "1st", 1,
ifelse(TiSur$passengerClass == "2nd", 2,
ifelse(TiSur$passengerClass == "3rd", 3,
NA ))))
summary(TiSur)
## X survived sex age
## Length:1309 Length:1309 Length:1309 Min. : 0.1667
## Class :character Class :character Class :character 1st Qu.:21.0000
## Mode :character Mode :character Mode :character Median :28.0000
## Mean :29.8811
## 3rd Qu.:39.0000
## Max. :80.0000
## NA's :263
## passengerClass survivedNew sexNew passClassNew
## Length:1309 Mode :logical Mode :logical 1:323
## Class :character FALSE:809 FALSE:843 2:277
## Mode :character TRUE :500 TRUE :466 3:709
##
##
##
##
The two columns “survived” and “sex” are stored in the original data set as characters. In order to summarize the data with the summary function, and, indeed, be more useful for other analytics I created two new columns. At first, I started to rename the values as numbers “1” for a “yes” response, “0” for “no.” Because these values are binary I could store them as a boolean, that is a “True/False”.
Perhaps the easiest solution is to simply convert sex and survivor status as a factor, as I’ve done for class. Seeing class in this way and compared to the passenger summaries from the internet, I see that the 3rd class passengers add up exactly to the official tally. This data set is missing 7 2nd class passengers and 1 1st class passenger.
TiSurOnly <- subset(TiSur, TiSur$survivedNew == TRUE)
TiSurVictim <- subset(TiSur, TiSur$survivedNew == FALSE)
Probably not the best visual for a scatter plot. The data set is pretty light on continuous data, so these lines are a bit tricky to interperate. I much prefer the histograms below to demonstrate these data.
library(ggplot2)
ggplot(TiSur, aes(x=age, y=passClassNew, color=survivedNew)) +
geom_point()
## Warning: Removed 263 rows containing missing values (geom_point).
library(ggplot2)
ggplot(TiSur, aes(x=survivedNew, y=age)) +
geom_boxplot()
## Warning: Removed 263 rows containing non-finite values (stat_boxplot).
library(ggplot2)
ggplot(TiSurOnly, aes(x=age, color=sex)) +
geom_histogram(fill="white", alpha=0.1, position="stack")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 73 rows containing non-finite values (stat_bin).
library(ggplot2)
ggplot(TiSurVictim, aes(x=age, color=sex)) +
geom_histogram(fill="white", alpha=0.1, position="stack")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 190 rows containing non-finite values (stat_bin).
library(ggplot2)
ggplot(TiSur, aes(x=age, color=survivedNew)) +
geom_histogram(fill="white", alpha=0.1, position="stack")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 263 rows containing non-finite values (stat_bin).
Though the data set was fairly limited, I was able to draw a few conclusions based on the summary statistics and graphics created.
The typical survivor was most likely 1st class and a woman, while the typical victim was likely 3rd class an a man.