R Bridge Week 3 - Titanic Surviors

The R.M.S. Titanic has captivated the world since it sank the night of April 14, 1912. Scrolling through the data sets, even seeing the word “Titanic” was so evocative I had to select it even just to see what it contained.

Analytical Question

The Titanic disaster was a turning point for marine transportation even today. Many rules, laws, and safety procedures that stemmed from investigation and recommendations following the sinking are still in place today. The question I wish to address using the current data set is:

What did the typical survivor and victim look like?

Perhaps obviously, my assumption before doing much analysis is that women, children, and 1st class passengers would be the most likely survivor and older, male, and 3rd class passengers the most likely victim, following in the “women and children first” trope. I wish this data set had additional categories, such as crew or passenger status, national origin, and room and deck assignments.

Summary Statistics

Downloading the data set and re-uploading it to my GitHub successfully pulling from my account.

TiSurURL <- "https://raw.githubusercontent.com/iscostello/Rbridgedata/master/TitanicSurvival.csv"
TiSur <- read.csv(file = TiSurURL, header = TRUE, sep = ",")

head(TiSur)

##                                 X survived    sex     age passengerClass
## 1   Allen, Miss. Elisabeth Walton      yes female 29.0000            1st
## 2  Allison, Master. Hudson Trevor      yes   male  0.9167            1st
## 3    Allison, Miss. Helen Loraine       no female  2.0000            1st
## 4 Allison, Mr. Hudson Joshua Crei       no   male 30.0000            1st
## 5 Allison, Mrs. Hudson J C (Bessi       no female 25.0000            1st
## 6             Anderson, Mr. Harry      yes   male 48.0000            1st

summary(TiSur)

##       X               survived             sex                 age         
##  Length:1309        Length:1309        Length:1309        Min.   : 0.1667  
##  Class :character   Class :character   Class :character   1st Qu.:21.0000  
##  Mode  :character   Mode  :character   Mode  :character   Median :28.0000  
##                                                           Mean   :29.8811  
##                                                           3rd Qu.:39.0000  
##                                                           Max.   :80.0000  
##                                                           NA's   :263      
##  passengerClass    
##  Length:1309       
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
##

The summary statistics for the entire data set do not offer much insight into my question. However, there are some interesting observations.

The data set is incomplete
- Titanic had 2,208 people on board, of those 1,317 were passengers
- This data set has only 1,309 records, so assuming this is looking at only passengers
The people in this data set trends very young
- Median age of 29
- NULL age records (263) are abundant enough to perhaps affect these statistics
Both data elements for survival and sex are stored as characters, so the summary function will not sort them automatically

Data Wrangling

TiSur$survivedNew <- ifelse(TiSur$survived == "yes", TRUE, FALSE)
TiSur$sexNew <- ifelse(TiSur$sex == "female", TRUE, FALSE)
TiSur$passClassNew <- as.factor(ifelse(TiSur$passengerClass == "1st", 1,
                         ifelse(TiSur$passengerClass == "2nd", 2,
                         ifelse(TiSur$passengerClass == "3rd", 3,
                                NA ))))

summary(TiSur)

##       X               survived             sex                 age         
##  Length:1309        Length:1309        Length:1309        Min.   : 0.1667  
##  Class :character   Class :character   Class :character   1st Qu.:21.0000  
##  Mode  :character   Mode  :character   Mode  :character   Median :28.0000  
##                                                           Mean   :29.8811  
##                                                           3rd Qu.:39.0000  
##                                                           Max.   :80.0000  
##                                                           NA's   :263      
##  passengerClass     survivedNew       sexNew        passClassNew
##  Length:1309        Mode :logical   Mode :logical   1:323       
##  Class :character   FALSE:809       FALSE:843       2:277       
##  Mode  :character   TRUE :500       TRUE :466       3:709       
##                                                                 
##                                                                 
##                                                                 
##

The two columns “survived” and “sex” are stored in the original data set as characters. In order to summarize the data with the summary function, and, indeed, be more useful for other analytics I created two new columns. At first, I started to rename the values as numbers “1” for a “yes” response, “0” for “no.” Because these values are binary I could store them as a boolean, that is a “True/False”.

Perhaps the easiest solution is to simply convert sex and survivor status as a factor, as I’ve done for class. Seeing class in this way and compared to the passenger summaries from the internet, I see that the 3rd class passengers add up exactly to the official tally. This data set is missing 7 2nd class passengers and 1 1st class passenger.

TiSurOnly <- subset(TiSur, TiSur$survivedNew == TRUE)
TiSurVictim <- subset(TiSur, TiSur$survivedNew == FALSE)

Scatter Plot of Passengers by Age and Class

Probably not the best visual for a scatter plot. The data set is pretty light on continuous data, so these lines are a bit tricky to interperate. I much prefer the histograms below to demonstrate these data.

library(ggplot2)

ggplot(TiSur, aes(x=age, y=passClassNew, color=survivedNew)) +
  geom_point()

## Warning: Removed 263 rows containing missing values (geom_point).

Box Plot of Age by Survivor Status

library(ggplot2)
ggplot(TiSur, aes(x=survivedNew, y=age)) + 
  geom_boxplot()

## Warning: Removed 263 rows containing non-finite values (stat_boxplot).

Histogram of Survivors ONLY by Sex (Gender)

library(ggplot2)
ggplot(TiSurOnly, aes(x=age, color=sex)) +
  geom_histogram(fill="white", alpha=0.1, position="stack")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 73 rows containing non-finite values (stat_bin).

Histogram of Victims ONLY by Passenger Class Status

library(ggplot2)
ggplot(TiSurVictim, aes(x=age, color=sex)) +
  geom_histogram(fill="white", alpha=0.1, position="stack")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 190 rows containing non-finite values (stat_bin).

Histogram of ALL Passengers by Age and Survivor Status

library(ggplot2)
ggplot(TiSur, aes(x=age, color=survivedNew)) +
  geom_histogram(fill="white", alpha=0.1, position="stack")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 263 rows containing non-finite values (stat_bin).

Conclusions

Though the data set was fairly limited, I was able to draw a few conclusions based on the summary statistics and graphics created.

3rd class passengers were more likely victims
- There were many more 3rd class passengers than 1st or 2nd
- While these data are not in this set, it is well-known that 3rd class passengers were disadvantaged by their location on the ship and, likely in part, social status
Women were more likely survivors - No suprise that my initial guess would pan out.
Age does not appear determinative - While the box plot shows that survivors were slightly younger, it has had a number of outliers, which are the oldest passengers in the set.

The typical survivor was most likely 1st class and a woman, while the typical victim was likely 3rd class an a man.

R_Week_3

Ian Costello

8/2/2020