The Assignment

For this assignment, I analyzed factors related to surviving the Titanic sinking using data from http://www.personal.psu.edu/dlp/w540/datasets/titanicsurvival.csv.1

The dataset contained the following four variables and had no missing data.

Findings

1. Obtaining the dataset

To obtain the dataset and prepare R for the analysis, I needed to download the file from the web, then load it and the R packages I needed into rStudio:

require(dplyr)
## Loading required package: dplyr
## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
require(datasets)
require(ggvis)
## Loading required package: ggvis
require(magrittr)
## Loading required package: magrittr
titanic <- read.csv(file = "http://www.personal.psu.edu/dlp/w540/datasets/titanicsurvival.csv", header = TRUE, sep = ",")

2. Calculate the total number of passengers in the dataset.

I calculated this by using a simple “tally” of the observations in the dataset.

totalpassengers <- tally(titanic)
totalpassengers
##      n
## 1 2201

The result showed me that there were 2,201 observations in the dataset, which, in this case, means 2,201 passenger records. (Interesting to note that this is lower than the 2,224 passengers and crew reported in Dr. Passmore’s assignment introduction.)

3. Calculate the total proportion of passengers surviving.

To determine this, I had to filter out the number of survivors (Survive = 1) from the dataset, then calculate what percent the survivors represent of the total number of passengers. To do this, I used the following:

titanic_df <- tbl_df(titanic)
survivors <- filter(titanic_df, Survive == 1)
totalsurvivors <- tally(survivors)
totalsurvivors
## Source: local data frame [1 x 1]
## 
##       n
##   (int)
## 1   711
round((totalsurvivors/totalpassengers)*100)
##    n
## 1 32

The result showed me that there were 711 survivors, which, when rounded, meant that 32% of the ship’s passengers survived. (Again, interesting to note that this does not correspond with the information in Dr. Passmore’s assignment introduction, but since the number of observations in this dataset was lower than his reported 2,224, it makes sense that the number of survivors indicated in this dataset would also be slightly lower.)

4. Calculate the proportion of passengers surviving for each class of passenger.

To determine this, I created a table that shows the number of survivors by class, then added a new column that showed what percent the survivors in that class represented of the total number of passengers in that class. To do this, I used the following:

titanic_df %>%
  group_by (Class, Survive) %>%
  summarise (number = n()) %>%
  mutate (freq = (number / sum(number)*100))
## Source: local data frame [8 x 4]
## Groups: Class [4]
## 
##   Class Survive number     freq
##   (int)   (int)  (int)    (dbl)
## 1     0       0    673 76.04520
## 2     0       1    212 23.95480
## 3     1       0    122 37.53846
## 4     1       1    203 62.46154
## 5     2       0    167 58.59649
## 6     2       1    118 41.40351
## 7     3       0    528 74.78754
## 8     3       1    178 25.21246

The result showed me that 24% of crew, 62.5% of first class, 41.4%% of second class, and 25.2% of third class passengers survived.

5. Calculate the proportion of passengers surviving for each sex category. Which sex had the highest survival rate?

To determine this, I created a table that shows the number of survivors by sex, then added a new column that showed what percent the survivors of each sex represented of the total number of passengers of that sex. To do this, I used the following:

titanic_df %>%
  group_by (Sex, Survive) %>%
  summarise (number = n()) %>%
  mutate (freq = (number / sum(number)*100))
## Source: local data frame [4 x 4]
## Groups: Sex [2]
## 
##     Sex Survive number     freq
##   (int)   (int)  (int)    (dbl)
## 1     0       0    126 26.80851
## 2     0       1    344 73.19149
## 3     1       0   1364 78.79838
## 4     1       1    367 21.20162

The result showed me that 21.2% of male passengers and 73.2% of female passengers survived the sinking of the Titanic. Female passengers had the highest survival rate.

6. Calculate the proportion of passengers surviving for each age category. Which age had the lowest survival rate?

To determine this, I created a table that shows the number of survivors by age, then added a new column that showed what percent the survivors of each age category represented of the total number of passengers of that age category. To do this, I used the following:

titanic_df %>%
  group_by (Age, Survive) %>%
  summarise (number = n()) %>%
  mutate (freq = (number / sum(number)*100))
## Source: local data frame [4 x 4]
## Groups: Age [2]
## 
##     Age Survive number     freq
##   (int)   (int)  (int)    (dbl)
## 1     0       0     52 47.70642
## 2     0       1     57 52.29358
## 3     1       0   1438 68.73805
## 4     1       1    654 31.26195

The result showed me that 31.3% of adults and 52.3% of children survived the sinking of the Titanic. Adults had the lowest survival rate.

7. Calculate the proportion of passengers surviving for each age/sex category (i.e., for adult males, child males, adult females, child females). Which group was most likely to survive? Least likely?

To determine this, I created a table that shows the number of survivors by age and sex, then added a new column that showed what percent the survivors of each age/sex category represented of the total number of passengers of that age/sex category. To do this, I used the following:

titanic_df %>%
  group_by (Age, Sex, Survive) %>%
  summarise (number = n()) %>%
  mutate (freq = (number / sum(number)*100))
## Source: local data frame [8 x 5]
## Groups: Age, Sex [4]
## 
##     Age   Sex Survive number     freq
##   (int) (int)   (int)  (int)    (dbl)
## 1     0     0       0     17 37.77778
## 2     0     0       1     28 62.22222
## 3     0     1       0     35 54.68750
## 4     0     1       1     29 45.31250
## 5     1     0       0    109 25.64706
## 6     1     0       1    316 74.35294
## 7     1     1       0   1329 79.72406
## 8     1     1       1    338 20.27594

There were the following survival rates among men, women, boys, and girls:

  • Men - 20.3%
  • Women - 74.4%
  • Boys - 45.3%
  • Girls - 62.2%

This told me that women were the most likely to survive, and men were the least likely to survive.

8. Calculate the proportion of passengers surviving for each age/sex/class category. Which group had the highest mortality in this disaster. Why?

To determine this, I created four different tables, one for each class. Each showed the number of survivors by age/sex for that class. I also added a new column to each table that showed what percent the survivors of each age/sex category represented of the total number of passengers of that age/sex category. The code was as follows:

survivors <- filter(titanic_df, Class == 1) 
survivors %>%
  group_by (Age, Sex, Survive) %>%
  summarise (number = n()) %>%
  mutate (freq = (number / sum(number)*100))
## Source: local data frame [6 x 5]
## Groups: Age, Sex [4]
## 
##     Age   Sex Survive number       freq
##   (int) (int)   (int)  (int)      (dbl)
## 1     0     0       1      1 100.000000
## 2     0     1       1      5 100.000000
## 3     1     0       0      4   2.777778
## 4     1     0       1    140  97.222222
## 5     1     1       0    118  67.428571
## 6     1     1       1     57  32.571429
survivors <- filter(titanic_df, Class == 2) 
survivors %>%
  group_by (Age, Sex, Survive) %>%
  summarise (number = n()) %>%
  mutate (freq = (number / sum(number)*100))
## Source: local data frame [6 x 5]
## Groups: Age, Sex [4]
## 
##     Age   Sex Survive number       freq
##   (int) (int)   (int)  (int)      (dbl)
## 1     0     0       1     13 100.000000
## 2     0     1       1     11 100.000000
## 3     1     0       0     13  13.978495
## 4     1     0       1     80  86.021505
## 5     1     1       0    154  91.666667
## 6     1     1       1     14   8.333333
survivors <- filter(titanic_df, Class == 3) 
survivors %>%
  group_by (Age, Sex, Survive) %>%
  summarise (number = n()) %>%
  mutate (freq = (number / sum(number)*100))
## Source: local data frame [8 x 5]
## Groups: Age, Sex [4]
## 
##     Age   Sex Survive number     freq
##   (int) (int)   (int)  (int)    (dbl)
## 1     0     0       0     17 54.83871
## 2     0     0       1     14 45.16129
## 3     0     1       0     35 72.91667
## 4     0     1       1     13 27.08333
## 5     1     0       0     89 53.93939
## 6     1     0       1     76 46.06061
## 7     1     1       0    387 83.76623
## 8     1     1       1     75 16.23377
survivors <- filter(titanic_df, Class == 0) 
survivors %>%
  group_by (Age, Sex, Survive) %>%
  summarise (number = n()) %>%
  mutate (freq = (number / sum(number)*100))
## Source: local data frame [4 x 5]
## Groups: Age, Sex [2]
## 
##     Age   Sex Survive number     freq
##   (int) (int)   (int)  (int)    (dbl)
## 1     1     0       0      3 13.04348
## 2     1     0       1     20 86.95652
## 3     1     1       0    670 77.72622
## 4     1     1       1    192 22.27378

These tables showed me that there were the following survival rates for each age, sex, and class category:

  • First Class
    • Girls - 100%
    • Boys - 100%
    • Women - 97.22
    • Men - 32.6%
  • Second Class
    • Girls - 100%
    • Boys - 100%
    • Women - 86%
    • Men - 8.3%
  • Third Class
    • Girls - 45.2%
    • Boys - 27.1%
    • Women - 46.1%
    • Men - 16.2%
  • Crew
    • Woman - 87%
    • Men - 22.3%

Second class male passengers had the highest mortality in this disaster. That was not completely surprising, as the previous calculations showed that adults and males had the lowest survival rates. However, the crew had the highest likelihood of fatalities, not second class. This meant I had further analysis to conduct if I wanted to explain why second class male passengers had the highest mortality rate.

Based on what I know about the sinking of the Titanic, it made sense that men had the lowest survival rates overall. There were not enough life boats and there was a general “women and children first” policy could account for the fact that men, in general, had a higher mortality rate. But why did second class men have the highest mortality rate?

First, the men in first class likely had the best access to the lifeboats and the most assistance for themselves, their wives, and their children. My analysis above also showed that there were more children in second class than in first class, The second class men may have lost more lives in a valient attempt to help “women and children first.”

To analyze why, then, second class male passengers had the highest mortality rate, I looked at the raw numbers of men and women in each class, both in table form and as a bar plot:

  adults <- filter(titanic_df, Age == 1) 
  adults %>%
    group_by (Class, Sex) %>%
    summarise (number = n()) %>%
    mutate (freq = (number / sum(number)*100))
## Source: local data frame [8 x 4]
## Groups: Class [4]
## 
##   Class   Sex number     freq
##   (int) (int)  (int)    (dbl)
## 1     0     0     23  2.59887
## 2     0     1    862 97.40113
## 3     1     0    144 45.14107
## 4     1     1    175 54.85893
## 5     2     0     93 35.63218
## 6     2     1    168 64.36782
## 7     3     0    165 26.31579
## 8     3     1    462 73.68421
adults <- filter(titanic_df, Age == 1)
counts <- table(adults$Sex, adults$Class)
barplot(counts, main="Adult Female and Male Passengers by Class",
    xlab="Class", col=c("darkblue","red"),
    legend = rownames(counts), beside=TRUE)

Then I compared the number of men in each class:

men <- filter(titanic_df, Sex == 1, Age == 1)
counts <- table(men$Class)
barplot(counts, main="Adult Male Passengers by Class",
  xlab="Class", col=c("darkblue"))

As I looked at this data, I next considered the higher survival rate of adult men in the crew compared to the adult men in second and third class. The men on the crew were loading and even piloting the lifeboats, which could explain why the men in the crew had a higher survival rate than the men in second or third class.

That left me with the question of why the percentage of men in third class who survived was higher than the percentage of men in second class who survived. The following stacked box illustrated for me the comparative survival rates:

# Stacked Bar Plot with Colors and Legend (to get stack, no "beside=TRUE")
survivors <- filter(titanic_df, Sex == 1, Age == 1)
counts <- table(survivors$Survive, survivors$Class)
barplot(counts, main="Adult Male Survival Rates by Class",
  xlab="Class", col=c("darkblue","red"),
  legend = rownames(counts))

From this, it was easier for me to see that there were the fewest adult male passengers overall in second class. While the men in second class had the lowest survival rate, the raw number of second class adult male passenger deaths was still very small compared to the overall number of men lost in the other classes.

9. Write a summary of your findings. Your summary may contain no more than 60 words.

The sinking of the Titanic was a disaster of enormous proportions. Only 32% survived, with the highest percent of fatalities among the crew (76%). Females were more likely to survive than males (73% compared to 21%), and children were more likely to survive than adults (52% compared to 31%). Men in second class suffered the most, with only 8.3% surviving.


1 The dataset used in this assignment is from “Report on the Loss of the ‘Titanic’ (S.S.)” (1990), British Board of Trade Inquiry Report (reprint), Gloucester, UK: Allan Sutton Publishing and are discussed in Dawson, R. J. M. (1995). The ‘unusual episode’ data revisited. Journal of Statistics Education [on-line] 3(3). (http://www.amstat.org/publications/jse/v3n3/datasets.dawson.html).