Assignment 1: TitanicAnalysis

Initial Step: Install required packages needed for examining data

I find that a helpful first step in completing any assignment in “R” is to install the required packages needed for examining and manipulating the data required for the task.

To do this, I ran the following code:

require(datasets)
require(ggvis)

## Loading required package: ggvis

require(dplyr)

## Loading required package: dplyr
## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

require(magrittr)

## Loading required package: magrittr

require(knitr)

## Loading required package: knitr

Item 1: Read data from Web and look at it

To complete the questions posed in this assignment I first needed to download the data from the identified web address and get a feel for what it included. So I ran the following steps:

TitanicData <- read.csv(file="http://www.personal.psu.edu/dlp/w540/datasets/titanicsurvival.csv", header=TRUE, sep=",")
glimpse(TitanicData)

## Observations: 2,201
## Variables: 4
## $ Class   (int) 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Age     (int) 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Sex     (int) 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Survive (int) 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...

summary(TitanicData)

##      Class            Age              Sex            Survive     
##  Min.   :0.000   Min.   :0.0000   Min.   :0.0000   Min.   :0.000  
##  1st Qu.:0.000   1st Qu.:1.0000   1st Qu.:1.0000   1st Qu.:0.000  
##  Median :1.000   Median :1.0000   Median :1.0000   Median :0.000  
##  Mean   :1.369   Mean   :0.9505   Mean   :0.7865   Mean   :0.323  
##  3rd Qu.:3.000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.000  
##  Max.   :3.000   Max.   :1.0000   Max.   :1.0000   Max.   :1.000

Using the “glimpse” function, I was able to view a small part of the data included in the relatively large data set. I was also able to glean some important information from it, including the number of observations, the number of variables included for each observation, the names of the variables, and how those variables were stored in the data set (as integers). The “summary” function also shows that I’m not missing any information in the data set, which would be important to account for in making further calculations. Since my data set is complete, I don’t have to account for and/or remove any “NAs”.

This information could also be viewed by creating a new data frame from our original dataset by running the following command:

TitanicDataDF <- tbl_df(TitanicData)
TitanicDataDF

## Source: local data frame [2,201 x 4]
## 
##    Class   Age   Sex Survive
##    (int) (int) (int)   (int)
## 1      1     1     1       1
## 2      1     1     1       1
## 3      1     1     1       1
## 4      1     1     1       1
## 5      1     1     1       1
## 6      1     1     1       1
## 7      1     1     1       1
## 8      1     1     1       1
## 9      1     1     1       1
## 10     1     1     1       1
## ..   ...   ...   ...     ...

This new data frame also shows that there are 2,201 observations included in 4 different variables (Class, Age, Sex, Survive).

The data analyzed in this report are from “Report on the Loss of the ‘Titanic’ (S.S.)” (1990), British Board of Trade Inquiry Report (reprint), Gloucester, UK: Allan Sutton Publishing and are discussed in Dawson, R. J. M. (1995). The ‘unusual episode’ data revisited. Journal of Statistics Education [on-line] 3(3).

Item 2: Calculate the total number of passengers in the dataset

Answer: 2,201 total passengers.

Whichever way I choose to review the information, both show the answer to Item #2 in the assignment: there were a total of 2,201 passengers on the Titanic, which corresponds to the total number of observations in the data set regardless of class, age, sex or survival.

Looking ahead at other items in the assignment, I knew it would be helpful to have this total saved as a variable that could be used multiple times for calculating proportions. To do this, I created a new vector based on the variable “Class” and calculated its length into a new variable called TP (Total Passengers) using the following procedure:

tempTP <- TitanicData[,1] #This function takes only the first column of data (which is the variable "Class", but I could have chosen any column) and creates a new vector called "tempTP"
TP <- length(tempTP) #This command will create a new variable "TP" which corresponds to the number of observations in the vector "tempTP"
TP

## [1] 2201

This corresponds to the total number of passengers, and is now included in a new variable “TP” which I can now use in further calculations.

Item 3: Calculate the total proportion of passengers surviving

Answer: Thirty-two percent of passengers survived.

To calculate the total proportion of passengers who survived the sinking I need to know, first, the total number of people who survived, and, next, divide that total by total number of passengers (TP).

I find the total number of people who survived by filtering the data in the “Survive” variable which corresponds to the value of 1 (1=yes, 0 = no). I can do this using the following code:

survivors <- filter(TitanicData, Survive == 1)
glimpse(survivors)

## Observations: 711
## Variables: 4
## $ Class   (int) 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Age     (int) 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Sex     (int) 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Survive (int) 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...

This shows us that the total number of survivors is 771, which corresponds to the total number of filtered observations. I can also save the number of observations to a new variable TS (Total Survivors) using a similar procedure to what I did above, but will use my newly created data set for those calculations:

tempTS <- survivors[,4]
TS <- length(tempTS)
TS

## [1] 711

However, I can also skip the steps of creating a new variables in creating TS by using piped commands. For example, the following commands also create the same value for TS:

TS <- length((TitanicData %>% filter(Survive ==1))[,4])
TS

## [1] 711

Whatever method I choose to use, I now have the information I need to calculate the total proportion of survivors during the disaster, which is found by dividing the total number of survivors by the total number of passengers (TS/TP). This is easily done by the following command:

TS/TP

## [1] 0.323035

So my calculations show there was a 32% survival rate for passengers on the Titanic.

One of the things I like about “R” is that many of the steps I just outlined above can be accomplished in a more straightforward manner without having to create a number of different variables. By combining piped commands and basic functions, we can receive the same results, as is demonstrated by the single command line below:

length((TitanicData %>% filter(Survive == 1))[,4])/length(TitanicData[,1])

## [1] 0.323035

I can use similar methods for calculating the items listed in the assignment. As I move forward the “group by” function will become helpful.

I wanted to mention one other way to easily find the proportion of survivors on the Titanic. Since I know from our “summary” of the data that there isn’t any missing data (NAs) and that the data in the relevant variable “Survive” is stored as either a 1 or 0, I can easily calculate the mean of that variable through the following calculation, which also corresponds to the proportion of surviving passengers:

summarise (TitanicData, ProportionOfSurvivors=mean(Survive))

##   ProportionOfSurvivors
## 1              0.323035

It should be noted that this method may not work for more complex data sets with multiple different values other than 1 and 0. For example, this may not work if our variable also included an additional value for “Survived, but Injured”, etc.

Item 4: Calculate the proportion of passengers surviving for each class of passenger

Answers: Crew = 24%; First-class = 62%; Second-class = 41%; Third-class = 25%.

To do these calculations, I must first use the “group_by” function to gather information on each individual class of passengers. When combined with the calculations to estimate proportion of survivors above, I can readily show the proportion of survivors for each class. I can do this by running the following piped command line.

TitanicData %>% group_by(Class) %>% summarise(Survivors_by_Class=mean(Survive))

## Source: local data frame [4 x 2]
## 
##   Class Survivors_by_Class
##   (int)              (dbl)
## 1     0          0.2395480
## 2     1          0.6246154
## 3     2          0.4140351
## 4     3          0.2521246

Since I’m new to using “R”, I wanted to make sure that this piped command was returning the information I was looking for. I can unpack (and check) these calculations by, first, calculating the total number of people in each class, and, next dividing the number of survivors in each class by the total is just calculated. I can do this by running the following commands based on the methods shown above:

CrewMembers <- filter(TitanicData, Class == 0) #Create data set of only crew members
TCM <- length(CrewMembers[,1]) #Calculate the total number of observations in that data set
TCM #Total number of Crew Members

## [1] 885

CrewSurvivors <- filter(CrewMembers, Survive == 1) #Create data set of only surviving crew members
SCM <- length(CrewSurvivors[,1]) #Calculate the total number of observations in that data set
SCM #Total number of surviving Crew Members

## [1] 212

SCM/TCM

## [1] 0.239548

Therefore, if you were a crew member, you had a 24% chance of surviving the disaster. I can use this same calculation for each other class of passenger, which I have done below:

FirstClassMembers <- filter(TitanicData, Class == 1) #Create data set of only first-class members
FCP <- length(FirstClassMembers[,1]) #Calculate the total number of observations in that data set
FCP #Total number of First-Class Members

## [1] 325

FirstClassSurvivors <- filter(FirstClassMembers, Survive == 1) #Create data set of only surviving First-Class members
SFCP <- length(FirstClassSurvivors[,1]) #Calculate the total number of observations in that data set
SFCP #Total number of surviving First-Class Members

## [1] 203

SFCP/FCP

## [1] 0.6246154

Therefore, if you were a first-class passenger, you had a 62% chance of surviving the disaster.

SecondClassMembers <- filter(TitanicData, Class == 2) #Create data set of only Second-class members
SCP <- length(SecondClassMembers[,1]) #Calculate the total number of observations in that data set
SCP #Total number of Second-Class Passengers

## [1] 285

SecondClassSurvivors <- filter(SecondClassMembers, Survive == 1) #Create data set of only surviving Second-Class members
SSCP <- length(SecondClassSurvivors[,1]) #Calculate the total number of observations in that data set
SSCP #Total number of surviving Second-Class Passengers

## [1] 118

SSCP/SCP

## [1] 0.4140351

Therefore, if you were a second-class passenger, you had a 41% chance of surviving the disaster.

ThirdClassPassengers <- filter(TitanicData, Class == 3) #Create data set of only Third-Class Passengers
TCP <- length(ThirdClassPassengers[,1]) #Calculate the total number of observations in that data set
TCP #Total number of Third-Class Passengers

## [1] 706

ThirdClassSurvivors <- filter(ThirdClassPassengers, Survive == 1) #Create data set of only surviving Third-Class Passengers
STCP <- length(ThirdClassSurvivors[,1]) #Calculate the total number of observations in that data set
STCP #Total number of surviving Third-Class Passengers

## [1] 178

STCP/TCP

## [1] 0.2521246

Therefore, if you were a third-class passenger, you had a 25% chance of surviving the disaster.

Now that I’m more comfortable that the “summarise” function is returning the information I’m looking for, I will primarily use it to complete the rest of the tasks in the assignment.

Item 5: Calculate the proportion of passengers surviving for each sex category. Which sex had the highest survival rate?

Answer: Females had the highest rate of survival.

By using the “summarise” command when “grouping by”" the variable “Sex”, I’m able to calculate the following:

TitanicData %>% group_by(Sex) %>% summarise(Survivors_by_Sex=mean(Survive))

## Source: local data frame [2 x 2]
## 
##     Sex Survivors_by_Sex
##   (int)            (dbl)
## 1     0        0.7319149
## 2     1        0.2120162

Women (female = 0) had a 73% chance of surviving the sinking, Men (male = 1) had only a 21% chance of surviving.

Item 6: Calculate the proportion of passengers surviving for each age category. Which age had the lowest survival rate?

Answer: Adults had the lowest survival rate.

Using the same method as in Item 5, but changing the “group by” variable, I can find the answer by the following command:

TitanicData %>% group_by(Age) %>% summarise(Survivors_by_Age=mean(Survive))

## Source: local data frame [2 x 2]
## 
##     Age Survivors_by_Age
##   (int)            (dbl)
## 1     0        0.5229358
## 2     1        0.3126195

Children (Child = 0) had a 52% chance of surviving the disaster, while adults (Adult = 1) only had a 31% chance.

Item 7: Calculate the proportion of passengers surviving for each age/sex category (i.e., for adult males, child males, adult females, child females). Which group was most likely to survive? Least likely?

Answers: Adult women were the most likely to survive (74%), and adult men were the least likely to survive (20%).

Using piped commands with an additional “group by” variable, calculating these rates was straightforward:

TitanicData %>% group_by(Age, Sex) %>% summarise(Survivors_by_Age_Sex=mean(Survive))

## Source: local data frame [4 x 3]
## Groups: Age [?]
## 
##     Age   Sex Survivors_by_Age_Sex
##   (int) (int)                (dbl)
## 1     0     0            0.6222222
## 2     0     1            0.4531250
## 3     1     0            0.7435294
## 4     1     1            0.2027594

I find it interesting to note that while children, overall, had a higher chance of survival than adults, adult women were saved at a higher rate than either sex of children taken individually. This, initially, seems to indicate a number of hypotheses:

Women without children were able to find seats in lifeboats more quickly.
Women with multiple children chose to stay on Titanic with their children rather than take a seat on the lifeboat if only one was available and/or chose to place one of their children in that seat if available.
There were more women in third-class who had multiple children and their higher individual mortality rate also led to a higher mortality rate for their entire family.

Item 8: Calculate the proportion of passengers surviving for each age/sex/class category. Which group had the highest mortality in this disaster. Why?

Answer: Adult Second-Class Males had the highest mortality rate in the disaster, with only a 8% chance of surviving.

I can easily add a third variable to my “group by” command line to gather information on survival probablity within a specific class of passenger.

TitanicData %>% group_by(Age, Sex, Class) %>% summarise(Survivors_by_Age_Sex_Class=mean(Survive))

## Source: local data frame [14 x 4]
## Groups: Age, Sex [?]
## 
##      Age   Sex Class Survivors_by_Age_Sex_Class
##    (int) (int) (int)                      (dbl)
## 1      0     0     1                 1.00000000
## 2      0     0     2                 1.00000000
## 3      0     0     3                 0.45161290
## 4      0     1     1                 1.00000000
## 5      0     1     2                 1.00000000
## 6      0     1     3                 0.27083333
## 7      1     0     0                 0.86956522
## 8      1     0     1                 0.97222222
## 9      1     0     2                 0.86021505
## 10     1     0     3                 0.46060606
## 11     1     1     0                 0.22273782
## 12     1     1     1                 0.32571429
## 13     1     1     2                 0.08333333
## 14     1     1     3                 0.16233766

I found this result surprising, as I would have thought that Adult Third-Class Males and/or Male Crew members would have had a higher rate of mortality. My thinking is based on the idea that social privelege would have seen upper-class members taking seats on lifeboats first, and that the crew would be the last to leave the ship and those seats would already be taken.

However, when I look at this result in conjunction with the fact that all second-class children survived, along with a large proportion of second-class females, it would indicate that many second-class males gave up seats in lifeboats to their families, or women with children.

The numbers may also indicate that there may have also been a large number of men without families in the third-class. This is supported by the fact that a disproportinate number of third-class children lost their lives. I think it is reasonable to suggest that if a father had to choose between themselves and their children, he would opt to give a seat to his child. However, this is a hypothesis that would need further examination.

Item 9: Write a summary of your findings. Your summary may contain no more than 60 words.

The data shows the crew died at a highest rate and in the greatest number. Third-class fared slightly better with a and accounted for most deaths among paying passengers, which included children. First- and Second-class had the greatest chance of survival, with the vast majority of these deaths among men and no children were lost in these classes.