Titanic is a data sets that provides information on the survival of passengers on the fatal maiden voyage of the ocean liner “Titanic”. The Titanic is one of my favorite movies, so I thought it would be interesting to explore some data behind the historic event. In this report, we will explore if personal identifiers played a role in survival on the day the titanic hit the iceberg and sank.
The data set contains 12 columns: Passenger ID, Survived, Name, Class, Age, Sex, Sibling Spouse,Parch, Ticket, Fare, Cabin, and Embarked. The set contains a sample of 891 Passengers that were on board the titanic when it embarked on its only voyage.
Number of columns in the data set:
colnames(df_titanic) # 12 columns
## [1] "\\f0\\fs24 \\cf0 PassengerId" "Survived"
## [3] "Pclass" "Name"
## [5] "Sex" "Age"
## [7] "SibSp" "Parch"
## [9] "Ticket" "Fare"
## [11] "Cabin" "Embarked\\"
Summary of fields included in the Titanic data set:
summary(df_titanic)
## \\f0\\fs24 \\cf0 PassengerId Survived Pclass
## Min. : 1.0 Min. :0.0000 Min. :1.000
## 1st Qu.:223.5 1st Qu.:0.0000 1st Qu.:2.000
## Median :446.0 Median :0.0000 Median :3.000
## Mean :446.0 Mean :0.3838 Mean :2.309
## 3rd Qu.:668.5 3rd Qu.:1.0000 3rd Qu.:3.000
## Max. :891.0 Max. :1.0000 Max. :3.000
##
## Name Sex Age SibSp
## Length:891 Length:891 Min. : 0.42 Min. :0.000
## Class :character Class :character 1st Qu.:20.12 1st Qu.:0.000
## Mode :character Mode :character Median :28.00 Median :0.000
## Mean :29.70 Mean :0.523
## 3rd Qu.:38.00 3rd Qu.:1.000
## Max. :80.00 Max. :8.000
## NA's :177
## Parch Ticket Fare Cabin
## Min. :0.0000 Length:891 Min. : 0.00 Length:891
## 1st Qu.:0.0000 Class :character 1st Qu.: 7.91 Class :character
## Median :0.0000 Mode :character Median : 14.45 Mode :character
## Mean :0.3816 Mean : 32.20
## 3rd Qu.:0.0000 3rd Qu.: 31.00
## Max. :6.0000 Max. :512.33
##
## Embarked\\
## Length:891
## Class :character
## Mode :character
##
##
##
##
In reviewing the data set values, i decided to exclude several columns from my analysis (Cabin, Fare) as they had a significant amount of missing values. In my analysis, I removed the 177 NULL records from the age columns to improve the accuracy of the column. I also verified that all logical numeric columns were stored as integers, and did the same for character data types. I enhanced the data set by adding descriptive columns for Age Bracket, Survival, and Class.
Below we will examine each factor role in the data set and its impact on survival rate.
The plot below breaks displays the outcome for passengers based upon their gender. From the plot we can determine that about twice as many women survived the Titanic than men. Proportionately, women had a much better chance of survival than men. This supports the notion that women (particularly mothers as we can see from the high survival in the 20-35 year old range in the second visual) boarded the early set of life boats to safety.
ggplot(df_titanic, aes(x = Sex, fill=Survived_Iceberg)) +
labs(title = "Survival by Sex", x = "Sex", y = "Passenger Count") +
geom_bar(position = position_dodge()) +
scale_fill_manual(values = c("Survived" = "#3d7185",
"Did not Survive" = "#b52618")) +
geom_text(stat='count',
aes(label=stat(count)),
position = position_dodge(width=1), vjust=-0.5)
ggplot(new_df_titanic, aes(x=Age, color=Survived_Iceberg)) +
geom_histogram(fill="white", alpha=0.5, position="identity", bins=30) +
facet_grid(Sex ~ .) +
scale_color_manual(values=c("#b52618", "#3d7185", "#gray17")) +
labs(title="Survival Counts by Age and Gender",x="Age", y = "Count of Passengers") +
theme(plot.title = element_text(hjust = 0.5), legend.position="bottom")
When reviewing the passenger age distribution, we see that the largest age bracket of passengers aboard the Titanic was 20-30 years old. This contradicts my original thoughts that a majority of the Passengers we older or that there were a significant amount of children on the boat.
new_df_titanic$bin_age = cut(new_df_titanic$Age, c(0,10,20,30,40,50,60,70,80,100))
ggplot(new_df_titanic, aes(x = new_df_titanic$bin_age, fill = Survived)) +
geom_bar(position = position_dodge()) +
geom_text(stat='count', aes(label=stat(count)), position = position_dodge(width=1), vjust=-0.5)+
labs(title = "Passenger Age Distribution", x = "Age Bracket")
The plots below demonstrate density of age distribution among the total
Passenger population followed by the breakout of Age Density by
survival. We can see that there is a noticeable peak between 0 and 10
years old in the survival density chart, supporting the idea that
children were some of the first passengers to board the life boats to
safety as the Titanic began to sink, hence ensuring their survival.
ggplot(new_df_titanic, aes(x = Age)) +
labs(title = "Passenger Age Density", x = "Age") +
geom_density(fill='#dbba2f')
plota <- ggplot(new_df_titanic[(new_df_titanic$Survived == 1)], aes(x = Age)) +
labs(title = "Survivors", x = "Age") +
geom_density(fill='#3d7185')
plotb <- ggplot(new_df_titanic[(new_df_titanic$Survived == 0)], aes(x = Age)) +
labs(title = "Did not Survive", x = "Age") +
geom_density(fill='#b52618')
grid.arrange(plota, plotb, ncol=2)
The following visual illustrates the frequency of survival broken out by class. We can see that the largest group of Passengers that did not survive were Third Class ticket holders. Most of the survivors came from the first or second class. We can see that the class with the largest survival proportion is the first class where 63% of first class members survived in contrast to the 24% of Passengers that survived from the Third Class.
Survived_Titanic <- df_titanic[(df_titanic$Survived == 1)]
NoSurvived_Titanic <- df_titanic[(df_titanic$Survived == 0)]
PassengerClassCount1 <- data.frame(count(Survived_Titanic, "Class"))
PassengerClassCount1 <- na.omit(PassengerClassCount1[order(agecount$freq,decreasing = TRUE),])
PassengerClassCount0 <- data.frame(count(NoSurvived_Titanic, "Class"))
PassengerClassCount0 <- na.omit(PassengerClassCount0[order(agecount$freq,decreasing = TRUE),])
#convert passenger ID to character value
df_titanic$PassengerId <- as.character(df_titanic$PassengerId)
#survived
plot1 <- ggplot(PassengerClassCount1, aes(x=Class, y=freq, fill = freq)) +
geom_bar(stat="identity") +
labs(title="Survived") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_fill_gradient2(low = "#3d7185", mid = "#dbba2f", high = "#b52618",
midpoint = 120)
#did not survive
plot0 <- ggplot(PassengerClassCount0, aes(x=Class, y=freq, fill = freq)) +
geom_bar(stat="identity") +
labs(title="Did not Survive") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_fill_gradient2(low = "#3d7185", mid = "#dbba2f", high = "#b52618",
midpoint = 120)
grid.arrange(plot0, plot1, ncol=2)
Comparative Survival Rates by Class
plot_ly( hole=0.75 ) %>%
layout(title="Survival Rate by Class") %>%
add_pie(df_titanic[df_titanic$Class == "First Class",],
df_titanic[df_titanic$Class == "First Class",Survived_Iceberg],
df_titanic[df_titanic$Class == "First Class",PassengerId],
textposition="inside",
marker = list(colors = c("#3d7185","#b52618")),
hovertemplate = "Class:First Class<br>Percent:%{percent}<br>Passenger Count:
%{value}<extra>/</extra>") %>%
add_pie(df_titanic[df_titanic$Class == "Second Class",],
df_titanic[df_titanic$Class == "Second Class",Survived_Iceberg],
df_titanic[df_titanic$Class == "Second Class",PassengerId],
textposition="inside",
hovertemplate = "Class:Second Class<br>Percent:%{percent}<br>Passenger Count:
%{value}<extra>/</extra>",
domain = list(
x = c(0.16,0.84),
y = c(0.16,0.84)
)) %>%
add_pie(df_titanic[df_titanic$Class == "Third Class",],
df_titanic[df_titanic$Class == "Third Class",Survived_Iceberg],
df_titanic[df_titanic$Class == "Third Class",PassengerId],
textposition="inside",
hovertemplate = "Class:Third Class<br>Percent:%{percent}<br>Passenger Count:
%{value}<extra>/</extra>",
domain = list(
x = c(0.27,0.73),
y = c(0.27,0.73)
))
The trellis representation factors impacting survival shows us more depth. Here we can examine the role each of the factors plays in survival. The most obvious call out here is the Third class male plot. We can see in that plot that survival is proportionately the lowest of the six plots for this group. We can also determine that females had a better rate of survival than males in the second class.
new_df_titanic %>%
ggplot(aes(x = Age, fill = Survived_Iceberg)) +
geom_histogram(bins=30) +
facet_wrap(~Sex + Class) +
scale_fill_manual(values = c("Survived" = "#3d7185",
"Did not Survive" = "#b52618")) +
labs(title = "Survival by Age, Sex and Passenger Class", x = "Passenger Count", y = "Age")
## {-}
We can conclude from the visualizations above that Age, Sex, and Class all played significant roles in survival aboard the Titanic. Aligning with the movie depiction, women (mothers) and children must have been loaded into the life boats to safety first as they have a high survival rate. Following this, it seems the next groups of passengers to board the life boats to safety were the upper class passengers. We can determine that First Class passengers had a higher chance of survival as opposed to second and third class passengers aboard the Titanic. Based on our findings, we can see that third class males fared the least well in this disasterous event.