library(tidyverse)
library(ggplot2)
library(ggpie)knitr::include_graphics("titanicsink.jpg")here we will explore the data about the passengers of a renowed boat, RMS Titanic. At the end of this project, we might unfold some facts about its passengers through this document. For the start, let’s import the data!
tnc <- read.csv("Titanic.csv")dim(tnc)## [1] 418 12
head(tnc)| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 892 | 0 | 3 | Kelly, Mr. James | male | 34.5 | 0 | 0 | 330911 | 7.8292 | Q | |
| 893 | 1 | 3 | Wilkes, Mrs. James (Ellen Needs) | female | 47.0 | 1 | 0 | 363272 | 7.0000 | S | |
| 894 | 0 | 2 | Myles, Mr. Thomas Francis | male | 62.0 | 0 | 0 | 240276 | 9.6875 | Q | |
| 895 | 0 | 3 | Wirz, Mr. Albert | male | 27.0 | 0 | 0 | 315154 | 8.6625 | S | |
| 896 | 1 | 3 | Hirvonen, Mrs. Alexander (Helga E Lindqvist) | female | 22.0 | 1 | 1 | 3101298 | 12.2875 | S | |
| 897 | 0 | 3 | Svensson, Mr. Johan Cervin | male | 14.0 | 0 | 0 | 7538 | 9.2250 | S |
tail(tnc)| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 413 | 1304 | 1 | 3 | Henriksson, Miss. Jenny Lovisa | female | 28.0 | 0 | 0 | 347086 | 7.7750 | S | |
| 414 | 1305 | 0 | 3 | Spector, Mr. Woolf | male | NA | 0 | 0 | A.5. 3236 | 8.0500 | S | |
| 415 | 1306 | 1 | 1 | Oliva y Ocana, Dona. Fermina | female | 39.0 | 0 | 0 | PC 17758 | 108.9000 | C105 | C |
| 416 | 1307 | 0 | 3 | Saether, Mr. Simon Sivertsen | male | 38.5 | 0 | 0 | SOTON/O.Q. 3101262 | 7.2500 | S | |
| 417 | 1308 | 0 | 3 | Ware, Mr. Frederick | male | NA | 0 | 0 | 359309 | 8.0500 | S | |
| 418 | 1309 | 0 | 3 | Peter, Master. Michael J | male | NA | 1 | 1 | 2668 | 22.3583 | C |
str(tnc)## 'data.frame': 418 obs. of 12 variables:
## $ PassengerId: int 892 893 894 895 896 897 898 899 900 901 ...
## $ Survived : int 0 1 0 0 1 0 1 0 1 0 ...
## $ Pclass : int 3 3 2 3 3 3 3 2 3 3 ...
## $ Name : chr "Kelly, Mr. James" "Wilkes, Mrs. James (Ellen Needs)" "Myles, Mr. Thomas Francis" "Wirz, Mr. Albert" ...
## $ Sex : chr "male" "female" "male" "male" ...
## $ Age : num 34.5 47 62 27 22 14 30 26 18 21 ...
## $ SibSp : int 0 1 0 0 1 0 0 1 0 2 ...
## $ Parch : int 0 0 0 0 1 0 0 1 0 0 ...
## $ Ticket : chr "330911" "363272" "240276" "315154" ...
## $ Fare : num 7.83 7 9.69 8.66 12.29 ...
## $ Cabin : chr "" "" "" "" ...
## $ Embarked : chr "Q" "S" "Q" "S" ...
str(tnc)## 'data.frame': 418 obs. of 12 variables:
## $ PassengerId: int 892 893 894 895 896 897 898 899 900 901 ...
## $ Survived : int 0 1 0 0 1 0 1 0 1 0 ...
## $ Pclass : int 3 3 2 3 3 3 3 2 3 3 ...
## $ Name : chr "Kelly, Mr. James" "Wilkes, Mrs. James (Ellen Needs)" "Myles, Mr. Thomas Francis" "Wirz, Mr. Albert" ...
## $ Sex : chr "male" "female" "male" "male" ...
## $ Age : num 34.5 47 62 27 22 14 30 26 18 21 ...
## $ SibSp : int 0 1 0 0 1 0 0 1 0 2 ...
## $ Parch : int 0 0 0 0 1 0 0 1 0 0 ...
## $ Ticket : chr "330911" "363272" "240276" "315154" ...
## $ Fare : num 7.83 7 9.69 8.66 12.29 ...
## $ Cabin : chr "" "" "" "" ...
## $ Embarked : chr "Q" "S" "Q" "S" ...
anyNA(tnc)## [1] TRUE
colSums(is.na(tnc))## PassengerId Survived Pclass Name Sex Age
## 0 0 0 0 0 86
## SibSp Parch Ticket Fare Cabin Embarked
## 0 0 0 1 0 0
tnc <- na.exclude(tnc)
dim(tnc)## [1] 331 12
summary(tnc)## PassengerId Survived Pclass Name
## Min. : 892.0 Min. :0.0000 Min. :1.000 Length:331
## 1st Qu.: 992.5 1st Qu.:0.0000 1st Qu.:1.000 Class :character
## Median :1100.0 Median :0.0000 Median :2.000 Mode :character
## Mean :1100.2 Mean :0.3837 Mean :2.142
## 3rd Qu.:1210.5 3rd Qu.:1.0000 3rd Qu.:3.000
## Max. :1307.0 Max. :1.0000 Max. :3.000
## Sex Age SibSp Parch
## Length:331 Min. : 0.17 Min. :0.0000 Min. :0.0000
## Class :character 1st Qu.:21.00 1st Qu.:0.0000 1st Qu.:0.0000
## Mode :character Median :27.00 Median :0.0000 Median :0.0000
## Mean :30.18 Mean :0.4834 Mean :0.3988
## 3rd Qu.:39.00 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :76.00 Max. :8.0000 Max. :6.0000
## Ticket Fare Cabin Embarked
## Length:331 Min. : 0.00 Length:331 Length:331
## Class :character 1st Qu.: 8.05 Class :character Class :character
## Mode :character Median : 16.00 Mode :character Mode :character
## Mean : 40.98
## 3rd Qu.: 40.63
## Max. :512.33
sapply(tnc, n_distinct)## PassengerId Survived Pclass Name Sex Age
## 331 2 3 331 2 78
## SibSp Parch Ticket Fare Cabin Embarked
## 7 7 284 148 73 3
unique(tnc$Survived)## [1] 0 1
unique(tnc$Pclass)## [1] 3 2 1
unique(tnc$Sex)## [1] "male" "female"
unique(tnc$Embarked)## [1] "Q" "S" "C"
Here are some facts and explanation about the data
PassengerId is unique for each passenger, there won’t be any similar passenger id
Survived contains 2 category. 0 = not survived, 1 = survived
Pclass is passenger class
Name, Sex, Age are what they are
SibSp is the total of siblings on board on the RMS Titanic
Parch is the total of parents/children on board on the RMS Titanic
Ticket is the distinctive id for each ticket, or ticket number
Fare is the ticket price. the mean for the ticket price is 40.98 dollar and the most expensive ticket charges at 512.33 dollar
Cabin is the number of Cabin the passenger stayed at (if they were in Cabin)
Embarked is the port of Embarkation. C = Cherbourg, Q = Queenstown; S = Southampton
Now, I will remove the columns: SibSp, Parch, as it does not tell much the relation between each variables or observations
tnc <- tnc[,-c(7:8)]
names(tnc)## [1] "PassengerId" "Survived" "Pclass" "Name" "Sex"
## [6] "Age" "Ticket" "Fare" "Cabin" "Embarked"
head(tnc)| PassengerId | Survived | Pclass | Name | Sex | Age | Ticket | Fare | Cabin | Embarked |
|---|---|---|---|---|---|---|---|---|---|
| 892 | 0 | 3 | Kelly, Mr. James | male | 34.5 | 330911 | 7.8292 | Q | |
| 893 | 1 | 3 | Wilkes, Mrs. James (Ellen Needs) | female | 47.0 | 363272 | 7.0000 | S | |
| 894 | 0 | 2 | Myles, Mr. Thomas Francis | male | 62.0 | 240276 | 9.6875 | Q | |
| 895 | 0 | 3 | Wirz, Mr. Albert | male | 27.0 | 315154 | 8.6625 | S | |
| 896 | 1 | 3 | Hirvonen, Mrs. Alexander (Helga E Lindqvist) | female | 22.0 | 3101298 | 12.2875 | S | |
| 897 | 0 | 3 | Svensson, Mr. Johan Cervin | male | 14.0 | 7538 | 9.2250 | S |
ggpie3D(data = tnc, group_key = "Sex",
count_type = "full",
tilt_degrees = 8,
label_size=2) +
ggtitle("Percentage Between Males and Females On Board") +
theme(plot.title = element_text(hjust = 0.5))
The total of males and females recorded in the data, there are 62% of
male with the total of 204 passengers, and 38% females with the total of
127 passengers
ggplot(tnc, aes(Pclass, Fare))+
geom_col(aes(fill= Sex), position = "dodge")+
geom_jitter(aes(col=Survived, size=Fare))+
labs(title = "Ticket Fare per Class of Titanic Passenger",
x = "Passenger Class")+
theme(plot.title = element_text(hjust = 0.5))
from the graph above we can see that:
the most expensive ticket fare was first class ticket, which was bought by female for a total of 500+ dollar
the cheapest ticket fare was third class ticket, which was bought by female for slightly below 50 dollar in total
there is a passenger of first class without paying a single penny
the majority of the data is the third class passengers
there are first class ticket which has the same Fare with second and third class, which is under 100 dollars
the Fare for second and third class are under 100 dollars
Based on insight number #3, let’s uncover who was not paying a single penny to get on board the RMS Titanic
filter(tnc, Pclass == 1, Fare == 0)| PassengerId | Survived | Pclass | Name | Sex | Age | Ticket | Fare | Cabin | Embarked |
|---|---|---|---|---|---|---|---|---|---|
| 1264 | 0 | 1 | Ismay, Mr. Joseph Bruce | male | 49 | 112058 | 0 | B52 B54 B56 | S |
the free rider of a first class ticket is Mr. Joseph Bruce Ismay, which was probably the owner, special guest, or the important crew of the RMS Titanic
ggplot(tnc, aes(Sex, Age))+
geom_boxplot(outlier.shape=NA, aes(fill=Sex), col="Blue")+
geom_jitter(alpha=0.5, col="orange")+
labs(title = "Passengers Age")+
theme(plot.title = element_text(hjust = 0.5))
from the data above we can see that:
the most, second-most, and third-most passengers’ age for both sex fall in between the age of 20-40, 41-60, and 0-20 respectively
there are babies on board
the oldest male passenger is under 70
the oldest female passenger is above 70
the average age for both sex is almost the same
ggplot(tnc, aes(Sex, Survived))+
labs(title = "Passenger Survived",
x = "Sex",
y = "Survived")+
geom_bar(stat = "identity", aes(fill= Sex))+
coord_polar("y", start=0, direction = 1)+
theme_void()
as we can see from the pie chart above that all survived passengers of
Titanic accident are female. To prove it, let’s see it through the
table.
filter(tnc, Survived == 1, Sex == "male")| PassengerId | Survived | Pclass | Name | Sex | Age | Ticket | Fare | Cabin | Embarked |
|---|
filter(tnc, Survived == 0, Sex == "female")| PassengerId | Survived | Pclass | Name | Sex | Age | Ticket | Fare | Cabin | Embarked |
|---|
from the data above, we can conclude that all females recorded in the data are survived the accident, and all males recorded in the data are not survived the accident.
ggrosepie(data = tnc, group_key = c("Embarked", "Sex"),
count_type = "full",
label_info = "all",
show_tick = F,
donut_frac = NULL,
donut_label_size = "5") +
ggtitle("Percentage of Embarkation In Each Port") +
theme(plot.title = element_text(hjust = 0.5))ggplot(tnc, aes(Embarked, Sex))+
geom_col(aes(fill = Sex), position = "fill",
show.legend = F)+
labs(title = "Proportion Based On Sex In Each Embarkation Port",
x = "Passenger Class")+
theme(plot.title = element_text(hjust = 0.5))
from 2 graphs depicted above, the data show that :
Port Southampton has the highest embarkation with the total of 69% total passengers, with majority of males
Port Cherbourg has the second highest of embarkation with the total of 25% total passengers, with majority of males
Port Queesntown has the least total passengers embarkation with the total of 7%. However, the proportion of females between port is the highest in this port
From the “Percentage of Embarkation in Each Port” graph above, we can see the percentage of passengers that embarked in each embarkation port. unfortunately, the graph does not tell us how much is the proportion between males and females, so we should see it from the graph “Proportion Based On Sex In Each Embarkation Port”. Those 2 graphs is quite ineffective since we can actually do it within a single graph. In the next visualization, we will be trying to see both info in a single visualization. However, the data present will be seeing how much the proportion of males and females in each class.
ggnestedpie(tnc, group_key = c("Pclass", "Sex"), count_type = "full",
inner_label_info = "all",
inner_label_split = NULL,
inner_label_size = 2,
outer_label_type = "circle",
outer_label_pos = "in",
outer_label_info = "all")+
labs(title = "Proportions of Males And Females In Each Class")+
theme(plot.title = element_text(hjust=0.5))## Coordinate system already present. Adding new coordinate system, which will
## replace the existing one.
From the Nested Pie Chart above we can see that:
The proportion of 1st class is the second highest which standing at 29.6% total passengers of all classes, with proportion of males is 15.11% and females is 14.50% of all passengers in all classes
The proportion of 2nd class is the least, standing at 26.6% total passengers of all classes, with proportion of males is 17.82% and females is 8.76% of all passengers in all classes
The proportion of 3rd class is the majority of all passengers come from which is 43.8% total passengers of all classes, with proportion of males is 28.70% and females is 15.11% of all passengers in all classes