Titanic is a British passenger ship that sank in 1912 while sailing for New York, United States. This ship has a capacity of 2,224 passengers. Unfortunately, the Titanic sank on April 15, 1912, killing more than 1500 people while only 705 survived. On this occasion, I will conduct Exploratory Data Analysis on the Titanic ship data.
let’s Start!
This is the column description of the dataset:
1 = PassangerId
2 = Survived
3 = Pclass
4 = Name
5 = Sex
6 = Age
7 = SibSp
8 = Parch
9 = Ticket
10 = Fare
11 = Cabin
12 = Embarked (C = Cherbourg, Q = Queenstown, S = Southampton)
# Data input and Checking Data
titanic <- read.csv("train.csv")
str(titanic)
## 'data.frame': 891 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : chr "male" "female" "female" "female" ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr "" "C85" "" "C123" ...
## $ Embarked : chr "S" "C" "S" "S" ...
# Dimensi
dim(titanic)
## [1] 891 12
# Cek data NA
colSums(is.na(titanic))
## PassengerId Survived Pclass Name Sex Age
## 0 0 0 0 0 177
## SibSp Parch Ticket Fare Cabin Embarked
## 0 0 0 0 0 0
From the data above:
I will not use Passanger ID, Name, SibSP, Parch, Ticket, Cabin variable.
The Age column has an NA value of 177, so I won’t use it either.
Changing the integral data into a factor in the Survived and Pclass columns
# Subsetting that didn't use
titanic1 <- titanic[,-c(1, 4, 6, 7, 8, 9, 11)]
#titanic1 data checking
str(titanic1)
## 'data.frame': 891 obs. of 5 variables:
## $ Survived: int 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Sex : chr "male" "female" "female" "female" ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Embarked: chr "S" "C" "S" "S" ...
# changing the data into factor
titanic1$Survived <- sapply(as.character(titanic1$Survived), switch,
"0" = "Not Survived",
"1" = "Survived")
titanic1$Survived <- as.factor(titanic1$Survived)
# giving name each Pclass
titanic1$Pclass <- sapply(as.character(titanic1$Pclass), switch,
"1" = "1st Class",
"2" = "2nd Class",
"3" = "3rd Class")
titanic1$Pclass <- as.factor(titanic1$Pclass)
library(ggplot2)
ggplot(data = titanic1, mapping = aes(x = Survived,
y = Sex)) +
geom_count(aes(color = Sex))
Summary from the chart above:
Most male passengers are not as safe as female passengers
The passengers who survived, women outnumbered men
There are more passengers who did not survive than those who survived
titanic_survived <- titanic1[titanic1$Survived == "Survived",]
head(titanic_survived)
## Survived Pclass Sex Fare Embarked
## 2 Survived 1st Class female 71.2833 C
## 3 Survived 3rd Class female 7.9250 S
## 4 Survived 1st Class female 53.1000 S
## 9 Survived 3rd Class female 11.1333 S
## 10 Survived 2nd Class female 30.0708 C
## 11 Survived 3rd Class female 16.7000 S
levels(titanic_survived$Pclass)
## [1] "1st Class" "2nd Class" "3rd Class"
# Display of survived passengers data based on passenger class classification
ggplot(data = titanic_survived, mapping = aes(x = Pclass,
y = Survived)) +
geom_col(aes(fill = Pclass), show.legend = F) +
labs(title = "Survived Passengers based on Class",
x = "Passenger Class",
y = "Survived",
caption = "Titanic")
Summary: - Passenger Class 1 is the most survived - Passenger Class 2 is the lowest survived
ggplot(data = titanic_survived, mapping = aes(x = Pclass,
y = Survived)) +
geom_col(aes(fill = Sex), show.legend = T) +
labs(title = "Survived Passengers based on Passenger Class and Gender",
x = "Passenger Class",
y = "Survived",
caption = "Titanic")
ggplot(data = titanic_survived, mapping = aes(x = Pclass,
y = Survived)) +
geom_col(aes(fill = Embarked), show.legend = T) +
labs(title = "Survived Passengers based on Embarked",
x = "Passenger Class",
y = "Survived",
caption = "Titanic")
# Subseting unsurvived passengers
titanic_notsurvived <- titanic1[titanic1$Survived == "Not Survived",]
head(titanic_notsurvived)
## Survived Pclass Sex Fare Embarked
## 1 Not Survived 3rd Class male 7.2500 S
## 5 Not Survived 3rd Class male 8.0500 S
## 6 Not Survived 3rd Class male 8.4583 Q
## 7 Not Survived 1st Class male 51.8625 S
## 8 Not Survived 3rd Class male 21.0750 S
## 13 Not Survived 3rd Class male 8.0500 S
ggplot(data = titanic_notsurvived, mapping = aes(x = Pclass,
y = Survived)) +
geom_col(aes(fill = Pclass), show.legend = F) +
labs(title = "Unsurvived Passengers based on Passenger Class",
x = "Passenger Class",
y = "Unsurvived",
caption = "Titanic")
Summary: - the most unsurvived passengers came from class 3 - the lowest unsurvived passengers came from class 1
ggplot(data = titanic_notsurvived, mapping = aes(x = Pclass,
y = Survived)) +
geom_col(aes(fill = Sex), show.legend = T) +
labs(title = "Unsurvived Passengers based on Passenger Class and Gender",
x = "Passenger Class",
y = "Unsurvived",
caption = "Titanic")
ggplot(data = titanic_notsurvived, mapping = aes(x = Pclass,
y = Survived)) +
geom_col(aes(fill = Embarked), show.legend = T) +
labs(title = "Unsurvived Passengers Class",
x = "Passenger Class",
y = "Unsurvived",
caption = "Titanic")