Introduction

Titanic is a British passenger ship that sank in 1912 while sailing for New York, United States. This ship has a capacity of 2,224 passengers. Unfortunately, the Titanic sank on April 15, 1912, killing more than 1500 people while only 705 survived. On this occasion, I will conduct Exploratory Data Analysis on the Titanic ship data.

let’s Start!

Data Wrangling

Cek Data

This is the column description of the dataset:

1 = PassangerId

2 = Survived

3 = Pclass

4 = Name

5 = Sex

6 = Age

7 = SibSp

8 = Parch

9 = Ticket

10 = Fare

11 = Cabin

12 = Embarked (C = Cherbourg, Q = Queenstown, S = Southampton)

# Data input and Checking Data

titanic <- read.csv("train.csv")
str(titanic)
## 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : chr  "male" "female" "female" "female" ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  "" "C85" "" "C123" ...
##  $ Embarked   : chr  "S" "C" "S" "S" ...
# Dimensi

dim(titanic)
## [1] 891  12
# Cek data NA

colSums(is.na(titanic))
## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0         177 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           0           0           0

From the data above:

  1. I will not use Passanger ID, Name, SibSP, Parch, Ticket, Cabin variable.

  2. The Age column has an NA value of 177, so I won’t use it either.

  3. Changing the integral data into a factor in the Survived and Pclass columns

Subsetting Data

# Subsetting that didn't use

titanic1 <- titanic[,-c(1, 4, 6, 7, 8, 9, 11)]
#titanic1 data checking

str(titanic1)
## 'data.frame':    891 obs. of  5 variables:
##  $ Survived: int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass  : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Sex     : chr  "male" "female" "female" "female" ...
##  $ Fare    : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Embarked: chr  "S" "C" "S" "S" ...
# changing the data into factor

titanic1$Survived <- sapply(as.character(titanic1$Survived), switch, 
                           "0" = "Not Survived",
                           "1" = "Survived")
titanic1$Survived <- as.factor(titanic1$Survived)
# giving name each Pclass
titanic1$Pclass <- sapply(as.character(titanic1$Pclass), switch,
                          "1" = "1st Class",
                          "2" = "2nd Class",
                          "3" = "3rd Class")
titanic1$Pclass <- as.factor(titanic1$Pclass)

Data Analysis

Correlation between Gender and Survival

library(ggplot2)
ggplot(data = titanic1, mapping = aes(x = Survived,
                                      y = Sex)) +
  geom_count(aes(color = Sex))

Summary from the chart above:

  • Most male passengers are not as safe as female passengers

  • The passengers who survived, women outnumbered men

  • There are more passengers who did not survive than those who survived

Correlation between Passenger Class and Survival

Survived Passengers

titanic_survived <- titanic1[titanic1$Survived == "Survived",]
head(titanic_survived)
##    Survived    Pclass    Sex    Fare Embarked
## 2  Survived 1st Class female 71.2833        C
## 3  Survived 3rd Class female  7.9250        S
## 4  Survived 1st Class female 53.1000        S
## 9  Survived 3rd Class female 11.1333        S
## 10 Survived 2nd Class female 30.0708        C
## 11 Survived 3rd Class female 16.7000        S

Survived Passengers based on Passenger Class

levels(titanic_survived$Pclass)
## [1] "1st Class" "2nd Class" "3rd Class"
# Display of survived passengers data based on passenger class classification
ggplot(data = titanic_survived, mapping = aes(x = Pclass,
                                      y = Survived)) +
  geom_col(aes(fill = Pclass), show.legend = F) +
  labs(title = "Survived Passengers based on Class",
       x = "Passenger Class",
       y = "Survived",
       caption = "Titanic")

Summary: - Passenger Class 1 is the most survived - Passenger Class 2 is the lowest survived

Survived Passengers based on Passenger Class and Gender

ggplot(data = titanic_survived, mapping = aes(x = Pclass,
                                      y = Survived)) +
  geom_col(aes(fill = Sex), show.legend = T) +
  labs(title = "Survived Passengers based on Passenger Class and Gender",
       x = "Passenger Class",
       y = "Survived",
       caption = "Titanic")

  • There are more female passengers who survived than male passengers in all categories of Passenger Class

Survived Passengers based on Embarked

ggplot(data = titanic_survived, mapping = aes(x = Pclass,
                                      y = Survived)) +
  geom_col(aes(fill = Embarked), show.legend = T) +
  labs(title = "Survived Passengers based on Embarked",
       x = "Passenger Class",
       y = "Survived",
       caption = "Titanic")

  • Majority of Survived Passengers departed from S (Southhampton)

Unsurvived Passengers

# Subseting unsurvived passengers
titanic_notsurvived <- titanic1[titanic1$Survived == "Not Survived",]
head(titanic_notsurvived)
##        Survived    Pclass  Sex    Fare Embarked
## 1  Not Survived 3rd Class male  7.2500        S
## 5  Not Survived 3rd Class male  8.0500        S
## 6  Not Survived 3rd Class male  8.4583        Q
## 7  Not Survived 1st Class male 51.8625        S
## 8  Not Survived 3rd Class male 21.0750        S
## 13 Not Survived 3rd Class male  8.0500        S

Unsurvived Passengers based on Passenger Class

ggplot(data = titanic_notsurvived, mapping = aes(x = Pclass,
                                                  y = Survived)) +
  geom_col(aes(fill = Pclass), show.legend = F) +
  labs(title = "Unsurvived Passengers based on Passenger Class",
       x = "Passenger Class",
       y = "Unsurvived",
       caption = "Titanic")

Summary: - the most unsurvived passengers came from class 3 - the lowest unsurvived passengers came from class 1

Unsurvived Passengers based on Passenger Class and Gender

ggplot(data = titanic_notsurvived, mapping = aes(x = Pclass,
                                                  y = Survived)) +
  geom_col(aes(fill = Sex), show.legend = T) +
  labs(title = "Unsurvived Passengers based on Passenger Class and Gender",
       x = "Passenger Class",
       y = "Unsurvived",
       caption = "Titanic")

  • Male passengers is more unsurvived based on all categories of Passenger Class

Unsurvived Passengers based on Embarked

ggplot(data = titanic_notsurvived, mapping = aes(x = Pclass,
                                      y = Survived)) +
  geom_col(aes(fill = Embarked), show.legend = T) +
  labs(title = "Unsurvived Passengers Class",
       x = "Passenger Class",
       y = "Unsurvived",
       caption = "Titanic")

  • Passengers that embarked from Southhampton are most likely unsurvived.