1 Titanic Dataset Analysis

1.1 Introduction

RMS Titanic was a British passenger liner, operated by the White Star Line, that sank in the North Atlantic Ocean on 15 April 1912 after striking an iceberg during her maiden voyage from Southampton, England, to New York City, United States. RMS Titanic was the largest ship afloat at the time she entered service.

The sinking of the Titanic is one of the most infamous shipwrecks in history resulted in more than 1,500 people died out of estimated 2,224 passengers and crew aboard the ship, making it the deadliest sinking of a single ship up to that time.

The dataset used in this LBB Project of Programming for Data Science (with R) is provided by Kaggle.

1.2 Dataset Overview

Datasets provided by Kaggle contains two similar datasets that include passenger information like name, age, gender, socio-economic class, etc.

One dataset is titled train.csv and the other is titled test.csv.

train.csv dataset will contain the details of a subset of the passengers on board (891 to be exact) and importantly, will reveal whether they survived or not, also known as the “ground truth”.

test.csv dataset contains similar information as train.csv but does not disclose the “ground truth” for each passenger.

For the purpose of this LBB Project, the dataset that will be used for further analysis is the merging of the two datasets above without the information of the “ground truth”.

2 Data Wrangling Process

Before we can perform further analysis on our dataset, the first step is to prepare the dataset itself.

2.1 Import Libraries

We set the groundwork by importing the necessary packages.

# Install necessary packages
library(readr)
library(dplyr)
library(gtools)
library(lubridate)                                               

2.2 Exploratory of the Original Kaggle Datasets

Before we can create the dataset, we will check the informations contain in each original kaggle datasets train.csv and test.csv.

2.2.1 Train Dataset

# Read the first dataset
train <- read.csv("data_input/train.csv")

# Check the structure of first dataset
str(train)
#> 'data.frame':    891 obs. of  12 variables:
#>  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
#>  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
#>  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
#>  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
#>  $ Sex        : chr  "male" "female" "female" "female" ...
#>  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
#>  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
#>  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
#>  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
#>  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
#>  $ Cabin      : chr  "" "C85" "" "C123" ...
#>  $ Embarked   : chr  "S" "C" "S" "S" ...

This dataset contains information for total of 891 passengers.

2.2.2 Test Dataset

# Read the second dataset
test <- read.csv("data_input/test.csv")

# Check the structure of second dataset
str(test)
#> 'data.frame':    418 obs. of  11 variables:
#>  $ PassengerId: int  892 893 894 895 896 897 898 899 900 901 ...
#>  $ Pclass     : int  3 3 2 3 3 3 3 2 3 3 ...
#>  $ Name       : chr  "Kelly, Mr. James" "Wilkes, Mrs. James (Ellen Needs)" "Myles, Mr. Thomas Francis" "Wirz, Mr. Albert" ...
#>  $ Sex        : chr  "male" "female" "male" "male" ...
#>  $ Age        : num  34.5 47 62 27 22 14 30 26 18 21 ...
#>  $ SibSp      : int  0 1 0 0 1 0 0 1 0 2 ...
#>  $ Parch      : int  0 0 0 0 1 0 0 1 0 0 ...
#>  $ Ticket     : chr  "330911" "363272" "240276" "315154" ...
#>  $ Fare       : num  7.83 7 9.69 8.66 12.29 ...
#>  $ Cabin      : chr  "" "" "" "" ...
#>  $ Embarked   : chr  "Q" "S" "Q" "S" ...

This dataset contains information for total of 418 passengers.

2.3 Dataset Preparation

Based on our exploratory of the two original Kaggle datasets above, the additional “ground truth” column in the Train Dataset is named Survived. To create the dataset in this LBB Project, we will merge the two datasets above without the column Survived.

Using dplyr function, we will select the Train Dataset to contain all columns except Survived, then merge it with the Test Dataset and save the new dataset as titanic which should contains information for total of 1,309 passengers.

# Merge the two datasets and save it to a new dataframe object named titanic
titanic <- rbind(train %>% select(-Survived), # Select Train Dataset without column Survived 
                 test)

# Check the structure of the new dataset
str(titanic)
#> 'data.frame':    1309 obs. of  11 variables:
#>  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
#>  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
#>  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
#>  $ Sex        : chr  "male" "female" "female" "female" ...
#>  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
#>  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
#>  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
#>  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
#>  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
#>  $ Cabin      : chr  "" "C85" "" "C123" ...
#>  $ Embarked   : chr  "S" "C" "S" "S" ...

Data Description:

  • PassengerId: Row number
  • Pclass: Ticket class and a proxy for socio-economic status (1 = 1st class (Upper), 2 = 2nd class (Middle), 3 = 3rd class (Lower))
  • Name: Name of the passenger
  • Sex: Gender of the passenger (male / female)
  • Age: Age of the passenger in years and it is fractional if less than 1. If the age is estimated, it is in the form of xx.5
  • SibSp: Number of Siblings / Spouses aboard the Titanic with family relations as follows
    • Sibling = brother, sister, stepbrother, stepsister
    • Spouse = husband, wife (mistresses and fiancés were ignored)
  • Parch: Number of Parents / Children aboard the Titanic with family relations as follows
    • Parent = mother, father
    • Child = daughter, son, stepdaughter, stepson
    • Some children travelled only with a nanny, therefore parch = 0 for them
  • Ticket: Ticket Number
  • Fare: Passenger Fare
  • Cabin: Cabin number
  • Embarked: Port of Embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)

2.4 Dataset Exploration

2.4.1 Check Categorial Columns

Based on the Data Description of our dataset titanic, there are several columns that we suspected to be categorial types, such as: Pclass, Sex, SibSp, Parch, and Embarked.

Before we proceed to change those columns to type as.factor(), let us confirm the unique values for each of those columns.

2.4.1.1 Pclass

unique(titanic$Pclass)
#> [1] 3 1 2

There are 3 unique values in column Pclass.

2.4.1.2 Sex

unique(titanic$Sex) 
#> [1] "male"   "female"

There are 2 unique values in column Sex.

2.4.1.3 SibSp

unique(titanic$SibSp)
#> [1] 1 0 3 4 2 5 8

There are 7 unique values in column SibSp.

2.4.1.4 Parch

unique(titanic$Parch)
#> [1] 0 1 2 5 3 4 6 9

There are 8 unique values in column Parch.

2.4.1.5 Embarked

unique(titanic$Embarked)
#> [1] "S" "C" "Q" ""

There are 4 unique values in column Embarked.

2.4.2 Change Data Type in the above columns

Using base R, we will use explicit coercion to change the data type of columns Pclass, Sex, SibSp, Parch, and Embarked to

# Explicit Coercion to change data type to as.factor()
titanic$Pclass <- as.factor(titanic$Pclass)
titanic$Sex <- as.factor(titanic$Sex)
titanic$SibSp <- as.factor(titanic$SibSp)
titanic$Parch <- as.factor(titanic$Parch)
titanic$Embarked <- as.factor(titanic$Embarked)

# Re-Check the structure of the dataset after explicit coercion
str(titanic)
#> 'data.frame':    1309 obs. of  11 variables:
#>  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
#>  $ Pclass     : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
#>  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
#>  $ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
#>  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
#>  $ SibSp      : Factor w/ 7 levels "0","1","2","3",..: 2 2 1 2 1 1 1 4 1 2 ...
#>  $ Parch      : Factor w/ 8 levels "0","1","2","3",..: 1 1 1 1 1 1 1 2 3 1 ...
#>  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
#>  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
#>  $ Cabin      : chr  "" "C85" "" "C123" ...
#>  $ Embarked   : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...

2.4.3 Data Exploratory with Descriptive Statistics

We will used method of Descriptive Statistic to describe our dataset in general before proceeding to the next process in our Exploratory Data Analysis (EDA)

summary(titanic)
#>   PassengerId   Pclass      Name               Sex           Age        SibSp  
#>  Min.   :   1   1:323   Length:1309        female:466   Min.   : 0.17   0:891  
#>  1st Qu.: 328   2:277   Class :character   male  :843   1st Qu.:21.00   1:319  
#>  Median : 655   3:709   Mode  :character                Median :28.00   2: 42  
#>  Mean   : 655                                           Mean   :29.88   3: 20  
#>  3rd Qu.: 982                                           3rd Qu.:39.00   4: 22  
#>  Max.   :1309                                           Max.   :80.00   5:  6  
#>                                                         NA's   :263     8:  9  
#>      Parch         Ticket               Fare            Cabin          
#>  0      :1002   Length:1309        Min.   :  0.000   Length:1309       
#>  1      : 170   Class :character   1st Qu.:  7.896   Class :character  
#>  2      : 113   Mode  :character   Median : 14.454   Mode  :character  
#>  3      :   8                      Mean   : 33.295                     
#>  4      :   6                      3rd Qu.: 31.275                     
#>  5      :   6                      Max.   :512.329                     
#>  (Other):   4                      NA's   :1                           
#>  Embarked
#>   :  2   
#>  C:270   
#>  Q:123   
#>  S:914   
#>          
#>          
#> 

Several initial insights that we can summarized about our dataset titanic using Descriptive Statistics method as follows:

  • There are more male (846 passengers) than female (466 passengers) aboard the Titanic.
  • Passengers with Lower socio-economic status has the most numbers aboard the Titanic.
  • There are several missing values in our dataset on columns Age and Fare.

2.4.3.1 Alternative to Check Missing Value

Alternatively, we can also use is.na() function to re-check missing value on each columns and combined it with colSums() function to sum all the count missing values on each column

colSums(is.na(titanic))
#> PassengerId      Pclass        Name         Sex         Age       SibSp 
#>           0           0           0           0         263           0 
#>       Parch      Ticket        Fare       Cabin    Embarked 
#>           0           0           1           0           0

The result above matched with the summary() function which stated there are 263 and 1 missing values from columns Age and Fare respectively.

Even though there are quite significant total missing values in columns Age, we will keep the dataset as it is for now.

2.4.4 Data Exploratory with Business Cases

1. What is the total number of Male and Female passengers in Each different socio-economic class?

We would like to know more about our passengers based on their genders sex within each socio-economic class Pclass.

table(titanic$Sex, titanic$Pclass)
#>         
#>            1   2   3
#>   female 144 106 216
#>   male   179 171 493

Insight that we can conclude is that the largest number of passengers with total of 493 passengers aboard Titanic have charateristic as Male with Lower Socio-Economic or 3rd class.

2. What is the total sales of Fare that was sold for each Socio-Economic class Pclass?

xtabs(formula = Fare~Pclass, data = titanic)
#> Pclass
#>         1         2         3 
#> 28265.404  5866.637  9418.445

Insight that we can conclude is that the largest ticket sales with total sales of $28,265.40 sold to the 323 passengers of Upper Socio-Economic or 1st class passengers onboard the Titanic.

3 References

  1. Titanic Dataset: (https://www.kaggle.com/competitions/titanic/)
  2. Titanic Wikipedia: (https://en.wikipedia.org/wiki/Titanic)