1 Introduction

1.1 Data Source

I am going to use the titanic dataset from https://www.kaggle.com/competitions/titanic/data. To be exact, I am going to use the train.csv from the dataset.

1.2 Big Question

What were the factors that affected a passenger’s survival on the sinking of titanic?

2 Data Pre-Processing

2.1 Read and Inspect Data

Before we begin analyzing anything lets first read the data.

titanic <- read.csv("data_input/titanic/train.csv")

Next, we can inspect the data:

head(titanic)

tail(titanic)

str(titanic)

## 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : chr  "male" "female" "female" "female" ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  "" "C85" "" "C123" ...
##  $ Embarked   : chr  "S" "C" "S" "S" ...

From the 3 outputs shown above, we can see that:

There are 891 rows and 12 columns.
PassengerId is simply the row number.
There are some missing values.

2.1.1 Columns Description

From the same source as the dataset, here are most of the columns description:

Survived: 0 = Dead, 1 = Survived
Pclass: Ticket class with 1 = 1st class, 2 = 2nd class, 3 = 3rd class
Sex: Gender of male or female
Age: Age in years
SibSp: Number of siblings/spouses aboard the Titanic
Parch: Number of parents/children aboard the Titanic
Ticket: Ticket number
Fare: Passenger fare
Cabin: Cabin number
Embarked: Port of Embarkation with C = Cherbourg, Q = Queenstown, S = Southampton

2.2 Data Selection and Cleansing

Referring to the Big Question, I think that the column PassengerId, Names, Ticket, and Cabin are irrelevant thus can be removed. As for the reasons: PassengerId doesn’t have any meaning, Names are only identifiers, Ticket should be unique identifiers, and Cabin doesn’t mean anything without knowing the layout of the Titanic.

titanic <- titanic[,c("Survived", "Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked")]
head(titanic)

Next, in order to be processed properly, the data type must be in the correct format. In this case, everything should be of format factor except those with <dbl> as a format.

titanic$Survived <- as.factor(titanic$Survived)
titanic$Pclass <- as.factor(titanic$Pclass)
titanic$Sex <- as.factor(titanic$Sex)
titanic$SibSp <- as.factor(titanic$SibSp)
titanic$Parch <- as.factor(titanic$Parch)
titanic$Embarked <- as.factor(titanic$Embarked)
str(titanic)

## 'data.frame':    891 obs. of  8 variables:
##  $ Survived: Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
##  $ Pclass  : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
##  $ Sex     : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Age     : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp   : Factor w/ 7 levels "0","1","2","3",..: 2 2 1 2 1 1 1 4 1 2 ...
##  $ Parch   : Factor w/ 7 levels "0","1","2","3",..: 1 1 1 1 1 1 1 2 3 1 ...
##  $ Fare    : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Embarked: Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...

Check if there are any missing values

anyNA(titanic)

## [1] TRUE

Check which column(s) have missing values

colSums(is.na(titanic))

## Survived   Pclass      Sex      Age    SibSp    Parch     Fare Embarked 
##        0        0        0      177        0        0        0        0

It seems that the column Age had 177 missing values.

There are two treatments that can be used:

Remove the whole Age column so that there won’t be less rows.
Remove the rows with missing values in Age column so that it could still be used in the analysis.

titanic_no_age <- titanic[,c(1:3, 5:8)] # Only Age column removed
titanic_clean <- na.omit(titanic) # Only rows with missing values removed

str(titanic_no_age)

## 'data.frame':    891 obs. of  7 variables:
##  $ Survived: Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
##  $ Pclass  : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
##  $ Sex     : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ SibSp   : Factor w/ 7 levels "0","1","2","3",..: 2 2 1 2 1 1 1 4 1 2 ...
##  $ Parch   : Factor w/ 7 levels "0","1","2","3",..: 1 1 1 1 1 1 1 2 3 1 ...
##  $ Fare    : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Embarked: Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...

str(titanic_clean)

## 'data.frame':    714 obs. of  8 variables:
##  $ Survived: Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 2 2 2 ...
##  $ Pclass  : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 1 3 3 2 3 ...
##  $ Sex     : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 1 1 1 ...
##  $ Age     : num  22 38 26 35 35 54 2 27 14 4 ...
##  $ SibSp   : Factor w/ 7 levels "0","1","2","3",..: 2 2 1 2 1 1 4 1 2 2 ...
##  $ Parch   : Factor w/ 7 levels "0","1","2","3",..: 1 1 1 1 1 1 2 3 1 2 ...
##  $ Fare    : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Embarked: Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 4 4 4 2 4 ...
##  - attr(*, "na.action")= 'omit' Named int [1:177] 6 18 20 27 29 30 32 33 37 43 ...
##   ..- attr(*, "names")= chr [1:177] "6" "18" "20" "27" ...

3 Data Exploration

Brief Overview of the data

summary(titanic_clean)

##  Survived Pclass      Sex           Age        SibSp   Parch        Fare       
##  0:424    1:186   female:261   Min.   : 0.42   0:471   0:521   Min.   :  0.00  
##  1:290    2:173   male  :453   1st Qu.:20.12   1:183   1:110   1st Qu.:  8.05  
##           3:355                Median :28.00   2: 25   2: 68   Median : 15.74  
##                                Mean   :29.70   3: 12   3:  5   Mean   : 34.69  
##                                3rd Qu.:38.00   4: 18   4:  4   3rd Qu.: 33.38  
##                                Max.   :80.00   5:  5   5:  5   Max.   :512.33  
##                                                8:  0   6:  1                   
##  Embarked
##   :  2   
##  C:130   
##  Q: 28   
##  S:554   
##          
##          
##

summary(titanic_no_age)

##  Survived Pclass      Sex      SibSp   Parch        Fare        Embarked
##  0:549    1:216   female:314   0:608   0:678   Min.   :  0.00    :  2   
##  1:342    2:184   male  :577   1:209   1:118   1st Qu.:  7.91   C:168   
##           3:491                2: 28   2: 80   Median : 14.45   Q: 77   
##                                3: 16   3:  5   Mean   : 32.20   S:644   
##                                4: 18   4:  4   3rd Qu.: 31.00           
##                                5:  5   5:  5   Max.   :512.33           
##                                8:  7   6:  1

3.1 Simple Interpretation

I will use the titanic_no_age for most interpretation and titanic_clean for the Age column only.

According to the summary above:

Most passengers have 3rd class ticket
Most passengers are strangers (no relationship).
Most passengers embarked from Southampton
Less than half of the total passenger from the dataset survived.

For those columns that contain numerical values, they are best interpreted with a boxplot.

boxplot(titanic_no_age$Fare)

Column Fare Interpretation

According to the boxplot above, there seems to be a lot of outliers above the max value. I can only assume that maybe those outliers are caused by scalpers reselling ticket or maybe bought through an auction.

boxplot(titanic_clean$Age)

Column Age Interpretation

According to the boxplot above and the output of summary:

There are some outliers above the age of 60.
The average age is 29.7.
There is at least a baby on board since the lowest age is 0.42.

4 Data Analysis

4.1 Columns: Survived, Pclass, Sex

Q1: What can be interpreted from the following figure?

plot(xtabs(~Survived + Pclass + Sex, titanic_no_age))

A: Sex = male with 3rd class ticket didn’t survive the most. Sex = female survived more than Sex = male regardless of ticket class. This means that Sex = female have some influence on Titanic sinking survival. Pclass = 3 also seem to have some influence on priority to be saved.

4.2 Columns: Survived, SibSp, Parch

Q2: What can be interpreted from the following figure?

plot(xtabs(~Survived + SibSp + Parch, titanic_no_age))

A: Looking at the figure roughly (even though it is not very clear), we can see that the size of the rectangles between those who survived and those who don’t survive are similar. From the previous summary, we know that less than half of those on board titanic survive which kind of mirrors the figure above. Based on this, I think that the number of Sibling/Spouse and the number of Parent/children does not have effect on Titanic sinking survival.

4.3 Column: Age

Q: What about age? How do you check if age is a factor or not?

The range of ages following this will be according to my own interpretation that is based on the previous boxplot.

A: Looking at the frequency plot, since the number of rows for each categories(YOUNG, MIDDLE, OLD) are different, then we can only look at survival comparison. Looking at the age groups, all have similar comparison of survival with no survival being the highest of each category. Unfortunately there is no pattern indicating if this is a factor or not. But since there could be another grouping, I would abstain from deciding whether age is a factor or not.

4.3.1 Age: 0 - 20 (AKA: YOUNG)

age_cond <- titanic_clean[titanic_clean$Age < 20,]
barplot(xtabs(~Survived, age_cond))

4.3.2 Age: 20 - 40 (AKA: MIDDLE)

age_cond <- titanic_clean[(titanic_clean$Age >= 20) & (titanic_clean$Age <= 40),]
barplot(xtabs(~Survived, age_cond))

4.3.3 Age: > 40 (AKA: OLD)

age_cond <- titanic_clean[titanic_clean$Age > 20,]
barplot(xtabs(~Survived, age_cond))

4.4 Column: Fare

Q: What about fare?

The range of fares following this will be according to my own interpretation that is based on the previous boxplot.

A: It seems that the resulting bar plot is similar with the previous one, thus I would abstain from deciding whether fare is a factor or not.

4.4.1 Fare: 0 - 8 (AKA: CHEAP)

fare_cond <- titanic_clean[titanic_no_age$Fare < 8,]
barplot(xtabs(~Survived, fare_cond))

4.4.2 Fare: 8 - 33 (AKA: NORMAL)

fare_cond <- titanic_clean[(titanic_no_age$Fare >= 8) & (titanic_no_age$Fare <= 33),]
barplot(xtabs(~Survived, fare_cond))

4.4.3 Fare: > 33 (AKA: EXPENSIVE)

fare_cond <- titanic_clean[titanic_no_age$Fare > 33,]
barplot(xtabs(~Survived, fare_cond))

5 Conclusion

The answers to the Big Question that I can answer with Exploratory Data Analysis alone is not enough to check for all conditions. Thus since from the data source itself this is a machine learning problem, it is better to continue in that direction. Otherwise, the conclusion for this EDA is that Sex = female probably had a high bias (looking at the figure) on survival and Pclass = 3 probably had a small bias (again from the figure) on survival while the rest either has no effect or is undecided.

Exploratory Data Analysis - Titanic dataset

Muhammad Hanif Ibrahim

2022-05-23