I am going to use the titanic dataset from https://www.kaggle.com/competitions/titanic/data. To be exact, I am going to use the train.csv from the dataset.
What were the factors that affected a passenger’s survival on the sinking of titanic?
Before we begin analyzing anything lets first read the data.
titanic <- read.csv("data_input/titanic/train.csv")
Next, we can inspect the data:
head(titanic)
tail(titanic)
str(titanic)
## 'data.frame': 891 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : chr "male" "female" "female" "female" ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr "" "C85" "" "C123" ...
## $ Embarked : chr "S" "C" "S" "S" ...
From the 3 outputs shown above, we can see that:
From the same source as the dataset, here are most of the columns description:
Referring to the Big Question, I think that the column PassengerId, Names, Ticket, and Cabin are irrelevant thus can be removed. As for the reasons: PassengerId doesn’t have any meaning, Names are only identifiers, Ticket should be unique identifiers, and Cabin doesn’t mean anything without knowing the layout of the Titanic.
titanic <- titanic[,c("Survived", "Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked")]
head(titanic)
Next, in order to be processed properly, the data type must be in the
correct format. In this case, everything should be of format factor
except those with <dbl> as a format.
titanic$Survived <- as.factor(titanic$Survived)
titanic$Pclass <- as.factor(titanic$Pclass)
titanic$Sex <- as.factor(titanic$Sex)
titanic$SibSp <- as.factor(titanic$SibSp)
titanic$Parch <- as.factor(titanic$Parch)
titanic$Embarked <- as.factor(titanic$Embarked)
str(titanic)
## 'data.frame': 891 obs. of 8 variables:
## $ Survived: Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
## $ Pclass : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : Factor w/ 7 levels "0","1","2","3",..: 2 2 1 2 1 1 1 4 1 2 ...
## $ Parch : Factor w/ 7 levels "0","1","2","3",..: 1 1 1 1 1 1 1 2 3 1 ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Embarked: Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
Check if there are any missing values
anyNA(titanic)
## [1] TRUE
Check which column(s) have missing values
colSums(is.na(titanic))
## Survived Pclass Sex Age SibSp Parch Fare Embarked
## 0 0 0 177 0 0 0 0
It seems that the column Age had 177 missing values.
There are two treatments that can be used:
Age column so that there won’t be less
rows.Age column so
that it could still be used in the analysis.titanic_no_age <- titanic[,c(1:3, 5:8)] # Only Age column removed
titanic_clean <- na.omit(titanic) # Only rows with missing values removed
str(titanic_no_age)
## 'data.frame': 891 obs. of 7 variables:
## $ Survived: Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
## $ Pclass : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
## $ SibSp : Factor w/ 7 levels "0","1","2","3",..: 2 2 1 2 1 1 1 4 1 2 ...
## $ Parch : Factor w/ 7 levels "0","1","2","3",..: 1 1 1 1 1 1 1 2 3 1 ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Embarked: Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
str(titanic_clean)
## 'data.frame': 714 obs. of 8 variables:
## $ Survived: Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 2 2 2 ...
## $ Pclass : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 1 3 3 2 3 ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 1 1 1 ...
## $ Age : num 22 38 26 35 35 54 2 27 14 4 ...
## $ SibSp : Factor w/ 7 levels "0","1","2","3",..: 2 2 1 2 1 1 4 1 2 2 ...
## $ Parch : Factor w/ 7 levels "0","1","2","3",..: 1 1 1 1 1 1 2 3 1 2 ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Embarked: Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 4 4 4 2 4 ...
## - attr(*, "na.action")= 'omit' Named int [1:177] 6 18 20 27 29 30 32 33 37 43 ...
## ..- attr(*, "names")= chr [1:177] "6" "18" "20" "27" ...
Brief Overview of the data
summary(titanic_clean)
## Survived Pclass Sex Age SibSp Parch Fare
## 0:424 1:186 female:261 Min. : 0.42 0:471 0:521 Min. : 0.00
## 1:290 2:173 male :453 1st Qu.:20.12 1:183 1:110 1st Qu.: 8.05
## 3:355 Median :28.00 2: 25 2: 68 Median : 15.74
## Mean :29.70 3: 12 3: 5 Mean : 34.69
## 3rd Qu.:38.00 4: 18 4: 4 3rd Qu.: 33.38
## Max. :80.00 5: 5 5: 5 Max. :512.33
## 8: 0 6: 1
## Embarked
## : 2
## C:130
## Q: 28
## S:554
##
##
##
summary(titanic_no_age)
## Survived Pclass Sex SibSp Parch Fare Embarked
## 0:549 1:216 female:314 0:608 0:678 Min. : 0.00 : 2
## 1:342 2:184 male :577 1:209 1:118 1st Qu.: 7.91 C:168
## 3:491 2: 28 2: 80 Median : 14.45 Q: 77
## 3: 16 3: 5 Mean : 32.20 S:644
## 4: 18 4: 4 3rd Qu.: 31.00
## 5: 5 5: 5 Max. :512.33
## 8: 7 6: 1
I will use the titanic_no_age for most interpretation
and titanic_clean for the Age column only.
According to the summary above:
For those columns that contain numerical values, they are best interpreted with a boxplot.
boxplot(titanic_no_age$Fare)
Column Fare Interpretation
According to the boxplot above, there seems to be a lot of outliers above the max value. I can only assume that maybe those outliers are caused by scalpers reselling ticket or maybe bought through an auction.
boxplot(titanic_clean$Age)
Column Age Interpretation
According to the boxplot above and the output of summary:
Q1: What can be interpreted from the following figure?
plot(xtabs(~Survived + Pclass + Sex, titanic_no_age))
A: Sex = male with 3rd class ticket didn’t survive the most. Sex = female survived more than Sex = male regardless of ticket class. This means that Sex = female have some influence on Titanic sinking survival. Pclass = 3 also seem to have some influence on priority to be saved.
Q2: What can be interpreted from the following figure?
plot(xtabs(~Survived + SibSp + Parch, titanic_no_age))
A: Looking at the figure roughly (even though it is not very clear), we can see that the size of the rectangles between those who survived and those who don’t survive are similar. From the previous summary, we know that less than half of those on board titanic survive which kind of mirrors the figure above. Based on this, I think that the number of Sibling/Spouse and the number of Parent/children does not have effect on Titanic sinking survival.
Q: What about age? How do you check if age is a factor or not?
The range of ages following this will be according to my own interpretation that is based on the previous boxplot.
A: Looking at the frequency plot, since the number of rows for each categories(YOUNG, MIDDLE, OLD) are different, then we can only look at survival comparison. Looking at the age groups, all have similar comparison of survival with no survival being the highest of each category. Unfortunately there is no pattern indicating if this is a factor or not. But since there could be another grouping, I would abstain from deciding whether age is a factor or not.
age_cond <- titanic_clean[titanic_clean$Age < 20,]
barplot(xtabs(~Survived, age_cond))
age_cond <- titanic_clean[(titanic_clean$Age >= 20) & (titanic_clean$Age <= 40),]
barplot(xtabs(~Survived, age_cond))
age_cond <- titanic_clean[titanic_clean$Age > 20,]
barplot(xtabs(~Survived, age_cond))
Q: What about fare?
The range of fares following this will be according to my own interpretation that is based on the previous boxplot.
A: It seems that the resulting bar plot is similar with the previous one, thus I would abstain from deciding whether fare is a factor or not.
fare_cond <- titanic_clean[titanic_no_age$Fare < 8,]
barplot(xtabs(~Survived, fare_cond))
fare_cond <- titanic_clean[(titanic_no_age$Fare >= 8) & (titanic_no_age$Fare <= 33),]
barplot(xtabs(~Survived, fare_cond))
fare_cond <- titanic_clean[titanic_no_age$Fare > 33,]
barplot(xtabs(~Survived, fare_cond))
The answers to the Big Question that I can answer with Exploratory Data Analysis alone is not enough to check for all conditions. Thus since from the data source itself this is a machine learning problem, it is better to continue in that direction. Otherwise, the conclusion for this EDA is that Sex = female probably had a high bias (looking at the figure) on survival and Pclass = 3 probably had a small bias (again from the figure) on survival while the rest either has no effect or is undecided.