1 Brief Description

This is my first RMD as an exercise of what I’ve learned in P4DS (Programming for Data Science) session of Algoritma boothcamp.

I use data source from Kaggle which is Titanic data (https://www.kaggle.com/competitions/titanic/overview) in order to analyze the train data and at the end use machine learning to create a model that predicts which passengers survived the Titanic shipwreck based on the test data.

2 About Titanic

The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

3 Getting the Data Ready

3.1 Data Input and Checking Data

titanic <- read.csv("data_input/train.csv")
str(titanic)
## 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : chr  "male" "female" "female" "female" ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  "" "C85" "" "C123" ...
##  $ Embarked   : chr  "S" "C" "S" "S" ...

Description:
Survival: survival passenger (0=Not survive, 1=Survive)
PClass: ticket class (1=1st, 2=2nd, 3=3rd)
Sex: gender of passenger
Age: age of passenger (in years)
Sibsp: no of siblings/spouses aboard the ship
Parch: no of parents/children aboard the ship
Ticket: ticket no
Fare: passenger fare
Cabin: cabin no
Embarked: port of embarkation (C=Cherbourg, Q=Queenstown, S=Southampton)

3.2 Inspecting Data and Data Cleansing

titanic$Sex <- as.factor(titanic$Sex)
titanic$Embarked <- as.factor(titanic$Embarked)
titanic$Pclass <- as.factor(titanic$Pclass)
titanic$SibSp <- as.factor(titanic$SibSp)
titanic$Parch <- as.factor(titanic$Parch)
str(titanic)
## 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : Factor w/ 7 levels "0","1","2","3",..: 2 2 1 2 1 1 1 4 1 2 ...
##  $ Parch      : Factor w/ 7 levels "0","1","2","3",..: 1 1 1 1 1 1 1 2 3 1 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  "" "C85" "" "C123" ...
##  $ Embarked   : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...

3.3 Cek for Missing Value & Eliminate Data with NA

colSums(is.na(titanic))
## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0         177 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           0           0           0
Titanic <- titanic[!is.na(titanic$Age),]
anyNA(Titanic)
## [1] FALSE

3.4 Data Explanation

nrow(Titanic)
## [1] 714
summary(Titanic)
##   PassengerId       Survived      Pclass      Name               Sex     
##  Min.   :  1.0   Min.   :0.0000   1:186   Length:714         female:261  
##  1st Qu.:222.2   1st Qu.:0.0000   2:173   Class :character   male  :453  
##  Median :445.0   Median :0.0000   3:355   Mode  :character               
##  Mean   :448.6   Mean   :0.4062                                          
##  3rd Qu.:677.8   3rd Qu.:1.0000                                          
##  Max.   :891.0   Max.   :1.0000                                          
##                                                                          
##       Age        SibSp   Parch      Ticket               Fare       
##  Min.   : 0.42   0:471   0:521   Length:714         Min.   :  0.00  
##  1st Qu.:20.12   1:183   1:110   Class :character   1st Qu.:  8.05  
##  Median :28.00   2: 25   2: 68   Mode  :character   Median : 15.74  
##  Mean   :29.70   3: 12   3:  5                      Mean   : 34.69  
##  3rd Qu.:38.00   4: 18   4:  4                      3rd Qu.: 33.38  
##  Max.   :80.00   5:  5   5:  5                      Max.   :512.33  
##                  8:  0   6:  1                                      
##     Cabin           Embarked
##  Length:714          :  2   
##  Class :character   C:130   
##  Mode  :character   Q: 28   
##                     S:554   
##                             
##                             
## 

Summary:
1. There are 714 passengers with age ranging from 5 months to 80 years old (in average 29-30 years old)
2. Class 3 has the most passengers and Class2 has the least passengers
3. Most of the passengers are Male
4. Among of the passengers who have siblings/spouses, the most is only 1 sibling/spouse
5. Among of the passengers who have parents/children, the most is only 1 parent/child
6. Majority of the passengers do not have siblings/spouses and also parents/children
7. The most passengers embark from port S (Southampton) and the least from port Q (Queenstown)

3.5 Creating New Variable for Group Age

Titanic$Group.Age[Titanic$Age<6] <- "<6 yo" 
Titanic$Group.Age[Titanic$Age>=6&Titanic$Age<12] <- "6-11 yo"
Titanic$Group.Age[Titanic$Age>=12&Titanic$Age<18] <- "12-17 yo"
Titanic$Group.Age[Titanic$Age>=18&Titanic$Age<25] <- "18-24 yo"
Titanic$Group.Age[Titanic$Age>=25&Titanic$Age<35] <- "25-34 yo"
Titanic$Group.Age[Titanic$Age>=35&Titanic$Age<45] <- "35-44 yo"
Titanic$Group.Age[Titanic$Age>=45&Titanic$Age<55] <- "45-54 yo"
Titanic$Group.Age[Titanic$Age>=55&Titanic$Age<65] <- "55-64 yo"
Titanic$Group.Age[Titanic$Age>=65&Titanic$Age<75] <- "65-74 yo"
Titanic$Group.Age[Titanic$Age>=75] <- ">75 yo"

Titanic$Group.Age <- as.factor(Titanic$Group.Age)
levels(Titanic$Group.Age)
##  [1] "<6 yo"    ">75 yo"   "12-17 yo" "18-24 yo" "25-34 yo" "35-44 yo"
##  [7] "45-54 yo" "55-64 yo" "6-11 yo"  "65-74 yo"
str(Titanic)
## 'data.frame':    714 obs. of  13 variables:
##  $ PassengerId: int  1 2 3 4 5 7 8 9 10 11 ...
##  $ Survived   : int  0 1 1 1 0 0 0 1 1 1 ...
##  $ Pclass     : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 1 3 3 2 3 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 1 1 1 ...
##  $ Age        : num  22 38 26 35 35 54 2 27 14 4 ...
##  $ SibSp      : Factor w/ 7 levels "0","1","2","3",..: 2 2 1 2 1 1 4 1 2 2 ...
##  $ Parch      : Factor w/ 7 levels "0","1","2","3",..: 1 1 1 1 1 1 2 3 1 2 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  "" "C85" "" "C123" ...
##  $ Embarked   : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 4 4 4 2 4 ...
##  $ Group.Age  : Factor w/ 10 levels "<6 yo",">75 yo",..: 4 6 5 6 6 7 1 5 3 1 ...

4 Exploratory Data Analysis

table(Titanic$Sex,Titanic$Group.Age)
##         
##          <6 yo >75 yo 12-17 yo 18-24 yo 25-34 yo 35-44 yo 45-54 yo 55-64 yo
##   female    21      0       23       62       63       45       26       10
##   male      23      1       22      103      138       75       47       21
##         
##          6-11 yo 65-74 yo
##   female      11        0
##   male        13       10
prop.table(table(Titanic$Sex,Titanic$Group.Age))*100
##         
##              <6 yo    >75 yo  12-17 yo  18-24 yo  25-34 yo  35-44 yo  45-54 yo
##   female  2.941176  0.000000  3.221289  8.683473  8.823529  6.302521  3.641457
##   male    3.221289  0.140056  3.081232 14.425770 19.327731 10.504202  6.582633
##         
##           55-64 yo   6-11 yo  65-74 yo
##   female  1.400560  1.540616  0.000000
##   male    2.941176  1.820728  1.400560

Most of passengers are Male in the age between 18-24 and 25-34 years old (around 33%), which means Titanic’s passengers are majority teenagers and young adults.

table(Titanic$Group.Age,Titanic$Pclass)
##           
##              1   2   3
##   <6 yo      3  13  28
##   >75 yo     1   0   0
##   12-17 yo   8   6  31
##   18-24 yo  27  35 103
##   25-34 yo  34  62 105
##   35-44 yo  46  28  46
##   45-54 yo  40  17  16
##   55-64 yo  21   6   4
##   6-11 yo    1   4  19
##   65-74 yo   5   2   3

Among the teenagers and young adults, most of them have 3rd class tickets, which means they are mostly ordinary people. While 1st class tickets are dominated by mature adults and elderly.

pclass1 <- Titanic[Titanic$Pclass==1,]
pclass2 <- Titanic[Titanic$Pclass==2,]
pclass3 <- Titanic[Titanic$Pclass==3,]

var(pclass1$Fare)
## [1] 6537.885
var(pclass2$Fare)
## [1] 173.9083
var(pclass3$Fare)
## [1] 100.865
sd(pclass1$Fare)
## [1] 80.85719
sd(pclass2$Fare)
## [1] 13.18743
sd(pclass3$Fare)
## [1] 10.04316

Ticket fare for 1st Class is the most fluctuative, while ticket fare for 3rd Class is more stable. Therefore most of the passengers choose to buy 3rd class ticket.

4.1 Histogram

Since the ticket fare for all classes are skewed to the left, the suitable center of data used is median.

4.1.1 First Class

hist(pclass1$Fare)
abline(v = mean(pclass1$Fare), col = "red", lwd = 2)
abline(v = median(pclass1$Fare), col = "blue", lwd = 2)

4.1.2 Second Class

hist(pclass2$Fare)
abline(v = mean(pclass2$Fare), col = "red", lwd = 2)
abline(v = median(pclass2$Fare), col = "blue", lwd = 2)

4.1.3 Third Class

hist(pclass3$Fare)
abline(v = mean(pclass3$Fare), col = "red", lwd = 2)
abline(v = median(pclass3$Fare), col = "blue", lwd = 2)

4.2 Central Tendency

median(pclass1$Fare)
## [1] 69.3
median(pclass2$Fare)
## [1] 15.0458
median(pclass3$Fare)
## [1] 8.05
boxplot(formula=Titanic$Fare~Titanic$Pclass,data=Titanic)

It is shown that 1st class ticket fare is more distributed with outliers than 3rd class ticket fare, which means passengers are more likely to buy 3rd class ticket because it is the cheapest fare and least variance.

boxplot(formula=pclass1$Fare~pclass1$Embarked,data=pclass1)
boxplot(formula=pclass2$Fare~pclass2$Embarked,data=pclass2)
boxplot(formula=pclass3$Fare~pclass3$Embarked,data=pclass3)

Based on port of embarkation, it seems that:
- 1st class would choose port S(Southampton) because it has the lowest fare and least outliers
- 2nd class would choose port Q(Queenstown) because it has the lowest fare without outliers
- 3rd class would choose port Q(Queenstown) because it has the lowest fare and least outliers

boxplot(formula=pclass1$Fare~pclass1$Group.Age,data=pclass1,las=2)
boxplot(formula=pclass2$Fare~pclass2$Group.Age,data=pclass2,las=2)
boxplot(formula=pclass3$Fare~pclass3$Group.Age,data=pclass3,las=2)

For 1st and 2nd class ticket fare, the younger the passenger the more expensive the ticket fare, except for age range 35-44 years old has slightly higher fare than it’s supposed to be.
For 3rd class ticket fare, there is similarity of the ticket fare for almost any age range, except for children (age below 6 until 11 years old) has the higher ticket fare.

4.3 Find Correlation

Is there any correlation between Age and Ticket Fare?

Calculate the correlation value:

cor(Titanic$Age,Titanic$Fare)
## [1] 0.09606669

Visualize the correlation between Age and Ticket Fare:

plot(Titanic$Age,Titanic$Fare,
     col=Titanic$Pclass,
     xlab="Passenger Age",
     ylab="Ticket Fare",
     pch=18)
abline(lm(Titanic$Fare~Titanic$Age),
       lwd=1,
       lty=3)
legend("topright",
       legend=levels(Titanic$Pclass),
       fill=1:3)
title("Correlation between Age and Ticket Fare")

There is a very weak correlation between Age and Ticket Fare so that Age does not significantly impact on the amount of Ticket Fare.
From the plot above, it seems that:
- Class 1 has more distributed ticket fare ranging from the lowest to the highest fare
- Class 2 and 3 have more small range of ticket fare at the lower fare

4.4 Survival Passenger

xtabs(formula=Survived~Sex+Pclass,data=Titanic)
##         Pclass
## Sex       1  2  3
##   female 82 68 47
##   male   40 15 38
xtabs(formula=Survived~Group.Age,data=Titanic)
## Group.Age
##    <6 yo   >75 yo 12-17 yo 18-24 yo 25-34 yo 35-44 yo 45-54 yo 55-64 yo 
##       31        1       22       57       78       51       30       12 
##  6-11 yo 65-74 yo 
##        8        0

Based on the chart below, we can see that the survival of Titanic is mostly Female, from 1st Class and Age ranging from 25-34 years old.

4.4.1 By Gender

graphics::pie(xtabs(formula=Survived~Sex,data=Titanic))

4.4.2 By Class

graphics::pie(xtabs(formula=Survived~Pclass,data=Titanic))

4.4.3 By Group Age

graphics::pie(xtabs(formula=Survived~Group.Age,data=Titanic))

4.5 Data Analysis

Which ticket class and gender have the most survival?

library(ggplot2)
# create new object for survived and fare aggregation
require(data.table)
## Loading required package: data.table
dt <- data.table(Titanic)
Titanic.Survived <- dt[,list(m.Fare=median(Fare),s.Survived=sum(Survived)),
             by=c("Pclass","Sex","SibSp","Parch","Embarked","Age","Group.Age")]
Titanic.Survived
ggplot(data=Titanic.Survived,mapping=aes(x=Sex,y=s.Survived)) +
  geom_boxplot(outlier.shape=NA, col="black", fill="#b7bab6") +
  facet_wrap(~Pclass) +
  geom_jitter(aes(size=m.Fare, col=Pclass), alpha=0.5) +
  labs(
    title="Survival Passengers Characteristic in Titanic",
    subtitle="By Ticket Class 1, 2 and 3",
    caption="Source: https://www.kaggle.com/c/titanic",
    x=NULL,
    y="Total Survival",
    size="Ticket Fare",
    col="Ticket Class"
  ) +
  theme(
    plot.title = element_text(face="bold", size=14),
    plot.caption = element_text(vjust=-2)
  )

Based on boxplot above, First Class ticket has the most survival passengers and among the 3 ticket classes, female passengers are the most survival.

Additional information (based on the jitter size) is that Ticket Fare is higher in First Class than others, while there is no significant difference between Second and Third Class ticket fare. It also seems that most first class female passengers who are survival pay more for the ticket fare.

Which group age has the most survival?

ggplot(data=Titanic.Survived,mapping=aes(x=s.Survived,y=Group.Age)) +
  geom_boxplot(outlier.shape=NA, col="black", fill="#b7bab6") +
  facet_wrap(~Sex) +
  geom_jitter(aes(size=m.Fare, col=Sex), alpha=0.5) +
  labs(
    title="Survival Passengers Characteristic in Titanic",
    subtitle="By Group Age",
    caption="Source: https://www.kaggle.com/c/titanic",
    x="Total Survival",
    y=NULL,
    size="Ticket Fare",
    col="Gender",
  ) +
  theme(
    plot.title = element_text(face="bold", size=14),
    plot.caption = element_text(vjust=-2)
  )

The most survival came from female with age range from 18-24, 25-34 and 35-44 years old, who also paid higher ticket fare than others. Male passengers are mostly not survived.

Which embarkation port has the most survival?

ggplot(data=Titanic.Survived,mapping=aes(x=Embarked,y=s.Survived)) +
  geom_boxplot(outlier.shape=NA, col="black", fill="#b7bab6") +
  facet_wrap(~Pclass) +
  geom_jitter(aes(size=m.Fare, col=Pclass),alpha=0.5) +
  labs(
    title="Survival Passengers Characteristic in Titanic",
    subtitle="By Embarkation Port C=Cherbourg, Q=Queenstown, S=Southampton",
    caption="Source: https://www.kaggle.com/c/titanic",
    x=NULL,
    y="Total Survival",
    size="Ticket Fare",
    col="Pclass",
  ) +
  theme(
    plot.title = element_text(face="bold", size=14),
    plot.caption = element_text(vjust=-2)
  )

Most passengers from all classes embarked from Southampton Port and most of First Class passengers also embarked from Cherbourg Port.Therefore the survival are mostly comes from Southampton Port with higher ticket fare for first class than other classes.

Were passengers who have siblings also survived?

ggplot(data=Titanic.Survived,mapping=aes(x=s.Survived,y=SibSp)) +
  geom_boxplot(outlier.shape=NA, col="black", fill="#b7bab6") +
  facet_wrap(~Sex) +
  geom_jitter(aes(size=m.Fare, col=Pclass),alpha=0.5) +
  labs(
    title="Survival Passengers Characteristic in Titanic",
    subtitle="Passengers who have siblings",
    caption="Source: https://www.kaggle.com/c/titanic",
    x="Total Survival",
    y="Total Siblings",
    size="Ticket Fare",
    col="Pclass",
  ) +
  theme(
    plot.title = element_text(face="bold", size=14),
    plot.caption = element_text(vjust=-2)
  )

Most survival comes from female passengers without siblings and majority from First Class who paid higher ticket fare than others. Others come from female passengers with 1 or 2 siblings.

Were passengers with parents and/or children also survived?

ggplot(data=Titanic.Survived,mapping=aes(x=Parch,y=s.Survived)) +
  geom_boxplot(outlier.shape=NA, col="black", fill="#b7bab6") +
  facet_wrap(~Sex) +
  geom_jitter(aes(size=m.Fare, col=Pclass),alpha=0.5) +
  labs(
    title="Survival Passengers Characteristic in Titanic",
    subtitle="Passengers with Parents and/or Children",
    caption="Source: https://www.kaggle.com/c/titanic",
    x="No of Parents and/or Children",
    y="Total Survival",
    size="Ticket Fare",
    col="Pclass",
  ) +
  theme(
    plot.title = element_text(face="bold", size=14),
    plot.caption = element_text(vjust=-2)
  )

Most female passengers without parents and/or children were survived and majority from First Class who paid higher ticket fare than others. Passengers with only 1 or 2 parents and/or children are the next passengers who were also survived.

From the analysis above, we know that among passenger’s age group 18-24 years old, 25-34 years old and 35-44 years old have the most number of survival. Let’s see how the average ticket fare from those 3 age group.

# subset data from Titanic
Titanic.fare <- Titanic[Titanic$Group.Age %in% c("18-24 yo","25-34 yo","35-44 yo"),]

# check levels
levels(Titanic.fare$Group.Age)
##  [1] "<6 yo"    ">75 yo"   "12-17 yo" "18-24 yo" "25-34 yo" "35-44 yo"
##  [7] "45-54 yo" "55-64 yo" "6-11 yo"  "65-74 yo"
# drop levels
Titanic.fare$Group.Age <- droplevels(Titanic.fare$Group.Age)

# check data
head(Titanic.fare)
# create new object for survived and fare aggregation
require(data.table)
dt <- data.table(Titanic.fare)
Titanic.fare.agg <- dt[,list(Fare=mean(Fare),Survived=sum(Survived)),
             by=c("Group.Age")]
Titanic.fare.agg
# transform data from wide to long
library(tidyr)

Titanic.fare.pivot <- pivot_longer(data=Titanic.fare.agg,
                                   cols=c("Fare","Survived"),
                                   names_to="parameter")
Titanic.fare.pivot
ggplot(data=Titanic.fare.pivot, mapping=aes(x=Group.Age, y=value)) +
  geom_col(aes(fill=parameter),position="dodge") +
  labs(
    title="Average Ticket Fare and Total Survival by Age Group",
    x=NULL,
    y=NULL,
  ) +
  #scale_fill_gradient(low = "#0000ff",high = "#ffa500") +
  geom_label(mapping=aes(label=sprintf("%0.2f", round(value, digits = 2))),
             col="black") +
  theme(
    plot.title = element_text(face="bold", size=14)
  ) 

Age group 35-44 years old has the highest average ticket fare compare to other age groups, but number of survival for age group 25-34 years old has the highest with the lowest average ticket fare among others.

Is there any correlation between Ticket Fare and Number of Survival?

Titanic.fare.agg.df <- as.data.frame(Titanic.fare.agg)
cor(Titanic.fare.agg.df$Fare,Titanic.fare.agg.df$Survived)
## [1] -0.8291379

From the negative correlation value above, there is no correlation between Ticket Fare and Number of Survival.