This is my first RMD as an exercise of what I’ve learned in P4DS (Programming for Data Science) session of Algoritma boothcamp.
I use data source from Kaggle which is Titanic data (https://www.kaggle.com/competitions/titanic/overview) in order to analyze the train data and at the end use machine learning to create a model that predicts which passengers survived the Titanic shipwreck based on the test data.
The sinking of the Titanic is one of the most infamous shipwrecks in history.
On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.
While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.
<- read.csv("data_input/train.csv")
titanic str(titanic)
## 'data.frame': 891 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : chr "male" "female" "female" "female" ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr "" "C85" "" "C123" ...
## $ Embarked : chr "S" "C" "S" "S" ...
Description:
Survival: survival passenger (0=Not survive,
1=Survive)
PClass: ticket class (1=1st, 2=2nd, 3=3rd)
Sex:
gender of passenger
Age: age of passenger (in years)
Sibsp: no
of siblings/spouses aboard the ship
Parch: no of parents/children
aboard the ship
Ticket: ticket no
Fare: passenger fare
Cabin: cabin no
Embarked: port of embarkation (C=Cherbourg,
Q=Queenstown, S=Southampton)
$Sex <- as.factor(titanic$Sex)
titanic$Embarked <- as.factor(titanic$Embarked)
titanic$Pclass <- as.factor(titanic$Pclass)
titanic$SibSp <- as.factor(titanic$SibSp)
titanic$Parch <- as.factor(titanic$Parch)
titanicstr(titanic)
## 'data.frame': 891 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : Factor w/ 7 levels "0","1","2","3",..: 2 2 1 2 1 1 1 4 1 2 ...
## $ Parch : Factor w/ 7 levels "0","1","2","3",..: 1 1 1 1 1 1 1 2 3 1 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr "" "C85" "" "C123" ...
## $ Embarked : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
colSums(is.na(titanic))
## PassengerId Survived Pclass Name Sex Age
## 0 0 0 0 0 177
## SibSp Parch Ticket Fare Cabin Embarked
## 0 0 0 0 0 0
<- titanic[!is.na(titanic$Age),]
Titanic anyNA(Titanic)
## [1] FALSE
nrow(Titanic)
## [1] 714
summary(Titanic)
## PassengerId Survived Pclass Name Sex
## Min. : 1.0 Min. :0.0000 1:186 Length:714 female:261
## 1st Qu.:222.2 1st Qu.:0.0000 2:173 Class :character male :453
## Median :445.0 Median :0.0000 3:355 Mode :character
## Mean :448.6 Mean :0.4062
## 3rd Qu.:677.8 3rd Qu.:1.0000
## Max. :891.0 Max. :1.0000
##
## Age SibSp Parch Ticket Fare
## Min. : 0.42 0:471 0:521 Length:714 Min. : 0.00
## 1st Qu.:20.12 1:183 1:110 Class :character 1st Qu.: 8.05
## Median :28.00 2: 25 2: 68 Mode :character Median : 15.74
## Mean :29.70 3: 12 3: 5 Mean : 34.69
## 3rd Qu.:38.00 4: 18 4: 4 3rd Qu.: 33.38
## Max. :80.00 5: 5 5: 5 Max. :512.33
## 8: 0 6: 1
## Cabin Embarked
## Length:714 : 2
## Class :character C:130
## Mode :character Q: 28
## S:554
##
##
##
Summary:
1. There are 714 passengers with age ranging from 5
months to 80 years old (in average 29-30 years old)
2. Class 3 has
the most passengers and Class2 has the least passengers
3. Most of
the passengers are Male
4. Among of the passengers who have
siblings/spouses, the most is only 1 sibling/spouse
5. Among of the
passengers who have parents/children, the most is only 1 parent/child
6. Majority of the passengers do not have siblings/spouses and also
parents/children
7. The most passengers embark from port S
(Southampton) and the least from port Q (Queenstown)
$Group.Age[Titanic$Age<6] <- "<6 yo"
Titanic$Group.Age[Titanic$Age>=6&Titanic$Age<12] <- "6-11 yo"
Titanic$Group.Age[Titanic$Age>=12&Titanic$Age<18] <- "12-17 yo"
Titanic$Group.Age[Titanic$Age>=18&Titanic$Age<25] <- "18-24 yo"
Titanic$Group.Age[Titanic$Age>=25&Titanic$Age<35] <- "25-34 yo"
Titanic$Group.Age[Titanic$Age>=35&Titanic$Age<45] <- "35-44 yo"
Titanic$Group.Age[Titanic$Age>=45&Titanic$Age<55] <- "45-54 yo"
Titanic$Group.Age[Titanic$Age>=55&Titanic$Age<65] <- "55-64 yo"
Titanic$Group.Age[Titanic$Age>=65&Titanic$Age<75] <- "65-74 yo"
Titanic$Group.Age[Titanic$Age>=75] <- ">75 yo"
Titanic
$Group.Age <- as.factor(Titanic$Group.Age)
Titaniclevels(Titanic$Group.Age)
## [1] "<6 yo" ">75 yo" "12-17 yo" "18-24 yo" "25-34 yo" "35-44 yo"
## [7] "45-54 yo" "55-64 yo" "6-11 yo" "65-74 yo"
str(Titanic)
## 'data.frame': 714 obs. of 13 variables:
## $ PassengerId: int 1 2 3 4 5 7 8 9 10 11 ...
## $ Survived : int 0 1 1 1 0 0 0 1 1 1 ...
## $ Pclass : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 1 3 3 2 3 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 1 1 1 ...
## $ Age : num 22 38 26 35 35 54 2 27 14 4 ...
## $ SibSp : Factor w/ 7 levels "0","1","2","3",..: 2 2 1 2 1 1 4 1 2 2 ...
## $ Parch : Factor w/ 7 levels "0","1","2","3",..: 1 1 1 1 1 1 2 3 1 2 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr "" "C85" "" "C123" ...
## $ Embarked : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 4 4 4 2 4 ...
## $ Group.Age : Factor w/ 10 levels "<6 yo",">75 yo",..: 4 6 5 6 6 7 1 5 3 1 ...
table(Titanic$Sex,Titanic$Group.Age)
##
## <6 yo >75 yo 12-17 yo 18-24 yo 25-34 yo 35-44 yo 45-54 yo 55-64 yo
## female 21 0 23 62 63 45 26 10
## male 23 1 22 103 138 75 47 21
##
## 6-11 yo 65-74 yo
## female 11 0
## male 13 10
prop.table(table(Titanic$Sex,Titanic$Group.Age))*100
##
## <6 yo >75 yo 12-17 yo 18-24 yo 25-34 yo 35-44 yo 45-54 yo
## female 2.941176 0.000000 3.221289 8.683473 8.823529 6.302521 3.641457
## male 3.221289 0.140056 3.081232 14.425770 19.327731 10.504202 6.582633
##
## 55-64 yo 6-11 yo 65-74 yo
## female 1.400560 1.540616 0.000000
## male 2.941176 1.820728 1.400560
Most of passengers are Male in the age between 18-24 and 25-34 years old (around 33%), which means Titanic’s passengers are majority teenagers and young adults.
table(Titanic$Group.Age,Titanic$Pclass)
##
## 1 2 3
## <6 yo 3 13 28
## >75 yo 1 0 0
## 12-17 yo 8 6 31
## 18-24 yo 27 35 103
## 25-34 yo 34 62 105
## 35-44 yo 46 28 46
## 45-54 yo 40 17 16
## 55-64 yo 21 6 4
## 6-11 yo 1 4 19
## 65-74 yo 5 2 3
Among the teenagers and young adults, most of them have 3rd class tickets, which means they are mostly ordinary people. While 1st class tickets are dominated by mature adults and elderly.
<- Titanic[Titanic$Pclass==1,]
pclass1 <- Titanic[Titanic$Pclass==2,]
pclass2 <- Titanic[Titanic$Pclass==3,]
pclass3
var(pclass1$Fare)
## [1] 6537.885
var(pclass2$Fare)
## [1] 173.9083
var(pclass3$Fare)
## [1] 100.865
sd(pclass1$Fare)
## [1] 80.85719
sd(pclass2$Fare)
## [1] 13.18743
sd(pclass3$Fare)
## [1] 10.04316
Ticket fare for 1st Class is the most fluctuative, while ticket fare for 3rd Class is more stable. Therefore most of the passengers choose to buy 3rd class ticket.
Since the ticket fare for all classes are skewed to the left, the suitable center of data used is median.
hist(pclass1$Fare)
abline(v = mean(pclass1$Fare), col = "red", lwd = 2)
abline(v = median(pclass1$Fare), col = "blue", lwd = 2)
hist(pclass2$Fare)
abline(v = mean(pclass2$Fare), col = "red", lwd = 2)
abline(v = median(pclass2$Fare), col = "blue", lwd = 2)
hist(pclass3$Fare)
abline(v = mean(pclass3$Fare), col = "red", lwd = 2)
abline(v = median(pclass3$Fare), col = "blue", lwd = 2)
median(pclass1$Fare)
## [1] 69.3
median(pclass2$Fare)
## [1] 15.0458
median(pclass3$Fare)
## [1] 8.05
boxplot(formula=Titanic$Fare~Titanic$Pclass,data=Titanic)
It is shown that 1st class ticket fare is more distributed with outliers than 3rd class ticket fare, which means passengers are more likely to buy 3rd class ticket because it is the cheapest fare and least variance.
boxplot(formula=pclass1$Fare~pclass1$Embarked,data=pclass1)
boxplot(formula=pclass2$Fare~pclass2$Embarked,data=pclass2)
boxplot(formula=pclass3$Fare~pclass3$Embarked,data=pclass3)
Based on port of embarkation, it seems that:
- 1st class
would choose port S(Southampton) because it has the lowest fare
and least outliers
- 2nd class would choose port
Q(Queenstown) because it has the lowest fare without outliers
- 3rd class would choose port Q(Queenstown)
because it has the lowest fare and least outliers
boxplot(formula=pclass1$Fare~pclass1$Group.Age,data=pclass1,las=2)
boxplot(formula=pclass2$Fare~pclass2$Group.Age,data=pclass2,las=2)
boxplot(formula=pclass3$Fare~pclass3$Group.Age,data=pclass3,las=2)
For 1st and 2nd class ticket fare, the younger the passenger
the more expensive the ticket fare, except for age range 35-44
years old has slightly higher fare than it’s supposed to be.
For
3rd class ticket fare, there is similarity of the ticket fare
for almost any age range, except for children (age below 6
until 11 years old) has the higher ticket fare.
Is there any correlation between Age and Ticket Fare?
Calculate the correlation value:
cor(Titanic$Age,Titanic$Fare)
## [1] 0.09606669
Visualize the correlation between Age and Ticket Fare:
plot(Titanic$Age,Titanic$Fare,
col=Titanic$Pclass,
xlab="Passenger Age",
ylab="Ticket Fare",
pch=18)
abline(lm(Titanic$Fare~Titanic$Age),
lwd=1,
lty=3)
legend("topright",
legend=levels(Titanic$Pclass),
fill=1:3)
title("Correlation between Age and Ticket Fare")
There is a very weak correlation between Age and Ticket
Fare so that Age does not significantly impact on the amount of
Ticket Fare.
From the plot above, it seems that:
- Class 1 has
more distributed ticket fare ranging from the lowest to the highest fare
- Class 2 and 3 have more small range of ticket fare at the lower
fare
xtabs(formula=Survived~Sex+Pclass,data=Titanic)
## Pclass
## Sex 1 2 3
## female 82 68 47
## male 40 15 38
xtabs(formula=Survived~Group.Age,data=Titanic)
## Group.Age
## <6 yo >75 yo 12-17 yo 18-24 yo 25-34 yo 35-44 yo 45-54 yo 55-64 yo
## 31 1 22 57 78 51 30 12
## 6-11 yo 65-74 yo
## 8 0
Based on the chart below, we can see that the survival of Titanic is mostly Female, from 1st Class and Age ranging from 25-34 years old.
::pie(xtabs(formula=Survived~Sex,data=Titanic)) graphics
::pie(xtabs(formula=Survived~Pclass,data=Titanic)) graphics
::pie(xtabs(formula=Survived~Group.Age,data=Titanic)) graphics
Which ticket class and gender have the most survival?
library(ggplot2)
# create new object for survived and fare aggregation
require(data.table)
## Loading required package: data.table
<- data.table(Titanic)
dt <- dt[,list(m.Fare=median(Fare),s.Survived=sum(Survived)),
Titanic.Survived =c("Pclass","Sex","SibSp","Parch","Embarked","Age","Group.Age")]
by Titanic.Survived
ggplot(data=Titanic.Survived,mapping=aes(x=Sex,y=s.Survived)) +
geom_boxplot(outlier.shape=NA, col="black", fill="#b7bab6") +
facet_wrap(~Pclass) +
geom_jitter(aes(size=m.Fare, col=Pclass), alpha=0.5) +
labs(
title="Survival Passengers Characteristic in Titanic",
subtitle="By Ticket Class 1, 2 and 3",
caption="Source: https://www.kaggle.com/c/titanic",
x=NULL,
y="Total Survival",
size="Ticket Fare",
col="Ticket Class"
+
) theme(
plot.title = element_text(face="bold", size=14),
plot.caption = element_text(vjust=-2)
)
Based on boxplot above, First Class ticket has the most
survival passengers and among the 3 ticket classes,
female passengers are the most survival.
Additional information (based on the jitter size) is that
Ticket Fare is higher in First Class than others, while
there is no significant difference between Second and Third Class ticket
fare. It also seems that most first class female passengers who
are survival pay more for the ticket fare.
Which
group age has the most survival?
ggplot(data=Titanic.Survived,mapping=aes(x=s.Survived,y=Group.Age)) +
geom_boxplot(outlier.shape=NA, col="black", fill="#b7bab6") +
facet_wrap(~Sex) +
geom_jitter(aes(size=m.Fare, col=Sex), alpha=0.5) +
labs(
title="Survival Passengers Characteristic in Titanic",
subtitle="By Group Age",
caption="Source: https://www.kaggle.com/c/titanic",
x="Total Survival",
y=NULL,
size="Ticket Fare",
col="Gender",
+
) theme(
plot.title = element_text(face="bold", size=14),
plot.caption = element_text(vjust=-2)
)
The most survival came from female with age range from 18-24,
25-34 and 35-44 years old, who also paid higher ticket fare
than others. Male passengers are mostly not survived.
Which
embarkation port has the most survival?
ggplot(data=Titanic.Survived,mapping=aes(x=Embarked,y=s.Survived)) +
geom_boxplot(outlier.shape=NA, col="black", fill="#b7bab6") +
facet_wrap(~Pclass) +
geom_jitter(aes(size=m.Fare, col=Pclass),alpha=0.5) +
labs(
title="Survival Passengers Characteristic in Titanic",
subtitle="By Embarkation Port C=Cherbourg, Q=Queenstown, S=Southampton",
caption="Source: https://www.kaggle.com/c/titanic",
x=NULL,
y="Total Survival",
size="Ticket Fare",
col="Pclass",
+
) theme(
plot.title = element_text(face="bold", size=14),
plot.caption = element_text(vjust=-2)
)
Most passengers from all classes embarked from Southampton
Port and most of First Class passengers also embarked
from Cherbourg Port.Therefore the survival are mostly
comes from Southampton Port with higher ticket fare for first
class than other classes.
Were passengers who have siblings
also survived?
ggplot(data=Titanic.Survived,mapping=aes(x=s.Survived,y=SibSp)) +
geom_boxplot(outlier.shape=NA, col="black", fill="#b7bab6") +
facet_wrap(~Sex) +
geom_jitter(aes(size=m.Fare, col=Pclass),alpha=0.5) +
labs(
title="Survival Passengers Characteristic in Titanic",
subtitle="Passengers who have siblings",
caption="Source: https://www.kaggle.com/c/titanic",
x="Total Survival",
y="Total Siblings",
size="Ticket Fare",
col="Pclass",
+
) theme(
plot.title = element_text(face="bold", size=14),
plot.caption = element_text(vjust=-2)
)
Most survival comes from female passengers without
siblings and majority from First Class who paid higher ticket
fare than others. Others come from female passengers with 1 or 2
siblings.
Were passengers with parents and/or children also
survived?
ggplot(data=Titanic.Survived,mapping=aes(x=Parch,y=s.Survived)) +
geom_boxplot(outlier.shape=NA, col="black", fill="#b7bab6") +
facet_wrap(~Sex) +
geom_jitter(aes(size=m.Fare, col=Pclass),alpha=0.5) +
labs(
title="Survival Passengers Characteristic in Titanic",
subtitle="Passengers with Parents and/or Children",
caption="Source: https://www.kaggle.com/c/titanic",
x="No of Parents and/or Children",
y="Total Survival",
size="Ticket Fare",
col="Pclass",
+
) theme(
plot.title = element_text(face="bold", size=14),
plot.caption = element_text(vjust=-2)
)
Most female passengers without parents and/or children were
survived and majority from First Class who paid higher ticket
fare than others. Passengers with only 1 or 2 parents and/or children
are the next passengers who were also survived.
From the
analysis above, we know that among passenger’s age group 18-24
years old, 25-34 years old and 35-44
years old have the most number of survival. Let’s see how the
average ticket fare from those 3 age group.
# subset data from Titanic
<- Titanic[Titanic$Group.Age %in% c("18-24 yo","25-34 yo","35-44 yo"),]
Titanic.fare
# check levels
levels(Titanic.fare$Group.Age)
## [1] "<6 yo" ">75 yo" "12-17 yo" "18-24 yo" "25-34 yo" "35-44 yo"
## [7] "45-54 yo" "55-64 yo" "6-11 yo" "65-74 yo"
# drop levels
$Group.Age <- droplevels(Titanic.fare$Group.Age)
Titanic.fare
# check data
head(Titanic.fare)
# create new object for survived and fare aggregation
require(data.table)
<- data.table(Titanic.fare)
dt <- dt[,list(Fare=mean(Fare),Survived=sum(Survived)),
Titanic.fare.agg =c("Group.Age")]
by Titanic.fare.agg
# transform data from wide to long
library(tidyr)
<- pivot_longer(data=Titanic.fare.agg,
Titanic.fare.pivot cols=c("Fare","Survived"),
names_to="parameter")
Titanic.fare.pivot
ggplot(data=Titanic.fare.pivot, mapping=aes(x=Group.Age, y=value)) +
geom_col(aes(fill=parameter),position="dodge") +
labs(
title="Average Ticket Fare and Total Survival by Age Group",
x=NULL,
y=NULL,
+
) #scale_fill_gradient(low = "#0000ff",high = "#ffa500") +
geom_label(mapping=aes(label=sprintf("%0.2f", round(value, digits = 2))),
col="black") +
theme(
plot.title = element_text(face="bold", size=14)
)
Age group 35-44 years old has the highest average ticket
fare compare to other age groups, but number of
survival for age group 25-34 years old has the highest with the
lowest average ticket fare among others.
Is there any
correlation between Ticket Fare and Number of Survival?
<- as.data.frame(Titanic.fare.agg)
Titanic.fare.agg.df cor(Titanic.fare.agg.df$Fare,Titanic.fare.agg.df$Survived)
## [1] -0.8291379
From the negative correlation value above, there is no correlation between Ticket Fare and Number of Survival.