This is my first RMD as an exercise of what I’ve learned in P4DS (Programming for Data Science) session of Algoritma boothcamp.
I use data source from Kaggle which is Titanic data (https://www.kaggle.com/competitions/titanic/overview) in order to analyze the train data and at the end use machine learning to create a model that predicts which passengers survived the Titanic shipwreck based on the test data.
The sinking of the Titanic is one of the most infamous shipwrecks in history.
On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.
While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.
titanic <- read.csv("data_input/train.csv")
str(titanic)## 'data.frame': 891 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : chr "male" "female" "female" "female" ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr "" "C85" "" "C123" ...
## $ Embarked : chr "S" "C" "S" "S" ...
Description:
Survival: survival passenger (0=Not survive,
1=Survive)
PClass: ticket class (1=1st, 2=2nd, 3=3rd)
Sex:
gender of passenger
Age: age of passenger (in years)
Sibsp: no
of siblings/spouses aboard the ship
Parch: no of parents/children
aboard the ship
Ticket: ticket no
Fare: passenger fare
Cabin: cabin no
Embarked: port of embarkation (C=Cherbourg,
Q=Queenstown, S=Southampton)
titanic$Sex <- as.factor(titanic$Sex)
titanic$Embarked <- as.factor(titanic$Embarked)
titanic$Pclass <- as.factor(titanic$Pclass)
titanic$SibSp <- as.factor(titanic$SibSp)
titanic$Parch <- as.factor(titanic$Parch)
str(titanic)## 'data.frame': 891 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : Factor w/ 7 levels "0","1","2","3",..: 2 2 1 2 1 1 1 4 1 2 ...
## $ Parch : Factor w/ 7 levels "0","1","2","3",..: 1 1 1 1 1 1 1 2 3 1 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr "" "C85" "" "C123" ...
## $ Embarked : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
colSums(is.na(titanic))## PassengerId Survived Pclass Name Sex Age
## 0 0 0 0 0 177
## SibSp Parch Ticket Fare Cabin Embarked
## 0 0 0 0 0 0
Titanic <- titanic[!is.na(titanic$Age),]
str(Titanic)## 'data.frame': 714 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 7 8 9 10 11 ...
## $ Survived : int 0 1 1 1 0 0 0 1 1 1 ...
## $ Pclass : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 1 3 3 2 3 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 1 1 1 ...
## $ Age : num 22 38 26 35 35 54 2 27 14 4 ...
## $ SibSp : Factor w/ 7 levels "0","1","2","3",..: 2 2 1 2 1 1 4 1 2 2 ...
## $ Parch : Factor w/ 7 levels "0","1","2","3",..: 1 1 1 1 1 1 2 3 1 2 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr "" "C85" "" "C123" ...
## $ Embarked : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 4 4 4 2 4 ...
nrow(Titanic)## [1] 714
summary(Titanic)## PassengerId Survived Pclass Name Sex
## Min. : 1.0 Min. :0.0000 1:186 Length:714 female:261
## 1st Qu.:222.2 1st Qu.:0.0000 2:173 Class :character male :453
## Median :445.0 Median :0.0000 3:355 Mode :character
## Mean :448.6 Mean :0.4062
## 3rd Qu.:677.8 3rd Qu.:1.0000
## Max. :891.0 Max. :1.0000
##
## Age SibSp Parch Ticket Fare
## Min. : 0.42 0:471 0:521 Length:714 Min. : 0.00
## 1st Qu.:20.12 1:183 1:110 Class :character 1st Qu.: 8.05
## Median :28.00 2: 25 2: 68 Mode :character Median : 15.74
## Mean :29.70 3: 12 3: 5 Mean : 34.69
## 3rd Qu.:38.00 4: 18 4: 4 3rd Qu.: 33.38
## Max. :80.00 5: 5 5: 5 Max. :512.33
## 8: 0 6: 1
## Cabin Embarked
## Length:714 : 2
## Class :character C:130
## Mode :character Q: 28
## S:554
##
##
##
Summary:
1. There are 714 passengers with age ranging from 5
months to 80 years old (in average 29-30 years old)
2. Class 3 has
the most passengers and Class2 has the least passengers
3. Most of
the passengers are Male
4. Among of the passengers who have
siblings/spouses, the most is only 1 sibling/spouse
5. Among of the
passengers who have parents/children, the most is only 1 parent/child
6. Majority of the passengers do not have siblings/spouses and also
parents/children
7. The most passengers embark from port S
(Southampton) and the least from port Q (Queenstown)
Titanic$Group.Age[Titanic$Age<6] <- "<6 yo"
Titanic$Group.Age[Titanic$Age>=6&Titanic$Age<12] <- "6-11 yo"
Titanic$Group.Age[Titanic$Age>=12&Titanic$Age<18] <- "12-17 yo"
Titanic$Group.Age[Titanic$Age>=18&Titanic$Age<25] <- "18-24 yo"
Titanic$Group.Age[Titanic$Age>=25&Titanic$Age<35] <- "25-34 yo"
Titanic$Group.Age[Titanic$Age>=35&Titanic$Age<45] <- "35-44 yo"
Titanic$Group.Age[Titanic$Age>=45&Titanic$Age<55] <- "45-54 yo"
Titanic$Group.Age[Titanic$Age>=55&Titanic$Age<65] <- "55-64 yo"
Titanic$Group.Age[Titanic$Age>=65&Titanic$Age<75] <- "65-74 yo"
Titanic$Group.Age[Titanic$Age>=75] <- ">75 yo"
Titanic$Group.Age <- as.factor(Titanic$Group.Age)
levels(Titanic$Group.Age)## [1] "<6 yo" ">75 yo" "12-17 yo" "18-24 yo" "25-34 yo" "35-44 yo"
## [7] "45-54 yo" "55-64 yo" "6-11 yo" "65-74 yo"
str(Titanic)## 'data.frame': 714 obs. of 13 variables:
## $ PassengerId: int 1 2 3 4 5 7 8 9 10 11 ...
## $ Survived : int 0 1 1 1 0 0 0 1 1 1 ...
## $ Pclass : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 1 3 3 2 3 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 1 1 1 ...
## $ Age : num 22 38 26 35 35 54 2 27 14 4 ...
## $ SibSp : Factor w/ 7 levels "0","1","2","3",..: 2 2 1 2 1 1 4 1 2 2 ...
## $ Parch : Factor w/ 7 levels "0","1","2","3",..: 1 1 1 1 1 1 2 3 1 2 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr "" "C85" "" "C123" ...
## $ Embarked : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 4 4 4 2 4 ...
## $ Group.Age : Factor w/ 10 levels "<6 yo",">75 yo",..: 4 6 5 6 6 7 1 5 3 1 ...
table(Titanic$Sex,Titanic$Group.Age)##
## <6 yo >75 yo 12-17 yo 18-24 yo 25-34 yo 35-44 yo 45-54 yo 55-64 yo
## female 21 0 23 62 63 45 26 10
## male 23 1 22 103 138 75 47 21
##
## 6-11 yo 65-74 yo
## female 11 0
## male 13 10
prop.table(table(Titanic$Sex,Titanic$Group.Age))*100##
## <6 yo >75 yo 12-17 yo 18-24 yo 25-34 yo 35-44 yo 45-54 yo
## female 2.941176 0.000000 3.221289 8.683473 8.823529 6.302521 3.641457
## male 3.221289 0.140056 3.081232 14.425770 19.327731 10.504202 6.582633
##
## 55-64 yo 6-11 yo 65-74 yo
## female 1.400560 1.540616 0.000000
## male 2.941176 1.820728 1.400560
Most of passengers are Male in the age between 18-24 and 25-34 years old (around 33%), which means Titanic’s passengers are majority teenagers and young adults.
table(Titanic$Group.Age,Titanic$Pclass)##
## 1 2 3
## <6 yo 3 13 28
## >75 yo 1 0 0
## 12-17 yo 8 6 31
## 18-24 yo 27 35 103
## 25-34 yo 34 62 105
## 35-44 yo 46 28 46
## 45-54 yo 40 17 16
## 55-64 yo 21 6 4
## 6-11 yo 1 4 19
## 65-74 yo 5 2 3
Among the teenagers and young adults, most of them have 3rd class tickets, which means they are mostly ordinary people. While 1st class tickets are dominated by mature adults and elderly.
pclass1 <- Titanic[Titanic$Pclass==1,]
pclass2 <- Titanic[Titanic$Pclass==2,]
pclass3 <- Titanic[Titanic$Pclass==3,]
var(pclass1$Fare)## [1] 6537.885
var(pclass2$Fare)## [1] 173.9083
var(pclass3$Fare)## [1] 100.865
sd(pclass1$Fare)## [1] 80.85719
sd(pclass2$Fare)## [1] 13.18743
sd(pclass3$Fare)## [1] 10.04316
Ticket fare for 1st Class is the most fluctuative, while ticket fare for 3rd Class is more stable. Therefore most of the passengers choose to buy 3rd class ticket.
Since the ticket fare for all classes are skewed to the left, the suitable center of data used is median.
hist(pclass1$Fare)
abline(v = mean(pclass1$Fare), col = "red", lwd = 2)
abline(v = median(pclass1$Fare), col = "blue", lwd = 2)hist(pclass2$Fare)
abline(v = mean(pclass2$Fare), col = "red", lwd = 2)
abline(v = median(pclass2$Fare), col = "blue", lwd = 2)hist(pclass3$Fare)
abline(v = mean(pclass3$Fare), col = "red", lwd = 2)
abline(v = median(pclass3$Fare), col = "blue", lwd = 2)median(pclass1$Fare)## [1] 69.3
median(pclass2$Fare)## [1] 15.0458
median(pclass3$Fare)## [1] 8.05
boxplot(formula=Titanic$Fare~Titanic$Pclass,data=Titanic)It is shown that 1st class ticket fare is more distributed with outliers than 3rd class ticket fare, which means passengers are more likely to buy 3rd class ticket because it is the cheapest fare and least variance.
boxplot(formula=pclass1$Fare~pclass1$Embarked,data=pclass1)
boxplot(formula=pclass2$Fare~pclass2$Embarked,data=pclass2)
boxplot(formula=pclass3$Fare~pclass3$Embarked,data=pclass3)Based on port of embarkation, it seems that:
- 1st class
would choose port S(Southampton) because it has the lowest fare
and least outliers
- 2nd class would choose port
Q(Queenstown) because it has the lowest fare without outliers
- 3rd class would choose port Q(Queenstown)
because it has the lowest fare and least outliers
boxplot(formula=pclass1$Fare~pclass1$Group.Age,data=pclass1,las=2)
boxplot(formula=pclass2$Fare~pclass2$Group.Age,data=pclass2,las=2)
boxplot(formula=pclass3$Fare~pclass3$Group.Age,data=pclass3,las=2)For 1st and 2nd class ticket fare, the younger the passenger
the more expensive the ticket fare, except for age range 35-44
years old has slightly higher fare than it’s supposed to be.
For
3rd class ticket fare, there is similarity of the ticket fare
for almost any age range, except for children (age below 6
until 11 years old) has the higher ticket fare.
xtabs(formula=Survived~Sex+Pclass,data=Titanic)## Pclass
## Sex 1 2 3
## female 82 68 47
## male 40 15 38
xtabs(formula=Survived~Group.Age,data=Titanic)## Group.Age
## <6 yo >75 yo 12-17 yo 18-24 yo 25-34 yo 35-44 yo 45-54 yo 55-64 yo
## 31 1 22 57 78 51 30 12
## 6-11 yo 65-74 yo
## 8 0
Based on the chart below, we can see that the survival of Titanic is mostly Female, from 1st Class and Age ranging from 25-34 years old.
graphics::pie(xtabs(formula=Survived~Sex,data=Titanic))graphics::pie(xtabs(formula=Survived~Pclass,data=Titanic))graphics::pie(xtabs(formula=Survived~Group.Age,data=Titanic))