The sinking of the Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.
While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others. In this challenge, I need to build predictive models that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).
setwd("/Users/shanshantong/Desktop/Skills/1.Tools/R/6.Datasets")
titanic.train <- read.csv("titanic.train.csv")
titanic.test <- read.csv("titanic.test.csv")
dim(titanic.train) # 891 rows, 12 columns
dim(titanic.test) # 418 rows, 11 columns
Use str() function to display the structure of the training and test datasets, and summary() function to summarize the datasets.
str(titanic.train)
## 'data.frame': 891 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : Factor w/ 681 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : Factor w/ 148 levels "","A10","A14",..: 1 83 1 57 1 1 131 1 1 1 ...
## $ Embarked : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
str(titanic.test)
## 'data.frame': 418 obs. of 11 variables:
## $ PassengerId: int 892 893 894 895 896 897 898 899 900 901 ...
## $ Pclass : int 3 3 2 3 3 3 3 2 3 3 ...
## $ Name : Factor w/ 418 levels "Abbott, Master. Eugene Joseph",..: 210 409 273 414 182 370 85 58 5 104 ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 2 2 1 2 1 2 1 2 ...
## $ Age : num 34.5 47 62 27 22 14 30 26 18 21 ...
## $ SibSp : int 0 1 0 0 1 0 0 1 0 2 ...
## $ Parch : int 0 0 0 0 1 0 0 1 0 0 ...
## $ Ticket : Factor w/ 363 levels "110469","110489",..: 153 222 74 148 139 262 159 85 101 270 ...
## $ Fare : num 7.83 7 9.69 8.66 12.29 ...
## $ Cabin : Factor w/ 77 levels "","A11","A18",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Embarked : Factor w/ 3 levels "C","Q","S": 2 3 2 3 3 3 2 3 1 3 ...
summary(titanic.train) # summarize the data
## PassengerId Survived Pclass
## Min. : 1.0 Min. :0.0000 Min. :1.000
## 1st Qu.:223.5 1st Qu.:0.0000 1st Qu.:2.000
## Median :446.0 Median :0.0000 Median :3.000
## Mean :446.0 Mean :0.3838 Mean :2.309
## 3rd Qu.:668.5 3rd Qu.:1.0000 3rd Qu.:3.000
## Max. :891.0 Max. :1.0000 Max. :3.000
##
## Name Sex Age
## Abbing, Mr. Anthony : 1 female:314 Min. : 0.42
## Abbott, Mr. Rossmore Edward : 1 male :577 1st Qu.:20.12
## Abbott, Mrs. Stanton (Rosa Hunt) : 1 Median :28.00
## Abelson, Mr. Samuel : 1 Mean :29.70
## Abelson, Mrs. Samuel (Hannah Wizosky): 1 3rd Qu.:38.00
## Adahl, Mr. Mauritz Nils Martin : 1 Max. :80.00
## (Other) :885 NA's :177
## SibSp Parch Ticket Fare
## Min. :0.000 Min. :0.0000 1601 : 7 Min. : 0.00
## 1st Qu.:0.000 1st Qu.:0.0000 347082 : 7 1st Qu.: 7.91
## Median :0.000 Median :0.0000 CA. 2343: 7 Median : 14.45
## Mean :0.523 Mean :0.3816 3101295 : 6 Mean : 32.20
## 3rd Qu.:1.000 3rd Qu.:0.0000 347088 : 6 3rd Qu.: 31.00
## Max. :8.000 Max. :6.0000 CA 2144 : 6 Max. :512.33
## (Other) :852
## Cabin Embarked
## :687 : 2
## B96 B98 : 4 C:168
## C23 C25 C27: 4 Q: 77
## G6 : 4 S:644
## C22 C26 : 3
## D : 3
## (Other) :186
summary(titanic.test) # display the structure of the data
## PassengerId Pclass
## Min. : 892.0 Min. :1.000
## 1st Qu.: 996.2 1st Qu.:1.000
## Median :1100.5 Median :3.000
## Mean :1100.5 Mean :2.266
## 3rd Qu.:1204.8 3rd Qu.:3.000
## Max. :1309.0 Max. :3.000
##
## Name Sex
## Abbott, Master. Eugene Joseph : 1 female:152
## Abelseth, Miss. Karen Marie : 1 male :266
## Abelseth, Mr. Olaus Jorgensen : 1
## Abrahamsson, Mr. Abraham August Johannes : 1
## Abrahim, Mrs. Joseph (Sophie Halaut Easu): 1
## Aks, Master. Philip Frank : 1
## (Other) :412
## Age SibSp Parch Ticket
## Min. : 0.17 Min. :0.0000 Min. :0.0000 PC 17608: 5
## 1st Qu.:21.00 1st Qu.:0.0000 1st Qu.:0.0000 113503 : 4
## Median :27.00 Median :0.0000 Median :0.0000 CA. 2343: 4
## Mean :30.27 Mean :0.4474 Mean :0.3923 16966 : 3
## 3rd Qu.:39.00 3rd Qu.:1.0000 3rd Qu.:0.0000 220845 : 3
## Max. :76.00 Max. :8.0000 Max. :9.0000 347077 : 3
## NA's :86 (Other) :396
## Fare Cabin Embarked
## Min. : 0.000 :327 C:102
## 1st Qu.: 7.896 B57 B59 B63 B66: 3 Q: 46
## Median : 14.454 A34 : 2 S:270
## Mean : 35.627 B45 : 2
## 3rd Qu.: 31.500 C101 : 2
## Max. :512.329 C116 : 2
## NA's :1 (Other) : 80
Now, I merged the independent variables of training and test sets into a full data set, which is convenient for me to manipulate the data on both sets.
titanic.full <- rbind(titanic.train[,-c(1,2)], titanic.test[,-1])
summary(titanic.full)
## Pclass Name Sex
## Min. :1.000 Connolly, Miss. Kate : 2 female:466
## 1st Qu.:2.000 Kelly, Mr. James : 2 male :843
## Median :3.000 Abbing, Mr. Anthony : 1
## Mean :2.295 Abbott, Mr. Rossmore Edward : 1
## 3rd Qu.:3.000 Abbott, Mrs. Stanton (Rosa Hunt): 1
## Max. :3.000 Abelson, Mr. Samuel : 1
## (Other) :1301
## Age SibSp Parch Ticket
## Min. : 0.17 Min. :0.0000 Min. :0.000 CA. 2343: 11
## 1st Qu.:21.00 1st Qu.:0.0000 1st Qu.:0.000 1601 : 8
## Median :28.00 Median :0.0000 Median :0.000 CA 2144 : 8
## Mean :29.88 Mean :0.4989 Mean :0.385 3101295 : 7
## 3rd Qu.:39.00 3rd Qu.:1.0000 3rd Qu.:0.000 347077 : 7
## Max. :80.00 Max. :8.0000 Max. :9.000 347082 : 7
## NA's :263 (Other) :1261
## Fare Cabin Embarked
## Min. : 0.000 :1014 : 2
## 1st Qu.: 7.896 C23 C25 C27 : 6 C:270
## Median : 14.454 B57 B59 B63 B66: 5 Q:123
## Mean : 33.295 G6 : 5 S:914
## 3rd Qu.: 31.275 B96 B98 : 4
## Max. :512.329 C22 C26 : 4
## NA's :1 (Other) : 271
In this section, I will utilize the idea of feature engineering, basically try to create additional relevent features from the existing raw features in the data, and to increase the predictive power of the learning algorithms.
Based on the description of the Data in Part2, Pclass is an ordered factor variable, therefore, I will transform it from a numeric variable to a categorical variable.
titanic.full$Pclass <- factor(titanic.full$Pclass,
levels = c("3", "2", "1"),
labels = c("3rd", "2nd", "1st"),
ordered = T)
table(titanic.full$Pclass)
##
## 3rd 2nd 1st
## 709 277 323
The idea of binning the Age variable is that I believe children and the old people are more likely to be saved, resulting higher survival rate.
library(dplyr)
titanic.full <- titanic.full %>%
mutate(Aged = case_when(
Age < 13 ~ "Child",
between(Age, 13, 18) ~ "Teenager",
between(Age, 19, 50) ~ "Adult",
Age > 50 ~ "Old"
))
We could not get any useful information directly from variable Name, however, we could extract the Title from Name to support further analysis.
titanic.full$Title <- gsub("^.*, (.*?)\\..*$", "\\1", titanic.full$Name)
table(titanic.full$Title)
##
## Capt Col Don Dona Dr
## 1 4 1 1 8
## Jonkheer Lady Major Master Miss
## 1 1 2 61 260
## Mlle Mme Mr Mrs Ms
## 2 1 757 197 2
## Rev Sir the Countess
## 8 1 1
Based on the result above, the most frequent variables are “Master”, “Miss”, “Mr”, and “Mrs”. Therefore, I will explore the observations with rare titles and check if I could group them into the 4 most frequent titles.
titanic.full %>%
filter(!Title %in% c("Master", "Miss", "Mr", "Mrs")) %>%
select(Pclass, Title, Sex, Age, SibSp, Parch, Fare, Embarked) %>%
arrange(Sex)
## Pclass Title Sex Age SibSp Parch Fare Embarked
## 1 1st Mme female 24 0 0 69.3000 C
## 2 2nd Ms female 28 0 0 13.0000 S
## 3 1st Lady female 48 1 0 39.6000 C
## 4 1st Mlle female 24 0 0 69.3000 C
## 5 1st Mlle female 24 0 0 49.5042 C
## 6 1st the Countess female 33 0 0 86.5000 S
## 7 1st Dr female 49 0 0 25.9292 S
## 8 3rd Ms female NA 0 0 7.7500 Q
## 9 1st Dona female 39 0 0 108.9000 C
## 10 1st Don male 40 0 0 27.7208 C
## 11 2nd Rev male 42 0 0 13.0000 S
## 12 2nd Rev male 51 0 0 12.5250 S
## 13 1st Dr male 44 2 0 90.0000 Q
## 14 2nd Rev male 54 1 0 26.0000 S
## 15 2nd Dr male 54 0 0 14.0000 S
## 16 2nd Dr male 23 0 0 10.5000 S
## 17 1st Major male 52 0 0 30.5000 S
## 18 1st Major male 45 0 0 26.5500 S
## 19 1st Sir male 49 1 0 56.9292 C
## 20 2nd Rev male 57 0 0 12.3500 Q
## 21 1st Dr male 32 0 0 30.5000 C
## 22 1st Col male 56 0 0 35.5000 C
## 23 1st Dr male 50 2 0 133.6500 S
## 24 1st Col male 60 0 0 26.5500 S
## 25 1st Capt male 70 1 1 71.0000 S
## 26 1st Dr male NA 0 0 39.6000 C
## 27 1st Jonkheer male 38 0 0 0.0000 S
## 28 2nd Rev male 28 0 1 33.0000 S
## 29 2nd Rev male 27 0 0 13.0000 S
## 30 1st Col male 53 0 0 28.5000 C
## 31 2nd Rev male 30 1 1 26.0000 S
## 32 2nd Rev male 41 0 0 13.0000 S
## 33 1st Col male 47 1 0 227.5250 C
## 34 1st Dr male 53 1 1 81.8583 S
library(forcats)
titanic.full <- titanic.full %>%
mutate(Title = fct_collapse(Title,
"Miss" = c("Miss", "Ms", "Lady", "Mlle", "Mme"),
"Mr" = c("Mr", "Sir"),
"Mrs" = "Mrs",
"Master" = "Master",
)) %>%
mutate(Title = fct_other(Title, keep = c("Miss", "Mr", "Mrs", "Master"), other_level = "rare title"))
table(titanic.full$Title)
##
## Miss Master Mr Mrs rare title
## 266 61 758 197 27
Since siblings, spouse, parents, children all express “family size” information, I add a new categorical variable “FamilySize” to display this information.
titanic.full <- titanic.full %>%
mutate(FamilySize = SibSp + Parch + 1,) %>%
mutate(FamilySized = case_when(
FamilySize == 1 ~ "single",
between(FamilySize, 2, 4) ~ "small",
FamilySize >= 5 ~ "large"
)) %>%
mutate(FamilySized = fct_relevel(FamilySized, "single", "small", "large"))
table(titanic.full$FamilySize)
##
## 1 2 3 4 5 6 7 8 11
## 790 235 159 43 22 25 16 8 11
table(titanic.full$FamilySized)
##
## single small large
## 790 437 82
For Ticket variable, it’s similary as Name variable, meaning that we could not directly extract useful information from it. However, I found that sometimes the ticket number only shows once, while other times, ticket numbers shows multiple times. Therefore, I will add a new categorical variable named TicketType that shows the frequency of the Ticket number.
ticket.unique <- rep(0, nrow(titanic.full))
tickets <- unique(titanic.full$Ticket) # list each unique the Ticket number 1 time
for (i in 1:length(tickets)) {
current.ticket <- tickets[i]
party.indexes <- which(titanic.full$Ticket == current.ticket)
for (k in 1:length(party.indexes)) {
ticket.unique[party.indexes[k]] <- length(party.indexes)
}
}
titanic.full <- titanic.full %>%
mutate(TicketGroup = ticket.unique,) %>%
mutate(TicketTyped = case_when(
TicketGroup == 1 ~ "single",
between(TicketGroup, 2, 4) ~ "small",
TicketGroup >= 5 ~ "large"
)) %>%
mutate(TicketTyped = fct_relevel(TicketTyped, "single", "small", "large"))
table(titanic.full$TicketGroup)
##
## 1 2 3 4 5 6 7 8 11
## 713 264 147 64 35 24 35 16 11
table(titanic.full$TicketTyped)
##
## single small large
## 713 475 121
First, check number of missing values/empty values for each predictor
nrow(titanic.full[!complete.cases(titanic.full),]) # 264 incomplete cases
## [1] 267
# There are some empty values, so first transform them to NA so as to count NAs
titanic.full[titanic.full[1:nrow(titanic.full), ] == ""] <- NA
library(VIM)
aggr(titanic.full, plot = F)
##
## Missings in variables:
## Variable Count
## Age 263
## Fare 1
## Cabin 1014
## Embarked 2
## Aged 266
To sum up: * Age has 263 missing values * Fare has 1 missing values * Cabin has 1014 missing values * Embarked has 2 missing values
Since Age has so many missing values, I will use multiple imputation method in mice package to fill in the missing values.
# check observations with missing values of Embarked or Fare
library(mice)
set.seed(1234)
mice.mod <- mice(titanic.full[, !names(titanic.full) %in% c("Name", "SibSp", "Parch", "Ticket", "Cabin")], method = "pmm")
##
## iter imp variable
## 1 1 Age Fare Embarked
## 1 2 Age Fare Embarked
## 1 3 Age Fare Embarked
## 1 4 Age Fare Embarked
## 1 5 Age Fare Embarked
## 2 1 Age Fare Embarked
## 2 2 Age Fare Embarked
## 2 3 Age Fare Embarked
## 2 4 Age Fare Embarked
## 2 5 Age Fare Embarked
## 3 1 Age Fare Embarked
## 3 2 Age Fare Embarked
## 3 3 Age Fare Embarked
## 3 4 Age Fare Embarked
## 3 5 Age Fare Embarked
## 4 1 Age Fare Embarked
## 4 2 Age Fare Embarked
## 4 3 Age Fare Embarked
## 4 4 Age Fare Embarked
## 4 5 Age Fare Embarked
## 5 1 Age Fare Embarked
## 5 2 Age Fare Embarked
## 5 3 Age Fare Embarked
## 5 4 Age Fare Embarked
## 5 5 Age Fare Embarked
## Warning: Number of logged events: 51
mice.output <- complete(mice.mod)
Now, check the distribution of Age before and after the imputation. The two distributions are highly similar to each other, therefore, it’s an valid method to utilize.
par(mfrow = c(1,2))
hist(titanic.full$Age, freq = F, main = 'Age: Original Data',
col='skyblue', ylim=c(0,0.04))
hist(mice.output$Age, freq = F, main = 'Age: MICE Output',
col='lightblue', ylim=c(0, 0.04))
par(mfrow=c(1,1))
titanic.full$Age <- mice.output$Age
First, figure out the observation with missing value.
titanic.full %>%
filter(is.na(Fare))
## Pclass Name Sex Age SibSp Parch Ticket Fare Cabin
## 1 3rd Storey, Mr. Thomas male 60.5 0 0 3701 NA <NA>
## Embarked Aged Title FamilySize FamilySized TicketGroup TicketTyped
## 1 S Old Mr 1 single 1 single
This passenger embarked on S, and belongs to the 3rd Pclass. Next, I summarised the data with the median Fare value sliced by Pclass and Embarked, the result is as below:
titanic.full %>%
group_by(Pclass, Embarked) %>%
summarise(median = median(Fare, na.rm = T))
## # A tibble: 10 x 3
## # Groups: Pclass [3]
## Pclass Embarked median
## <ord> <fct> <dbl>
## 1 3rd C 7.90
## 2 3rd Q 7.75
## 3 3rd S 8.05
## 4 2nd C 15.3
## 5 2nd Q 12.4
## 6 2nd S 15.4
## 7 1st C 76.7
## 8 1st Q 90
## 9 1st S 52
## 10 1st <NA> 80
# the median Fare price for passengers with Embarked == S, Pclass = 3rd is 8.05
Then I made a density plot of passengers with Pclass = 3rd and Embarked = S and fill in the missing value with the median Fare price of this group of passengers.
# visualization
library(ggplot2)
titanic.full %>%
filter(Pclass == "3rd", Embarked == "S") %>%
ggplot(aes(x = Fare)) +
geom_density(fill = "skyblue", alpha = 0.4) +
geom_vline(aes(xintercept = median(Fare, na.rm = T)),
col = "red", linetype = 2, size = 0.5)
# fill in the missing value using median
titanic.full$Fare[is.na(titanic.full$Fare)] <- 8.05
As we know, Cabin has 1014 missing values, meaning that more than 75% of the information was missed. Therefore, I simply deleted this variable.
titanic.full$Cabin <- NULL
First, figure out the observation with missing value.
titanic.full %>%
filter(is.na(Embarked))
## Pclass Name Sex Age SibSp Parch
## 1 1st Icard, Miss. Amelie female 38 0 0
## 2 1st Stone, Mrs. George Nelson (Martha Evelyn) female 62 0 0
## Ticket Fare Embarked Aged Title FamilySize FamilySized TicketGroup
## 1 113572 80 <NA> Adult Miss 1 single 2
## 2 113572 80 <NA> Old Mrs 1 single 2
## TicketTyped
## 1 small
## 2 small
For these two observations, both of them were in the 1st class and their Fare prices were both 80.
Then, show the table of median Fare values sliced by Pclass and Embarked again:
titanic.full %>%
group_by(Pclass, Embarked) %>%
summarise(median = median(Fare, na.rm = T))
## # A tibble: 10 x 3
## # Groups: Pclass [3]
## Pclass Embarked median
## <ord> <fct> <dbl>
## 1 3rd C 7.90
## 2 3rd Q 7.75
## 3 3rd S 8.05
## 4 2nd C 15.3
## 5 2nd Q 12.4
## 6 2nd S 15.4
## 7 1st C 76.7
## 8 1st Q 90
## 9 1st S 52
## 10 1st <NA> 80
Based on the result, it’s not hard to find that the median Fare price of passengers Embarked on C and were in the 1st class is 76.7, which is quite similar to 80. Therefore, it’s safe to fill in the missing value of Embarked information as “C” as below.
titanic.full$Embarked[which(is.na(titanic.full$Embarked))] <- "C"
summary(titanic.full$Embarked) # double check
## C Q S
## 0 272 123 914
# there's an unused level, therefore, drop it as below:
titanic.full$Embarked <- droplevels(titanic.full$Embarked)
Finally, delete useless variables and look at the structure and the summary information of the full data again.
titanic.full <- titanic.full %>%
select(Pclass, Title, Sex, Age, Aged,
SibSp, Parch, FamilySize, FamilySized,
TicketGroup, TicketTyped, Fare, Embarked)
str(titanic.full)
## 'data.frame': 1309 obs. of 13 variables:
## $ Pclass : Ord.factor w/ 3 levels "3rd"<"2nd"<"1st": 1 3 1 3 1 1 3 1 1 2 ...
## $ Title : Factor w/ 5 levels "Miss","Master",..: 3 4 1 4 3 3 3 2 4 4 ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
## $ Age : num 22 38 26 35 35 50 54 2 27 14 ...
## $ Aged : chr "Adult" "Adult" "Adult" "Adult" ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ FamilySize : num 2 2 1 2 1 1 1 5 3 2 ...
## $ FamilySized: Factor w/ 3 levels "single","small",..: 2 2 1 2 1 1 1 3 2 2 ...
## $ TicketGroup: num 1 2 1 2 1 1 2 5 3 2 ...
## $ TicketTyped: Factor w/ 3 levels "single","small",..: 1 2 1 2 1 1 2 3 2 2 ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Embarked : Factor w/ 3 levels "C","Q","S": 3 1 3 3 3 2 3 3 3 1 ...
summary(titanic.full)
## Pclass Title Sex Age
## 3rd:709 Miss :266 female:466 Min. : 0.17
## 2nd:277 Master : 61 male :843 1st Qu.:21.00
## 1st:323 Mr :758 Median :29.00
## Mrs :197 Mean :30.29
## rare title: 27 3rd Qu.:39.00
## Max. :80.00
## Aged SibSp Parch FamilySize
## Length:1309 Min. :0.0000 Min. :0.000 Min. : 1.000
## Class :character 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.: 1.000
## Mode :character Median :0.0000 Median :0.000 Median : 1.000
## Mean :0.4989 Mean :0.385 Mean : 1.884
## 3rd Qu.:1.0000 3rd Qu.:0.000 3rd Qu.: 2.000
## Max. :8.0000 Max. :9.000 Max. :11.000
## FamilySized TicketGroup TicketTyped Fare Embarked
## single:790 Min. : 1.000 single:713 Min. : 0.000 C:272
## small :437 1st Qu.: 1.000 small :475 1st Qu.: 7.896 Q:123
## large : 82 Median : 1.000 large :121 Median : 14.454 S:914
## Mean : 2.102 Mean : 33.276
## 3rd Qu.: 3.000 3rd Qu.: 31.275
## Max. :11.000 Max. :512.329
Before performing EDA, I need to create the cleaned train and test data sets and perform EDA on the train dataset, because we only have the Survived information of the train dataset.
titanic.train.clean <- cbind(titanic.train$Survived, titanic.full[1:891,])
names(titanic.train.clean)[1] <- "Survived"
titanic.train.clean$Survived <- as.factor(titanic.train.clean$Survived)
titanic.test.clean <- titanic.full[892:1309,]
titanic.train.clean %>%
ggplot(aes(x = Pclass, y = ..count.., fill = Survived)) +
geom_bar(stat = "count", position = "dodge") +
labs(x = "Ticket Class",
y = "Number of Passengers",
title = "Effect of Ticket Class on Survival Rate") +
theme_classic() + scale_fill_brewer(palette = "Dark2") +
geom_text(stat = "count",
aes(label = ..count..),
position=position_dodge(width = 1),
vjust=-0.5)
Based on the visualization, more than 50% the passengers in the 1st class survived, about 50% of the passengers in the 2nd class survived, while most of the passenger in the 3rd class did not survive.
Next, calculate WOE and IV as below:
library(InformationValue)
WOETable(X = titanic.train.clean$Pclass, Y = titanic.train.clean$Survived)
## CAT GOODS BADS TOTAL PCT_G PCT_B WOE IV
## 1 3rd 119 372 491 0.3479532 0.6775956 -0.6664827 0.21970095
## 2 2nd 87 97 184 0.2543860 0.1766849 0.3644848 0.02832087
## 3 1st 136 80 216 0.3976608 0.1457195 1.0039160 0.25292792
IV(X = titanic.train.clean$Pclass, Y = titanic.train.clean$Survived)
## [1] 0.5009497
## attr(,"howgood")
## [1] "Highly Predictive"
According to a 0.5 IV value, and the “Highly Predictive” result, we could regard Pclass as one of the features that are used to model the data.
titanic.train.clean %>%
ggplot(aes(x = Title, y = ..count.., fill = Survived)) +
geom_bar(stat = "count", position = "stack") +
labs(x = "Titles",
y = "Number of Passengers",
title = "Effect of Titles on Survival Rate") +
theme_classic() + scale_fill_brewer(palette = "Dark2") +
geom_text(stat = "count",
aes(label = ..count..),
position = position_stack(vjust = 0.5))
According to the plot, most of the passengers with the title of “Mr” did not survive, while passengers with the title of “Miss” and “Mrs” have higher survival rate.
Next, calculate WOE and IV as below:
WOETable(X = titanic.train.clean$Title, Y = titanic.train.clean$Survived)
## CAT GOODS BADS TOTAL PCT_G PCT_B WOE IV
## 1 Miss 132 55 187 0.38596491 0.10018215 1.3487564 0.385451342
## 2 Master 23 17 40 0.06725146 0.03096539 0.7755686 0.028142336
## 3 Mr 82 436 518 0.23976608 0.79417122 -1.1976353 0.663975160
## 4 Mrs 99 26 125 0.28947368 0.04735883 1.8103110 0.438303180
## 5 rare title 6 15 21 0.01754386 0.02732240 -0.4430030 0.004331925
IV(X = titanic.train.clean$Title, Y = titanic.train.clean$Survived)
## [1] 1.520204
## attr(,"howgood")
## [1] "Highly Predictive"
The IV value is as high as 1.52, therefore, we could regard Title as one of the features that are used to model the data.
titanic.train.clean %>%
ggplot(aes(x = Sex, y = ..count.., fill = Survived)) +
geom_bar(stat = "count", position = "dodge") +
labs(x = "Sex",
y = "Number of Passengers",
title = "Effect of Sex on Survival Rate") +
theme_classic() + scale_fill_brewer(palette = "Dark2") +
geom_text(stat = "count",
aes(label = ..count..),
position = position_dodge(width = 1), vjust = -0.5)
According to the plot, female passengers are much morel likely to survive than male passengers.
Next, calculate WOE and IV as below:
WOETable(X = titanic.train.clean$Sex, Y = titanic.train.clean$Survived)
## CAT GOODS BADS TOTAL PCT_G PCT_B WOE IV
## 1 female 233 81 314 0.6812865 0.147541 1.5298770 0.8165651
## 2 male 109 468 577 0.3187135 0.852459 -0.9838327 0.5251163
IV(X = titanic.train.clean$Sex, Y = titanic.train.clean$Survived)
## [1] 1.341681
## attr(,"howgood")
## [1] "Highly Predictive"
The IV value is as high as 1.34, therefore, we could regard Sex as one of the features that are used to model the data.
First, perform WOE and IV analysis on both Age and Aged variable to decide which one to use.
WOETable(X = as.factor(titanic.train.clean$Age), Y = titanic.train.clean$Survived)
# since the table is too big, I will not show it here.
IV(X = as.factor(titanic.train.clean$Age), Y = titanic.train.clean$Survived)
## [1] 0.2999792
## attr(,"howgood")
## [1] "Highly Predictive"
WOETable(X = titanic.train.clean$Aged, Y = titanic.train.clean$Survived)
## [1] GOODS BADS TOTAL PCT_G PCT_B WOE IV
## <0 rows> (or 0-length row.names)
IV(X = titanic.train.clean$Aged, Y = titanic.train.clean$Survived)
## [1] 0
## attr(,"howgood")
## [1] "Not Predictive"
Based on the result, I will use the Age variable rather than Aged. Below is the visualization,
titanic.train.clean %>%
ggplot(aes(x = Age, color = Survived)) +
geom_line(aes(label=..count..), stat = 'bin') +
labs(x = "Age",
y = "Number of Passengers",
title = "Effect of Age on Survival Rate") +
theme_classic() + scale_color_brewer(palette = "Dark2")
From the plot, we could conclude that yound passengers have higher survival rate than old passengers.
Again, utilize WOE and IV to select the variable first.
WOETable(X = as.factor(titanic.train.clean$SibSp), Y = titanic.train.clean$Survived)
## CAT GOODS BADS TOTAL PCT_G PCT_B WOE IV
## 1 0 210 398 608 0.593220339 0.724954463 -0.2005429 0.026418349
## 2 1 112 97 209 0.316384181 0.176684882 0.5825894 0.081387334
## 3 2 13 15 28 0.036723164 0.027322404 0.2957007 0.002779811
## 4 3 4 12 16 0.011299435 0.021857923 -0.6598108 0.006966604
## 5 4 3 15 18 0.008474576 0.027322404 -1.1706364 0.022063953
## 6 5 5 5 5 0.014124294 0.009107468 0.4388015 0.002201391
## 7 8 7 7 7 0.019774011 0.012750455 0.4388015 0.003081947
IV(X = as.factor(titanic.train.clean$SibSp), Y = titanic.train.clean$Survived)
## [1] 0.1448994
## attr(,"howgood")
## [1] "Highly Predictive"
WOETable(X = as.factor(titanic.train.clean$Parch), Y = titanic.train.clean$Survived)
## CAT GOODS BADS TOTAL PCT_G PCT_B WOE IV
## 1 0 233 445 678 0.671469741 0.810564663 -0.1882622 0.026186312
## 2 1 65 53 118 0.187319885 0.096539162 0.6628690 0.060175728
## 3 2 40 40 80 0.115273775 0.072859745 0.4587737 0.019458440
## 4 3 3 2 5 0.008645533 0.003642987 0.8642388 0.004323394
## 5 4 4 4 4 0.011527378 0.007285974 0.4587737 0.001945844
## 6 5 1 4 5 0.002881844 0.007285974 -0.9275207 0.004084922
## 7 6 1 1 1 0.002881844 0.001821494 0.4587737 0.000486461
IV(X = as.factor(titanic.train.clean$Parch), Y = titanic.train.clean$Survived)
## [1] 0.1166611
## attr(,"howgood")
## [1] "Highly Predictive"
WOETable(X = as.factor(titanic.train.clean$FamilySize), Y = titanic.train.clean$Survived)
## CAT GOODS BADS TOTAL PCT_G PCT_B WOE IV
## 1 1 163 374 537 0.459154930 0.68123862 -0.3945249 0.0876175539
## 2 2 89 72 161 0.250704225 0.13114754 0.6479509 0.0774668616
## 3 3 59 43 102 0.166197183 0.07832423 0.7523180 0.0661084057
## 4 4 21 8 29 0.059154930 0.01457195 1.4010615 0.0624634998
## 5 5 3 12 15 0.008450704 0.02185792 -0.9503137 0.0127410643
## 6 6 3 19 22 0.008450704 0.03460838 -1.4098460 0.0368782940
## 7 7 4 8 12 0.011267606 0.01457195 -0.2571665 0.0008497665
## 8 8 6 6 6 0.016901408 0.01092896 0.4359807 0.0026038712
## 9 11 7 7 7 0.019718310 0.01275046 0.4359807 0.0030378497
IV(X = as.factor(titanic.train.clean$FamilySize), Y = titanic.train.clean$Survived)
## [1] 0.3497672
## attr(,"howgood")
## [1] "Highly Predictive"
WOETable(X = titanic.train.clean$FamilySized, Y = titanic.train.clean$Survived)
## CAT GOODS BADS TOTAL PCT_G PCT_B WOE IV
## 1 single 163 374 537 0.47660819 0.68123862 -0.3572179 0.07309765
## 2 small 169 123 292 0.49415205 0.22404372 0.7910021 0.21365625
## 3 large 10 52 62 0.02923977 0.09471767 -1.1753709 0.07696082
IV(X = titanic.train.clean$FamilySized, Y = titanic.train.clean$Survived)
## [1] 0.3637147
## attr(,"howgood")
## [1] "Highly Predictive"
Based on the highest IV criteria, I will select FamilySized here.
titanic.train.clean %>%
ggplot(aes(x = FamilySized, y = ..count.., fill = Survived)) +
geom_bar(stat = "count", position = "dodge") +
labs(x = "Family Size",
y = "Number of Passengers",
title = "Effect of Family Size on Survival Rate") +
theme_classic() + scale_fill_brewer(palette = "Dark2") +
geom_text(stat = "count",
aes(label = ..count..),
position = position_dodge(width = 1), vjust = -0.5)
According to the plot, passengers with small family are the group with the highest survival rate.
As before, apply WOE and IV to select the variable first.
WOETable(X = as.factor(titanic.train.clean$TicketGroup), Y = titanic.train.clean$Survived)
## CAT GOODS BADS TOTAL PCT_G PCT_B WOE IV
## 1 1 130 351 481 0.37249284 0.63934426 -0.54022525 1.441599e-01
## 2 2 93 88 181 0.26647564 0.16029144 0.50828920 5.397228e-02
## 3 3 66 35 101 0.18911175 0.06375228 1.08733320 1.363075e-01
## 4 4 32 12 44 0.09169054 0.02185792 1.43385577 1.001299e-01
## 5 5 7 14 21 0.02005731 0.02550091 -0.24012066 1.307122e-03
## 6 6 4 15 19 0.01146132 0.02732240 -0.86872932 1.377899e-02
## 7 7 5 19 24 0.01432665 0.03460838 -0.88197455 1.788797e-02
## 8 8 5 8 13 0.01432665 0.01457195 -0.01697711 4.164509e-06
## 9 11 7 7 7 0.02005731 0.01275046 0.45302652 3.310197e-03
IV(X = as.factor(titanic.train.clean$TicketGroup), Y = titanic.train.clean$Survived)
## [1] 0.470858
## attr(,"howgood")
## [1] "Highly Predictive"
WOETable(X = titanic.train.clean$TicketTyped, Y = titanic.train.clean$Survived)
## CAT GOODS BADS TOTAL PCT_G PCT_B WOE IV
## 1 single 130 351 481 0.38011696 0.6393443 -0.5199641 0.13478888
## 2 small 191 135 326 0.55847953 0.2459016 0.8202864 0.25640338
## 3 large 21 63 84 0.06140351 0.1147541 -0.6253246 0.03336144
IV(X = titanic.train.clean$TicketTyped, Y = titanic.train.clean$Survived)
## [1] 0.4245537
## attr(,"howgood")
## [1] "Highly Predictive"
Based on the highest IV criteria, I will select TicketGroup here.
titanic.train.clean %>%
ggplot(aes(x = TicketGroup, y = ..count.., fill = Survived)) +
geom_bar(stat = "count", position = "dodge") +
labs(x = "Ticket Group",
y = "Number of Passengers",
title = "Effect of Ticket Type on Survival Rate") +
theme_classic() + scale_fill_brewer(palette = "Dark2") +
scale_x_continuous(breaks = 1:11) +
geom_text(stat = "count",
aes(label = ..count..),
position = position_dodge(width = 1), vjust = -0.5)
Based on the result, passengers with ticket group of 2,3,4 have the highest survival rate.
titanic.train.clean %>%
ggplot(aes(x = Fare, color = Survived)) +
geom_line(aes(label=..count..), stat = 'bin') +
labs(x = "Fare Price",
y = "Number of Passengers",
title = "Effect of Fare Price on Survival Rate") +
theme_classic() + scale_color_brewer(palette = "Dark2")
Accordingly, when the price is below 200 dollars, the higher the fare price is, the higher the survival rate it, when the price is higher than 200 dollars, there isn’t significant difference.
titanic.train.clean %>%
ggplot(aes(x = Embarked, y = ..count.., fill = Survived)) +
geom_bar(stat = "count", position = "dodge") +
labs(x = "Embarked Place",
y = "Number of Passengers",
title = "Effect of Embarked Place on Survival Rate") +
theme_classic() + scale_fill_brewer(palette = "Dark2") +
geom_text(stat = "count",
aes(label = ..count..),
position = position_dodge(width = 1), vjust = -0.5)
Based on the plot, the survival rate for passengers embarked on C is the highest, while the survival rate for passengers embarked on S is the lowest.
WOETable(X = titanic.train.clean$Embarked, Y = titanic.train.clean$Survived)
## CAT GOODS BADS TOTAL PCT_G PCT_B WOE IV
## 1 C 95 75 170 0.2777778 0.1366120 0.70967648 1.001820e-01
## 2 Q 30 47 77 0.0877193 0.0856102 0.02433748 5.133014e-05
## 3 S 217 427 644 0.6345029 0.7777778 -0.20359896 2.917061e-02
IV(X = titanic.train.clean$Embarked, Y = titanic.train.clean$Survived)
## [1] 0.129404
## attr(,"howgood")
## [1] "Highly Predictive"
Accordingly, Embarked could be regarded as one of the features used to predict the survival because of the high IV value.
To sum up, I will use Pclass, Title, Sex, Age, FamilySized, TicketGroup, Fare, and Embarked as predictors.
titanic.train.final <- titanic.train.clean %>%
select(Survived, Pclass, Title, Sex, Age, FamilySized,
TicketGroup, Fare, Embarked)
titanic.test.final <- titanic.test.clean %>%
select(Pclass, Title, Sex, Age, FamilySized,
TicketGroup, Fare, Embarked)
All later steps will using titanic.train.final and titanic.test.final datasets.
library(corrplot)
titanic.train.final %>%
select(Age, TicketGroup, Fare) %>%
cor() %>%
corrplot.mixed()
Here, I will utilize 6 classification methods, which are Stepwise Logistic Regression Model, the k-Nearest-Neighbors (KNN), Naive Bayes Classifier, Classification and Regression Trees (CART), Random Forest Model, and Ensembles (Bagging and Boosting).
titanic.null <- glm(Survived ~ 1, data = titanic.train.final, family = 'binomial')
titanic.full <- glm(Survived ~ ., data = titanic.train.final, family = 'binomial')
titanic.step <- step(titanic.null, scope = list(lower = titanic.null, upper = titanic.full),
direction = "both")
## Start: AIC=1188.66
## Survived ~ 1
##
## Df Deviance AIC
## + Title 4 886.64 896.64
## + Sex 1 917.80 921.80
## + Pclass 2 1083.11 1089.11
## + FamilySized 2 1111.56 1117.56
## + Fare 1 1117.57 1121.57
## + Embarked 2 1159.30 1165.30
## + Age 1 1176.89 1180.89
## + TicketGroup 1 1182.96 1186.96
## <none> 1186.66 1188.66
##
## Step: AIC=896.64
## Survived ~ Title
##
## Df Deviance AIC
## + Pclass 2 782.74 796.74
## + FamilySized 2 813.72 827.72
## + Fare 1 855.86 867.86
## + Embarked 2 865.42 879.42
## + TicketGroup 1 873.60 885.60
## + Sex 1 881.07 893.07
## <none> 886.64 896.64
## + Age 1 885.39 897.39
## - Title 4 1186.66 1188.66
##
## Step: AIC=796.74
## Survived ~ Title + Pclass
##
## Df Deviance AIC
## + FamilySized 2 729.80 747.80
## + TicketGroup 1 765.23 781.23
## + Embarked 2 774.44 792.44
## + Age 1 777.38 793.38
## + Sex 1 778.19 794.19
## <none> 782.74 796.74
## + Fare 1 782.73 798.73
## - Pclass 2 886.64 896.64
## - Title 4 1083.11 1089.11
##
## Step: AIC=747.8
## Survived ~ Title + Pclass + FamilySized
##
## Df Deviance AIC
## + Age 1 719.92 739.92
## + Sex 1 725.46 745.46
## + Fare 1 725.58 745.58
## + TicketGroup 1 727.24 747.24
## <none> 729.80 747.80
## + Embarked 2 726.66 748.66
## - FamilySized 2 782.74 796.74
## - Pclass 2 813.72 827.72
## - Title 4 1039.15 1049.15
##
## Step: AIC=739.92
## Survived ~ Title + Pclass + FamilySized + Age
##
## Df Deviance AIC
## + Sex 1 716.08 738.08
## + Fare 1 716.41 738.41
## <none> 719.92 739.92
## + TicketGroup 1 718.59 740.59
## + Embarked 2 716.85 740.85
## - Age 1 729.80 747.80
## - FamilySized 2 777.38 793.38
## - Pclass 2 813.60 829.60
## - Title 4 993.79 1005.79
##
## Step: AIC=738.08
## Survived ~ Title + Pclass + FamilySized + Age + Sex
##
## Df Deviance AIC
## + Fare 1 712.64 736.64
## <none> 716.08 738.08
## + Embarked 2 712.60 738.60
## + TicketGroup 1 714.91 738.91
## - Sex 1 719.92 739.92
## - Age 1 725.46 745.46
## - Title 4 765.13 779.13
## - FamilySized 2 773.13 791.13
## - Pclass 2 807.83 825.83
##
## Step: AIC=736.64
## Survived ~ Title + Pclass + FamilySized + Age + Sex + Fare
##
## Df Deviance AIC
## <none> 712.64 736.64
## + Embarked 2 710.05 738.05
## - Fare 1 716.08 738.08
## - Sex 1 716.41 738.41
## + TicketGroup 1 712.53 738.53
## - Age 1 721.34 743.34
## - Pclass 2 758.30 778.30
## - Title 4 763.32 779.32
## - FamilySized 2 773.12 793.12
summary(titanic.step)
##
## Call:
## glm(formula = Survived ~ Title + Pclass + FamilySized + Age +
## Sex + Fare, family = "binomial", data = titanic.train.final)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7387 -0.5224 -0.3774 0.5415 2.5146
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.271461 0.344879 6.586 4.51e-11 ***
## TitleMaster 15.829109 617.587831 0.026 0.97955
## TitleMr 12.279537 617.587655 0.020 0.98414
## TitleMrs 0.678671 0.364734 1.861 0.06278 .
## Titlerare title 12.018373 617.587399 0.019 0.98447
## Pclass.L 1.488218 0.222807 6.679 2.40e-11 ***
## Pclass.Q 0.140950 0.198604 0.710 0.47789
## FamilySizedsmall -0.378234 0.242526 -1.560 0.11886
## FamilySizedlarge -3.268710 0.509555 -6.415 1.41e-10 ***
## Age -0.025656 0.008868 -2.893 0.00381 **
## Sexmale -15.223227 617.587597 -0.025 0.98033
## Fare 0.004527 0.002655 1.705 0.08818 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1186.66 on 890 degrees of freedom
## Residual deviance: 712.64 on 879 degrees of freedom
## AIC: 736.64
##
## Number of Fisher Scoring iterations: 13
Have a look at the classification table of the stepwise logistic regression model:
library(QuantPsyc)
ClassLog(titanic.step, titanic.train.final$Survived, cut = .5)
## $rawtab
## resp
## 0 1
## FALSE 483 85
## TRUE 66 257
##
## $classtab
## resp
## 0 1
## FALSE 0.8797814 0.2485380
## TRUE 0.1202186 0.7514620
##
## $overall
## [1] 0.8305275
##
## $mcFadden
## [1] 0.399458
Plot the ROC curve:
library(InformationValue)
predicted <- plogis(predict(titanic.step, titanic.train.final))
plotROC(titanic.train.final$Survived, predicted)
Check the optimal cutoff value, and use it to build the classification table again:
(optCutOff <- optimalCutoff(titanic.train.final$Survived, predicted))
## [1] 0.6499997
ClassLog(titanic.step, titanic.train.final$Survived, cut = .65)
## $rawtab
## resp
## 0 1
## FALSE 510 105
## TRUE 39 237
##
## $classtab
## resp
## 0 1
## FALSE 0.92896175 0.30701754
## TRUE 0.07103825 0.69298246
##
## $overall
## [1] 0.8383838
##
## $mcFadden
## [1] 0.399458
Predict on the test data:
pred.logistic <- predict(titanic.step, titanic.test.final, type = "response")
pred.logistic <- ifelse(pred.logistic > 0.65, 1, 0)
set.seed(1)
library(randomForest)
titanic.rf <- randomForest(Survived ~ ., data = titanic.train.final)
library(caret)
confusionMatrix(titanic.rf$predicted, titanic.train.final$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 497 93
## 1 52 249
##
## Accuracy : 0.8373
## 95% CI : (0.8114, 0.8609)
## No Information Rate : 0.6162
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.648
##
## Mcnemar's Test P-Value : 0.0008943
##
## Sensitivity : 0.9053
## Specificity : 0.7281
## Pos Pred Value : 0.8424
## Neg Pred Value : 0.8272
## Prevalence : 0.6162
## Detection Rate : 0.5578
## Detection Prevalence : 0.6622
## Balanced Accuracy : 0.8167
##
## 'Positive' Class : 0
##
Predict on the test data:
pred.rf <- predict(titanic.rf, titanic.test.final, type = "response")
Bagging:
set.seed(1)
library(adabag)
titanic.bag <- bagging(Survived ~ ., data = titanic.train.final)
confusionMatrix(as.factor(titanic.bag$class), titanic.train.final$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 512 97
## 1 37 245
##
## Accuracy : 0.8496
## 95% CI : (0.8244, 0.8725)
## No Information Rate : 0.6162
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6712
##
## Mcnemar's Test P-Value : 3.454e-07
##
## Sensitivity : 0.9326
## Specificity : 0.7164
## Pos Pred Value : 0.8407
## Neg Pred Value : 0.8688
## Prevalence : 0.6162
## Detection Rate : 0.5746
## Detection Prevalence : 0.6835
## Balanced Accuracy : 0.8245
##
## 'Positive' Class : 0
##
Predict on the test data:
pred.bag <- predict(titanic.bag, titanic.test.final, type = "response")
Boosting:
set.seed(1)
library(adabag)
titanic.boost <- boosting(Survived ~ ., data = titanic.train.final)
confusionMatrix(as.factor(titanic.boost$class), titanic.train.final$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 543 14
## 1 6 328
##
## Accuracy : 0.9776
## 95% CI : (0.9655, 0.9862)
## No Information Rate : 0.6162
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9523
##
## Mcnemar's Test P-Value : 0.1175
##
## Sensitivity : 0.9891
## Specificity : 0.9591
## Pos Pred Value : 0.9749
## Neg Pred Value : 0.9820
## Prevalence : 0.6162
## Detection Rate : 0.6094
## Detection Prevalence : 0.6251
## Balanced Accuracy : 0.9741
##
## 'Positive' Class : 0
##
Predict on the test data:
pred.boost <- predict(titanic.boost, titanic.test.final, type = "response")
write files into csv and upload the prediction result to Kaggle:
titanic.test$pred.logistic <- pred.logistic
titanic.test$pred.rf <- pred.rf
titanic.test$pred.bag <- pred.bag$class
titanic.test$pred.boost <- pred.boost$class
setwd("/Users/shanshantong/Desktop/Skills/1.Tools/R/5.Projects/2.Titanic")
write.csv(titanic.test[,c(1,12)], file = "pred_logistic.csv", row.names = F)
write.csv(titanic.test[,c(1,13)], file = "pred_rf.csv", row.names = F)
write.csv(titanic.test[,c(1,14)], file = "pred_bag.csv", row.names = F)
write.csv(titanic.test[,c(1,15)], file = "pred_boost.csv", row.names = F)
Kaggle Score with Logistic Regression - 0.7751 Accuracy with Random Forest - 0.7895 Accuracy with Bagging Model - 0.8134 Accuracy with Boosting Model - 0.7416
The Bagging model generates the highest accuracy with Kaggle score of 0.8134. Besides, for this specific task, there are still many other algorithms that could be utilized to make prediction but I will not discuss more details here.