Titanic_Kaggle_Competition

Overview

The Titanic left the port of Queensland (now Cobh), Ireland on April 11, 1912 with 2224 passengers ¹. On early morning of Monday, 15 April 1912, sank the “unsinkable” after a collision with an iceberg in the Atlantic ocean and consequently, more than 2/3 of the passengers lost their lives as only a very limited number of them was able to get in the available lifeboats.

Due to the insufficient number of means of survival, the possibility to get in one of these lifeboats would have obeyed to some kind of considerations from the ethical “women and children first” (considering the physical and personal situation of each passenger) to the inequitable “Only money can build roads over the sees/oceans” ² (considering the social class or economic situation of each passenger).

Based on these considerations, the purpose of this project is to predict the survival of a - test - subset of passengers using machine learning. To this intent, I “dive” into the training dataset provided by kaggle.com within the framework of its prediction competitions, to identify and engineer the most important predictors.

Read data

trainVarTypes <- c("integer", "factor", "factor", "character", "factor", "numeric", "integer", "integer", "character", "numeric", "character", "factor") # for Resp. PassengerId, Survived, Pclass, Name, Sex, Age, SibSp, Parch, Ticket, Fare, Cabin, Embarked
testVarTypes <- trainVarTypes[-2]

train <- read.csv("train.csv", colClasses = trainVarTypes, na.strings = c("NA", ""))
test <- read.csv("test.csv", colClasses = testVarTypes, na.strings = c("NA", ""))

train$Survived <- factor(train$Survived, levels = c(0,1), labels = c("no", "yes"))

Preliminary data analysis

The train dataset consists on a subset of 891 observations of 11 variables describing each passenger listed in. According to each variable we can know: whether the passenger survived or not (yes/no), his/her class on board (1st, 2nd or 3rd class), name (including the title), sex (female/male), Age, ticket number, fare, cabin number, and were he/her embarked.

Adding to that, the variables SibSp and Parch represent respectively the number of siblings and spouses, and the number of parents and children on board. Both variables indicate implicitly whether a passenger has family members on board and how many are they. (more details can be found here.

str(train)

## 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : Factor w/ 2 levels "no","yes": 1 2 2 2 1 1 1 1 2 2 ...
##  $ Pclass     : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  NA "C85" NA "C123" ...
##  $ Embarked   : Factor w/ 3 levels "C","Q","S": 3 1 3 3 3 2 3 3 3 1 ...

head(train, n = 4)

##   PassengerId Survived Pclass
## 1           1       no      3
## 2           2      yes      1
## 3           3      yes      3
## 4           4      yes      1
##                                                  Name    Sex Age SibSp
## 1                             Braund, Mr. Owen Harris   male  22     1
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38     1
## 3                              Heikkinen, Miss. Laina female  26     0
## 4        Futrelle, Mrs. Jacques Heath (Lily May Peel) female  35     1
##   Parch           Ticket    Fare Cabin Embarked
## 1     0        A/5 21171  7.2500  <NA>        S
## 2     0         PC 17599 71.2833   C85        C
## 3     0 STON/O2. 3101282  7.9250  <NA>        S
## 4     0           113803 53.1000  C123        S

summary(train)

##   PassengerId    Survived  Pclass      Name               Sex     
##  Min.   :  1.0   no :549   1:216   Length:891         female:314  
##  1st Qu.:223.5   yes:342   2:184   Class :character   male  :577  
##  Median :446.0             3:491   Mode  :character               
##  Mean   :446.0                                                    
##  3rd Qu.:668.5                                                    
##  Max.   :891.0                                                    
##                                                                   
##       Age            SibSp           Parch           Ticket         
##  Min.   : 0.42   Min.   :0.000   Min.   :0.0000   Length:891        
##  1st Qu.:20.12   1st Qu.:0.000   1st Qu.:0.0000   Class :character  
##  Median :28.00   Median :0.000   Median :0.0000   Mode  :character  
##  Mean   :29.70   Mean   :0.523   Mean   :0.3816                     
##  3rd Qu.:38.00   3rd Qu.:1.000   3rd Qu.:0.0000                     
##  Max.   :80.00   Max.   :8.000   Max.   :6.0000                     
##  NA's   :177                                                        
##       Fare           Cabin           Embarked  
##  Min.   :  0.00   Length:891         C   :168  
##  1st Qu.:  7.91   Class :character   Q   : 77  
##  Median : 14.45   Mode  :character   S   :644  
##  Mean   : 32.20                      NA's:  2  
##  3rd Qu.: 31.00                                
##  Max.   :512.33                                
##

One important aspect when predicting using machine learning is the assessment of the completeness of the data (percentage of data with one or more values).

missmap(train, main = "Missing Values Analysis", col=c("red", "gray")) # how many observations are missing

sapply(train, function(x) sum(is.na(x))) # how many observations are missing

## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0         177 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           0         687           2

The analysis shows that two variables suffer strongly from lack of data: Cabin and Age.

While a missing Cabin number simply means that the passenger has no cabin, the variable Age is crucial and its missing observations should be imputed.

Data Engineering

Replace titles in names

The preliminary data analysis puts light also on the fact that the passengers titles are included to the variable Names. While the passengers names are not of a big importance here, their titles are in contrast very important and could represent an important predictor for machine learning.

I chose thus to isolate the titles and classify them in four different categories according to their meanings: “Miss” for young ladies, “Mrs” for older/married ones, “Mil” for military passengers and “Noble” for the aristocratic titles. The categories “Rev”, “Mr”, “Dr” and “Master” are kept unchangeable.

# 1. train$Name - Replace titles in Name

x <- sub(".*, ", "", as.character(train$Name))
train$NameCat <- sub("\\..*", "", x)
train$NameCat[train$NameCat %in% c("Miss", "Mlle", "Ms")] <- "Miss"
train$NameCat[train$NameCat %in% c("Mme", "Mrs")] <- "Mrs"
train$NameCat[train$NameCat %in% c("Capt", "Col", "Major")] <- "Mil"
train$NameCat[train$NameCat %in% c("Don", "Jonkheer", "Lady", "Sir", "the Countess")] <- "Noble"

Please note however, that, contrary to what one might think, the category “Master” was used at the time to identify male children.

ggplot(train, aes(x = NameCat, y = Age)) + geom_boxplot(aes(fill=Survived)) + theme_bw()

Create Age categories

For efficiency, it is more appropriate to classify the passengers according to specific age ranges using the variable Age. In this way, one could directly identify whether the passenger is a “child”, “adult” or “senior”.

For passengers with no information about the age, I use temporarily the category “missing”.

Please note also, that eventhough the Titanic’s Certificates of Clearance indicates that passengers over 14 are considered adults ³, I consider myself that during the evacuation of the Titanic, only signs of puberty can be used to determine whether a passenger is a child or not. For this reason, I chose the age of 14 as a limit to distinguish between an adult or a child passenger.

# 2. train$Age - Categories

train$AgeCat[train$Age > 0 & train$Age <= 14 & !is.na(train$Age)] <- "child"
train$AgeCat[train$Age > 14 & train$Age <= 60 & !is.na(train$Age)] <- "adult"
train$AgeCat[train$Age > 60 & !is.na(train$Age)] <- "senior"
train$AgeCat[is.na(train$Age)] <- "missing"

Cabin number

A new categorical variable is set according to whether a passenger has a cabin or not.

# 3. train$Cabin

train$HasCab[!is.na(train$Cabin)] <- "yes"
train$HasCab[is.na(train$Cabin)] <- "no"

The “family” factor

It is known that many passengers were crossing the Atlantic with other members of their families. On living the sinking Titanic, it is obvious that children should have been accompanied on lifeboats by at least one of their parents. Thus, being on board with other family members would play an important role.

# 4. train$HasFam
train$HasFam[train$SibSp != 0 | train$Parch != 0] <- "yes"
train$HasFam[train$SibSp == 0 & train$Parch == 0] <- "no"

Nevertheless, the question is: to what extent? did big families have the same chance of getting in lifeboats as smaller ones?

# 5. train$FamCat
# Check the relationship between SibSp, Parchar and survival

ggplot(train, aes(y = SibSp, x = Parch)) +
    geom_jitter(aes(color = Survived)) +
    theme_bw() +
    geom_vline(xintercept = 3, color = 'black', lty=5) +
    geom_hline(yintercept = 3, color = 'black', lty=5)

It appears here that passengers with SibSp or Parch greater than 3 (i.e. the passenger belonged to a family of more than 4 members) had very small chance to survive. I introduce thus another predictor variable which classify the passengers according to the count of their relatives on board.

train$FamCat[train$SibSp + train$Parch <= 4] <- "small"     # family
train$FamCat[train$SibSp + train$Parch > 4] <- "big"        # big one
train$FamCat[train$SibSp + train$Parch == 0] <- "single"    # the passenger was alone

Factorize categorical variables…

…as all these engineered variables are to be used as categorical ones:

train$NameCat <- factor(train$NameCat)
train$AgeCat <- factor(train$AgeCat)
train$HasCab <- factor(train$HasCab)
train$HasFam <- factor(train$HasFam)
train$FamCat <- factor(train$FamCat)

Detailed exploratory data analysis

According to the train dataset the proportion of survival is a bit greather than the global one (one third). Not surprisingly, more women and children survived the shipwreck than men.

One can also notice that the overall proportion of survival for the category “missing” seems to be very close to category “adult” which leads me to believe (even subjectively) that the mean ages would not be very different.

# barplot: survived or not
p1 <- ggplot(data = train, aes(Survived, fill = Survived)) + geom_bar() + guides(fill = FALSE) + theme_bw()

# barplot: survival according to sex
p2 <-  ggplot(data = train, aes(Sex, fill = Survived)) + geom_bar(position = "dodge") + guides(fill = FALSE) + theme_bw()

# barplot: survival or not according to age category
p3 <-  ggplot(data = train, aes(x = AgeCat, fill = Survived)) + geom_bar(position = "dodge") + theme_bw()

grid.arrange(p1, p2, p3, layout_matrix = cbind(c(1,3), c(2,3)), widths=c(2,3))

## [1] "The percentage of survival is about 38.4%"

##   train$Sex train$Survived
## 1    female       74.20382
## 2      male       18.89081

##   train$AgeCat train$Survived
## 1        adult       39.02439
## 2        child       58.44156
## 3      missing       29.37853
## 4       senior       22.72727

In light of the first plot, it seems that having family members on board of the Titanic was not crucial for survival. However, when looking to the second plot, it appears clearly that being a member of a small family was in fact very important for survival mainly because of the fact that small children were always accompanied by at least one of their parents.

# barplot: survived or not according to has a family on board or not
p1 <- ggplot(data = train, aes(x = HasFam, fill = Survived)) + geom_bar(position = "dodge") + theme_bw()

# barplot: survived or not according to the category of the family
p2 <- ggplot(data = train, aes(x = FamCat, fill = Survived)) + geom_bar(position = "dodge") + guides(fill = FALSE) + theme_bw()

grid.arrange(p1, p2)

##   train$HasFam train$Survived
## 1           no       30.35382
## 2          yes       50.56497

##   train$FamCat train$Survived
## 1          big       14.89362
## 2       single       30.35382
## 3        small       56.02606

Wealth also seems to have an import impact on survival. It appears on the two first plots that having a cabin or being a first class passenger increased clearly the chance of survival.

# barplot: survived or not according to having a cabin or not
p2 <- ggplot(data = train, aes(x = HasCab, fill = Survived)) + geom_bar(position = "dodge") + theme_bw()

# barplot: survival or not according to passenger class
p1 <-  ggplot(data = train, aes(x = Pclass, fill = Survived)) + geom_bar(position = "dodge") + guides(fill = FALSE) + theme_bw()

grid.arrange(p1, p2)

##   train$Pclass train$Survived
## 1            1       62.96296
## 2            2       47.28261
## 3            3       24.23625

##   train$HasCab train$Survived
## 1           no       29.98544
## 2          yes       66.66667

On the other side, it appears also clearly that the overwhelming majority of women who paid more than 50 pounds survived. Since it is not the case for men, the reason could be that the women from first class were the first to leave the Titanic on lifeboats when men of the first classe was forced to wait for women and children from other classes.

# Had the Fare any impact? 
ggplot(train, aes(x = PassengerId, y = Fare, color = Survived)) + geom_point() + theme_bw() + facet_grid(. ~ Sex)

Concerning the passengers titles, the proportions of survival seem to be related to age categories and sex. One can notice also that certain categories such as Doctors, servicemen or clerics did not (for the majority) even tried to leave the Titanic.

# Relationship between survival and name categories
ggplot(data = train, aes(x = NameCat, fill = Survived)) + geom_bar(position = "dodge") + theme_bw()

aggregate(train$Survived ~ train$NameCat, FUN= function(x) sum(x == "yes") / length(x) * 100)

##   train$NameCat train$Survived
## 1            Dr       42.85714
## 2        Master       57.50000
## 3           Mil       40.00000
## 4          Miss       70.27027
## 5            Mr       15.66731
## 6           Mrs       79.36508
## 7         Noble       60.00000
## 8           Rev        0.00000

CURIOSITY (not directly related to the topic of the project)

# How evolves the Fare according to passengers age (depending on age categories)
ggplot(train,aes(x = Age,y = Fare, color = AgeCat)) + geom_point(shape = 1) + 
     geom_smooth(method='lm') + scale_y_continuous(limits = c(0, 300)) + theme_bw()

# How evolves the Fare according to passengers age (overall)
ggplot(train,aes(x = Age,y = Fare)) + geom_point(color = "red") + 
    geom_smooth(method='lm') + scale_y_continuous(limits = c(0, 300)) + theme_bw()

# Relatioship between passengers title and passengers class - Difference in social classes (exception for "Master")
p <- ggplot(data = train, aes(Pclass, fill = Pclass)) + geom_bar() + theme_bw()
p + facet_grid(. ~ NameCat)

# Relationship between age categories and passengers class (interesting concerning the category "missing" which seems, according to its distribution, to be close to "adult") (suppose that most of the passengers with "missing" ages are in fact "adults")
p + facet_grid(. ~ AgeCat)

Data Imputation

For the machine learning it is important to impute all missing values of the predictor variables. There are many methods for the imputation each with advantages and inconvenients. For this project, two variables show missing values: Embarked and Age. Since only two values are missing in Embarked I do a little simple investigation to deduct them based on the ticket number which should be related to the place were it was issued.

# 1. train$Embarked

# passengerID with missing "Embarked"
train$PassengerId[is.na(train$Embarked)]

## [1]  62 830

# both passengerd have same ticket: 113572 => they embarked together with the same ticket
train$Ticket[train$PassengerId == 62 | train$PassengerId == 830]

## [1] "113572" "113572"

# check other passengers with ticket of the same prefix => they most probably bought their tickets from the same embarkment place
train$Embarked[grep("^1135", train$Ticket)]

## [1] C    <NA> S    S    S    S    C    S    <NA>
## Levels: C Q S

# since for most cases the Embarkment has the value "S"
train$Embarked[62] <- "S" 
train$Embarked[830] <- "S"

For the machine learning, instead of Age I decide to use AgeCat.The missing values are imputed using machine learning since it appears clearly that the age category of the passengers is related directly to many parameters included in our dataset as seen previously. Of course, for the prediction of missing AgeCat, the variable Survived is excluded when training the model.

The chosed predictor variables are: Pclass, Sex, Embarked, NameCat, HasCab, HasFam, and FamCat while the target variable is “AgeCat” with three different levels: “child”, “adult”, and “senior”.

The trainControl was set to use the K-fold cross-validation as it represents a robust method to estimate the model’s accuracy. The choice of k = 5 has been empirically shown to avoid high bias and variance when estimating the test error rate ⁴.

# train$Age
# The imputation is predicted
# 1. Select the predictors - "Survived"" should not be selected
trainAge <- train[c(1, 3, 5, 12, 13, 14, 15, 16, 17)]

# keep only observations with AgeCat = "missing" in the test subset
trainAgetest <- trainAge[trainAge$AgeCat == "missing",]

# exclude AgeCat from the test subset
trainAgetest <- trainAgetest[, -6]

# exclude observations with AgeCat = "missing" in the train subset
trainAge <- trainAge[trainAge$AgeCat != "missing",]
trainAge <- trainAge[, -1]

# exclude the level "mising"
trainAge$AgeCat <- factor(trainAge$AgeCat)

# Subset the train data set
set.seed(1962)
subsets <- createDataPartition(y=trainAge$AgeCat, p=0.75, list=FALSE)
subTraining <- trainAge[subsets, ] 
subTesting <- trainAge[-subsets, ]

# train the model
x <- subTraining[, -5] # column 5 is the target variable
y <- subTraining[, 5]
control = trainControl(method = "cv", number = 5, allowParallel = TRUE)
modelGBM <- train(x, y, method = "rf", trControl = control)  # I use GBM method
#modelGBM$results

predGBM <- predict(modelGBM, subTesting)
cfMatGBM <- confusionMatrix(subTesting$AgeCat, predGBM)
cfMatGBM$overall

##       Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
##      0.9378531      0.6888782      0.8915243      0.9685713      0.9152542 
## AccuracyPValue  McnemarPValue 
##      0.1730900            NaN

cfMatGBM$table

##           Reference
## Prediction adult child senior
##     adult    152     1      0
##     child      5    14      0
##     senior     5     0      0

Another test was done using the random forest method. However the best accuracy was obtained with the GBM method.

The missing values of AgeCat in the train data set are then predicted using the GBM model.

# predict for the missing AgeCat in train data set
predAge <- predict(modelGBM, trainAgetest)

# values are then injected into train
train$AgeCat[train$PassengerId %in% trainAgetest$PassengerId] <- predAge 
train$AgeCat <- factor(train$AgeCat)
# ifelse(train$AgeCat[trainAgetest$PassengerId] == predAge, 1, 0)

Adapting the test file

In order to do machine learning, it is mandatory to apply the same modifications to the test data set. The same steps applied to train data should thus be done to test data.

sapply(test, function(x) sum(is.na(x))) # how many observations are missing

## PassengerId      Pclass        Name         Sex         Age       SibSp 
##           0           0           0           0          86           0 
##       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           1         327           0

We notice that here are missing values in Age, and Fare. Concerning Fare, and since it is only one value missing, one could simply assign to this observation the median value of the variable Fare for the appropriate class.

The missing values of AgeCat are imputed using the exct same method as for the train data.

Many titles seen in the train file are missing in test file. At the same time, a new one, Dona, is classified as a noble title.

# Data Engineering

# 1. test$Name
x <- sub(".*, ", "", as.character(test$Name))
test$NameCat <- sub("\\..*", "", x)
test$NameCat[test$NameCat %in% c("Miss", "Ms")] <- "Miss"
test$NameCat[test$NameCat == "Col"] <- "mil"
test$NameCat[test$NameCat == "Dona"] <- "noble"

# 2. test$Age
test$AgeCat[test$Age > 0 & test$Age <= 14 & !is.na(test$Age)] <- "child"
test$AgeCat[test$Age > 14 & test$Age <= 60 & !is.na(test$Age)] <- "adult"
test$AgeCat[test$Age > 60 & !is.na(test$Age)] <- "senior"
test$AgeCat[is.na(test$Age)] <- "missing"

# 3. test$Cabin
test$HasCab[!is.na(test$Cabin)] <- 1
test$HasCab[is.na(test$Cabin)] <- 0

# 4. Add test$HasFam
test$HasFam[test$SibSp != 0 | test$Parch != 0] <- 1
test$HasFam[test$SibSp == 0 & test$Parch == 0] <- 0

# 5. Add test$Famcat
test$FamCat[test$SibSp + test$Parch <= 4] <- "small"
test$FamCat[test$SibSp + test$Parch > 4] <- "big"
test$FamCat[test$SibSp + test$Parch == 0] <- "single"

# 6. missing Fare: PassengerId = 1044 with Pclass = 3 -> use the median for pclass
test$Fare[test$PassengerId == 1044] <- median(test$Fare[test$Pclass == 3], na.rm = TRUE)

test$NameCat <- factor(test$NameCat)
test$AgeCat <- factor(test$AgeCat)
test$HasCab <- factor(test$HasCab)
test$HasFam <- factor(test$HasFam)
test$FamCat <- factor(test$FamCat)

#Imputation of missing test$Age
testAge <- test[c(1, 2, 4, 11, 12, 13, 14, 15, 16)]
testAgetest <- testAge[testAge$AgeCat == "missing",]
testAgetest <- testAgetest[, -6]                        # test file
testAge <- testAge[testAge$AgeCat != "missing",]
testAge <- testAge[, -1]                                # train file
testAge$AgeCat <- factor(testAge$AgeCat)                # in case of different levels
set.seed(1962)
subsets <- createDataPartition(y=testAge$AgeCat, p=0.75, list=FALSE)
subTraining <- testAge[subsets, ] 
subTesting <- testAge[-subsets, ]
x <- subTraining[, -5]                                  # column 5 is the target variable
y <- subTraining[, 5]
control = trainControl(method = "cv", number = 5, allowParallel = TRUE)
modelGBMtest <- train(x, y, method = "rf", trControl = control)
#modelGBMtest$results

predGBMtest <- predict(modelGBMtest, subTesting)
cfMatGBMtest <- confusionMatrix(subTesting$AgeCat, predGBMtest)
cfMatGBMtest$overall

##       Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
##      0.8641975      0.4095427      0.7699904      0.9302113      0.8518519 
## AccuracyPValue  McnemarPValue 
##      0.4523181            NaN

cfMatGBMtest$table

##           Reference
## Prediction adult child senior
##     adult     65     7      0
##     child      2     5      0
##     senior     2     0      0

predAgetest <- predict(modelGBMtest, testAgetest)
test$AgeCat[test$PassengerId %in% testAgetest$PassengerId] <- predAgetest
test$AgeCat <- factor(test$AgeCat)

Prediction of survival

Now that all the data is processed and missing values imputed, lets compute the final model and predict survival among the passengers in test list.

The predictor variables are: “Pclass”, “Sex”, “Embarked”, “NameCat”, “AgeCat”, “HasCab”, “HasFam”, and “FamCat”.

The target variable is “Survived”.

The train data consists in 891 observations while the test data consists in 418.

# RF for train and test
trainFin <- train[c(2, 3, 5, 12, 13, 14, 15, 16, 17)]
#trainFin$Survived <- factor(trainFin$Survived)
testFin <- test[c(1, 2, 4, 11, 12, 13, 14, 15, 16)]
set.seed(1962)
subsets <- createDataPartition(y=trainFin$AgeCat, p=0.75, list=FALSE)
subTraining <- trainFin[subsets, ]
subTesting <- trainFin[-subsets, ]
x <- subTraining[, -1] # column 1 is the target variable
y <- subTraining[, 1]
control = trainControl(method = "boot", number = 5, allowParallel = TRUE)
modelGBMFin <- train(x, y, method = "gbm", trControl = control)

modelGBMFin$results

##   shrinkage interaction.depth n.minobsinnode n.trees  Accuracy     Kappa
## 1       0.1                 1             10      50 0.8144104 0.5807197
## 4       0.1                 2             10      50 0.8135026 0.5825964
## 7       0.1                 3             10      50 0.8144249 0.5835697
## 2       0.1                 1             10     100 0.8129420 0.5808665
## 5       0.1                 2             10     100 0.8159136 0.5892643
## 8       0.1                 3             10     100 0.8239166 0.6035873
## 3       0.1                 1             10     150 0.8203495 0.6037245
## 6       0.1                 2             10     150 0.8206616 0.5987273
## 9       0.1                 3             10     150 0.8278934 0.6097493
##    AccuracySD    KappaSD
## 1 0.017017093 0.04787700
## 4 0.005442876 0.01228053
## 7 0.015013440 0.03670176
## 2 0.020246036 0.05357968
## 5 0.006297675 0.02207858
## 8 0.001446437 0.01990235
## 3 0.021378438 0.05007471
## 6 0.008710776 0.02909173
## 9 0.007754714 0.03187313

predGBMFin <- predict(modelGBMFin, subTesting)
cfMatGBMFin <- confusionMatrix(subTesting$Survived, predGBMFin)
cfMatGBMFin$overall

##       Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
##   7.692308e-01   5.340829e-01   7.079997e-01   8.230873e-01   5.610860e-01 
## AccuracyPValue  McnemarPValue 
##   8.964825e-11   5.754030e-01

cfMatGBMFin$table

##           Reference
## Prediction no yes
##        no  96  23
##        yes 28  74

#importanceRFFin <- varImp(modelRFFin,scale = FALSE)
#plot(importanceRFFin, top= 10)
predGBMFin <- predict(modelGBMFin, testFin)
#sol <- ifelse(predGBMFin == "yes", "1", "0")
#z <- data.frame(test$PassengerId, sol)
#colnames(z) <- c("PassengerId", "Survived")
#write.csv(z, "Res.csv", row.names=FALSE)

Bibliography

Mersey, Lord (1999) [1912]. The Loss of the Titanic, 1912. The Stationery Office. ISBN 978-0-11-702403-8↩
Algerian proverb↩
https://www.encyclopedia-titanica.org/titanic-statistics.html ↩
http://www.sthda.com/english/articles/38-regression-model-validation/157-cross-validation-essentials-in-r/↩