Load Required Packages

library(mice)  # missing data investigation
library(dplyr) # data mining and exploratory data analysis
library(caret) # machine learning models
library(e1071) # dependent package
library(doMC)  # parallel processing - caret models
library(knitr) # formatable table plot
registerDoMC(cores = 3) # set 3 processor units

Introduction

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class. [1]

This reports is meant to provide an analysis of the main factors that could lead to the survival os a specific person, due to its personal features. Besides, a few models were build in order to automatically predict whether a person would survive or not. The data dictionary can be found in the challenge’s website (reference 1).

First step: Read datasets and join training and testing dataset in order to make data transformations in both sets. Data imputation processes were made only with training dataset in subsequent steps.

# Setting work directory and reading training data
setwd('/Users/caiomiyashiro/Downloads/Mytaxy Challenge');
training <- read.csv('Titanic - train.csv',header = TRUE);
training$mode <- 'Training'

testing <- read.csv('Titanic - test.csv',header = TRUE);
testing$mode <- 'Testing'
testing$Survived <- NA

dataset <- rbind(training, testing)
dataset$mode <- as.factor(dataset$mode)

Data Cleaning and Preparation

Data cleaning procedures are important in order to make sure we are working with the correct data. A few problems can come up when we obtain data from different resources:

# turning empty values into NA
dataset[which(dataset$Cabin == ''),"Cabin"] = NA
dataset[which(dataset$Embarked == ''),"Embarked"] = NA
# transforming PClass into factor
dataset[,'Pclass'] <- as.factor(dataset[,'Pclass'])

# turning response from 0-1 to 'N' 'Y'
dataset[1:nrow(training),'Survived'] <- ifelse(dataset[1:nrow(training),'Survived'] == 0, 'N', 'Y')
dataset[,'Survived'] <- as.factor(dataset[,'Survived'])

Data Imputation

Many models we will build do not accept missing values during its training. A possible approach is to try to artificially generate each missing value. However, we need first to determine what kind of missing value we have:

  • Missing Totally at Random: Values are really missing due to data workflow error, bad connection or something not related to the business we are dealing with
  • Missing at Random: Missing values in a column might be related to another variable inside the database. For example, people have strong clues that most tickets that were found in the disaster were from people only from the first class. So, if the person was not from the first class, they might have a high probability of not having the ticket number.
  • Missing not at Random: In this case, a missing value is a value on itself. For example, if only people from the third class had missing tickets, a NA in the tickets column would indicate that person’s social class.

In order to do this kind of evaluation, we evaluate the missing pattern of the database using the MICE package[2]. They provide a way to evaluate if some missing data are correlated with other variables’ missing data. It does not evaluate if the data is missing in relation to another variable value. Below, we find that Cabin has a lot of missing values, and it does not depend on other columns missing values. We find a significant count of rows where Age and Cabin were missing, so we evaluated if there is a relation between its values. We see that there are more missing values inside a range of [18-24] years, but looking at the Age histogram, we see that most part of the passengers were on this age range, so with more data, it is acceptable to have more missing data as well.

training <- dataset[dataset$mode == 'Training',]

# Cabin - 529 solo missing (Totally at random)
md.pattern(training)
##     PassengerId Survived Pclass Name Sex SibSp Parch Ticket Fare mode
## 183           1        1      1    1   1     1     1      1    1    1
##  19           1        1      1    1   1     1     1      1    1    1
## 529           1        1      1    1   1     1     1      1    1    1
##   2           1        1      1    1   1     1     1      1    1    1
## 158           1        1      1    1   1     1     1      1    1    1
##               0        0      0    0   0     0     0      0    0    0
##     Embarked Age Cabin    
## 183        1   1     1   0
##  19        1   0     1   1
## 529        1   1     0   1
##   2        0   1     1   1
## 158        1   0     0   2
##            2 177   687 866

To evaluate this table, we search for the 0 value, an indicator of a missing value. The first line, with all 1s, indicate there are 183 complete rows, while there are 19 rows with only Age missing, 529 with only Cabin missing, 2 only Embarked missing and 158 with both Age and Cabin missing (might intersect with only Age missing rows).

# lead  - 158 rows with Age and Cabin missing (Missing at random?)
missingCabinByAge <- table(training$Age, is.na(training$Cabin))

#peak at young age 18 - 24 years, but it is also the average age of passengers
par(mfrow = c(1,2)) # Configure R to plot 2 graphs per line
plot(row.names(missingCabinByAge), missingCabinByAge[,2],
     xlab = 'Age', ylab = 'Amount of missing Cabins',
     main = 'Nbr of Missing Cabins per Age')
hist(training$Age, xlab = 'Age', main = 'Histogram of Age')

par(mfrow = c(1,1)) # Configure ploting area back to normal

Age Column Imputation: Focusing on the Age column, we tried to evaluate if there were any relation between this column and other from the dataset. We checked the Age distribution grouped by the Sex and Pclass individually and with both variables together. As it can be seen below, we got the best Age separation conditioning this columns with both Sex and Pclass, so we extracted the mean and standard deviation of each segment, so we would imput new values using these statistics. In order to keep the data distribution of each segment, each time we see a NA value in Age, we sample a normal random variable with mean and standard deviation from the table calculated before.

# age Evaluation by Sex
training %>% group_by(Sex) %>% 
  summarise(meanAge = mean(Age,na.rm = TRUE), medianAge = median(Age,na.rm = TRUE))
## # A tibble: 2 x 3
##      Sex  meanAge medianAge
##   <fctr>    <dbl>     <dbl>
## 1 female 27.91571        27
## 2   male 30.72664        29
# age Evaluation by Pclass
training %>% group_by(Pclass) %>% 
  summarise(meanAge = mean(Age,na.rm = TRUE), medianAge = median(Age,na.rm = TRUE))
## # A tibble: 3 x 3
##   Pclass  meanAge medianAge
##   <fctr>    <dbl>     <dbl>
## 1      1 38.23344        37
## 2      2 29.87763        29
## 3      3 25.14062        24
# Will use this for data imputation (Contains only TRAINING statistics)
ageEvaluationSex_Pclass <- training %>% group_by(Sex, Pclass) %>% 
  summarise(meanAge = mean(Age,na.rm = TRUE), medianAge = median(Age,na.rm = TRUE))
ageEvaluationSex_Pclass
## # A tibble: 6 x 4
## # Groups:   Sex [?]
##      Sex Pclass  meanAge medianAge
##   <fctr> <fctr>    <dbl>     <dbl>
## 1 female      1 34.61176      35.0
## 2 female      2 28.72297      28.0
## 3 female      3 21.75000      21.5
## 4   male      1 41.28139      40.0
## 5   male      2 30.74071      30.0
## 6   male      3 26.50759      25.0
### Age Imputation grouping by Sex and Pclass
ageImputBySex_Pclass <- function(dataset, averageAgeStats){
  calculateImputedAge <- function(dfRow, ageEvaluationSex_Pclass){
    filterIndex <- ageEvaluationSex_Pclass$Sex == dfRow['Sex'] & ageEvaluationSex_Pclass$Pclass == dfRow['Pclass']
    impAge <- ageEvaluationSex_Pclass[filterIndex,]$meanAge
  }
  # send each age missing row for imputation
  dataset$Age[is.na(dataset$Age)] <- apply(dataset[is.na(dataset$Age),], 1, calculateImputedAge, ageEvaluationSex_Pclass)
  invisible(dataset)
}

dataset <- ageImputBySex_Pclass(dataset, averageAgeStats)

Fare Column Imputation: For another numerical column, Fare, we did the same as Age, i.e. calculated the training dataset stats and applied it anytime we found a missing value, whether it was in training or testing dataset.

# Will use this for data imputation (Contains only TRAINING statistics)
fareStats <- training %>% summarise(avgFare = mean(Fare), stdFare = sd(Fare))
dataset$Fare[is.na(dataset$Fare)] <- rnorm(length(dataset$Fare[is.na(dataset$Fare)]),
                                           fareStats$avgFare,
                                           fareStats$stdFare)

Embarked Column Imputation: The Embarked column only had two independent values missing in the training set, so I decided to only imput it with its mode, i.e. the most occuring value within the column.

### Embarked, imput
# As there are only 2 NA, replace it by the mode
mostEmbarkedPort <- sort(table(training$Embarked),decreasing = TRUE)[1] #S
dataset$Embarked[is.na(dataset$Embarked)] <- names(mostEmbarkedPort)

Remove non Interesting Columns: For this analysis, we decided to remove a few columns from the dataset. It is important to notice that they are not useless, I just deleted them because of the time needed to provide the analysis. With further investigations, we could be able to extract good features from these variables.

###### Cabin -- Too many NA, remove it
dataset$Cabin <- NULL
###### Ticket will also remove
dataset$Ticket <- NULL
###### PassengerId will also remove
dataset$PassengerId <- NULL

As we have finished imputing values, updating both training and testing dataset, we isolate the training dataset again, because we will extract metrics that will be used to standardize the entire dataset again.

training <- dataset[dataset$mode == 'Training',]
testing <- dataset[dataset$mode == 'Testing',]

Feature Standardization: As stated by Lecun, et. al., 1998 [3], feature normalization plays sometimes a crucial role in the execution of a few machine learning algorithms. As a neural network specialist, he provided examples where non normalized features could make the tradicional sigmoid activation function used in logistic regression and neural networks saturate and make learning really slow. Because of this notes, I standardized the two numerical columns we had, transforming then within the interval [0-1].

####Standardize numerical data
ageStats <- training %>% summarise(avgAge = mean(Age), stdAge = sd(Age))

dataset$Age <- (dataset$Age - ageStats$avgAge)/ageStats$stdAge
hist(dataset$Age, xlab = 'Age Standardized', main = 'Age Histogram')

# Fare - Used previously calculated avg and sd
dataset$Fare <- (dataset$Fare - fareStats$avgFare)/fareStats$stdFare
hist(dataset$Fare, xlab = 'Fare St. With Outlier', main = 'Fare Histogram')

For the Fare column, I also noticed that some people absurdely more to get inside the ship, and clearly the value do not belong to the same Fare distribution we are analysing. Therefore, I decided to fix a threshold to the ticket price. Anyone who paid more than three standard deviations from the mean, would have its price update the acceptable ceiling, three standard deviations.

dataset$Fare[which(dataset$Fare > 3)] <- 3
hist(dataset$Fare, xlab = 'Fare St. Without Outlier', main = 'Fare Histogram') # that is better!

Feature Engineering

As defined by Moreira, Gabriel, 2017 [4], Feature engineering is the process of transforming raw data into features that better represent the underlying problem to the predictive models, resulting in better performance. Evaluating the Titanic dataset, we can try to come up with a few other variables that could enhance the algorithm power to detect survival passengers.

Name Title Extraction: If we take a look at the Name column, we wil see there are different features inside it, such as name, surname and title. We will extract the person’s title, that could indicate a person’s educational status, other than the economical status, such as the Pclass column.

titles <- apply(training,1,function(row){
  strsplit(strsplit(as.character(row['Name']),', ')[[1]][2],'\\.')[[1]][1]
})
summary(as.factor(titles))
##         Capt          Col          Don           Dr     Jonkheer 
##            1            2            1            7            1 
##         Lady        Major       Master         Miss         Mlle 
##            1            2           40          182            2 
##          Mme           Mr          Mrs           Ms          Rev 
##            1          517          125            1            6 
##          Sir the Countess 
##            1            1

There is some serious imbalance between the titles, as we can see above. To deal with that, I concatenated a few columns that could indicate the same thing, and, for titles that did not have any equivalent, I converted them to a separated category ‘Rare Title’. The titles ‘Dr’,‘Master’, ‘Miss’, ‘Mr’, ‘Mrs’ were kept, and the titles that were replaces can be seen below.

After extracting the title, we have no more use to the Name column, so I removed it.

## preprocessing function
extractAndConvertTitles <- function(dataset){
  titles <- apply(dataset,1,function(row){
    strsplit(strsplit(as.character(row['Name']),', ')[[1]][2],'\\.')[[1]][1]
  })
  keep_titles <- c('Dr','Master', 'Miss', 'Mr', 'Mrs')
  replacementTitles <- list(Mlle = 'Miss', Mme = 'Mrs', Sir = 'Mr', Ms = 'Miss')
  for(r_title in names(replacementTitles)){
    titles[titles == r_title] <- replacementTitles[[r_title]]
  }
  # convert all the other remaining titles to 'Rare Title'
  titles[!titles %in% keep_titles] = 'Rare Title'
  dataset$Title <- as.factor(titles)
  invisible(dataset)
}
dataset <- extractAndConvertTitles(dataset)
dataset$Name <- NULL
summary(dataset$Title[dataset$mode == 'Training']) # Better!
##         Dr     Master       Miss         Mr        Mrs Rare Title 
##          7         40        185        518        126         15

Family Size Classification: Two columns, SibSp and Parch, indicate how many siblings and parents were travelling with that specific passenger. With that, I created an extrar feature, indicating the size of the family that person was travelling with. The theory is that the family size could be an indication of that family’s chance of survival. Small and medium sized families would have higher chances of survival than big families. The family size is defined as the number os siblings + number of parents + him itself (+1), and for each family size interval, I classified them into small, medium or big sized families.

### Family Size Classification
familySize <- dataset$SibSp + dataset$Parch + 1
familySizeClass = array(dim = length(familySize))
familySizeClass[familySize == 1] = 'Small'
familySizeClass[familySize >= 2 & familySize <= 4] = 'Medium'
familySizeClass[familySize > 4] = 'Big'

dataset$FamilySize <- as.factor(familySizeClass)

Young people travelling alone: We saw how adolescents were prevalent within the Titanic passengers, i.e. a bulk at the Age distribution for people between 18 and 24 years old. But how many of them were travelling alone? Could this help us to predict if they had a (I believe) lower chance of survival?

dataset$youngTravelAl <- dataset$Age < 25 & dataset$Parch == 0

Numeric Transformation: Numerical transformations can bring the needed non linearity for your models. [5]. In this way we create new data in new distributions that can help us to give a better fit between our model and the ‘data’s true distribution’. As we had only two numerical columns, I used both to create power transformations of them.

dataset.numeric <- dataset[c('Age', 'Fare')]

dataset[c('AgePw2','FarePw2')] <- dataset.numeric ^ 2
dataset[c('AgePw3','FarePw3')] <- dataset.numeric ^ 3

Exploratory Data Analysis

In order to better understand our dataset, besides providing further answers over our data, I did some exploratory data analysis over some questions:

First, as we did before after updating the datasets over imputations and now with data transformations, we update our training dataset.

training <- dataset[dataset$mode == 'Training',]
training$mode <- NULL

1) Is Age correlated with the response? - To check if age is really a potential feature for our objective, we could check if there is a significant difference between the average age between people that survived and those who did not make it. This is a quick verification, so I did some assumptions here, such as both populations having the same variance and both had a relativelly normal distribution.

t.test(training$Age[training$Survived == 'Y'],
       training$Age[training$Survived == 'N'],
       var.equal = TRUE, paired = FALSE)
## 
##  Two Sample t-test
## 
## data:  training$Age[training$Survived == "Y"] and training$Age[training$Survived == "N"]
## t = -2.0167, df = 889, p-value = 0.04402
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.273657376 -0.003720438
## sample estimates:
##   mean of x   mean of y 
## -0.08545478  0.05323413

As we can see above, a 95% confidence interval for the mean standardized age of the surviving group is [-0.2736, -0.0037]. As the average age of the group who died at the accident is 0.0532, we have satisfying evidence that Age can be a good feature for our survival prediction. But what about the power transformations we did with the Age column? Lets check it visually:

par(mfrow = c(1,2))
boxplot(training$AgePw2 ~ Survived, training,
        main = 'Dist of AgePw2 and Survival',
        xlab = 'Survival',
        ylab = 'AgePw2')
boxplot(training$AgePw3 ~ Survived, training,
        main = 'Dist of AgePw2 and Survival',
        xlab = 'Survival',
        ylab = 'AgePw3')

par(mfrow = c(1,1))

From the figures, we see we centered the 25th and 75th percentile, but I did not see any remarkable differention capacity. Lets see further how they do with the models.

1) Does Sex and Social Class (Pclass) seem to influence in Survival? -

Lets make some count tables relating these variables to see if there is any evidence of relationship:

table(training$Sex, training$Pclass, training$Survived)
## , ,  = N
## 
##         
##            1   2   3
##   female   3   6  72
##   male    77  91 300
## 
## , ,  = Y
## 
##         
##            1   2   3
##   female  91  70  72
##   male    45  17  47
logistTest <- train(Survived ~ Sex + Pclass, training,
                    method = 'glm', family = binomial())
coef(summary(logistTest))
##               Estimate Std. Error    z value     Pr(>|z|)
## (Intercept)  2.2971232  0.2189839  10.489917 9.611309e-26
## Sexmale     -2.6418754  0.1840954 -14.350577 1.056551e-46
## Pclass2     -0.8379523  0.2447436  -3.423797 6.175280e-04
## Pclass3     -1.9054951  0.2141410  -8.898319 5.669887e-19

For the count statistics, we can see some interesting things, such as:

To have a quantitative evaluation over the importance of Sex and Pclass to predict survival, we made a mini logistic regression with just these two independent variables. As the P-values indicate, these two variables seem really good to explain the survival variability.

3) Family Size was one of the theories that, if you were until a mid-sized family, your chances of survival were bigger. Is it?

ggplot(training, aes(FamilySize, fill = Survived)) +
  geom_bar(position = 'fill') +
  ggtitle('Family Size Impact on Survival') + 
  labs(y = '%')

From the plot above, we see that more than 50% of the mid-sized families (from 2 to 4 people) survived the accident, more than the small families (alone or 1) and more than twice the number of big families (> 4 people). This might provide a great indicative that, if you had a few family members with you, you would have a preference to go and save yourself with them. Small families too, but preferences could have selected bigger families. however, if you had > 4 people with you, it would be difficult to fit and plan for everyone to get inside the lifeboat.

4) Young people travelling alone. The idea was that, if you were young and were not with a family, your chances of survival were smaller.

ggplot(training, aes(youngTravelAl, fill = Survived)) +
  geom_bar(position = 'fill') +
  ggtitle('youngTravelAl Impact on Survival') + 
  labs(y = '%')

We have some indication that, if you were young and alone, your priority to get into the lifeboat was not so good. For a quantitative evaluation, lets run a chi-square test to see if these two variables are independent or not:

chisq.test(table(training$Survived, training$youngTravelAl))
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table(training$Survived, training$youngTravelAl)
## X-squared = 18.656, df = 1, p-value = 1.565e-05

Statistically speaking, there is a significant dependency between these two variables. Lets see further if these variables provide a good prediction power inside the models.

Modelling

Do we need subsampling processing?

Before we try to create the models, we must see the dependent variable distribution. As a high class imbalance can bias the models, forcing them to always opt for the majority class, without actually learning something.

Below, we see there is a litte class imbalance, as more people died at the accident than survived. However, considering the little amount of data, I will accept this distribution and will not downsample the 0 class.

barchart(training$Survived)

Stablish training control - 10 fold cross validation, ROC curve evaluation and hiperparameter search space

Next, we stablish a strategy to evaluate our model performance. All the metrics will be calculate as average of the 10-fold cross validation, an estimative of the model generalization capacity, instead of counting in the biased metrics of the training dataset.

Besides, even though the challenge is based on pure accuracy, i.e. number of corrected assigned examples, I chose to evaluate the model perfomance over the ROC curve, sensitivity and specificity. This approach takes into account the number of corrected classified instances, but separates metrics between positive and negatives classes. So, if a model is biased towards the majority class, we will be able to detect it, feature that we would not have considering accuracy by itself [6].

train_control <- trainControl(method = 'cv', number = 10,
                              classProbs = TRUE, summaryFunction = twoClassSummary)

All models below with exception the logistic regression, have aditional hiperparameters that can enhance the model performance. There is no rule book to stablish the parameter values, so instead we do a grid search procedure. Taking advantage of the cross-validation defined before, we do a cross-validation for each possible parameter we pre-defined. Each model has specific tunable hiperparameter, so we plan the search inside each model training with the tuneGrid and tuneLength parameters.

Basic Linear Modelling

All the models were developed using the caret package [7].

The first model produced was a logistic regression. Besides it does not have all the flexible capacities of modern machine learning algorithms, it is probably the most used model used today in production. Its simplicity and, mostly, its intepretability turn this model a first choice when doing the modelling.

logRegModel <- train(Survived ~ ., data = training, trControl = train_control,
                     method = 'glm', family = binomial(),
                     metric = 'ROC')
# cross validation results
print(logRegModel)
## Generalized Linear Model 
## 
## 891 samples
##  14 predictor
##   2 classes: 'N', 'Y' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 802, 802, 802, 803, 802, 802, ... 
## Resampling results:
## 
##   ROC        Sens       Spec     
##   0.8653757  0.8705387  0.7428571
coef(summary(logRegModel))
##                        Estimate   Std. Error     z value     Pr(>|z|)
## (Intercept)        11.891867732 503.41456370  0.02362241 9.811538e-01
## Pclass2            -0.979293041   0.38972211 -2.51279826 1.197778e-02
## Pclass3            -1.833120163   0.44482658 -4.12097715 3.772689e-05
## Sexmale           -15.373609829 503.41235497 -0.03053880 9.756373e-01
## Age                -0.393081670   0.19273723 -2.03946931 4.140321e-02
## SibSp              -0.139490309   0.22654831 -0.61571992 5.380794e-01
## Parch               0.189371686   0.30722262  0.61639889 5.376313e-01
## Fare                0.955920210   0.48227655  1.98209972 4.746808e-02
## EmbarkedC           0.313521090   0.25836598  1.21347666 2.249476e-01
## EmbarkedQ           0.306066714   0.35000637  0.87446042 3.818676e-01
## TitleMaster         3.549358152   1.14784388  3.09219592 1.986817e-03
## TitleMiss         -12.537109694 503.41285928 -0.02490423 9.801314e-01
## TitleMr            -0.108947011   0.95563539 -0.11400479 9.092340e-01
## TitleMrs          -11.920609016 503.41289186 -0.02367959 9.811082e-01
## `TitleRare Title`  -0.625835731   1.23879735 -0.50519622 6.134210e-01
## FamilySizeMedium    2.798611499   0.89423406  3.12961856 1.750334e-03
## FamilySizeSmall     3.223420836   1.13265567  2.84589653 4.428658e-03
## youngTravelAlTRUE   0.327950094   0.52045705  0.63011941 5.286165e-01
## AgePw2             -0.003930222   0.08896402 -0.04417766 9.647628e-01
## FarePw2            -0.929251626   0.56093897 -1.65660024 9.760032e-02
## AgePw3              0.004205360   0.04101266  0.10253809 9.183296e-01
## FarePw3             0.219999767   0.16382578  1.34288857 1.793081e-01

Model Interpretation: Coefficient interpretation for logistic regression are a little tricky, but we are able to get a basic sense anyway.

  • For each increase in unit of a numerical variable, the odds of belonging to the sucess class increase in exp(coefficient).
  • For each different value of a categorical value, the odds of belonging to the success class increase in exp(coefficient) in relation to the base variable.

The most representative coefficient was Sex, but it is shown that the coefficient had a high standard deviation, becoming not so trustworthy. If you were from the third class, your odds of surviving were exp(-1.8331) = ~0.16. That is less than 1/6 of chance than a person from the first class.

Family size: The respective odds change for small and medium families were, respectivelly exp(3.2234) and exp(2.7986) = 25 and 16 more likely to survive than a big family. A surprise, considering we were thinkingm mid sized families had higher chances of survival.

Naive Bayes

Naive bayes is another model that, beside its simplicity, and ‘naive’ assumptions of variable independence, this algorithm usually achieve good results, so I will also put it on the model stack.

nbModel <- train(Survived ~ ., data = training, trControl = train_control,
                 method = 'nb', tuneLength = 10,
                 metric = 'ROC')

Random Forest - Variable Importance

Next model, I chose a model that it is really versatile. It can deal with missing values, class imbalance, it has feature selection process built into it and has also a way to visualize it. It is a good way to select variables and use them into other models.

rfModel <- train(Survived ~ ., data = training, trControl = train_control,
                 method = 'rf', tuneLength = 10,
                 metric = 'ROC')
varImp(rfModel)
## rf variable importance
## 
##   only 20 most important variables shown (out of 22)
## 
##                   Overall
## TitleMr           100.000
## Sexmale            86.550
## Fare               53.855
## FarePw3            52.857
## AgePw2             52.155
## FarePw2            41.824
## Pclass3            39.367
## AgePw3             38.098
## Age                37.819
## SibSp              16.540
## TitleRare Title     6.250
## FamilySizeMedium    6.069
## EmbarkedS           5.738
## TitleMaster         4.524
## Parch               4.232
## EmbarkedC           3.392
## Pclass2             2.718
## EmbarkedQ           2.418
## FamilySizeSmall     2.353
## youngTravelAlTRUE   1.998

Variable Selection: By this results, we can find a few things:

  • Sex is proven to be a great feature, but it might be also highly correlated with the title feature.
  • All the other variables we chose proved to be useful in the random forest model, so we made we a good job at selecting and extracting variables.

Neural Networks

Lastly, we also use the classical neural network, one of the most powerful algorithms today. I tried two variants of the neural networks, the classic feed forward and a more advanced one, where we apply the neural network over the Principal Components of the dataset, becoming easier to separate the different classes over the hiper-space.

nnGrid <- expand.grid(.size = c(1,2,3,4,5,6,7),
                      .decay = c(0, .01, .1, .2, .3, .4, .5, 1, 2))

nnModel <- train(Survived ~ ., data = training, trControl = train_control,
                 method = 'nnet', tuneGrid = nnGrid,
                 metric = 'ROC', trace = FALSE)

pcaNNModel <- train(Survived ~ ., data = training, trControl = train_control,
                 method = 'pcaNNet', tuneGrid = nnGrid,
                 metric = 'ROC', trace = FALSE)

Models Comparison

Sensitivity and Specificity: Before visualizing the models performance, is better to understand the different metrics we are dealing with. Supose we were dealing with a imbalanced class problem, 1/10 to class 1 imbalance, more specifically. If we ‘trained’ a model to predict all values 0, we would reach a 90% accuracy! That is good numerically, but practically it is not. So we approached the problem with these 2 metrics:

  • Sensitivity: Also called true positive rate, it measures the proportion of positives that were correctly identified as such. In this case, the power to classify a person that survived as such.
  • Specificity: Also called true negative rate, it measures the proportion of negatives that are correctly identified as such. In this case, the power to to classify a person that died as such.

  • Receiving Operation Characteristic (ROC): The third metric, yes there is a third, is a combination of the two metrics above. It is a graphical plot that show how sensitivity and 1 - specificity (false negatives) varies when we vary the probability threshold outputed by the models. In the models created, the ROC value returned is actually the Area Under the Curve (AUC) metric.

Roc Curve Example

Roc Curve Example

After training different models, we want to have an idea how each model performed against each other. We can evaluate the metrics that each model gave to us, but we also want to visualize how sure was the model when returning a given metric. This can be visualized throught the models metrics’ standard deviation. A small standard deviation indicates that, for next unseen data, the model would probably output the same metric as we saw in the cross validation, whereas a model with high standard deviation could output a wider range of results of what we saw in the cross validation. Lets take a look at the metrics and its standard deviations:

results <- resamples(list(logistic = logRegModel, nb = nbModel, 
                        rf = rfModel, nn = nnModel, pcann = pcaNNModel))

bwplot(results)

dotplot(results)

After this results, I realized the configuration of ‘Success’ in the dataset was inverted. So sensitivity ended up being the ability to classify an example as 0 when it was the case and specificity, the opposit. Considering this invertion, we got an expected response. All of the variables had a better capacity to detect a true 0 than a true 1, because of the class imbalance.

The value returned by the plot shows random forests and pca neural network as the best AUC, but also with the highest standard deviation. To define a better model in this way between the two, we would probably need more data.

A curious result shows Naive Bayes as the best model to correctly identify if a person survived. Considering the class imbalance, and depending on the business objectives, this would be a chosen model to detect a more rare event, such as class 1 survival.

Powering Up - Ensembles

A last attempt to improve the models performance is to aggregate their probabilistic output and serve it as input to another machine learning model. That is called stacking. For this, we collect all models’ output, concatenate them into a new dataframe and train a new model using this probabilities as input. Below, I stacked the probabilities into two new models, a logistic regression and a random forest. Lets compare how these new stacked models perform comparing with the previous models.

models <- list(logistic = logRegModel, nb = nbModel, 
     rf = rfModel, nn = nnModel, pcann = pcaNNModel)
models_preds <- lapply(models, predict, newdata = training, type = 'prob')
models_probs <- as.data.frame(sapply(models_preds, function(df){1 - df$N}))
models_probs$Survived <- training$Survived


stack.glm <- train(Survived ~ ., data = models_probs,
                  trControl = train_control, metric = 'ROC', 
                  method = 'glm', family = binomial())

stack.rf <- train(Survived ~ ., data = models_probs,
                  trControl = train_control, method = 'rf',
                  metric = 'ROC')

finalModels <- resamples(list(logistic = logRegModel, nb = nbModel, 
               rf = rfModel, nn = nnModel, pcann = pcaNNModel,
               stack.glm = stack.glm, stack.rf = stack.rf))
bwplot(finalModels)

dotplot(finalModels)

The results are promissing. The two stacked models seemed to outperform the simple models in all three metrics and with a lower standard deviation. Of course, with this little amount of data, it might not generalize as well as if we had more data.

Generate Test Scores for each Model

Last part, we calculate the outputs for the testing dataset in order to send them to kaggle and evaluate the score.

testing <- dataset[dataset$mode == 'Testing',]
testing$mode <- NULL
testing$Survived <- NULL

logRegModelTest <- data.frame(PassengerId = 892:1309,
                              Survived = ifelse(predict(logRegModel, testing) == 'Y',1,0))
nbModelTest <- data.frame(PassengerId = 892:1309,
                              Survived = ifelse(predict(nbModel, testing) == 'Y',1,0))
rfModelTest <- data.frame(PassengerId = 892:1309,
                              Survived = ifelse(predict(rfModel, testing) == 'Y',1,0))
nnModelTest <- data.frame(PassengerId = 892:1309,
                              Survived = ifelse(predict(nnModel, testing) == 'Y',1,0))
pcaNNModelTest <- data.frame(PassengerId = 892:1309,
                              Survived = ifelse(predict(pcaNNModel, testing) == 'Y',1,0))
write.csv(logRegModelTest, 'logRegModel.csv',row.names = FALSE)
write.csv(nbModelTest, 'nbModel.csv',row.names = FALSE)
write.csv(rfModelTest, 'rfModel.csv',row.names = FALSE)
write.csv(nnModelTest, 'nnModel.csv',row.names = FALSE)
write.csv(pcaNNModelTest, 'pcaNNModel.csv',row.names = FALSE)

models <- list(logistic = logRegModel, nb = nbModel, 
               rf = rfModel, nn = nnModel, pcann = pcaNNModel)
models_preds <- lapply(models, predict, newdata = testing, type = 'prob')
models_probs <- as.data.frame(sapply(models_preds, function(df){1 - df$N}))

stack.glm_PredTest <- data.frame(PassengerId = 892:1309,
                              Survived = ifelse(predict(stack.glm, models_probs) == 'Y',1,0))
stack.rf_PredTest <- data.frame(PassengerId = 892:1309,
                              Survived = ifelse(predict(stack.rf, models_probs) == 'Y',1,0))
write.csv(stack.glm_PredTest, 'stack_glm_Pred.csv',row.names = FALSE)
write.csv(stack.rf_PredTest, 'stack_rf_Pred.csv',row.names = FALSE)

Conclusion

The test results can be seen below. The feed forward neural network had the best performance with ~78%. The stacked models did not do so well as imagined, but my guess is that if we had more data, these models would have a better performance that the others.

results <- data.frame(LogRegr = 0.77511, NaiveBayes = 0.77033, NeuralNet = 0.77990, pcaNNet = 0.77511, StackGLM = 0.7272, StackRF = 0.77511)
row.names(results) <- 'Accuracy'
kable(results, caption = 'Test Results')
Test Results
LogRegr NaiveBayes NeuralNet pcaNNet StackGLM StackRF
Accuracy 0.77511 0.77033 0.7799 0.77511 0.7272 0.77511

Next Directions

Of course we could try to create new features or fine tuning our models, but comparing the training performance (not shown) with the cross validation show us that we had a little of overfitting, i.e. the training score were considering higher than the validation score. So my next steps would be:

0 - I would consider obtaining more data. Normally, these non linear models do better with a high amount of data. But, as we are definitelly limited by this challenge amount of data, this is not possible.

1 - Try regularized models: Regularization decreases a little the model’s variance, aiming to decrease the detected overfitting. Regarding the models we tried, we could try regularized logistic regression, limiting trees depth in random forest or extremelly randomized trees [8] and regularization and dropout [9] in neural networks.

2 - Fine tuning the existing algorithms. We could consider the class imbalance and take approaches to deal with it, such as classification error penalties. But, in my opinion, with this amount of data, this step would not aggregate too much.

References

[1] - https://www.kaggle.com/c/titanic

[2] - https://cran.r-project.org/web/packages/mice/mice.pdf

[3] - LeCun, et. al., 1998 - Efficient BackProb

[4] - Moreira. Gabriel, 2017 - Feature Engineering - Getting most out of data for predictive models - Presentation

[5] - Chatterjee. Samprit, Hadi. Ali, 2006 - Regression Analysis by Example, 4th Edition

[6] - Witten. et. al, 2016 - Practical Machine Learning Tools and Techniques, 4th Edition

[7] - Building Predictive Models in R Using the caret Package

[8] - Geurts. Pierre, Ernst. Damien, Wehenkel. Louis, 2006 - Extremely Randomized Trees

[9] - Hinton. Geoffrey, et. al., 2014 - Dropout: A Simple Way to Prevent Neural Networks from Overfitting

.