Comparing Modeling Techniques with Kaggle’s Titanic Data

Loading and Cleaning the data

The first step is loading the data and inspecting it to see what needs to be cleaned up and rangled. Kaggle provides a description of the data here

titanic_train <- read_csv("data/train.csv")
titanic_submission <- read_csv("data/test.csv")

Inspecting the data we can see that generally it looks pretty good, but there are some areas of concern. The first is that a lot of the data on cabins is missing. In fact we only have cabin data on about 23% of our records.We also notice that we have a significant amount of records where age is missing.

It looks like there is a correlation between our dependent variable and whether or not we have cabin data on the person. We will want to make sure we include these records and handle them in a way that takes advantage of that correlation.

Name titanic_train
Number of rows 891
Number of columns 12
_______________________
Column type frequency:
character 5
numeric 7
________________________
Group variables None

Data summary

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Name 0 1.00 12 82 0 891 0
Sex 0 1.00 4 6 0 2 0
Ticket 0 1.00 3 18 0 681 0
Cabin 687 0.23 1 15 0 147 0
Embarked 2 1.00 1 1 0 3 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
PassengerId 0 1.0 446.00 257.35 1.00 223.50 446.00 668.5 891.00 ▇▇▇▇▇
Survived 0 1.0 0.38 0.49 0.00 0.00 0.00 1.0 1.00 ▇▁▁▁▅
Pclass 0 1.0 2.31 0.84 1.00 2.00 3.00 3.0 3.00 ▃▁▃▁▇
Age 177 0.8 29.70 14.53 0.42 20.12 28.00 38.0 80.00 ▂▇▅▂▁
SibSp 0 1.0 0.52 1.10 0.00 0.00 0.00 1.0 8.00 ▇▁▁▁▁
Parch 0 1.0 0.38 0.81 0.00 0.00 0.00 0.0 6.00 ▇▁▁▁▁
Fare 0 1.0 32.20 49.69 0.00 7.91 14.45 31.0 512.33 ▇▁▁▁▁
## `summarise()` ungrouping output (override with `.groups` argument)

Looking at Age, we can see that there also seems to be some correlation with wheter or not the passenger survived, albeit weaker than the cabin data. We will likely want to make sure not to exclude individuals with NA in their age field.

## `summarise()` ungrouping output (override with `.groups` argument)

Embarked is the only other field with missing values and it only has two. There should be no significant effect on the model if we keep them.

We will begin with the cabin data. We will convert all the NA’s in this column to “unknown”. Because the submission data set will be the data the model is used on, any changes made to the train data should also be made on the submission data.

sum(is.na(titanic_train$Cabin))
## [1] 687
titanic_train <- titanic_train %>%
  mutate(Cabin = replace_na(Cabin, "unknown"))
sum(is.na(titanic_train$Cabin))
## [1] 0
titanic_submission <- titanic_submission %>%
  mutate(Cabin = replace_na(Cabin, "unknown"))

Now that the NAs in Cabin are taken care of, we want to take a closer look at the cabin data to see if there is any additional information we can glean from it. Not including unknown, there are 147 unique values for Cabin and they all seem to follow the pattern of having a letter followed by a number.

length(unique(titanic_train$Cabin))
## [1] 148
head(titanic_train$Cabin, 25)
##  [1] "unknown" "C85"     "unknown" "C123"    "unknown" "unknown" "E46"    
##  [8] "unknown" "unknown" "unknown" "G6"      "C103"    "unknown" "unknown"
## [15] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
## [22] "D56"     "unknown" "A6"      "unknown"

We learn from this link that the letter at the beginning indicates the deck. We will follow the work from this link and use this to create a new column indicating the deck of the passenger. First we want to check and make sure that the passengers assigned to multiple cabins are all on the same deck. we can confirm that they are. This link helps explain some of the more confusing values.

titanic_train <- titanic_train %>%
  mutate(deck = str_sub(Cabin, end = 1))
titanic_submission <- titanic_submission %>%
  mutate(deck = str_sub(Cabin, end = 1))

We could assume it’s obvious that any “u” stands for “unknown”, but we want it to be more clear than that so we change them back to “unknown”.

titanic_train <- titanic_train %>%
  mutate(deck = recode(deck, "u" = "unknown"))

titanic_submission <- titanic_submission %>%
  mutate(deck = recode(deck, "u" = "unknown"))

Now we move on to deal with the NA values in age. There doesn’t seem to be a significant difference in age between genders, although the age for men seems to be slightly more skewed to the left than for women. We do have two variables that may help indicate age a little more though. SibSp indicates how many siblings or spouses the passenger has on board, while Parch indicates how many parents or children the passenger has on board.

The Parch seems to give some pretty good insight into age at around 4 and above. Th SibSp seems to be a little more telling as mores of the time in the Anglo world would make if very unlikely for anyone to be traveling with more than one spouse, but children would could easily be traveling with more than one sibling.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

With this knowledge we will impute age for missing values based on the median value by SibSp. Because we are adding values for age rather than marking them unknown, we will create a new column to keep track of which records had age and which were missing age.

titanic_train <- titanic_train %>%
  group_by(SibSp) %>%
  mutate(age_original = !is.na(Age),
         Age = ifelse(is.na(Age), round(median(Age, na.rm = TRUE)), Age)) %>%
  ungroup()

titanic_submission <- titanic_submission %>%
  group_by(SibSp) %>%
  mutate(age_original = !is.na(Age),
         Age = ifelse(is.na(Age), 
                      ifelse(SibSp == 0, 29,
                             ifelse(SibSp == 1, 30,
                                    ifelse(SibSp == 2, 23,
                                           ifelse(SibSp == 3, 10,
                                                  ifelse(SibSp == 4, 6,
                                                         ifelse(SibSp == 5, 11,
                                                                Age)))))), Age))%>%
  ungroup()

unfortunately there are in the train data set we all of the records where SibSp is 8 have NA for age, so we can’t calculate a median age for them. They all seem to be from the same family. This is kind of a real tossup as far as what to do. It doesn’t really make sense to assign them all to one age because we know they are from the same family, but at the same time, we can’t assume that all future passengers in the submission set will be from the same family, so we need to have a consistent policy. The decision has been made to just impute those NA values with the same median age for SibSp == 5, which is 11.

titanic_train <- titanic_train %>%
  mutate(Age = ifelse(is.na(Age) & SibSp == 8, 11, Age))
  
titanic_submission <- titanic_submission %>%
  mutate(Age = ifelse(is.na(Age) & SibSp == 8, 11, Age))

Finally we will replace NA in embarked with “unknown” and since we’re coming to the end of cleaning up our data we will also use the janitor package to clean the names. We will also convert some of the character variables into factor variables.

titanic_train <- titanic_train %>%
  mutate(Embarked = replace_na(Embarked, "unknown"))

titanic_submission <- titanic_submission %>%
  mutate(Embarked = replace_na(Embarked, "unknown"))

titanic_train <- janitor::clean_names(titanic_train) %>%
  mutate(sex = factor(sex),
         cabin = factor(cabin),
         embarked = factor(embarked),
         deck = factor(deck),
         age_original = factor(age_original),
         survived = factor(survived))

titanic_submission <- janitor::clean_names(titanic_submission) %>%
  mutate(sex = factor(sex),
         cabin = factor(cabin),
         embarked = factor(embarked),
         deck = factor(deck),
         age_original = factor(age_original))

Now we have a value for every variable and every record, Even if that value is only there to indicate unknown. Now let’s take a look at the relationships between the Survived variable (our dependent variable) and all other numeric variables. We can see that the only numeric variable Survived seems to have any significant relationship with is the Pclass variable.

Looking at the relationship with some of the non-numeric variables we can see that there is certainly a strong relationship beween Sex and Survived and there also seem to be relationships with Embarked and deck as well.

Now that we have a pretty good looking dataframe we can remove a few of the variables we won’t be using in the models. Name, Ticket, and PassengerID all seem to be forms of unique identifiers that probably won’t be as helpful in these models. In one of the links above, another model builder used Name to create a new varibale called title, but we will not be doing that.

titanic_train <- titanic_train %>%
  select(-name, -ticket, -passenger_id, -cabin)

titanic_submission <- titanic_submission %>%
  select(-name, -ticket, -cabin)

Building Models

The first thing we’re going to want to do is split the train data set so that we have a test set to test our models on before making predictions on the data to be submitted.

set.seed(3)

ttnc_smp_size <- floor(0.8 * nrow(titanic_train))

ttnc_train_ind <- sample(seq_len(nrow(titanic_train)), size = ttnc_smp_size)

ttnc_model_train <- titanic_train[ttnc_train_ind, ]
ttnc_model_test <- titanic_train[-ttnc_train_ind, ]

Logistic Model

The first model we will build is a logistic model. This generalized linear model may be one of the simplest models for classification, but it will likely still be powerful and provide a good baseline for us to compare other models to.

We train the model using caret and calli summary on the model to see how it turned out. Looking at this the model seems to show that sex, pclass, and age are all variables with very significant coefficients. the coefficients for sib_sp and passengers who embarked out of Southampton both seem to be signficant at the 0.01 level.

glm_pred <- train(survived ~ ., ttnc_model_train, method = "glm",
                  family = "binomial")

summary(glm_pred)
## 
## Call:
## NULL
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3387  -0.5971  -0.4064   0.6265   2.3552  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        4.755849   0.868458   5.476 4.35e-08 ***
## pclass            -0.854241   0.199759  -4.276 1.90e-05 ***
## sexmale           -2.611924   0.223529 -11.685  < 2e-16 ***
## age               -0.036316   0.008951  -4.057 4.97e-05 ***
## sib_sp            -0.296231   0.119302  -2.483   0.0130 *  
## parch             -0.172056   0.130103  -1.322   0.1860    
## fare               0.002478   0.003397   0.729   0.4658    
## embarkedQ         -0.047436   0.429969  -0.110   0.9122    
## embarkedS         -0.611588   0.275879  -2.217   0.0266 *  
## embarkedunknown   11.547077 612.197238   0.019   0.9850    
## deckB              0.396844   0.824102   0.482   0.6301    
## deckC             -0.676247   0.746859  -0.905   0.3652    
## deckD              0.341089   0.845597   0.403   0.6867    
## deckE              0.893621   0.843845   1.059   0.2896    
## deckF              0.605808   1.119443   0.541   0.5884    
## deckG             -1.956018   1.473134  -1.328   0.1842    
## deckT            -14.089307 882.743609  -0.016   0.9873    
## deckunknown       -0.455669   0.699316  -0.652   0.5147    
## age_originalTRUE   0.391413   0.280725   1.394   0.1632    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 950.78  on 711  degrees of freedom
## Residual deviance: 630.85  on 693  degrees of freedom
## AIC: 668.85
## 
## Number of Fisher Scoring iterations: 13

Now we check to see how this first model does on new data. We get the predicited values for the new data and see how often they are correct. Looks like this model predicts correctly about 82% of the time on the new data! However, this isnt the best measure. We could, for get above a 50% correct prediction rate if we just guessed everyone did not survive. Because of this we’ll want to plot an ROC curve and calculate the area under the curve. A 1 would mean a perfect model predicitng correctly every time. We have an AUC of 0.8957!! Our baseline model seems to be very strong. We’ll have to see if we can improve on this model at all.

ttnc_preds <- predict(glm_pred, ttnc_model_test)

mean(ttnc_model_test$survived == ttnc_preds)
## [1] 0.8212291
ttnc_preds <- predict(glm_pred, ttnc_model_test, type = "prob")

log_ROC <- roc(ttnc_model_test$survived, ttnc_preds$`1`)
## Setting levels: control = 0, case = 1

## Setting direction: controls < cases
plot(log_ROC, col = "blue")

log_auc <- auc(log_ROC)
log_auc
## Area under the curve: 0.8957

Naive Bayes Model

The next model we’ll use to try and predicted survival on the titanic is Naive Bayes.

nb_predict <- train(survived ~ ., ttnc_model_train, method = "naive_bayes",
                    tuneGrid = data.frame(laplace = 0, usekernel = TRUE, 
                                          adjust = TRUE))

After training the model we check it’s results on new data. Not only is this model less accurate than the logisitic model on the new data, but it also has a lower AUC. Looks like the Naive Bayes model doesn’t perform as well on new data as the Logistic model.

ttnc_preds <- predict(nb_predict, ttnc_model_test)

mean(ttnc_model_test$survived == ttnc_preds)
## [1] 0.6648045
ttnc_preds <- predict(nb_predict, ttnc_model_test, type = "prob")

nb_ROC <- roc(ttnc_model_test$survived, ttnc_preds$`1`)
## Setting levels: control = 0, case = 1

## Setting direction: controls < cases
plot(nb_ROC, col = "blue")

nb_auc <- auc(nb_ROC)
nb_auc
## Area under the curve: 0.8439

Random Forest Model

For our third model we will create a random forest model using the ranger package. In order to include the probabilites using this model we will need to make some changes to the coding of the survived variable.

ttnc_model_train <- ttnc_model_train %>%
  mutate(survived = recode(survived, `0` = "didnt_survive",
                           `1` = "survived"))
rf_predict <- train(survived ~ ., ttnc_model_train, method = "ranger",
                    tuneLength = 15, 
                    trControl = trainControl(classProbs = TRUE))

Again, we check the model’s performance on new data. This model looks really good! Outperforming even our logisitic regression score in both accuracy and auc. Where approaching 84% accuracy and an auc of 0.9279. That’s 0.074 better than our logisitic model’s auc.

ttnc_model_test <- ttnc_model_test %>%
  mutate(survived = recode(survived, `0` = "didnt_survive",
                           `1` = "survived"))

ttnc_preds <- predict(rf_predict, ttnc_model_test)

mean(ttnc_model_test$survived == ttnc_preds)
## [1] 0.8435754
ttnc_preds <- predict(rf_predict, ttnc_model_test, type= "prob")

rf_ROC <- roc(ttnc_model_test$survived, ttnc_preds$survived)
## Setting levels: control = didnt_survive, case = survived

## Setting direction: controls < cases
plot(rf_ROC, col = "blue")

rf_auc <- auc(rf_ROC)
rf_auc
## Area under the curve: 0.9217

Gradient Boost Model

Finally, we build a gradient boost model.

Looking at the results of this final model on new data we see that it performs very well, but not quite as well as our random forest model. It’s accuracy is 83% and auc is 0.9114. It looks like the model we will be using on our submission data will be the random forest model.

ttnc_model_test <- ttnc_model_test %>%
  mutate(survived = recode(survived, `0` = "didnt_survive",
                           `1` = "survived"))

ttnc_preds <- predict(gbm_predict, ttnc_model_test)

mean(ttnc_model_test$survived == ttnc_preds)
## [1] 0.8324022
ttnc_preds <- predict(gbm_predict, ttnc_model_test, type= "prob")

gbm_ROC <- roc(ttnc_model_test$survived, ttnc_preds$survived)
## Setting levels: control = didnt_survive, case = survived

## Setting direction: controls < cases
plot(gbm_ROC, col = "blue")

gbm_auc <- auc(gbm_ROC)
gbm_auc
## Area under the curve: 0.9059

Building our final model and submission

So now that we know our random forest model performed the best on new data we will be using that model on our submission data. First, we need to train it on all of our training data. Because we used a tunelength to have caret use multiple setting for the random forest model we need to inspect it to see what its final setting were. We see that the setting were mtry = 4, splitule = gini and min.node.size = 1. We set these in the tuneGrid in caret and retrain this model on the entire dataset we have.

rf_predict
## Random Forest 
## 
## 712 samples
##   9 predictor
##   2 classes: 'didnt_survive', 'survived' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 712, 712, 712, 712, 712, 712, ... 
## Resampling results across tuning parameters:
## 
##   mtry  splitrule   Accuracy   Kappa    
##    2    gini        0.8000553  0.5625751
##    2    extratrees  0.7950035  0.5496135
##    3    gini        0.8091946  0.5874376
##    3    extratrees  0.8047130  0.5755170
##    4    gini        0.8139886  0.6002962
##    4    extratrees  0.8034914  0.5746161
##    5    gini        0.8115231  0.5966845
##    5    extratrees  0.8032077  0.5756824
##    6    gini        0.8098023  0.5933574
##    6    extratrees  0.8038111  0.5785518
##    7    gini        0.8065610  0.5874703
##    7    extratrees  0.8047132  0.5815899
##    8    gini        0.8020022  0.5788444
##    8    extratrees  0.8016775  0.5757492
##   10    gini        0.7915662  0.5580663
##   10    extratrees  0.7979620  0.5688669
##   11    gini        0.7884934  0.5522858
##   11    extratrees  0.7954896  0.5643153
##   12    gini        0.7858531  0.5473523
##   12    extratrees  0.7922505  0.5580490
##   13    gini        0.7847093  0.5448121
##   13    extratrees  0.7896716  0.5531682
##   14    gini        0.7820927  0.5398599
##   14    extratrees  0.7863984  0.5466854
##   15    gini        0.7832345  0.5422270
##   15    extratrees  0.7852720  0.5449786
##   16    gini        0.7820115  0.5395912
##   16    extratrees  0.7842289  0.5429885
##   18    gini        0.7787147  0.5327663
##   18    extratrees  0.7805860  0.5352416
## 
## Tuning parameter 'min.node.size' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 4, splitrule = gini
##  and min.node.size = 1.
rf_predict <- train(survived ~ ., titanic_train, method = "ranger",
                    tuneGrid = data.frame(mtry = 4, splitrule = "gini",
                                          min.node.size = 1))

Now that we have the model trained we will use it to make predictions on the final submission data. We find that while there were no NAs for fare in the train dataset, the submission dataset has one value where fare is NA. For this we impute it with the median fare before running predict.

titanic_submission <- titanic_submission %>%
  mutate(fare = replace_na(fare, median(fare, na.rm = TRUE)))

prediction <- predict(rf_predict, titanic_submission)

titanic_submission <- titanic_submission %>%
  bind_cols(prediction = prediction) %>%
  mutate(PassengerId = passenger_id, Survived = prediction) %>%
  select(PassengerId, Survived) 

After submitting these prediction my accuracy was 0.75598, which ranked me at 19,516 on the leaderboard for this contest on Kaggle.com. Thank you very much for view my work and I look forward to any feedback or advice anyone would like to provide.