This Assignment is aiming to use the appended train and test data set to build a random forest model to predict survival of Titanic.

Train data has 891 obs and 12 columns. We’ll use this data to make model later.

Test data has 418 rows and 11 columns. If we see in train data has 12 columns and in test data has 11 columns. The difference is that train data has Survived variable while test data does not.The Survived variable in test data is contained in gender_submission.csv.

To make it easier to read, let’s join the test data and survive, so that we just only work with only two data.

From train data set, we find there are 177 missing value in age column. It was around 19% of our data. Instead of remove the data. let’s try to replace the age number with the mean of age.

There’s some missing value in Fare in test data. To avoid NA from affecting the prediction results, we will delete data that contains NA.

From the table and histogram above we can tell that survival rate of female (0.74) is greater than that of male(0.19).

From the table and histogram above we can tell that survival rate of Pclass = 1 (0.629) is greater than other groups.

We notice that Passengers who’s fare is lower than 50 has a relatively lower survival rate. Passengers who’s fare is extremely high (500-550) have very high survival rate.

Before we continue to modeling, let’s check the proportion of our data in class target.

So, we can find in the train data, around 61.61% were not survived. Let’s do down-sampling to balancing the proportion of target variable.

From the decision tree model, we can get some insights: a). Sex is the Root Node that matters most when determine the target value. b). Pclass, Fare and Age would be the Interior Node that if first branch is not sufficient to determine the target. c). The left are terminal node which are predicted value or class.

From the table above, it is also clear that Sex, Pclass, Fare and Age are the major variables that are used to do classifications.

Random Forest

Since we have a large number of columns, we could delete some that have near-zero variance.

nearZeroVar(data_trained)

## integer(0)

As there is no near-zero variance data. We could use Random forest to model.

set.seed(267)

ctrl <- trainControl(method="repeatedcv", number = 5, repeats = 3) # k-fold cross validation
survive_forest <- train(Survived ~ ., data = data_trained, method = "rf", trControl = ctrl)

From the model summary, we know that split 11 variables for each tree node would get the highest accuracy.

plot(survive_forest$finalModel)
legend("topright", colnames(survive_forest$finalModel$err.rate),col=1:6,cex=0.8,fill=1:6)

#Predict Data
pred_test_rf <- predict(survive_forest, newdata = data_test, type = "raw")

#Model Evaluation by using Confusion Matrix
mat2 <- confusionMatrix(data = pred_test_rf, reference = data_test$Survived, positive = "1")
mat2

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 168  20
##          1  36 107
##                                          
##                Accuracy : 0.8308         
##                  95% CI : (0.786, 0.8696)
##     No Information Rate : 0.6163         
##     P-Value [Acc > NIR] : < 2e-16        
##                                          
##                   Kappa : 0.6506         
##                                          
##  Mcnemar's Test P-Value : 0.04502        
##                                          
##             Sensitivity : 0.8425         
##             Specificity : 0.8235         
##          Pos Pred Value : 0.7483         
##          Neg Pred Value : 0.8936         
##              Prevalence : 0.3837         
##          Detection Rate : 0.3233         
##    Detection Prevalence : 0.4320         
##       Balanced Accuracy : 0.8330         
##                                          
##        'Positive' Class : 1              
##

# Model Evaluation by ROC
prob_survive_rf <- predict(survive_forest, data_test, type = "prob")

data_roc2 <- data.frame(prob = prob_survive_rf[,2], # probability of positive class(survived)
                       labels = as.numeric(data_test$Survived == "1")) #get the label as the test data who survived
rf_roc <- ROCR::prediction(data_roc2$prob, data_roc2$labels) 

# ROC curve
plot(performance(rf_roc, "tpr", "fpr"), #tpr = true positive rate, fpr = false positive rate
     main = "ROC")
abline(a = 0, b = 1)

# Model Evaluation by AUC
rf_auc <- performance(rf_roc, measure = "auc")
rf_auc@y.values

## [[1]]
## [1] 0.9202563

We have no near-zero variance data. So, we can try to model using Random forest.

set.seed(267)
# k-fold cross validation
ctrl <- trainControl(method="repeatedcv", number = 5, repeats = 3) 
survive_forest <- train(Survived ~ ., data = data_trained, method = "rf", trControl = ctrl)

# Call the model
survive_forest

## Random Forest 
## 
## 684 samples
##   7 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times) 
## Summary of sample sizes: 546, 547, 548, 548, 547, 548, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.7738311  0.5476843
##   11    0.8001697  0.6003162
##   20    0.7831127  0.5662113
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 11.

From the model summary, we know that the optimum number of variables considered for splitting at each tree node is 11. We can also inspect the importance of each variable that was used in our random forest.

Conclusion Let’s Compare the decision tree model and random forest model for this case.

#get the accuracy and AUC of model decision tree
m_dt <- data.frame(Model = "Decision Tree",
           Accuracy = round((mat1$table[4] + mat1$table[1]) / sum(mat1$table),4),
          AUC = round(as.numeric(dt_auc@y.values),4))

#get the and AUC of model random forest
m_rf <- data.frame(Model = "Random Forest",
           Accuracy = round((mat2$table[4] + mat2$table[1]) / sum(mat2$table),4),
           AUC = round(as.numeric(rf_auc@y.values),4
           ))

rbind(m_dt, m_rf)

##           Model Accuracy    AUC
## 1 Decision Tree   0.9668 0.9411
## 2 Random Forest   0.8308 0.9203

We can see from the above table that the model built by Decision Tree may be the better model to predict the survival of Titanic passengers, as it has 96.68% accuracy and 94% AUC while Random Forest got 82.78% Accuracy.