This Assignment is aiming to use the appended train and test data set to build a random forest model to predict survival of Titanic.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
library(e1071)
library(ROCR)
library(partykit) 
## Loading required package: grid
## Loading required package: libcoin
## Loading required package: mvtnorm
library(ggplot2)
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
## The following object is masked from 'package:dplyr':
## 
##     combine
library(ggthemes)
  1. First, we need to import and prepare the data set.
train <- read.csv("titanic_train.csv")
rmarkdown::paged_table(train)

Train data has 891 obs and 12 columns. We’ll use this data to make model later.

test <- read.csv("titanic_test.csv")
rmarkdown::paged_table(train)

Test data has 418 rows and 11 columns. If we see in train data has 12 columns and in test data has 11 columns. The difference is that train data has Survived variable while test data does not.The Survived variable in test data is contained in gender_submission.csv.

survive <- read.csv("gender_submission.csv")
rmarkdown::paged_table(survive)

To make it easier to read, let’s join the test data and survive, so that we just only work with only two data.

test1 <- cbind(test, Survived = survive$Survived)

Second, we need clean the data.

colSums(is.na(train))
## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0         177 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           0           0           0

From train data set, we find there are 177 missing value in age column. It was around 19% of our data. Instead of remove the data. let’s try to replace the age number with the mean of age.

train_clean <- train %>%
    mutate(Age = if_else(is.na(Age), mean(Age, na.rm = TRUE), Age))

colSums(is.na(train_clean))
## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0           0 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           0           0           0

Then, we do the same step to check the missing value in test data.

colSums(is.na(test))
## PassengerId      Pclass        Name         Sex         Age       SibSp 
##           0           0           0           0          86           0 
##       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           1           0           0

There’s some missing value in Fare in test data. To avoid NA from affecting the prediction results, we will delete data that contains NA.

test_clean <- test1 %>% 
    na.omit()
  1. Next, we need to do Exploratory Analysis and Data Processing In this case, Survived is the target variable, and some variables are not contain much information about the modeling.Our objective is to figure out what features would influence the survival, we are going to go deep into the data to explore the relationship between each attribute and survival.

Sex v.s Survival

# create histgram to show effect of Sex on survival
ggplot(train, aes(Sex,fill = factor(Survived))) +
    geom_histogram(stat = "count")
## Warning: Ignoring unknown parameters: binwidth, bins, pad

# calculate survival rate
tapply(train$Survived,train$Sex,mean)
##    female      male 
## 0.7420382 0.1889081

From the table and histogram above we can tell that survival rate of female (0.74) is greater than that of male(0.19).

Pclass v.s. Survival

# make a histogram
ggplot(train, aes(Pclass,fill = factor(Survived))) +
    geom_histogram(stat = "count")
## Warning: Ignoring unknown parameters: binwidth, bins, pad

# calculate survival rate
tapply(train$Survived,train$Pclass,mean)
##         1         2         3 
## 0.6296296 0.4728261 0.2423625

From the table and histogram above we can tell that survival rate of Pclass = 1 (0.629) is greater than other groups.

Fare v.s. Survival

# make a histogram
ggplot(train, aes(Fare,fill = factor(Survived))) +
    geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We notice that Passengers who’s fare is lower than 50 has a relatively lower survival rate. Passengers who’s fare is extremely high (500-550) have very high survival rate.

Now, we could select some predictor variable and change the data type.

data_train <- train_clean %>% 
          select(-c(PassengerId, Name, Ticket, Cabin)) %>% 
          mutate(Survived = as.factor(Survived),
           Pclass = as.factor(Pclass),
           Sex = as.factor(Sex),
           SibSp = as.factor(SibSp),
           Parch = as.factor(Parch),
           Embarked = as.factor(Embarked))

data_test <- test_clean %>% 
          select(-c(PassengerId, Name, Ticket, Cabin)) %>% 
          mutate(Survived = as.factor(Survived),
           Pclass = as.factor(Pclass),
           Sex = as.factor(Sex),
           SibSp = as.factor(SibSp),
           Parch = as.factor(Parch),
           Embarked = as.factor(Embarked))

str(data_train)
## 'data.frame':    891 obs. of  8 variables:
##  $ Survived: Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
##  $ Pclass  : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
##  $ Sex     : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Age     : num  22 38 26 35 35 ...
##  $ SibSp   : Factor w/ 7 levels "0","1","2","3",..: 2 2 1 2 1 1 1 4 1 2 ...
##  $ Parch   : Factor w/ 7 levels "0","1","2","3",..: 1 1 1 1 1 1 1 2 3 1 ...
##  $ Fare    : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Embarked: Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...

Before we continue to modeling, let’s check the proportion of our data in class target.

prop.table(table(data_train$Survived))
## 
##         0         1 
## 0.6161616 0.3838384

So, we can find in the train data, around 61.61% were not survived. Let’s do down-sampling to balancing the proportion of target variable.

set.seed(267)
data_trained <- downSample(x = data_train[, -1], y = as.factor(data_train[, 1]), yname = "Survived")
rmarkdown::paged_table(data_trained)

Let’s check the proportion again in the trained data.

prop.table(table(data_trained$Survived))
## 
##   0   1 
## 0.5 0.5
  1. Modeling ## Decision Tree
model_dt <- ctree(Survived~ .,data_trained)
plot(model_dt, type="simple")

From the decision tree model, we can get some insights: a). Sex is the Root Node that matters most when determine the target value. b). Pclass, Fare and Age would be the Interior Node that if first branch is not sufficient to determine the target. c). The left are terminal node which are predicted value or class.

model_dt
## 
## Model formula:
## Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked
## 
## Fitted party:
## [1] root
## |   [2] Sex in female
## |   |   [3] Pclass in 1, 2: 1 (n = 165, err = 2.4%)
## |   |   [4] Pclass in 3
## |   |   |   [5] Fare <= 23.25: 1 (n = 104, err = 33.7%)
## |   |   |   [6] Fare > 23.25: 0 (n = 19, err = 15.8%)
## |   [7] Sex in male
## |   |   [8] Pclass in 1: 0 (n = 91, err = 49.5%)
## |   |   [9] Pclass in 2, 3
## |   |   |   [10] Age <= 12
## |   |   |   |   [11] SibSp in 0, 1, 2: 1 (n = 17, err = 0.0%)
## |   |   |   |   [12] SibSp in 3, 4, 5: 0 (n = 9, err = 11.1%)
## |   |   |   [13] Age > 12: 0 (n = 279, err = 16.5%)
## 
## Number of inner nodes:    6
## Number of terminal nodes: 7

From the table above, it is also clear that Sex, Pclass, Fare and Age are the major variables that are used to do classifications.

#Predict Data
pred_test_dt <- predict(model_dt, newdata = data_test, 
                          type = "response")
head(pred_test_dt)
## 1 2 3 4 5 6 
## 0 1 0 0 1 0 
## Levels: 0 1
#Model evaluation by using confusion matrix.
mat1 <- confusionMatrix(data = pred_test_dt, reference = data_test$Survived, positive = "1")
mat1
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 195   2
##          1   9 125
##                                           
##                Accuracy : 0.9668          
##                  95% CI : (0.9413, 0.9833)
##     No Information Rate : 0.6163          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.9305          
##                                           
##  Mcnemar's Test P-Value : 0.07044         
##                                           
##             Sensitivity : 0.9843          
##             Specificity : 0.9559          
##          Pos Pred Value : 0.9328          
##          Neg Pred Value : 0.9898          
##              Prevalence : 0.3837          
##          Detection Rate : 0.3776          
##    Detection Prevalence : 0.4048          
##       Balanced Accuracy : 0.9701          
##                                           
##        'Positive' Class : 1               
## 
# Model Evaluation by ROC
prob_survive_dt <- predict(model_dt, data_test, type = "prob")

data_roc1 <- data.frame(prob = prob_survive_dt[,2], # probability of positive class(survived)
                       labels = as.numeric(data_test$Survived == "1")) #get the label as the test data who survived

dt_roc <- ROCR::prediction(data_roc1$prob, data_roc1$labels) 

# ROC curve
plot(performance(dt_roc, "tpr", "fpr"), #tpr = true positive rate, fpr = false positive rate
     main = "ROC")
abline(a = 0, b = 1)

# Model Evaluation by AUC
dt_auc <- performance(dt_roc, measure = "auc")
dt_auc@y.values
## [[1]]
## [1] 0.9410607

Random Forest

Since we have a large number of columns, we could delete some that have near-zero variance.

nearZeroVar(data_trained)
## integer(0)

As there is no near-zero variance data. We could use Random forest to model.

set.seed(267)

ctrl <- trainControl(method="repeatedcv", number = 5, repeats = 3) # k-fold cross validation
survive_forest <- train(Survived ~ ., data = data_trained, method = "rf", trControl = ctrl)

From the model summary, we know that split 11 variables for each tree node would get the highest accuracy.

plot(survive_forest$finalModel)
legend("topright", colnames(survive_forest$finalModel$err.rate),col=1:6,cex=0.8,fill=1:6)

#Predict Data
pred_test_rf <- predict(survive_forest, newdata = data_test, type = "raw")
#Model Evaluation by using Confusion Matrix
mat2 <- confusionMatrix(data = pred_test_rf, reference = data_test$Survived, positive = "1")
mat2
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 168  20
##          1  36 107
##                                          
##                Accuracy : 0.8308         
##                  95% CI : (0.786, 0.8696)
##     No Information Rate : 0.6163         
##     P-Value [Acc > NIR] : < 2e-16        
##                                          
##                   Kappa : 0.6506         
##                                          
##  Mcnemar's Test P-Value : 0.04502        
##                                          
##             Sensitivity : 0.8425         
##             Specificity : 0.8235         
##          Pos Pred Value : 0.7483         
##          Neg Pred Value : 0.8936         
##              Prevalence : 0.3837         
##          Detection Rate : 0.3233         
##    Detection Prevalence : 0.4320         
##       Balanced Accuracy : 0.8330         
##                                          
##        'Positive' Class : 1              
## 
# Model Evaluation by ROC
prob_survive_rf <- predict(survive_forest, data_test, type = "prob")

data_roc2 <- data.frame(prob = prob_survive_rf[,2], # probability of positive class(survived)
                       labels = as.numeric(data_test$Survived == "1")) #get the label as the test data who survived
rf_roc <- ROCR::prediction(data_roc2$prob, data_roc2$labels) 

# ROC curve
plot(performance(rf_roc, "tpr", "fpr"), #tpr = true positive rate, fpr = false positive rate
     main = "ROC")
abline(a = 0, b = 1)

# Model Evaluation by AUC
rf_auc <- performance(rf_roc, measure = "auc")
rf_auc@y.values
## [[1]]
## [1] 0.9202563

We have no near-zero variance data. So, we can try to model using Random forest.

set.seed(267)
# k-fold cross validation
ctrl <- trainControl(method="repeatedcv", number = 5, repeats = 3) 
survive_forest <- train(Survived ~ ., data = data_trained, method = "rf", trControl = ctrl)
# Call the model
survive_forest
## Random Forest 
## 
## 684 samples
##   7 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times) 
## Summary of sample sizes: 546, 547, 548, 548, 547, 548, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.7738311  0.5476843
##   11    0.8001697  0.6003162
##   20    0.7831127  0.5662113
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 11.

From the model summary, we know that the optimum number of variables considered for splitting at each tree node is 11. We can also inspect the importance of each variable that was used in our random forest.

  1. Conclusion Let’s Compare the decision tree model and random forest model for this case.
#get the accuracy and AUC of model decision tree
m_dt <- data.frame(Model = "Decision Tree",
           Accuracy = round((mat1$table[4] + mat1$table[1]) / sum(mat1$table),4),
          AUC = round(as.numeric(dt_auc@y.values),4))

#get the and AUC of model random forest
m_rf <- data.frame(Model = "Random Forest",
           Accuracy = round((mat2$table[4] + mat2$table[1]) / sum(mat2$table),4),
           AUC = round(as.numeric(rf_auc@y.values),4
           ))

rbind(m_dt, m_rf)
##           Model Accuracy    AUC
## 1 Decision Tree   0.9668 0.9411
## 2 Random Forest   0.8308 0.9203

We can see from the above table that the model built by Decision Tree may be the better model to predict the survival of Titanic passengers, as it has 96.68% accuracy and 94% AUC while Random Forest got 82.78% Accuracy.