We want to build a RF model that classifies passengers of the Titanic ship that sunk in 1912 as having survived or not based on a number of predictors (features). This is a Kaggle begginer challenge. Both test and train datasets are obtained here.
rm(list = ls(all.names = TRUE))
library(dplyr) # For data wrangling
library(mice) # For imputaion
library(VIM) # For imputaion
library(caret) # For ML
library(randomForest) # For RD
Train <- read.csv("D:\\Data Science\\Hackathons\\Kaggle\\Titanic--Machine Learning from Disaster\\train.csv") %>% select(-c(PassengerId, Name, Ticket, Cabin))
Test <- read.csv("D:\\Data Science\\Hackathons\\Kaggle\\Titanic--Machine Learning from Disaster\\test.csv") %>% select(-c(Name, Ticket, Cabin))
set.seed(1111)
Train %>% sample_n(5) # View 5 random rows
## Survived Pclass Sex Age SibSp Parch Fare Embarked
## 1 0 3 male 39 0 0 24.1500 S
## 2 1 2 female 24 2 3 18.7500 S
## 3 0 3 female 18 1 0 17.8000 S
## 4 0 1 male NA 0 0 30.6958 C
## 5 1 1 female 19 1 0 91.0792 C
We should ensure that categorical variables such as Survived are captured as type factor and so on
str(Train)
## 'data.frame': 891 obs. of 8 variables:
## $ Survived: int 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Embarked: Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
The above results show that variables are not in appropritae class. We correct this below.
Embarked_Names <- c(C = "Cherbourg", Q = "Queenstown", S = "Southampton" )
Train <- within(Train, {
Survived <- factor(Survived)
Pclass <- factor(Pclass,
levels = c(1, 2, 3),
labels = c("1st", "2nd", "3rd"))
Embarked <- recode(Embarked, !!!Embarked_Names) %>% factor()
})
Test <- within(Test, {
Pclass <- factor(Pclass,
levels = c(1, 2, 3),
labels = c("1st", "2nd", "3rd"))
Embarked <- recode(Embarked, !!!Embarked_Names) %>% factor()
})
We need to eensure that our dataset has no missing instances wtherether in the DV or any of the IVs.
anyNA(Train)
## [1] TRUE
The result above shows that there are missing instances in some columns.
md.pairs(Train)
## $rr
## Survived Pclass Sex Age SibSp Parch Fare Embarked
## Survived 891 891 891 714 891 891 891 891
## Pclass 891 891 891 714 891 891 891 891
## Sex 891 891 891 714 891 891 891 891
## Age 714 714 714 714 714 714 714 714
## SibSp 891 891 891 714 891 891 891 891
## Parch 891 891 891 714 891 891 891 891
## Fare 891 891 891 714 891 891 891 891
## Embarked 891 891 891 714 891 891 891 891
##
## $rm
## Survived Pclass Sex Age SibSp Parch Fare Embarked
## Survived 0 0 0 177 0 0 0 0
## Pclass 0 0 0 177 0 0 0 0
## Sex 0 0 0 177 0 0 0 0
## Age 0 0 0 0 0 0 0 0
## SibSp 0 0 0 177 0 0 0 0
## Parch 0 0 0 177 0 0 0 0
## Fare 0 0 0 177 0 0 0 0
## Embarked 0 0 0 177 0 0 0 0
##
## $mr
## Survived Pclass Sex Age SibSp Parch Fare Embarked
## Survived 0 0 0 0 0 0 0 0
## Pclass 0 0 0 0 0 0 0 0
## Sex 0 0 0 0 0 0 0 0
## Age 177 177 177 0 177 177 177 177
## SibSp 0 0 0 0 0 0 0 0
## Parch 0 0 0 0 0 0 0 0
## Fare 0 0 0 0 0 0 0 0
## Embarked 0 0 0 0 0 0 0 0
##
## $mm
## Survived Pclass Sex Age SibSp Parch Fare Embarked
## Survived 0 0 0 0 0 0 0 0
## Pclass 0 0 0 0 0 0 0 0
## Sex 0 0 0 0 0 0 0 0
## Age 0 0 0 177 0 0 0 0
## SibSp 0 0 0 0 0 0 0 0
## Parch 0 0 0 0 0 0 0 0
## Fare 0 0 0 0 0 0 0 0
## Embarked 0 0 0 0 0 0 0 0
We visualise this as follows.
md.pattern(Train)
## Survived Pclass Sex SibSp Parch Fare Embarked Age
## 714 1 1 1 1 1 1 1 1 0
## 177 1 1 1 1 1 1 1 0 1
## 0 0 0 0 0 0 0 177 177
The plot above reveals to us that only Age has missing values at 177 instances. We can thus impute this in the following manner.
Impute_train <- mice(Train, m = 3, seed = 1111)
##
## iter imp variable
## 1 1 Age
## 1 2 Age
## 1 3 Age
## 2 1 Age
## 2 2 Age
## 2 3 Age
## 3 1 Age
## 3 2 Age
## 3 3 Age
## 4 1 Age
## 4 2 Age
## 4 3 Age
## 5 1 Age
## 5 2 Age
## 5 3 Age
Impute_train
## Class: mids
## Number of multiple imputations: 3
## Imputation methods:
## Survived Pclass Sex Age SibSp Parch Fare Embarked
## "" "" "" "pmm" "" "" "" ""
## PredictorMatrix:
## Survived Pclass Sex Age SibSp Parch Fare Embarked
## Survived 0 1 1 1 1 1 1 1
## Pclass 1 0 1 1 1 1 1 1
## Sex 1 1 0 1 1 1 1 1
## Age 1 1 1 0 1 1 1 1
## SibSp 1 1 1 1 0 1 1 1
## Parch 1 1 1 1 1 0 1 1
The function mice() will generate three random imputations (if m = 3) and interate 5 times by default. The default method for imputing numeric varaibles is pmm, which stands for predictive mean matching, while for factor variables is polyreg which is multinomial logistic regression
We can look at the first 5 imputed values using;
Impute_train$imp$Age[1:5, ]
## 1 2 3
## 6 46 47 27
## 18 31 43 18
## 20 39 24 2
## 27 25 36 39
## 29 35 25 2
We can obatin the final dataset using the first impuations by;
Train <- complete(Impute_train, 1)
anyNA(Train)
## [1] FALSE
To visualize the imputed values, we can do stripplot;
stripplot(Impute_train, pch = 20, cex = 1.2)
In the above diagram, 0 on x-axis represents original data, while 1, 2, and 3 are the 1st, 2nd and 3rd imputaions respectively.
There are only three hyperparameters involved in a RF, namely the number of trees ntree , the number of variables tried at each split mtry and the the number of terminal nodes (leaves, directly related to tree debth: the deeper the tree, the fewer the leaves) nodesize
RF_Model <- randomForest(Survived ~ .,
data = Train,
importance = TRUE)
RF_Model
##
## Call:
## randomForest(formula = Survived ~ ., data = Train, importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 16.72%
## Confusion matrix:
## 0 1 class.error
## 0 499 50 0.09107468
## 1 99 243 0.28947368
Since this is a classification problem, we measure model accuracy using Accuracy metric and not regression metrics such as MAE, RMSE, MSE etc.
We can obtain accuracy by summming the diagomal elemets of the confusion matrix and dividing by the table total:
(504+238)/(504+238+104+45)
## [1] 0.8327722
To view varaible importance:-
RF_Model$importance
## 0 1 MeanDecreaseAccuracy MeanDecreaseGini
## Pclass 0.03002979 0.103039035 0.05799803 34.10667
## Sex 0.12693107 0.209098819 0.15822907 103.52928
## Age 0.04770524 0.038393866 0.04406702 60.64767
## SibSp 0.03170089 0.004421977 0.02119980 17.41862
## Parch 0.01544175 0.008230806 0.01265676 12.21988
## Fare 0.04080752 0.069731258 0.05183229 62.69512
## Embarked 0.00844533 0.013936855 0.01058102 11.40967
Or viisualize varaible importance:-
varImpPlot(RF_Model,
pch = 20,
col = "green",
main = "Feature Importance")
Ensure the testing set does not have mismatching factor levels and variable names, otherwise we get the error Type of predictors in new data do not match that of the training data
Impute_test <- mice(Test, m = 3, seed = 2222)
##
## iter imp variable
## 1 1 Age Fare
## 1 2 Age Fare
## 1 3 Age Fare
## 2 1 Age Fare
## 2 2 Age Fare
## 2 3 Age Fare
## 3 1 Age Fare
## 3 2 Age Fare
## 3 3 Age Fare
## 4 1 Age Fare
## 4 2 Age Fare
## 4 3 Age Fare
## 5 1 Age Fare
## 5 2 Age Fare
## 5 3 Age Fare
Test <- complete(Impute_test, 1)
common <- intersect(names(Train), names(Test))
for (p in common) {
if (class(Train[[p]]) == "factor") {
levels(Test[[p]]) <- levels(Train[[p]])
}
}
Pred <- predict(RF_Model, Test)
We can now save th results for submission.
Submit <- data.frame(PassengerId = Test$PassengerId, Survived = Pred)
write.csv(Submit,
row.names = FALSE,
file = "C:\\Users\\User\\Desktop\\Desktop\\Cory_02042020.csv")
The above submission gives me a score of 0.75119! This is amazing for a good start.