# You may set your own, mine is:
setwd("E:/Titanic_ML/")
This script trains a Random Forest model based on the data, saves a sample submission, and plots the relative importance of the variables in making predictions
Download my_rf_submission.csv from the output below and submit it through https://www.kaggle.com/c/titanic-gettingStarted/submissions/attach to enter this competition!
This analysis uses the Titanic survivor data provided by Kaggle at http://www.kaggle.com/c/titanic/data
I am using R for this analysis.
According to the description on Kaggle Titanic Competition, the variable descriptions are as follows:
survival: Survival (0 = No; 1 = Yes)
pclass: Passenger Class (proxy for socio-economic status 1 = 1st; 2 = 2nd; 3 = 3rd)
name: Name
sex: Sex
age: Age
sibsp: Number of Siblings/Spouses Aboard
parch: Number of Parents/Children Aboard
ticket: Ticket Number
fare: Passenger Fare
cabin: Cabin
embarked: Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)
The purpose of this ananlysis is to build a model that selects the variables that are able to predctive the passenger's survival with better
prediction accuracy.
For this analysis, since I have survival labels (whether survived or not) I'll apply a supervised learning algorithm to build the model.
First let us load the required libraries in R and also data in the working environment.
library(ggplot2)
library(randomForest)
library(rpart)
library(rpart.plot)
library(pander)
library(caTools)
library(rattle)
library(RColorBrewer)
library(plyr)
set.seed(1)
train <- read.csv("./input/train.csv", stringsAsFactors=FALSE)
test <- read.csv("./input/test.csv", stringsAsFactors=FALSE)
Training Data:
# Print few rows of training dataet to the console
m <- head(train)
knitr::kable(m, digits = 2, caption = "Training Data: First Few Rows of Data")
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22 | 1 | 0 | A/5 21171 | 7.25 | S | |
2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Thayer) | female | 38 | 1 | 0 | PC 17599 | 71.28 | C85 | C |
3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26 | 0 | 0 | STON/O2. 3101282 | 7.92 | S | |
4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35 | 1 | 0 | 113803 | 53.10 | C123 | S |
5 | 0 | 3 | Allen, Mr. William Henry | male | 35 | 0 | 0 | 373450 | 8.05 | S | |
6 | 0 | 3 | Moran, Mr. James | male | NA | 0 | 0 | 330877 | 8.46 | Q |
Testing Data:
# Print few rows of testing dataet to the console
n <- head(test)
knitr::kable(n, digits = 2, caption = "Testing Data: First Few Rows of Data")
PassengerId | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
---|---|---|---|---|---|---|---|---|---|---|
892 | 3 | Kelly, Mr. James | male | 34.5 | 0 | 0 | 330911 | 7.83 | Q | |
893 | 3 | Wilkes, Mrs. James (Ellen Needs) | female | 47.0 | 1 | 0 | 363272 | 7.00 | S | |
894 | 2 | Myles, Mr. Thomas Francis | male | 62.0 | 0 | 0 | 240276 | 9.69 | Q | |
895 | 3 | Wirz, Mr. Albert | male | 27.0 | 0 | 0 | 315154 | 8.66 | S | |
896 | 3 | Hirvonen, Mrs. Alexander (Helga E Lindqvist) | female | 22.0 | 1 | 1 | 3101298 | 12.29 | S | |
897 | 3 | Svensson, Mr. Johan Cervin | male | 14.0 | 0 | 0 | 7538 | 9.22 | S |
str(train)
## 'data.frame': 891 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : chr "male" "female" "female" "female" ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr "" "C85" "" "C123" ...
## $ Embarked : chr "S" "C" "S" "S" ...
str(test)
## 'data.frame': 418 obs. of 11 variables:
## $ PassengerId: int 892 893 894 895 896 897 898 899 900 901 ...
## $ Pclass : int 3 3 2 3 3 3 3 2 3 3 ...
## $ Name : chr "Kelly, Mr. James" "Wilkes, Mrs. James (Ellen Needs)" "Myles, Mr. Thomas Francis" "Wirz, Mr. Albert" ...
## $ Sex : chr "male" "female" "male" "male" ...
## $ Age : num 34.5 47 62 27 22 14 30 26 18 21 ...
## $ SibSp : int 0 1 0 0 1 0 0 1 0 2 ...
## $ Parch : int 0 0 0 0 1 0 0 1 0 0 ...
## $ Ticket : chr "330911" "363272" "240276" "315154" ...
## $ Fare : num 7.83 7 9.69 8.66 12.29 ...
## $ Cabin : chr "" "" "" "" ...
## $ Embarked : chr "Q" "S" "Q" "S" ...
From the above results, it is clear that the 'Survived' variable is loaded as integer by default.
Therefore I'll convert it to a categorical varaible (factor).
I'll also convert other categorical variables to factors as well (which are not loaded as factors but actually are).
train2 <- train
train2$Survived <- as.factor(train2$Survived)
train2$Pclass <- as.factor(train2$Pclass)
As previously noted, I'll split the training data into a set that will be used to train the model and a set that will be used to evaluate the model.
Test data set, as created from the above process, will contain 75% of randomly selected observations.
Training data set will contain the rest 25% obesvations (in original training set) which are exluded by newly created test data set.
set.seed(1446)
split <- sample.split(train2$Survived, SplitRatio = 0.75)
sub_training_data <- subset(train2, split == TRUE)
sub_testing_data <- subset(train2, split == FALSE)
# Males & Females that Survived vs. Males & Females that passed away
table(train2$Sex, train2$Survived)
##
## 0 1
## female 81 233
## male 468 109
# As row-wise proportions
prop.table(table(train$Sex, train$Survived), margin = 1)
##
## 0 1
## female 0.2579618 0.7420382
## male 0.8110919 0.1889081
I'll first consider a simple model based on gender (Sex variable) of the passengers vs. survival from the disaster.
k <- ggplot(train2, aes(Survived))
k + geom_bar(aes( fill = Sex), width=.85, colour="darkgreen") + scale_fill_brewer() +
ylab("Survival Count (Genderwise)") +
xlab("Survived: No = 0, Yes = 1") +
ggtitle("Titanic Disaster: Gender Vs. Survival (Training Dataset")
The following plot reveals that a majority of the passengers who survived Ire females.
Referring this idea, I can use Sex attribute for a basic model to predict whether a particular person survived or not.
Using R's "rpart" library, I will develop a simple decision tree (model):
formula <- Survived ~ Sex
# Build the decision tree
dtree <- rpart(formula, data=sub_training_data, method="class")
dtree_tr_predict <- predict(dtree, newdata=sub_training_data, type="class")
dtree_tr_predict.t <- table(sub_training_data$Survived, dtree_tr_predict)
# Model Accuracy
dtree_tr_accuracy <- (dtree_tr_predict.t[1, 1] + dtree_tr_predict.t[2, 2]) / sum(dtree_tr_predict.t)
# Print accuracy in Prediction
cat("Model Accuracy on Sub sample on training data: ", dtree_tr_accuracy)
## Model Accuracy on Sub sample on training data: 0.7919162
dtree_te_predict <- predict(dtree, newdata=sub_testing_data, type="class")
dtree_te_predict.t <- table(sub_testing_data$Survived, dtree_te_predict)
# Model Accuracy
dtree_testing_accuracy <- (dtree_te_predict.t[1, 1] + dtree_te_predict.t[2, 2]) / sum(dtree_te_predict.t)
# Print accuracy
cat("Model Accuracy in Prediction: ", dtree_testing_accuracy)
## Model Accuracy in Prediction: 0.7713004
fancyRpartPlot(dtree)
For the revised model, I'll consider "Age" and "Passenger Class" as additional variables.
Note that the "Passenger Class" variable is a proxy for socio-economic status with:
1 = 1st,
2 = 2nd, and
3 = 3rd class of passenger's catagory
formula2 <- Survived ~ Sex + Pclass + Age
# Build the decision tree
dtree2 <- rpart(formula2, data = sub_training_data, method="class")
dtree_tr_predict2 <- predict(dtree2, newdata = sub_training_data, type="class")
dtree_tr_predict.t2 <- table(sub_training_data$Survived, dtree_tr_predict2)
# Model Accuracy
dtree_tr_accuracy2 <- (dtree_tr_predict.t2[1, 1] + dtree_tr_predict.t2[2, 2]) / sum(dtree_tr_predict.t2)
# Print accuracy in Prediction
cat("Model Accuracy on Sub sample on training data: ", dtree_tr_accuracy2)
## Model Accuracy on Sub sample on training data: 0.8128743
dtree_te_predict2 <- predict(dtree2, newdata = sub_testing_data, type="class")
dtree_te_predict.t2 <- table(sub_testing_data$Survived, dtree_te_predict2)
# Model Accuracy
dtree_testing_accuracy2 <- (dtree_te_predict.t2[1, 1] + dtree_te_predict.t2[2, 2]) / sum(dtree_te_predict.t2)
# Print accuracy
cat("Model Accuracy in Prediction: ", dtree_testing_accuracy2)
## Model Accuracy in Prediction: 0.7847534
fancyRpartPlot(dtree2)
Since Random Forest funciton in randomForest package throughs an error if there is any variable that has NAs in it and is used to develop decision tree using this function.
We, therefore, first find out which are rows in the training dataset that has the NAs.
If complete row has NAs for all the variables in the training dataset, then we will simple remove it.
Or else we will replace the NAs with suitable imputation method.
# Function to count number of missing values
nmissing <- function(x) sum(is.na(x))
# Apply the above function to every column in a train dataset
L <- colwise(nmissing)(train)
knitr::kable(L, digits = 2, caption = "Training Data: NA's")
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 177 | 0 | 0 | 0 | 0 | 0 | 0 |
As we can see that there are total of 176 rows in varaiable "Age" column that has NAs.
None other variable has NAs.
This makes our life simple, we can replace these NAs using imputation.
Question is which method will be suitable for replacing NAs for a continuous variable?
AaHa! PREDICTION.. that's what are we doing until now, and will do again.
I'll make a prediction of an individual passenger's "Age"" using the other variables and a decision tree model.
But, this time I will use "anova", why? because I am trying to predict a continuous variable, as I said earlier.
So let's do is using R's "rpart" package. But why "rpart"? Becasue it's function take care of NAs in the data to build up prediciton model.
names(train) # use all of these
## [1] "PassengerId" "Survived" "Pclass" "Name" "Sex"
## [6] "Age" "SibSp" "Parch" "Ticket" "Fare"
## [11] "Cabin" "Embarked"
predicted_age <- rpart(Age ~ Pclass + Sex + SibSp + Parch + Fare + Embarked + Survived, data = train[!is.na(train$Age), ], method = "anova")
train1 <- train
test1 <- test
#Summary before imputaiton
summary(train1$Age)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.42 20.12 28.00 29.70 38.00 80.00 177
train1$Age[is.na(train1$Age)] <- predict(predicted_age, train[is.na(train$Age), ])
#Summary after imputaiton
summary(train1$Age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.42 22.00 27.84 29.57 37.00 80.00
k1 <- colwise(nmissing)(train1)
knitr::kable(k1, digits = 2, caption = "Training Data: NA's")
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Therefore, we can observe from the above summary results (before and after inputation) that change effects due to imputation are almost negligiable on the dataset.
We now can use randomForest package to build RF model on train1 data:
First convert variables into factors.
train1$Survived <- as.factor(train1$Survived)
#train1$Pclass <- as.factor(train1$Pclass)
train1$Sex <- as.factor(train1$Sex)
#train1$Parch <- as.factor(train1$Parch)
train1$Embarked <- as.factor(train1$Embarked)
#train1$SibSp <- as.factor(train1$SibSp)
#test1$Pclass <- as.factor(test$Pclass)
test1$Sex <- as.factor(test1$Sex)
#test1$Parch <- as.factor(test1$Parch)
test1$Embarked <- as.factor(test1$Embarked)
#test1$SibSp <- as.factor(test1$SibSp)
sapply(train1, class)
## PassengerId Survived Pclass Name Sex Age
## "integer" "factor" "integer" "character" "factor" "numeric"
## SibSp Parch Ticket Fare Cabin Embarked
## "integer" "integer" "character" "numeric" "character" "factor"
sapply(test1, class)
## PassengerId Pclass Name Sex Age SibSp
## "integer" "integer" "character" "factor" "numeric" "integer"
## Parch Ticket Fare Cabin Embarked
## "integer" "character" "numeric" "character" "factor"
Apply randomForest to builde RF tree:
formula2 <- Survived ~ Pclass + Sex + Age
# Build the decision tree
my_rf_model <- randomForest(formula2, data = train1, importance=TRUE, ntree=2000)
#my_rf_model <- randomForest(formula2, data = train1, importance=TRUE, ntree=2000,proximity=TRUE)
Variable importance:
importance(my_rf_model)
## 0 1 MeanDecreaseAccuracy MeanDecreaseGini
## Pclass 42.37356 48.41536 52.11166 37.32574
## Sex 94.66900 116.35814 109.42921 102.67206
## Age 39.05505 30.48756 51.39344 23.96749
Plot Variable importance:
varImpPlot(my_rf_model, main = "RF: Variable Importance")
Make Prediciton:
# Make your prediction using the test set
my_RF_prediction <- predict(my_rf_model, test1)
Modeling Errors (Confusion Matrix):
options('digits'= 3)
conf_matrix <- my_rf_model$confusion
knitr::kable(conf_matrix, digits = 2, caption = "Prediciton Errors: ")
0 | 1 | class.error | |
---|---|---|---|
0 | 493 | 56 | 0.10 |
1 | 112 | 230 | 0.33 |
Accuracy of the model:
options('digits'= 3)
my_rf_model$confusion[, 'class.error']
## 0 1
## 0.102 0.327
# Look at OOB errors
my_rf_model
##
## Call:
## randomForest(formula = formula2, data = train1, importance = TRUE, ntree = 2000)
## Type of random forest: classification
## Number of trees: 2000
## No. of variables tried at each split: 1
##
## OOB estimate of error rate: 18.9%
## Confusion matrix:
## 0 1 class.error
## 0 493 56 0.102
## 1 112 230 0.327
Out of bag (OOB) error is the overall classification error in the above model and as we can see, OOB error still is more than 15%.
This means that we still can reduce the OOB error in the model.
# Create a data frame with two columns: PassengerId & Survived.
## Survived contains your predictions
my_solution_rf <- data.frame(PassengerId = test1$PassengerId, Survived = my_RF_prediction)
# Write your solution away to a csv file with the name my_solution.csv
write.csv(my_solution_rf, file = "my_solution_rf.csv" , row.names=FALSE)