This analysis features the prediction of Survival of Passengers in Titanic based on the data given by kaggle for “Titanic: Machine Learning from Disaster” competition.

library(data.table, quietly = TRUE)
library(ggplot2, quietly = TRUE)
library(randomForest, quietly = TRUE)
## Warning: package 'randomForest' was built under R version 3.2.3
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
library(vcd, quietly = TRUE)
## Warning: package 'vcd' was built under R version 3.2.3
require(Hmisc, quietly = TRUE)
## Warning: package 'Hmisc' was built under R version 3.2.3
## Warning: package 'Formula' was built under R version 3.2.3
## 
## Attaching package: 'Hmisc'
## 
## The following object is masked from 'package:randomForest':
## 
##     combine
## 
## The following objects are masked from 'package:base':
## 
##     format.pval, round.POSIXt, trunc.POSIXt, units
library(party, quietly = TRUE)
## Warning: package 'party' was built under R version 3.2.3
## Warning: package 'mvtnorm' was built under R version 3.2.3
## Warning: package 'modeltools' was built under R version 3.2.3
## Warning: package 'strucchange' was built under R version 3.2.3
## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Warning: package 'sandwich' was built under R version 3.2.3
require(Amelia)
## Loading required package: Amelia
## Warning: package 'Amelia' was built under R version 3.2.3
## Loading required package: Rcpp
## ## 
## ## Amelia II: Multiple Imputation
## ## (Version 1.7.4, built: 2015-12-05)
## ## Copyright (C) 2005-2015 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##

Getting the Data

The data size is pretty small and any package (data.table/base) can be used for reading the files. The files are read such that empty data is replaced by NA.

train <- fread("train.csv",na.strings = c("NA",""))
test <- fread("test.csv",na.strings = c("NA",""))

By using the Amelia package, we can get the sense of the percentage of data missing in the train data set. based on that we can proceed to impute the data.

missmap(train,col=c("yellow", "black"))
## Warning in if (class(obj) == "amelia") {: the condition has length > 1 and
## only the first element will be used

 sum(is.na(train$Age))/nrow(train) #NApercent_age <-
## [1] 0.1986532
 sum(is.na(train$Cabin))/nrow(train) # NApercent_cabin <-
## [1] 0.7710438

It can be observed 80% of the cabin data and 20% of the Age data is missing. So we need to impute the Age missing values. Before that we can check the various factors influencing the survival through plots. This also give us insight to pick our variables for imputation. The following factors are first converted to factor variables.

train <- train[,Survived:= as.factor(Survived)] #survive factor
train <- train[,Sex:= as.factor(Sex)]
train <- train[,Pclass:= as.factor(Pclass)]
train <- train[,Embarked:= as.factor(Embarked)]

test <- test[,Sex:= as.factor(Sex)]
test <- test[,Pclass:= as.factor(Pclass)]
test <- test[,Embarked:= as.factor(Embarked)]


save(train,test,file = "rawdata.RData")

Summary of the raw train and test data

str(train)
## Classes 'data.table' and 'data.frame':   891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
##  $ Pclass     : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  NA "C85" NA "C123" ...
##  $ Embarked   : Factor w/ 3 levels "C","Q","S": 3 1 3 3 3 2 3 3 3 1 ...
##  - attr(*, ".internal.selfref")=<externalptr>
str(test)
## Classes 'data.table' and 'data.frame':   418 obs. of  11 variables:
##  $ PassengerId: int  892 893 894 895 896 897 898 899 900 901 ...
##  $ Pclass     : Factor w/ 3 levels "1","2","3": 3 3 2 3 3 3 3 2 3 3 ...
##  $ Name       : chr  "Kelly, Mr. James" "Wilkes, Mrs. James (Ellen Needs)" "Myles, Mr. Thomas Francis" "Wirz, Mr. Albert" ...
##  $ Sex        : Factor w/ 2 levels "female","male": 2 1 2 2 1 2 1 2 1 2 ...
##  $ Age        : num  34.5 47 62 27 22 14 30 26 18 21 ...
##  $ SibSp      : int  0 1 0 0 1 0 0 1 0 2 ...
##  $ Parch      : int  0 0 0 0 1 0 0 1 0 0 ...
##  $ Ticket     : chr  "330911" "363272" "240276" "315154" ...
##  $ Fare       : num  7.83 7 9.69 8.66 12.29 ...
##  $ Cabin      : chr  NA NA NA NA ...
##  $ Embarked   : Factor w/ 3 levels "C","Q","S": 2 3 2 3 3 3 2 3 1 3 ...
##  - attr(*, ".internal.selfref")=<externalptr>
ggplot(data = train,aes(Survived)) + geom_bar(fill = "black") + ggtitle("Passengers Survived/Dead")

 ggplot(data = train,aes(Pclass)) + geom_bar(fill = "black") + ggtitle("Passengers Class")

 ggplot(data = train,aes(Age)) + geom_histogram(fill = "black") + ggtitle("Age of Passengers")
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

 ggplot(data = train,aes(Sex)) + geom_histogram(fill = "black") + ggtitle("Gender of Passengers")

 ggplot(data = train,aes(SibSp)) + geom_histogram(fill = "black") + ggtitle("Sibings + Spouse")
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

 ggplot(data = train,aes(Parch)) + geom_histogram(fill = "black") + ggtitle("Parents and Kids")
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

 ggplot(data = train,aes(Fare)) + geom_histogram(fill = "black") + ggtitle("Fare")
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

  ggplot(data = train,aes(Embarked)) + geom_bar(fill = "black") + ggtitle("Place of Embarkment")

  boxplot(train$Age ~ train$Survived, main="Passenger Fate by Age",
        xlab="Survived", ylab="Age")

“0” indicates passenger as Dead and 1 as “Survived”. It can be observed that more passengers died than survived. The number of people travelling by Third class is way more compared to other classes. Male passengers completely outnumbered the female passengers. Most of the passengers embarked at Southampton. Most of the passengers are middle aged.

mosaicplot(train$Pclass  ~ train$Survived, main =  "Passenger fate by class",color=TRUE)

  mosaicplot(train$Sex  ~ train$Survived, main =  "Passenger fate by Gender",color=TRUE)

  mosaicplot(train$Embarked  ~ train$Survived, main =  "Passenger fate by Embarkment",color=TRUE)

Mosaic plots are used to understand the effect of variables on Survival.It can be observed that more third class passengers perished and also the male passengers. It is known that females and children were first allowed to get on to the life boats. The embarkment place does not have much influence on the survival.

Imputing missing values of Age

summary(train$Age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.42   20.12   28.00   29.70   38.00   80.00     177

It was observed that 20% of the Age data are NA’s which is huge. There are many ways of imputing and the simplest way is by using the impute function of the hmisc package. But if we replace all the NA’s by the mean of the age, it may not be very accurate since the age groups are varying and the data set is small. So we can find the variables that influence the age.

boxplot(train$Age~train$Pclass, main="Passenger Class by Age",
xlab="Passenger class", ylab="Age")

It can be seen that the average age of the First class is more than the other two classes. It is quiet innate that it took some years for the people to become rich.

Next variable is Name. The way of addresing can in some way determine the age of the person like “Master” for a child and “Mr” for an adult. The following function is used to extract the title from the Name variable . Source

getTitle <- function(d){
  d$Title <- sapply(d$Name,FUN = function(x){
    strsplit(x,split = '[,.]')[[1]][2]
  })
  return(d$Title)
}
train$Title <- getTitle(train)
test$Title <- getTitle(test)
unique(train$Title)
##  [1] " Mr"           " Mrs"          " Miss"         " Master"      
##  [5] " Don"          " Rev"          " Dr"           " Mme"         
##  [9] " Ms"           " Major"        " Lady"         " Sir"         
## [13] " Mlle"         " Col"          " Capt"         " the Countess"
## [17] " Jonkheer"

By uisng bystas function in the hmisc package, we can know how many Titles have missing NA’s of age and use these titles only to impute the age.

bystats(train$Age,train$Title)
## 
##  Mean of train$Age by  
## 
##                 N Missing      Mean
##  Capt           1       0 70.000000
##  Col            2       0 58.000000
##  Don            1       0 40.000000
##  Dr             6       1 42.000000
##  Jonkheer       1       0 38.000000
##  Lady           1       0 48.000000
##  Major          2       0 48.500000
##  Master        36       4  4.574167
##  Miss         146      36 21.773973
##  Mlle           2       0 24.000000
##  Mme            1       0 24.000000
##  Mr           398     119 32.368090
##  Mrs          108      17 35.898148
##  Ms             1       0 28.000000
##  Rev            6       0 43.166667
##  Sir            1       0 49.000000
##  the Countess   1       0 33.000000
## ALL           714     177 29.699118
bystats(test$Age, test$Title)
## 
##  Mean of test$Age by  
## 
##           N Missing      Mean
##  Col      2       0 50.000000
##  Dona     1       0 39.000000
##  Dr       1       0 53.000000
##  Master  17       4  7.406471
##  Miss    64      14 21.774844
##  Mr     183      57 32.000000
##  Mrs     62      10 38.903226
##  Ms       0       1        NA
##  Rev      2       0 35.500000
## ALL     332      86 30.272590

Thus, only “Dr”,“Master”,“Miss”,“Mr”,“Mrs” have unknown ages and are used to impute the age by replacing the NA with the mean of the age of the corresponding Title.

title.na_age <- c("Dr","Master","Miss","Mr","Mrs")
trim <- function (x) gsub("^\\s+|\\s+$", "", x)
train$Title <- trim(train$Title)
 test$Title <- trim(test$Title)
imputeMedian <- function(impute.var, filter.var, var.levels) {
  for (v in var.levels) {
    impute.var[ which( filter.var == v)] <- impute(impute.var[ 
      which( filter.var == v)])
  }
  return (impute.var)
}

#train[,"Age"] <-
train$Age <- imputeMedian(train$Age,train$Title,title.na_age)
title.na_age_test <- c("Master","Miss","Mr","Mrs","Ms")
test$Age <- as.integer(imputeMedian(test$Age, test$Title, 
                           title.na_age_test))
test$Age[which(is.na(test$Age))] <- as.integer(20)
summary(train$Age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.42   21.00   30.00   29.39   35.00   80.00
summary(test$Age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   22.00   28.00   29.55   36.00   76.00

Before proceeding to predict, we can further reduce the titles such that both the test and train datasets have same levels of titles.

test$Title[which(test$Title == "Dona" )] <- "Don"
train$Title[which(train$Title == "Mme" )] <- "Miss"
train$Title[which(train$Title == "Major" )] <- "Mr"
train$Title[which(train$Title == "Lady" )] <- "Mrs"
train$Title[which(train$Title == "Sir" )] <- "Rev"
train$Title[which(train$Title == "Mlle" )] <- "Miss"
train$Title[which(train$Title == "Capt" )] <- "Col"
train$Title[which(train$Title == "the Countess" )] <- "Mrs"
train$Title[which(train$Title == "Jonkheer" )] <- "Mr"

Fare has been give zero in few places. This can be converted to Na and replaced by the imputeMedian function by using class a sthe base for calculating the mean.

train$Fare[which(train$Fare == 0)]<- NA
train$Fare <- imputeMedian(train$Fare,train$Pclass,as.numeric(levels(train$Pclass)))
test$Fare[which(test$Fare == 0)]<- NA
test$Fare <- imputeMedian(test$Fare,test$Pclass,as.numeric(levels(test$Pclass)))
summary(train$Fare)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.012   7.925  14.500  32.690  31.280 512.300
train$Embarked[which(is.na(train$Embarked))] <- 'S'

Embarked coloumn has very few NA’s and can be replaced by the highest (Southampton).

summary(train$Embarked)
##   C   Q   S 
## 168  77 646
test$Embarked[which(is.na(test$Embarked))] <- 'S'
summary(test$Embarked)
##   C   Q   S 
## 102  46 270

Random Forest Prediction

Before prediction, let the required variables be grouped and converted to the required class.

train <- train[,Title := as.factor(Title)]
test <- test[,Title := as.factor(Title)]
str(train)
## Classes 'data.table' and 'data.frame':   891 obs. of  13 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
##  $ Pclass     : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Age        : num  22 38 26 35 35 30 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  NA "C85" NA "C123" ...
##  $ Embarked   : Factor w/ 3 levels "C","Q","S": 3 1 3 3 3 2 3 3 3 1 ...
##  $ Title      : Factor w/ 9 levels "Col","Don","Dr",..: 6 7 5 7 6 6 6 4 7 7 ...
##  - attr(*, ".internal.selfref")=<externalptr>
str(test)
## Classes 'data.table' and 'data.frame':   418 obs. of  12 variables:
##  $ PassengerId: int  892 893 894 895 896 897 898 899 900 901 ...
##  $ Pclass     : Factor w/ 3 levels "1","2","3": 3 3 2 3 3 3 3 2 3 3 ...
##  $ Name       : chr  "Kelly, Mr. James" "Wilkes, Mrs. James (Ellen Needs)" "Myles, Mr. Thomas Francis" "Wirz, Mr. Albert" ...
##  $ Sex        : Factor w/ 2 levels "female","male": 2 1 2 2 1 2 1 2 1 2 ...
##  $ Age        : int  34 47 62 27 22 14 30 26 18 21 ...
##  $ SibSp      : int  0 1 0 0 1 0 0 1 0 2 ...
##  $ Parch      : int  0 0 0 0 1 0 0 1 0 0 ...
##  $ Ticket     : chr  "330911" "363272" "240276" "315154" ...
##  $ Fare       : num  7.83 7 9.69 8.66 12.29 ...
##  $ Cabin      : chr  NA NA NA NA ...
##  $ Embarked   : Factor w/ 3 levels "C","Q","S": 2 3 2 3 3 3 2 3 1 3 ...
##  $ Title      : Factor w/ 9 levels "Col","Don","Dr",..: 6 7 6 6 7 6 5 6 7 6 ...
##  - attr(*, ".internal.selfref")=<externalptr>
getFeatures <- function(data){
  features <- c("Pclass",
                "Age",
                "Sex",
                "Parch",
                "SibSp",
                "Fare",
                "Embarked",
                "Title")
  f <- data[,features, with = FALSE]
  
  return(f)
}

The getfeatures function is used to extract the required features from the datasets. Random Forest function is used for the prediction fit. The importance = TRUE argument enables us to inspect the impotance of the difference variables via varImpPlot.

rf <- randomForest(getFeatures(train),as.factor(train$Survived), ntree = 1000, importance = TRUE)
#str(rf)
varImpPlot(rf)

The accuracy tests helps to see how worse the model performs without each variable, so a high decrease in accuracy would be expected for very predictive variables. The Gini one digs into the mathematics behind decision trees, but essentially measures how pure the nodes are at the end of the tree. Again it tests to see the result if each variable is taken out and a high score means the variable was important. source

The Passengerclass, Fare and the Title can be seen as very influential parameters.The random forest fit isused to predict the survival and the out put is written in the format required for the submission.

sub <- data.table(PassengerId = test$PassengerId)
sub$Survived <- predict(rf,getFeatures(test))

write.csv(sub,file = "rf.csv",row.names=FALSE)