Predicting Survival on the Titanic

Author: Sriram Sampath

This is one of the first prediction problems that I had attempted in Kaggle. The objective of the competition is to predict whether a passenger would survive or not.

The evaluation metric is the Model Accuracy.

NOTE: I have jotted down only those portions that I have included in the final model. The variables that do not feature here were part of the model iterations but were deemed to be less important.

Initial Profiling

Setting your working directory:

setwd("C:/Users/sriram.sampath/Desktop/Kaggle/Titanic")

Read the dataset into R:

train <- read.csv("train.csv")

The dataset consits of 891 rows and 12 columns.

Let's see what these Columns are:

colnames(train)

##  [1] "PassengerId" "Survived"    "Pclass"      "Name"        "Sex"        
##  [6] "Age"         "SibSp"       "Parch"       "Ticket"      "Fare"       
## [11] "Cabin"       "Embarked"

So there are 12 columns altogether. Our dependent variable is Survived. Let's run a simple frequency table of the dependent variable. Or in other words, let's see how many of them have survived, from the data in “train”

table(train$Survived)

## 
##   0   1 
## 549 342

Looks like 342 people have survived and 549 have died. The event rate, which is the rate of survival, is around 38%.

How is the distribution of people across classes? The Kaggle page says that “Pclass is a proxy for socio-economic status (SES) 1st ~ Upper; 2nd ~ Middle; 3rd ~ Lower”

table(train$Pclass)

## 
##   1   2   3 
## 216 184 491

Majority of the passengers are in the third class. But does the passenger class have an impact on survival? Let's check.

train$Pclass <- as.factor(train$Pclass)
table(train$Pclass,train$Survived)

##    
##       0   1
##   1  80 136
##   2  97  87
##   3 372 119

Look's like 75% of the people in class 3 have perished. To get a more visual feel of this, lets try to plot this.

train$Survived <- as.factor(train$Survived)
plot(train$Pclass,train$Survived)

plot of chunk unnamed-chunk-7

Note: Please ensure that you convert the concerned variables to factors.

It seems obvious that people in the upper classes have a higher rate of survival.

Let's move to the variable “Sex” - we will repeat the same exercise as with Pclass.

train$Sex <- as.factor(train$Sex)
table(train$Sex)

## 
## female   male 
##    314    577

Does Sex impact the rate of survival?

table(train$Sex, train$Survived)

##         
##            0   1
##   female  81 233
##   male   468 109

From a quick glance, it is evident that females have a better survival rate. ~75% of the women have survived when compared to ~18% of survival among men.

Again, let's plot this one.

plot(train$Sex, train$Survived)

plot of chunk unnamed-chunk-10

Let's now move on to AGE.

summary(train$Age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.42   20.12   28.00   29.70   38.00   80.00     177

There are 177 NA's in AGE. We will definitely have to do something about this as Age might be an important variable - so let's come back to it at a later stage.

FARE Variable.

The first thing that I did was to check whether there is any correlation between Fare and Age.

cor(train$Fare, train$Age, use="complete.obs")

## [1] 0.09606669

use=“complete.obs” ensures that the correlation is calculated for cases that have both Fare and Age. Remember, we have 177 NA's in AGE.

The correlation coefficient is very less, though.

Let's do a summary of Fare.

summary(train$Fare)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    7.91   14.45   32.20   31.00  512.30

Zero Fare? How can someone ride without paying a fare. Let's investigate.

subset(train, Fare < 5, select = c(Age, Fare))

##     Age   Fare
## 180  36 0.0000
## 264  40 0.0000
## 272  25 0.0000
## 278  NA 0.0000
## 303  19 0.0000
## 379  20 4.0125
## 414  NA 0.0000
## 467  NA 0.0000
## 482  NA 0.0000
## 598  49 0.0000
## 634  NA 0.0000
## 675  NA 0.0000
## 733  NA 0.0000
## 807  39 0.0000
## 816  NA 0.0000
## 823  38 0.0000

There might be some error here. What I have done is to substitute the zero fares with the median values of the fares corresponding to each passenger class. Also, we see that zero fares are not corresponding to infants, whom I assumed to be allowed to travel free of cost.

aggregate(Fare~Pclass,train,median)

##   Pclass    Fare
## 1      1 60.2875
## 2      2 14.2500
## 3      3  8.0500

I took median because it is a more robust measure than the mean.

train$Fare <- ifelse( (round(train$Fare==0) & as.numeric(train$Pclass)==1),60.2875,
                    ifelse( (round(train$Fare==0) & as.numeric(train$Pclass)==2),14.25,
                            ifelse( (round(train$Fare==0) & as.numeric(train$Pclass)==3),8.05,train$Fare)))

The subsitution is done! Let's check the summary once again.

summary(train$Fare)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.012   7.925  14.500  32.670  31.280 512.300

Okay, it's time to move on.

There is a variable called “Name” - I'm going to use this variable to extract another feature called “Title”, which I believe might be helpful to predict the missing age values.

train$Title <-  ifelse(grepl("mr", tolower(train$Name)), 'Mr', 
                       ifelse(grepl("miss", tolower(train$Name)), 'Miss', 
                              ifelse(grepl("mrs", tolower(train$Name)), 'Mrs', 
                                     ifelse(grepl("master", tolower(train$Name)), 'Master','Unknown'))))

The default value for Title has been set to Unknown as there are many other titles that occur sparsely, such as Major., Dr. etc…

NOW IT'S TIME TO BUILD AN INTERIM MODEL TO PREDICT THE MISSING AGE VALUES. As you might have guessed, this will be a linear regression problem as we need to predict a continuous variable.

After a few iterations, I decided to stick to the following independent variables

Title
Fare
SibSp
Sex

age.model<-lm(Age ~ Title + Fare + SibSp + Sex -1, data = train)
summary(age.model)

## 
## Call:
## lm(formula = Age ~ Title + Fare + SibSp + Sex - 1, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.144  -8.575  -1.509   6.544  46.532 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## TitleMaster  11.399455   2.765468   4.122 4.20e-05 ***
## TitleMiss    21.204678   1.129183  18.779  < 2e-16 ***
## TitleMr      34.875953   1.224548  28.481  < 2e-16 ***
## TitleUnknown 43.689547   2.746480  15.907  < 2e-16 ***
## Fare          0.043236   0.008738   4.948 9.38e-07 ***
## SibSp        -2.568739   0.547869  -4.689 3.30e-06 ***
## Sexmale      -2.704734   1.281596  -2.110   0.0352 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.03 on 707 degrees of freedom
##   (177 observations deleted due to missingness)
## Multiple R-squared:  0.8689, Adjusted R-squared:  0.8676 
## F-statistic: 669.4 on 7 and 707 DF,  p-value: < 2.2e-16

for(i in 1:nrow(train))
{
  if(is.na(train[i,"Age"]))
  {
    train[i,"Age"] <- predict(age.model, newdata=train[i,])
  }

}

What this code will do is to insert the predicted age values in places where the age is missing.

-1 in the equation indicates that it is a no-intercept model.

How good is my model?

summary(age.model)

## 
## Call:
## lm(formula = Age ~ Title + Fare + SibSp + Sex - 1, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.144  -8.575  -1.509   6.544  46.532 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## TitleMaster  11.399455   2.765468   4.122 4.20e-05 ***
## TitleMiss    21.204678   1.129183  18.779  < 2e-16 ***
## TitleMr      34.875953   1.224548  28.481  < 2e-16 ***
## TitleUnknown 43.689547   2.746480  15.907  < 2e-16 ***
## Fare          0.043236   0.008738   4.948 9.38e-07 ***
## SibSp        -2.568739   0.547869  -4.689 3.30e-06 ***
## Sexmale      -2.704734   1.281596  -2.110   0.0352 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.03 on 707 degrees of freedom
##   (177 observations deleted due to missingness)
## Multiple R-squared:  0.8689, Adjusted R-squared:  0.8676 
## F-statistic: 669.4 on 7 and 707 DF,  p-value: < 2.2e-16

But let me do a summary of the predicted values.

summary(train$Age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -8.848  21.000  30.000  29.620  36.000  80.000

The minimum value is -8. Age cannot be negative. Let me see how many such values are there.

length(which(train$Age <= 0))

## [1] 1

Only one value - I'm going to just take the modulo of the negative value.

train$Age_1 <- ifelse(train$Age<0, (-1*train$Age), train$Age)

Now it's time for some “feature engineering”. I'm going to create some additional variables based on my assumptions and profiling.

train$Child <- ifelse(train$Age_1<=18,1,0)

So anyone who is less than 18 years are considered to be children. Children are mosty likely to be rescued first - let's check it out.

table(train$Child, train$Survived)

##    
##       0   1
##   0 469 269
##   1  80  73

Looks like it doesn't it - there is a ~50% chance that you survive if you are a child.

table(train$Child, train$Survived, by=train$Sex)

## , , by = female
## 
##    
##       0   1
##   0  53 186
##   1  28  47
## 
## , , by = male
## 
##    
##       0   1
##   0 416  83
##   1  52  26

From here, it's also somewhat evident that female children are more likely to survive.

Creating an indicator for Woman

train$Woman <- ifelse(train$Sex=="female",1,0)

Creating another indicator for Child or Woman

train$CoW <- ifelse(train$Child==1 | train$Woman==1,1,0)

NOW COMING TO THE VARIABLES RELATED TO FAMILY.

train$FamilySize <- train$SibSp + train$Parch + 1
train$FamilyCat <- cut(train$FamilySize, c(0,1,4,12))

What is the family size of the individual? That's what FamilySize gives you. In the next step, I have split the Family Sizes into 3 groups - those who are travelling alone, those whose family size is 2-4 and the third group of individuals whose family size is greater than 4.

To demonstrate how this works, let's tabulate the FamilyCat and survival

table(train$FamilyCat, train$Survived)

##         
##            0   1
##   (0,1]  374 163
##   (1,4]  123 169
##   (4,12]  52  10

Individuals travelling alone and those with very high family sizes have a bad survival rate.

Now a variable for tracking down the 3rd class passengers alone - the other two passenger classes had a fairly similar survival rate.

train$PC3 <- ifelse(train$Pclass=="3",1,0)

Finally, moving on to the modelling part. I have chosen randomForest over logistic regression because the former performed much better with the same set of variables and fetched much higher accuracy.

In order to use randomForest in R, we must first install the randomForest package and load the library into the R environment.

library(randomForest)

## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.

After numerous iterations, I have decided to go with the following independent variables:

Pclass # Passenger class - the proxy for socio-economic status
FamilyCat # Family Category - Family Size=1, Family Size 2-4, Family Size >4
Sex
CoW # Is the passenger a Child or a woman?
Title

I'm loading only these required variables into a new data frame:

df_train <- train[,c("Survived","FamilyCat","Pclass","Sex","CoW","Title")]

The next step is to split the data into a Training Set & a Validation Set

dataTrain<-df_train[1:700,]
dataVal<-df_train[701:nrow(df_train),]

What the above code means is that the first 700 rows of data are loaded into the data frame object “dataTrain” and the remaining rows are loaded into “dataVal”, which is going to be my validation set.

All my variables are categorical, so I'm converting them to factors.

dataTrain$Survived <- as.factor(dataTrain$Survived)  
dataTrain$Sex <- as.factor(dataTrain$Sex)
dataTrain$Title <- as.factor(dataTrain$Title)       
dataTrain$Pclass <- as.factor(dataTrain$Pclass)
dataTrain$CoW <- as.factor(dataTrain$CoW)   
dataTrain$FamilyCat <- as.factor(dataTrain$FamilyCat)

Using randomForest package, I'm going to determine how many nodes to split at each level.

bestmtry <- tuneRF(dataTrain[-1],dataTrain$Survived, ntreeTry=100, stepFactor=1.5,improve=0.01, trace=TRUE, plot=TRUE, dobest=FALSE)

## mtry = 2  OOB error = 17.57% 
## Searching left ...
## Searching right ...
## mtry = 3     OOB error = 18.14% 
## -0.03252033 0.01

plot of chunk unnamed-chunk-37

The best mtry value seems to be 2, so this will be used during the model process.

rf_model<-randomForest(Survived~Pclass+Sex+CoW+FamilyCat+Title,data=dataTrain,type=classification, ntree=100,mtry=2)

It's time to move to the validation set. It is essential that we ensure that the datatypes of the input variables are the same across the training and validation sets.

dataVal$Survived <- as.factor(dataVal$Survived)
dataVal$Title <- as.factor(dataVal$Title)
dataVal$Sex <- as.factor(dataVal$Sex)
dataVal$CoW <- as.factor(dataVal$CoW)
dataVal$Pclass <- as.factor(dataVal$Pclass)
dataVal$FamilyCat <- as.factor(dataVal$FamilyCat)

Predicting the probabilities using the validation set:

rf.pred = predict(rf_model,type="prob",newdata=dataVal)[,2]

The predict function gives the output (probabilities) in two columns. We require the second column, hence the [,2] condition is imposed.

Alright, now we have the probabilities as well. In order to understand the accuracy at different thresholds, I have written a looping structure that would give me the threshold at which the maximum accuracy can be achieved by the model.

For this purpose, I have used the accuracy function from the package SDMTools.

library(SDMTools)
df = data.frame(matrix(vector(), 0, 3, dimnames=list(c(), c("threshold","auc","accuracy"))), stringsAsFactors=F)
for ( i in seq(0,1,0.01)) {
d <- list()
d$threshold <- as.numeric(accuracy(dataVal$Survived,rf.pred,threshold=i)[1])
d$auc <- as.numeric(accuracy(dataVal$Survived,rf.pred,threshold=i)[2])
d$accuracy <- as.numeric(accuracy(dataVal$Survived,rf.pred,threshold=i)[6])
df <- rbind(df, c(d$threshold, d$auc, d$accuracy))
}
write.csv(df, "exp random.csv", row.names=FALSE)

The output is stored in a CSV file. On sorting the accuracy values in descending order, the following metrics were observed:

Threshold = 0.31, AUC = 0.8485, Accuracy = 0.8638

These are the metrics corresponding to maximum accuracy that could be achieved by the model. I will have to use this threshold while scoring the model in the test set.

Preparing the TEST SET

The test set should possess the same variable definitions and data types as the train and validation set.

test<-read.csv("test.csv")
test$Sex <- as.factor(test$Sex)
test$Pclass <- as.factor(test$Pclass)
test$Title <-  ifelse(grepl("mr", tolower(test$Name)), 'Mr', 
                        ifelse(grepl("miss", tolower(test$Name)), 'Miss', 
                              ifelse(grepl("mrs", tolower(test$Name)), 'Mrs', 
                                      ifelse(grepl("master", tolower(test$Name)), 'Master','Unknown'))))

test$Title <- as.factor(test$Title)
age.model<-lm(Age ~ Title + Fare + SibSp + Sex -1, data = test)
summary(age.model)

## 
## Call:
## lm(formula = Age ~ Title + Fare + SibSp + Sex - 1, data = test)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -25.332  -8.041  -1.074   7.185  36.326 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## TitleMaster   8.57906    3.55286   2.415   0.0163 *  
## TitleMiss    18.42300    1.49292  12.340  < 2e-16 ***
## TitleMr      33.91184    1.59662  21.240  < 2e-16 ***
## TitleUnknown 41.30887    4.86263   8.495 7.28e-16 ***
## Fare          0.07505    0.01033   7.265 2.81e-12 ***
## SibSp        -0.15555    0.75529  -0.206   0.8370    
## Sexmale      -4.16846    1.66472  -2.504   0.0128 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.05 on 324 degrees of freedom
##   (87 observations deleted due to missingness)
## Multiple R-squared:  0.8922, Adjusted R-squared:  0.8899 
## F-statistic: 383.2 on 7 and 324 DF,  p-value: < 2.2e-16

for(i in 1:nrow(test))
{
  if(is.na(test[i,"Age"]))
  {
    test[i,"Age"] <- predict(age.model, newdata=test[i,])
  }

}
summary(test$Age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.17   22.00   30.00   29.89   36.00   76.00

test$Age_1 <- ifelse(test$Age<0, (-1*test$Age), test$Age)
test$Child <- ifelse(test$Age_1<=18,1,0)
test$Woman <- ifelse(test$Sex=="female",1,0)
test$CoW <- ifelse(test$Child==1 | test$Woman==1,1,0) #Child or Woman
test$CoW <- as.factor(test$CoW)
test$FamilySize <- test$SibSp + test$Parch + 1
test$FamilyCat <- cut(test$FamilySize, c(0,1,4,12))
test$FamilyCat <- as.factor(test$FamilyCat)

Finally, all the probabilities greater than or equal to 0.31 will classify as a 1, as per the previously obtained threshold value.

test$Prob <- predict(rf_model,type="prob",newdata=test)[,2]
test$Survived <- ifelse(test$Prob >= 0.31, 1, 0)

The last step of the exercise! Write out the file!

The file consisting only of two columns: PassengerId & Survived has to be submitted:

write.csv(test[,c(1,20)],"sub7_rf_revised.csv", row.names=FALSE)

The model fetched me an accuracy of 0.79426 on the leaderboard. At the time of submission, my rank was 603, which is in the top 25 percentile. Before I could complete this document, my rank slid down to 611!

Anyway, 0.79426 is not all that bad for this competition!