This project involves the analysis and the prediction of the people who were more likely to survive on the British liner named Titanic, that sank in the North Atlantic Ocean in the early morning of 15 April 1912 after colliding with an iceberg during her maiden voyage from Southampton, UK, to New York City, US.. Titanic only carried enough lifeboats for 1,178 people, slightly more than half of the number of the people on board, and one-third her total capacity.

Reading the data in order to be able to complete the further development.

#set the seed for reproductory purposes and get the train data for training purposes, while the test data is to confirm the model
seed <- as.numeric(as.Date("12-12-2015"))
set.seed(seed)
setwd("C:/Users/s7c/Documents/titanic")
train <-read.csv('C:/Users/s7c/Documents/titanic/train.csv', stringsAsFactors = FALSE)
test <- read.csv('C:/Users/s7c/Documents/titanic/test.csv', stringsAsFactors = FALSE)
#I'm going to need this for submitting the script
PassengerId <- test$PassengerId

Data Exploration

Exploring the data to find patterns and ideas which could help develop the prediction model.

# a quick look at the data
View(train)
# we can see we deal with missing values, which appear in our data as missing values. How many and on what columns?
sum(is.na(train))

## [1] 177

colSums(is.na(train))

## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0         177 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           0           0           0

# Age variable has 177 missing values

# get a detailed view of our variables structure and composition
str(train)

## 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : chr  "male" "female" "female" "female" ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  "" "C85" "" "C123" ...
##  $ Embarked   : chr  "S" "C" "S" "S" ...

Just around 39% survived overall

library(RColorBrewer)
barplot(prop.table(table(train$Survived)), col  = brewer.pal(3,"Set1"))

There were almost 2 times more men than women and around 10% of the men survived, while just around 10% of the women perished. While 25% traveled at first class, around 20% traveled at the second class and around 55% traveled at the third class Almost 65% survived at first class; around 45% survived at the second class; just 25% survived at the third class.

barplot(prop.table(table(train$Survived, train$Sex)),col  = brewer.pal(3,"Set1"))

library(ggplot2)

qplot(factor(Pclass), data=train, geom="bar", fill=factor(Survived))

An interesting and more detailed view, which confirms some of the above: at the class 2 the men curiously had the lowest nr of survivals. The most men survived at the 3 class, but as a percentage, the highest was at the first class. While the women had the nr and the percentage going to first class and the other 2 classes had aprx similar nr of female survivors.

mosaicplot(table(train$Pclass, train$Sex,train$Survived), sort = c(3,1,2), color = T)

A quick visualization of survival by class

library(gridExtra)
library(Hmisc)

## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## 
## Attaching package: 'Hmisc'
## 
## The following objects are masked from 'package:base':
## 
##     format.pval, round.POSIXt, trunc.POSIXt, units

cutSurviving <- cut2(train$Survived,g=2)
qplot(cutSurviving,Age, data=train,fill=cutSurviving,
           geom=c("boxplot"))

## Warning: Removed 177 rows containing non-finite values (stat_boxplot).

So the age is slightly younger for the survivors. But not necessarily something to work with, beside maybe the fact that most of the people were between around 19-38?

A different approach:

qplot(Pclass,Age,colour=Survived,data=train, na.rm=TRUE)+geom_smooth(method='lm',formula=y~x)

## Warning: Removed 177 rows containing non-finite values (stat_smooth).

qplot(factor(Pclass), Age, data = train, geom = "boxplot")

## Warning: Removed 177 rows containing non-finite values (stat_boxplot).

So now we can clearly see that the age of the person matters allot for surviving when is correlated with the Pclass variable. The younger ones had a better chance but if we follow the smoothed line we can see that is inverse proportional with the number from Pclass. Correlating all this with the fact that the highest median was at the first class, while the youngest were at the third class get’s me thinks that the road ahead it will be pretty challenging.

Looking further at the Fare variable, we can clearly see that those who payed more for a ticket had a better chance for survival. This is mainly because the most expensive tickets were at he first class, which had the highest rate of survival.

qplot(cutSurviving,Fare, data=train,fill=cutSurviving,
                         geom=c("boxplot"))

qplot(Fare, Age, data=train, colour=factor(Pclass), na.rm = TRUE)

Data Mining

Being done with the data exploration part, now I will dive in and manipulate the remaining variables in order to be able to use them properly for the prediction model.

One conclusion is that we can work with ticket variable in order to improve our prediction and manipulate the ticket variable to reflect the afferent class. The code below finds the right class of the person based on the first numbers or letters from the ticket using grep function. Next code is right about 90% of the time, around 10% the patterns I found doesn’t comply with the code but I think is a big improvement.

library(data.table)
dt1<- as.data.table(train)

dt1$Ticket[grep('^1|^57|^69|^P', dt1$Ticket)] <- 1
dt1$Ticket[grep('^2|^SW|^SO|^SC|^S.W|^S.P|^S.O|^S.C|^F', dt1$Ticket)] <- 2
dt1$Ticket[grep('^3|^C|^26|^4|^54|^65|^7|^8|^9|^A|LINE|SOTON|STON|^W|^160|^14', dt1$Ticket)] <- 3

We saw previously that the Age variable has allot of missing values. A quick way to solve this it would be to replace them with the average of the rest of the Age’s values ,which would be 29.7. considering the context, this would be a poor decision and a more complex approach it will be necessary, cause we can clearly see that the Na’s appear for different groups of people with different ages, so with different averages. I will try to handle the missing values from the Age variable by grouping the people based on the age category. In order to do that I will take a certain honorific term or prefix which appears in the name description and replace it with something more standard.

dt1$Name[grep('Master.', dt1$Name, fixed=TRUE)] = 'Master'#I need to add fixed =True, else the Mrs it will be overwrite with Mr
dt1$Name[grep('Mr.', dt1$Name, fixed=TRUE)] = 'Mr'
dt1$Name[grep('Miss.|Mlle', dt1$Name)] = 'Miss'
dt1$Name[grep('Mrs.|Mme.|Ms.|Lady.|Countess.', dt1$Name)] = 'Mrs'
dt1$Name[grep("Dr.|Col.|Rev.|Capt.|Major.|Sir.|Don.|Jonkheer", dt1$Name)] = 'Sir'

#create a data table with the age average classified by names
dt_0<-dt1[, mean(Age, na.rm = TRUE), by = Name]

#create a new data table formed just from age missing values and the afferent names
dt0 <- dt1[is.na(Age),.(Age, Name)]
#from previous constructed make a single data table
dt <- merge(dt0, dt_0, by = 'Name')
#make the necessary update on mising values and remove the V1 column after
dt <- dt[, Age := round(V1, 2)]
dt <- dt[, V1 := NULL]
unique(dt)

##      Name   Age
## 1: Master  4.57
## 2:   Miss 21.80
## 3:     Mr 32.37
## 4:    Mrs 35.80
## 5:    Sir 46.05

#and now let's replace these missing values in our initial data. dt orders the data alphabetically and when i try to replace the missing values it get's replaced in order, not considering the name allocation. First value missing for Mr get's replaced with first value from dt which because of the order thing, is the value for doctor. So on...if I use the set key as Name, apparently it orders again alphabeticaly and each missing values for each name is allocated properly!
setkey(dt1, Name)
dt1$Age[is.na(dt1$Age)] <- dt[,Age]

Now finishing to manage the missing values let’s go back for a bit to data exploration part and see how the age groups are represented in the data related to the name:

qplot(Age,colour=Name,data=dt1,geom="density")

#now geting into more details we can see that the ones with name Master and Miss were mostly younger than 25 years, so they have a bigger chance of surviving. Should that imply a new variable?

K, we saw the relation between Age and Name. But what about how Name variable is represented into the traveled classes? The answer is that, as a confirmation to some of the conclusions from above, the Master, Miss and Mrs had a very high chance of surviving if they were located at the first 2 classes. So the females and the kids! It’s hard to get a conclusion from the third class. An Another overall conclusion confirmed is that men aka Mr had the lowest chance of survival.

table(dt1$Pclass, dt1$Name)

##    
##     Master Miss  Mr Mrs Sir
##   1      3   48 107  45  13
##   2      9   34  91  42   8
##   3     28  102 319  42   0

table(dt1$Pclass, dt1$Name,dt1$Survived)

## , ,  = 0
## 
##    
##     Master Miss  Mr Mrs Sir
##   1      0    2  70   1   7
##   2      0    2  83   4   8
##   3     17   51 283  21   0
## 
## , ,  = 1
## 
##    
##     Master Miss  Mr Mrs Sir
##   1      3   46  37  44   6
##   2      9   32   8  38   0
##   3     11   51  36  21   0

ggplot(dt1, aes(Name, Pclass)) + geom_bar(stat = 'identity') + facet_grid(Survived~Pclass) +ylim(0, 300)

Checking if any variable are correlated or have a high variability in order to avoid the burnt of over fitting the model. We get a mild correlation between fare and pclass variables, which theoretically shouldn’t have an effect but I will see this at the right time.

library(caret)

## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:survival':
## 
##     cluster

findCorrelation(cor(train[,-c(1, 4, 5, 9, 11, 12)]), cutoff = .7, names = T)

## character(0)

#there is no correlation which matters. 
#The only decent correlation is obviously between class and the fare paid:
# cor(train$Pclass, train$Fare)
# [1] -0.5494996

#I can check visualy for near zero var situation, which would  #just deteriorate the prediction model. 
table(train$Sex)#for ex

## 
## female   male 
##    314    577

#so we have many instances despite to just the 2 values of sex. That means Sex is not a near zero var. The next fuction it would check the data for this type of situation
nzv <- nearZeroVar(dt1, saveMetrics=TRUE)
if (any(nzv$nzv)) nzv else message("No variables with near zero variance")

## No variables with near zero variance

Feature Engineering and additional exploration

I tried different approaches in order to reflect the ideas from the exploration part but most of them didn’t improved the model. Some of them are presented below and the conclusions involves a model tried and with unsatisfactory results.

#if we take a look at the columns description:
#                sibsp           Number of Siblings/Spouses Aboard
#                parch           Number of Parents/Children Aboard
#With respect to the family relation, results that this it could be an important factor for surving. Not sure in which way though... 
#So I tried to create a new variable formed by the sum of #sibsp and parch
# dt1$Family <- dt1$Parch + dt1$SibSp + 1# I added 1 because we are considering the family here and it's common sense to include the person itself 
# dt1[,c('Parch', 'SibSp') := NULL]      
# unique(dt1$Family)
# ggplot(dt1, aes(x = Family, fill = factor(Survived))) + geom_bar(position = "fill")# 1-mean alive/0-means dead
#This is a confirmation at the above statement, that large families didn't survived.
#But this proved a bad idea, not seeing any improvement. An explanation to this is that it's easier for the model to determine the females and kids...

#Even though we can get some clearly hypothesis from the charts related to Fare variable, using the Fare variable or having an approach like spliting it in groups, apparently it just overfits the model. An explanation to this is that the mild correlation we saw with Pclass variable apparently appears hard on the model.

#the one which proved to have a significant impact:
#this reflects the idea where the group names Master, Miss and Mrs had a higher/medium/acceptable chance of survival when correlated with the first class/second/third. But because this category name groups implies a wide age variation, I restrict them as above.
dt1 <- dt1[, New_group := 0L]
for (i in 1:nrow(dt1)) {
 if(((dt1$Sex[i] == 'female' | dt1$Age[i] < 18)   &  dt1$Pclass[i] == 2) | (dt1$Pclass[i] == 1)){
     dt1$New_group[i] = 'High'
  } else if((dt1$Sex[i] == 'female' | dt1$Age[i] < 18) &  dt1$Pclass[i] == 3){
     dt1$New_group[i] = 'Acceptable'
 }     else{
     dt1$New_group[i] = 'Low'
}
}

Perform similar transformations on the test data

test <- test[, -c(1, 9:11)]               
dt2<- as.data.table(test)
#strange I get an error and on a closer look:
colSums(is.na(test))

## Pclass   Name    Sex    Age  SibSp  Parch Ticket 
##      0      0      0     86      0      0      0

#So I don't understand why but there is a missing value in test data, different than the ones from our train data.
# dt2$Fare <- ifelse(is.na(dt2$Fare) == TRUE, mean(dt2$Fare, na.rm = TRUE), dt2$Fare)#this is gonna fix the problem

dt2$Ticket[grep('^1|^57|^69|^68|^P', dt2$Ticket)] <- 1
dt2$Ticket[grep('^2|^SW|^SO|^SC|^S.W|^S.P|^S.O|^S.C|^F', dt2$Ticket)] <- 2
dt2$Ticket[grep('^3|^C|^26|^4|^54|^65|^7|^8|^9|^A|LINE|SOTON|STON|^W|^160|^14|^LP', dt2$Ticket)] <- 3
#apparently the variable from test data has new levels so I change the above code to manage that

dt2$Name[grep('Master.', dt2$Name, fixed=TRUE)] = 'Master'#I need to add fixed =True, else the Mrs it will be overwrite with Mr
dt2$Name[grep('Mr.', dt2$Name, fixed=TRUE)] = 'Mr'
dt2$Name[grep('Miss.|Mlle', dt2$Name)] = 'Miss'
dt2$Name[grep('Mrs.|Mme.|Ms.|Lady.|Countess.', dt2$Name)] = 'Mrs'
dt2$Name[grep("Dr.|Col.|Rev.|Capt.|Major.|Sir.|Don.|Jonkheer", dt2$Name)] = 'Sir'            

dt_00<-dt2[, mean(Age, na.rm = TRUE), by = Name]
dt00 <- dt2[is.na(Age),.(Age, Name)]
dtt <- merge(dt00, dt_00, by = 'Name')
dtt <- dtt[, Age := round(V1, 2)]
dtt <- dtt[, V1 := NULL]
unique(dtt)

##      Name   Age
## 1: Master  7.41
## 2:   Miss 21.77
## 3:     Mr 32.00
## 4:    Mrs 38.90

setkey(dt2, Name)
dt2$Age[is.na(dt2$Age)] <- dtt[,Age]              
dt2 <- dt2[, New_group := 0L]
 for (i in 1:nrow(dt2)) {
 if(((dt2$Sex[i] == 'female' | dt2$Age[i] < 18)   &  dt2$Pclass[i] == 2) | (dt2$Pclass[i] == 1)){
     dt2$New_group[i] = 'High'
  } else if((dt2$Sex[i] == 'female' | dt2$Age[i] < 18) &  dt2$Pclass[i] == 3){
     dt2$New_group[i] = 'Acceptable'
 }     else{
     dt2$New_group[i] = 'Low'
}
}

Machine Learning

Logistic regresion

Binary models are frequently used to model outcomes that have two values, like in our case: alive vs dead. Linear regression is not appropriate in the case of a qualitative response. I tried before any feature engineering just to see how model behaves and I got 739 for AIC, which it will be my reference for any future approaches. Further more I tried to work with Fare variable in the way you can see it in the feature engineering part as a commentary. Then I went further and create a new variable Family by merging SibSp and Parch columns, but neither of this approaches proved to improve the model. First conclusion is that is better without the fare column at all.

Some of the ideas taken from data exploration part I wasn’t able to use them in a proper manner in the feature engineering part in order to improve my model. Now considering the fact that most of the people are located in fact in third class, that implies that is a relation between family and class. With most families with children in third class. And we saw that there is a relation between females and their class location. That’s because more females than males survived and way more people from first class compared in special with the third class. Adding at this that the as younger a person was, they had a better chance of surviving. For that group of age, the representative names are: master and miss. So it didn’t work as feature engineering approach but considering the obvious relations here, let’s try to reflect this ideas with some interactions and hard-coding:

#so we start by applying backward stepwise selection approach and start with all the predictors:
glm<-glm(Survived ~ Pclass + Name  + Sex + New_group  + Age +  Ticket+ SibSp+Parch + Fare + Cabin + Embarked,  data=dt1, family = binomial(link = 'logit'))
#the conclusions: Residual deviance:  550.45 on 729  degrees of freedom and AIC: 874.45
#So it becomes clear that considering the p-value = .99 for most of Cabin levels, which is way bigger than the threshold of 0.05 we fail to reject the null hypothesis. This implies that the faith of the people is not due to cabin type!
#I wil continue follow this idea and try to reflect in the same time some from work we done...

I will keep just those variables which truly helps the prediction process.

dt1[, c("PassengerId","Fare", 'Cabin', 'Embarked') := NULL]
#I will try to handle these by grouping the people using common sense pretty much.

#I end up with this model:
glm<-glm(Survived ~ Pclass + I(Name == 'Master') + Sex + New_group  + Age +  Ticket+ SibSp+Parch + SibSp * Pclass,  data=dt1, family = binomial(link = 'logit'))
#with the following summary:
#Residual deviance:  685.1  on 879  degrees of freedom and
#AIC: 709.1
#the coefficients are interpreted to be as the log odds ratios. consider as a measure of accuracy the Akaike Information Criterion

glm_p <- predict(glm, dt2, type = 'response')# The predict() function can be used to predict the probability that the passenger survived or not
#The type="response" option tells R to output probabilities of the form P(Y = 1|X), as opposed to other information such as the logit

#preparing for submission
Survivor <- vector()
for(i in 1:length(glm_p)){
  if(glm_p[i] > .9){
    Survivor[i] = 1
  } else {
    Survivor[i] = 0
  }
}


titanic_off <- cbind(PassengerId,Survivor)
colnames(titanic_off) <- c('PassengerId', "Survived")
write.csv(titanic_off, file = "titanic.csv", row.names = FALSE)

Random Forest

“A random forest is an ensemble of decision trees which will output a prediction value, in this case survival. Each decision tree is constructed by using a random subset of the training data. After you have trained your forest, you can then pass each test row through it, in order to output a prediction.”, as is stated very nicely on kaggle site.

Extra feature engineering I need to make in order to avoid further errors, taking in account the tree prediction considerations. For the start I will a nice graphic representation of the decision tree process.

library(rpart)
library(rattle)

## Rattle: A free graphical interface for data mining with R.
## Version 4.0.5 Copyright (c) 2006-2015 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.

library(rpart.plot)
fit <- rpart(Survived ~ Pclass + Sex + Age + SibSp + Parch + Ticket+Name+New_group, data=dt1, method="class")

fancyRpartPlot(fit)

Let’s dive into deep waters

#Use this just for tree predictions using caret package.
#This changes needs to be done for using tree predictions, cause #class levels are not a valid R variables. That's why I need to make the following conversions:
for (i in 1:nrow(dt1)){
  if(dt1$Pclass[i] == 1){
    dt1$Pclass[i] = 'FirstClass'
  } else if(dt1$Pclass[i] == 2){
    dt1$Pclass[i] = 'SecondClass'
    }else{
      dt1$Pclass[i] = 'ThirdClass' 
    
  }
}



for (i in 1:nrow(dt1)){
  if(dt1$Survived[i] == 1){
    dt1$Survived[i] = 'Alive'
  }else{
    dt1$Survived[i] = 'Dead' 
    
  }
}
dt1$Survived <- as.factor(dt1$Survived)
dt1$Name <- as.factor(dt1$Name)
dt1$Sex <- as.factor(dt1$Sex)
dt1$Pclass <- as.factor(dt1$Pclass)
dt1$New_group <- as.factor(dt1$New_group)


#just for rf

for (i in 1:nrow(dt2)){
  if(dt2$Pclass[i] == 1){
    dt2$Pclass[i] = 'FirstClass'
  } else if(dt2$Pclass[i] == 2){
    dt2$Pclass[i] = 'SecondClass'
  }else{
    dt2$Pclass[i] = 'ThirdClass' 
    
  }
}


dt2$Name <- as.factor(dt2$Name)
dt2$Sex <- as.factor(dt2$Sex)            
dt2$Pclass <- as.factor(dt2$Pclass)  
dt2$New_group <- as.factor(dt2$New_group)

library(caret)
plus <- trainControl(method = "cv",#use cross-validation
                           number = 10,
                           repeats = 3,
                           # Estimate class probabilities
                           classProbs = TRUE,
                           # Evaluate performance using 
                           # the following function
                           summaryFunction = twoClassSummary)

rf <- train(Survived ~ ., 
              data = dt1, method = 'rf',# Use the "random forest" algorithm
              trControl = plus,
              metric = "Accuracy", 
              tuneGrid = data.frame(.mtry = 2),  
              importance =TRUE)#The argument mtry=2 indicates that #just 2 predictors should be considered for each split of the tree. #The reason for this is that by default, randomForest()

## Loading required package: randomForest
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## 
## The following object is masked from 'package:Hmisc':
## 
##     combine
## 
## The following object is masked from 'package:ggplot2':
## 
##     margin

## Warning in train.default(x, y, weights = w, ...): The metric "Accuracy" was
## not in the result set. ROC will be used instead.

#uses p/3 variables when building a random forest of regression #trees, and square root of p variables when building a random #forest of classification trees.

rf

## Random Forest 
## 
## 891 samples
##   8 predictor
##   2 classes: 'Alive', 'Dead' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 802, 802, 802, 803, 802, 801, ... 
## Resampling results
## 
##   ROC        Sens       Spec       ROC SD      Sens SD     Spec SD   
##   0.8673231  0.6985714  0.9016835  0.04049663  0.08149195  0.04461756
## 
## Tuning parameter 'mtry' was held constant at a value of 2
##

#In present case sensitivity,("Sens") is the probability that the model predicts correctly Titanic passenger's death. Specificity #("Spec"), on the other hand, is the probability that a model will #predict survival for those people which lived.

confusionMatrix(rf)

## Cross-Validated (10 fold) Confusion Matrix 
## 
## (entries are percentages of table totals)
##  
##           Reference
## Prediction Alive Dead
##      Alive  26.8  6.1
##      Dead   11.6 55.6

#so we can see that we get above 80% accuracy on training data, but #always the trainining error is smaller than the test error.

var_importance <- varImp(rf, scale=FALSE)
plot(var_importance, top=5)