The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.
One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.
The Aim of this project is to predict the survival of passengers on titanic using Machine learning algorithms. The model should be able to predict whether the passenger survived the sinking of Titanic or not.
The data has been downloaded from Kaggle.
Variable
1.survival
(Survival) 0 = No, 1 = Yes
2.pclass
(Ticket class) 1 = 1st, 2 = 2nd, 3 = 3rd
3.sex
(Sex)
4.Age
(Age in years)
5.sibsp
(# of siblings / spouses aboard the Titanic)
6.parch
(# of parents / children aboard the Titanic)
7.ticket
(Ticket number)
8.fare
(Passenger fare)
9.cabin
(Cabin number)
10.embarked
(Port of Embarkation) C = Cherbourg, Q = Queenstown, S = Southampton
library(dplyr) #data manipulation
library(ggplot2) #visualization
library(naniar) #missing value visualization
library(mice) #missing value imputation
library(randomForest) #classification algorithm
library(rpart) #classification algorithm
library(rpart.plot) #decision tree visualization
library(caTools) #splitting data into training and test set
#Load data
titanic_data<-read.csv("titanic.csv",stringsAsFactors = FALSE)
#Checking structure of the data
str(titanic_data)
## 'data.frame': 891 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : chr "male" "female" "female" "female" ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr "" "C85" "" "C123" ...
## $ Embarked : chr "S" "C" "S" "S" ...
summary(titanic_data)
## PassengerId Survived Pclass Name
## Min. : 1.0 Min. :0.0000 Min. :1.000 Length:891
## 1st Qu.:223.5 1st Qu.:0.0000 1st Qu.:2.000 Class :character
## Median :446.0 Median :0.0000 Median :3.000 Mode :character
## Mean :446.0 Mean :0.3838 Mean :2.309
## 3rd Qu.:668.5 3rd Qu.:1.0000 3rd Qu.:3.000
## Max. :891.0 Max. :1.0000 Max. :3.000
##
## Sex Age SibSp Parch
## Length:891 Min. : 0.42 Min. :0.000 Min. :0.0000
## Class :character 1st Qu.:20.12 1st Qu.:0.000 1st Qu.:0.0000
## Mode :character Median :28.00 Median :0.000 Median :0.0000
## Mean :29.70 Mean :0.523 Mean :0.3816
## 3rd Qu.:38.00 3rd Qu.:1.000 3rd Qu.:0.0000
## Max. :80.00 Max. :8.000 Max. :6.0000
## NA's :177
## Ticket Fare Cabin Embarked
## Length:891 Min. : 0.00 Length:891 Length:891
## Class :character 1st Qu.: 7.91 Class :character Class :character
## Mode :character Median : 14.45 Mode :character Mode :character
## Mean : 32.20
## 3rd Qu.: 31.00
## Max. :512.33
##
head(titanic_data)
## PassengerId Survived Pclass
## 1 1 0 3
## 2 2 1 1
## 3 3 1 3
## 4 4 1 1
## 5 5 0 3
## 6 6 0 3
## Name Sex Age SibSp
## 1 Braund, Mr. Owen Harris male 22 1
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1
## 3 Heikkinen, Miss. Laina female 26 0
## 4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1
## 5 Allen, Mr. William Henry male 35 0
## 6 Moran, Mr. James male NA 0
## Parch Ticket Fare Cabin Embarked
## 1 0 A/5 21171 7.2500 S
## 2 0 PC 17599 71.2833 C85 C
## 3 0 STON/O2. 3101282 7.9250 S
## 4 0 113803 53.1000 C123 S
## 5 0 373450 8.0500 S
## 6 0 330877 8.4583 Q
The data has 891 observations and 12 variables. The names of the variables are as follows:
## [1] "PassengerId" "Survived" "Pclass" "Name" "Sex"
## [6] "Age" "SibSp" "Parch" "Ticket" "Fare"
## [11] "Cabin" "Embarked"
vis_miss(titanic_data)
Age
has 19.9%
NA values, as shown from the above plot.
ggplot(titanic_data, aes(x=Age, fill=factor(Survived))) +
geom_histogram(bins=30)+
facet_grid(.~Sex)+
ggtitle("Age vs Survived")+
scale_fill_discrete(name="Survived")
As we can see, chanced of survival of females is greater than that of males. Males below age 18 had considerable survival rate due to tha fact that they were children. ##Survival as a function of Sex
ggplot(titanic_data, aes(Sex, fill = factor(Survived))) +
geom_bar(position = 'fill')+
xlab("Sex") +
ylab("Frequency") +
scale_fill_discrete(name = "Survived") +
ggtitle("Sex vs Survived")
This plot illustrated that females had survival rate of ~75%, whereas men had survival rate < 25%.
ggplot(titanic_data, aes(x=Pclass, fill=factor(Survived))) +
geom_bar(position="fill")+
ylab("Frequency")+
scale_fill_discrete(name="Survived")+
ggtitle("Pclass vs Survived")
From this plot, we can observe that passengers in 1st class have survival rate of more than 50%. 2nd class passengers have survival rate of close to 50%. For 3rd class passengers, survival rate is just 25%.
ggplot(titanic_data, aes(x=Fare, fill=factor(Survived)))+
geom_histogram()+
ggtitle("Fare vs survived")
From this graph, it looks like there is a strong correlation between fare and survival rate. People who had paid < $50 have less than 50% survival rate. As you move towards right towards higher fare prices, we see a higher survival rate.
#exploring name variable
head(titanic_data$Name)
## [1] "Braund, Mr. Owen Harris"
## [2] "Cumings, Mrs. John Bradley (Florence Briggs Thayer)"
## [3] "Heikkinen, Miss. Laina"
## [4] "Futrelle, Mrs. Jacques Heath (Lily May Peel)"
## [5] "Allen, Mr. William Henry"
## [6] "Moran, Mr. James"
#grab title from passenger's name
titanic_data$Title<-gsub('(.*,)|(\\..*)','',titanic_data$Name)
unique(titanic_data$Title)
## [1] " Mr" " Mrs" " Miss" " Master"
## [5] " Don" " Rev" " Dr" " Mme"
## [9] " Ms" " Major" " Lady" " Sir"
## [13] " Mlle" " Col" " Capt" " the Countess"
## [17] " Jonkheer"
As we can see, We have 17 unique Titles.
table(titanic_data$Sex,titanic_data$Title)
##
## Capt Col Don Dr Jonkheer Lady Major Master Miss Mlle
## female 0 0 0 1 0 1 0 0 182 2
## male 1 2 1 6 1 0 2 40 0 0
##
## Mme Mr Mrs Ms Rev Sir the Countess
## female 1 0 125 1 0 0 1
## male 0 517 0 0 6 1 0
We see, there can be some typing mistakes in titles (Mlle , Mme, Ms).
titanic_data$Title[titanic_data$Title==' Mlle']<-' Miss'
titanic_data$Title[titanic_data$Title==' Ms']<-' Miss'
titanic_data$Title[titanic_data$Title==' Mme']<-' Mrs'
table(titanic_data$Sex,titanic_data$Title)
##
## Capt Col Don Dr Jonkheer Lady Major Master Miss Mr Mrs
## female 0 0 0 1 0 1 0 0 185 0 126
## male 1 2 1 6 1 0 2 40 0 517 0
##
## Rev Sir the Countess
## female 0 0 1
## male 6 1 0
Also, there are number of rare titles, that we see here. We can combine all such titles to “Rare title”
rare_titles<-c(" Capt"," Col"," Don"," Dr"," Rev", " Jonkheer"," Lady"," Major"," the Countess"," Sir")
titanic_data$Title[titanic_data$Title %in% rare_titles] <- " Rare title"
table(titanic_data$Sex,titanic_data$Title)
##
## Master Miss Mr Mrs Rare title
## female 0 185 0 126 3
## male 40 0 517 0 20
ggplot(titanic_data, aes(x=factor(Title), fill=factor(Survived)))+
geom_bar()+
xlab("Title")+
scale_fill_discrete(name = "Survived") +
ggtitle("Title vs survived")
Children (Titles Miss and Master) have high survival rate. Women (Title Mrs) also have great chances of survival. For men (Title Mr), survival rate is pretty low.
titanic_data$Familysize<-titanic_data$SibSp+titanic_data$Parch+1
# Visualize the relationship between Familysize & survival
ggplot(titanic_data, aes(x = Familysize, fill = factor(Survived))) +
geom_bar(stat='count', position='dodge') +
scale_x_continuous(breaks=c(1:12)) +
labs(x = 'Family Size')
We can see, there is a survival penalty to solo travellers and those with family size above 4.
head(titanic_data$Cabin,30)
## [1] "" "C85" "" "C123" ""
## [6] "" "E46" "" "" ""
## [11] "G6" "C103" "" "" ""
## [16] "" "" "" "" ""
## [21] "" "D56" "" "A6" ""
## [26] "" "" "C23 C25 C27" "" ""
sum(titanic_data$Cabin=="")
## [1] 687
Due to huge number of missing values, cabin variable will not be included for model building.
unique(titanic_data$Embarked)
## [1] "S" "C" "Q" ""
which(titanic_data$Embarked=="")
## [1] 62 830
titanic_data[c(62,830),]
## PassengerId Survived Pclass Name
## 62 62 1 1 Icard, Miss. Amelie
## 830 830 1 1 Stone, Mrs. George Nelson (Martha Evelyn)
## Sex Age SibSp Parch Ticket Fare Cabin Embarked Title Familysize
## 62 female 38 0 0 113572 80 B28 Miss 1
## 830 female 62 0 0 113572 80 B28 Mrs 1
There are 2 missing values for Embarked variable, for passenger 62 and 830. Both are solo travellers.
#visualizing embarked , fare and Pclass
fare_embark_data<-titanic_data %>% filter(PassengerId !=62 & PassengerId!=830)
str(fare_embark_data)
## 'data.frame': 889 obs. of 14 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : chr "male" "female" "female" "female" ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr "" "C85" "" "C123" ...
## $ Embarked : chr "S" "C" "S" "S" ...
## $ Title : chr " Mr" " Mrs" " Miss" " Mrs" ...
## $ Familysize : num 2 2 1 2 1 1 1 5 3 2 ...
ggplot(fare_embark_data, aes(x=Embarked,y=Fare,fill=factor(Pclass)))+
geom_boxplot()+
geom_hline(aes(yintercept=80))
As we can see from the graph, The median fare for 1st class passenger departing from Charbourg (‘C’) coincides nicely with our missing embarkment passengers. Hence, For Pclass 1 and fare $80, the closest option is Embarkment port ‘C’.
#imputing missing values for Embarked
titanic_data$Embarked[c(62,830)]<-'C'
titanic_data[c(62,830),]
## PassengerId Survived Pclass Name
## 62 62 1 1 Icard, Miss. Amelie
## 830 830 1 1 Stone, Mrs. George Nelson (Martha Evelyn)
## Sex Age SibSp Parch Ticket Fare Cabin Embarked Title Familysize
## 62 female 38 0 0 113572 80 B28 C Miss 1
## 830 female 62 0 0 113572 80 B28 C Mrs 1
Now, Embarked column is complete.
sum(is.na(titanic_data$Age))
## [1] 177
There are 177 missing Age values. For Age imputation, Multiple imputation using chained equations from Mice
package.
#Factorizing variables
titanic_data$Pclass<-factor(titanic_data$Pclass)
titanic_data$Sex<-factor(titanic_data$Sex)
titanic_data$Embarked<-factor(titanic_data$Embarked)
titanic_data$Title<-factor(titanic_data$Title)
titanic_data$Familysize<-factor(titanic_data$Familysize)
titanic_data$Survived<-factor(titanic_data$Survived)
#Set a random seed
set.seed(128)
mice_model<-mice(titanic_data[,names(titanic_data) %in% c('Pclass','Sex','Embarked','Title','Age','SibSp','Parch','Fare')],method='rf')
##
## iter imp variable
## 1 1 Age
## 1 2 Age
## 1 3 Age
## 1 4 Age
## 1 5 Age
## 2 1 Age
## 2 2 Age
## 2 3 Age
## 2 4 Age
## 2 5 Age
## 3 1 Age
## 3 2 Age
## 3 3 Age
## 3 4 Age
## 3 5 Age
## 4 1 Age
## 4 2 Age
## 4 3 Age
## 4 4 Age
## 4 5 Age
## 5 1 Age
## 5 2 Age
## 5 3 Age
## 5 4 Age
## 5 5 Age
mice_output<-complete(mice_model)
par(mfrow=c(1,2))
hist(titanic_data$Age, freq=F,main="Original data : Age", col="dark blue", ylim=c(0,0.04))
hist(mice_output$Age, freq=F,main="MICE output : Age", col="blue", ylim=c(0,0.04))
We can see the original age distribution is somewhat similar to the mice age distribution. Things look good, Hence, we can replace the age with mice output.
titanic_data$Age<-mice_output$Age
#Checking for missing values
sum(is.na(titanic_data$Age))
## [1] 0
No missing values in Age
# Visualizing to check for any missing values
vis_miss(titanic_data)
All the missing values have been handled. Data is ready for model building for prediction of survival.
#Keeping only desired variables for model building
#Age,Survived,Pclass,Sex,Fare,Familysize,Embarked and Title
titanic_sub<-titanic_data[,c(2,3,5,6,10,12,13,14)]
#Set a random seed
set.seed(123)
split = sample.split(titanic_sub$Survived, SplitRatio = 0.75)
training_set = subset(titanic_sub, split == TRUE)
test_set = subset(titanic_sub, split == FALSE)
str(training_set)
## 'data.frame': 668 obs. of 8 variables:
## $ Survived : Factor w/ 2 levels "0","1": 1 2 2 2 1 2 2 2 1 1 ...
## $ Pclass : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 3 1 3 3 ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 1 1 1 2 2 ...
## $ Age : num 22 38 26 35 11 27 4 58 20 39 ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.46 ...
## $ Embarked : Factor w/ 3 levels "C","Q","S": 3 1 3 3 2 3 3 3 3 3 ...
## $ Title : Factor w/ 5 levels " Master"," Miss",..: 3 4 2 4 3 4 2 2 3 3 ...
## $ Familysize: Factor w/ 9 levels "1","2","3","4",..: 2 2 1 2 1 3 3 1 1 7 ...
dec_tree_model<-rpart(Survived~ Age+Sex+Pclass+Familysize+Embarked+Title+Fare, data=training_set,method = "class")
pred_survived<-predict(dec_tree_model,newdata=test_set[-1],type="class")
#confusion matrix
table(test_set$Survived,pred_survived)
## pred_survived
## 0 1
## 0 121 16
## 1 33 53
#Evaluating accuracy
mean(test_set$Survived==pred_survived)
## [1] 0.7802691
So, decision tree model gave us an accuracy of 78.03%
#Plotting decision tree
rpart.plot(dec_tree_model)
plotcp(dec_tree_model)
set.seed(123)
rf_model<-randomForest(Survived~Age+Sex+Pclass+Familysize+Embarked+Title+Fare,data = training_set)
#prediction
rf_pred<-predict(rf_model,test_set)
mean(test_set$Survived==rf_pred)
## [1] 0.8071749
print(rf_model)
##
## Call:
## randomForest(formula = Survived ~ Age + Sex + Pclass + Familysize + Embarked + Title + Fare, data = training_set)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 15.42%
## Confusion matrix:
## 0 1 class.error
## 0 379 33 0.08009709
## 1 70 186 0.27343750
With this model, accuracy has improved to 80.72%. The summary shows an estimated error of 15.42% for the model.
plot(rf_model)
legend('topright', colnames(rf_model$err.rate),col=1:3,fill=1:3)
The black line shows the overall error rate which falls below 20%. The red and green lines show the error rate for ‘died’ and ‘survived’ respectively.
varImpPlot(rf_model,main="RF_Model")
This shows the relative variable importance by plotting the mean decrease in Gini calclulated among all the trees.
importance <- importance(rf_model)
varImportance <- data.frame(Variables = row.names(importance),
Importance = round(importance[ ,'MeanDecreaseGini'],2))
# Create a rank variable based on importance
rankImportance <- varImportance %>%
mutate(Rank = paste0('#',dense_rank(desc(Importance))))
# Use ggplot2 to visualize the relative importance of variables
ggplot(rankImportance, aes(x = reorder(Variables, Importance),
y = Importance, fill = Importance)) +
geom_bar(stat='identity') +
geom_text(aes(x = Variables, y = 0.5, label = Rank),
hjust=0, vjust=0.55, size = 4, colour = 'white') +
labs(x = 'Variables') +
coord_flip()
Title, Fare and Sex are the topmost important variables in our prediction of survival.
Making th final predictions for test set using Random Forest classifier
test_set$pred_survived<-rf_pred
head(test_set)
## Survived Pclass Sex Age Fare Embarked Title Familysize
## 5 0 3 male 35 8.0500 S Mr 1
## 7 0 1 male 54 51.8625 S Mr 1
## 8 0 3 male 2 21.0750 S Master 5
## 10 1 2 female 14 30.0708 C Mrs 2
## 15 0 3 female 14 7.8542 S Miss 1
## 21 0 2 male 35 26.0000 S Mr 1
## pred_survived
## 5 0
## 7 0
## 8 0
## 10 1
## 15 1
## 21 0