Titanic Ship- Survival Prediction using Logistic Regression

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

Here i have taken the passenger’s data of titanic ,applied the tools of machine learning to predict which passengers survived the tragedy.

Logistic regression allows us to predict a categorical outcome using categorical and numeric data. Logistic regression tells us “How likely is it?” Here i predicted how likely a passenger survive the tragedy,

#1.Reading File

setwd("D:/Raviteja/Raviteja Professional/Data Science/EDA_Course_Materials")

titanic <- read.csv('titanic.csv',header=T,na.strings=c(""))

View(titanic)

#2.Checking How many Data values are missing

sapply(titanic, function(x) sum(is.na(x)))

## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0         177 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           0         687           2

library(Amelia)

## Warning: package 'Amelia' was built under R version 3.2.3

## Loading required package: Rcpp

## ## 
## ## Amelia II: Multiple Imputation
## ## (Version 1.7.4, built: 2015-12-05)
## ## Copyright (C) 2005-2016 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##

missmap(titanic, main = "Missing values vs observed")

#3.Taking the necessary data required for the model

titanic= subset(titanic,select= c(2,3,5,6,7,8,10,12))

View(titanic)

#4. Filling the missed values

titanic$Age[is.na(titanic$Age)]= mean(titanic$Age,na.rm =T)

contrasts(titanic$Sex)

##        male
## female    0
## male      1

contrasts(titanic$Embarked)

##   Q S
## C 0 0
## Q 1 0
## S 0 1

titanic<- titanic[!is.na(titanic$Embarked),]

#5. Building a Logistic Regression model:

model<- glm(formula = Survived~ ., family = binomial(link=logit), data = titanic)

summary(model)

## 
## Call:
## glm(formula = Survived ~ ., family = binomial(link = logit), 
##     data = titanic)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6446  -0.5907  -0.4230   0.6220   2.4431  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  5.285188   0.564778   9.358  < 2e-16 ***
## Pclass      -1.100058   0.143529  -7.664 1.80e-14 ***
## Sexmale     -2.718695   0.200783 -13.540  < 2e-16 ***
## Age         -0.039901   0.007854  -5.080 3.77e-07 ***
## SibSp       -0.325777   0.109384  -2.978   0.0029 ** 
## Parch       -0.092602   0.118708  -0.780   0.4353    
## Fare         0.001918   0.002376   0.807   0.4194    
## EmbarkedQ   -0.034076   0.381936  -0.089   0.9289    
## EmbarkedS   -0.418817   0.236794  -1.769   0.0769 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1182.82  on 888  degrees of freedom
## Residual deviance:  784.19  on 880  degrees of freedom
## AIC: 802.19
## 
## Number of Fisher Scoring iterations: 5

anova(model,test="Chisq")

## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: Survived
## 
## Terms added sequentially (first to last)
## 
## 
##          Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                       888    1182.82              
## Pclass    1  100.179       887    1082.64 < 2.2e-16 ***
## Sex       1  255.814       886     826.82 < 2.2e-16 ***
## Age       1   22.101       885     804.72 2.587e-06 ***
## SibSp     1   14.423       884     790.30  0.000146 ***
## Parch     1    0.497       883     789.80  0.480798    
## Fare      1    1.578       882     788.22  0.209046    
## Embarked  2    4.036       880     784.19  0.132904    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

#From the anova table we can understand that the p values of the variables: parch, fare and embarked are greater than the alpha value which is 0.005 .So these variables are going to have a very less impact on the variation of the output.

#6.Testing the accuracy of the model

fitted.results <- predict(model,data=titanic,type='response')

fitted.results <- ifelse(fitted.results > 0.5,1,0)

misClasificError <- mean(fitted.results != titanic$Survived)

print(paste('Accuracy',1-misClasificError))

## [1] "Accuracy 0.800899887514061"

#This is a very good model as the "Accuracy 0.842696629213483" 

#7. Using model to Predict the survival chance of the passenger:

newdata= data.frame("Pclass"=2, "Sex"= 'female', "Age"= 30,"SibSp"=3,"Parch"=0,"Fare"= 80,"Embarked"= 'C')

predict(model,newdata,type= "response")

##         1 
## 0.7434887

newdata= data.frame("Pclass"=2, "Sex"= 'male', "Age"= 30,"SibSp"=3,"Parch"=0,"Fare"= 80,"Embarked"= 'C')

predict(model,newdata,type= "response")

##         1 
## 0.1604998

# we can observe the chances of survival for a women is far better than that of a men having all the factors remain same.

newdata= data.frame("Pclass"=2, "Sex"= 'female', "Age"= 18,"SibSp"=3,"Parch"=0,"Fare"= 80,"Embarked"= 'C')

predict(model,newdata,type= "response")

##         1 
## 0.8238988

# we can observe the chances of survival for a women of 18 years is better than that of a women with age of 30 years , having all the factors remain same.

Titanic Ship- Survival Prediction using Logistic Regression

raviteja

January 26, 2016