The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.
One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.
Here i have taken the passenger’s data of titanic ,applied the tools of machine learning to predict which passengers survived the tragedy.
Logistic regression allows us to predict a categorical outcome using categorical and numeric data. Logistic regression tells us “How likely is it?” Here i predicted how likely a passenger survive the tragedy,
#1.Reading File
setwd("D:/Raviteja/Raviteja Professional/Data Science/EDA_Course_Materials")
titanic <- read.csv('titanic.csv',header=T,na.strings=c(""))
View(titanic)
#2.Checking How many Data values are missing
sapply(titanic, function(x) sum(is.na(x)))
## PassengerId Survived Pclass Name Sex Age
## 0 0 0 0 0 177
## SibSp Parch Ticket Fare Cabin Embarked
## 0 0 0 0 687 2
library(Amelia)
## Warning: package 'Amelia' was built under R version 3.2.3
## Loading required package: Rcpp
## ##
## ## Amelia II: Multiple Imputation
## ## (Version 1.7.4, built: 2015-12-05)
## ## Copyright (C) 2005-2016 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##
missmap(titanic, main = "Missing values vs observed")
#3.Taking the necessary data required for the model
titanic= subset(titanic,select= c(2,3,5,6,7,8,10,12))
View(titanic)
#4. Filling the missed values
titanic$Age[is.na(titanic$Age)]= mean(titanic$Age,na.rm =T)
contrasts(titanic$Sex)
## male
## female 0
## male 1
contrasts(titanic$Embarked)
## Q S
## C 0 0
## Q 1 0
## S 0 1
titanic<- titanic[!is.na(titanic$Embarked),]
#5. Building a Logistic Regression model:
model<- glm(formula = Survived~ ., family = binomial(link=logit), data = titanic)
summary(model)
##
## Call:
## glm(formula = Survived ~ ., family = binomial(link = logit),
## data = titanic)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6446 -0.5907 -0.4230 0.6220 2.4431
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 5.285188 0.564778 9.358 < 2e-16 ***
## Pclass -1.100058 0.143529 -7.664 1.80e-14 ***
## Sexmale -2.718695 0.200783 -13.540 < 2e-16 ***
## Age -0.039901 0.007854 -5.080 3.77e-07 ***
## SibSp -0.325777 0.109384 -2.978 0.0029 **
## Parch -0.092602 0.118708 -0.780 0.4353
## Fare 0.001918 0.002376 0.807 0.4194
## EmbarkedQ -0.034076 0.381936 -0.089 0.9289
## EmbarkedS -0.418817 0.236794 -1.769 0.0769 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1182.82 on 888 degrees of freedom
## Residual deviance: 784.19 on 880 degrees of freedom
## AIC: 802.19
##
## Number of Fisher Scoring iterations: 5
anova(model,test="Chisq")
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: Survived
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 888 1182.82
## Pclass 1 100.179 887 1082.64 < 2.2e-16 ***
## Sex 1 255.814 886 826.82 < 2.2e-16 ***
## Age 1 22.101 885 804.72 2.587e-06 ***
## SibSp 1 14.423 884 790.30 0.000146 ***
## Parch 1 0.497 883 789.80 0.480798
## Fare 1 1.578 882 788.22 0.209046
## Embarked 2 4.036 880 784.19 0.132904
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#From the anova table we can understand that the p values of the variables: parch, fare and embarked are greater than the alpha value which is 0.005 .So these variables are going to have a very less impact on the variation of the output.
#6.Testing the accuracy of the model
fitted.results <- predict(model,data=titanic,type='response')
fitted.results <- ifelse(fitted.results > 0.5,1,0)
misClasificError <- mean(fitted.results != titanic$Survived)
print(paste('Accuracy',1-misClasificError))
## [1] "Accuracy 0.800899887514061"
#This is a very good model as the "Accuracy 0.842696629213483"
#7. Using model to Predict the survival chance of the passenger:
newdata= data.frame("Pclass"=2, "Sex"= 'female', "Age"= 30,"SibSp"=3,"Parch"=0,"Fare"= 80,"Embarked"= 'C')
predict(model,newdata,type= "response")
## 1
## 0.7434887
newdata= data.frame("Pclass"=2, "Sex"= 'male', "Age"= 30,"SibSp"=3,"Parch"=0,"Fare"= 80,"Embarked"= 'C')
predict(model,newdata,type= "response")
## 1
## 0.1604998
# we can observe the chances of survival for a women is far better than that of a men having all the factors remain same.
newdata= data.frame("Pclass"=2, "Sex"= 'female', "Age"= 18,"SibSp"=3,"Parch"=0,"Fare"= 80,"Embarked"= 'C')
predict(model,newdata,type= "response")
## 1
## 0.8238988
# we can observe the chances of survival for a women of 18 years is better than that of a women with age of 30 years , having all the factors remain same.