Survival Analysis Report of Titanic data.
titanic_s=TitanicSurvival
head(titanic_s)
str(titanic_s)
'data.frame': 1309 obs. of 4 variables:
$ survived : Factor w/ 2 levels "no","yes": 2 2 1 1 1 2 2 1 2 1 ...
$ sex : Factor w/ 2 levels "female","male": 1 2 1 2 1 2 1 2 1 2 ...
$ age : num 29 0.917 2 30 25 ...
$ passengerClass: Factor w/ 3 levels "1st","2nd","3rd": 1 1 1 1 1 1 1 1 1 1 ...
summary(titanic_s)
survived sex age
no :809 female:466 Min. : 0.1667
yes:500 male :843 1st Qu.:21.0000
Median :28.0000
Mean :29.8811
3rd Qu.:39.0000
Max. :80.0000
NA's :263
passengerClass
1st:323
2nd:277
3rd:709
From the summary table we get to know that-
1)There are “263” NULL values in the age variable.
2)Also the some of the variables are of character type.
So, we need to resolve both the above mentioned problems before beginning the analysis.
As shown in the above table the missing values and non-numeric values have been resolved.
Now to better understand the relationship and influence of different variables ,here are some graphical representation among different variables.

Above distribution graph shows the survival of people of different age group.

Above graphs shows the number of people survived or died of different age,sex and passengerclass.

The boxplot represents the different age group people with different passengerclass.
Now moving over to analysis of our titanic data.
1)First we have to split the data into train and test data.
2)Creating a model or logistics equation for the suiting the data,considering survived as a dependent variable and rest as independent variable.
summary(model_ti)
Call:
glm(formula = survived ~ ., family = binomial(link = "logit"),
data = train_ti)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.3103 -0.6880 -0.4285 0.6478 2.3842
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 7.01299 0.52146 13.449 < 2e-16
sex -2.48680 0.17326 -14.353 < 2e-16
age -0.03117 0.00700 -4.453 8.48e-06
passengerClass -1.15000 0.11230 -10.240 < 2e-16
(Intercept) ***
sex ***
age ***
passengerClass ***
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1308.86 on 981 degrees of freedom
Residual deviance: 913.05 on 978 degrees of freedom
AIC: 921.05
Number of Fisher Scoring iterations: 4
As from the above table,we came to know that according to p-value all the independent variables are significant but since in logistics regression not only p-value is considered but also we have to see the residual error and AIC.
summary(model_ti3)
Call:
glm(formula = survived ~ . - age, family = "binomial", data = train_ti)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.1543 -0.7119 -0.4603 0.7040 2.1436
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 5.67741 0.40067 14.170 <2e-16
sex -2.51197 0.17128 -14.666 <2e-16
passengerClass -0.94837 0.09964 -9.518 <2e-16
(Intercept) ***
sex ***
passengerClass ***
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1308.86 on 981 degrees of freedom
Residual deviance: 933.71 on 979 degrees of freedom
AIC: 939.71
Number of Fisher Scoring iterations: 4
After excluding different variable we came to the result that none of the variables can be ignored as excluding them led to increase in residual error and AIC
So best fitted model for this data is “model_ti” inclusive of all the independent variables.
Now we have to predict the survival rate for the test data.
head(pred_ti)
Allison, Miss. Helen Loraine
0.9649007
Andrews, Miss. Kornelia Theodos
0.8041507
Astor, Col. John Jacob
0.3599375
Barkworth, Mr. Algernon Henry W
0.1673867
Bazzani, Miss. Albina
0.9151907
Behr, Mr. Karl Howell
0.5197262
Above data shows the probablity of survival of people in test data.
table_ti
predictedvalue
actualvalue FALSE TRUE
0 180 25
1 38 84
from above confusion matrix shows the number of values truely predicted and falsely predicted
A11=TRUE NEGATIVE (truely predicted as “not survived” )
A22=TRUE POSITIVE (truely predicted as “survived”)
A12=FALSE POSITIVE (falsely predicted as “survived”)
A21=FALSE NEGATIVE (falsely predicted as “not survived”)
Now based on the set threshold we have to check its accuracy and find the best threshold value for max accuracy.
#determining accuracy of the prediction
acc=((180+84)/(180+84+38+25))
acc
[1] 0.8073394
#since accuracy is 80% we can increase it by selecting optimal threshold
library(ROCR)
rocr_ti=prediction(pred_ti,test_ti$survived)
rocr_ti_per=performance(rocr_ti,"acc")
rocr_ti_per1=performance(rocr_ti,"tpr","fpr")
plot(rocr_ti_per)

plot(rocr_ti_per1,colorize=TRUE)

from the above chart we can say that threshold must b in between 0.5to0.6 ,So again making table by setting threshold in between 0.5to0.6
table_ti1=table(actualvalue=test_ti$survived,predictedvalue=pred_ti>0.51)
table_ti1
predictedvalue
actualvalue FALSE TRUE
0 181 24
1 38 84
accuracy_ti1=((181+84)/(181+84+24+39))
accuracy_ti1
[1] 0.8079268
plot(table_ti1,col=c("red","green"))

1)Above shown is the confusion matrix with optimal threshold value and accuracy corresponding to it Hence for threshold 0.51 we get the maximum accuracy as 80.8%.
2)The graph represents the table in the graphical view showing green(predicted as “survived”) and red(predicted as “died”)

Above graph shows that-
1)People on high class are having less chance of dying than low passengerclass.
2)Also the age has less affect on the survival rate of the people.
Hence we the predict the test data with 80.8% accuracy
