Survival Analysis Report of Titanic data.

titanic_s=TitanicSurvival
head(titanic_s)

str(titanic_s)
'data.frame':   1309 obs. of  4 variables:
 $ survived      : Factor w/ 2 levels "no","yes": 2 2 1 1 1 2 2 1 2 1 ...
 $ sex           : Factor w/ 2 levels "female","male": 1 2 1 2 1 2 1 2 1 2 ...
 $ age           : num  29 0.917 2 30 25 ...
 $ passengerClass: Factor w/ 3 levels "1st","2nd","3rd": 1 1 1 1 1 1 1 1 1 1 ...
summary(titanic_s)
 survived      sex           age         
 no :809   female:466   Min.   : 0.1667  
 yes:500   male  :843   1st Qu.:21.0000  
                        Median :28.0000  
                        Mean   :29.8811  
                        3rd Qu.:39.0000  
                        Max.   :80.0000  
                        NA's   :263      
 passengerClass
 1st:323       
 2nd:277       
 3rd:709       
               
               
               
               

From the summary table we get to know that-

1)There are “263” NULL values in the age variable.

2)Also the some of the variables are of character type.

So, we need to resolve both the above mentioned problems before beginning the analysis.

As shown in the above table the missing values and non-numeric values have been resolved.

Now to better understand the relationship and influence of different variables ,here are some graphical representation among different variables.

Above distribution graph shows the survival of people of different age group.

Above graphs shows the number of people survived or died of different age,sex and passengerclass.

The boxplot represents the different age group people with different passengerclass.

Now moving over to analysis of our titanic data.

1)First we have to split the data into train and test data.

2)Creating a model or logistics equation for the suiting the data,considering survived as a dependent variable and rest as independent variable.

summary(model_ti)

Call:
glm(formula = survived ~ ., family = binomial(link = "logit"), 
    data = train_ti)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.3103  -0.6880  -0.4285   0.6478   2.3842  

Coefficients:
               Estimate Std. Error z value Pr(>|z|)
(Intercept)     7.01299    0.52146  13.449  < 2e-16
sex            -2.48680    0.17326 -14.353  < 2e-16
age            -0.03117    0.00700  -4.453 8.48e-06
passengerClass -1.15000    0.11230 -10.240  < 2e-16
                  
(Intercept)    ***
sex            ***
age            ***
passengerClass ***
---
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1308.86  on 981  degrees of freedom
Residual deviance:  913.05  on 978  degrees of freedom
AIC: 921.05

Number of Fisher Scoring iterations: 4

As from the above table,we came to know that according to p-value all the independent variables are significant but since in logistics regression not only p-value is considered but also we have to see the residual error and AIC.

summary(model_ti3)

Call:
glm(formula = survived ~ . - age, family = "binomial", data = train_ti)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.1543  -0.7119  -0.4603   0.7040   2.1436  

Coefficients:
               Estimate Std. Error z value Pr(>|z|)
(Intercept)     5.67741    0.40067  14.170   <2e-16
sex            -2.51197    0.17128 -14.666   <2e-16
passengerClass -0.94837    0.09964  -9.518   <2e-16
                  
(Intercept)    ***
sex            ***
passengerClass ***
---
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1308.86  on 981  degrees of freedom
Residual deviance:  933.71  on 979  degrees of freedom
AIC: 939.71

Number of Fisher Scoring iterations: 4

After excluding different variable we came to the result that none of the variables can be ignored as excluding them led to increase in residual error and AIC

So best fitted model for this data is “model_ti” inclusive of all the independent variables.

Now we have to predict the survival rate for the test data.

head(pred_ti)
   Allison, Miss. Helen Loraine 
                      0.9649007 
Andrews, Miss. Kornelia Theodos 
                      0.8041507 
         Astor, Col. John Jacob 
                      0.3599375 
Barkworth, Mr. Algernon Henry W 
                      0.1673867 
          Bazzani, Miss. Albina 
                      0.9151907 
          Behr, Mr. Karl Howell 
                      0.5197262 

Above data shows the probablity of survival of people in test data.

table_ti
           predictedvalue
actualvalue FALSE TRUE
          0   180   25
          1    38   84

from above confusion matrix shows the number of values truely predicted and falsely predicted

A11=TRUE NEGATIVE (truely predicted as “not survived” )

A22=TRUE POSITIVE (truely predicted as “survived”)

A12=FALSE POSITIVE (falsely predicted as “survived”)

A21=FALSE NEGATIVE (falsely predicted as “not survived”)

Now based on the set threshold we have to check its accuracy and find the best threshold value for max accuracy.

#determining accuracy of the prediction
acc=((180+84)/(180+84+38+25))
acc
[1] 0.8073394
#since accuracy is 80% we can increase it by selecting optimal threshold
library(ROCR)
rocr_ti=prediction(pred_ti,test_ti$survived)
rocr_ti_per=performance(rocr_ti,"acc")
rocr_ti_per1=performance(rocr_ti,"tpr","fpr")
plot(rocr_ti_per)

plot(rocr_ti_per1,colorize=TRUE)

from the above chart we can say that threshold must b in between 0.5to0.6 ,So again making table by setting threshold in between 0.5to0.6

table_ti1=table(actualvalue=test_ti$survived,predictedvalue=pred_ti>0.51)
table_ti1
           predictedvalue
actualvalue FALSE TRUE
          0   181   24
          1    38   84
accuracy_ti1=((181+84)/(181+84+24+39))
accuracy_ti1
[1] 0.8079268
plot(table_ti1,col=c("red","green"))

1)Above shown is the confusion matrix with optimal threshold value and accuracy corresponding to it Hence for threshold 0.51 we get the maximum accuracy as 80.8%.

2)The graph represents the table in the graphical view showing green(predicted as “survived”) and red(predicted as “died”)

Above graph shows that-

1)People on high class are having less chance of dying than low passengerclass.

2)Also the age has less affect on the survival rate of the people.

Hence we the predict the test data with 80.8% accuracy

