Objective: Determine Survival on the Titanic

Data Source: Kaggle | https://www.kaggle.com/c/titanic

Data Description: https://www.kaggle.com/c/titanic/data

Approach: Using binomial logistics regression to predict Survival of PAX based on Pclass, Sex, Age, SibSp, Parch, Fare, Embarked

Setting up environment

## Loading required package: Rcpp
## ## 
## ## Amelia II: Multiple Imputation
## ## (Version 1.7.4, built: 2015-12-05)
## ## Copyright (C) 2005-2015 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ## 
## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

1.) Quick data exploration

summary(Data_orig)
##   PassengerId       Survived          Pclass     
##  Min.   :  1.0   Min.   :0.0000   Min.   :1.000  
##  1st Qu.:223.5   1st Qu.:0.0000   1st Qu.:2.000  
##  Median :446.0   Median :0.0000   Median :3.000  
##  Mean   :446.0   Mean   :0.3838   Mean   :2.309  
##  3rd Qu.:668.5   3rd Qu.:1.0000   3rd Qu.:3.000  
##  Max.   :891.0   Max.   :1.0000   Max.   :3.000  
##                                                  
##                                     Name         Sex           Age       
##  Abbing, Mr. Anthony                  :  1   female:314   Min.   : 0.42  
##  Abbott, Mr. Rossmore Edward          :  1   male  :577   1st Qu.:20.12  
##  Abbott, Mrs. Stanton (Rosa Hunt)     :  1                Median :28.00  
##  Abelson, Mr. Samuel                  :  1                Mean   :29.70  
##  Abelson, Mrs. Samuel (Hannah Wizosky):  1                3rd Qu.:38.00  
##  Adahl, Mr. Mauritz Nils Martin       :  1                Max.   :80.00  
##  (Other)                              :885                NA's   :177    
##      SibSp           Parch             Ticket         Fare       
##  Min.   :0.000   Min.   :0.0000   1601    :  7   Min.   :  0.00  
##  1st Qu.:0.000   1st Qu.:0.0000   347082  :  7   1st Qu.:  7.91  
##  Median :0.000   Median :0.0000   CA. 2343:  7   Median : 14.45  
##  Mean   :0.523   Mean   :0.3816   3101295 :  6   Mean   : 32.20  
##  3rd Qu.:1.000   3rd Qu.:0.0000   347088  :  6   3rd Qu.: 31.00  
##  Max.   :8.000   Max.   :6.0000   CA 2144 :  6   Max.   :512.33  
##                                   (Other) :852                   
##          Cabin     Embarked  
##  B96 B98    :  4   C   :168  
##  C23 C25 C27:  4   Q   : 77  
##  G6         :  4   S   :644  
##  C22 C26    :  3   NA's:  2  
##  D          :  3             
##  (Other)    :186             
##  NA's       :687
# Check for missing values per variable
# summary(is.na(Data_orig))                   # alternative
sapply(Data_orig, function(x) sum(is.na(x)))
## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0         177 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           0         687           2
# Check for unique values per variable
sapply(Data_orig, function(x) length(unique(x)))
## PassengerId    Survived      Pclass        Name         Sex         Age 
##         891           2           3         891           2          89 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           7           7         681         248         148           4
# A much nicer and conlidated approach
missmap(Data_orig, main = "Missing Values vs. Observed Values")

2.) Filter out useful data (omit [PassengerId], [Name], [Ticket], [Cabin])

Data_master <- Data_orig %>% 
  select(Survived, Pclass, Sex, Age, SibSp, Parch, Fare, Embarked)

# Check for missing values per variable
sapply(Data_master, function(x) sum(is.na(x)))
## Survived   Pclass      Sex      Age    SibSp    Parch     Fare Embarked 
##        0        0        0      177        0        0        0        2
# Omit records whose [Embarked]'s value == NA (since there are only 2 observations)
Data_master <- Data_master %>% 
  filter(!is.na(Embarked))

# Replace [Age] == NA with MEAN(Data_orig$Age)
Data_master <- Data_master %>% 
  mutate(Age = replace(Age, is.na(Age), mean(Data_master$Age, na.rm = TRUE)))

# Confirm categorical variables are Factor types and check its dummy-variables.
contrasts(Data_master$Sex)
##        male
## female    0
## male      1
contrasts(Data_master$Embarked)
##   Q S
## C 0 0
## Q 1 0
## S 0 1

3.) Split dataset into Training and Testing

training.data <- Data_master[1:800, ]
test.data <- Data_master[801:889, ]

4.) Fit a Logistics model using GLM()

model1 <- glm(Survived ~ ., family = binomial(link = "logit"), data = training.data)
summary(model1)
## 
## Call:
## glm(formula = Survived ~ ., family = binomial(link = "logit"), 
##     data = training.data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6063  -0.5953  -0.4253   0.6221   2.4165  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  5.136491   0.594907   8.634  < 2e-16 ***
## Pclass      -1.087130   0.151170  -7.191 6.41e-13 ***
## Sexmale     -2.756859   0.212023 -13.003  < 2e-16 ***
## Age         -0.037252   0.008195  -4.545 5.48e-06 ***
## SibSp       -0.292862   0.114627  -2.555   0.0106 *  
## Parch       -0.116467   0.128121  -0.909   0.3633    
## Fare         0.001529   0.002353   0.650   0.5158    
## EmbarkedQ   -0.003548   0.400870  -0.009   0.9929    
## EmbarkedS   -0.318538   0.252954  -1.259   0.2079    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1065.39  on 799  degrees of freedom
## Residual deviance:  709.41  on 791  degrees of freedom
## AIC: 727.41
## 
## Number of Fisher Scoring iterations: 5

RESULT: First of all, we can see that SibSp, Fare and Embarked are not statistically significant. As for the statistically significant variables, sex has the lowest p-value suggesting a strong association of the sex of the passenger with the probability of having survived. The negative coefficient for this predictor suggests that all other variables being equal, the male passenger is less likely to have survived. Remember that in the logit model the response variable is log odds: ln(odds) = ln(p/(1-p)) = ax1 + bx2 + … + z*xn. Since male is a dummy variable, being male reduces the log odds by 2.75 while a unit increase in age reduces the log odds by 0.037.

# Analyze the Table of Deviance for model's fit
anova(model1, test = "Chisq")
## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: Survived
## 
## Terms added sequentially (first to last)
## 
## 
##          Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                       799    1065.39              
## Pclass    1   83.607       798     981.79 < 2.2e-16 ***
## Sex       1  240.014       797     741.77 < 2.2e-16 ***
## Age       1   17.490       796     724.28 2.888e-05 ***
## SibSp     1   10.838       795     713.44 0.0009943 ***
## Parch     1    0.860       794     712.58 0.3536034    
## Fare      1    0.994       793     711.59 0.3187354    
## Embarked  2    2.180       791     709.41 0.3362274    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

RESULT: The difference between the null deviance and the residual deviance shows how our model is doing against the null model (a model with only the intercept). The wider this gap, the better. Analyzing the table we can see the drop in deviance when adding each variable one at a time. Again, adding Pclass, Sex and Age significantly reduces the residual deviance. The other variables seem to improve the model less even though SibSp has a low p-value. A large p-value here indicates that the model without the variable explains more or less the same amount of variation. Ultimately what you would like to see is a significant drop in deviance and the AIC.

# Analyze McFadden R^2 index for model's fit
library(pscl)
## Loading required package: MASS
## 
## Attaching package: 'MASS'
## 
## The following object is masked from 'package:dplyr':
## 
##     select
## 
## Loading required package: lattice
## Classes and Methods for R developed in the
## 
## Political Science Computational Laboratory
## 
## Department of Political Science
## 
## Stanford University
## 
## Simon Jackman
## 
## hurdle and zeroinfl functions by Achim Zeileis
pR2(model1)
##          llh      llhNull           G2     McFadden         r2ML 
## -354.7044627 -532.6961008  355.9832762    0.3341335    0.3591623 
##         r2CU 
##    0.4880038

RESULT: McFadden R^2 = 0.3341335

5.) Assessing the predictive ability of the model

NOTE: Decision boundary is 0.5. If P(y=1|X) > 0.5 then y = 1 otherwise y=0.

# Model's Predictive Accuracy
# test.data.IV <- test.data %>% 
#   select(-Survived)
test.data.IV <- subset(test.data, select = c(2:8))

fitted.results <- predict(model1, newdata = test.data.IV, type = "response")
fitted.results <- ifelse(fitted.results > 0.5, 1, 0)

fitted.results.errors <- mean(fitted.results != test.data$Survived)
fitted.results.accuracy <- 1 - fitted.results.errors
print(paste('Model Accuracy =', round(fitted.results.accuracy*100, 2), "%"))
## [1] "Model Accuracy = 84.27 %"

RESULT: The 0.84 accuracy on the test set is quite a good result. However, keep in mind that this result is somewhat dependent on the manual split of the data that I made earlier, therefore if you wish for a more precise score, you would be better off running some kind of cross validation such as k-fold cross validation.

ROC curve and calculate the AUC as performance measurement of a binary classifier model.

NOTE: The ROC is a curve generated by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings while the AUC is the area under the ROC curve. As a rule of thumb, a model with good predictive ability should have an AUC closer to 1 (1 is ideal) than to 0.5.

library(ROCR)
## Loading required package: gplots
## 
## Attaching package: 'gplots'
## 
## The following object is masked from 'package:stats':
## 
##     lowess
p <- predict(model1, newdata = test.data.IV, type = "response")
prd <- prediction(p, test.data$Survived)
prf <- performance(prd, measure = "tpr", x.measure = "fpr")
plot(prf)

auc <- performance(prd, measure = "auc")
auc <- auc@y.values[[1]]
print(paste("Model AUC =", auc))
## [1] "Model AUC = 0.864718614718615"