The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.

Data Cleaning

TitanicData <- read.csv("/Users/vidarithchan/Desktop/DuringBreakDataScience/TitanicMLFromDisaster/Train.csv", header=T)
colnames(TitanicData)
##  [1] "PassengerId" "Survived"    "Pclass"      "Name"        "Sex"        
##  [6] "Age"         "SibSp"       "Parch"       "Ticket"      "Fare"       
## [11] "Cabin"       "Embarked"
nrow(TitanicData)
## [1] 891
ncol(TitanicData)
## [1] 12
sapply(TitanicData, function(x) sum(is.na(x))) # check to see if there are any blank columns. 
## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0         177 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           0           0           0
median(TitanicData$Age, na.rm=TRUE)
## [1] 28
TitanicData$Age[which(is.na(TitanicData$Age))] <- median(TitanicData$Age,na.rm=TRUE) # Replace blank age cell with the median.

TitanicDataSurvived <- subset(TitanicData,Survived ==1)

Exploratory Data Analysis

You can also embed plots, for example:

TitanicData$Sex <- ifelse(TitanicData$Sex=="female",1,0)
summary(TitanicData$Survived)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.3838  1.0000  1.0000
summary(TitanicData$Sex)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.3524  1.0000  1.0000
library(plyr)
count(TitanicData,'Sex')
##   Sex freq
## 1   0  577
## 2   1  314
count(TitanicData,'Survived')
##   Survived freq
## 1        0  549
## 2        1  342
count(TitanicData,'Embarked')
##   Embarked freq
## 1             2
## 2        C  168
## 3        Q   77
## 4        S  644
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
confusionMatrix(TitanicData$Survived,TitanicData$Sex) # one way to know the crosstab count
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 468  81
##          1 109 233
##                                           
##                Accuracy : 0.7868          
##                  95% CI : (0.7584, 0.8132)
##     No Information Rate : 0.6476          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.5421          
##  Mcnemar's Test P-Value : 0.05014         
##                                           
##             Sensitivity : 0.8111          
##             Specificity : 0.7420          
##          Pos Pred Value : 0.8525          
##          Neg Pred Value : 0.6813          
##              Prevalence : 0.6476          
##          Detection Rate : 0.5253          
##    Detection Prevalence : 0.6162          
##       Balanced Accuracy : 0.7766          
##                                           
##        'Positive' Class : 0               
## 
####### OR
library(gmodels)
CrossTable(TitanicData$Survived,TitanicData$Sex) 
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  891 
## 
##  
##                      | TitanicData$Sex 
## TitanicData$Survived |         0 |         1 | Row Total | 
## ---------------------|-----------|-----------|-----------|
##                    0 |       468 |        81 |       549 | 
##                      |    35.583 |    65.386 |           | 
##                      |     0.852 |     0.148 |     0.616 | 
##                      |     0.811 |     0.258 |           | 
##                      |     0.525 |     0.091 |           | 
## ---------------------|-----------|-----------|-----------|
##                    1 |       109 |       233 |       342 | 
##                      |    57.120 |   104.962 |           | 
##                      |     0.319 |     0.681 |     0.384 | 
##                      |     0.189 |     0.742 |           | 
##                      |     0.122 |     0.262 |           | 
## ---------------------|-----------|-----------|-----------|
##         Column Total |       577 |       314 |       891 | 
##                      |     0.648 |     0.352 |           | 
## ---------------------|-----------|-----------|-----------|
## 
## 
## The number of survived male passengers are higher than that of female passengers. 
## What is the correlation then? 
CorSurvived <- cor(TitanicData$Survived,TitanicData$Sex)
CorSurvived # .543 . You have a higher chance of surviving if you are female passenger. 
## [1] 0.5433514
## What is the average age of all passengers? 
summary(TitanicData$Age) # 29 years old.
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.42   22.00   28.00   29.36   35.00   80.00
## What is the average age of survived passengers? 
summary(TitanicDataSurvived$Age) # 28 years old.
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.42   21.00   28.00   28.29   35.00   80.00
## Age distribution of passengers and survived passengers:
histogram(TitanicData$Age,TitanicData,
          main="Passengers'Ages Distribution",
          xlab="Age of Passenger",
          ylab="Percent of Total")
## Warning in histogram.numeric(TitanicData$Age, TitanicData, main =
## "Passengers'Ages Distribution", : explicit 'data' specification ignored

histogram(TitanicDataSurvived$Age,TitanicDataSurvived,
          main="Passengers'Ages Distribution",
          xlab="Age of Passenger",
          ylab="Percent of Total")
## Warning in histogram.numeric(TitanicDataSurvived$Age,
## TitanicDataSurvived, : explicit 'data' specification ignored

histogram(TitanicData$Sex,TitanicData,
          main="Percent of Passenger By Sex",
          xlab="Sex")
## Warning in histogram.numeric(TitanicData$Sex, TitanicData, main = "Percent
## of Passenger By Sex", : explicit 'data' specification ignored

TitanicData$AgeRank <- ifelse(TitanicData$Age < 18,1,
                              ifelse(TitanicData$Age>50,3,2))
TitanicDataSurvived <- subset(TitanicData,Survived ==1) # Rerun this code so that the sub dataset has the newly created variable 'AgeRank' included.

count(TitanicData$AgeRank)
##   x freq
## 1 1  113
## 2 2  714
## 3 3   64
CrossTable(TitanicData$Survived,TitanicData$AgeRank)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  891 
## 
##  
##                      | TitanicData$AgeRank 
## TitanicData$Survived |         1 |         2 |         3 | Row Total | 
## ---------------------|-----------|-----------|-----------|-----------|
##                    0 |        52 |       455 |        42 |       549 | 
##                      |     4.462 |     0.516 |     0.167 |           | 
##                      |     0.095 |     0.829 |     0.077 |     0.616 | 
##                      |     0.460 |     0.637 |     0.656 |           | 
##                      |     0.058 |     0.511 |     0.047 |           | 
## ---------------------|-----------|-----------|-----------|-----------|
##                    1 |        61 |       259 |        22 |       342 | 
##                      |     7.163 |     0.828 |     0.268 |           | 
##                      |     0.178 |     0.757 |     0.064 |     0.384 | 
##                      |     0.540 |     0.363 |     0.344 |           | 
##                      |     0.068 |     0.291 |     0.025 |           | 
## ---------------------|-----------|-----------|-----------|-----------|
##         Column Total |       113 |       714 |        64 |       891 | 
##                      |     0.127 |     0.801 |     0.072 |           | 
## ---------------------|-----------|-----------|-----------|-----------|
## 
## 
CrossTable(TitanicData$Sex,TitanicData$AgeRank)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  891 
## 
##  
##                 | TitanicData$AgeRank 
## TitanicData$Sex |         1 |         2 |         3 | Row Total | 
## ----------------|-----------|-----------|-----------|-----------|
##               0 |        58 |       472 |        47 |       577 | 
##                 |     3.148 |     0.200 |     0.744 |           | 
##                 |     0.101 |     0.818 |     0.081 |     0.648 | 
##                 |     0.513 |     0.661 |     0.734 |           | 
##                 |     0.065 |     0.530 |     0.053 |           | 
## ----------------|-----------|-----------|-----------|-----------|
##               1 |        55 |       242 |        17 |       314 | 
##                 |     5.784 |     0.368 |     1.368 |           | 
##                 |     0.175 |     0.771 |     0.054 |     0.352 | 
##                 |     0.487 |     0.339 |     0.266 |           | 
##                 |     0.062 |     0.272 |     0.019 |           | 
## ----------------|-----------|-----------|-----------|-----------|
##    Column Total |       113 |       714 |        64 |       891 | 
##                 |     0.127 |     0.801 |     0.072 |           | 
## ----------------|-----------|-----------|-----------|-----------|
## 
## 
count(TitanicDataSurvived$AgeRank)
##   x freq
## 1 1   61
## 2 2  259
## 3 3   22
CrossTable(TitanicData$Survived,TitanicData$Pclass) 
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  891 
## 
##  
##                      | TitanicData$Pclass 
## TitanicData$Survived |         1 |         2 |         3 | Row Total | 
## ---------------------|-----------|-----------|-----------|-----------|
##                    0 |        80 |        97 |       372 |       549 | 
##                      |    21.178 |     2.365 |    15.950 |           | 
##                      |     0.146 |     0.177 |     0.678 |     0.616 | 
##                      |     0.370 |     0.527 |     0.758 |           | 
##                      |     0.090 |     0.109 |     0.418 |           | 
## ---------------------|-----------|-----------|-----------|-----------|
##                    1 |       136 |        87 |       119 |       342 | 
##                      |    33.997 |     3.796 |    25.603 |           | 
##                      |     0.398 |     0.254 |     0.348 |     0.384 | 
##                      |     0.630 |     0.473 |     0.242 |           | 
##                      |     0.153 |     0.098 |     0.134 |           | 
## ---------------------|-----------|-----------|-----------|-----------|
##         Column Total |       216 |       184 |       491 |       891 | 
##                      |     0.242 |     0.207 |     0.551 |           | 
## ---------------------|-----------|-----------|-----------|-----------|
## 
## 
# First class: Survived more
# Middle class: About the same
# Lower class: More death than survival

## What is the demographic of each class by sex and age? 
## Why were there more first class survivals than others? 
## Why were there so many death in the lower class? Less safety-preventive tools available to them? 

CrossTable(TitanicData$Sex,TitanicData$Pclass)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  891 
## 
##  
##                 | TitanicData$Pclass 
## TitanicData$Sex |         1 |         2 |         3 | Row Total | 
## ----------------|-----------|-----------|-----------|-----------|
##               0 |       122 |       108 |       347 |       577 | 
##                 |     2.285 |     1.044 |     2.651 |           | 
##                 |     0.211 |     0.187 |     0.601 |     0.648 | 
##                 |     0.565 |     0.587 |     0.707 |           | 
##                 |     0.137 |     0.121 |     0.389 |           | 
## ----------------|-----------|-----------|-----------|-----------|
##               1 |        94 |        76 |       144 |       314 | 
##                 |     4.199 |     1.919 |     4.872 |           | 
##                 |     0.299 |     0.242 |     0.459 |     0.352 | 
##                 |     0.435 |     0.413 |     0.293 |           | 
##                 |     0.105 |     0.085 |     0.162 |           | 
## ----------------|-----------|-----------|-----------|-----------|
##    Column Total |       216 |       184 |       491 |       891 | 
##                 |     0.242 |     0.207 |     0.551 |           | 
## ----------------|-----------|-----------|-----------|-----------|
## 
## 
CrossTable(TitanicData$AgeRank,TitanicData$Pclass)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  891 
## 
##  
##                     | TitanicData$Pclass 
## TitanicData$AgeRank |         1 |         2 |         3 | Row Total | 
## --------------------|-----------|-----------|-----------|-----------|
##                   1 |        12 |        23 |        78 |       113 | 
##                     |     8.651 |     0.005 |     3.973 |           | 
##                     |     0.106 |     0.204 |     0.690 |     0.127 | 
##                     |     0.056 |     0.125 |     0.159 |           | 
##                     |     0.013 |     0.026 |     0.088 |           | 
## --------------------|-----------|-----------|-----------|-----------|
##                   2 |       165 |       146 |       403 |       714 | 
##                     |     0.378 |     0.014 |     0.231 |           | 
##                     |     0.231 |     0.204 |     0.564 |     0.801 | 
##                     |     0.764 |     0.793 |     0.821 |           | 
##                     |     0.185 |     0.164 |     0.452 |           | 
## --------------------|-----------|-----------|-----------|-----------|
##                   3 |        39 |        15 |        10 |        64 | 
##                     |    35.548 |     0.241 |    18.104 |           | 
##                     |     0.609 |     0.234 |     0.156 |     0.072 | 
##                     |     0.181 |     0.082 |     0.020 |           | 
##                     |     0.044 |     0.017 |     0.011 |           | 
## --------------------|-----------|-----------|-----------|-----------|
##        Column Total |       216 |       184 |       491 |       891 | 
##                     |     0.242 |     0.207 |     0.551 |           | 
## --------------------|-----------|-----------|-----------|-----------|
## 
## 
CrossTable(TitanicData$Survived,TitanicData$Embarked)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  891 
## 
##  
##                      | TitanicData$Embarked 
## TitanicData$Survived |           |         C |         Q |         S | Row Total | 
## ---------------------|-----------|-----------|-----------|-----------|-----------|
##                    0 |         0 |        75 |        47 |       427 |       549 | 
##                      |     1.232 |     7.855 |     0.004 |     2.297 |           | 
##                      |     0.000 |     0.137 |     0.086 |     0.778 |     0.616 | 
##                      |     0.000 |     0.446 |     0.610 |     0.663 |           | 
##                      |     0.000 |     0.084 |     0.053 |     0.479 |           | 
## ---------------------|-----------|-----------|-----------|-----------|-----------|
##                    1 |         2 |        93 |        30 |       217 |       342 | 
##                      |     1.978 |    12.609 |     0.007 |     3.688 |           | 
##                      |     0.006 |     0.272 |     0.088 |     0.635 |     0.384 | 
##                      |     1.000 |     0.554 |     0.390 |     0.337 |           | 
##                      |     0.002 |     0.104 |     0.034 |     0.244 |           | 
## ---------------------|-----------|-----------|-----------|-----------|-----------|
##         Column Total |         2 |       168 |        77 |       644 |       891 | 
##                      |     0.002 |     0.189 |     0.086 |     0.723 |           | 
## ---------------------|-----------|-----------|-----------|-----------|-----------|
## 
## 
## Average ticket fare by class

aggregate(TitanicData$Fare,list(TitanicData$Pclass),mean)
##   Group.1        x
## 1       1 84.15469
## 2       2 20.66218
## 3       3 13.67555
# OR use dplyr package
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
TitanicData %>% group_by(Pclass) %>% summarise_each(funs(mean(., na.rm=TRUE)), -c(PassengerId,Name,Ticket,Cabin,Embarked))
## `summarise_each()` is deprecated.
## Use `summarise_all()`, `summarise_at()` or `summarise_if()` instead.
## To map `funs` over a selection of variables, use `summarise_at()`
## # A tibble: 3 x 8
##   Pclass  Survived       Sex      Age     SibSp     Parch     Fare
##    <int>     <dbl>     <dbl>    <dbl>     <dbl>     <dbl>    <dbl>
## 1      1 0.6296296 0.4351852 36.81213 0.4166667 0.3564815 84.15469
## 2      2 0.4728261 0.4130435 29.76538 0.4021739 0.3804348 20.66218
## 3      3 0.2423625 0.2932790 25.93263 0.6150713 0.3930754 13.67555
## # ... with 1 more variables: AgeRank <dbl>

Split the data into training and testing data:

set.seed(42)
Train <- createDataPartition(TitanicData$Survived,p=0.7, list = FALSE)
Training <- TitanicData[Train,]
Testing <- TitanicData[-Train,]

Logistic Regression Model Evaluation

### Modelling: Logistic Regression for training set:

Model <- glm(Survived ~ Pclass + Sex + Age + SibSp + Parch + Cabin + Embarked,data=Training, family=binomial)
StepModel <- step(Model,direction = 'both')
## Start:  AIC=688.33
## Survived ~ Pclass + Sex + Age + SibSp + Parch + Cabin + Embarked
## 
##             Df Deviance    AIC
## - Cabin    118   552.15 570.15
## - Embarked   2   436.28 686.28
## - Parch      1   435.03 687.03
## <none>           434.33 688.33
## - SibSp      1   444.01 696.01
## - Age        1   444.18 696.18
## - Pclass     1   453.56 705.56
## - Sex        1   558.12 810.12
## 
## Step:  AIC=570.15
## Survived ~ Pclass + Sex + Age + SibSp + Parch + Embarked
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
##             Df Deviance    AIC
## - Embarked   3   557.99 569.99
## - Parch      1   554.05 570.05
## <none>           552.15 570.15
## - SibSp      1   559.14 575.14
## - Age        1   567.68 583.68
## - Pclass     1   607.48 623.48
## + Cabin    119   434.34 690.34
## - Sex        1   717.24 733.24
## 
## Step:  AIC=569.99
## Survived ~ Pclass + Sex + Age + SibSp + Parch
## 
##             Df Deviance    AIC
## <none>           557.99 569.99
## + Embarked   3   552.15 570.15
## - Parch      1   560.32 570.32
## - SibSp      1   565.99 575.99
## - Age        1   573.20 583.20
## - Pclass     1   621.29 631.29
## + Cabin    119   436.28 686.28
## - Sex        1   736.33 746.33
summary(StepModel)
## 
## Call:
## glm(formula = Survived ~ Pclass + Sex + Age + SibSp + Parch, 
##     family = binomial, data = Training)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5104  -0.5922  -0.4351   0.5968   2.3308  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  2.229126   0.516344   4.317 1.58e-05 ***
## Pclass      -1.068930   0.141638  -7.547 4.46e-14 ***
## Sex          2.826355   0.240071  11.773  < 2e-16 ***
## Age         -0.035498   0.009371  -3.788 0.000152 ***
## SibSp       -0.338214   0.131672  -2.569 0.010211 *  
## Parch       -0.235070   0.157061  -1.497 0.134478    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 840.24  on 623  degrees of freedom
## Residual deviance: 557.99  on 618  degrees of freedom
## AIC: 569.99
## 
## Number of Fisher Scoring iterations: 5
FinalModel <- glm(Survived ~ Pclass + Age + Sex + SibSp, data=Training)
summary(FinalModel)
## 
## Call:
## glm(formula = Survived ~ Pclass + Age + Sex + SibSp, data = Training)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.07125  -0.18180  -0.09441   0.20995   0.97332  
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.781446   0.075918  10.293  < 2e-16 ***
## Pclass      -0.169659   0.020070  -8.453  < 2e-16 ***
## Age         -0.005070   0.001311  -3.866 0.000122 ***
## Sex          0.521518   0.033017  15.796  < 2e-16 ***
## SibSp       -0.051919   0.014247  -3.644 0.000291 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.1456713)
## 
##     Null deviance: 149.840  on 623  degrees of freedom
## Residual deviance:  90.171  on 619  degrees of freedom
## AIC: 575.74
## 
## Number of Fisher Scoring iterations: 2
### Use the fitted model to do prediction for the test data:
Testing$model_pred_probs <- predict(FinalModel, Testing, type="response")
Testing$model_pred_probs_test <- ifelse(Testing$model_pred_probs>=0.5,1,0)

### Create the confusion matrix, and compute the misclassification rate for the test data:
confusionMatrix(Testing$model_pred_probs_test,Testing$Survived) # 78.65% is pretty good. 
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 147  30
##          1  28  62
##                                           
##                Accuracy : 0.7828          
##                  95% CI : (0.7284, 0.8307)
##     No Information Rate : 0.6554          
##     P-Value [Acc > NIR] : 3.892e-06       
##                                           
##                   Kappa : 0.5166          
##  Mcnemar's Test P-Value : 0.8955          
##                                           
##             Sensitivity : 0.8400          
##             Specificity : 0.6739          
##          Pos Pred Value : 0.8305          
##          Neg Pred Value : 0.6889          
##              Prevalence : 0.6554          
##          Detection Rate : 0.5506          
##    Detection Prevalence : 0.6629          
##       Balanced Accuracy : 0.7570          
##                                           
##        'Positive' Class : 0               
## 
### 10-Fold Cross-Validation for the test data:
### When evaluating the models, we often want to assess how well it performs in predicting the target variables on different subsets of the data.
library(caret)
set.seed(42)
FinalModelCross <- train(as.factor(Survived) ~ Pclass + Age + Sex + SibSp, 
                         TitanicData, 
                         method="glm",
                         family="binomial",
                         trControl=trainControl(
                           method="cv", number = 10,
                           verboseIter = TRUE)
)
## + Fold01: parameter=none 
## - Fold01: parameter=none 
## + Fold02: parameter=none 
## - Fold02: parameter=none 
## + Fold03: parameter=none 
## - Fold03: parameter=none 
## + Fold04: parameter=none 
## - Fold04: parameter=none 
## + Fold05: parameter=none 
## - Fold05: parameter=none 
## + Fold06: parameter=none 
## - Fold06: parameter=none 
## + Fold07: parameter=none 
## - Fold07: parameter=none 
## + Fold08: parameter=none 
## - Fold08: parameter=none 
## + Fold09: parameter=none 
## - Fold09: parameter=none 
## + Fold10: parameter=none 
## - Fold10: parameter=none 
## Aggregating results
## Fitting final model on full training set
FinalModelCross$finalModel
## 
## Call:  NULL
## 
## Coefficients:
## (Intercept)       Pclass          Age          Sex        SibSp  
##     2.43755     -1.17565     -0.03955      2.73948     -0.35443  
## 
## Degrees of Freedom: 890 Total (i.e. Null);  886 Residual
## Null Deviance:       1187 
## Residual Deviance: 791.2     AIC: 801.2
summary(FinalModelCross)
## 
## Call:
## NULL
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6817  -0.6029  -0.4159   0.6161   2.4327  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  2.437548   0.423758   5.752 8.81e-09 ***
## Pclass      -1.175654   0.120073  -9.791  < 2e-16 ***
## Age         -0.039553   0.007761  -5.096 3.47e-07 ***
## Sex          2.739477   0.193984  14.122  < 2e-16 ***
## SibSp       -0.354433   0.103392  -3.428 0.000608 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1186.66  on 890  degrees of freedom
## Residual deviance:  791.23  on 886  degrees of freedom
## AIC: 801.23
## 
## Number of Fisher Scoring iterations: 5
pred <- predict(FinalModelCross,newdata=Testing)
confusionMatrix(data=pred,Testing$Survived)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 141  25
##          1  34  67
##                                           
##                Accuracy : 0.779           
##                  95% CI : (0.7244, 0.8273)
##     No Information Rate : 0.6554          
##     P-Value [Acc > NIR] : 7.416e-06       
##                                           
##                   Kappa : 0.5219          
##  Mcnemar's Test P-Value : 0.2976          
##                                           
##             Sensitivity : 0.8057          
##             Specificity : 0.7283          
##          Pos Pred Value : 0.8494          
##          Neg Pred Value : 0.6634          
##              Prevalence : 0.6554          
##          Detection Rate : 0.5281          
##    Detection Prevalence : 0.6217          
##       Balanced Accuracy : 0.7670          
##                                           
##        'Positive' Class : 0               
## 
TitanicData$predictkfold <- predict(FinalModelCross,newdata=TitanicData)

Model: Logistic Regression for the main data set:

TitanicData$predict <- predict(FinalModel,newdata=TitanicData,type="response")
TitanicData$predictbinary <- ifelse(TitanicData$predict>=.5,1,0)

library(caret)
confusionMatrix(TitanicData$predictbinary,TitanicData$Survived)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 474 107
##          1  75 235
##                                           
##                Accuracy : 0.7957          
##                  95% CI : (0.7677, 0.8218)
##     No Information Rate : 0.6162          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.5604          
##  Mcnemar's Test P-Value : 0.02157         
##                                           
##             Sensitivity : 0.8634          
##             Specificity : 0.6871          
##          Pos Pred Value : 0.8158          
##          Neg Pred Value : 0.7581          
##              Prevalence : 0.6162          
##          Detection Rate : 0.5320          
##    Detection Prevalence : 0.6521          
##       Balanced Accuracy : 0.7753          
##                                           
##        'Positive' Class : 0               
## 
library(ROCR)
## Loading required package: gplots
## 
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
## 
##     lowess
library(pROC)
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following object is masked from 'package:gmodels':
## 
##     ci
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
TitanicData$Survived <- as.factor(TitanicData$Survived)

ROCRpredwhole <- prediction(TitanicData$predict,TitanicData$Survived)
ROCRperfwhole <- performance(ROCRpredwhole,'tpr','fpr')
plot(ROCRperfwhole)

aucwhole <- performance(ROCRpredwhole, measure="auc")
aucwhole <- aucwhole@y.values[[1]]
aucwhole
## [1] 0.8516867
plot(ROCRperfwhole, main= "AUC=85%",xlab="1-Specificity",
     ylab="Sensitivity", colorsize = TRUE, text.adj = c(-.2,1.7))
abline(a=0,b=1)

Model: Random Forest

library(randomForest)
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin
randommodel <- randomForest(as.factor(Survived) ~ Pclass + Sex + Age + SibSp
                              , data= TitanicData,importance = TRUE, ntree=1000)
randommodel
## 
## Call:
##  randomForest(formula = as.factor(Survived) ~ Pclass + Sex + Age +      SibSp, data = TitanicData, importance = TRUE, ntree = 1000) 
##                Type of random forest: classification
##                      Number of trees: 1000
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 17.96%
## Confusion matrix:
##     0   1 class.error
## 0 493  56   0.1020036
## 1 104 238   0.3040936
library(ROCR)
predictions = as.vector(randommodel$votes[,2])
TitanicData$predictRF <- predictions
ROCRpredwholeRandom <- prediction(TitanicData$predictRF,TitanicData$Survived)
ROCRperfwholeRandom <- performance(ROCRpredwholeRandom,'tpr','fpr') 
plot(ROCRperfwholeRandom)

aucwholeRandom <- performance(ROCRpredwholeRandom, measure="auc")
aucwholeRandom <- aucwholeRandom@y.values[[1]]
aucwholeRandom
## [1] 0.8628208
plot(ROCRperfwholeRandom, colorsize = TRUE, text.adj = c(-.2,1.7))
abline(a=0,b=1)

Model: Neural Network

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
library(neuralnet)
## 
## Attaching package: 'neuralnet'
## The following object is masked from 'package:ROCR':
## 
##     prediction
## The following object is masked from 'package:dplyr':
## 
##     compute
TitanicData$Survived <- ifelse(TitanicData$Survived=="1",1,0)
TitanicData <- TitanicData[, sapply(TitanicData, is.numeric)]
maxs <- apply(TitanicData[,2:9], 2, max) 
mins <- apply(TitanicData[,2:9], 2, min)
scaled <- as.data.frame(scale(TitanicData[,2:9], center = mins, scale = maxs-mins))
dataneuralnet <- cbind(TitanicData$PassengerId,scaled)

neuralnetmodel <- neuralnet(Survived ~ Pclass + Sex + Age + SibSp, data=dataneuralnet, hidden=5,linear.output=FALSE)
summary(neuralnetmodel)
##                     Length Class      Mode    
## call                   5   -none-     call    
## response             891   -none-     numeric 
## covariate           3564   -none-     numeric 
## model.list             2   -none-     list    
## err.fct                1   -none-     function
## act.fct                1   -none-     function
## linear.output          1   -none-     logical 
## data                   9   data.frame list    
## net.result             1   -none-     list    
## weights                1   -none-     list    
## startweights           1   -none-     list    
## generalized.weights    1   -none-     list    
## result.matrix         34   -none-     numeric
newdataneuralnet <- dataneuralnet[c("Pclass","Sex","Age","SibSp")]
pr.nn <- compute(neuralnetmodel,newdataneuralnet[,1:4])

neuralpredict <- ifelse(pr.nn$net.result >=0.5,1,0)

TitanicData$predictedneural <- pr.nn$net.result
TitanicData$predictedneuralbinary <- neuralpredict

detach("package:neuralnet", unload=TRUE)
library(ROCR)
ROCRpredneural <- prediction(TitanicData$predictedneural,TitanicData$Survived)
ROCRperfpredneural <- performance(ROCRpredneural,'tpr','fpr') 
plot(ROCRperfpredneural)

aucpredneural <- performance(ROCRpredneural, measure="auc")
aucpredneural <- aucpredneural@y.values[[1]]
aucpredneural
## [1] 0.881179497
plot(ROCRperfpredneural, colorsize = TRUE, text.adj = c(-.2,1.7))
abline(a=0,b=1)

library(caret)
library(e1071)
confusionMatrix(TitanicData$predictedneuralbinary,TitanicData$Survived)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 506  97
##          1  43 245
##                                                   
##                Accuracy : 0.8428732               
##                  95% CI : (0.8172892, 0.8661648)  
##     No Information Rate : 0.6161616               
##     P-Value [Acc > NIR] : < 0.00000000000000022204
##                                                   
##                   Kappa : 0.6576256               
##  Mcnemar's Test P-Value : 0.000007488206          
##                                                   
##             Sensitivity : 0.9216758               
##             Specificity : 0.7163743               
##          Pos Pred Value : 0.8391376               
##          Neg Pred Value : 0.8506944               
##              Prevalence : 0.6161616               
##          Detection Rate : 0.5679012               
##    Detection Prevalence : 0.6767677               
##       Balanced Accuracy : 0.8190250               
##                                                   
##        'Positive' Class : 0                       
## 

Ensembled Model - Averaging the three models:

TitanicData$predavg <- (TitanicData$predictedneural + TitanicData$predict + TitanicData$predictRF)/3
library(ROCR)
ROCRpredavg <- prediction(TitanicData$predavg,TitanicData$Survived)
ROCRperfpredavg <- performance(ROCRpredavg,'tpr','fpr')
plot(ROCRperfpredavg)

aucpredavg <- performance(ROCRpredavg,measure ="auc")
aucpredavg <- aucpredavg@y.values[[1]]
aucpredavg
## [1] 0.8763914187

Comparing AUC of the 4 Models:

# Stepwise Logistic Regression Model
aucwhole * 100
## [1] 85.16867457
# Random Forest Model
aucwholeRandom *100
## [1] 86.28207586
# Neural Network Model
aucpredneural *100
## [1] 88.1179497
# Ensembled Model- Averaging
aucpredavg *100
## [1] 87.63914187

Plotting the 4 AUC Curves:

plot(ROCRperfwhole, col="red",colorsize = TRUE, text.adj = c(-.2,1.7), main="AUC Curves - Individual Model ")
plot(ROCRperfwholeRandom,add=TRUE,col="green", colorsize = TRUE, text.adj = c(-.2,1.7))
plot(ROCRperfpredneural,add=TRUE,col="blue", colorsize = TRUE, text.adj = c(-.2,1.7))
plot(ROCRperfpredavg,add=TRUE,col="black", colorsize = TRUE, text.adj = c(-.2,1.7))
labels <- c("GLM: AUC=85%","Random Forest: AUC=86%","Neural Net: AUC=88%", "Ensembled Average: AUC=87%")
legend("bottomright",inset=c(0,0), title="Legend",labels,bty="n",lwd=2,col=c("red","green","blue","black"))
abline(a=0,b=1)

Conclusion:

There are many different machine learning technqiues that one can use to predict who survives from the Titanic disaster. In this problem, I use AUC value as a sole metric to evaluate which techique gives us the higest predictability.

Comparing all 4 techniques, neural network provides us the most accurate model to predict survival rates of Titanic passengers followed by an ensembled averaging model. Surprisingly, the simple stepwise logistic regression model provides the lowest AUC value.

Also, one can expand this project by looking at different ensembled techniques such as majority voting, or combination of different technqiues. I only did the averaging model here but please feel free to do others.