The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.
One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.
In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.
TitanicData <- read.csv("/Users/vidarithchan/Desktop/DuringBreakDataScience/TitanicMLFromDisaster/Train.csv", header=T)
colnames(TitanicData)
## [1] "PassengerId" "Survived" "Pclass" "Name" "Sex"
## [6] "Age" "SibSp" "Parch" "Ticket" "Fare"
## [11] "Cabin" "Embarked"
nrow(TitanicData)
## [1] 891
ncol(TitanicData)
## [1] 12
sapply(TitanicData, function(x) sum(is.na(x))) # check to see if there are any blank columns.
## PassengerId Survived Pclass Name Sex Age
## 0 0 0 0 0 177
## SibSp Parch Ticket Fare Cabin Embarked
## 0 0 0 0 0 0
median(TitanicData$Age, na.rm=TRUE)
## [1] 28
TitanicData$Age[which(is.na(TitanicData$Age))] <- median(TitanicData$Age,na.rm=TRUE) # Replace blank age cell with the median.
TitanicDataSurvived <- subset(TitanicData,Survived ==1)
You can also embed plots, for example:
TitanicData$Sex <- ifelse(TitanicData$Sex=="female",1,0)
summary(TitanicData$Survived)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.3838 1.0000 1.0000
summary(TitanicData$Sex)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.3524 1.0000 1.0000
library(plyr)
count(TitanicData,'Sex')
## Sex freq
## 1 0 577
## 2 1 314
count(TitanicData,'Survived')
## Survived freq
## 1 0 549
## 2 1 342
count(TitanicData,'Embarked')
## Embarked freq
## 1 2
## 2 C 168
## 3 Q 77
## 4 S 644
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
confusionMatrix(TitanicData$Survived,TitanicData$Sex) # one way to know the crosstab count
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 468 81
## 1 109 233
##
## Accuracy : 0.7868
## 95% CI : (0.7584, 0.8132)
## No Information Rate : 0.6476
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.5421
## Mcnemar's Test P-Value : 0.05014
##
## Sensitivity : 0.8111
## Specificity : 0.7420
## Pos Pred Value : 0.8525
## Neg Pred Value : 0.6813
## Prevalence : 0.6476
## Detection Rate : 0.5253
## Detection Prevalence : 0.6162
## Balanced Accuracy : 0.7766
##
## 'Positive' Class : 0
##
####### OR
library(gmodels)
CrossTable(TitanicData$Survived,TitanicData$Sex)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 891
##
##
## | TitanicData$Sex
## TitanicData$Survived | 0 | 1 | Row Total |
## ---------------------|-----------|-----------|-----------|
## 0 | 468 | 81 | 549 |
## | 35.583 | 65.386 | |
## | 0.852 | 0.148 | 0.616 |
## | 0.811 | 0.258 | |
## | 0.525 | 0.091 | |
## ---------------------|-----------|-----------|-----------|
## 1 | 109 | 233 | 342 |
## | 57.120 | 104.962 | |
## | 0.319 | 0.681 | 0.384 |
## | 0.189 | 0.742 | |
## | 0.122 | 0.262 | |
## ---------------------|-----------|-----------|-----------|
## Column Total | 577 | 314 | 891 |
## | 0.648 | 0.352 | |
## ---------------------|-----------|-----------|-----------|
##
##
## The number of survived male passengers are higher than that of female passengers.
## What is the correlation then?
CorSurvived <- cor(TitanicData$Survived,TitanicData$Sex)
CorSurvived # .543 . You have a higher chance of surviving if you are female passenger.
## [1] 0.5433514
## What is the average age of all passengers?
summary(TitanicData$Age) # 29 years old.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.42 22.00 28.00 29.36 35.00 80.00
## What is the average age of survived passengers?
summary(TitanicDataSurvived$Age) # 28 years old.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.42 21.00 28.00 28.29 35.00 80.00
## Age distribution of passengers and survived passengers:
histogram(TitanicData$Age,TitanicData,
main="Passengers'Ages Distribution",
xlab="Age of Passenger",
ylab="Percent of Total")
## Warning in histogram.numeric(TitanicData$Age, TitanicData, main =
## "Passengers'Ages Distribution", : explicit 'data' specification ignored
histogram(TitanicDataSurvived$Age,TitanicDataSurvived,
main="Passengers'Ages Distribution",
xlab="Age of Passenger",
ylab="Percent of Total")
## Warning in histogram.numeric(TitanicDataSurvived$Age,
## TitanicDataSurvived, : explicit 'data' specification ignored
histogram(TitanicData$Sex,TitanicData,
main="Percent of Passenger By Sex",
xlab="Sex")
## Warning in histogram.numeric(TitanicData$Sex, TitanicData, main = "Percent
## of Passenger By Sex", : explicit 'data' specification ignored
TitanicData$AgeRank <- ifelse(TitanicData$Age < 18,1,
ifelse(TitanicData$Age>50,3,2))
TitanicDataSurvived <- subset(TitanicData,Survived ==1) # Rerun this code so that the sub dataset has the newly created variable 'AgeRank' included.
count(TitanicData$AgeRank)
## x freq
## 1 1 113
## 2 2 714
## 3 3 64
CrossTable(TitanicData$Survived,TitanicData$AgeRank)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 891
##
##
## | TitanicData$AgeRank
## TitanicData$Survived | 1 | 2 | 3 | Row Total |
## ---------------------|-----------|-----------|-----------|-----------|
## 0 | 52 | 455 | 42 | 549 |
## | 4.462 | 0.516 | 0.167 | |
## | 0.095 | 0.829 | 0.077 | 0.616 |
## | 0.460 | 0.637 | 0.656 | |
## | 0.058 | 0.511 | 0.047 | |
## ---------------------|-----------|-----------|-----------|-----------|
## 1 | 61 | 259 | 22 | 342 |
## | 7.163 | 0.828 | 0.268 | |
## | 0.178 | 0.757 | 0.064 | 0.384 |
## | 0.540 | 0.363 | 0.344 | |
## | 0.068 | 0.291 | 0.025 | |
## ---------------------|-----------|-----------|-----------|-----------|
## Column Total | 113 | 714 | 64 | 891 |
## | 0.127 | 0.801 | 0.072 | |
## ---------------------|-----------|-----------|-----------|-----------|
##
##
CrossTable(TitanicData$Sex,TitanicData$AgeRank)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 891
##
##
## | TitanicData$AgeRank
## TitanicData$Sex | 1 | 2 | 3 | Row Total |
## ----------------|-----------|-----------|-----------|-----------|
## 0 | 58 | 472 | 47 | 577 |
## | 3.148 | 0.200 | 0.744 | |
## | 0.101 | 0.818 | 0.081 | 0.648 |
## | 0.513 | 0.661 | 0.734 | |
## | 0.065 | 0.530 | 0.053 | |
## ----------------|-----------|-----------|-----------|-----------|
## 1 | 55 | 242 | 17 | 314 |
## | 5.784 | 0.368 | 1.368 | |
## | 0.175 | 0.771 | 0.054 | 0.352 |
## | 0.487 | 0.339 | 0.266 | |
## | 0.062 | 0.272 | 0.019 | |
## ----------------|-----------|-----------|-----------|-----------|
## Column Total | 113 | 714 | 64 | 891 |
## | 0.127 | 0.801 | 0.072 | |
## ----------------|-----------|-----------|-----------|-----------|
##
##
count(TitanicDataSurvived$AgeRank)
## x freq
## 1 1 61
## 2 2 259
## 3 3 22
CrossTable(TitanicData$Survived,TitanicData$Pclass)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 891
##
##
## | TitanicData$Pclass
## TitanicData$Survived | 1 | 2 | 3 | Row Total |
## ---------------------|-----------|-----------|-----------|-----------|
## 0 | 80 | 97 | 372 | 549 |
## | 21.178 | 2.365 | 15.950 | |
## | 0.146 | 0.177 | 0.678 | 0.616 |
## | 0.370 | 0.527 | 0.758 | |
## | 0.090 | 0.109 | 0.418 | |
## ---------------------|-----------|-----------|-----------|-----------|
## 1 | 136 | 87 | 119 | 342 |
## | 33.997 | 3.796 | 25.603 | |
## | 0.398 | 0.254 | 0.348 | 0.384 |
## | 0.630 | 0.473 | 0.242 | |
## | 0.153 | 0.098 | 0.134 | |
## ---------------------|-----------|-----------|-----------|-----------|
## Column Total | 216 | 184 | 491 | 891 |
## | 0.242 | 0.207 | 0.551 | |
## ---------------------|-----------|-----------|-----------|-----------|
##
##
# First class: Survived more
# Middle class: About the same
# Lower class: More death than survival
## What is the demographic of each class by sex and age?
## Why were there more first class survivals than others?
## Why were there so many death in the lower class? Less safety-preventive tools available to them?
CrossTable(TitanicData$Sex,TitanicData$Pclass)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 891
##
##
## | TitanicData$Pclass
## TitanicData$Sex | 1 | 2 | 3 | Row Total |
## ----------------|-----------|-----------|-----------|-----------|
## 0 | 122 | 108 | 347 | 577 |
## | 2.285 | 1.044 | 2.651 | |
## | 0.211 | 0.187 | 0.601 | 0.648 |
## | 0.565 | 0.587 | 0.707 | |
## | 0.137 | 0.121 | 0.389 | |
## ----------------|-----------|-----------|-----------|-----------|
## 1 | 94 | 76 | 144 | 314 |
## | 4.199 | 1.919 | 4.872 | |
## | 0.299 | 0.242 | 0.459 | 0.352 |
## | 0.435 | 0.413 | 0.293 | |
## | 0.105 | 0.085 | 0.162 | |
## ----------------|-----------|-----------|-----------|-----------|
## Column Total | 216 | 184 | 491 | 891 |
## | 0.242 | 0.207 | 0.551 | |
## ----------------|-----------|-----------|-----------|-----------|
##
##
CrossTable(TitanicData$AgeRank,TitanicData$Pclass)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 891
##
##
## | TitanicData$Pclass
## TitanicData$AgeRank | 1 | 2 | 3 | Row Total |
## --------------------|-----------|-----------|-----------|-----------|
## 1 | 12 | 23 | 78 | 113 |
## | 8.651 | 0.005 | 3.973 | |
## | 0.106 | 0.204 | 0.690 | 0.127 |
## | 0.056 | 0.125 | 0.159 | |
## | 0.013 | 0.026 | 0.088 | |
## --------------------|-----------|-----------|-----------|-----------|
## 2 | 165 | 146 | 403 | 714 |
## | 0.378 | 0.014 | 0.231 | |
## | 0.231 | 0.204 | 0.564 | 0.801 |
## | 0.764 | 0.793 | 0.821 | |
## | 0.185 | 0.164 | 0.452 | |
## --------------------|-----------|-----------|-----------|-----------|
## 3 | 39 | 15 | 10 | 64 |
## | 35.548 | 0.241 | 18.104 | |
## | 0.609 | 0.234 | 0.156 | 0.072 |
## | 0.181 | 0.082 | 0.020 | |
## | 0.044 | 0.017 | 0.011 | |
## --------------------|-----------|-----------|-----------|-----------|
## Column Total | 216 | 184 | 491 | 891 |
## | 0.242 | 0.207 | 0.551 | |
## --------------------|-----------|-----------|-----------|-----------|
##
##
CrossTable(TitanicData$Survived,TitanicData$Embarked)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 891
##
##
## | TitanicData$Embarked
## TitanicData$Survived | | C | Q | S | Row Total |
## ---------------------|-----------|-----------|-----------|-----------|-----------|
## 0 | 0 | 75 | 47 | 427 | 549 |
## | 1.232 | 7.855 | 0.004 | 2.297 | |
## | 0.000 | 0.137 | 0.086 | 0.778 | 0.616 |
## | 0.000 | 0.446 | 0.610 | 0.663 | |
## | 0.000 | 0.084 | 0.053 | 0.479 | |
## ---------------------|-----------|-----------|-----------|-----------|-----------|
## 1 | 2 | 93 | 30 | 217 | 342 |
## | 1.978 | 12.609 | 0.007 | 3.688 | |
## | 0.006 | 0.272 | 0.088 | 0.635 | 0.384 |
## | 1.000 | 0.554 | 0.390 | 0.337 | |
## | 0.002 | 0.104 | 0.034 | 0.244 | |
## ---------------------|-----------|-----------|-----------|-----------|-----------|
## Column Total | 2 | 168 | 77 | 644 | 891 |
## | 0.002 | 0.189 | 0.086 | 0.723 | |
## ---------------------|-----------|-----------|-----------|-----------|-----------|
##
##
## Average ticket fare by class
aggregate(TitanicData$Fare,list(TitanicData$Pclass),mean)
## Group.1 x
## 1 1 84.15469
## 2 2 20.66218
## 3 3 13.67555
# OR use dplyr package
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
TitanicData %>% group_by(Pclass) %>% summarise_each(funs(mean(., na.rm=TRUE)), -c(PassengerId,Name,Ticket,Cabin,Embarked))
## `summarise_each()` is deprecated.
## Use `summarise_all()`, `summarise_at()` or `summarise_if()` instead.
## To map `funs` over a selection of variables, use `summarise_at()`
## # A tibble: 3 x 8
## Pclass Survived Sex Age SibSp Parch Fare
## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 0.6296296 0.4351852 36.81213 0.4166667 0.3564815 84.15469
## 2 2 0.4728261 0.4130435 29.76538 0.4021739 0.3804348 20.66218
## 3 3 0.2423625 0.2932790 25.93263 0.6150713 0.3930754 13.67555
## # ... with 1 more variables: AgeRank <dbl>
set.seed(42)
Train <- createDataPartition(TitanicData$Survived,p=0.7, list = FALSE)
Training <- TitanicData[Train,]
Testing <- TitanicData[-Train,]
### Modelling: Logistic Regression for training set:
Model <- glm(Survived ~ Pclass + Sex + Age + SibSp + Parch + Cabin + Embarked,data=Training, family=binomial)
StepModel <- step(Model,direction = 'both')
## Start: AIC=688.33
## Survived ~ Pclass + Sex + Age + SibSp + Parch + Cabin + Embarked
##
## Df Deviance AIC
## - Cabin 118 552.15 570.15
## - Embarked 2 436.28 686.28
## - Parch 1 435.03 687.03
## <none> 434.33 688.33
## - SibSp 1 444.01 696.01
## - Age 1 444.18 696.18
## - Pclass 1 453.56 705.56
## - Sex 1 558.12 810.12
##
## Step: AIC=570.15
## Survived ~ Pclass + Sex + Age + SibSp + Parch + Embarked
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Df Deviance AIC
## - Embarked 3 557.99 569.99
## - Parch 1 554.05 570.05
## <none> 552.15 570.15
## - SibSp 1 559.14 575.14
## - Age 1 567.68 583.68
## - Pclass 1 607.48 623.48
## + Cabin 119 434.34 690.34
## - Sex 1 717.24 733.24
##
## Step: AIC=569.99
## Survived ~ Pclass + Sex + Age + SibSp + Parch
##
## Df Deviance AIC
## <none> 557.99 569.99
## + Embarked 3 552.15 570.15
## - Parch 1 560.32 570.32
## - SibSp 1 565.99 575.99
## - Age 1 573.20 583.20
## - Pclass 1 621.29 631.29
## + Cabin 119 436.28 686.28
## - Sex 1 736.33 746.33
summary(StepModel)
##
## Call:
## glm(formula = Survived ~ Pclass + Sex + Age + SibSp + Parch,
## family = binomial, data = Training)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5104 -0.5922 -0.4351 0.5968 2.3308
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.229126 0.516344 4.317 1.58e-05 ***
## Pclass -1.068930 0.141638 -7.547 4.46e-14 ***
## Sex 2.826355 0.240071 11.773 < 2e-16 ***
## Age -0.035498 0.009371 -3.788 0.000152 ***
## SibSp -0.338214 0.131672 -2.569 0.010211 *
## Parch -0.235070 0.157061 -1.497 0.134478
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 840.24 on 623 degrees of freedom
## Residual deviance: 557.99 on 618 degrees of freedom
## AIC: 569.99
##
## Number of Fisher Scoring iterations: 5
FinalModel <- glm(Survived ~ Pclass + Age + Sex + SibSp, data=Training)
summary(FinalModel)
##
## Call:
## glm(formula = Survived ~ Pclass + Age + Sex + SibSp, data = Training)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.07125 -0.18180 -0.09441 0.20995 0.97332
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.781446 0.075918 10.293 < 2e-16 ***
## Pclass -0.169659 0.020070 -8.453 < 2e-16 ***
## Age -0.005070 0.001311 -3.866 0.000122 ***
## Sex 0.521518 0.033017 15.796 < 2e-16 ***
## SibSp -0.051919 0.014247 -3.644 0.000291 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.1456713)
##
## Null deviance: 149.840 on 623 degrees of freedom
## Residual deviance: 90.171 on 619 degrees of freedom
## AIC: 575.74
##
## Number of Fisher Scoring iterations: 2
### Use the fitted model to do prediction for the test data:
Testing$model_pred_probs <- predict(FinalModel, Testing, type="response")
Testing$model_pred_probs_test <- ifelse(Testing$model_pred_probs>=0.5,1,0)
### Create the confusion matrix, and compute the misclassification rate for the test data:
confusionMatrix(Testing$model_pred_probs_test,Testing$Survived) # 78.65% is pretty good.
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 147 30
## 1 28 62
##
## Accuracy : 0.7828
## 95% CI : (0.7284, 0.8307)
## No Information Rate : 0.6554
## P-Value [Acc > NIR] : 3.892e-06
##
## Kappa : 0.5166
## Mcnemar's Test P-Value : 0.8955
##
## Sensitivity : 0.8400
## Specificity : 0.6739
## Pos Pred Value : 0.8305
## Neg Pred Value : 0.6889
## Prevalence : 0.6554
## Detection Rate : 0.5506
## Detection Prevalence : 0.6629
## Balanced Accuracy : 0.7570
##
## 'Positive' Class : 0
##
### 10-Fold Cross-Validation for the test data:
### When evaluating the models, we often want to assess how well it performs in predicting the target variables on different subsets of the data.
library(caret)
set.seed(42)
FinalModelCross <- train(as.factor(Survived) ~ Pclass + Age + Sex + SibSp,
TitanicData,
method="glm",
family="binomial",
trControl=trainControl(
method="cv", number = 10,
verboseIter = TRUE)
)
## + Fold01: parameter=none
## - Fold01: parameter=none
## + Fold02: parameter=none
## - Fold02: parameter=none
## + Fold03: parameter=none
## - Fold03: parameter=none
## + Fold04: parameter=none
## - Fold04: parameter=none
## + Fold05: parameter=none
## - Fold05: parameter=none
## + Fold06: parameter=none
## - Fold06: parameter=none
## + Fold07: parameter=none
## - Fold07: parameter=none
## + Fold08: parameter=none
## - Fold08: parameter=none
## + Fold09: parameter=none
## - Fold09: parameter=none
## + Fold10: parameter=none
## - Fold10: parameter=none
## Aggregating results
## Fitting final model on full training set
FinalModelCross$finalModel
##
## Call: NULL
##
## Coefficients:
## (Intercept) Pclass Age Sex SibSp
## 2.43755 -1.17565 -0.03955 2.73948 -0.35443
##
## Degrees of Freedom: 890 Total (i.e. Null); 886 Residual
## Null Deviance: 1187
## Residual Deviance: 791.2 AIC: 801.2
summary(FinalModelCross)
##
## Call:
## NULL
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6817 -0.6029 -0.4159 0.6161 2.4327
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.437548 0.423758 5.752 8.81e-09 ***
## Pclass -1.175654 0.120073 -9.791 < 2e-16 ***
## Age -0.039553 0.007761 -5.096 3.47e-07 ***
## Sex 2.739477 0.193984 14.122 < 2e-16 ***
## SibSp -0.354433 0.103392 -3.428 0.000608 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1186.66 on 890 degrees of freedom
## Residual deviance: 791.23 on 886 degrees of freedom
## AIC: 801.23
##
## Number of Fisher Scoring iterations: 5
pred <- predict(FinalModelCross,newdata=Testing)
confusionMatrix(data=pred,Testing$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 141 25
## 1 34 67
##
## Accuracy : 0.779
## 95% CI : (0.7244, 0.8273)
## No Information Rate : 0.6554
## P-Value [Acc > NIR] : 7.416e-06
##
## Kappa : 0.5219
## Mcnemar's Test P-Value : 0.2976
##
## Sensitivity : 0.8057
## Specificity : 0.7283
## Pos Pred Value : 0.8494
## Neg Pred Value : 0.6634
## Prevalence : 0.6554
## Detection Rate : 0.5281
## Detection Prevalence : 0.6217
## Balanced Accuracy : 0.7670
##
## 'Positive' Class : 0
##
TitanicData$predictkfold <- predict(FinalModelCross,newdata=TitanicData)
TitanicData$predict <- predict(FinalModel,newdata=TitanicData,type="response")
TitanicData$predictbinary <- ifelse(TitanicData$predict>=.5,1,0)
library(caret)
confusionMatrix(TitanicData$predictbinary,TitanicData$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 474 107
## 1 75 235
##
## Accuracy : 0.7957
## 95% CI : (0.7677, 0.8218)
## No Information Rate : 0.6162
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.5604
## Mcnemar's Test P-Value : 0.02157
##
## Sensitivity : 0.8634
## Specificity : 0.6871
## Pos Pred Value : 0.8158
## Neg Pred Value : 0.7581
## Prevalence : 0.6162
## Detection Rate : 0.5320
## Detection Prevalence : 0.6521
## Balanced Accuracy : 0.7753
##
## 'Positive' Class : 0
##
library(ROCR)
## Loading required package: gplots
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
library(pROC)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following object is masked from 'package:gmodels':
##
## ci
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
TitanicData$Survived <- as.factor(TitanicData$Survived)
ROCRpredwhole <- prediction(TitanicData$predict,TitanicData$Survived)
ROCRperfwhole <- performance(ROCRpredwhole,'tpr','fpr')
plot(ROCRperfwhole)
aucwhole <- performance(ROCRpredwhole, measure="auc")
aucwhole <- aucwhole@y.values[[1]]
aucwhole
## [1] 0.8516867
plot(ROCRperfwhole, main= "AUC=85%",xlab="1-Specificity",
ylab="Sensitivity", colorsize = TRUE, text.adj = c(-.2,1.7))
abline(a=0,b=1)
library(randomForest)
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
randommodel <- randomForest(as.factor(Survived) ~ Pclass + Sex + Age + SibSp
, data= TitanicData,importance = TRUE, ntree=1000)
randommodel
##
## Call:
## randomForest(formula = as.factor(Survived) ~ Pclass + Sex + Age + SibSp, data = TitanicData, importance = TRUE, ntree = 1000)
## Type of random forest: classification
## Number of trees: 1000
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 17.96%
## Confusion matrix:
## 0 1 class.error
## 0 493 56 0.1020036
## 1 104 238 0.3040936
library(ROCR)
predictions = as.vector(randommodel$votes[,2])
TitanicData$predictRF <- predictions
ROCRpredwholeRandom <- prediction(TitanicData$predictRF,TitanicData$Survived)
ROCRperfwholeRandom <- performance(ROCRpredwholeRandom,'tpr','fpr')
plot(ROCRperfwholeRandom)
aucwholeRandom <- performance(ROCRpredwholeRandom, measure="auc")
aucwholeRandom <- aucwholeRandom@y.values[[1]]
aucwholeRandom
## [1] 0.8628208
plot(ROCRperfwholeRandom, colorsize = TRUE, text.adj = c(-.2,1.7))
abline(a=0,b=1)
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library(neuralnet)
##
## Attaching package: 'neuralnet'
## The following object is masked from 'package:ROCR':
##
## prediction
## The following object is masked from 'package:dplyr':
##
## compute
TitanicData$Survived <- ifelse(TitanicData$Survived=="1",1,0)
TitanicData <- TitanicData[, sapply(TitanicData, is.numeric)]
maxs <- apply(TitanicData[,2:9], 2, max)
mins <- apply(TitanicData[,2:9], 2, min)
scaled <- as.data.frame(scale(TitanicData[,2:9], center = mins, scale = maxs-mins))
dataneuralnet <- cbind(TitanicData$PassengerId,scaled)
neuralnetmodel <- neuralnet(Survived ~ Pclass + Sex + Age + SibSp, data=dataneuralnet, hidden=5,linear.output=FALSE)
summary(neuralnetmodel)
## Length Class Mode
## call 5 -none- call
## response 891 -none- numeric
## covariate 3564 -none- numeric
## model.list 2 -none- list
## err.fct 1 -none- function
## act.fct 1 -none- function
## linear.output 1 -none- logical
## data 9 data.frame list
## net.result 1 -none- list
## weights 1 -none- list
## startweights 1 -none- list
## generalized.weights 1 -none- list
## result.matrix 34 -none- numeric
newdataneuralnet <- dataneuralnet[c("Pclass","Sex","Age","SibSp")]
pr.nn <- compute(neuralnetmodel,newdataneuralnet[,1:4])
neuralpredict <- ifelse(pr.nn$net.result >=0.5,1,0)
TitanicData$predictedneural <- pr.nn$net.result
TitanicData$predictedneuralbinary <- neuralpredict
detach("package:neuralnet", unload=TRUE)
library(ROCR)
ROCRpredneural <- prediction(TitanicData$predictedneural,TitanicData$Survived)
ROCRperfpredneural <- performance(ROCRpredneural,'tpr','fpr')
plot(ROCRperfpredneural)
aucpredneural <- performance(ROCRpredneural, measure="auc")
aucpredneural <- aucpredneural@y.values[[1]]
aucpredneural
## [1] 0.881179497
plot(ROCRperfpredneural, colorsize = TRUE, text.adj = c(-.2,1.7))
abline(a=0,b=1)
library(caret)
library(e1071)
confusionMatrix(TitanicData$predictedneuralbinary,TitanicData$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 506 97
## 1 43 245
##
## Accuracy : 0.8428732
## 95% CI : (0.8172892, 0.8661648)
## No Information Rate : 0.6161616
## P-Value [Acc > NIR] : < 0.00000000000000022204
##
## Kappa : 0.6576256
## Mcnemar's Test P-Value : 0.000007488206
##
## Sensitivity : 0.9216758
## Specificity : 0.7163743
## Pos Pred Value : 0.8391376
## Neg Pred Value : 0.8506944
## Prevalence : 0.6161616
## Detection Rate : 0.5679012
## Detection Prevalence : 0.6767677
## Balanced Accuracy : 0.8190250
##
## 'Positive' Class : 0
##
TitanicData$predavg <- (TitanicData$predictedneural + TitanicData$predict + TitanicData$predictRF)/3
library(ROCR)
ROCRpredavg <- prediction(TitanicData$predavg,TitanicData$Survived)
ROCRperfpredavg <- performance(ROCRpredavg,'tpr','fpr')
plot(ROCRperfpredavg)
aucpredavg <- performance(ROCRpredavg,measure ="auc")
aucpredavg <- aucpredavg@y.values[[1]]
aucpredavg
## [1] 0.8763914187
# Stepwise Logistic Regression Model
aucwhole * 100
## [1] 85.16867457
# Random Forest Model
aucwholeRandom *100
## [1] 86.28207586
# Neural Network Model
aucpredneural *100
## [1] 88.1179497
# Ensembled Model- Averaging
aucpredavg *100
## [1] 87.63914187
plot(ROCRperfwhole, col="red",colorsize = TRUE, text.adj = c(-.2,1.7), main="AUC Curves - Individual Model ")
plot(ROCRperfwholeRandom,add=TRUE,col="green", colorsize = TRUE, text.adj = c(-.2,1.7))
plot(ROCRperfpredneural,add=TRUE,col="blue", colorsize = TRUE, text.adj = c(-.2,1.7))
plot(ROCRperfpredavg,add=TRUE,col="black", colorsize = TRUE, text.adj = c(-.2,1.7))
labels <- c("GLM: AUC=85%","Random Forest: AUC=86%","Neural Net: AUC=88%", "Ensembled Average: AUC=87%")
legend("bottomright",inset=c(0,0), title="Legend",labels,bty="n",lwd=2,col=c("red","green","blue","black"))
abline(a=0,b=1)
There are many different machine learning technqiues that one can use to predict who survives from the Titanic disaster. In this problem, I use AUC value as a sole metric to evaluate which techique gives us the higest predictability.
Comparing all 4 techniques, neural network provides us the most accurate model to predict survival rates of Titanic passengers followed by an ensembled averaging model. Surprisingly, the simple stepwise logistic regression model provides the lowest AUC value.
Also, one can expand this project by looking at different ensembled techniques such as majority voting, or combination of different technqiues. I only did the averaging model here but please feel free to do others.