Business Intelligence

Agenda

Case Study: Titanic Survival
Import Data
Exploration and Data Preparation
Build Logistic Model
Predict with Logistic Model
Evaluate Logistic Model

Case Study - Titanic Survival Prediction

The sinking of the Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

Step 1: Load data

In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

#You can also laod the datasets from Titanic dataset
## Load Titanic library to get the dataset
#install.packages("titanic")
library(titanic)

## Load the datasets
data("titanic_train")
data("titanic_test")

Step 1: Load data

Since this dataset is from a Kaggle competition, we don’t have the outcome variable for the test dataset, we can add additional column then combine the two datasets.

## Setting Survived column for test data to NA
titanic_test$Survived <- NA

## Combining Training and Testing dataset
complete_data <- rbind(titanic_train, titanic_test)

## Check data structure
str(complete_data)

## 'data.frame':    1309 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : chr  "male" "female" "female" "female" ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  "" "C85" "" "C123" ...
##  $ Embarked   : chr  "S" "C" "S" "S" ...

Step 2: Exploring and Preparing Data

We can first check whether there are missing and empty values

## Let's check for any missing values in the data
colSums(is.na(complete_data))

## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0         418           0           0           0         263 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           1           0           0

## Checking for empty values
colSums(complete_data=='')

## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0          NA           0           0           0          NA 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0          NA        1014           2

Step 2: Exploring and Preparing Data

Since there are some categorical data, we will first check how many categories each have to decide keep them in the model. (sapply()is used to apply a function over a list/vector).

For the ones with a few values and relevant to our prediction of survival, we can keep them and convert them as factor and dummy variables.

## Check number of unique values for each of the column to find out columns which we can convert to factors
sapply(complete_data, function(x) length(unique(x)))

## PassengerId    Survived      Pclass        Name         Sex         Age 
##        1309           3           3        1307           2          99 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           7           8         929         282         187           4

Dealing with Categorical Variables

Note: Only convert categorical variables, do not covert numeric variables into factor types.

## Removing Cabin as it has very high missing values, passengerId, Ticket and Name are not required
library(dplyr)
titanic_data <- complete_data %>% select(-c(Cabin, PassengerId, Ticket, Name))

## Converting "Survived","Pclass","Sex","Embarked" to factors
for (i in c("Survived","Pclass","Sex","Embarked")){
  titanic_data[,i]=as.factor(titanic_data[,i])
}

Impute Missing Values

Then we can impute/fill in missing value, as well as discard those variable not relevant

 ## Missing values imputation
complete_data$Embarked[complete_data$Embarked==""] <- "S"

complete_data$Age[is.na(complete_data$Age)] <- median(complete_data$Age,na.rm=T)

Spliting data into training and testing datasets

We will split our data into two portions:

a training dataset to build the model (75%)
a test dataset to evaluate the model performance (25%)

train <- titanic_data[1:667,]
test <- titanic_data[668:889,]

Step 3- Training a Logistic Model

We will use the generalized linear models in the glm package to train our logistic regression model.

## Model Creation
model <- glm(Survived ~Pclass+Sex+Age+SibSp+Parch+Fare,
             family=binomial(link='logit'),
             data=train)
## Model Summary
summary(model)

## 
## Call:
## glm(formula = Survived ~ Pclass + Sex + Age + SibSp + Parch + 
##     Fare, family = binomial(link = "logit"), data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5419  -0.6929  -0.4115   0.6621   2.3492  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  3.992484   0.593485   6.727 1.73e-11 ***
## Pclass2     -1.254120   0.381916  -3.284  0.00102 ** 
## Pclass3     -2.458470   0.403688  -6.090 1.13e-09 ***
## Sexmale     -2.554458   0.247865 -10.306  < 2e-16 ***
## Age         -0.036871   0.009375  -3.933 8.39e-05 ***
## SibSp       -0.298151   0.140182  -2.127  0.03343 *  
## Parch       -0.080783   0.146222  -0.552  0.58063    
## Fare        -0.001773   0.003361  -0.528  0.59776    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 715.14  on 527  degrees of freedom
## Residual deviance: 491.63  on 520  degrees of freedom
##   (139 observations deleted due to missingness)
## AIC: 507.63
## 
## Number of Fisher Scoring iterations: 4

Model Evaluation

Then we can see the detailed result of the logistic model by summary(). And anova stands for analysis of variance, helps determine whether the differences between group means of each feature are statistically significant.

## Using anova() to analyze the table of devaiance
anova(model, test="Chisq")

## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: Survived
## 
## Terms added sequentially (first to last)
## 
## 
##        Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                     527     715.14              
## Pclass  2   60.114       525     655.02 8.839e-14 ***
## Sex     1  144.474       524     510.55 < 2.2e-16 ***
## Age     1   11.386       523     499.16 0.0007398 ***
## SibSp   1    6.779       522     492.39 0.0092252 ** 
## Parch   1    0.493       521     491.89 0.4824373    
## Fare    1    0.264       520     491.63 0.6072937    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Predict with Logistic Model

## Predicting Test Data
result <- predict(model,newdata=test,type='response')
result <- ifelse(result > 0.5,1,0)

result<-as.factor(result)
test$Survived<- as.factor(test$Survived)

Evaluate with Confusion Matrix

## Confusion matrix and statistics
#install.packages("caret")
library(caret)

## Loading required package: ggplot2

## Loading required package: lattice

confusionMatrix(data=result, reference=test$Survived)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 96 19
##          1 16 53
##                                           
##                Accuracy : 0.8098          
##                  95% CI : (0.7455, 0.8638)
##     No Information Rate : 0.6087          
##     P-Value [Acc > NIR] : 3.613e-09       
##                                           
##                   Kappa : 0.5977          
##                                           
##  Mcnemar's Test P-Value : 0.7353          
##                                           
##             Sensitivity : 0.8571          
##             Specificity : 0.7361          
##          Pos Pred Value : 0.8348          
##          Neg Pred Value : 0.7681          
##              Prevalence : 0.6087          
##          Detection Rate : 0.5217          
##    Detection Prevalence : 0.6250          
##       Balanced Accuracy : 0.7966          
##                                           
##        'Positive' Class : 0               
##

Predict with Your Own Data

## Predicting with our own data, see whether you will survive or not? 
predict(model,
        newdata=data.frame(Pclass="3",Sex="female",
                                 Age=29,SibSp=2,Parch=2,
                                 Fare=50,Embarked = "S"),
        type='response')

##         1 
## 0.4056977