Case Study: Titanic Survival
Import Data
Exploration and Data Preparation
Build Logistic Model
Predict with Logistic Model
Evaluate Logistic Model
Case Study: Titanic Survival
Import Data
Exploration and Data Preparation
Build Logistic Model
Predict with Logistic Model
Evaluate Logistic Model
The sinking of the Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.
While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.
In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).
#You can also laod the datasets from Titanic dataset
## Load Titanic library to get the dataset
#install.packages("titanic")
library(titanic)
## Load the datasets
data("titanic_train")
data("titanic_test")
Since this dataset is from a Kaggle competition, we don’t have the outcome variable for the test dataset, we can add additional column then combine the two datasets.
## Setting Survived column for test data to NA titanic_test$Survived <- NA ## Combining Training and Testing dataset complete_data <- rbind(titanic_train, titanic_test) ## Check data structure str(complete_data)
## 'data.frame': 1309 obs. of 12 variables: ## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ... ## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ... ## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ... ## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ... ## $ Sex : chr "male" "female" "female" "female" ... ## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ... ## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ... ## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ... ## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ... ## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ... ## $ Cabin : chr "" "C85" "" "C123" ... ## $ Embarked : chr "S" "C" "S" "S" ...
We can first check whether there are missing and empty values
## Let's check for any missing values in the data colSums(is.na(complete_data))
## PassengerId Survived Pclass Name Sex Age ## 0 418 0 0 0 263 ## SibSp Parch Ticket Fare Cabin Embarked ## 0 0 0 1 0 0
## Checking for empty values colSums(complete_data=='')
## PassengerId Survived Pclass Name Sex Age ## 0 NA 0 0 0 NA ## SibSp Parch Ticket Fare Cabin Embarked ## 0 0 0 NA 1014 2
Since there are some categorical data, we will first check how many categories each have to decide keep them in the model. (sapply()is used to apply a function over a list/vector).
For the ones with a few values and relevant to our prediction of survival, we can keep them and convert them as factor and dummy variables.
## Check number of unique values for each of the column to find out columns which we can convert to factors sapply(complete_data, function(x) length(unique(x)))
## PassengerId Survived Pclass Name Sex Age ## 1309 3 3 1307 2 99 ## SibSp Parch Ticket Fare Cabin Embarked ## 7 8 929 282 187 4
Note: Only convert categorical variables, do not covert numeric variables into factor types.
## Removing Cabin as it has very high missing values, passengerId, Ticket and Name are not required
library(dplyr)
titanic_data <- complete_data %>% select(-c(Cabin, PassengerId, Ticket, Name))
## Converting "Survived","Pclass","Sex","Embarked" to factors
for (i in c("Survived","Pclass","Sex","Embarked")){
titanic_data[,i]=as.factor(titanic_data[,i])
}
Then we can impute/fill in missing value, as well as discard those variable not relevant
## Missing values imputation complete_data$Embarked[complete_data$Embarked==""] <- "S" complete_data$Age[is.na(complete_data$Age)] <- median(complete_data$Age,na.rm=T)
We will split our data into two portions:
a training dataset to build the model (75%)
a test dataset to evaluate the model performance (25%)
train <- titanic_data[1:667,] test <- titanic_data[668:889,]
We will use the generalized linear models in the glm package to train our logistic regression model.
## Model Creation
model <- glm(Survived ~Pclass+Sex+Age+SibSp+Parch+Fare,
family=binomial(link='logit'),
data=train)
## Model Summary
summary(model)
## ## Call: ## glm(formula = Survived ~ Pclass + Sex + Age + SibSp + Parch + ## Fare, family = binomial(link = "logit"), data = train) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -2.5419 -0.6929 -0.4115 0.6621 2.3492 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 3.992484 0.593485 6.727 1.73e-11 *** ## Pclass2 -1.254120 0.381916 -3.284 0.00102 ** ## Pclass3 -2.458470 0.403688 -6.090 1.13e-09 *** ## Sexmale -2.554458 0.247865 -10.306 < 2e-16 *** ## Age -0.036871 0.009375 -3.933 8.39e-05 *** ## SibSp -0.298151 0.140182 -2.127 0.03343 * ## Parch -0.080783 0.146222 -0.552 0.58063 ## Fare -0.001773 0.003361 -0.528 0.59776 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 715.14 on 527 degrees of freedom ## Residual deviance: 491.63 on 520 degrees of freedom ## (139 observations deleted due to missingness) ## AIC: 507.63 ## ## Number of Fisher Scoring iterations: 4
Then we can see the detailed result of the logistic model by summary(). And anova stands for analysis of variance, helps determine whether the differences between group means of each feature are statistically significant.
## Using anova() to analyze the table of devaiance anova(model, test="Chisq")
## Analysis of Deviance Table ## ## Model: binomial, link: logit ## ## Response: Survived ## ## Terms added sequentially (first to last) ## ## ## Df Deviance Resid. Df Resid. Dev Pr(>Chi) ## NULL 527 715.14 ## Pclass 2 60.114 525 655.02 8.839e-14 *** ## Sex 1 144.474 524 510.55 < 2.2e-16 *** ## Age 1 11.386 523 499.16 0.0007398 *** ## SibSp 1 6.779 522 492.39 0.0092252 ** ## Parch 1 0.493 521 491.89 0.4824373 ## Fare 1 0.264 520 491.63 0.6072937 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Predicting Test Data result <- predict(model,newdata=test,type='response') result <- ifelse(result > 0.5,1,0) result<-as.factor(result) test$Survived<- as.factor(test$Survived)
## Confusion matrix and statistics
#install.packages("caret")
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
confusionMatrix(data=result, reference=test$Survived)
## Confusion Matrix and Statistics ## ## Reference ## Prediction 0 1 ## 0 96 19 ## 1 16 53 ## ## Accuracy : 0.8098 ## 95% CI : (0.7455, 0.8638) ## No Information Rate : 0.6087 ## P-Value [Acc > NIR] : 3.613e-09 ## ## Kappa : 0.5977 ## ## Mcnemar's Test P-Value : 0.7353 ## ## Sensitivity : 0.8571 ## Specificity : 0.7361 ## Pos Pred Value : 0.8348 ## Neg Pred Value : 0.7681 ## Prevalence : 0.6087 ## Detection Rate : 0.5217 ## Detection Prevalence : 0.6250 ## Balanced Accuracy : 0.7966 ## ## 'Positive' Class : 0 ##
## Predicting with our own data, see whether you will survive or not?
predict(model,
newdata=data.frame(Pclass="3",Sex="female",
Age=29,SibSp=2,Parch=2,
Fare=50,Embarked = "S"),
type='response')
## 1 ## 0.4056977