Introduction

The purpose of this report is to provide summary of steps followed to predict passenger survival onboard Titanic. I will be covering below topics:

  • Data Importing
  • Data Cleaning
  • Feature Engineering
  • Data Visualizations
  • Modeling, Prediction and Model Comparison

Guidance from online tutorials, posts and coursework at UC has been extremely helpful to assist me through this project.

Loading Required Packages

Following packed are required for this analysis:

library(data.table) # for importing data
library(dplyr) # for data manipulation
library(ggplot2) # for data visualization
library(rpart) # for using decision trees
library(rpart.plot) # for plotting decision trees
library(knitr) # for displaying data frame
library(randomForest) # for using randomforest
library(caret) # for building confusion matrix

Loading Data

Input data is downloaded from Kaggle.

input <- fread("train.csv") # read csv file
input <- as.data.frame(input) # convert to dataframe

Summarizing Data

# dimension of data
dim(input)
## [1] 891  12
# structure of data
str(input)
## 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : chr  "male" "female" "female" "female" ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  "" "C85" "" "C123" ...
##  $ Embarked   : chr  "S" "C" "S" "S" ...
# summary of data
summary(input)
##   PassengerId       Survived          Pclass          Name          
##  Min.   :  1.0   Min.   :0.0000   Min.   :1.000   Length:891        
##  1st Qu.:223.5   1st Qu.:0.0000   1st Qu.:2.000   Class :character  
##  Median :446.0   Median :0.0000   Median :3.000   Mode  :character  
##  Mean   :446.0   Mean   :0.3838   Mean   :2.309                     
##  3rd Qu.:668.5   3rd Qu.:1.0000   3rd Qu.:3.000                     
##  Max.   :891.0   Max.   :1.0000   Max.   :3.000                     
##                                                                     
##      Sex                 Age            SibSp           Parch       
##  Length:891         Min.   : 0.42   Min.   :0.000   Min.   :0.0000  
##  Class :character   1st Qu.:20.12   1st Qu.:0.000   1st Qu.:0.0000  
##  Mode  :character   Median :28.00   Median :0.000   Median :0.0000  
##                     Mean   :29.70   Mean   :0.523   Mean   :0.3816  
##                     3rd Qu.:38.00   3rd Qu.:1.000   3rd Qu.:0.0000  
##                     Max.   :80.00   Max.   :8.000   Max.   :6.0000  
##                     NA's   :177                                     
##     Ticket               Fare           Cabin             Embarked        
##  Length:891         Min.   :  0.00   Length:891         Length:891        
##  Class :character   1st Qu.:  7.91   Class :character   Class :character  
##  Mode  :character   Median : 14.45   Mode  :character   Mode  :character  
##                     Mean   : 32.20                                        
##                     3rd Qu.: 31.00                                        
##                     Max.   :512.33                                        
## 

The data summary suggests 2 things:

  • Variables like sex, survived, embarked etc should be factors.
  • There are 177 missing values for age variable.

These will be addressed in data cleaning section.

The data contains 12 variables. Below is the description:

  • survival: Survival 0 = No, 1 = Yes
  • pclass: Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd
  • sex: Sex
  • Age: Age in years
  • sibsp: # of siblings / spouses aboard the Titanic
  • parch: # of parents / children aboard the Titanic
  • ticket: Ticket number
  • fare: Passenger fare
  • cabin: Cabin number
  • embarked: Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton

Data Cleaning

Factor Conversion

# covert below variables to factors
col_name <- c('Pclass', 'Embarked', 'Sex', 'Survived')
input[col_name] <- lapply(input[col_name], function(x) as.factor(x))
levels(input$Survived) <- c("Died", "Survived")

# Verify factor levels, embarked variable has 2 values that are spaces
summary(input)
##   PassengerId        Survived   Pclass      Name               Sex     
##  Min.   :  1.0   Died    :549   1:216   Length:891         female:314  
##  1st Qu.:223.5   Survived:342   2:184   Class :character   male  :577  
##  Median :446.0                  3:491   Mode  :character               
##  Mean   :446.0                                                         
##  3rd Qu.:668.5                                                         
##  Max.   :891.0                                                         
##                                                                        
##       Age            SibSp           Parch           Ticket         
##  Min.   : 0.42   Min.   :0.000   Min.   :0.0000   Length:891        
##  1st Qu.:20.12   1st Qu.:0.000   1st Qu.:0.0000   Class :character  
##  Median :28.00   Median :0.000   Median :0.0000   Mode  :character  
##  Mean   :29.70   Mean   :0.523   Mean   :0.3816                     
##  3rd Qu.:38.00   3rd Qu.:1.000   3rd Qu.:0.0000                     
##  Max.   :80.00   Max.   :8.000   Max.   :6.0000                     
##  NA's   :177                                                        
##       Fare           Cabin           Embarked
##  Min.   :  0.00   Length:891          :  2   
##  1st Qu.:  7.91   Class :character   C:168   
##  Median : 14.45   Mode  :character   Q: 77   
##  Mean   : 32.20                      S:644   
##  3rd Qu.: 31.00                              
##  Max.   :512.33                              
## 

Missing Values Imputation

First let us look into embarked variable with 2 missing values. Visualize fare and passenger class based on embarked on port to estimate missing values.

# check passengerclass and fare for missing embarked variable values
input %>% filter(Embarked == "")
##   PassengerId Survived Pclass                                      Name
## 1          62 Survived      1                       Icard, Miss. Amelie
## 2         830 Survived      1 Stone, Mrs. George Nelson (Martha Evelyn)
##      Sex Age SibSp Parch Ticket Fare Cabin Embarked
## 1 female  38     0     0 113572   80   B28         
## 2 female  62     0     0 113572   80   B28
# Compare 1st class fare of missing values with others in embarked variable
ggplot(input, aes(x = Embarked, y = Fare, fill = Pclass)) + geom_boxplot() + 
  ggtitle('Passenger Fare Based on Class and Port of Embarkation')

# change missing embarked to C as the fare of missing values matches with median of 1st class from Cherbourg
input[input$Embarked == "",'Embarked'] = "C" 

Next, before imputing missing values in age variable, check it’s distribution.

# impute missing age values with median age
input[which(is.na(input$Age)),'Age'] <- median(input$Age, na.rm = TRUE)

Data Visualizations

Visualize survival by age, sex, siblings and parents to identify patterns. It is evident from below plots:

  • Age and sex plays a role in survival
  • Survival rate decreases with increase in family members
  • Survival rate differs across passenger class separated by sex

Feature Engineering

This section focuses on creating new variables using existing variables. First, the variables Familysize and Familytype are created as below:

### create familysize var
input <- input %>%
  mutate(Familysize = SibSp + Parch + 1)

### create familytype variable
input <- input %>%
  mutate(Familytype = case_when(
    Familysize <= 1 ~ "Single",
    Familysize >= 2 & Familysize <= 4 ~ "Medium",
    Familysize >= 5 ~ "Large"
  ))

input$Familytype <- as.factor(input$Familytype)

Next, some useful information can be extracted from Name variable. For example, survival could be affected by title.

### add title variable
input <- input %>% 
  mutate(Title = substr(input$Name, regexpr(", ", input$Name) + 2, regexpr("\\. ", input$Name) - 1))

# count by title
table(input$Title, input$Sex)
##               
##                female male
##   Capt              0    1
##   Col               0    2
##   Don               0    1
##   Dr                1    6
##   Jonkheer          0    1
##   Lady              1    0
##   Major             0    2
##   Master            0   40
##   Miss            182    0
##   Mlle              2    0
##   Mme               1    0
##   Mr                0  517
##   Mrs             125    0
##   Ms                1    0
##   Rev               0    6
##   Sir               0    1
##   the Countess      1    0

The numbers of titles are more. The list can be condensed by combining titles.

# reassign title names
input$Title[input$Title == "Ms"] <- "Miss"
input$Title[input$Title == "Mme"] <- "Mrs"
input$Title[input$Title == "Mlle"] <- "Miss"
t_vector <- c("Capt", "Col", "Dr","Major")
input$Title[input$Title %in% t_vector] <- "Titleprof"
t_vector <- c("Don","Jonkheer","Lady","Rev","the Countess","Sir","Dona")
input$Title[input$Title %in% t_vector] <- "Titleother"

# convert title to factors
input$Title <- as.factor(input$Title)

# count by title
table(input$Title, input$Sex)
##             
##              female male
##   Master          0   40
##   Miss          185    0
##   Mr              0  517
##   Mrs           126    0
##   Titleother      2    9
##   Titleprof       1   11

The Cabin variable starts with alphabet. A new variable Cabintype is created by extracting first alphabet from Cabin variable.

# create Cabintype variable
input <- input %>% 
          mutate(Cabintype = substr(input$Cabin,1,1))

input$Cabintype <- as.factor(input$Cabintype)

Modeling and Prediction

Train Test Data Split

Split input data into train and test sets using 70:30 ratio respectively.

set.seed(42)
# shuffle rows
input <- input[sample(1:nrow(input)),]

train <-  input[1:round(nrow(input) * 0.7),]
test <- input[(round(nrow(input) * 0.7) + 1):nrow(input),]

Using GLM

First, let’s fit generalized linear model on training data.Please note that below model is built by using forward variable selection and only the final model after multiple trials is listed below.

# build model
mod_glm <- glm(Survived ~ Age + Sex + Pclass + Familysize + Title + Embarked + Familytype + Cabintype, 
               train,
               family = "binomial")

# predict survival in test data using above model
pred1 <- predict(mod_glm, test, type = "response") 

# use threshold to convert predicted probability into predicted class
pred_class <- ifelse(pred1 < 0.6, "Died", "Survived")

# build confusion matrix to find accuracy, specificity etc..
confusionMatrix(pred_class, test$Survived)  #
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Died Survived
##   Died      156       28
##   Survived   19       64
##                                           
##                Accuracy : 0.824           
##                  95% CI : (0.7729, 0.8677)
##     No Information Rate : 0.6554          
##     P-Value [Acc > NIR] : 7.434e-10       
##                                           
##                   Kappa : 0.601           
##  Mcnemar's Test P-Value : 0.2432          
##                                           
##             Sensitivity : 0.8914          
##             Specificity : 0.6957          
##          Pos Pred Value : 0.8478          
##          Neg Pred Value : 0.7711          
##              Prevalence : 0.6554          
##          Detection Rate : 0.5843          
##    Detection Prevalence : 0.6891          
##       Balanced Accuracy : 0.7935          
##                                           
##        'Positive' Class : Died            
## 
# store accuracy for comparison with other models
mod_glm_accuracy <- sum(pred_class == test$Survived) / nrow(test)

Using Decision Tree

Next, use decision tree to build model and predict survival. Here, tree can be plotted to understand important variables and flow of decision.

# build model
mod_dt <- rpart(Survived ~ Age + Pclass + Title + Familytype , data = train, method = "class")

#predict survival using above model
pred2 <- predict(mod_dt, test, type = "class")

# store accuracy for comparison with other models
mod_dt_accuracy <- sum(pred2 == test$Survived) / nrow(test) 

# plot decision tree to visualize the decision flow
rpart.plot(mod_dt)

The tree shows that the Title and Familytype variables created in feature engineering section are extremely useful in making decision. Also, probabilty of survival is realtively low for certain titles and large families.

Using Random Forest

The last model to try is random forest.

mod_rf <- randomForest(Survived ~ Age + Sex + Pclass + Familysize + Title + Fare + Embarked +  Familytype +  Cabintype, train)
pred3 <- predict(mod_rf, test, type = "response")
mod_rf_accuracy <- sum(pred3 == test$Survived) / nrow(test)

Model Comparison

The 3 models will be compared based on prediction accuracy. Prediction was done on the test set.

Model Accuracy in %
GLM 82.40
Decision Trees 82.02
Random Forest 81.27

Thus, generalized linear model (GLM) performed best for this problem to predict survival.

Summary

The input data containing 891 rows and 16 variables is imported. First, data cleaning is performed to impute missing values and convert data types to suitable ones.
Next, visualize data to uncover patterns in survival. Passenger class, family size, age and sex seemed to impact survival. New variables are also created to extract more information from existing variables. For example, passenger title is extracted from name and cabin type is extracted from cabin number.
Once data is cleaned and new features are created, next step is to try out various models using multiple variables selcted via variable selection. The models built are generalized linear model, decision tree and random forest. The best model is selected based on accuracy of prediction. Generalized linear model (GLM) performed best for this problem to predict survival. The important variables used in this model are:

  • Title
  • Family size and type
  • Cabin type
  • Passenger class
  • Age
  • Sex
  • Port of Embarkation