The purpose of this report is to provide summary of steps followed to predict passenger survival onboard Titanic. I will be covering below topics:
Guidance from online tutorials, posts and coursework at UC has been extremely helpful to assist me through this project.
Following packed are required for this analysis:
library(data.table) # for importing data
library(dplyr) # for data manipulation
library(ggplot2) # for data visualization
library(rpart) # for using decision trees
library(rpart.plot) # for plotting decision trees
library(knitr) # for displaying data frame
library(randomForest) # for using randomforest
library(caret) # for building confusion matrix
Input data is downloaded from Kaggle.
input <- fread("train.csv") # read csv file
input <- as.data.frame(input) # convert to dataframe
# dimension of data
dim(input)
## [1] 891 12
# structure of data
str(input)
## 'data.frame': 891 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : chr "male" "female" "female" "female" ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr "" "C85" "" "C123" ...
## $ Embarked : chr "S" "C" "S" "S" ...
# summary of data
summary(input)
## PassengerId Survived Pclass Name
## Min. : 1.0 Min. :0.0000 Min. :1.000 Length:891
## 1st Qu.:223.5 1st Qu.:0.0000 1st Qu.:2.000 Class :character
## Median :446.0 Median :0.0000 Median :3.000 Mode :character
## Mean :446.0 Mean :0.3838 Mean :2.309
## 3rd Qu.:668.5 3rd Qu.:1.0000 3rd Qu.:3.000
## Max. :891.0 Max. :1.0000 Max. :3.000
##
## Sex Age SibSp Parch
## Length:891 Min. : 0.42 Min. :0.000 Min. :0.0000
## Class :character 1st Qu.:20.12 1st Qu.:0.000 1st Qu.:0.0000
## Mode :character Median :28.00 Median :0.000 Median :0.0000
## Mean :29.70 Mean :0.523 Mean :0.3816
## 3rd Qu.:38.00 3rd Qu.:1.000 3rd Qu.:0.0000
## Max. :80.00 Max. :8.000 Max. :6.0000
## NA's :177
## Ticket Fare Cabin Embarked
## Length:891 Min. : 0.00 Length:891 Length:891
## Class :character 1st Qu.: 7.91 Class :character Class :character
## Mode :character Median : 14.45 Mode :character Mode :character
## Mean : 32.20
## 3rd Qu.: 31.00
## Max. :512.33
##
The data summary suggests 2 things:
These will be addressed in data cleaning section.
The data contains 12 variables. Below is the description:
# covert below variables to factors
col_name <- c('Pclass', 'Embarked', 'Sex', 'Survived')
input[col_name] <- lapply(input[col_name], function(x) as.factor(x))
levels(input$Survived) <- c("Died", "Survived")
# Verify factor levels, embarked variable has 2 values that are spaces
summary(input)
## PassengerId Survived Pclass Name Sex
## Min. : 1.0 Died :549 1:216 Length:891 female:314
## 1st Qu.:223.5 Survived:342 2:184 Class :character male :577
## Median :446.0 3:491 Mode :character
## Mean :446.0
## 3rd Qu.:668.5
## Max. :891.0
##
## Age SibSp Parch Ticket
## Min. : 0.42 Min. :0.000 Min. :0.0000 Length:891
## 1st Qu.:20.12 1st Qu.:0.000 1st Qu.:0.0000 Class :character
## Median :28.00 Median :0.000 Median :0.0000 Mode :character
## Mean :29.70 Mean :0.523 Mean :0.3816
## 3rd Qu.:38.00 3rd Qu.:1.000 3rd Qu.:0.0000
## Max. :80.00 Max. :8.000 Max. :6.0000
## NA's :177
## Fare Cabin Embarked
## Min. : 0.00 Length:891 : 2
## 1st Qu.: 7.91 Class :character C:168
## Median : 14.45 Mode :character Q: 77
## Mean : 32.20 S:644
## 3rd Qu.: 31.00
## Max. :512.33
##
First let us look into embarked variable with 2 missing values. Visualize fare and passenger class based on embarked on port to estimate missing values.
# check passengerclass and fare for missing embarked variable values
input %>% filter(Embarked == "")
## PassengerId Survived Pclass Name
## 1 62 Survived 1 Icard, Miss. Amelie
## 2 830 Survived 1 Stone, Mrs. George Nelson (Martha Evelyn)
## Sex Age SibSp Parch Ticket Fare Cabin Embarked
## 1 female 38 0 0 113572 80 B28
## 2 female 62 0 0 113572 80 B28
# Compare 1st class fare of missing values with others in embarked variable
ggplot(input, aes(x = Embarked, y = Fare, fill = Pclass)) + geom_boxplot() +
ggtitle('Passenger Fare Based on Class and Port of Embarkation')
# change missing embarked to C as the fare of missing values matches with median of 1st class from Cherbourg
input[input$Embarked == "",'Embarked'] = "C"
Next, before imputing missing values in age variable, check it’s distribution.
# impute missing age values with median age
input[which(is.na(input$Age)),'Age'] <- median(input$Age, na.rm = TRUE)
Visualize survival by age, sex, siblings and parents to identify patterns. It is evident from below plots:
This section focuses on creating new variables using existing variables. First, the variables Familysize and Familytype are created as below:
### create familysize var
input <- input %>%
mutate(Familysize = SibSp + Parch + 1)
### create familytype variable
input <- input %>%
mutate(Familytype = case_when(
Familysize <= 1 ~ "Single",
Familysize >= 2 & Familysize <= 4 ~ "Medium",
Familysize >= 5 ~ "Large"
))
input$Familytype <- as.factor(input$Familytype)
Next, some useful information can be extracted from Name variable. For example, survival could be affected by title.
### add title variable
input <- input %>%
mutate(Title = substr(input$Name, regexpr(", ", input$Name) + 2, regexpr("\\. ", input$Name) - 1))
# count by title
table(input$Title, input$Sex)
##
## female male
## Capt 0 1
## Col 0 2
## Don 0 1
## Dr 1 6
## Jonkheer 0 1
## Lady 1 0
## Major 0 2
## Master 0 40
## Miss 182 0
## Mlle 2 0
## Mme 1 0
## Mr 0 517
## Mrs 125 0
## Ms 1 0
## Rev 0 6
## Sir 0 1
## the Countess 1 0
The numbers of titles are more. The list can be condensed by combining titles.
# reassign title names
input$Title[input$Title == "Ms"] <- "Miss"
input$Title[input$Title == "Mme"] <- "Mrs"
input$Title[input$Title == "Mlle"] <- "Miss"
t_vector <- c("Capt", "Col", "Dr","Major")
input$Title[input$Title %in% t_vector] <- "Titleprof"
t_vector <- c("Don","Jonkheer","Lady","Rev","the Countess","Sir","Dona")
input$Title[input$Title %in% t_vector] <- "Titleother"
# convert title to factors
input$Title <- as.factor(input$Title)
# count by title
table(input$Title, input$Sex)
##
## female male
## Master 0 40
## Miss 185 0
## Mr 0 517
## Mrs 126 0
## Titleother 2 9
## Titleprof 1 11
The Cabin variable starts with alphabet. A new variable Cabintype is created by extracting first alphabet from Cabin variable.
# create Cabintype variable
input <- input %>%
mutate(Cabintype = substr(input$Cabin,1,1))
input$Cabintype <- as.factor(input$Cabintype)
Split input data into train and test sets using 70:30 ratio respectively.
set.seed(42)
# shuffle rows
input <- input[sample(1:nrow(input)),]
train <- input[1:round(nrow(input) * 0.7),]
test <- input[(round(nrow(input) * 0.7) + 1):nrow(input),]
First, let’s fit generalized linear model on training data.Please note that below model is built by using forward variable selection and only the final model after multiple trials is listed below.
# build model
mod_glm <- glm(Survived ~ Age + Sex + Pclass + Familysize + Title + Embarked + Familytype + Cabintype,
train,
family = "binomial")
# predict survival in test data using above model
pred1 <- predict(mod_glm, test, type = "response")
# use threshold to convert predicted probability into predicted class
pred_class <- ifelse(pred1 < 0.6, "Died", "Survived")
# build confusion matrix to find accuracy, specificity etc..
confusionMatrix(pred_class, test$Survived) #
## Confusion Matrix and Statistics
##
## Reference
## Prediction Died Survived
## Died 156 28
## Survived 19 64
##
## Accuracy : 0.824
## 95% CI : (0.7729, 0.8677)
## No Information Rate : 0.6554
## P-Value [Acc > NIR] : 7.434e-10
##
## Kappa : 0.601
## Mcnemar's Test P-Value : 0.2432
##
## Sensitivity : 0.8914
## Specificity : 0.6957
## Pos Pred Value : 0.8478
## Neg Pred Value : 0.7711
## Prevalence : 0.6554
## Detection Rate : 0.5843
## Detection Prevalence : 0.6891
## Balanced Accuracy : 0.7935
##
## 'Positive' Class : Died
##
# store accuracy for comparison with other models
mod_glm_accuracy <- sum(pred_class == test$Survived) / nrow(test)
Next, use decision tree to build model and predict survival. Here, tree can be plotted to understand important variables and flow of decision.
# build model
mod_dt <- rpart(Survived ~ Age + Pclass + Title + Familytype , data = train, method = "class")
#predict survival using above model
pred2 <- predict(mod_dt, test, type = "class")
# store accuracy for comparison with other models
mod_dt_accuracy <- sum(pred2 == test$Survived) / nrow(test)
# plot decision tree to visualize the decision flow
rpart.plot(mod_dt)
The tree shows that the Title and Familytype variables created in feature engineering section are extremely useful in making decision. Also, probabilty of survival is realtively low for certain titles and large families.
The last model to try is random forest.
mod_rf <- randomForest(Survived ~ Age + Sex + Pclass + Familysize + Title + Fare + Embarked + Familytype + Cabintype, train)
pred3 <- predict(mod_rf, test, type = "response")
mod_rf_accuracy <- sum(pred3 == test$Survived) / nrow(test)
The 3 models will be compared based on prediction accuracy. Prediction was done on the test set.
| Model | Accuracy in % |
|---|---|
| GLM | 82.40 |
| Decision Trees | 82.02 |
| Random Forest | 81.27 |
Thus, generalized linear model (GLM) performed best for this problem to predict survival.
The input data containing 891 rows and 16 variables is imported. First, data cleaning is performed to impute missing values and convert data types to suitable ones.
Next, visualize data to uncover patterns in survival. Passenger class, family size, age and sex seemed to impact survival. New variables are also created to extract more information from existing variables. For example, passenger title is extracted from name and cabin type is extracted from cabin number.
Once data is cleaned and new features are created, next step is to try out various models using multiple variables selcted via variable selection. The models built are generalized linear model, decision tree and random forest. The best model is selected based on accuracy of prediction. Generalized linear model (GLM) performed best for this problem to predict survival. The important variables used in this model are: