Background

Introduction

The data we’re exploring is a Titanic Passenger data from kaggle. The goal is to create a linear regression model to classify whether a passenger survive or not. We’re not using kNN for this one, because it seems that there are more categorical predictor variables, and kNN excels more in the presence of numerical variables.

Library

library(tidyverse)
library(stringr)
library(GGally)
library(lmtest)
library(car)
library(MLmetrics)
library(effects)
library(magrittr)
library(caret)
library(rsample)

Data Setup

passenger <- read.csv("data_input/train.csv")

knitr::kable(head(passenger))

PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
1	0	3	Braund, Mr. Owen Harris	male	22	1	A/5 21171	7.2500		S
2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Thayer)	female	38	1	PC 17599	71.2833	C85	C
3	1	3	Heikkinen, Miss. Laina	female	26	0	STON/O2. 3101282	7.9250		S
4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35	1	113803	53.1000	C123	S
5	0	3	Allen, Mr. William Henry	male	35	0	373450	8.0500		S
6	0	3	Moran, Mr. James	male	NA	0	330877	8.4583		Q

Data Explanation :
- survived = Whether passenger survived or not
- pclass = Ticket Class
- sex = Sex
- Age = Age in years
- sibsp = # of siblings/spouses onboard
- parch = # of parents aboard
- ticket = Ticket number
- fare = Passenger fare
- cabin = Cabin number
- embarked = Port of embarkation

Data Wrangling

names(passenger) <- str_to_lower(names(passenger))

Check for NA

colSums(is.na(passenger))

## passengerid    survived      pclass        name         sex         age 
##           0           0           0           0           0         177 
##       sibsp       parch      ticket        fare       cabin    embarked 
##           0           0           0           0           0           0

There’s quite a lot of data missing in the age group. I don’t think it’s a good idea to drop the data as well as to replace everything with the mean of the whole group, since the number of missing age data makes up almost 20% of the total data.

Cross Validation

Splitting the data to train and test

set.seed(400)
index <- initial_split(passenger, prop = .8)

passenger_train <- training(index)
passenger_test <- testing(index)

Fixing Age NA data

Using 1 mean value to fill in all the 177 NA data will results in a skewness in age data. So instead, we’ll split the mean age calculation based on a few categories in order to spread the age more. The variables we’re using for this are pclass, sex, sibsp, parch, and the title of ‘master’ in the name.
The variables chosen for the splitting of age group is based on our case analysis where we hope that the mean age of the criteria/variables chosen are as close as possible to the missing age group.
But before we do that, let’s check the distribution of our target variables.

prop.table(table(passenger$survived))

## 
##         0         1 
## 0.6161616 0.3838384

prop.table(table(passenger_train$survived))

## 
##         0         1 
## 0.6086957 0.3913043

prop.table(table(passenger_test$survived))

## 
##         0         1 
## 0.6460674 0.3539326

We’re binning some of the variables and creating a new Master variable in order to get mean age data.

passenger_train <- passenger_train %>% 
  mutate(sibsp = case_when(sibsp >= 2 ~ 2,
                           sibsp == 1 ~ 1,
                           sibsp == 0 ~ 0)) %>% 
  mutate(parch = case_when(parch >= 2 ~ 2,
                           parch == 1 ~ 1,
                           parch == 0 ~ 0)) %>% 
  mutate(master = as.numeric(str_detect(name,"Master")))

Splitting our train data into NA and non-NA.

passenger_train_na <- passenger_train %>% 
  filter(is.na(age))

passenger_train_no_na <- passenger_train %>% 
  filter(!is.na(age))

Calculating passenger mean age based on specific combination, and merging together the non-na and the fixed na dataset.

passenger_mean_age <- passenger_train_no_na %>% 
  group_by(pclass, sex, parch, sibsp, master) %>% 
  summarise(mean_age = mean(age))


passenger_train_na_2 <- passenger_train_na %>% 
  left_join(passenger_mean_age, by = c("pclass", "sex", "parch", "master", "sibsp")) %>% 
  mutate(age = mean_age) %>% 
  select(-mean_age)

passenger_train_complete <- merge(passenger_train_no_na, passenger_train_na_2, all = TRUE)

Checking our target variable proportion. 62 to 38 still looks proportional.

prop.table(table(passenger$survived))

## 
##         0         1 
## 0.6161616 0.3838384

Changing survival, pclass, sibsp, embarked, and parch to factor
Removing name, ticket, and cabin column

passenger_filter <- passenger_train_complete %>% 
  mutate(pclass = as.factor(pclass),
         sibsp = as.factor(sibsp),
         parch = as.factor(parch)
         ) %>% 
  select(-c(name, ticket, cabin))

Checking split train data proportion

prop.table(table(passenger_filter$survived))

## 
##         0         1 
## 0.6086957 0.3913043

Modelling

Version 1 - All Variables

model_pass_v1 <- glm(formula = survived ~., data = passenger_filter, family = "binomial")
summary(model_pass_v1)

## 
## Call:
## glm(formula = survived ~ ., family = "binomial", data = passenger_filter)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4322  -0.6682  -0.3816   0.5230   2.8295  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.551e+01  6.206e+02   0.025 0.980062    
## passengerid  1.103e-04  4.015e-04   0.275 0.783613    
## pclass2     -6.952e-01  3.475e-01  -2.001 0.045436 *  
## pclass3     -1.989e+00  3.578e-01  -5.558 2.72e-08 ***
## sexmale     -3.259e+00  2.662e-01 -12.239  < 2e-16 ***
## age         -2.397e-02  9.911e-03  -2.419 0.015559 *  
## sibsp1      -1.740e-01  2.639e-01  -0.659 0.509781    
## sibsp2      -1.635e+00  4.221e-01  -3.873 0.000107 ***
## parch1      -2.919e-01  3.496e-01  -0.835 0.403776    
## parch2      -6.591e-01  3.808e-01  -1.731 0.083536 .  
## fare         2.810e-03  2.616e-03   1.074 0.282768    
## embarkedC   -1.170e+01  6.206e+02  -0.019 0.984959    
## embarkedQ   -1.192e+01  6.206e+02  -0.019 0.984678    
## embarkedS   -1.218e+01  6.206e+02  -0.020 0.984338    
## master       3.041e+00  5.637e-01   5.395 6.87e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 954.46  on 712  degrees of freedom
## Residual deviance: 591.86  on 698  degrees of freedom
## AIC: 621.86
## 
## Number of Fisher Scoring iterations: 13

Version 2 - Significance Priority

We prioritize the significance score for this one, and ended up with pclass, sex, age, parch, and master as predictor variables.

model_pass_v2 <- glm(formula = survived ~ pclass + sex + age + parch + master, 
    family = "binomial", data = passenger_filter)
  summary(model_pass_v2)

## 
## Call:
## glm(formula = survived ~ pclass + sex + age + parch + master, 
##     family = "binomial", data = passenger_filter)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2244  -0.7078  -0.3860   0.5588   2.4145  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  3.477738   0.466716   7.452 9.23e-14 ***
## pclass2     -0.900225   0.295931  -3.042  0.00235 ** 
## pclass3     -2.293407   0.296077  -7.746 9.48e-15 ***
## sexmale     -3.186800   0.254635 -12.515  < 2e-16 ***
## age         -0.019038   0.009364  -2.033  0.04204 *  
## parch1      -0.462991   0.322233  -1.437  0.15077    
## parch2      -1.053591   0.332283  -3.171  0.00152 ** 
## master       2.575605   0.504774   5.102 3.35e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 954.46  on 712  degrees of freedom
## Residual deviance: 615.52  on 705  degrees of freedom
## AIC: 631.52
## 
## Number of Fisher Scoring iterations: 5

Prediction

Passenger Test Wrangling

The same treatment as above, but we’re using the mean age of the train data instead of creating a new mean age data.

passenger_test <- passenger_test %>% 
  mutate(sibsp = case_when(sibsp >= 2 ~ 2,
                           sibsp == 1 ~ 1,
                           sibsp == 0 ~ 0)) %>% 
  mutate(parch = case_when(parch >= 2 ~ 2,
                           parch == 1 ~ 1,
                           parch == 0 ~ 0)) %>% 
  mutate(master = as.numeric(str_detect(name,"Master")))

passenger_test_na <- passenger_test %>% 
  filter(is.na(age))

passenger_test_no_na <- passenger_test %>% 
  filter(!is.na(age))

passenger_test_na_2 <- passenger_test_na %>% 
  left_join(passenger_mean_age, by = c("pclass", "sex", "master", "sibsp", "master")) %>% 
  mutate(age = mean_age) %>% 
  select(-mean_age) %>% 
  distinct(passengerid, .keep_all = T)

passenger_test_complete <- merge(passenger_test_no_na, passenger_train_na_2, all = TRUE)


colSums(is.na(passenger_test_complete))

## passengerid    survived      pclass        name         sex         age 
##           0           0           0           0           0           0 
##       sibsp       parch      ticket        fare       cabin    embarked 
##           0           0           0           0           0           0 
##      master 
##           0

passenger_test_final <- passenger_test_complete %>%
  
  mutate(pclass = as.factor(pclass),
         sibsp = as.factor(sibsp),
         parch = as.factor(parch),
         survived=as.factor(survived)
         ) %>% 
  select(-c(name, ticket, cabin))

We will be using the model version 2 for our prediction, because it has significantly better variables judging by its significance value.

passenger_test_final$pred.risk <- predict(object = model_pass_v2, newdata = passenger_test_final, type = "response")

passenger_test_final$pred.label <- ifelse(passenger_test_final$pred.risk > .4, "1", "0")

passenger_test_final$pred.label <- as.factor(passenger_test_final$pred.label)

Confusion Matrix

confusionMatrix(data = passenger_test_final$pred.label,
                reference = passenger_test_final$survived,
                positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 154  23
##          1  26  70
##                                           
##                Accuracy : 0.8205          
##                  95% CI : (0.7697, 0.8642)
##     No Information Rate : 0.6593          
##     P-Value [Acc > NIR] : 2.466e-09       
##                                           
##                   Kappa : 0.6035          
##                                           
##  Mcnemar's Test P-Value : 0.7751          
##                                           
##             Sensitivity : 0.7527          
##             Specificity : 0.8556          
##          Pos Pred Value : 0.7292          
##          Neg Pred Value : 0.8701          
##              Prevalence : 0.3407          
##          Detection Rate : 0.2564          
##    Detection Prevalence : 0.3516          
##       Balanced Accuracy : 0.8041          
##                                           
##        'Positive' Class : 1               
##

Conclusion

Our current model using logistic regression is getting 82% accuracy. Our predictor that statistically significant are : Pclass, Age, Sex, Sibsp, and Master.
I think we can try other classification methods such as Decision Tree or Random Forest to see if we can improve that number.

Titanic_LogReg_LBB

Deo Ivan Mareza

2/21/2020