Background

Introduction

The data we’re exploring is a Titanic Passenger data from kaggle. The goal is to create a linear regression model to classify whether a passenger survive or not. We’re not using kNN for this one, because it seems that there are more categorical predictor variables, and kNN excels more in the presence of numerical variables.

Data Setup

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
1 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.2500 S
2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0 PC 17599 71.2833 C85 C
3 1 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282 7.9250 S
4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0 113803 53.1000 C123 S
5 0 3 Allen, Mr. William Henry male 35 0 0 373450 8.0500 S
6 0 3 Moran, Mr. James male NA 0 0 330877 8.4583 Q

Data Explanation :
- survived = Whether passenger survived or not
- pclass = Ticket Class
- sex = Sex
- Age = Age in years
- sibsp = # of siblings/spouses onboard
- parch = # of parents aboard
- ticket = Ticket number
- fare = Passenger fare
- cabin = Cabin number
- embarked = Port of embarkation

Data Wrangling

Check for NA

## passengerid    survived      pclass        name         sex         age 
##           0           0           0           0           0         177 
##       sibsp       parch      ticket        fare       cabin    embarked 
##           0           0           0           0           0           0

There’s quite a lot of data missing in the age group. I don’t think it’s a good idea to drop the data as well as to replace everything with the mean of the whole group, since the number of missing age data makes up almost 20% of the total data.

Cross Validation

Splitting the data to train and test

Fixing Age NA data

Using 1 mean value to fill in all the 177 NA data will results in a skewness in age data. So instead, we’ll split the mean age calculation based on a few categories in order to spread the age more. The variables we’re using for this are pclass, sex, sibsp, parch, and the title of ‘master’ in the name.
The variables chosen for the splitting of age group is based on our case analysis where we hope that the mean age of the criteria/variables chosen are as close as possible to the missing age group.
But before we do that, let’s check the distribution of our target variables.

## 
##         0         1 
## 0.6161616 0.3838384
## 
##         0         1 
## 0.6086957 0.3913043
## 
##         0         1 
## 0.6460674 0.3539326

We’re binning some of the variables and creating a new Master variable in order to get mean age data.

Splitting our train data into NA and non-NA.

Calculating passenger mean age based on specific combination, and merging together the non-na and the fixed na dataset.

Checking our target variable proportion. 62 to 38 still looks proportional.

## 
##         0         1 
## 0.6161616 0.3838384

  • Changing survival, pclass, sibsp, embarked, and parch to factor
  • Removing name, ticket, and cabin column

Checking split train data proportion

## 
##         0         1 
## 0.6086957 0.3913043

Modelling

Version 1 - All Variables

## 
## Call:
## glm(formula = survived ~ ., family = "binomial", data = passenger_filter)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4322  -0.6682  -0.3816   0.5230   2.8295  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.551e+01  6.206e+02   0.025 0.980062    
## passengerid  1.103e-04  4.015e-04   0.275 0.783613    
## pclass2     -6.952e-01  3.475e-01  -2.001 0.045436 *  
## pclass3     -1.989e+00  3.578e-01  -5.558 2.72e-08 ***
## sexmale     -3.259e+00  2.662e-01 -12.239  < 2e-16 ***
## age         -2.397e-02  9.911e-03  -2.419 0.015559 *  
## sibsp1      -1.740e-01  2.639e-01  -0.659 0.509781    
## sibsp2      -1.635e+00  4.221e-01  -3.873 0.000107 ***
## parch1      -2.919e-01  3.496e-01  -0.835 0.403776    
## parch2      -6.591e-01  3.808e-01  -1.731 0.083536 .  
## fare         2.810e-03  2.616e-03   1.074 0.282768    
## embarkedC   -1.170e+01  6.206e+02  -0.019 0.984959    
## embarkedQ   -1.192e+01  6.206e+02  -0.019 0.984678    
## embarkedS   -1.218e+01  6.206e+02  -0.020 0.984338    
## master       3.041e+00  5.637e-01   5.395 6.87e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 954.46  on 712  degrees of freedom
## Residual deviance: 591.86  on 698  degrees of freedom
## AIC: 621.86
## 
## Number of Fisher Scoring iterations: 13

Version 2 - Significance Priority

We prioritize the significance score for this one, and ended up with pclass, sex, age, parch, and master as predictor variables.

## 
## Call:
## glm(formula = survived ~ pclass + sex + age + parch + master, 
##     family = "binomial", data = passenger_filter)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2244  -0.7078  -0.3860   0.5588   2.4145  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  3.477738   0.466716   7.452 9.23e-14 ***
## pclass2     -0.900225   0.295931  -3.042  0.00235 ** 
## pclass3     -2.293407   0.296077  -7.746 9.48e-15 ***
## sexmale     -3.186800   0.254635 -12.515  < 2e-16 ***
## age         -0.019038   0.009364  -2.033  0.04204 *  
## parch1      -0.462991   0.322233  -1.437  0.15077    
## parch2      -1.053591   0.332283  -3.171  0.00152 ** 
## master       2.575605   0.504774   5.102 3.35e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 954.46  on 712  degrees of freedom
## Residual deviance: 615.52  on 705  degrees of freedom
## AIC: 631.52
## 
## Number of Fisher Scoring iterations: 5

Prediction

Passenger Test Wrangling

The same treatment as above, but we’re using the mean age of the train data instead of creating a new mean age data.

## passengerid    survived      pclass        name         sex         age 
##           0           0           0           0           0           0 
##       sibsp       parch      ticket        fare       cabin    embarked 
##           0           0           0           0           0           0 
##      master 
##           0

We will be using the model version 2 for our prediction, because it has significantly better variables judging by its significance value.

Confusion Matrix

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 154  23
##          1  26  70
##                                           
##                Accuracy : 0.8205          
##                  95% CI : (0.7697, 0.8642)
##     No Information Rate : 0.6593          
##     P-Value [Acc > NIR] : 2.466e-09       
##                                           
##                   Kappa : 0.6035          
##                                           
##  Mcnemar's Test P-Value : 0.7751          
##                                           
##             Sensitivity : 0.7527          
##             Specificity : 0.8556          
##          Pos Pred Value : 0.7292          
##          Neg Pred Value : 0.8701          
##              Prevalence : 0.3407          
##          Detection Rate : 0.2564          
##    Detection Prevalence : 0.3516          
##       Balanced Accuracy : 0.8041          
##                                           
##        'Positive' Class : 1               
## 

Conclusion

Our current model using logistic regression is getting 82% accuracy. Our predictor that statistically significant are : Pclass, Age, Sex, Sibsp, and Master.
I think we can try other classification methods such as Decision Tree or Random Forest to see if we can improve that number.