The data we’re exploring is a Titanic Passenger data from kaggle. The goal is to create a linear regression model to classify whether a passenger survive or not. We’re not using kNN for this one, because it seems that there are more categorical predictor variables, and kNN excels more in the presence of numerical variables.
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22 | 1 | 0 | A/5 21171 | 7.2500 | S | |
| 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Thayer) | female | 38 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
| 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26 | 0 | 0 | STON/O2. 3101282 | 7.9250 | S | |
| 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35 | 1 | 0 | 113803 | 53.1000 | C123 | S |
| 5 | 0 | 3 | Allen, Mr. William Henry | male | 35 | 0 | 0 | 373450 | 8.0500 | S | |
| 6 | 0 | 3 | Moran, Mr. James | male | NA | 0 | 0 | 330877 | 8.4583 | Q |
Data Explanation :
- survived = Whether passenger survived or not
- pclass = Ticket Class
- sex = Sex
- Age = Age in years
- sibsp = # of siblings/spouses onboard
- parch = # of parents aboard
- ticket = Ticket number
- fare = Passenger fare
- cabin = Cabin number
- embarked = Port of embarkation
Check for NA
## passengerid survived pclass name sex age
## 0 0 0 0 0 177
## sibsp parch ticket fare cabin embarked
## 0 0 0 0 0 0
There’s quite a lot of data missing in the age group. I don’t think it’s a good idea to drop the data as well as to replace everything with the mean of the whole group, since the number of missing age data makes up almost 20% of the total data.
Splitting the data to train and test
set.seed(400)
index <- initial_split(passenger, prop = .8)
passenger_train <- training(index)
passenger_test <- testing(index)Using 1 mean value to fill in all the 177 NA data will results in a skewness in age data. So instead, we’ll split the mean age calculation based on a few categories in order to spread the age more. The variables we’re using for this are pclass, sex, sibsp, parch, and the title of ‘master’ in the name.
The variables chosen for the splitting of age group is based on our case analysis where we hope that the mean age of the criteria/variables chosen are as close as possible to the missing age group.
But before we do that, let’s check the distribution of our target variables.
##
## 0 1
## 0.6161616 0.3838384
##
## 0 1
## 0.6086957 0.3913043
##
## 0 1
## 0.6460674 0.3539326
We’re binning some of the variables and creating a new Master variable in order to get mean age data.
passenger_train <- passenger_train %>%
mutate(sibsp = case_when(sibsp >= 2 ~ 2,
sibsp == 1 ~ 1,
sibsp == 0 ~ 0)) %>%
mutate(parch = case_when(parch >= 2 ~ 2,
parch == 1 ~ 1,
parch == 0 ~ 0)) %>%
mutate(master = as.numeric(str_detect(name,"Master")))Splitting our train data into NA and non-NA.
passenger_train_na <- passenger_train %>%
filter(is.na(age))
passenger_train_no_na <- passenger_train %>%
filter(!is.na(age))Calculating passenger mean age based on specific combination, and merging together the non-na and the fixed na dataset.
passenger_mean_age <- passenger_train_no_na %>%
group_by(pclass, sex, parch, sibsp, master) %>%
summarise(mean_age = mean(age))
passenger_train_na_2 <- passenger_train_na %>%
left_join(passenger_mean_age, by = c("pclass", "sex", "parch", "master", "sibsp")) %>%
mutate(age = mean_age) %>%
select(-mean_age)
passenger_train_complete <- merge(passenger_train_no_na, passenger_train_na_2, all = TRUE)Checking our target variable proportion. 62 to 38 still looks proportional.
##
## 0 1
## 0.6161616 0.3838384
passenger_filter <- passenger_train_complete %>%
mutate(pclass = as.factor(pclass),
sibsp = as.factor(sibsp),
parch = as.factor(parch)
) %>%
select(-c(name, ticket, cabin))Checking split train data proportion
##
## 0 1
## 0.6086957 0.3913043
model_pass_v1 <- glm(formula = survived ~., data = passenger_filter, family = "binomial")
summary(model_pass_v1)##
## Call:
## glm(formula = survived ~ ., family = "binomial", data = passenger_filter)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4322 -0.6682 -0.3816 0.5230 2.8295
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.551e+01 6.206e+02 0.025 0.980062
## passengerid 1.103e-04 4.015e-04 0.275 0.783613
## pclass2 -6.952e-01 3.475e-01 -2.001 0.045436 *
## pclass3 -1.989e+00 3.578e-01 -5.558 2.72e-08 ***
## sexmale -3.259e+00 2.662e-01 -12.239 < 2e-16 ***
## age -2.397e-02 9.911e-03 -2.419 0.015559 *
## sibsp1 -1.740e-01 2.639e-01 -0.659 0.509781
## sibsp2 -1.635e+00 4.221e-01 -3.873 0.000107 ***
## parch1 -2.919e-01 3.496e-01 -0.835 0.403776
## parch2 -6.591e-01 3.808e-01 -1.731 0.083536 .
## fare 2.810e-03 2.616e-03 1.074 0.282768
## embarkedC -1.170e+01 6.206e+02 -0.019 0.984959
## embarkedQ -1.192e+01 6.206e+02 -0.019 0.984678
## embarkedS -1.218e+01 6.206e+02 -0.020 0.984338
## master 3.041e+00 5.637e-01 5.395 6.87e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 954.46 on 712 degrees of freedom
## Residual deviance: 591.86 on 698 degrees of freedom
## AIC: 621.86
##
## Number of Fisher Scoring iterations: 13
We prioritize the significance score for this one, and ended up with pclass, sex, age, parch, and master as predictor variables.
model_pass_v2 <- glm(formula = survived ~ pclass + sex + age + parch + master,
family = "binomial", data = passenger_filter)
summary(model_pass_v2)##
## Call:
## glm(formula = survived ~ pclass + sex + age + parch + master,
## family = "binomial", data = passenger_filter)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2244 -0.7078 -0.3860 0.5588 2.4145
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.477738 0.466716 7.452 9.23e-14 ***
## pclass2 -0.900225 0.295931 -3.042 0.00235 **
## pclass3 -2.293407 0.296077 -7.746 9.48e-15 ***
## sexmale -3.186800 0.254635 -12.515 < 2e-16 ***
## age -0.019038 0.009364 -2.033 0.04204 *
## parch1 -0.462991 0.322233 -1.437 0.15077
## parch2 -1.053591 0.332283 -3.171 0.00152 **
## master 2.575605 0.504774 5.102 3.35e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 954.46 on 712 degrees of freedom
## Residual deviance: 615.52 on 705 degrees of freedom
## AIC: 631.52
##
## Number of Fisher Scoring iterations: 5
The same treatment as above, but we’re using the mean age of the train data instead of creating a new mean age data.
passenger_test <- passenger_test %>%
mutate(sibsp = case_when(sibsp >= 2 ~ 2,
sibsp == 1 ~ 1,
sibsp == 0 ~ 0)) %>%
mutate(parch = case_when(parch >= 2 ~ 2,
parch == 1 ~ 1,
parch == 0 ~ 0)) %>%
mutate(master = as.numeric(str_detect(name,"Master")))passenger_test_na <- passenger_test %>%
filter(is.na(age))
passenger_test_no_na <- passenger_test %>%
filter(!is.na(age))passenger_test_na_2 <- passenger_test_na %>%
left_join(passenger_mean_age, by = c("pclass", "sex", "master", "sibsp", "master")) %>%
mutate(age = mean_age) %>%
select(-mean_age) %>%
distinct(passengerid, .keep_all = T)
passenger_test_complete <- merge(passenger_test_no_na, passenger_train_na_2, all = TRUE)
colSums(is.na(passenger_test_complete))## passengerid survived pclass name sex age
## 0 0 0 0 0 0
## sibsp parch ticket fare cabin embarked
## 0 0 0 0 0 0
## master
## 0
passenger_test_final <- passenger_test_complete %>%
mutate(pclass = as.factor(pclass),
sibsp = as.factor(sibsp),
parch = as.factor(parch),
survived=as.factor(survived)
) %>%
select(-c(name, ticket, cabin))We will be using the model version 2 for our prediction, because it has significantly better variables judging by its significance value.
confusionMatrix(data = passenger_test_final$pred.label,
reference = passenger_test_final$survived,
positive = "1")## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 154 23
## 1 26 70
##
## Accuracy : 0.8205
## 95% CI : (0.7697, 0.8642)
## No Information Rate : 0.6593
## P-Value [Acc > NIR] : 2.466e-09
##
## Kappa : 0.6035
##
## Mcnemar's Test P-Value : 0.7751
##
## Sensitivity : 0.7527
## Specificity : 0.8556
## Pos Pred Value : 0.7292
## Neg Pred Value : 0.8701
## Prevalence : 0.3407
## Detection Rate : 0.2564
## Detection Prevalence : 0.3516
## Balanced Accuracy : 0.8041
##
## 'Positive' Class : 1
##
Our current model using logistic regression is getting 82% accuracy. Our predictor that statistically significant are : Pclass, Age, Sex, Sibsp, and Master.
I think we can try other classification methods such as Decision Tree or Random Forest to see if we can improve that number.