LBB Logistic Regression & k-NN using Titanic Kaggle Train Dataset
Since the train and test data didn’t have the same variables, I decided to use only train data for Machine Learning Classification LBB. For the Question, I want to know which variables contributes significantly to make passenger survived.
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## ── Attaching packages ────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.2.1 ✓ purrr 0.3.3
## ✓ tibble 2.1.3 ✓ stringr 1.4.0
## ✓ tidyr 1.0.2 ✓ forcats 0.4.0
## ✓ readr 1.3.1
## ── Conflicts ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
titanic <- read.csv("/Users/dinnah/Documents/ALGORITMA Machine Learning/LBB Machine Learning/titanic/train.csv")
str(titanic)
## 'data.frame': 891 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : Factor w/ 681 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : Factor w/ 148 levels "","A10","A14",..: 1 83 1 57 1 1 131 1 1 1 ...
## $ Embarked : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
Information: PassengerId: Passenger’s Identification Number
Survived: Passenger’s survival, 1 = Yes, 0 = No
Pclass: Socio Economy Status, 1st = Higher, 2nd = Middle, 3rd = Lower
Name: Passenger’s name
Sex: Passenger’s gender
Age: Passenger’s age
SibSp: sibsp: The dataset defines family relations in this way
Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife (mistresses and fiancés were ignored)
Parch: parch: The dataset defines family relations in this way
Parent = mother, father
Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for them
Ticket: Passenger’s ticket ID
Fare: The fare of the ticket
Cabin: Cabin number
Embarked: Port of emabarkation. C = Cherbourg, Q = Queenstown, S = Southampton
# Data Wrangling
titanic_fin <- titanic %>%
dplyr::select(-c(PassengerId, Name, Cabin, Ticket)) %>%
mutate(Survived = as.factor(Survived),
Pclass = as.factor(Pclass)) %>%
mutate(Age = replace_na(Age, replace = mean(Age, na.rm = T)))
str(titanic_fin)
## 'data.frame': 891 obs. of 8 variables:
## $ Survived: Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
## $ Pclass : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
## $ Age : num 22 38 26 35 35 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Embarked: Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
colSums(is.na(titanic_fin))
## Survived Pclass Sex Age SibSp Parch Fare Embarked
## 0 0 0 0 0 0 0 0
Since the PassengerID, Name, Cabin, and Ticket has unique levels and aren’t meaningful, we take out those variables. The Age data represents only around 81% of train data so we fill the missing value with the mean data.
Let see if the proportion of eah class is considered balance for our Classification Model
prop.table(table(titanic_fin$Survived))
##
## 0 1
## 0.6161616 0.3838384
I consider it is balance.
Splitting the data
set.seed(777)
intrain <- sample(nrow(titanic_fin), nrow(titanic_fin)*0.8)
titanic_train <- titanic_fin[intrain,]
titanic_test <- titanic_fin[-intrain,]
table(titanic_train$Survived)
##
## 0 1
## 443 269
titanic_m_all <- glm(Survived~., titanic_train, family = "binomial")
summary(titanic_m_all)
##
## Call:
## glm(formula = Survived ~ ., family = "binomial", data = titanic_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3771 -0.5866 -0.4158 0.6094 2.4470
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 14.931596 535.411331 0.028 0.97775
## Pclass2 -1.067305 0.340786 -3.132 0.00174 **
## Pclass3 -2.181663 0.340487 -6.407 0.000000000148 ***
## Sexmale -2.740152 0.225756 -12.138 < 0.0000000000000002 ***
## Age -0.042590 0.009052 -4.705 0.000002536299 ***
## SibSp -0.354001 0.122614 -2.887 0.00389 **
## Parch -0.083305 0.132741 -0.628 0.53028
## Fare 0.003161 0.002766 1.143 0.25313
## EmbarkedC -10.953789 535.411266 -0.020 0.98368
## EmbarkedQ -10.822691 535.411350 -0.020 0.98387
## EmbarkedS -11.061172 535.411257 -0.021 0.98352
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 944.09 on 711 degrees of freedom
## Residual deviance: 617.87 on 701 degrees of freedom
## AIC: 639.87
##
## Number of Fisher Scoring iterations: 12
From this summary when we create model with all variables, the Parch factor as well as Fare and Embarkation are not significant. Let us see with Feature Engineering with Stepwise.
step(titanic_m_all, direction = "backward")
## Start: AIC=639.87
## Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked
##
## Df Deviance AIC
## - Embarked 3 618.52 634.52
## - Parch 1 618.27 638.27
## - Fare 1 619.34 639.34
## <none> 617.87 639.87
## - SibSp 1 627.95 647.95
## - Age 1 642.11 662.11
## - Pclass 2 663.39 681.39
## - Sex 1 803.56 823.56
##
## Step: AIC=634.52
## Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare
##
## Df Deviance AIC
## - Parch 1 619.05 633.05
## - Fare 1 620.21 634.21
## <none> 618.52 634.52
## - SibSp 1 629.32 643.32
## - Age 1 642.99 656.99
## - Pclass 2 665.53 677.53
## - Sex 1 812.87 826.87
##
## Step: AIC=633.05
## Survived ~ Pclass + Sex + Age + SibSp + Fare
##
## Df Deviance AIC
## - Fare 1 620.43 632.43
## <none> 619.05 633.05
## - SibSp 1 632.89 644.89
## - Age 1 643.38 655.38
## - Pclass 2 668.62 678.62
## - Sex 1 815.52 827.52
##
## Step: AIC=632.43
## Survived ~ Pclass + Sex + Age + SibSp
##
## Df Deviance AIC
## <none> 620.43 632.43
## - SibSp 1 633.16 643.16
## - Age 1 645.99 655.99
## - Pclass 2 708.40 716.40
## - Sex 1 821.16 831.16
##
## Call: glm(formula = Survived ~ Pclass + Sex + Age + SibSp, family = "binomial",
## data = titanic_train)
##
## Coefficients:
## (Intercept) Pclass2 Pclass3 Sexmale Age SibSp
## 4.17317 -1.28645 -2.39060 -2.75570 -0.04331 -0.37014
##
## Degrees of Freedom: 711 Total (i.e. Null); 706 Residual
## Null Deviance: 944.1
## Residual Deviance: 620.4 AIC: 632.4
using step backwards, the optimum AIC (this means, the smallest value of AIC), is using the variables: Pclass, Sex, Age, and SibSp.
titanic_m_back <- glm(formula = Survived ~ Pclass + Sex + Age + SibSp, family = "binomial",
data = titanic_train)
summary(titanic_m_back)
##
## Call:
## glm(formula = Survived ~ Pclass + Sex + Age + SibSp, family = "binomial",
## data = titanic_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3597 -0.5797 -0.4058 0.6045 2.4692
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.173168 0.459195 9.088 < 0.0000000000000002 ***
## Pclass2 -1.286446 0.296515 -4.339 0.00001434 ***
## Pclass3 -2.390600 0.277284 -8.621 < 0.0000000000000002 ***
## Sexmale -2.755700 0.218720 -12.599 < 0.0000000000000002 ***
## Age -0.043313 0.008982 -4.822 0.00000142 ***
## SibSp -0.370135 0.116381 -3.180 0.00147 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 944.09 on 711 degrees of freedom
## Residual deviance: 620.43 on 706 degrees of freedom
## AIC: 632.43
##
## Number of Fisher Scoring iterations: 5
titanic_m_back2 <- stepAIC(titanic_m_all, direction = "backward")
## Start: AIC=639.87
## Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked
##
## Df Deviance AIC
## - Embarked 3 618.52 634.52
## - Parch 1 618.27 638.27
## - Fare 1 619.34 639.34
## <none> 617.87 639.87
## - SibSp 1 627.95 647.95
## - Age 1 642.11 662.11
## - Pclass 2 663.39 681.39
## - Sex 1 803.56 823.56
##
## Step: AIC=634.52
## Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare
##
## Df Deviance AIC
## - Parch 1 619.05 633.05
## - Fare 1 620.21 634.21
## <none> 618.52 634.52
## - SibSp 1 629.32 643.32
## - Age 1 642.99 656.99
## - Pclass 2 665.53 677.53
## - Sex 1 812.87 826.87
##
## Step: AIC=633.05
## Survived ~ Pclass + Sex + Age + SibSp + Fare
##
## Df Deviance AIC
## - Fare 1 620.43 632.43
## <none> 619.05 633.05
## - SibSp 1 632.89 644.89
## - Age 1 643.38 655.38
## - Pclass 2 668.62 678.62
## - Sex 1 815.52 827.52
##
## Step: AIC=632.43
## Survived ~ Pclass + Sex + Age + SibSp
##
## Df Deviance AIC
## <none> 620.43 632.43
## - SibSp 1 633.16 643.16
## - Age 1 645.99 655.99
## - Pclass 2 708.40 716.40
## - Sex 1 821.16 831.16
using stepAIC, the optimum AIC (this means, the smallest value of AIC), is using the variables: Pclass, Sex, Age, and SibSp. Same value with the step backwards, so next we only use the step backwards.
titanic_predict <- predict(titanic_m_back, titanic_test)
class(titanic_predict)
## [1] "numeric"
titanic_test$titanic_prob <- predict(titanic_m_back, titanic_test, type = "response")
ggplot(titanic_test, aes(x=titanic_prob))+geom_density(lwd=0.5)+theme_minimal()
on this plot, the probability on our test data is skewed to 0, means our data distribute to Not Survived.
titanic_test$titanic_predict <- factor(ifelse(titanic_test$titanic_prob > 0.5, "1", "0"))
head(titanic_test[1:10, c("titanic_predict", "Survived")])
## titanic_predict Survived
## 3 1 1
## 4 1 1
## 20 1 1
## 28 0 0
## 30 0 0
## 40 1 1
confusionMatrix(titanic_test$titanic_predict, titanic_test$Survived, positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 85 21
## 1 21 52
##
## Accuracy : 0.7654
## 95% CI : (0.6964, 0.8254)
## No Information Rate : 0.5922
## P-Value [Acc > NIR] : 0.0000008154
##
## Kappa : 0.5142
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.7123
## Specificity : 0.8019
## Pos Pred Value : 0.7123
## Neg Pred Value : 0.8019
## Prevalence : 0.4078
## Detection Rate : 0.2905
## Detection Prevalence : 0.4078
## Balanced Accuracy : 0.7571
##
## 'Positive' Class : 1
##
Model Interpretation: the ability of the model predicting the Y target is correct, (Accuracy of the model to predict ‘Survived’ and ‘Not Survived’) is at 76.5%. The ability of the model predicting the Actual Positive data (Sensitivity) is correct, is at 71.2%. And the ability of the model predicitng the Actual Negative data (Specificity) is at 80.2%.
titanicnum <- titanic %>%
dplyr::select(-c(PassengerId, Name, Cabin, Ticket)) %>%
mutate(Sex = as.numeric(Sex),
Embarked = as.numeric(Embarked)) %>%
mutate(Age = replace_na(Age, replace = mean(Age, na.rm = T)))
# assigning label for target
titanicnum$Survived <- factor(titanicnum$Survived, levels = c("0", "1"), labels = c("NotSurvived", "Survived"))
str(titanicnum)
## 'data.frame': 891 obs. of 8 variables:
## $ Survived: Factor w/ 2 levels "NotSurvived",..: 1 2 2 2 1 1 1 1 2 2 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Sex : num 2 1 1 1 2 2 2 2 1 1 ...
## $ Age : num 22 38 26 35 35 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Embarked: num 4 2 4 4 4 3 4 4 4 2 ...
# target class proportion
prop.table(table(titanicnum$Survived))
##
## NotSurvived Survived
## 0.6161616 0.3838384
summary(titanicnum)
## Survived Pclass Sex Age
## NotSurvived:549 Min. :1.000 Min. :1.000 Min. : 0.42
## Survived :342 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:22.00
## Median :3.000 Median :2.000 Median :29.70
## Mean :2.309 Mean :1.648 Mean :29.70
## 3rd Qu.:3.000 3rd Qu.:2.000 3rd Qu.:35.00
## Max. :3.000 Max. :2.000 Max. :80.00
## SibSp Parch Fare Embarked
## Min. :0.000 Min. :0.0000 Min. : 0.00 Min. :1.00
## 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.: 7.91 1st Qu.:3.00
## Median :0.000 Median :0.0000 Median : 14.45 Median :4.00
## Mean :0.523 Mean :0.3816 Mean : 32.20 Mean :3.53
## 3rd Qu.:1.000 3rd Qu.:0.0000 3rd Qu.: 31.00 3rd Qu.:4.00
## Max. :8.000 Max. :6.0000 Max. :512.33 Max. :4.00
colSums(is.na(titanicnum))
## Survived Pclass Sex Age SibSp Parch Fare Embarked
## 0 0 0 0 0 0 0 0
set.seed(249)
splitting <- initial_split(data = titanicnum, prop = 0.75, strata = "Survived")
titanic_train <- training(splitting)
titanic_test <- testing(splitting)
prop.table(table(titanic_train$Survived))
##
## NotSurvived Survived
## 0.6158445 0.3841555
# for predictor data
titanic_train_p <- scale(titanic_train[,-1])
titanic_test_p <- scale(titanic_test[,-1],
center = attr(titanic_train_p, "scaled:center"),
scale = attr(titanic_train_p, "scaled:scale"))
# for label (target) data
titanic_train_l <- titanic_train[,1]
titanic_test_l <- titanic_test[,1]
# find optimum k
round(sqrt(nrow(titanic_train)),0)
## [1] 26
since the result is an even number ‘26’, we will use ‘27’ as the k value.
titanic_knn_predict <- knn(train = titanic_train_p, test = titanic_test_p, cl = titanic_train_l, k = 27)
head(titanic_knn_predict)
## [1] NotSurvived Survived NotSurvived Survived NotSurvived NotSurvived
## Levels: NotSurvived Survived
class(titanic_knn_predict)
## [1] "factor"
confusionMatrix(data = titanic_knn_predict, reference = titanic_test_l, positive = "Survived")
## Confusion Matrix and Statistics
##
## Reference
## Prediction NotSurvived Survived
## NotSurvived 129 33
## Survived 8 52
##
## Accuracy : 0.8153
## 95% CI : (0.7579, 0.8641)
## No Information Rate : 0.6171
## P-Value [Acc > NIR] : 0.0000000001314
##
## Kappa : 0.5861
##
## Mcnemar's Test P-Value : 0.0001781
##
## Sensitivity : 0.6118
## Specificity : 0.9416
## Pos Pred Value : 0.8667
## Neg Pred Value : 0.7963
## Prevalence : 0.3829
## Detection Rate : 0.2342
## Detection Prevalence : 0.2703
## Balanced Accuracy : 0.7767
##
## 'Positive' Class : Survived
##
Model Interpretation: the ability of the variable data predicting the Y target is correct, (Accuracy of the model to predict ‘Survived’ and ‘Not Survived’) is at 80.2%. The ability of the model predicting the Actual Positive data (Sensitivity) is correct, is at 60.0%. And the ability of the model predicitng the Actual Negative data (Specificity) is at 97.7%.
From the two method of Logistic Regression and k-Nearest Neighbor, the ability to predict the Actual Positive (True Positive) of Survived person in Titanic Cruise, is with Logistic Regression with predictors: Pclass, Sex, Age, and SibSp. It is better because it has the Recall/Sensitivity value at 71.2%, higher than the Recall/Sensitivity from k-NN at 60%.