Learning to use logistic regression model using “Titanic - Machine Learning from Disaster” dataset from Kaggle. We want to use the data to create a model that predicts which passengers survived the Titanic shipwreck.
The sinking of the Titanic is one of the most infamous shipwrecks in history.
On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.
While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others. Thus, we would like to predict what sorts of people were more likely to survive.
There are two datasets: train.csv and test.csv. Train.csv will contain the details of a subset of the passengers on board (891 to be exact) and importantly, will reveal whether they survived or not, also known as the “ground truth”. While the test.csv dataset contains similar information but does not disclose the “ground truth” for each passenger.
Install the package required.
library(readr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(gtools)
library(ggplot2)
library(caret)
## Loading required package: lattice
library(class)
Load the dataset.
titanic <- read.csv("train.csv")
head(titanic)
## PassengerId Survived Pclass
## 1 1 0 3
## 2 2 1 1
## 3 3 1 3
## 4 4 1 1
## 5 5 0 3
## 6 6 0 3
## Name Sex Age SibSp Parch
## 1 Braund, Mr. Owen Harris male 22 1 0
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0
## 3 Heikkinen, Miss. Laina female 26 0 0
## 4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0
## 5 Allen, Mr. William Henry male 35 0 0
## 6 Moran, Mr. James male NA 0 0
## Ticket Fare Cabin Embarked
## 1 A/5 21171 7.2500 S
## 2 PC 17599 71.2833 C85 C
## 3 STON/O2. 3101282 7.9250 S
## 4 113803 53.1000 C123 S
## 5 373450 8.0500 S
## 6 330877 8.4583 Q
glimpse(titanic)
## Rows: 891
## Columns: 12
## $ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
## $ Survived <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1…
## $ Pclass <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3…
## $ Name <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Fl…
## $ Sex <chr> "male", "female", "female", "female", "male", "male", "mal…
## $ Age <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, …
## $ SibSp <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0…
## $ Parch <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0…
## $ Ticket <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "37…
## $ Fare <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,…
## $ Cabin <chr> "", "C85", "", "C123", "", "", "E46", "", "", "", "G6", "C…
## $ Embarked <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", "S"…
The titanic data has 891 rows and 12 columns, with the descriptions below: - PassengerId : passenger ID - Survived : survival (0 = No, 1 = Yes) - Pclass : ticket class (1 = 1st (upper), 2 = 2 nd (middle), 3 = 3rd (low)) - Sex : sex - Age : age in years - SibSp : number of siblings/spouse aboard the Titanic - Parch : number of parent/children aboard the Titanic (Parent = mother, father, Child = daughter, son, stepdaughter, stepson, Children who travelled with nanny count as 0) - Ticket : ticket number - Fare : passenger fare - Cabin : cabin number - Embarked : Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)
Based on the data, the variable target is survived (survive = 1, not survive = 0)
First, we need to filter the columns that is a unique identifier that we can ignore, which are: - PassengerID - Name - Ticket - Cabin - Embarked Also, we need to change the data type based on the correct type.
titanic <- titanic %>% select (- PassengerId, - Name, - Ticket, - Cabin, - Embarked) %>%
mutate(Survived = as.factor(Survived))
Next, we have to check if there are any missing value.
colSums(is.na(titanic))
## Survived Pclass Sex Age SibSp Parch Fare
## 0 0 0 177 0 0 0
The data is clean, thus we don’t need to do a treatment for missing value.
However, for the sex category, we can change into factor, by changing 0 = male and 1 = female.
titanic <- titanic %>%
mutate(Sex = recode_factor(Sex, "male" = 0, "female" = 1))
We can also change the age range into a category.
titanic <- titanic %>%
mutate(Age = case_when(
between(Age, 1, 24) ~ "< 25 years",
between(Age, 25, 34) ~ "25-34 years",
between(Age, 35, 44) ~ "35-44 years",
between(Age, 45, 54) ~ "45-54 years",
between(Age, 55, 64) ~ "55-64 years",
between(Age, 65, 84) ~ ">65 years"
))
Pclass indicates the higher category of the ship class. Thus can be change into factor.
titanic <- titanic %>%
mutate(Pclass = as.factor(Pclass),
SibSp = as.factor(SibSp))
prop.table(table(titanic$Survived))
##
## 0 1
## 0.6161616 0.3838384
table(titanic$Survived)
##
## 0 1
## 549 342
Based on the data, a percentage between not survived and survived = 60:40 is still balance. Thus, we don’t need another pre-processiong process to balanced between the two class.
Nest, we need to do a splitting train test data. The goal is to use the data train for modelling, while the data test is uded to test the model that we have made towards the unseen data.
RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(100)
intrain <- sample(x = nrow(titanic), size = nrow(titanic)*0.8)
titanic_train <- titanic[intrain,]
titanic_test <- titanic[-intrain,]
prop.table(table(titanic_train$Survived))
##
## 0 1
## 0.6095506 0.3904494
Based on the data for data train, a percentage between not survived and survived = 60:40 is still balance. Thus, we don’t need another pre-processiong process to balanced between the two class.
Next, we do the logistic regression model using glm() function. In choosing the predictor variable, we compare the model using stepwise forward and backward.
model_forward <- glm(formula = Survived ~ ., data = titanic_train, family = "binomial")
summary(model_forward)
##
## Call:
## glm(formula = Survived ~ ., family = "binomial", data = titanic_train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.329499 0.390323 0.844 0.398575
## Pclass2 -1.218327 0.362402 -3.362 0.000774 ***
## Pclass3 -2.080475 0.369029 -5.638 1.72e-08 ***
## Sex1 2.514220 0.243402 10.330 < 2e-16 ***
## Age>65 years -1.980675 1.125207 -1.760 0.078361 .
## Age25-34 years -0.163390 0.280440 -0.583 0.560149
## Age35-44 years -0.663440 0.335326 -1.978 0.047873 *
## Age45-54 years -1.030226 0.414610 -2.485 0.012962 *
## Age55-64 years -1.238150 0.608824 -2.034 0.041984 *
## SibSp1 -0.129255 0.262322 -0.493 0.622200
## SibSp2 0.278595 0.731193 0.381 0.703193
## SibSp3 -1.541956 0.793829 -1.942 0.052085 .
## SibSp4 -0.850582 0.791018 -1.075 0.282240
## SibSp5 -14.941010 715.196527 -0.021 0.983333
## Parch -0.121038 0.131457 -0.921 0.357187
## Fare 0.002659 0.002670 0.996 0.319290
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 747.74 on 551 degrees of freedom
## Residual deviance: 519.35 on 536 degrees of freedom
## (160 observations deleted due to missingness)
## AIC: 551.35
##
## Number of Fisher Scoring iterations: 14
model_backward <- step(object = model_forward, direction = "backward", trace = F)
summary(model_backward)
##
## Call:
## glm(formula = Survived ~ Pclass + Sex + Age, family = "binomial",
## data = titanic_train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.39265 0.31135 1.261 0.2073
## Pclass2 -1.32110 0.31537 -4.189 2.80e-05 ***
## Pclass3 -2.29285 0.30663 -7.478 7.57e-14 ***
## Sex1 2.37398 0.22729 10.445 < 2e-16 ***
## Age>65 years -1.97138 1.12175 -1.757 0.0788 .
## Age25-34 years -0.07324 0.27291 -0.268 0.7884
## Age35-44 years -0.56470 0.32453 -1.740 0.0819 .
## Age45-54 years -0.93976 0.40518 -2.319 0.0204 *
## Age55-64 years -1.11315 0.59190 -1.881 0.0600 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 747.74 on 551 degrees of freedom
## Residual deviance: 529.12 on 543 degrees of freedom
## (160 observations deleted due to missingness)
## AIC: 547.12
##
## Number of Fisher Scoring iterations: 4
Based on the model above (model_backward), we can interpret that: - The variable that is significant to the target variable: Pclass, Sex, Age - The variable that increase the survival chance: Sex1 (female) - The variable that decrease the survival chance: Pclass 2 (second class), Pclass 3 (Third class), Age 45-54 years
# category: log of odds/ coefficient from Pclass2
exp(-1.32110)
## [1] 0.2668416
# category: log of odds/ coefficient from Pclass3
exp(-2.29285)
## [1] 0.1009783
# category: log of odds/ coefficient from Sex1
exp(2.37398)
## [1] 10.74005
# category: log of odds/ coefficient from Age 45-54 years
exp(-0.93976)
## [1] 0.3907216
Based on the coefficient of each predictor variable that is significant, we can interpret that: - The chances of passengers surviving in Pclass2 (Second Class) are 0.26 times smaller than passengers in other classes. - The chances of passengers surviving in Pclass3 (Third Class) are 0.10 times smaller than passengers in other classes. - The chance of a female passenger (Sex1) surviving is 10.7 times greater than that of a male passenger. - The chance of survival for passengers aged 45-54 years is 0.40 times smaller than for passengers of other ages.
Based on the backward model that we have made, we try to predict using the data test that we have, and we try to see the probability distribution of the data prediction.
titanic_test$prob_Survived <- predict(object = model_backward,
newdata = titanic_test,
type = "response")
ggplot(titanic_test, aes (x = prob_Survived)) + geom_density (lwd = 0.5) + labs (title = "Distribution Probability of Data Prediction") + theme_minimal()
## Warning: Removed 26 rows containing non-finite values (`stat_density()`).
In the graph above, it can be interpreted that the prediction results are more inclined towards 0, which means not survived.
titanic_test$pred_Survived <- factor(ifelse(titanic_test$prob_Survived > 0.5, "Survived", "Not Survived"))
titanic_test[1:10, c("pred_Survived", "Survived")]
## pred_Survived Survived
## 3 Survived 1
## 4 Survived 1
## 5 Not Survived 0
## 7 Not Survived 0
## 8 Not Survived 0
## 17 Not Survived 0
## 21 Not Survived 0
## 23 Survived 1
## 35 Survived 0
## 38 Not Survived 0
Based on the synthax above, the data test with probability more than 0.5 means that they survived.
To evaluate the model that we’ve made, we make a confusion matrix.
First, we need to check the unique levels in both vectors.
unique(titanic_test$pred_Survived)
## [1] Survived Not Survived <NA>
## Levels: Not Survived Survived
unique(titanic_test$Survived)
## [1] 1 0
## Levels: 0 1
The pred_Survived and Survived has a different level, thus seeting the levels to match the other levels are needed.
levels(titanic_test$pred_Survived) <- levels(titanic_test$Survived)
titanic_conf <- confusionMatrix(data = titanic_test$pred_Survived, reference = titanic_test$Survived, positive = "1")
titanic_conf
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 86 13
## 1 11 43
##
## Accuracy : 0.8431
## 95% CI : (0.7757, 0.8968)
## No Information Rate : 0.634
## P-Value [Acc > NIR] : 9.294e-09
##
## Kappa : 0.6594
##
## Mcnemar's Test P-Value : 0.8383
##
## Sensitivity : 0.7679
## Specificity : 0.8866
## Pos Pred Value : 0.7963
## Neg Pred Value : 0.8687
## Prevalence : 0.3660
## Detection Rate : 0.2810
## Detection Prevalence : 0.3529
## Balanced Accuracy : 0.8272
##
## 'Positive' Class : 1
##
Positive = Survived, Negative = Not Survived. FP = Predict survived (+), but not survived (-) FN = Predict not survived (-), but survived (+) Based on the metrics in this case, we have to minimize false positive. Thus, we focus on “precision”.
Checking the value range of each variable to check if there is other pre-processing data that needs to be done. The data summary looks good. Thus, no need for another pre-processing data.
summary(titanic)
## Survived Pclass Sex Age SibSp Parch
## 0:549 1:216 0:577 Length:891 0:608 Min. :0.0000
## 1:342 2:184 1:314 Class :character 1:209 1st Qu.:0.0000
## 3:491 Mode :character 2: 28 Median :0.0000
## 3: 16 Mean :0.3816
## 4: 18 3rd Qu.:0.0000
## 5: 5 Max. :6.0000
## 8: 7
## Fare
## Min. : 0.00
## 1st Qu.: 7.91
## Median : 14.45
## Mean : 32.20
## 3rd Qu.: 31.00
## Max. :512.33
##
glimpse(titanic)
## Rows: 891
## Columns: 7
## $ Survived <fct> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0…
## $ Pclass <fct> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3, 2…
## $ Sex <fct> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0…
## $ Age <chr> "< 25 years", "35-44 years", "25-34 years", "35-44 years", "3…
## $ SibSp <fct> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0, 0…
## $ Parch <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0, 0…
## $ Fare <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625, 21…
titanic_knn <- titanic
dummy <- dummyVars("~.", data = titanic_knn)
dummy <- data.frame(predict(dummy, newdata = titanic_knn))
glimpse(dummy)
## Rows: 891
## Columns: 22
## $ Survived.0 <dbl> 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1…
## $ Survived.1 <dbl> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0…
## $ Pclass.1 <dbl> 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0…
## $ Pclass.2 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0…
## $ Pclass.3 <dbl> 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1…
## $ Sex.0 <dbl> 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0…
## $ Sex.1 <dbl> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1…
## $ Age..25.years <dbl> 1, 0, 0, 0, 0, NA, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, NA,…
## $ Age.65.years <dbl> 0, 0, 0, 0, 0, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, NA,…
## $ Age25.34.years <dbl> 0, 0, 1, 0, 0, NA, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, NA,…
## $ Age35.44.years <dbl> 0, 1, 0, 1, 1, NA, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, NA,…
## $ Age45.54.years <dbl> 0, 0, 0, 0, 0, NA, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, NA,…
## $ Age55.64.years <dbl> 0, 0, 0, 0, 0, NA, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, NA,…
## $ SibSp.0 <dbl> 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0…
## $ SibSp.1 <dbl> 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1…
## $ SibSp.2 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ SibSp.3 <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ SibSp.4 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0…
## $ SibSp.5 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ SibSp.8 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Parch <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0…
## $ Fare <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.86…
Erasing data with missing NA.
anyNA(dummy)
## [1] TRUE
dummy <- na.omit(dummy)
Splitting the data into data train and data test.
RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(100)
intrain <- sample(x = nrow(dummy), size = nrow(dummy)*0.8)
dummy_train <- dummy[intrain,]
dummy_test <- dummy[-intrain,]
glimpse(dummy)
## Rows: 705
## Columns: 22
## $ Survived.0 <dbl> 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0…
## $ Survived.1 <dbl> 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1…
## $ Pclass.1 <dbl> 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Pclass.2 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1…
## $ Pclass.3 <dbl> 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0…
## $ Sex.0 <dbl> 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1…
## $ Sex.1 <dbl> 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0…
## $ Age..25.years <dbl> 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0…
## $ Age.65.years <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Age25.34.years <dbl> 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1…
## $ Age35.44.years <dbl> 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0…
## $ Age45.54.years <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Age55.64.years <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0…
## $ SibSp.0 <dbl> 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1…
## $ SibSp.1 <dbl> 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0…
## $ SibSp.2 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ SibSp.3 <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ SibSp.4 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0…
## $ SibSp.5 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ SibSp.8 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Parch <dbl> 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0…
## $ Fare <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 51.8625, 21.0…
# Predictor
dummy_train_x <- dummy_train [, c(1, 3:22)]
dummy_test_x <- dummy_test [, c(1, 3:22)]
# Target
dummy_train_y <- dummy_train [, 2]
dummy_test_y <- dummy_test [,2]
We need to find the optimum K
sqrt(nrow(titanic_train))
## [1] 26.68333
The number of target class = 2 (Survived and Not Survived) K = 26.68 (round up to 27)
titanic_pred <- class:: knn(train = dummy_train_x,
test = dummy_test_x,
cl = dummy_train_y,
k = 27)
pred_knn_conf <- confusionMatrix(as.factor(titanic_pred),
as.factor(dummy_test_y), "1")
pred_knn_conf
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 71 21
## 1 9 40
##
## Accuracy : 0.7872
## 95% CI : (0.7104, 0.8516)
## No Information Rate : 0.5674
## P-Value [Acc > NIR] : 3.584e-08
##
## Kappa : 0.5562
##
## Mcnemar's Test P-Value : 0.04461
##
## Sensitivity : 0.6557
## Specificity : 0.8875
## Pos Pred Value : 0.8163
## Neg Pred Value : 0.7717
## Prevalence : 0.4326
## Detection Rate : 0.2837
## Detection Prevalence : 0.3475
## Balanced Accuracy : 0.7716
##
## 'Positive' Class : 1
##
Metrics summary of both model: 1. Re-call/Sensitivity = - Logistic Regression: 0.7679 - KNN: 0.6557 2. Specificity = - Logistic Regression: 0.8866 - KNN: 0.8875 3. Accuracy = - Logistic Regression: 0.8431 - KNN: 0.7872 4. Pos Pred Value/Precision = - Logistic Regression: 0.7963. - KNN: 0.8163 Based on both model, KNN has a better precision value in predicting people who survived than logistic regression. Precision is chosen to minimize false positive predictions, Thus, if I wanted to make sure that all passengers who survived is really the one who survived.