Logistic Regression and K-NN on Titanic
Introduction
Logistic regression is a classification algorithm used to fit a regression curve, y=f(x), where y is a categorical variable. When y is binary (1 for yes, 0 for no) we also call the model binomial logistic regression where in cases of y assuming more than 2 values you’ll sometimes hear the model being referred to as a class of multinomial logistic regression. We can think of logistic regression as a special case of linear regression, except we’re using log of odds as our target variable.
K nearest neighbors is a simple algorithm that stores all available cases and classifies new cases by a majority vote of its k neighbors. This algorithms segregates unlabeled data points into well defined groups. Choosing the number of nearest neighbors i.e. determining the value of k plays a significant role in determining the efficacy of the model. Thus, selection of k will determine how well the data can be utilized to generalize the results of the kNN algorithm. A large k value has benefits which include reducing the variance due to the noisy data; the side effect being developing a bias due to which the learner tends to ignore the smaller patterns which may have useful insights.
For this case, we are going to predict the survival of the Titanic passenger. The logistic regression and K-Nearest Neighbor (K-NN) would be used as the algorithm classification method. Data collecting from titanic. We are going to briefly go through the data, perform data preprocessing (if applicable), built and evaluate the models.
Data Wrangling
Load the required package
library(dplyr)
library(gtools)
library(gmodels)
library(ggplot2)
library(class)
library(kableExtra)
library(scales)Load Dataset
titanic.train <- read.csv("data_input/train.csv")
titanic.test <- read.csv("data_input/test.csv")# Combining all data
titanic <- bind_rows(titanic.train,titanic.test)# check data structure
glimpse(titanic)Rows: 1,309
Columns: 12
$ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,~
$ Survived <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1~
$ Pclass <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3~
$ Name <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Fl~
$ Sex <chr> "male", "female", "female", "female", "male", "male", "mal~
$ Age <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, ~
$ SibSp <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0~
$ Parch <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0~
$ Ticket <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "37~
$ Fare <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,~
$ Cabin <chr> "", "C85", "", "C123", "", "", "E46", "", "", "", "G6", "C~
$ Embarked <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", "S"~
The data has 1309 rows and 12 columns, and herewith the description of each column in titanic dataset :
PassengerId: ID of the passenger.
Survived: people whose survive, 0 = No, 1 = Yes.
Pclass: Ticket class, 1 = 1st, 2 = 2nd, 3 = 3rd.
Name: name of the passenger.
Sex: sex, male and female.
Age: Age in years.
SibSp: of siblings / spouses aboard the Titanic
Parch: of parents / children aboard the Titanic
Ticket: Ticket number
Fare: Passenger fare
Cabin: Cabin number
Embarked: Port of Embarkation, C = Cherbourg, Q = Queenstown, S = Southampton.
Before we go further, first we need to make sure that our data is clean and will be useful, and we have to check which column is not correct for the data type. We have to change the data type of each column which not correct.
Exploratory Data Analysis
## Let's check for any missing values in the data
colSums(is.na(titanic))PassengerId Survived Pclass Name Sex Age
0 418 0 0 0 263
SibSp Parch Ticket Fare Cabin Embarked
0 0 0 1 0 0
## Checking for empty values
colSums(titanic=='')PassengerId Survived Pclass Name Sex Age
0 NA 0 0 0 NA
SibSp Parch Ticket Fare Cabin Embarked
0 0 0 NA 1014 2
There are 5 variables which have missing values Survived, Age, Fare, Cabin and Embarked.
We will not impute the Cabin variable as it has too much missing value, also PassengerId, Ticket, and Name, we do not really needed for this column information, so we can remove those column, and I will remove the missing value in Survived column. The variables which need to be treated are Age, Fare, and Embarked.
First, we check in which observation has missing Fare value :
# Find which observation has missing fare value
which(is.na(titanic$Fare))[1] 1044
titanic[1044,] PassengerId Survived Pclass Name Sex Age SibSp Parch
1044 1044 NA 3 Storey, Mr. Thomas male 60.5 0 0
Ticket Fare Cabin Embarked
1044 3701 NA S
The Passenger which has missing fare value belongs to third class passenger who departed from Southampton (‘S’). Let’s try to visualize Fares among all others sharing their class and embarkment.
ggplot(titanic[titanic$Pclass == '3' & titanic$Embarked == 'S', ],
aes(x = Fare)) +
geom_density(fill = '#99d6ff', alpha=0.4) +
geom_vline(aes(xintercept=median(Fare, na.rm=T)),
colour='red', linetype='dashed', lwd=1) +
scale_x_continuous(labels=scales::dollar_format()) +
theme_minimal()From the visualization, I think we can replace the missing value from the median fare for Pclass and Embarked which is $8.05
# Replace fare missing value with the median fare for Pclass and Embarked
titanic$Fare[1044] <- median(titanic[titanic$Pclass == 3 & titanic$Embarked == 'S', ]$Fare,na.rm = TRUE)let’s continue, to check Embarked missing value :
titanic %>% dplyr::filter(Embarked=='') PassengerId Survived Pclass Name Sex
1 62 1 1 Icard, Miss. Amelie female
2 830 1 1 Stone, Mrs. George Nelson (Martha Evelyn) female
Age SibSp Parch Ticket Fare Cabin Embarked
1 38 0 0 113572 80 B28
2 62 0 0 113572 80 B28
From this information it seems clear that there are 2 significant similarities between these two passengers, both belongs to class 1 and have paid $80 as fare. So we can visualize the port of embarkation sharing similar information.
plot.data <- titanic[-c(62,830),]
ggplot(plot.data, aes(x = Embarked, y = Fare, fill = factor(Pclass))) +
geom_boxplot() +
geom_hline(aes(yintercept=80), colour='red', linetype='dashed', lwd=2) +
scale_y_continuous(labels=scales::dollar) +
theme_minimal()This visualization gives a clear picture of the median fare for a first class passenger departing from Charbourg (C) which coincides nicely with the $80 paid by our embarkment-deficient passengers. So the missing embarkement values can be replaced by C.
# replace missing value in Embarked with "C"
titanic$Embarked[c(62,830)] <- "C"Now we will be dealing with Age variable imputation as this variable has 263 missing value. As for this column I will replace the missing value with the mean of Age column
# replace using mean value of the age
titanic$Age[is.na(titanic$Age)] <- mean(titanic$Age, na.rm = T)Next, Survived missing value, I will remove missing value, because I can’t impute the value, due to the target variable I think it will impact the prediction .
titanic <- titanic[complete.cases(titanic), ] Next, remove Cabin, PassengerId, Ticket, and Name column :
# remove cabin, passengerid, ticket, and name column
titanic <- titanic %>%
dplyr::select(-c(Cabin, PassengerId, Ticket, Name))ok, I think is done to clean the missing value variable, let’s check again if there is still any missing value in dataset
colSums(is.na(titanic))Survived Pclass Sex Age SibSp Parch Fare Embarked
0 0 0 0 0 0 0 0
Then, I will check for the categorical data.
# Check number of uniques values for each of the column to find out columns which we can convert to factors
sapply(titanic, function(x) length(unique(x)))Survived Pclass Sex Age SibSp Parch Fare Embarked
2 3 2 89 7 7 248 3
Before we check uniques values for each column to find out which columns we can convert to factors, as from the result Survived,Pclass,Sex, and Embarked variable not match with the data type we should change it into factor data type
# change data type into categorical & create labels for target variable
titanic <- titanic %>%
mutate(Survived = factor(Survived, levels = c(0,1), labels = c("No","Yes")),
Pclass = factor(Pclass),
Sex = factor(Sex),
Embarked = factor(Embarked))
glimpse(titanic)Rows: 891
Columns: 8
$ Survived <fct> No, Yes, Yes, Yes, No, No, No, No, Yes, Yes, Yes, Yes, No, No~
$ Pclass <fct> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3, 2~
$ Sex <fct> male, female, female, female, male, male, male, male, female,~
$ Age <dbl> 22.00000, 38.00000, 26.00000, 35.00000, 35.00000, 29.88114, 5~
$ SibSp <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0, 0~
$ Parch <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0, 0~
$ Fare <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625, 21~
$ Embarked <fct> S, C, S, S, S, Q, S, S, S, C, S, S, S, S, S, S, Q, S, S, C, S~
## Create dummy variables for categorical variables
library(dummies)
titanic<- dummy.data.frame(titanic, names=c("Pclass","Sex","Embarked"), sep="_")Modeling
Splitting Data
Before we train our model, first we check if there is a class imbalance in target variable . Class imbalance can affect our model ability to classify the output.
prop.table(table(titanic$Survived))
No Yes
0.6161616 0.3838384
It is quite balanced, but we can balancing the data class for improvement later. Then, split the dataset into train and test dataset. Purposing is to check if the model is capable to classify new data that has not been seen by the model. The data will be split with ratio of 80/20 (80% data will be used to train, 20% to test).
RNGkind(sample.kind = "Rounding")
set.seed(417)
index <- sample(nrow(titanic), nrow(titanic)*0.8)
train <- titanic[index, ]
test <- titanic[-index, ]# Check proportion of titanic data train
prop.table(table(train$Survived))
No Yes
0.6179775 0.3820225
Logistic Regression
This is the equation of a logistic regression model:
\[log(\frac{p(x)}{1-p(x)}) = \beta_0 + \beta_1(x)\]
The left-hand side is called the log-odds or logit. On the right side, the _0 is the model intercept and _1 is the coefficient of feature x.
titanic_glm <- glm(Survived ~.,family=binomial(link='logit'),data=train)
summary(titanic_glm)
Call:
glm(formula = Survived ~ ., family = binomial(link = "logit"),
data = train)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.5716 -0.5690 -0.4233 0.6464 2.3842
Coefficients: (3 not defined because of singularities)
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.247653 0.307499 -4.057 0.000049617360 ***
Pclass_1 2.140342 0.331617 6.454 0.000000000109 ***
Pclass_2 1.004680 0.262437 3.828 0.000129 ***
Pclass_3 NA NA NA NA
Sex_female 2.715390 0.223782 12.134 < 0.0000000000000002 ***
Sex_male NA NA NA NA
Age -0.034517 0.008877 -3.888 0.000101 ***
SibSp -0.385369 0.128480 -2.999 0.002705 **
Parch -0.118518 0.137710 -0.861 0.389440
Fare 0.002327 0.002626 0.886 0.375602
Embarked_C 0.429872 0.256911 1.673 0.094281 .
Embarked_Q 0.361507 0.374045 0.966 0.333804
Embarked_S NA NA NA NA
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 947.02 on 711 degrees of freedom
Residual deviance: 631.32 on 702 degrees of freedom
AIC: 651.32
Number of Fisher Scoring iterations: 5
From the output there are 5 variable indicated significant value due to the p-value (< 0.05), which is : Pclass_1, Pclass_2, Sex_female, Age, and SibSp. Let’s try to interpreted this variable with the target variable , but before we interpret, first we need to convert the log off odds into odds or probability value, because log of odss value can not be interpreted.
paste("Pclass_1:", round(exp(2.112772), 4))[1] "Pclass_1: 8.2711"
paste("Pclass_2:", round(exp(1.203143), 4))[1] "Pclass_2: 3.3306"
paste("Sex_female:", round(exp(2.833890), 4))[1] "Sex_female: 17.0115"
paste("Age:", round(exp(-0.033508), 4))[1] "Age: 0.967"
paste("SibSp:", round(exp(-0.474637), 4))[1] "SibSp: 0.6221"
- The probability of the survive person from the Pclass_1 is 8.27 times possible than person from the Pclass_3
- The probability of the survive person from the Pclass_2 is 3.33 times possible than person from the Pclass_3
- The probability of the survive person from the female gender is 17.01 times possible than male gender
- The probability of the survive person who has siblings or spouse who went with him/her in the boat is 0.62 times possible than those who have no siblings/spouse
In the first modeling, there are still many predictor variables that are not significant to the target variable, hence we try to implemented with automatic feature selection step wise regression using backward elimination. Starting from using all features, the backward elimination process will iteratively discard some and evaluate the model until it finds one with the lowest Akaike Information Criterion (AIC). Given a collection of models for the data, AIC estimates the quality of each model, relative to each of the other models based on information loss. Lower AIC means better model. We’ll use step() function to apply backward elimination.
library(MASS)
titanic_glm2 <- stepAIC(titanic_glm, direction = "backward", trace = F)
summary(titanic_glm2)
Call:
glm(formula = Survived ~ Pclass_1 + Pclass_2 + Sex_female + Age +
SibSp + Embarked_C, family = binomial(link = "logit"), data = train)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.5723 -0.5669 -0.4245 0.6493 2.4844
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.200849 0.301000 -3.990 0.000066204794888496 ***
Pclass_1 2.241372 0.279150 8.029 0.000000000000000981 ***
Pclass_2 0.965744 0.252964 3.818 0.000135 ***
Sex_female 2.708151 0.217295 12.463 < 0.0000000000000002 ***
Age -0.034134 0.008836 -3.863 0.000112 ***
SibSp -0.409213 0.121698 -3.363 0.000772 ***
Embarked_C 0.413600 0.251434 1.645 0.099976 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 947.02 on 711 degrees of freedom
Residual deviance: 633.85 on 705 degrees of freedom
AIC: 647.85
Number of Fisher Scoring iterations: 5
The same as first model, there are 5 variable indicated significant value due to the p-value (< 0.05), which is : Pclass_1, Pclass_2, Sex_female, Age, and SibSp. Let’s try to interpreted this variable, the log off odds value is quite similar with the first model, lets’ convert and interpret.
paste("Pclass_1:", round(exp(2.241372), 4))[1] "Pclass_1: 9.4062"
paste("Pclass_2:", round(exp(0.965744), 4))[1] "Pclass_2: 2.6267"
paste("Sex_female:", round(exp(2.708151), 4))[1] "Sex_female: 15.0015"
paste("Age:", round(exp(-0.034134), 4))[1] "Age: 0.9664"
paste("SibSp:", round(exp(-0.409213), 4))[1] "SibSp: 0.6642"
- The probability of the survive person from the Pclass_1 is 9.40 times possible than person from the Pclass_3
- The probability of the survive person from the Pclass_2 is 2.62 times possible than person from the Pclass_3
- The probability of the survive person from the female gender is 15.00 times possible than male gender
- The probability of the survive person who has siblings or spouse who went with him/her in the boat is 0.66 times possible than those who have no siblings/spouse
K-NN
K-Nearest Neighbor is a method that uses a supervised learning algorithm where the results of the new instance are classified based on the majority of the k-nearest neighbor category. The K-Nearest Neighbor algorithm uses the Neighborhood Classification as the predictive value of the new instance value. The working principle of K-Nearest Neighbor (KNN) is to find the shortest distance between the data to be evaluated and the k closest neighbors in the training data. Where k is the number of nearest neighbors,to find near or far the distance between points in class k is usually calculated using the Euclidean distance. Euclidean distance is a formula for finding the distance between 2 points in two-dimensional space.
Let’s try to classify using K-NN, in the end I would like to compare between this two algorithm, which one that we can use it better to predict .
In the k-Nearest Neighbor algorithm, we need to do one additional data pre-processing stage. For every train and test data we have, omit the categorical variables except the Survived variable. Separate predictor and target variables from train and test data.
Keep in mind for the distance measurement in K-NN is highly dependent on the scale of the data from the predictor variables which have to be input of the model. The existence of predictors that have a very different range of values from other predictors can cause problems in the classification model. Therefore, we have to normalize the data to equalize the scale of each predictor variable so that it has a standard range of values.
# check range of the data
summary(train) Survived Pclass_1 Pclass_2 Pclass_3 Sex_female
No :440 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
Yes:272 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
Median :0.0000 Median :0.0000 Median :1.0000 Median :0.0000
Mean :0.2346 Mean :0.2037 Mean :0.5618 Mean :0.3539
3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
Sex_male Age SibSp Parch
Min. :0.0000 Min. : 0.42 Min. :0.0000 Min. :0.0000
1st Qu.:0.0000 1st Qu.:22.00 1st Qu.:0.0000 1st Qu.:0.0000
Median :1.0000 Median :29.88 Median :0.0000 Median :0.0000
Mean :0.6461 Mean :29.77 Mean :0.5211 Mean :0.3778
3rd Qu.:1.0000 3rd Qu.:35.00 3rd Qu.:1.0000 3rd Qu.:0.0000
Max. :1.0000 Max. :80.00 Max. :8.0000 Max. :6.0000
Fare Embarked_C Embarked_Q Embarked_S
Min. : 0.000 Min. :0.0000 Min. :0.00000 Min. :0.0000
1st Qu.: 7.896 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0000
Median : 14.454 Median :0.0000 Median :0.00000 Median :1.0000
Mean : 32.669 Mean :0.2079 Mean :0.08848 Mean :0.7037
3rd Qu.: 30.696 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:1.0000
Max. :512.329 Max. :1.0000 Max. :1.00000 Max. :1.0000
The range of the data is different , so we ned to scale the data , as bellow :
# omit categorical variables on predictors & normalize data using scale function in train data
train_x_scaled <- train %>%
select_if(is.numeric) %>%
scale()
# omit categorical variables on predictors & normalize data using scale function in test data
test_x_scaled <- test %>%
select_if(is.numeric) %>%
scale(center = attr(train_x_scaled, "scaled:center"),
scale = attr(train_x_scaled, "scaled:scale"))# separate target variable for train data
train_y <- train$Survived
# separate target variable for test data
test_y <- test$Survived# find the optimum k value using square root by number of row
k <- round(sqrt(nrow(train_x_scaled)),2)
k[1] 26.68
k must be odd if the number of target classes is even, and k must be even if the number of target classes is odd. This is to avoid a draw when majority voting. From the output result k value is 26, it’s even value, I will use 25 of k value for K-NN predict, due to our classes is even.
# create the K-NN model
library(class)
titanic_knn <- knn(train = train_x_scaled, test = test_x_scaled, cl = train_y, k=25)rmarkdown::paged_table(head(as.data.frame(titanic_knn), 10))We’ve successfully classify the data test with 25 neighbour.
Evaluation
Evaluation of the model will be done with confusion matrix. Confusion matrix is a table that shows four different category: True Positive, True Negative, False Positive, and False Negative.
The performance will be the Accuracy, Sensitivity/Recall, Specificity, and Precision (Saito and Rehmsmeier, 2015). Accuracy measures how many of our data is correctly predicted. Sensitivity measures out of all positive outcome, how many are correctly predicted. Specificty measure how many negative outcome is correctly predicted. Precision measures how many of our positive prediction is correct.
\[Accuracy = \frac{TP + TN}{TP + TN + FP + FN }\]
\[Sensitivity = \frac{TP}{TP + FN }\]
\[Specificity = \frac{TN}{TN + FP }\]
\[Precision = \frac{TP}{TP + FP }\]
Logistic Regression
Let’s predict our logistics regression model, in the previous modeling. I have created two model in logistic regression, first model using all predictor and the second one using step wise regression backward elimination method. I will compare both model. Logistic regression return value within range of [0,1] and not a binary class. The value is an estimate of the probability that the data will belong to the positive class (Yes or No).
# predict of model 1
pred_logit1 <- predict(titanic_glm, test, type = "response")
rmarkdown::paged_table(head(as.data.frame(pred_logit1),10))Because the prediction results in the logistic regression model are in the form of probabilities, we must convert these values into our target category/class using threshold value. Any values above the threshold value will be classified as positive class. By default, the threshold value is 0.5.
# determine the class based on the threshold 0.5
pred_class1 <- as.factor(if_else(pred_logit1 > 0.5, "Yes", "No"))
# confusion matrix
library(caret)
log_conf1 <- confusionMatrix(data = pred_class1, reference = test$Survived, positive = "Yes")
log_conf1Confusion Matrix and Statistics
Reference
Prediction No Yes
No 98 18
Yes 11 52
Accuracy : 0.838
95% CI : (0.7757, 0.8887)
No Information Rate : 0.6089
P-Value [Acc > NIR] : 0.00000000002273
Kappa : 0.6536
Mcnemar's Test P-Value : 0.2652
Sensitivity : 0.7429
Specificity : 0.8991
Pos Pred Value : 0.8254
Neg Pred Value : 0.8448
Prevalence : 0.3911
Detection Rate : 0.2905
Detection Prevalence : 0.3520
Balanced Accuracy : 0.8210
'Positive' Class : Yes
Let’s check for model 2 :
# predict of model 2
pred_logit2 <- predict(titanic_glm2, test, type = "response")
rmarkdown::paged_table(head(as.data.frame(pred_logit2),10))# determine the class based on the threshold 0.5
pred_class2 <- as.factor(if_else(pred_logit2 > 0.5, "Yes", "No"))
# confusion matrix
library(caret)
log_conf2 <- confusionMatrix(data = pred_class2, reference = test$Survived, positive = "Yes")
log_conf2Confusion Matrix and Statistics
Reference
Prediction No Yes
No 92 19
Yes 17 51
Accuracy : 0.7989
95% CI : (0.7326, 0.855)
No Information Rate : 0.6089
P-Value [Acc > NIR] : 0.00000004117
Kappa : 0.5755
Mcnemar's Test P-Value : 0.8676
Sensitivity : 0.7286
Specificity : 0.8440
Pos Pred Value : 0.7500
Neg Pred Value : 0.8288
Prevalence : 0.3911
Detection Rate : 0.2849
Detection Prevalence : 0.3799
Balanced Accuracy : 0.7863
'Positive' Class : Yes
# combine the performance data
logit1 <- data_frame(Accuracy = round(log_conf1$overall[1],4),
Sensitivity = round (log_conf1$byClass[1],4),
Specificity = round (log_conf1$byClass[2],4),
Precision = round (log_conf1$byClass[3],4))
logit2 <- data_frame(Accuracy = round(log_conf2$overall[1],4),
Sensitivity = round (log_conf2$byClass[1],4),
Specificity = round (log_conf2$byClass[2],4),
Precision = round (log_conf2$byClass[3],4))
performance <- bind_rows(logit1,logit2)| Accuracy | Sensitivity | Specificity | Precision |
|---|---|---|---|
| 0.8380 | 0.7429 | 0.8991 | 0.8254 |
| 0.7989 | 0.7286 | 0.8440 | 0.7500 |
The result from both model shows that the first model performance is better than the second one, our first logistic regression model has accuracy of 83.8% on test dataset, meaning that 83.8% of our data is correctly classified. The value of sensitivity and specificity is 74.29% and 89.91%. This indicate that most of positive outcomes are correctly classified . The precision predicted value is 82.54%, meaning that 82.54% of our positive prediction is correct.
K-NN
# create confusion matrix from KNN predict
knn_conf <- confusionMatrix(data = titanic_knn, reference = test_y, positive = "Yes")
knn_confConfusion Matrix and Statistics
Reference
Prediction No Yes
No 96 28
Yes 13 42
Accuracy : 0.7709
95% CI : (0.7024, 0.8303)
No Information Rate : 0.6089
P-Value [Acc > NIR] : 0.000003124
Kappa : 0.4999
Mcnemar's Test P-Value : 0.02878
Sensitivity : 0.6000
Specificity : 0.8807
Pos Pred Value : 0.7636
Neg Pred Value : 0.7742
Prevalence : 0.3911
Detection Rate : 0.2346
Detection Prevalence : 0.3073
Balanced Accuracy : 0.7404
'Positive' Class : Yes
The result shows that our K-NN with K = 25 has accuracy of 77.09% on test dataset, meaning that 77.09% of our data is correctly classified. The value of sensitivity and specificity is 60.00% and 88.07%. This indicate that our sensitivity value is quite low to predict the positive target. The precision predicted value is 76.36%, meaning that 76.36 % of our positive prediction is correct.
Model Improvement
We want to improve the performance of our model. Here I try to making a more balanced of the class. Actually there are a lot of method to handle imbalanced data. The simple technique to reduce the negative impact of this problem is by subsampling the data. The common subsampling methods used in practice are the following. Below methods should be applied only on the training set.
Upsampling: this method increases the size of the minority class by sampling with replacement so that the classes will have the same size.
Downsampling: in contrast to the above method, this one decreases the size of the majority class to be the same or closer to the minority class size by just taking out a random sample.
Hybrid methods : The well known hybrid methods are ROSE (Random oversampling examples), and SMOTE (Synthetic minority oversampling technique), they downsample the majority class, and creat new artificial points in the minority class.
Here I try using downsampling for balancing titanic data train :
# resampling data train using downsampling
library(caret)
RNGkind(sample.kind = "Rounding")
set.seed(417)
train_down <- downSample(x = train %>% dplyr::select(-Survived),
y = train$Survived,
yname = "Survived")
# check class proportion(
prop.table(table(train_down$Survived))
No Yes
0.5 0.5
From the proportion output class already balance, next I will use train_down to make a model again. I will use all predictor, because in the first step, the evaluation result is better when use all predictor.
Logistic Regression
# Fitting the model
titanic_down <- glm(Survived ~.,family=binomial(link='logit'),data=train_down)
summary(titanic_down)
Call:
glm(formula = Survived ~ ., family = binomial(link = "logit"),
data = train_down)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.4763 -0.6704 -0.0043 0.6920 2.1682
Coefficients: (3 not defined because of singularities)
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.772742 0.343486 -2.250 0.024468 *
Pclass_1 1.964450 0.377692 5.201 0.000000198 ***
Pclass_2 0.990516 0.292520 3.386 0.000709 ***
Pclass_3 NA NA NA NA
Sex_female 2.656155 0.256650 10.349 < 0.0000000000000002 ***
Sex_male NA NA NA NA
Age -0.033602 0.010259 -3.276 0.001055 **
SibSp -0.435804 0.148191 -2.941 0.003273 **
Parch -0.101009 0.156637 -0.645 0.519015
Fare 0.004278 0.003883 1.102 0.270543
Embarked_C 0.472748 0.285512 1.656 0.097764 .
Embarked_Q 0.361876 0.424357 0.853 0.393791
Embarked_S NA NA NA NA
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 754.14 on 543 degrees of freedom
Residual deviance: 504.74 on 534 degrees of freedom
AIC: 524.74
Number of Fisher Scoring iterations: 5
# Predict the models
pred_down <- predict(titanic_down, test, type = "response")
# determine the class based on the threshold 0.5
pred_class_down <- as.factor(if_else(pred_down > 0.5, "Yes", "No"))
# Confusion Matrix
perf_down <- confusionMatrix(data = pred_class_down, reference = test$Survived, positive = "Yes")
perf_downConfusion Matrix and Statistics
Reference
Prediction No Yes
No 90 16
Yes 19 54
Accuracy : 0.8045
95% CI : (0.7387, 0.8599)
No Information Rate : 0.6089
P-Value [Acc > NIR] : 0.0000000157
Kappa : 0.5926
Mcnemar's Test P-Value : 0.7353
Sensitivity : 0.7714
Specificity : 0.8257
Pos Pred Value : 0.7397
Neg Pred Value : 0.8491
Prevalence : 0.3911
Detection Rate : 0.3017
Detection Prevalence : 0.4078
Balanced Accuracy : 0.7986
'Positive' Class : Yes
K-NN
# omit categorical variables on predictors & normalize data using scale function in train
train_x_scaled_dwn <- train_down %>%
select_if(is.numeric) %>%
scale()
# omit categorical variables on predictors & normalize data using scale function in test data
test_x_scaled_dwn <- test %>%
select_if(is.numeric) %>%
scale(center = attr(train_x_scaled_dwn, "scaled:center"),
scale = attr(train_x_scaled_dwn, "scaled:scale"))# separate target variable for train data
train_y_dwn <- train_down$Survived# find the optimum k value using square root by number of row
k_down <- round(sqrt(nrow(train_x_scaled_dwn)),2)
k_down[1] 23.32
# create the K-NN model
titanic_knn_dwn <- knn(train = train_x_scaled_dwn, test = test_x_scaled_dwn, cl = train_y_dwn, k=23)# create confusion matrix from KNN predict
perf_knn_dwn <- confusionMatrix(data = titanic_knn_dwn, reference = test_y, positive = "Yes")
perf_knn_dwnConfusion Matrix and Statistics
Reference
Prediction No Yes
No 85 16
Yes 24 54
Accuracy : 0.7765
95% CI : (0.7084, 0.8353)
No Information Rate : 0.6089
P-Value [Acc > NIR] : 0.0000014
Kappa : 0.5402
Mcnemar's Test P-Value : 0.2684
Sensitivity : 0.7714
Specificity : 0.7798
Pos Pred Value : 0.6923
Neg Pred Value : 0.8416
Prevalence : 0.3911
Detection Rate : 0.3017
Detection Prevalence : 0.4358
Balanced Accuracy : 0.7756
'Positive' Class : Yes
Conclusion
Herewith the comparison between initial model logistic regression and K-NN and improve model using downsampling for balancing the class data.
| Model | Accuracy | Sensitivity | Specificity | Precision |
|---|---|---|---|---|
| logit_initial | 0.8380 | 0.7429 | 0.8991 | 0.8254 |
| knn_initial | 0.7709 | 0.6000 | 0.8807 | 0.7636 |
| logit_improve | 0.8045 | 0.7714 | 0.8257 | 0.7397 |
| knn_improve | 0.7765 | 0.7714 | 0.7798 | 0.6923 |
The metric I use to get the best parameter value to predict whose survived in titanic accident is the accuracy rate, I choose to use accuracy due to check how accurated model predicted class of the target. In this case the best value of accuracy is about 83.80% (logit_initial), it is from the logistic regression model one which used all predictor to create the model. But when we improve the logistic model using downsampling method, the value decrease to 80.45% (logit_improve). Contrary in K-NN model the initial KNN model value of accuracy is 77.09 obtained at k=25 lower than K-NN improve model, when we do improvement using downsampling method in data train and do scaling again the accuracy a bit increased in K-NN improve to 77.65% obtained at k=23.
I can conclude that so far the initial model from the logistic regression algorithm is better than others with a 83.80% accuracy, and if we want to use K-NN model , we can use the improve model K-NN with a 77.65% accuracy. This does not tells us whether the model specification is optimal, we can try in obtaining a better model performance using others algorithm, and maybe we can check for the recall/sensitivity value for comparison.