We apply both logistic and K-Nearest Neighbor
algorithm to see which best applies to this dataset.
Load the data
Heart <- read_csv("heart_kag.csv", col_types = "nffnnffnfnff")
Heart
## # A tibble: 918 × 12
## Age Sex ChestPainType RestingBP Cholesterol FastingBS RestingECG MaxHR
## <dbl> <fct> <fct> <dbl> <dbl> <fct> <fct> <dbl>
## 1 40 M ATA 140 289 0 Normal 172
## 2 49 F NAP 160 180 0 Normal 156
## 3 37 M ATA 130 283 0 ST 98
## 4 48 F ASY 138 214 0 Normal 108
## 5 54 M NAP 150 195 0 Normal 122
## 6 39 M NAP 120 339 0 Normal 170
## 7 45 F ATA 130 237 0 Normal 170
## 8 54 M ATA 110 208 0 Normal 142
## 9 37 M ASY 140 207 0 Normal 130
## 10 48 F ATA 120 284 0 Normal 120
## # ℹ 908 more rows
## # ℹ 4 more variables: ExerciseAngina <fct>, Oldpeak <dbl>, ST_Slope <fct>,
## # HeartDisease <fct>
head(Heart)
## # A tibble: 6 × 12
## Age Sex ChestPainType RestingBP Cholesterol FastingBS RestingECG MaxHR
## <dbl> <fct> <fct> <dbl> <dbl> <fct> <fct> <dbl>
## 1 40 M ATA 140 289 0 Normal 172
## 2 49 F NAP 160 180 0 Normal 156
## 3 37 M ATA 130 283 0 ST 98
## 4 48 F ASY 138 214 0 Normal 108
## 5 54 M NAP 150 195 0 Normal 122
## 6 39 M NAP 120 339 0 Normal 170
## # ℹ 4 more variables: ExerciseAngina <fct>, Oldpeak <dbl>, ST_Slope <fct>,
## # HeartDisease <fct>
Exploratory Data Analysis
glimpse(Heart)
## Rows: 918
## Columns: 12
## $ Age <dbl> 40, 49, 37, 48, 54, 39, 45, 54, 37, 48, 37, 58, 39, 49,…
## $ Sex <fct> M, F, M, F, M, M, F, M, M, F, F, M, M, M, F, F, M, F, M…
## $ ChestPainType <fct> ATA, NAP, ATA, ASY, NAP, NAP, ATA, ATA, ASY, ATA, NAP, …
## $ RestingBP <dbl> 140, 160, 130, 138, 150, 120, 130, 110, 140, 120, 130, …
## $ Cholesterol <dbl> 289, 180, 283, 214, 195, 339, 237, 208, 207, 284, 211, …
## $ FastingBS <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ RestingECG <fct> Normal, Normal, ST, Normal, Normal, Normal, Normal, Nor…
## $ MaxHR <dbl> 172, 156, 98, 108, 122, 170, 170, 142, 130, 120, 142, 9…
## $ ExerciseAngina <fct> N, N, N, Y, N, N, N, N, Y, N, N, Y, N, Y, N, N, N, N, N…
## $ Oldpeak <dbl> 0.0, 1.0, 0.0, 1.5, 0.0, 0.0, 0.0, 0.0, 1.5, 0.0, 0.0, …
## $ ST_Slope <fct> Up, Flat, Up, Flat, Up, Up, Up, Up, Flat, Up, Up, Flat,…
## $ HeartDisease <fct> 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1…
We can find the number of rows and columns
dim(Heart)
## [1] 918 12
###
str(Heart)
## spc_tbl_ [918 × 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Age : num [1:918] 40 49 37 48 54 39 45 54 37 48 ...
## $ Sex : Factor w/ 2 levels "M","F": 1 2 1 2 1 1 2 1 1 2 ...
## $ ChestPainType : Factor w/ 4 levels "ATA","NAP","ASY",..: 1 2 1 3 2 2 1 1 3 1 ...
## $ RestingBP : num [1:918] 140 160 130 138 150 120 130 110 140 120 ...
## $ Cholesterol : num [1:918] 289 180 283 214 195 339 237 208 207 284 ...
## $ FastingBS : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ RestingECG : Factor w/ 3 levels "Normal","ST",..: 1 1 2 1 1 1 1 1 1 1 ...
## $ MaxHR : num [1:918] 172 156 98 108 122 170 170 142 130 120 ...
## $ ExerciseAngina: Factor w/ 2 levels "N","Y": 1 1 1 2 1 1 1 1 2 1 ...
## $ Oldpeak : num [1:918] 0 1 0 1.5 0 0 0 0 1.5 0 ...
## $ ST_Slope : Factor w/ 3 levels "Up","Flat","Down": 1 2 1 2 1 1 1 1 2 1 ...
## $ HeartDisease : Factor w/ 2 levels "0","1": 1 2 1 2 1 1 1 1 2 1 ...
## - attr(*, "spec")=
## .. cols(
## .. Age = col_number(),
## .. Sex = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE),
## .. ChestPainType = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE),
## .. RestingBP = col_number(),
## .. Cholesterol = col_number(),
## .. FastingBS = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE),
## .. RestingECG = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE),
## .. MaxHR = col_number(),
## .. ExerciseAngina = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE),
## .. Oldpeak = col_number(),
## .. ST_Slope = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE),
## .. HeartDisease = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE)
## .. )
## - attr(*, "problems")=<externalptr>
Description of the data
describe(Heart)
## vars n mean sd median trimmed mad min max range
## Age 1 918 53.51 9.43 54.0 53.71 10.38 28.0 77.0 49.0
## Sex* 2 918 1.21 0.41 1.0 1.14 0.00 1.0 2.0 1.0
## ChestPainType* 3 918 2.45 0.85 3.0 2.50 0.00 1.0 4.0 3.0
## RestingBP 4 918 132.40 18.51 130.0 131.50 14.83 0.0 200.0 200.0
## Cholesterol 5 918 198.80 109.38 223.0 204.41 68.20 0.0 603.0 603.0
## FastingBS* 6 918 1.23 0.42 1.0 1.17 0.00 1.0 2.0 1.0
## RestingECG* 7 918 1.60 0.81 1.0 1.51 0.00 1.0 3.0 2.0
## MaxHR 8 918 136.81 25.46 138.0 137.23 26.69 60.0 202.0 142.0
## ExerciseAngina* 9 918 1.40 0.49 1.0 1.38 0.00 1.0 2.0 1.0
## Oldpeak 10 918 0.89 1.07 0.6 0.74 0.89 -2.6 6.2 8.8
## ST_Slope* 11 918 1.64 0.61 2.0 1.59 0.00 1.0 3.0 2.0
## HeartDisease* 12 918 1.55 0.50 2.0 1.57 0.00 1.0 2.0 1.0
## skew kurtosis se
## Age -0.20 -0.40 0.31
## Sex* 1.42 0.02 0.01
## ChestPainType* -0.52 -0.75 0.03
## RestingBP 0.18 3.23 0.61
## Cholesterol -0.61 0.10 3.61
## FastingBS* 1.26 -0.41 0.01
## RestingECG* 0.84 -0.95 0.03
## MaxHR -0.14 -0.46 0.84
## ExerciseAngina* 0.39 -1.85 0.02
## Oldpeak 1.02 1.18 0.04
## ST_Slope* 0.38 -0.67 0.02
## HeartDisease* -0.21 -1.96 0.02
The total count of missing values in each column
###
sum(is.na(Heart))
## [1] 0
sapply(Heart, function(x) sum(is.na(x)))
## Age Sex ChestPainType RestingBP Cholesterol
## 0 0 0 0 0
## FastingBS RestingECG MaxHR ExerciseAngina Oldpeak
## 0 0 0 0 0
## ST_Slope HeartDisease
## 0 0
Summary of the data after cleaning
summary(Heart)
## Age Sex ChestPainType RestingBP Cholesterol
## Min. :28.00 M:725 ATA:173 Min. : 0.0 Min. : 0.0
## 1st Qu.:47.00 F:193 NAP:203 1st Qu.:120.0 1st Qu.:173.2
## Median :54.00 ASY:496 Median :130.0 Median :223.0
## Mean :53.51 TA : 46 Mean :132.4 Mean :198.8
## 3rd Qu.:60.00 3rd Qu.:140.0 3rd Qu.:267.0
## Max. :77.00 Max. :200.0 Max. :603.0
## FastingBS RestingECG MaxHR ExerciseAngina Oldpeak
## 0:704 Normal:552 Min. : 60.0 N:547 Min. :-2.6000
## 1:214 ST :178 1st Qu.:120.0 Y:371 1st Qu.: 0.0000
## LVH :188 Median :138.0 Median : 0.6000
## Mean :136.8 Mean : 0.8874
## 3rd Qu.:156.0 3rd Qu.: 1.5000
## Max. :202.0 Max. : 6.2000
## ST_Slope HeartDisease
## Up :395 0:410
## Flat:460 1:508
## Down: 63
##
##
##
Looking at the factors summary
Heart %>% keep(is.factor) %>% summary()
## Sex ChestPainType FastingBS RestingECG ExerciseAngina ST_Slope
## M:725 ATA:173 0:704 Normal:552 N:547 Up :395
## F:193 NAP:203 1:214 ST :178 Y:371 Flat:460
## ASY:496 LVH :188 Down: 63
## TA : 46
## HeartDisease
## 0:410
## 1:508
##
##
Looking at the numeric summary
Heart %>% keep(is.numeric) %>% summary()
## Age RestingBP Cholesterol MaxHR
## Min. :28.00 Min. : 0.0 Min. : 0.0 Min. : 60.0
## 1st Qu.:47.00 1st Qu.:120.0 1st Qu.:173.2 1st Qu.:120.0
## Median :54.00 Median :130.0 Median :223.0 Median :138.0
## Mean :53.51 Mean :132.4 Mean :198.8 Mean :136.8
## 3rd Qu.:60.00 3rd Qu.:140.0 3rd Qu.:267.0 3rd Qu.:156.0
## Max. :77.00 Max. :200.0 Max. :603.0 Max. :202.0
## Oldpeak
## Min. :-2.6000
## 1st Qu.: 0.0000
## Median : 0.6000
## Mean : 0.8874
## 3rd Qu.: 1.5000
## Max. : 6.2000
Data Visualization
### looking at the histogram
Heart %>% keep(is.numeric) %>% gather %>% ggplot(aes(value, color =key)) +
facet_wrap(~ key, scales = "free") + geom_histogram(binwidth = 10)

Boxplot with age according to sex
Boxplot with Cholesterol according to sex
Frequency polygon for ChestPainType
Heart %>% ggplot(aes(Age, color = ChestPainType)) + geom_freqpoly()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Splitting the data for training and testing samples
##names(Heart)
set.seed(1234)
sample_index <- sample(nrow(Heart), round(nrow(Heart) *.75), replace = FALSE)
Heart_train <- Heart[sample_index, ]
Heart_test <- Heart[-sample_index, ]
Building the model for the logistic algorithm
Heart_mod <- glm(data = Heart_train, family = binomial, formula = HeartDisease ~.)
summary(Heart_mod)
##
## Call:
## glm(formula = HeartDisease ~ ., family = binomial, data = Heart_train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.031247 1.553377 -1.951 0.051010 .
## Age 0.020260 0.015606 1.298 0.194223
## SexF -1.419177 0.320981 -4.421 0.00000981 ***
## ChestPainTypeNAP 0.002548 0.415528 0.006 0.995107
## ChestPainTypeASY 1.770723 0.378740 4.675 0.00000294 ***
## ChestPainTypeTA 0.584068 0.593111 0.985 0.324745
## RestingBP 0.003665 0.006633 0.553 0.580589
## Cholesterol -0.003931 0.001245 -3.156 0.001597 **
## FastingBS1 0.877243 0.309374 2.836 0.004575 **
## RestingECGST -0.145835 0.332853 -0.438 0.661288
## RestingECGLVH 0.042329 0.318417 0.133 0.894243
## MaxHR -0.002407 0.005767 -0.417 0.676391
## ExerciseAnginaY 0.981308 0.285845 3.433 0.000597 ***
## Oldpeak 0.275234 0.136395 2.018 0.043600 *
## ST_SlopeFlat 2.598182 0.294012 8.837 < 0.0000000000000002 ***
## ST_SlopeDown 1.398246 0.523420 2.671 0.007554 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 943.49 on 687 degrees of freedom
## Residual deviance: 440.10 on 672 degrees of freedom
## AIC: 472.1
##
## Number of Fisher Scoring iterations: 5
Computing McFadden’s R^2 for the model.
A value of 0.5335352 which is high and indicates that the model fits
the dats very well and has a high predictive
power.
pscl::pR2(Heart_mod)["McFadden"]
## fitting null model for pseudo-r2
## McFadden
## 0.5335352
Variable Importance
Higher values indicates more importance that closely match the
significant variables on our model summary
varImp(Heart_mod)
## Overall
## Age 1.2981875
## SexF 4.4213769
## ChestPainTypeNAP 0.0061325
## ChestPainTypeASY 4.6753023
## ChestPainTypeTA 0.9847530
## RestingBP 0.5525242
## Cholesterol 3.1564154
## FastingBS1 2.8355456
## RestingECGST 0.4381362
## RestingECGLVH 0.1329373
## MaxHR 0.4173930
## ExerciseAnginaY 3.4330036
## Oldpeak 2.0179154
## ST_SlopeFlat 8.8369857
## ST_SlopeDown 2.6713631
Correlation with numeric variables
Heart %>% keep(is.numeric) %>% cor() %>% corrplot(method = "pie")

Checking for multicollinearity
vif(Heart_mod)
## GVIF Df GVIF^(1/(2*Df))
## Age 1.299780 1 1.140079
## Sex 1.114065 1 1.055493
## ChestPainType 1.345008 3 1.050641
## RestingBP 1.124971 1 1.060647
## Cholesterol 1.203686 1 1.097126
## FastingBS 1.083395 1 1.040863
## RestingECG 1.264973 2 1.060523
## MaxHR 1.369727 1 1.170353
## ExerciseAngina 1.234974 1 1.111294
## Oldpeak 1.381056 1 1.175183
## ST_Slope 1.574567 2 1.120186
Calculating the probability of HeartDisease
test_pred <- predict(Heart_mod, Heart_test, type = "response")
Stating the cutoff of the probability for the HeartDisease
test_pred <- ifelse(test_pred >= 0.5, 1, 0)
Model diagnostics and predictive table of our test data
test_pred_table <- table(Heart_test$HeartDisease, test_pred)
test_pred_table
## test_pred
## 0 1
## 0 90 18
## 1 14 108
Confusion matrix for the output
sum(diag(test_pred_table)) / nrow(Heart_test)
## [1] 0.8608696
Calculating the Sensitivity
test_pred <- as.factor(test_pred)
### sensitivity
sensitivity(Heart_test$HeartDisease, test_pred)
## [1] 0.8653846
Calculating the Specificity
### specificity
specificity(Heart_test$HeartDisease, test_pred)
## [1] 0.8571429
Classification Error
### Classification Error
misclassification_Error <- 1 - sum(diag(test_pred_table)) / nrow(Heart_test)
misclassification_Error
## [1] 0.1391304
Calculating Area under the curve
test_pred <- as.numeric(test_pred)
auc(Heart_test$HeartDisease, test_pred)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Area under the curve: 0.8593
Now applying K-NN algorithm to the same data
library(fastDummies)
### load the data
Heart <- read_csv("heart_kag.csv", col_types = "nffnnffnfnff")
normalize the numeric variables
normalize <- function(x){
return((x - min(x)) / (max(x) - min(x)))
}
Heart$Age <- normalize(Heart$Age)
Heart$RestingBP <- normalize(Heart$RestingBP)
Heart$Cholesterol <- normalize(Heart$Cholesterol)
Heart$MaxHR <- normalize(Heart$MaxHR)
Heart$Oldpeak <- normalize(Heart$Oldpeak)
Convert to data.frame
Heart <- as.data.frame(Heart)
### select the response variable from the data
Heart_label <- Heart %>% select(HeartDisease)
### now select the response from the original variable
Heart <- Heart %>% select(-HeartDisease)
Create dummy variables.
Heart <- dummy_cols(Heart)
Select the original features from the dummy
features
Heart <- Heart %>% select(-Sex,-ChestPainType, -FastingBS, -RestingECG,-ST_Slope, -ExerciseAngina)
dim(Heart)
## [1] 918 21
Splitting the data for training and testing
set.seed(1234)
sample_Heart <- sample(nrow(Heart), round(nrow(Heart)* .75), replace = FALSE)
Heart_train <- Heart[sample_Heart, ]
Heart_test <- Heart[-sample_Heart, ]
### change the response variable to factors after splitting
Heart_train_label <- as.factor(Heart_label[sample_Heart, ])
Heart_test_label <- as.factor(Heart_label[-sample_Heart, ])
table(Heart_train_label)
## Heart_train_label
## 0 1
## 302 386
Building the K-NN model for the data
library(class)
Heart_pred <- knn(train = Heart_train, test = Heart_test, cl = Heart_train_label, k= 26)
Evaluating the model
### evaluating the model
Heart_pred_eval <- table(Heart_test_label, Heart_pred)
Heart_pred_eval
## Heart_pred
## Heart_test_label 0 1
## 0 89 19
## 1 16 106
Checking the accuracy with confusion matrix
### accuracy
sum(diag(Heart_pred_eval)) / nrow(Heart_test)
## [1] 0.8478261
Improving the model by change the value of k
Heart_Pred2 <- knn(train = Heart_train, test = Heart_test, cl = Heart_train_label, k= 10)
Heart_pred_eval2 <- table(Heart_test_label, Heart_Pred2)
Heart_pred_eval2
## Heart_Pred2
## Heart_test_label 0 1
## 0 89 19
## 1 15 107
sum(diag(Heart_pred_eval2)) / nrow(Heart_test)
## [1] 0.8521739
Check for new accuracy
### use caret library for confusion matrix
confusionMatrix(Heart_Pred2, Heart_test_label)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 89 15
## 1 19 107
##
## Accuracy : 0.8522
## 95% CI : (0.7996, 0.8954)
## No Information Rate : 0.5304
## P-Value [Acc > NIR] : <0.0000000000000002
##
## Kappa : 0.7026
##
## Mcnemar's Test P-Value : 0.6069
##
## Sensitivity : 0.8241
## Specificity : 0.8770
## Pos Pred Value : 0.8558
## Neg Pred Value : 0.8492
## Prevalence : 0.4696
## Detection Rate : 0.3870
## Detection Prevalence : 0.4522
## Balanced Accuracy : 0.8506
##
## 'Positive' Class : 0
##
Define a range of k values to to find the optimum value of
k.
k_selection <- c(1, 3, 5, 7, 9, 15, 25, 26, 27, 29, 31, 33, 40)
#### Initialize a list to store confusion matrices
k_results <- list()
### Loop through k values and store the results
for (k in
k_selection) {
knn_pred <- knn(train = Heart_train, test = Heart_test, cl = Heart_train_label, k = k)
k_results[[paste0("k = ", k)]] <- confusionMatrix(knn_pred, Heart_test_label)
}
Shows the results for various values of k
k_results
## $`k = 1`
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 84 28
## 1 24 94
##
## Accuracy : 0.7739
## 95% CI : (0.7143, 0.8263)
## No Information Rate : 0.5304
## P-Value [Acc > NIR] : 0.00000000000001867
##
## Kappa : 0.5471
##
## Mcnemar's Test P-Value : 0.6774
##
## Sensitivity : 0.7778
## Specificity : 0.7705
## Pos Pred Value : 0.7500
## Neg Pred Value : 0.7966
## Prevalence : 0.4696
## Detection Rate : 0.3652
## Detection Prevalence : 0.4870
## Balanced Accuracy : 0.7741
##
## 'Positive' Class : 0
##
##
## $`k = 3`
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 83 19
## 1 25 103
##
## Accuracy : 0.8087
## 95% CI : (0.7518, 0.8574)
## No Information Rate : 0.5304
## P-Value [Acc > NIR] : <0.0000000000000002
##
## Kappa : 0.6147
##
## Mcnemar's Test P-Value : 0.451
##
## Sensitivity : 0.7685
## Specificity : 0.8443
## Pos Pred Value : 0.8137
## Neg Pred Value : 0.8047
## Prevalence : 0.4696
## Detection Rate : 0.3609
## Detection Prevalence : 0.4435
## Balanced Accuracy : 0.8064
##
## 'Positive' Class : 0
##
##
## $`k = 5`
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 84 17
## 1 24 105
##
## Accuracy : 0.8217
## 95% CI : (0.766, 0.8689)
## No Information Rate : 0.5304
## P-Value [Acc > NIR] : <0.0000000000000002
##
## Kappa : 0.6408
##
## Mcnemar's Test P-Value : 0.3487
##
## Sensitivity : 0.7778
## Specificity : 0.8607
## Pos Pred Value : 0.8317
## Neg Pred Value : 0.8140
## Prevalence : 0.4696
## Detection Rate : 0.3652
## Detection Prevalence : 0.4391
## Balanced Accuracy : 0.8192
##
## 'Positive' Class : 0
##
##
## $`k = 7`
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 86 16
## 1 22 106
##
## Accuracy : 0.8348
## 95% CI : (0.7804, 0.8804)
## No Information Rate : 0.5304
## P-Value [Acc > NIR] : <0.0000000000000002
##
## Kappa : 0.6673
##
## Mcnemar's Test P-Value : 0.4173
##
## Sensitivity : 0.7963
## Specificity : 0.8689
## Pos Pred Value : 0.8431
## Neg Pred Value : 0.8281
## Prevalence : 0.4696
## Detection Rate : 0.3739
## Detection Prevalence : 0.4435
## Balanced Accuracy : 0.8326
##
## 'Positive' Class : 0
##
##
## $`k = 9`
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 89 16
## 1 19 106
##
## Accuracy : 0.8478
## 95% CI : (0.7948, 0.8917)
## No Information Rate : 0.5304
## P-Value [Acc > NIR] : <0.0000000000000002
##
## Kappa : 0.694
##
## Mcnemar's Test P-Value : 0.7353
##
## Sensitivity : 0.8241
## Specificity : 0.8689
## Pos Pred Value : 0.8476
## Neg Pred Value : 0.8480
## Prevalence : 0.4696
## Detection Rate : 0.3870
## Detection Prevalence : 0.4565
## Balanced Accuracy : 0.8465
##
## 'Positive' Class : 0
##
##
## $`k = 15`
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 91 17
## 1 17 105
##
## Accuracy : 0.8522
## 95% CI : (0.7996, 0.8954)
## No Information Rate : 0.5304
## P-Value [Acc > NIR] : <0.0000000000000002
##
## Kappa : 0.7032
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.8426
## Specificity : 0.8607
## Pos Pred Value : 0.8426
## Neg Pred Value : 0.8607
## Prevalence : 0.4696
## Detection Rate : 0.3957
## Detection Prevalence : 0.4696
## Balanced Accuracy : 0.8516
##
## 'Positive' Class : 0
##
##
## $`k = 25`
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 89 17
## 1 19 105
##
## Accuracy : 0.8435
## 95% CI : (0.79, 0.8879)
## No Information Rate : 0.5304
## P-Value [Acc > NIR] : <0.0000000000000002
##
## Kappa : 0.6855
##
## Mcnemar's Test P-Value : 0.8676
##
## Sensitivity : 0.8241
## Specificity : 0.8607
## Pos Pred Value : 0.8396
## Neg Pred Value : 0.8468
## Prevalence : 0.4696
## Detection Rate : 0.3870
## Detection Prevalence : 0.4609
## Balanced Accuracy : 0.8424
##
## 'Positive' Class : 0
##
##
## $`k = 26`
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 89 16
## 1 19 106
##
## Accuracy : 0.8478
## 95% CI : (0.7948, 0.8917)
## No Information Rate : 0.5304
## P-Value [Acc > NIR] : <0.0000000000000002
##
## Kappa : 0.694
##
## Mcnemar's Test P-Value : 0.7353
##
## Sensitivity : 0.8241
## Specificity : 0.8689
## Pos Pred Value : 0.8476
## Neg Pred Value : 0.8480
## Prevalence : 0.4696
## Detection Rate : 0.3870
## Detection Prevalence : 0.4565
## Balanced Accuracy : 0.8465
##
## 'Positive' Class : 0
##
##
## $`k = 27`
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 89 15
## 1 19 107
##
## Accuracy : 0.8522
## 95% CI : (0.7996, 0.8954)
## No Information Rate : 0.5304
## P-Value [Acc > NIR] : <0.0000000000000002
##
## Kappa : 0.7026
##
## Mcnemar's Test P-Value : 0.6069
##
## Sensitivity : 0.8241
## Specificity : 0.8770
## Pos Pred Value : 0.8558
## Neg Pred Value : 0.8492
## Prevalence : 0.4696
## Detection Rate : 0.3870
## Detection Prevalence : 0.4522
## Balanced Accuracy : 0.8506
##
## 'Positive' Class : 0
##
##
## $`k = 29`
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 89 15
## 1 19 107
##
## Accuracy : 0.8522
## 95% CI : (0.7996, 0.8954)
## No Information Rate : 0.5304
## P-Value [Acc > NIR] : <0.0000000000000002
##
## Kappa : 0.7026
##
## Mcnemar's Test P-Value : 0.6069
##
## Sensitivity : 0.8241
## Specificity : 0.8770
## Pos Pred Value : 0.8558
## Neg Pred Value : 0.8492
## Prevalence : 0.4696
## Detection Rate : 0.3870
## Detection Prevalence : 0.4522
## Balanced Accuracy : 0.8506
##
## 'Positive' Class : 0
##
##
## $`k = 31`
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 89 16
## 1 19 106
##
## Accuracy : 0.8478
## 95% CI : (0.7948, 0.8917)
## No Information Rate : 0.5304
## P-Value [Acc > NIR] : <0.0000000000000002
##
## Kappa : 0.694
##
## Mcnemar's Test P-Value : 0.7353
##
## Sensitivity : 0.8241
## Specificity : 0.8689
## Pos Pred Value : 0.8476
## Neg Pred Value : 0.8480
## Prevalence : 0.4696
## Detection Rate : 0.3870
## Detection Prevalence : 0.4565
## Balanced Accuracy : 0.8465
##
## 'Positive' Class : 0
##
##
## $`k = 33`
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 89 16
## 1 19 106
##
## Accuracy : 0.8478
## 95% CI : (0.7948, 0.8917)
## No Information Rate : 0.5304
## P-Value [Acc > NIR] : <0.0000000000000002
##
## Kappa : 0.694
##
## Mcnemar's Test P-Value : 0.7353
##
## Sensitivity : 0.8241
## Specificity : 0.8689
## Pos Pred Value : 0.8476
## Neg Pred Value : 0.8480
## Prevalence : 0.4696
## Detection Rate : 0.3870
## Detection Prevalence : 0.4565
## Balanced Accuracy : 0.8465
##
## 'Positive' Class : 0
##
##
## $`k = 40`
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 88 17
## 1 20 105
##
## Accuracy : 0.8391
## 95% CI : (0.7851, 0.8841)
## No Information Rate : 0.5304
## P-Value [Acc > NIR] : <0.0000000000000002
##
## Kappa : 0.6765
##
## Mcnemar's Test P-Value : 0.7423
##
## Sensitivity : 0.8148
## Specificity : 0.8607
## Pos Pred Value : 0.8381
## Neg Pred Value : 0.8400
## Prevalence : 0.4696
## Detection Rate : 0.3826
## Detection Prevalence : 0.4565
## Balanced Accuracy : 0.8377
##
## 'Positive' Class : 0
##
### Extract accuracy for each value of k
k_accuracy <- sapply(k_results, function(x) x$overall['Accuracy'])
### Create a data frame for plotting
accuracy_dataframe <- data.frame(k = k_selection, Accuracy = k_accuracy)
Plot the graph for k_accuracy.
### Plot the graph for k_accuracy
ggplot(accuracy_dataframe, aes(x = k, y = Accuracy)) +
geom_line() +
geom_point(color = "red") +
labs(title = "K-NN Accuracy vs. k selections", x = "k selections", y = "Accuracy") +
theme_dark()

In conclusion, since the logistic algorithm model provided an
accuracy of 86.09%, and the best K-NN model accuracy
was 85.22% which was when K is 15, 27,
29 suggesting that increasing the value of K does not make the
model any better, we can conclude that the better classification
algorithm for this data is Logistic Regression.