We apply both logistic and K-Nearest Neighbor algorithm to see which best applies to this dataset.

Load the data

Heart <- read_csv("heart_kag.csv", col_types = "nffnnffnfnff")
Heart
## # A tibble: 918 × 12
##      Age Sex   ChestPainType RestingBP Cholesterol FastingBS RestingECG MaxHR
##    <dbl> <fct> <fct>             <dbl>       <dbl> <fct>     <fct>      <dbl>
##  1    40 M     ATA                 140         289 0         Normal       172
##  2    49 F     NAP                 160         180 0         Normal       156
##  3    37 M     ATA                 130         283 0         ST            98
##  4    48 F     ASY                 138         214 0         Normal       108
##  5    54 M     NAP                 150         195 0         Normal       122
##  6    39 M     NAP                 120         339 0         Normal       170
##  7    45 F     ATA                 130         237 0         Normal       170
##  8    54 M     ATA                 110         208 0         Normal       142
##  9    37 M     ASY                 140         207 0         Normal       130
## 10    48 F     ATA                 120         284 0         Normal       120
## # ℹ 908 more rows
## # ℹ 4 more variables: ExerciseAngina <fct>, Oldpeak <dbl>, ST_Slope <fct>,
## #   HeartDisease <fct>
head(Heart)
## # A tibble: 6 × 12
##     Age Sex   ChestPainType RestingBP Cholesterol FastingBS RestingECG MaxHR
##   <dbl> <fct> <fct>             <dbl>       <dbl> <fct>     <fct>      <dbl>
## 1    40 M     ATA                 140         289 0         Normal       172
## 2    49 F     NAP                 160         180 0         Normal       156
## 3    37 M     ATA                 130         283 0         ST            98
## 4    48 F     ASY                 138         214 0         Normal       108
## 5    54 M     NAP                 150         195 0         Normal       122
## 6    39 M     NAP                 120         339 0         Normal       170
## # ℹ 4 more variables: ExerciseAngina <fct>, Oldpeak <dbl>, ST_Slope <fct>,
## #   HeartDisease <fct>

Exploratory Data Analysis

glimpse(Heart)
## Rows: 918
## Columns: 12
## $ Age            <dbl> 40, 49, 37, 48, 54, 39, 45, 54, 37, 48, 37, 58, 39, 49,…
## $ Sex            <fct> M, F, M, F, M, M, F, M, M, F, F, M, M, M, F, F, M, F, M…
## $ ChestPainType  <fct> ATA, NAP, ATA, ASY, NAP, NAP, ATA, ATA, ASY, ATA, NAP, …
## $ RestingBP      <dbl> 140, 160, 130, 138, 150, 120, 130, 110, 140, 120, 130, …
## $ Cholesterol    <dbl> 289, 180, 283, 214, 195, 339, 237, 208, 207, 284, 211, …
## $ FastingBS      <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ RestingECG     <fct> Normal, Normal, ST, Normal, Normal, Normal, Normal, Nor…
## $ MaxHR          <dbl> 172, 156, 98, 108, 122, 170, 170, 142, 130, 120, 142, 9…
## $ ExerciseAngina <fct> N, N, N, Y, N, N, N, N, Y, N, N, Y, N, Y, N, N, N, N, N…
## $ Oldpeak        <dbl> 0.0, 1.0, 0.0, 1.5, 0.0, 0.0, 0.0, 0.0, 1.5, 0.0, 0.0, …
## $ ST_Slope       <fct> Up, Flat, Up, Flat, Up, Up, Up, Up, Flat, Up, Up, Flat,…
## $ HeartDisease   <fct> 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1…

We can find the number of rows and columns

dim(Heart)
## [1] 918  12
###
str(Heart)
## spc_tbl_ [918 × 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Age           : num [1:918] 40 49 37 48 54 39 45 54 37 48 ...
##  $ Sex           : Factor w/ 2 levels "M","F": 1 2 1 2 1 1 2 1 1 2 ...
##  $ ChestPainType : Factor w/ 4 levels "ATA","NAP","ASY",..: 1 2 1 3 2 2 1 1 3 1 ...
##  $ RestingBP     : num [1:918] 140 160 130 138 150 120 130 110 140 120 ...
##  $ Cholesterol   : num [1:918] 289 180 283 214 195 339 237 208 207 284 ...
##  $ FastingBS     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ RestingECG    : Factor w/ 3 levels "Normal","ST",..: 1 1 2 1 1 1 1 1 1 1 ...
##  $ MaxHR         : num [1:918] 172 156 98 108 122 170 170 142 130 120 ...
##  $ ExerciseAngina: Factor w/ 2 levels "N","Y": 1 1 1 2 1 1 1 1 2 1 ...
##  $ Oldpeak       : num [1:918] 0 1 0 1.5 0 0 0 0 1.5 0 ...
##  $ ST_Slope      : Factor w/ 3 levels "Up","Flat","Down": 1 2 1 2 1 1 1 1 2 1 ...
##  $ HeartDisease  : Factor w/ 2 levels "0","1": 1 2 1 2 1 1 1 1 2 1 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Age = col_number(),
##   ..   Sex = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE),
##   ..   ChestPainType = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE),
##   ..   RestingBP = col_number(),
##   ..   Cholesterol = col_number(),
##   ..   FastingBS = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE),
##   ..   RestingECG = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE),
##   ..   MaxHR = col_number(),
##   ..   ExerciseAngina = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE),
##   ..   Oldpeak = col_number(),
##   ..   ST_Slope = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE),
##   ..   HeartDisease = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE)
##   .. )
##  - attr(*, "problems")=<externalptr>

Description of the data

describe(Heart)
##                 vars   n   mean     sd median trimmed   mad  min   max range
## Age                1 918  53.51   9.43   54.0   53.71 10.38 28.0  77.0  49.0
## Sex*               2 918   1.21   0.41    1.0    1.14  0.00  1.0   2.0   1.0
## ChestPainType*     3 918   2.45   0.85    3.0    2.50  0.00  1.0   4.0   3.0
## RestingBP          4 918 132.40  18.51  130.0  131.50 14.83  0.0 200.0 200.0
## Cholesterol        5 918 198.80 109.38  223.0  204.41 68.20  0.0 603.0 603.0
## FastingBS*         6 918   1.23   0.42    1.0    1.17  0.00  1.0   2.0   1.0
## RestingECG*        7 918   1.60   0.81    1.0    1.51  0.00  1.0   3.0   2.0
## MaxHR              8 918 136.81  25.46  138.0  137.23 26.69 60.0 202.0 142.0
## ExerciseAngina*    9 918   1.40   0.49    1.0    1.38  0.00  1.0   2.0   1.0
## Oldpeak           10 918   0.89   1.07    0.6    0.74  0.89 -2.6   6.2   8.8
## ST_Slope*         11 918   1.64   0.61    2.0    1.59  0.00  1.0   3.0   2.0
## HeartDisease*     12 918   1.55   0.50    2.0    1.57  0.00  1.0   2.0   1.0
##                  skew kurtosis   se
## Age             -0.20    -0.40 0.31
## Sex*             1.42     0.02 0.01
## ChestPainType*  -0.52    -0.75 0.03
## RestingBP        0.18     3.23 0.61
## Cholesterol     -0.61     0.10 3.61
## FastingBS*       1.26    -0.41 0.01
## RestingECG*      0.84    -0.95 0.03
## MaxHR           -0.14    -0.46 0.84
## ExerciseAngina*  0.39    -1.85 0.02
## Oldpeak          1.02     1.18 0.04
## ST_Slope*        0.38    -0.67 0.02
## HeartDisease*   -0.21    -1.96 0.02

The total count of missing values in each column

###
sum(is.na(Heart))
## [1] 0
sapply(Heart, function(x) sum(is.na(x)))
##            Age            Sex  ChestPainType      RestingBP    Cholesterol 
##              0              0              0              0              0 
##      FastingBS     RestingECG          MaxHR ExerciseAngina        Oldpeak 
##              0              0              0              0              0 
##       ST_Slope   HeartDisease 
##              0              0

Summary of the data after cleaning

summary(Heart)
##       Age        Sex     ChestPainType   RestingBP      Cholesterol   
##  Min.   :28.00   M:725   ATA:173       Min.   :  0.0   Min.   :  0.0  
##  1st Qu.:47.00   F:193   NAP:203       1st Qu.:120.0   1st Qu.:173.2  
##  Median :54.00           ASY:496       Median :130.0   Median :223.0  
##  Mean   :53.51           TA : 46       Mean   :132.4   Mean   :198.8  
##  3rd Qu.:60.00                         3rd Qu.:140.0   3rd Qu.:267.0  
##  Max.   :77.00                         Max.   :200.0   Max.   :603.0  
##  FastingBS  RestingECG      MaxHR       ExerciseAngina    Oldpeak       
##  0:704     Normal:552   Min.   : 60.0   N:547          Min.   :-2.6000  
##  1:214     ST    :178   1st Qu.:120.0   Y:371          1st Qu.: 0.0000  
##            LVH   :188   Median :138.0                  Median : 0.6000  
##                         Mean   :136.8                  Mean   : 0.8874  
##                         3rd Qu.:156.0                  3rd Qu.: 1.5000  
##                         Max.   :202.0                  Max.   : 6.2000  
##  ST_Slope   HeartDisease
##  Up  :395   0:410       
##  Flat:460   1:508       
##  Down: 63               
##                         
##                         
## 

Looking at the factors summary

Heart %>%  keep(is.factor) %>% summary()
##  Sex     ChestPainType FastingBS  RestingECG  ExerciseAngina ST_Slope  
##  M:725   ATA:173       0:704     Normal:552   N:547          Up  :395  
##  F:193   NAP:203       1:214     ST    :178   Y:371          Flat:460  
##          ASY:496                 LVH   :188                  Down: 63  
##          TA : 46                                                       
##  HeartDisease
##  0:410       
##  1:508       
##              
## 

Looking at the numeric summary

Heart %>%  keep(is.numeric) %>% summary()
##       Age          RestingBP      Cholesterol        MaxHR      
##  Min.   :28.00   Min.   :  0.0   Min.   :  0.0   Min.   : 60.0  
##  1st Qu.:47.00   1st Qu.:120.0   1st Qu.:173.2   1st Qu.:120.0  
##  Median :54.00   Median :130.0   Median :223.0   Median :138.0  
##  Mean   :53.51   Mean   :132.4   Mean   :198.8   Mean   :136.8  
##  3rd Qu.:60.00   3rd Qu.:140.0   3rd Qu.:267.0   3rd Qu.:156.0  
##  Max.   :77.00   Max.   :200.0   Max.   :603.0   Max.   :202.0  
##     Oldpeak       
##  Min.   :-2.6000  
##  1st Qu.: 0.0000  
##  Median : 0.6000  
##  Mean   : 0.8874  
##  3rd Qu.: 1.5000  
##  Max.   : 6.2000

Data Visualization

### looking at the histogram
Heart %>% keep(is.numeric) %>% gather %>% ggplot(aes(value, color =key)) + 
  facet_wrap(~ key, scales = "free") + geom_histogram(binwidth = 10)

Boxplot with age according to sex

The median age of males appears to be higher than that of females

Heart %>% ggplot(aes(Sex, Age, color = Sex)) + geom_boxplot()  

Boxplot with Cholesterol according to sex

Female appears to have a higher median for high cholesterol with some few outliers for both sex

Heart %>% ggplot(aes(Sex, Cholesterol, color = Sex)) + geom_boxplot() 

In all the ChestPainType, Female happens to have a higher median value, while the IQR of ASY for male is well spread out.

Heart %>% ggplot(aes(ChestPainType, Cholesterol, color = Sex)) + geom_boxplot() 

Frequency polygon for ChestPainType

Heart %>% ggplot(aes(Age, color = ChestPainType)) + geom_freqpoly()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Splitting the data for training and testing samples

##names(Heart)
set.seed(1234)

sample_index <- sample(nrow(Heart), round(nrow(Heart) *.75), replace = FALSE)

Heart_train <- Heart[sample_index, ]
Heart_test <- Heart[-sample_index, ]

Building the model for the logistic algorithm

Heart_mod <- glm(data = Heart_train, family = binomial, formula = HeartDisease ~.)

summary(Heart_mod)
## 
## Call:
## glm(formula = HeartDisease ~ ., family = binomial, data = Heart_train)
## 
## Coefficients:
##                   Estimate Std. Error z value             Pr(>|z|)    
## (Intercept)      -3.031247   1.553377  -1.951             0.051010 .  
## Age               0.020260   0.015606   1.298             0.194223    
## SexF             -1.419177   0.320981  -4.421           0.00000981 ***
## ChestPainTypeNAP  0.002548   0.415528   0.006             0.995107    
## ChestPainTypeASY  1.770723   0.378740   4.675           0.00000294 ***
## ChestPainTypeTA   0.584068   0.593111   0.985             0.324745    
## RestingBP         0.003665   0.006633   0.553             0.580589    
## Cholesterol      -0.003931   0.001245  -3.156             0.001597 ** 
## FastingBS1        0.877243   0.309374   2.836             0.004575 ** 
## RestingECGST     -0.145835   0.332853  -0.438             0.661288    
## RestingECGLVH     0.042329   0.318417   0.133             0.894243    
## MaxHR            -0.002407   0.005767  -0.417             0.676391    
## ExerciseAnginaY   0.981308   0.285845   3.433             0.000597 ***
## Oldpeak           0.275234   0.136395   2.018             0.043600 *  
## ST_SlopeFlat      2.598182   0.294012   8.837 < 0.0000000000000002 ***
## ST_SlopeDown      1.398246   0.523420   2.671             0.007554 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 943.49  on 687  degrees of freedom
## Residual deviance: 440.10  on 672  degrees of freedom
## AIC: 472.1
## 
## Number of Fisher Scoring iterations: 5

Computing McFadden’s R^2 for the model.

A value of 0.5335352 which is high and indicates that the model fits the dats very well and has a high predictive power.

pscl::pR2(Heart_mod)["McFadden"]
## fitting null model for pseudo-r2
##  McFadden 
## 0.5335352

Variable Importance

Higher values indicates more importance that closely match the significant variables on our model summary

varImp(Heart_mod)
##                    Overall
## Age              1.2981875
## SexF             4.4213769
## ChestPainTypeNAP 0.0061325
## ChestPainTypeASY 4.6753023
## ChestPainTypeTA  0.9847530
## RestingBP        0.5525242
## Cholesterol      3.1564154
## FastingBS1       2.8355456
## RestingECGST     0.4381362
## RestingECGLVH    0.1329373
## MaxHR            0.4173930
## ExerciseAnginaY  3.4330036
## Oldpeak          2.0179154
## ST_SlopeFlat     8.8369857
## ST_SlopeDown     2.6713631

Correlation with numeric variables

Heart %>% keep(is.numeric) %>% cor() %>%  corrplot(method = "pie")

Checking for multicollinearity

vif(Heart_mod)
##                    GVIF Df GVIF^(1/(2*Df))
## Age            1.299780  1        1.140079
## Sex            1.114065  1        1.055493
## ChestPainType  1.345008  3        1.050641
## RestingBP      1.124971  1        1.060647
## Cholesterol    1.203686  1        1.097126
## FastingBS      1.083395  1        1.040863
## RestingECG     1.264973  2        1.060523
## MaxHR          1.369727  1        1.170353
## ExerciseAngina 1.234974  1        1.111294
## Oldpeak        1.381056  1        1.175183
## ST_Slope       1.574567  2        1.120186

Calculating the probability of HeartDisease

test_pred <- predict(Heart_mod, Heart_test, type = "response")

Stating the cutoff of the probability for the HeartDisease

test_pred <- ifelse(test_pred >= 0.5, 1, 0)

Model diagnostics and predictive table of our test data

test_pred_table <- table(Heart_test$HeartDisease, test_pred)
test_pred_table
##    test_pred
##       0   1
##   0  90  18
##   1  14 108

Confusion matrix for the output

sum(diag(test_pred_table)) / nrow(Heart_test)
## [1] 0.8608696

Calculating the Sensitivity

test_pred <- as.factor(test_pred)
### sensitivity

sensitivity(Heart_test$HeartDisease, test_pred)
## [1] 0.8653846

Calculating the Specificity

### specificity

specificity(Heart_test$HeartDisease, test_pred)
## [1] 0.8571429

Classification Error

### Classification Error

misclassification_Error <- 1 - sum(diag(test_pred_table)) / nrow(Heart_test)

misclassification_Error
## [1] 0.1391304

Calculating Area under the curve

test_pred <- as.numeric(test_pred)



auc(Heart_test$HeartDisease, test_pred)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Area under the curve: 0.8593

Now applying K-NN algorithm to the same data

library(fastDummies)
### load the data
Heart <- read_csv("heart_kag.csv", col_types = "nffnnffnfnff")

normalize the numeric variables

normalize <- function(x){
  return((x - min(x)) / (max(x) - min(x)))
}

Heart$Age  <- normalize(Heart$Age)
Heart$RestingBP <- normalize(Heart$RestingBP)
Heart$Cholesterol <- normalize(Heart$Cholesterol)
Heart$MaxHR  <- normalize(Heart$MaxHR)
Heart$Oldpeak <- normalize(Heart$Oldpeak)

Convert to data.frame

Heart <- as.data.frame(Heart)
### select the response variable from the data

Heart_label <- Heart %>% select(HeartDisease)
### now select the response from the original variable

Heart <- Heart %>% select(-HeartDisease)

Create dummy variables.

Heart <- dummy_cols(Heart)

Select the original features from the dummy features

Heart <- Heart %>% select(-Sex,-ChestPainType, -FastingBS, -RestingECG,-ST_Slope, -ExerciseAngina)
dim(Heart)
## [1] 918  21

Splitting the data for training and testing

set.seed(1234)
sample_Heart <- sample(nrow(Heart), round(nrow(Heart)* .75), replace = FALSE)

Heart_train <- Heart[sample_Heart, ]
Heart_test <- Heart[-sample_Heart, ]
### change the response variable to factors after splitting
Heart_train_label  <- as.factor(Heart_label[sample_Heart, ])
Heart_test_label <- as.factor(Heart_label[-sample_Heart, ])
table(Heart_train_label)
## Heart_train_label
##   0   1 
## 302 386

Building the K-NN model for the data

library(class)
Heart_pred <- knn(train = Heart_train, test = Heart_test, cl = Heart_train_label, k= 26)

Evaluating the model

### evaluating the model

Heart_pred_eval <- table(Heart_test_label, Heart_pred)
Heart_pred_eval
##                 Heart_pred
## Heart_test_label   0   1
##                0  89  19
##                1  16 106

Checking the accuracy with confusion matrix

### accuracy

sum(diag(Heart_pred_eval)) / nrow(Heart_test)
## [1] 0.8478261

Improving the model by change the value of k

Heart_Pred2 <- knn(train = Heart_train, test = Heart_test, cl = Heart_train_label, k= 10)

Heart_pred_eval2 <- table(Heart_test_label, Heart_Pred2)
Heart_pred_eval2
##                 Heart_Pred2
## Heart_test_label   0   1
##                0  89  19
##                1  15 107
sum(diag(Heart_pred_eval2)) / nrow(Heart_test)
## [1] 0.8521739

Check for new accuracy

### use caret library for confusion matrix

confusionMatrix(Heart_Pred2, Heart_test_label)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  89  15
##          1  19 107
##                                              
##                Accuracy : 0.8522             
##                  95% CI : (0.7996, 0.8954)   
##     No Information Rate : 0.5304             
##     P-Value [Acc > NIR] : <0.0000000000000002
##                                              
##                   Kappa : 0.7026             
##                                              
##  Mcnemar's Test P-Value : 0.6069             
##                                              
##             Sensitivity : 0.8241             
##             Specificity : 0.8770             
##          Pos Pred Value : 0.8558             
##          Neg Pred Value : 0.8492             
##              Prevalence : 0.4696             
##          Detection Rate : 0.3870             
##    Detection Prevalence : 0.4522             
##       Balanced Accuracy : 0.8506             
##                                              
##        'Positive' Class : 0                  
## 

Define a range of k values to to find the optimum value of k.

k_selection <- c(1, 3, 5, 7, 9, 15, 25, 26, 27, 29, 31, 33,  40)
#### Initialize a list to store confusion matrices
k_results <- list()
### Loop through k values and store the results
for (k in 
k_selection) {
  knn_pred <- knn(train = Heart_train, test = Heart_test, cl = Heart_train_label, k = k)
  k_results[[paste0("k = ", k)]] <- confusionMatrix(knn_pred, Heart_test_label)
}

Shows the results for various values of k

k_results
## $`k = 1`
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 84 28
##          1 24 94
##                                              
##                Accuracy : 0.7739             
##                  95% CI : (0.7143, 0.8263)   
##     No Information Rate : 0.5304             
##     P-Value [Acc > NIR] : 0.00000000000001867
##                                              
##                   Kappa : 0.5471             
##                                              
##  Mcnemar's Test P-Value : 0.6774             
##                                              
##             Sensitivity : 0.7778             
##             Specificity : 0.7705             
##          Pos Pred Value : 0.7500             
##          Neg Pred Value : 0.7966             
##              Prevalence : 0.4696             
##          Detection Rate : 0.3652             
##    Detection Prevalence : 0.4870             
##       Balanced Accuracy : 0.7741             
##                                              
##        'Positive' Class : 0                  
##                                              
## 
## $`k = 3`
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  83  19
##          1  25 103
##                                              
##                Accuracy : 0.8087             
##                  95% CI : (0.7518, 0.8574)   
##     No Information Rate : 0.5304             
##     P-Value [Acc > NIR] : <0.0000000000000002
##                                              
##                   Kappa : 0.6147             
##                                              
##  Mcnemar's Test P-Value : 0.451              
##                                              
##             Sensitivity : 0.7685             
##             Specificity : 0.8443             
##          Pos Pred Value : 0.8137             
##          Neg Pred Value : 0.8047             
##              Prevalence : 0.4696             
##          Detection Rate : 0.3609             
##    Detection Prevalence : 0.4435             
##       Balanced Accuracy : 0.8064             
##                                              
##        'Positive' Class : 0                  
##                                              
## 
## $`k = 5`
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  84  17
##          1  24 105
##                                              
##                Accuracy : 0.8217             
##                  95% CI : (0.766, 0.8689)    
##     No Information Rate : 0.5304             
##     P-Value [Acc > NIR] : <0.0000000000000002
##                                              
##                   Kappa : 0.6408             
##                                              
##  Mcnemar's Test P-Value : 0.3487             
##                                              
##             Sensitivity : 0.7778             
##             Specificity : 0.8607             
##          Pos Pred Value : 0.8317             
##          Neg Pred Value : 0.8140             
##              Prevalence : 0.4696             
##          Detection Rate : 0.3652             
##    Detection Prevalence : 0.4391             
##       Balanced Accuracy : 0.8192             
##                                              
##        'Positive' Class : 0                  
##                                              
## 
## $`k = 7`
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  86  16
##          1  22 106
##                                              
##                Accuracy : 0.8348             
##                  95% CI : (0.7804, 0.8804)   
##     No Information Rate : 0.5304             
##     P-Value [Acc > NIR] : <0.0000000000000002
##                                              
##                   Kappa : 0.6673             
##                                              
##  Mcnemar's Test P-Value : 0.4173             
##                                              
##             Sensitivity : 0.7963             
##             Specificity : 0.8689             
##          Pos Pred Value : 0.8431             
##          Neg Pred Value : 0.8281             
##              Prevalence : 0.4696             
##          Detection Rate : 0.3739             
##    Detection Prevalence : 0.4435             
##       Balanced Accuracy : 0.8326             
##                                              
##        'Positive' Class : 0                  
##                                              
## 
## $`k = 9`
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  89  16
##          1  19 106
##                                              
##                Accuracy : 0.8478             
##                  95% CI : (0.7948, 0.8917)   
##     No Information Rate : 0.5304             
##     P-Value [Acc > NIR] : <0.0000000000000002
##                                              
##                   Kappa : 0.694              
##                                              
##  Mcnemar's Test P-Value : 0.7353             
##                                              
##             Sensitivity : 0.8241             
##             Specificity : 0.8689             
##          Pos Pred Value : 0.8476             
##          Neg Pred Value : 0.8480             
##              Prevalence : 0.4696             
##          Detection Rate : 0.3870             
##    Detection Prevalence : 0.4565             
##       Balanced Accuracy : 0.8465             
##                                              
##        'Positive' Class : 0                  
##                                              
## 
## $`k = 15`
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  91  17
##          1  17 105
##                                              
##                Accuracy : 0.8522             
##                  95% CI : (0.7996, 0.8954)   
##     No Information Rate : 0.5304             
##     P-Value [Acc > NIR] : <0.0000000000000002
##                                              
##                   Kappa : 0.7032             
##                                              
##  Mcnemar's Test P-Value : 1                  
##                                              
##             Sensitivity : 0.8426             
##             Specificity : 0.8607             
##          Pos Pred Value : 0.8426             
##          Neg Pred Value : 0.8607             
##              Prevalence : 0.4696             
##          Detection Rate : 0.3957             
##    Detection Prevalence : 0.4696             
##       Balanced Accuracy : 0.8516             
##                                              
##        'Positive' Class : 0                  
##                                              
## 
## $`k = 25`
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  89  17
##          1  19 105
##                                              
##                Accuracy : 0.8435             
##                  95% CI : (0.79, 0.8879)     
##     No Information Rate : 0.5304             
##     P-Value [Acc > NIR] : <0.0000000000000002
##                                              
##                   Kappa : 0.6855             
##                                              
##  Mcnemar's Test P-Value : 0.8676             
##                                              
##             Sensitivity : 0.8241             
##             Specificity : 0.8607             
##          Pos Pred Value : 0.8396             
##          Neg Pred Value : 0.8468             
##              Prevalence : 0.4696             
##          Detection Rate : 0.3870             
##    Detection Prevalence : 0.4609             
##       Balanced Accuracy : 0.8424             
##                                              
##        'Positive' Class : 0                  
##                                              
## 
## $`k = 26`
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  89  16
##          1  19 106
##                                              
##                Accuracy : 0.8478             
##                  95% CI : (0.7948, 0.8917)   
##     No Information Rate : 0.5304             
##     P-Value [Acc > NIR] : <0.0000000000000002
##                                              
##                   Kappa : 0.694              
##                                              
##  Mcnemar's Test P-Value : 0.7353             
##                                              
##             Sensitivity : 0.8241             
##             Specificity : 0.8689             
##          Pos Pred Value : 0.8476             
##          Neg Pred Value : 0.8480             
##              Prevalence : 0.4696             
##          Detection Rate : 0.3870             
##    Detection Prevalence : 0.4565             
##       Balanced Accuracy : 0.8465             
##                                              
##        'Positive' Class : 0                  
##                                              
## 
## $`k = 27`
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  89  15
##          1  19 107
##                                              
##                Accuracy : 0.8522             
##                  95% CI : (0.7996, 0.8954)   
##     No Information Rate : 0.5304             
##     P-Value [Acc > NIR] : <0.0000000000000002
##                                              
##                   Kappa : 0.7026             
##                                              
##  Mcnemar's Test P-Value : 0.6069             
##                                              
##             Sensitivity : 0.8241             
##             Specificity : 0.8770             
##          Pos Pred Value : 0.8558             
##          Neg Pred Value : 0.8492             
##              Prevalence : 0.4696             
##          Detection Rate : 0.3870             
##    Detection Prevalence : 0.4522             
##       Balanced Accuracy : 0.8506             
##                                              
##        'Positive' Class : 0                  
##                                              
## 
## $`k = 29`
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  89  15
##          1  19 107
##                                              
##                Accuracy : 0.8522             
##                  95% CI : (0.7996, 0.8954)   
##     No Information Rate : 0.5304             
##     P-Value [Acc > NIR] : <0.0000000000000002
##                                              
##                   Kappa : 0.7026             
##                                              
##  Mcnemar's Test P-Value : 0.6069             
##                                              
##             Sensitivity : 0.8241             
##             Specificity : 0.8770             
##          Pos Pred Value : 0.8558             
##          Neg Pred Value : 0.8492             
##              Prevalence : 0.4696             
##          Detection Rate : 0.3870             
##    Detection Prevalence : 0.4522             
##       Balanced Accuracy : 0.8506             
##                                              
##        'Positive' Class : 0                  
##                                              
## 
## $`k = 31`
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  89  16
##          1  19 106
##                                              
##                Accuracy : 0.8478             
##                  95% CI : (0.7948, 0.8917)   
##     No Information Rate : 0.5304             
##     P-Value [Acc > NIR] : <0.0000000000000002
##                                              
##                   Kappa : 0.694              
##                                              
##  Mcnemar's Test P-Value : 0.7353             
##                                              
##             Sensitivity : 0.8241             
##             Specificity : 0.8689             
##          Pos Pred Value : 0.8476             
##          Neg Pred Value : 0.8480             
##              Prevalence : 0.4696             
##          Detection Rate : 0.3870             
##    Detection Prevalence : 0.4565             
##       Balanced Accuracy : 0.8465             
##                                              
##        'Positive' Class : 0                  
##                                              
## 
## $`k = 33`
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  89  16
##          1  19 106
##                                              
##                Accuracy : 0.8478             
##                  95% CI : (0.7948, 0.8917)   
##     No Information Rate : 0.5304             
##     P-Value [Acc > NIR] : <0.0000000000000002
##                                              
##                   Kappa : 0.694              
##                                              
##  Mcnemar's Test P-Value : 0.7353             
##                                              
##             Sensitivity : 0.8241             
##             Specificity : 0.8689             
##          Pos Pred Value : 0.8476             
##          Neg Pred Value : 0.8480             
##              Prevalence : 0.4696             
##          Detection Rate : 0.3870             
##    Detection Prevalence : 0.4565             
##       Balanced Accuracy : 0.8465             
##                                              
##        'Positive' Class : 0                  
##                                              
## 
## $`k = 40`
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  88  17
##          1  20 105
##                                              
##                Accuracy : 0.8391             
##                  95% CI : (0.7851, 0.8841)   
##     No Information Rate : 0.5304             
##     P-Value [Acc > NIR] : <0.0000000000000002
##                                              
##                   Kappa : 0.6765             
##                                              
##  Mcnemar's Test P-Value : 0.7423             
##                                              
##             Sensitivity : 0.8148             
##             Specificity : 0.8607             
##          Pos Pred Value : 0.8381             
##          Neg Pred Value : 0.8400             
##              Prevalence : 0.4696             
##          Detection Rate : 0.3826             
##    Detection Prevalence : 0.4565             
##       Balanced Accuracy : 0.8377             
##                                              
##        'Positive' Class : 0                  
## 
### Extract accuracy for each value of k
k_accuracy <- sapply(k_results, function(x) x$overall['Accuracy'])
###  Create a data frame for plotting
accuracy_dataframe <- data.frame(k = k_selection, Accuracy = k_accuracy)

Plot the graph for k_accuracy.

### Plot the graph for k_accuracy


ggplot(accuracy_dataframe, aes(x = k, y = Accuracy)) +
  geom_line() +
  geom_point(color = "red") +
  labs(title = "K-NN Accuracy vs. k selections", x = "k selections", y = "Accuracy") +
  theme_dark()

In conclusion, since the logistic algorithm model provided an accuracy of 86.09%, and the best K-NN model accuracy was 85.22% which was when K is 15, 27, 29 suggesting that increasing the value of K does not make the model any better, we can conclude that the better classification algorithm for this data is Logistic Regression.