Logistic Regression HeartDisease

We apply both logistic and K-Nearest Neighbor algorithm to see which best applies to this dataset.

Load the data

Heart <- read_csv("heart_kag.csv", col_types = "nffnnffnfnff")
Heart

## # A tibble: 918 × 12
##      Age Sex   ChestPainType RestingBP Cholesterol FastingBS RestingECG MaxHR
##    <dbl> <fct> <fct>             <dbl>       <dbl> <fct>     <fct>      <dbl>
##  1    40 M     ATA                 140         289 0         Normal       172
##  2    49 F     NAP                 160         180 0         Normal       156
##  3    37 M     ATA                 130         283 0         ST            98
##  4    48 F     ASY                 138         214 0         Normal       108
##  5    54 M     NAP                 150         195 0         Normal       122
##  6    39 M     NAP                 120         339 0         Normal       170
##  7    45 F     ATA                 130         237 0         Normal       170
##  8    54 M     ATA                 110         208 0         Normal       142
##  9    37 M     ASY                 140         207 0         Normal       130
## 10    48 F     ATA                 120         284 0         Normal       120
## # ℹ 908 more rows
## # ℹ 4 more variables: ExerciseAngina <fct>, Oldpeak <dbl>, ST_Slope <fct>,
## #   HeartDisease <fct>

head(Heart)

## # A tibble: 6 × 12
##     Age Sex   ChestPainType RestingBP Cholesterol FastingBS RestingECG MaxHR
##   <dbl> <fct> <fct>             <dbl>       <dbl> <fct>     <fct>      <dbl>
## 1    40 M     ATA                 140         289 0         Normal       172
## 2    49 F     NAP                 160         180 0         Normal       156
## 3    37 M     ATA                 130         283 0         ST            98
## 4    48 F     ASY                 138         214 0         Normal       108
## 5    54 M     NAP                 150         195 0         Normal       122
## 6    39 M     NAP                 120         339 0         Normal       170
## # ℹ 4 more variables: ExerciseAngina <fct>, Oldpeak <dbl>, ST_Slope <fct>,
## #   HeartDisease <fct>

Exploratory Data Analysis

glimpse(Heart)

## Rows: 918
## Columns: 12
## $ Age            <dbl> 40, 49, 37, 48, 54, 39, 45, 54, 37, 48, 37, 58, 39, 49,…
## $ Sex            <fct> M, F, M, F, M, M, F, M, M, F, F, M, M, M, F, F, M, F, M…
## $ ChestPainType  <fct> ATA, NAP, ATA, ASY, NAP, NAP, ATA, ATA, ASY, ATA, NAP, …
## $ RestingBP      <dbl> 140, 160, 130, 138, 150, 120, 130, 110, 140, 120, 130, …
## $ Cholesterol    <dbl> 289, 180, 283, 214, 195, 339, 237, 208, 207, 284, 211, …
## $ FastingBS      <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ RestingECG     <fct> Normal, Normal, ST, Normal, Normal, Normal, Normal, Nor…
## $ MaxHR          <dbl> 172, 156, 98, 108, 122, 170, 170, 142, 130, 120, 142, 9…
## $ ExerciseAngina <fct> N, N, N, Y, N, N, N, N, Y, N, N, Y, N, Y, N, N, N, N, N…
## $ Oldpeak        <dbl> 0.0, 1.0, 0.0, 1.5, 0.0, 0.0, 0.0, 0.0, 1.5, 0.0, 0.0, …
## $ ST_Slope       <fct> Up, Flat, Up, Flat, Up, Up, Up, Up, Flat, Up, Up, Flat,…
## $ HeartDisease   <fct> 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1…

We can find the number of rows and columns

dim(Heart)

## [1] 918  12

###
str(Heart)

## spc_tbl_ [918 × 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Age           : num [1:918] 40 49 37 48 54 39 45 54 37 48 ...
##  $ Sex           : Factor w/ 2 levels "M","F": 1 2 1 2 1 1 2 1 1 2 ...
##  $ ChestPainType : Factor w/ 4 levels "ATA","NAP","ASY",..: 1 2 1 3 2 2 1 1 3 1 ...
##  $ RestingBP     : num [1:918] 140 160 130 138 150 120 130 110 140 120 ...
##  $ Cholesterol   : num [1:918] 289 180 283 214 195 339 237 208 207 284 ...
##  $ FastingBS     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ RestingECG    : Factor w/ 3 levels "Normal","ST",..: 1 1 2 1 1 1 1 1 1 1 ...
##  $ MaxHR         : num [1:918] 172 156 98 108 122 170 170 142 130 120 ...
##  $ ExerciseAngina: Factor w/ 2 levels "N","Y": 1 1 1 2 1 1 1 1 2 1 ...
##  $ Oldpeak       : num [1:918] 0 1 0 1.5 0 0 0 0 1.5 0 ...
##  $ ST_Slope      : Factor w/ 3 levels "Up","Flat","Down": 1 2 1 2 1 1 1 1 2 1 ...
##  $ HeartDisease  : Factor w/ 2 levels "0","1": 1 2 1 2 1 1 1 1 2 1 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Age = col_number(),
##   ..   Sex = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE),
##   ..   ChestPainType = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE),
##   ..   RestingBP = col_number(),
##   ..   Cholesterol = col_number(),
##   ..   FastingBS = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE),
##   ..   RestingECG = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE),
##   ..   MaxHR = col_number(),
##   ..   ExerciseAngina = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE),
##   ..   Oldpeak = col_number(),
##   ..   ST_Slope = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE),
##   ..   HeartDisease = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE)
##   .. )
##  - attr(*, "problems")=<externalptr>

Description of the data

describe(Heart)

##                 vars   n   mean     sd median trimmed   mad  min   max range
## Age                1 918  53.51   9.43   54.0   53.71 10.38 28.0  77.0  49.0
## Sex*               2 918   1.21   0.41    1.0    1.14  0.00  1.0   2.0   1.0
## ChestPainType*     3 918   2.45   0.85    3.0    2.50  0.00  1.0   4.0   3.0
## RestingBP          4 918 132.40  18.51  130.0  131.50 14.83  0.0 200.0 200.0
## Cholesterol        5 918 198.80 109.38  223.0  204.41 68.20  0.0 603.0 603.0
## FastingBS*         6 918   1.23   0.42    1.0    1.17  0.00  1.0   2.0   1.0
## RestingECG*        7 918   1.60   0.81    1.0    1.51  0.00  1.0   3.0   2.0
## MaxHR              8 918 136.81  25.46  138.0  137.23 26.69 60.0 202.0 142.0
## ExerciseAngina*    9 918   1.40   0.49    1.0    1.38  0.00  1.0   2.0   1.0
## Oldpeak           10 918   0.89   1.07    0.6    0.74  0.89 -2.6   6.2   8.8
## ST_Slope*         11 918   1.64   0.61    2.0    1.59  0.00  1.0   3.0   2.0
## HeartDisease*     12 918   1.55   0.50    2.0    1.57  0.00  1.0   2.0   1.0
##                  skew kurtosis   se
## Age             -0.20    -0.40 0.31
## Sex*             1.42     0.02 0.01
## ChestPainType*  -0.52    -0.75 0.03
## RestingBP        0.18     3.23 0.61
## Cholesterol     -0.61     0.10 3.61
## FastingBS*       1.26    -0.41 0.01
## RestingECG*      0.84    -0.95 0.03
## MaxHR           -0.14    -0.46 0.84
## ExerciseAngina*  0.39    -1.85 0.02
## Oldpeak          1.02     1.18 0.04
## ST_Slope*        0.38    -0.67 0.02
## HeartDisease*   -0.21    -1.96 0.02

The total count of missing values in each column

###
sum(is.na(Heart))

## [1] 0

sapply(Heart, function(x) sum(is.na(x)))

##            Age            Sex  ChestPainType      RestingBP    Cholesterol 
##              0              0              0              0              0 
##      FastingBS     RestingECG          MaxHR ExerciseAngina        Oldpeak 
##              0              0              0              0              0 
##       ST_Slope   HeartDisease 
##              0              0

Summary of the data after cleaning

summary(Heart)

##       Age        Sex     ChestPainType   RestingBP      Cholesterol   
##  Min.   :28.00   M:725   ATA:173       Min.   :  0.0   Min.   :  0.0  
##  1st Qu.:47.00   F:193   NAP:203       1st Qu.:120.0   1st Qu.:173.2  
##  Median :54.00           ASY:496       Median :130.0   Median :223.0  
##  Mean   :53.51           TA : 46       Mean   :132.4   Mean   :198.8  
##  3rd Qu.:60.00                         3rd Qu.:140.0   3rd Qu.:267.0  
##  Max.   :77.00                         Max.   :200.0   Max.   :603.0  
##  FastingBS  RestingECG      MaxHR       ExerciseAngina    Oldpeak       
##  0:704     Normal:552   Min.   : 60.0   N:547          Min.   :-2.6000  
##  1:214     ST    :178   1st Qu.:120.0   Y:371          1st Qu.: 0.0000  
##            LVH   :188   Median :138.0                  Median : 0.6000  
##                         Mean   :136.8                  Mean   : 0.8874  
##                         3rd Qu.:156.0                  3rd Qu.: 1.5000  
##                         Max.   :202.0                  Max.   : 6.2000  
##  ST_Slope   HeartDisease
##  Up  :395   0:410       
##  Flat:460   1:508       
##  Down: 63               
##                         
##                         
##

Looking at the factors summary

Heart %>%  keep(is.factor) %>% summary()

##  Sex     ChestPainType FastingBS  RestingECG  ExerciseAngina ST_Slope  
##  M:725   ATA:173       0:704     Normal:552   N:547          Up  :395  
##  F:193   NAP:203       1:214     ST    :178   Y:371          Flat:460  
##          ASY:496                 LVH   :188                  Down: 63  
##          TA : 46                                                       
##  HeartDisease
##  0:410       
##  1:508       
##              
##

Looking at the numeric summary

Heart %>%  keep(is.numeric) %>% summary()

##       Age          RestingBP      Cholesterol        MaxHR      
##  Min.   :28.00   Min.   :  0.0   Min.   :  0.0   Min.   : 60.0  
##  1st Qu.:47.00   1st Qu.:120.0   1st Qu.:173.2   1st Qu.:120.0  
##  Median :54.00   Median :130.0   Median :223.0   Median :138.0  
##  Mean   :53.51   Mean   :132.4   Mean   :198.8   Mean   :136.8  
##  3rd Qu.:60.00   3rd Qu.:140.0   3rd Qu.:267.0   3rd Qu.:156.0  
##  Max.   :77.00   Max.   :200.0   Max.   :603.0   Max.   :202.0  
##     Oldpeak       
##  Min.   :-2.6000  
##  1st Qu.: 0.0000  
##  Median : 0.6000  
##  Mean   : 0.8874  
##  3rd Qu.: 1.5000  
##  Max.   : 6.2000

Data Visualization

### looking at the histogram
Heart %>% keep(is.numeric) %>% gather %>% ggplot(aes(value, color =key)) + 
  facet_wrap(~ key, scales = "free") + geom_histogram(binwidth = 10)

Boxplot with age according to sex

The median age of males appears to be higher than that of females

Heart %>% ggplot(aes(Sex, Age, color = Sex)) + geom_boxplot()

Boxplot with Cholesterol according to sex

Female appears to have a higher median for high cholesterol with some few outliers for both sex

Heart %>% ggplot(aes(Sex, Cholesterol, color = Sex)) + geom_boxplot()

In all the ChestPainType, Female happens to have a higher median value, while the IQR of ASY for male is well spread out.

Heart %>% ggplot(aes(ChestPainType, Cholesterol, color = Sex)) + geom_boxplot()

Frequency polygon for ChestPainType

Heart %>% ggplot(aes(Age, color = ChestPainType)) + geom_freqpoly()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Splitting the data for training and testing samples

##names(Heart)
set.seed(1234)

sample_index <- sample(nrow(Heart), round(nrow(Heart) *.75), replace = FALSE)

Heart_train <- Heart[sample_index, ]
Heart_test <- Heart[-sample_index, ]

Building the model for the logistic algorithm

Heart_mod <- glm(data = Heart_train, family = binomial, formula = HeartDisease ~.)

summary(Heart_mod)

## 
## Call:
## glm(formula = HeartDisease ~ ., family = binomial, data = Heart_train)
## 
## Coefficients:
##                   Estimate Std. Error z value             Pr(>|z|)    
## (Intercept)      -3.031247   1.553377  -1.951             0.051010 .  
## Age               0.020260   0.015606   1.298             0.194223    
## SexF             -1.419177   0.320981  -4.421           0.00000981 ***
## ChestPainTypeNAP  0.002548   0.415528   0.006             0.995107    
## ChestPainTypeASY  1.770723   0.378740   4.675           0.00000294 ***
## ChestPainTypeTA   0.584068   0.593111   0.985             0.324745    
## RestingBP         0.003665   0.006633   0.553             0.580589    
## Cholesterol      -0.003931   0.001245  -3.156             0.001597 ** 
## FastingBS1        0.877243   0.309374   2.836             0.004575 ** 
## RestingECGST     -0.145835   0.332853  -0.438             0.661288    
## RestingECGLVH     0.042329   0.318417   0.133             0.894243    
## MaxHR            -0.002407   0.005767  -0.417             0.676391    
## ExerciseAnginaY   0.981308   0.285845   3.433             0.000597 ***
## Oldpeak           0.275234   0.136395   2.018             0.043600 *  
## ST_SlopeFlat      2.598182   0.294012   8.837 < 0.0000000000000002 ***
## ST_SlopeDown      1.398246   0.523420   2.671             0.007554 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 943.49  on 687  degrees of freedom
## Residual deviance: 440.10  on 672  degrees of freedom
## AIC: 472.1
## 
## Number of Fisher Scoring iterations: 5

Computing McFadden’s R^2 for the model.

A value of 0.5335352 which is high and indicates that the model fits the dats very well and has a high predictive power.

pscl::pR2(Heart_mod)["McFadden"]

## fitting null model for pseudo-r2

##  McFadden 
## 0.5335352

Variable Importance

Higher values indicates more importance that closely match the significant variables on our model summary

varImp(Heart_mod)

##                    Overall
## Age              1.2981875
## SexF             4.4213769
## ChestPainTypeNAP 0.0061325
## ChestPainTypeASY 4.6753023
## ChestPainTypeTA  0.9847530
## RestingBP        0.5525242
## Cholesterol      3.1564154
## FastingBS1       2.8355456
## RestingECGST     0.4381362
## RestingECGLVH    0.1329373
## MaxHR            0.4173930
## ExerciseAnginaY  3.4330036
## Oldpeak          2.0179154
## ST_SlopeFlat     8.8369857
## ST_SlopeDown     2.6713631

Correlation with numeric variables

Heart %>% keep(is.numeric) %>% cor() %>%  corrplot(method = "pie")

Checking for multicollinearity

vif(Heart_mod)

##                    GVIF Df GVIF^(1/(2*Df))
## Age            1.299780  1        1.140079
## Sex            1.114065  1        1.055493
## ChestPainType  1.345008  3        1.050641
## RestingBP      1.124971  1        1.060647
## Cholesterol    1.203686  1        1.097126
## FastingBS      1.083395  1        1.040863
## RestingECG     1.264973  2        1.060523
## MaxHR          1.369727  1        1.170353
## ExerciseAngina 1.234974  1        1.111294
## Oldpeak        1.381056  1        1.175183
## ST_Slope       1.574567  2        1.120186

Calculating the probability of HeartDisease

test_pred <- predict(Heart_mod, Heart_test, type = "response")

Stating the cutoff of the probability for the HeartDisease

test_pred <- ifelse(test_pred >= 0.5, 1, 0)

Model diagnostics and predictive table of our test data

test_pred_table <- table(Heart_test$HeartDisease, test_pred)
test_pred_table

##    test_pred
##       0   1
##   0  90  18
##   1  14 108

Confusion matrix for the output

sum(diag(test_pred_table)) / nrow(Heart_test)

## [1] 0.8608696

Calculating the Sensitivity

test_pred <- as.factor(test_pred)
### sensitivity

sensitivity(Heart_test$HeartDisease, test_pred)

## [1] 0.8653846

Calculating the Specificity

### specificity

specificity(Heart_test$HeartDisease, test_pred)

## [1] 0.8571429

Classification Error

### Classification Error

misclassification_Error <- 1 - sum(diag(test_pred_table)) / nrow(Heart_test)

misclassification_Error

## [1] 0.1391304

Calculating Area under the curve

test_pred <- as.numeric(test_pred)



auc(Heart_test$HeartDisease, test_pred)

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

## Area under the curve: 0.8593

Now applying K-NN algorithm to the same data

library(fastDummies)
### load the data
Heart <- read_csv("heart_kag.csv", col_types = "nffnnffnfnff")

normalize the numeric variables

normalize <- function(x){
  return((x - min(x)) / (max(x) - min(x)))
}

Heart$Age  <- normalize(Heart$Age)
Heart$RestingBP <- normalize(Heart$RestingBP)
Heart$Cholesterol <- normalize(Heart$Cholesterol)
Heart$MaxHR  <- normalize(Heart$MaxHR)
Heart$Oldpeak <- normalize(Heart$Oldpeak)

Convert to data.frame

Heart <- as.data.frame(Heart)

### select the response variable from the data

Heart_label <- Heart %>% select(HeartDisease)

### now select the response from the original variable

Heart <- Heart %>% select(-HeartDisease)

Create dummy variables.

Heart <- dummy_cols(Heart)

Select the original features from the dummy features

Heart <- Heart %>% select(-Sex,-ChestPainType, -FastingBS, -RestingECG,-ST_Slope, -ExerciseAngina)
dim(Heart)

## [1] 918  21

Splitting the data for training and testing

set.seed(1234)
sample_Heart <- sample(nrow(Heart), round(nrow(Heart)* .75), replace = FALSE)

Heart_train <- Heart[sample_Heart, ]
Heart_test <- Heart[-sample_Heart, ]

### change the response variable to factors after splitting
Heart_train_label  <- as.factor(Heart_label[sample_Heart, ])
Heart_test_label <- as.factor(Heart_label[-sample_Heart, ])
table(Heart_train_label)

## Heart_train_label
##   0   1 
## 302 386

Building the K-NN model for the data

library(class)
Heart_pred <- knn(train = Heart_train, test = Heart_test, cl = Heart_train_label, k= 26)

Evaluating the model

### evaluating the model

Heart_pred_eval <- table(Heart_test_label, Heart_pred)
Heart_pred_eval

##                 Heart_pred
## Heart_test_label   0   1
##                0  89  19
##                1  16 106

Checking the accuracy with confusion matrix

### accuracy

sum(diag(Heart_pred_eval)) / nrow(Heart_test)

## [1] 0.8478261

Improving the model by change the value of k

Heart_Pred2 <- knn(train = Heart_train, test = Heart_test, cl = Heart_train_label, k= 10)

Heart_pred_eval2 <- table(Heart_test_label, Heart_Pred2)
Heart_pred_eval2

##                 Heart_Pred2
## Heart_test_label   0   1
##                0  89  19
##                1  15 107

sum(diag(Heart_pred_eval2)) / nrow(Heart_test)

## [1] 0.8521739

Check for new accuracy

### use caret library for confusion matrix

confusionMatrix(Heart_Pred2, Heart_test_label)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  89  15
##          1  19 107
##                                              
##                Accuracy : 0.8522             
##                  95% CI : (0.7996, 0.8954)   
##     No Information Rate : 0.5304             
##     P-Value [Acc > NIR] : <0.0000000000000002
##                                              
##                   Kappa : 0.7026             
##                                              
##  Mcnemar's Test P-Value : 0.6069             
##                                              
##             Sensitivity : 0.8241             
##             Specificity : 0.8770             
##          Pos Pred Value : 0.8558             
##          Neg Pred Value : 0.8492             
##              Prevalence : 0.4696             
##          Detection Rate : 0.3870             
##    Detection Prevalence : 0.4522             
##       Balanced Accuracy : 0.8506             
##                                              
##        'Positive' Class : 0                  
##

Define a range of k values to to find the optimum value of k.

k_selection <- c(1, 3, 5, 7, 9, 15, 25, 26, 27, 29, 31, 33,  40)

#### Initialize a list to store confusion matrices
k_results <- list()

### Loop through k values and store the results
for (k in 
k_selection) {
  knn_pred <- knn(train = Heart_train, test = Heart_test, cl = Heart_train_label, k = k)
  k_results[[paste0("k = ", k)]] <- confusionMatrix(knn_pred, Heart_test_label)
}

Shows the results for various values of k

k_results

## $`k = 1`
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 84 28
##          1 24 94
##                                              
##                Accuracy : 0.7739             
##                  95% CI : (0.7143, 0.8263)   
##     No Information Rate : 0.5304             
##     P-Value [Acc > NIR] : 0.00000000000001867
##                                              
##                   Kappa : 0.5471             
##                                              
##  Mcnemar's Test P-Value : 0.6774             
##                                              
##             Sensitivity : 0.7778             
##             Specificity : 0.7705             
##          Pos Pred Value : 0.7500             
##          Neg Pred Value : 0.7966             
##              Prevalence : 0.4696             
##          Detection Rate : 0.3652             
##    Detection Prevalence : 0.4870             
##       Balanced Accuracy : 0.7741             
##                                              
##        'Positive' Class : 0                  
##                                              
## 
## $`k = 3`
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  83  19
##          1  25 103
##                                              
##                Accuracy : 0.8087             
##                  95% CI : (0.7518, 0.8574)   
##     No Information Rate : 0.5304             
##     P-Value [Acc > NIR] : <0.0000000000000002
##                                              
##                   Kappa : 0.6147             
##                                              
##  Mcnemar's Test P-Value : 0.451              
##                                              
##             Sensitivity : 0.7685             
##             Specificity : 0.8443             
##          Pos Pred Value : 0.8137             
##          Neg Pred Value : 0.8047             
##              Prevalence : 0.4696             
##          Detection Rate : 0.3609             
##    Detection Prevalence : 0.4435             
##       Balanced Accuracy : 0.8064             
##                                              
##        'Positive' Class : 0                  
##                                              
## 
## $`k = 5`
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  84  17
##          1  24 105
##                                              
##                Accuracy : 0.8217             
##                  95% CI : (0.766, 0.8689)    
##     No Information Rate : 0.5304             
##     P-Value [Acc > NIR] : <0.0000000000000002
##                                              
##                   Kappa : 0.6408             
##                                              
##  Mcnemar's Test P-Value : 0.3487             
##                                              
##             Sensitivity : 0.7778             
##             Specificity : 0.8607             
##          Pos Pred Value : 0.8317             
##          Neg Pred Value : 0.8140             
##              Prevalence : 0.4696             
##          Detection Rate : 0.3652             
##    Detection Prevalence : 0.4391             
##       Balanced Accuracy : 0.8192             
##                                              
##        'Positive' Class : 0                  
##                                              
## 
## $`k = 7`
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  86  16
##          1  22 106
##                                              
##                Accuracy : 0.8348             
##                  95% CI : (0.7804, 0.8804)   
##     No Information Rate : 0.5304             
##     P-Value [Acc > NIR] : <0.0000000000000002
##                                              
##                   Kappa : 0.6673             
##                                              
##  Mcnemar's Test P-Value : 0.4173             
##                                              
##             Sensitivity : 0.7963             
##             Specificity : 0.8689             
##          Pos Pred Value : 0.8431             
##          Neg Pred Value : 0.8281             
##              Prevalence : 0.4696             
##          Detection Rate : 0.3739             
##    Detection Prevalence : 0.4435             
##       Balanced Accuracy : 0.8326             
##                                              
##        'Positive' Class : 0                  
##                                              
## 
## $`k = 9`
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  89  16
##          1  19 106
##                                              
##                Accuracy : 0.8478             
##                  95% CI : (0.7948, 0.8917)   
##     No Information Rate : 0.5304             
##     P-Value [Acc > NIR] : <0.0000000000000002
##                                              
##                   Kappa : 0.694              
##                                              
##  Mcnemar's Test P-Value : 0.7353             
##                                              
##             Sensitivity : 0.8241             
##             Specificity : 0.8689             
##          Pos Pred Value : 0.8476             
##          Neg Pred Value : 0.8480             
##              Prevalence : 0.4696             
##          Detection Rate : 0.3870             
##    Detection Prevalence : 0.4565             
##       Balanced Accuracy : 0.8465             
##                                              
##        'Positive' Class : 0                  
##                                              
## 
## $`k = 15`
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  91  17
##          1  17 105
##                                              
##                Accuracy : 0.8522             
##                  95% CI : (0.7996, 0.8954)   
##     No Information Rate : 0.5304             
##     P-Value [Acc > NIR] : <0.0000000000000002
##                                              
##                   Kappa : 0.7032             
##                                              
##  Mcnemar's Test P-Value : 1                  
##                                              
##             Sensitivity : 0.8426             
##             Specificity : 0.8607             
##          Pos Pred Value : 0.8426             
##          Neg Pred Value : 0.8607             
##              Prevalence : 0.4696             
##          Detection Rate : 0.3957             
##    Detection Prevalence : 0.4696             
##       Balanced Accuracy : 0.8516             
##                                              
##        'Positive' Class : 0                  
##                                              
## 
## $`k = 25`
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  89  17
##          1  19 105
##                                              
##                Accuracy : 0.8435             
##                  95% CI : (0.79, 0.8879)     
##     No Information Rate : 0.5304             
##     P-Value [Acc > NIR] : <0.0000000000000002
##                                              
##                   Kappa : 0.6855             
##                                              
##  Mcnemar's Test P-Value : 0.8676             
##                                              
##             Sensitivity : 0.8241             
##             Specificity : 0.8607             
##          Pos Pred Value : 0.8396             
##          Neg Pred Value : 0.8468             
##              Prevalence : 0.4696             
##          Detection Rate : 0.3870             
##    Detection Prevalence : 0.4609             
##       Balanced Accuracy : 0.8424             
##                                              
##        'Positive' Class : 0                  
##                                              
## 
## $`k = 26`
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  89  16
##          1  19 106
##                                              
##                Accuracy : 0.8478             
##                  95% CI : (0.7948, 0.8917)   
##     No Information Rate : 0.5304             
##     P-Value [Acc > NIR] : <0.0000000000000002
##                                              
##                   Kappa : 0.694              
##                                              
##  Mcnemar's Test P-Value : 0.7353             
##                                              
##             Sensitivity : 0.8241             
##             Specificity : 0.8689             
##          Pos Pred Value : 0.8476             
##          Neg Pred Value : 0.8480             
##              Prevalence : 0.4696             
##          Detection Rate : 0.3870             
##    Detection Prevalence : 0.4565             
##       Balanced Accuracy : 0.8465             
##                                              
##        'Positive' Class : 0                  
##                                              
## 
## $`k = 27`
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  89  15
##          1  19 107
##                                              
##                Accuracy : 0.8522             
##                  95% CI : (0.7996, 0.8954)   
##     No Information Rate : 0.5304             
##     P-Value [Acc > NIR] : <0.0000000000000002
##                                              
##                   Kappa : 0.7026             
##                                              
##  Mcnemar's Test P-Value : 0.6069             
##                                              
##             Sensitivity : 0.8241             
##             Specificity : 0.8770             
##          Pos Pred Value : 0.8558             
##          Neg Pred Value : 0.8492             
##              Prevalence : 0.4696             
##          Detection Rate : 0.3870             
##    Detection Prevalence : 0.4522             
##       Balanced Accuracy : 0.8506             
##                                              
##        'Positive' Class : 0                  
##                                              
## 
## $`k = 29`
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  89  15
##          1  19 107
##                                              
##                Accuracy : 0.8522             
##                  95% CI : (0.7996, 0.8954)   
##     No Information Rate : 0.5304             
##     P-Value [Acc > NIR] : <0.0000000000000002
##                                              
##                   Kappa : 0.7026             
##                                              
##  Mcnemar's Test P-Value : 0.6069             
##                                              
##             Sensitivity : 0.8241             
##             Specificity : 0.8770             
##          Pos Pred Value : 0.8558             
##          Neg Pred Value : 0.8492             
##              Prevalence : 0.4696             
##          Detection Rate : 0.3870             
##    Detection Prevalence : 0.4522             
##       Balanced Accuracy : 0.8506             
##                                              
##        'Positive' Class : 0                  
##                                              
## 
## $`k = 31`
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  89  16
##          1  19 106
##                                              
##                Accuracy : 0.8478             
##                  95% CI : (0.7948, 0.8917)   
##     No Information Rate : 0.5304             
##     P-Value [Acc > NIR] : <0.0000000000000002
##                                              
##                   Kappa : 0.694              
##                                              
##  Mcnemar's Test P-Value : 0.7353             
##                                              
##             Sensitivity : 0.8241             
##             Specificity : 0.8689             
##          Pos Pred Value : 0.8476             
##          Neg Pred Value : 0.8480             
##              Prevalence : 0.4696             
##          Detection Rate : 0.3870             
##    Detection Prevalence : 0.4565             
##       Balanced Accuracy : 0.8465             
##                                              
##        'Positive' Class : 0                  
##                                              
## 
## $`k = 33`
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  89  16
##          1  19 106
##                                              
##                Accuracy : 0.8478             
##                  95% CI : (0.7948, 0.8917)   
##     No Information Rate : 0.5304             
##     P-Value [Acc > NIR] : <0.0000000000000002
##                                              
##                   Kappa : 0.694              
##                                              
##  Mcnemar's Test P-Value : 0.7353             
##                                              
##             Sensitivity : 0.8241             
##             Specificity : 0.8689             
##          Pos Pred Value : 0.8476             
##          Neg Pred Value : 0.8480             
##              Prevalence : 0.4696             
##          Detection Rate : 0.3870             
##    Detection Prevalence : 0.4565             
##       Balanced Accuracy : 0.8465             
##                                              
##        'Positive' Class : 0                  
##                                              
## 
## $`k = 40`
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  88  17
##          1  20 105
##                                              
##                Accuracy : 0.8391             
##                  95% CI : (0.7851, 0.8841)   
##     No Information Rate : 0.5304             
##     P-Value [Acc > NIR] : <0.0000000000000002
##                                              
##                   Kappa : 0.6765             
##                                              
##  Mcnemar's Test P-Value : 0.7423             
##                                              
##             Sensitivity : 0.8148             
##             Specificity : 0.8607             
##          Pos Pred Value : 0.8381             
##          Neg Pred Value : 0.8400             
##              Prevalence : 0.4696             
##          Detection Rate : 0.3826             
##    Detection Prevalence : 0.4565             
##       Balanced Accuracy : 0.8377             
##                                              
##        'Positive' Class : 0                  
##

### Extract accuracy for each value of k
k_accuracy <- sapply(k_results, function(x) x$overall['Accuracy'])

###  Create a data frame for plotting
accuracy_dataframe <- data.frame(k = k_selection, Accuracy = k_accuracy)

Plot the graph for k_accuracy.

### Plot the graph for k_accuracy


ggplot(accuracy_dataframe, aes(x = k, y = Accuracy)) +
  geom_line() +
  geom_point(color = "red") +
  labs(title = "K-NN Accuracy vs. k selections", x = "k selections", y = "Accuracy") +
  theme_dark()

In conclusion, since the logistic algorithm model provided an accuracy of 86.09%, and the best K-NN model accuracy was 85.22% which was when K is 15, 27, 29 suggesting that increasing the value of K does not make the model any better, we can conclude that the better classification algorithm for this data is Logistic Regression.

Logistic Regression HeartDisease

Spencer Madamedon

2024-11-22

We apply both logistic and K-Nearest Neighbor algorithm to see which best applies to this dataset.

Load the data

Exploratory Data Analysis

We can find the number of rows and columns

Description of the data

The total count of missing values in each column

Summary of the data after cleaning

Looking at the factors summary

Looking at the numeric summary

Data Visualization

Boxplot with age according to sex

The median age of males appears to be higher than that of females

Boxplot with Cholesterol according to sex

Female appears to have a higher median for high cholesterol with some few outliers for both sex

In all the ChestPainType, Female happens to have a higher median value, while the IQR of ASY for male is well spread out.

Frequency polygon for ChestPainType

Splitting the data for training and testing samples

Building the model for the logistic algorithm

Computing McFadden’s R^2 for the model.

A value of 0.5335352 which is high and indicates that the model fits the dats very well and has a high predictive power.

Variable Importance

Higher values indicates more importance that closely match the significant variables on our model summary

Correlation with numeric variables

Checking for multicollinearity

Calculating the probability of HeartDisease

Stating the cutoff of the probability for the HeartDisease

Model diagnostics and predictive table of our test data

Confusion matrix for the output

Calculating the Sensitivity

Calculating the Specificity

Classification Error

Calculating Area under the curve

Now applying K-NN algorithm to the same data

normalize the numeric variables

Convert to data.frame

Create dummy variables.

Select the original features from the dummy features

Splitting the data for training and testing

Building the K-NN model for the data

Evaluating the model

Checking the accuracy with confusion matrix

Improving the model by change the value of k

Check for new accuracy

Define a range of k values to to find the optimum value of k.

Shows the results for various values of k

Plot the graph for k_accuracy.