Introduction

Did you know that in the United States, someone experiences a heart attack every 40 seconds? Additionally, approximately 647,000 people die from heart disease each year in this region. Heart disease is currently the leading cause of death for men, women, and individuals from various racial and ethnic backgrounds.

For our project, we aim to analyze the causes of heart attacks and develop a predictive model that can be valuable in predicting the occurrence of heart disease events. To achieve this, we will utilize the Heart Disease Data Set from the UCI Machine Learning Repository, which consists of patient data collected from Cleveland, Hungary, Switzerland, and Long Beach.

To summarize our Goals for this dataset:

Conduct a comprehensive Exploratory Data Analysis on both our independent and dependent variables.
Build machine learning models.
Utilize our models to predict heart disease events in unseen data.
Evaluate the performance of our models.
Select the best-performing model.

Data Understanding

Before conducting our Exploratory Data Analysis, we will first examine our dataset to identify any anomalies that can be rectified within it. It is crucial to recognize that the quality of insights relies heavily on the integrity of the underlying data. Consequently, ensuring that the data is clean and in a usable format is of utmost importance.

Columnwise Description

heart <- read.csv("heart.csv")
rmarkdown::paged_table(heart)

Each column in the given list corresponds to specific information in the dataset. Here’s a description of each column:

Age: Patient’s Age in years.
Sex: Patient’s Gender. (M = Male, F = Female)
ChestPainType: Chest Pain type. (4 values: ATA, NAP, ASY, TA)
RestingBP: resting Blood Pressure. ( in mm Hg )
Cholesterol: Serum Cholesterol. ( in mg/dl )
FastingBS: Fasting Blood Sugar > 120 mg/dl. (0 = True, 1 = False)
RestingECG: resting Electroencephalographic result. (values: Normal, ST, LVH)
MaxHR: Maximum Heart Rate achieved.
ExerciseAngina: Exercise induced Angina. (N = No, Y = Yes)
Oldpeak: ST Depression induced by Exercise relative to rest.
ST_Slope: Slope of the peak exercise ST segment. (values: Up, Flat, Down)
HeartDisease:: Heart Disease occured. (0 = No, 1 = Yes)

Data Wrangling

Check the Data Types and Data Content:

glimpse(heart)

## Rows: 918
## Columns: 12
## $ Age            <int> 40, 49, 37, 48, 54, 39, 45, 54, 37, 48, 37, 58, 39, 49,…
## $ Sex            <chr> "M", "F", "M", "F", "M", "M", "F", "M", "M", "F", "F", …
## $ ChestPainType  <chr> "ATA", "NAP", "ATA", "ASY", "NAP", "NAP", "ATA", "ATA",…
## $ RestingBP      <int> 140, 160, 130, 138, 150, 120, 130, 110, 140, 120, 130, …
## $ Cholesterol    <int> 289, 180, 283, 214, 195, 339, 237, 208, 207, 284, 211, …
## $ FastingBS      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ RestingECG     <chr> "Normal", "Normal", "ST", "Normal", "Normal", "Normal",…
## $ MaxHR          <int> 172, 156, 98, 108, 122, 170, 170, 142, 130, 120, 142, 9…
## $ ExerciseAngina <chr> "N", "N", "N", "Y", "N", "N", "N", "N", "Y", "N", "N", …
## $ Oldpeak        <dbl> 0.0, 1.0, 0.0, 1.5, 0.0, 0.0, 0.0, 0.0, 1.5, 0.0, 0.0, …
## $ ST_Slope       <chr> "Up", "Flat", "Up", "Flat", "Up", "Up", "Up", "Up", "Fl…
## $ HeartDisease   <int> 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1…

As pointed out earlier, our variable’s data type still doesn’t match its type. Some changes need to be made, namely:

Change some of the following columns to the factor data type: Sex, ChestPainType, FastingBS, RestingECG, ExerciseAngina, and ST_Slope.
Change the value of the sex column where “M” stands for Male and “F” stands for Female
Change the value of the FastingBS column where “0” becomes “< 120 mg/dl” and “1” becomes “> 120 mg/dl”
Changed the value of the ExerciseAngina column where “N” becomes “No” and “Y” becomes “Yes”

heart <- 
  heart %>% 
  mutate_at(vars(Sex, ChestPainType, FastingBS, RestingECG, ExerciseAngina, ST_Slope), as.factor) %>% 
  mutate(Sex = case_when(
    Sex == "M" ~ "Male",
    Sex == "F" ~ "Female"),
    FastingBS = case_when(
    FastingBS == "0" ~ "< 120 mg/dl",
    FastingBS == "1" ~ "> 120 mg/dl"),
    ExerciseAngina = case_when(
    ExerciseAngina == "N" ~ "No",
    ExerciseAngina == "Y" ~ "Yes"))

Check for Missing Values:

anyNA(heart)

## [1] FALSE

Our dataset does not contain any missing values.

Check Multicollinearity Indications

We will assess the Pearson Correlation among our numerical variables to determine whether there exists any initial multicollinearity.

variable_num <- heart %>% 
  select(Age, RestingBP, Cholesterol, MaxHR, Oldpeak, HeartDisease)

format <- round(cor(variable_num),2)

get_lower_tri<-function(format){
  format[upper.tri(format)] <- NA
  return(format)
  }

get_upper_tri <- function(format){
  format[lower.tri(format)]<- NA
  return(format)
}

lower_tri <- get_lower_tri(format)

heart_cor_melt <- reshape2::melt(lower_tri, na.rm = TRUE)
heart_cor_melt <- heart_cor_melt %>% 
  mutate(toltip = glue("{Var1} ~ {Var2}"))

plot_pm <- ggplot(heart_cor_melt,aes(Var1, Var2, text = toltip)) +
  geom_tile(aes(fill = value)) +
  geom_text(aes(label = round(value, 1)), alpha=0.5, size = 3, color = "White") + 
  scale_fill_gradientn(colors = c("#A8D8E0","#428793","#27636D"),
                      values = rescale(c(-1,0,1)),
                      limits = c(-1,1)) +
  labs(x = NULL,
      y = NULL,
      fill = "Pearson Corr:") +
  theme(legend.background = element_rect(fill = "#0E264E", color = "#0E264E"),
        plot.background = element_rect(fill = "#0E264E", color = "#0E264E"),
        panel.background = element_rect(fill = "#0E264E"),
        panel.grid = element_line(colour = "#0E264E"),
        panel.grid.major.x = element_line(colour = "#0E264E"),
        panel.grid.minor.x = element_line(colour = "#0E264E"),
        legend.title = element_text(colour = "#1B5249", face ="bold", family = "Times New Roman"),
        legend.text = element_text(colour = "#1B5249", face ="bold", family = "Times New Roman"),
        legend.justification = c(1, 0),
        legend.position = c(0.6, 0.7),
        legend.direction = "horizontal",
        axis.text.x = element_text(color = "#1B5249", family = "Times New Roman",
                                angle = 45, vjust = 1, hjust = 1),
        axis.text.y = element_blank(),
        axis.ticks = element_blank()) +
        guides(fill = guide_colorbar(barwidth = 7, barheight = 1,
                  title.position = "top", title.hjust = 0.5))

ggplotly(plot_pm, tooltip = "text") %>% 
  layout(hoverlabel = list( bgcolor = "rgba(255,255,255,0.75)",
                         font = list(
                           color = "Black",
                           family = "Cardo",
                           size = 12
                         )))

Based on the provided information, there is no evidence of multicollinearity as indicated by the absence of significant correlation coefficients exceeding a range of 0.4/-0.4.

Exploratory Data Analysis

Before doing modeling, we need to look at the proportion of the target variable that we have in the HeartDisease column.

proptarget <- paste(round(prop.table((table(heart$HeartDisease)))*100,2),"%")
proptarget_df <- data.frame(HeartDisease = c("No","Yes"), Prop = c(proptarget[1], proptarget[2]))
rmarkdown::paged_table(proptarget_df)

When viewed from the proportions of the two classes, it is quite balanced, so we don’t need additional pre-processing to balance the proportions between the two class target variables.

Build Models

Logistic Regression

Logistic regression is used to describe data and the relationship between one dependent variable and one or more independent variables. The independent variables can be nominal, ordinal, or of interval type.

Logistic regression return value within range of 0,1 and not a binary class. Later, we have to classify ourselves the range to convert it to binary class.

This is the equation of a logistic regression model:

\[ log(\frac{p(X)}{1-p(X)}) = B_0 + B_1.X \]

The left-hand side is called the log-odds or logit. On the right side the b0 is the model intercept and b1 is the coefficient of feature X.

Cross Validation

Prior to delving into the modeling process, our initial step is to conduct Cross Validation on our dataset. Why is this necessary? The Cross Validation technique is employed to assess the effectiveness of machine learning algorithms in making predictions on unfamiliar data. In this particular scenario, we will allocate 80% of our dataset for training our model and reserve 20% for testing our predictions, in order to evaluate the performance of the model.

heart <- heart %>% 
  mutate(HeartDisease = as.factor(HeartDisease))

RNGkind(sample.kind = "Rounding")
set.seed(100)
index <- sample(nrow(heart), size = nrow(heart)*0.80)
heart_train <- heart[index,] #take 80% data
heart_test <- heart[-index,] #take 20% data

Re-check Class-Imbalance:

prop_train <- paste(round(prop.table((table(heart_train$HeartDisease)))*100,2),"%")
prop_train_df <- data.frame(HeartDisease = c("No","Yes"), Prop = c(prop_train[1], prop_train[2]))
rmarkdown::paged_table(prop_train_df)

As demonstrated earlier, the class remains evenly distributed.

Modelling

Once we have divided our data, our next step is to proceed with training our models using the heart_train dataset. Initially, we will utilize all variables available as independent variables (X).

model_all <- glm(formula = HeartDisease ~ .,
             family = "binomial",
             data = heart_train)
summary(model_all)

## 
## Call:
## glm(formula = HeartDisease ~ ., family = "binomial", data = heart_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6670  -0.3924   0.1772   0.4543   2.4893  
## 
## Coefficients:
##                        Estimate Std. Error z value    Pr(>|z|)    
## (Intercept)          -0.4576065  1.5628513  -0.293    0.769673    
## Age                   0.0163795  0.0143858   1.139    0.254874    
## SexMale               1.3481197  0.3145406   4.286 0.000018192 ***
## ChestPainTypeATA     -1.7436885  0.3785043  -4.607 0.000004089 ***
## ChestPainTypeNAP     -1.5622087  0.2945054  -5.305 0.000000113 ***
## ChestPainTypeTA      -1.6574256  0.4578414  -3.620    0.000295 ***
## RestingBP             0.0006014  0.0064614   0.093    0.925841    
## Cholesterol          -0.0045547  0.0012139  -3.752    0.000175 ***
## FastingBS> 120 mg/dl  1.3006319  0.3083736   4.218 0.000024679 ***
## RestingECGNormal     -0.2035720  0.3007997  -0.677    0.498552    
## RestingECGST         -0.1711756  0.3847466  -0.445    0.656389    
## MaxHR                -0.0085998  0.0055707  -1.544    0.122650    
## ExerciseAnginaYes     0.5911453  0.2750255   2.149    0.031601 *  
## Oldpeak               0.5434655  0.1339077   4.059 0.000049387 ***
## ST_SlopeFlat          1.8358401  0.4773591   3.846    0.000120 ***
## ST_SlopeUp           -0.4232906  0.5047039  -0.839    0.401643    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1007.44  on 733  degrees of freedom
## Residual deviance:  482.99  on 718  degrees of freedom
## AIC: 514.99
## 
## Number of Fisher Scoring iterations: 5

Fitting

After conducting our initial modeling, we have identified several predictors that do not exhibit statistical significance towards our target variable. Consequently, we will proceed with the Fitting process employing the stepwise method.

model_both <- step(object = model_all,
                   direction = "both",
                   trace = 0)

summary(model_both)

## 
## Call:
## glm(formula = HeartDisease ~ Sex + ChestPainType + Cholesterol + 
##     FastingBS + MaxHR + ExerciseAngina + Oldpeak + ST_Slope, 
##     family = "binomial", data = heart_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6794  -0.3886   0.1805   0.4567   2.5841  
## 
## Coefficients:
##                       Estimate Std. Error z value    Pr(>|z|)    
## (Intercept)           0.609076   0.902756   0.675    0.499876    
## SexMale               1.318423   0.313121   4.211 0.000025471 ***
## ChestPainTypeATA     -1.767439   0.374885  -4.715 0.000002422 ***
## ChestPainTypeNAP     -1.539844   0.292127  -5.271 0.000000136 ***
## ChestPainTypeTA      -1.563223   0.447991  -3.489    0.000484 ***
## Cholesterol          -0.004450   0.001179  -3.774    0.000161 ***
## FastingBS> 120 mg/dl  1.325977   0.303953   4.362 0.000012862 ***
## MaxHR                -0.010429   0.005166  -2.019    0.043499 *  
## ExerciseAnginaYes     0.594821   0.272666   2.182    0.029146 *  
## Oldpeak               0.571182   0.132291   4.318 0.000015771 ***
## ST_SlopeFlat          1.794946   0.473894   3.788    0.000152 ***
## ST_SlopeUp           -0.465411   0.499941  -0.931    0.351889    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1007.44  on 733  degrees of freedom
## Residual deviance:  485.34  on 722  degrees of freedom
## AIC: 509.34
## 
## Number of Fisher Scoring iterations: 5

Interpretation of the result above:

Variables that increase the probability of Heart Attack are: SexMale, FastingBS> 120 mg/dl, ExerciseAnginaYes, Oldpeak, and ST_SlopeFlat.
Variables that reduce the probability of Heart Attack are: ChestPainTypeATA, ChestPainTypeNAP, ChestPainTypeTA, Cholesterol, MaxHR, and ST_SlopeUp.

Note we already got rid the unsignificant variables and also the AIC scores has improved from 514.99 to 509.34.

Predicting

Using our model_both, we will make predictions on our heart_test dataset.

heart_test$pred_risk <- predict(object = model_both,
                                newdata = heart_test,
                                type = "response")
heart_train$pred_risk <- predict(object = model_both,
                                newdata = heart_train,
                                type = "response")
plot_lr <- ggplot(heart_test, aes(x=pred_risk)) +
  geom_density(lwd=0.5, color = "white", fill = "#1B5249", adjust = 0.5) +
  labs(title = "Distribution of Probability Prediction Data", colours = "red") +
  xlab("Pred Risk") +
  ylab("Density") +
  theme(legend.position = "none",
        plot.title = element_text(color = "red", hjust = 0.5, face = "bold"),
        plot.background = element_rect(fill = "#0E264E", color = "#0E264E"),
        panel.background = element_rect(fill = "#0E264E"),
        panel.grid = element_line(colour = "white"),
        panel.grid.major.x = element_line(colour = "white"),
        panel.grid.minor.x = element_line(colour = "white"),
        axis.text.x = element_text(color = "white", family = "Times New Roman", size = 14),
        axis.text.y = element_text(color = "white", family = "Times New Roman", size = 14),
        axis.title.x = element_text(color = "#1B5249", family = "Times New Roman", size = 14),
        axis.title.y = element_text(color = "#1B5249", family = "Times New Roman", size = 14),
        axis.ticks = element_blank())

ggarrange(plot_lr, nrow = 1, ncol = 1)

The interpretaion for the graph above is the predictions are augmented towards 1 which means more predictions are in favour of positive Heart Disease Condition or Yes.

Classify our Probability Prediction Data with threshold of 0.5:

heart_test$pred_label <- ifelse(test = heart_test$pred_risk > 0.5,
                                    yes = "1",
                                    no = "0")

heart_train$pred_label <- ifelse(test = heart_train$pred_risk > 0.5,
                                    yes = "1",
                                    no = "0")

heart_test <- heart_test %>% 
  mutate(pred_label = as.factor(pred_label))

rmarkdown::paged_table(heart_test[1:20, c("HeartDisease", "pred_label", "pred_risk")])

K-NN

The k-nearest neighbors is non-parametric, supervised learning classifier, which uses proximity to make classifications about the grouping of an individual data point. K-NN Model working off the assumption that similar points can be found near one another.

This is the equation of K-NN model:

\[ d(x,y) = \sqrt {\sum(x_i - y_i)^2} \]

We will classify using the optimum number of K, which is the square root of the number of train dataset.

Cross Validation

The predictor variables for K-NN must be in numerical data formats.

model_knn <- heart %>% 
  select(HeartDisease, Age, RestingBP, Cholesterol, MaxHR)

RNGkind(sample.kind = "Rounding")
set.seed(100)

# index sampling
index_knn <-  sample(nrow(model_knn), size = 0.8*nrow(model_knn))
  
# splitting
knn_train <- model_knn[index_knn, ]
knn_test <- model_knn[-index_knn, ]

Re-check Class-Imbalance:

prop_trainkkn <- paste(round(prop.table((table(knn_train$HeartDisease)))*100,2),"%")
prop_trainkkn_df <- data.frame(HeartDisease = c("No","Yes"), Prop = c(prop_trainkkn[1], prop_trainkkn[2]))
rmarkdown::paged_table(prop_trainkkn_df)

As demonstrated earlier, the class remains evenly distributed.

Data Pre-Processing & Scaling

In the case of K-NN Model, it is necessary to distinguish the dependent variable from one or more independent variables.

knn_train_x <- select(knn_train, -HeartDisease)
knn_train_y <- knn_train %>% select(HeartDisease)

knn_test_x <- select(knn_test, -HeartDisease)
knn_test_y <- knn_test %>% select(HeartDisease)

We will standardize our independent variables/predictors using Z-score normalization because it ensures that all of our data ranges are within the same scale. The equation is as follows:

\[Z = \frac{x-mean}{sd}\]

Here’s how to do it in R.

heart_train_x_scale <- scale(x = knn_train_x)

heart_test_x_scale <- scale(x = knn_test_x,
                           center = attr(heart_train_x_scale, "scaled:center"),
                           scale  = attr(heart_train_x_scale, "scaled:scale"))

Predicting

In K-NN Model, there is no need for us to perform any modeling as we can directly proceed with predicting. However, before that, we need to determine the optimal value of K for our training data.

sqrt(nrow(knn_train))

## [1] 27.09243

As shown above, our optimum K is 27.

Based on the information presented, the optimum value for K is determined to be 27.

Predict

knn_pred <- knn(train = heart_train_x_scale,
                     test = heart_test_x_scale,
                     cl = knn_train_y$HeartDisease, 
                     k = 27,
                     prob = T)

knn_pred_train <- knn(train = heart_train_x_scale,
                     test = heart_train_x_scale,
                     cl = knn_train_y$HeartDisease, 
                     k = 27,
                     prob = T)

Evaluation

To assess the performance, we will utilize confusion matrix and ROC/AUC function which are capable of producing:

Accuracy: how capable my model correctly guesses target Y. \[ \frac {TP + TN} {TP + TN + FP + FN} \] Recall or Sensitivity: From all positive actual data, how well can the proportion of my model guess correctly? \[ \frac {TP} {TP + FN} \] Specificity: of all negative actual data, how well can the proportion of my model guess the correct one? \[ \frac {TN} {TN + FP} \] Precision: From all the prediction results, how well can my model correctly guess the positive class? \[ \frac {TP} {TP + FP} \]

Logistic Regression

Evaluation Summary

con_mat_test <- confusionMatrix(data = heart_test$pred_label, 
                               reference = heart_test$HeartDisease, 
                               positive = "1")

con_mat_train <- confusionMatrix(data = as.factor(heart_train$pred_label), 
                                 reference = heart_train$HeartDisease, 
                                 positive = "1")

performance <- cbind.data.frame(Accuracy = c(con_mat_train$overall[[1]], con_mat_test$overall[[1]]),
                                Recall = c(con_mat_train$byClass[[1]], con_mat_test$byClass[[1]]),
                                Precision = c(con_mat_train$byClass[[3]], con_mat_test$byClass[[3]]),
                                Specificity = c(con_mat_train$byClass[[2]], con_mat_test$byClass[[2]]))

rownames(performance) <- c("On Training Data", "On Unseen Data")

rmarkdown::paged_table(performance)

There are minimal disparities in the Accuracy, Sensitivity, Specificity, and Precision scores between our Unseen and Training Data.
This indicates that there are no signs of Overfitting and Underfitting in our Logistic Regression model.

Actual <- factor(c("No", "No", "Yes", "Yes"))
Predicted <- factor(c("No", "Yes", "No", "Yes"))
y      <- c(con_mat_test$table[2], con_mat_test$table[1], con_mat_test$table[4], con_mat_test$table[3])


roc <-  roc(predictor = heart_test$pred_risk,
            response = heart_test$HeartDisease,
            levels = c("0", "1"),
            percent = TRUE)
df_log <-  data.frame(Specificity=roc$specificities, Sensitivity=roc$sensitivities)



plot_log <- ggplot(data = df_log, aes(x = Specificity, y = Sensitivity))+
  with_outer_glow(geom_path(colour = '#1F6E8C', size = 1), colour = "#1B5249", sigma = 10, expand = 1)+
  scale_x_reverse() +
  geom_abline(intercept = 100, slope = 1, color="#1F6E8C", linetype = "longdash") +
  annotate("text", x = 40, y = 40, label = paste0('AUC: ', round(roc$auc,1), '%'), size = 4, color = "#1B5249", family = "Times New Roman")+
  ylab('Sensitivity (%)')+
  xlab('Specificity (%)')+
  scale_fill_gradient(low = "#1B5249", high = "#1F6E8C") +
  scale_color_gradient(low = "#1B5249", high = "#1F6E8C") +
  theme(legend.position = "none",
        plot.background = element_rect(fill = "#0E264E", color = "#0E264E"),
        panel.background = element_rect(fill = "#0E264E"),
        panel.grid = element_line(colour = "white"),
        panel.grid.major.x = element_line(colour = "white"),
        panel.grid.minor.x = element_line(colour = "white"),
        axis.text.x = element_text(color = "white", family = "Times New Roman", size = 14),
        axis.text.y = element_text(color = "white", family = "Times New Roman", size = 14),
        axis.title.x = element_text(color = "#1B5249", family = "Times New Roman", size = 14),
        axis.title.y = element_text(color = "#1B5249", family = "Times New Roman", size = 14),
        axis.ticks = element_blank())

ggarrange(plot_log, nrow = 1, ncol = 1)

Our logistic regression model achieves an accuracy rate of accuracy of 89.13% when applied to new and unseen data, indicating that it correctly classifies 89.13% of the data.
The sensitivity value of style=“color:red”>sensitivity is 88.77% demonstrates the accurate classification of the majority of positive outcomes.
The specificity value stands at specificity is 89.53%, signifying accurate classification of the majority of negative outcomes.
The precision value stands at 90.62%, indicating that 90.62% of our positive predictions are accurate.
The ROC Curve we have obtained demonstrates a remarkable level of distinction with an AUC score of 93.4%. This indicates that our model has been verified to possess a high degree of effectiveness in distinguishing between the classes we are targeting.

K-NN

Evaluation Summary

con_mat_knn_test <- confusionMatrix(data = knn_pred, 
                               reference = knn_test_y$HeartDisease, 
                               positive = "1")

con_mat_knn_train <- confusionMatrix(data = knn_pred_train, 
                               reference = knn_train_y$HeartDisease, 
                               positive = "1")

performanceknn <- cbind.data.frame(Accuracy = c(con_mat_knn_train$overall[[1]], con_mat_knn_test$overall[[1]]),
                                Sensitivity = c(con_mat_knn_train$byClass[[1]], con_mat_knn_test$byClass[[1]]),
                                Specificity = c(con_mat_knn_train$byClass[[2]], con_mat_knn_test$byClass[[2]]),
                                Precision = c(con_mat_knn_train$byClass[[3]], con_mat_knn_test$byClass[[3]]))

rownames(performanceknn) <- c("On Training Data", "On Unseen Data")

rmarkdown::paged_table(performanceknn)

There are minimal discrepancies observed in Accuracy, Sensitivity, Specificity, and Precision scores between our Unseen and Training Data.
This implies that there are no signs of Overfitting or Underfitting in our K-NN model.

Actual1 <- factor(c("No", "No", "Yes", "Yes"))
Predicted1 <- factor(c("No", "Yes", "No", "Yes"))
y1      <- c(con_mat_knn_test$table[2], con_mat_knn_test$table[1], con_mat_knn_test$table[4], con_mat_knn_test$table[3])

roc1 <- roc(predictor = attributes(knn_pred)$prob,
            response = knn_test_y$HeartDisease,
            levels = c("0", "1"),
            percent = TRUE)

df_knn <- data.frame(Specificity=roc1$specificities, Sensitivity=roc1$sensitivities)



plot_knn <- ggplot(data = df_knn, aes(x = Specificity, y = Sensitivity))+
  with_outer_glow(geom_path(colour = '#1F6E8C', size = 1), colour = "#1B5249", sigma = 10, expand = 1)+
  scale_x_reverse() +
  geom_abline(intercept = 100, slope = 1, color="#1F6E8C", linetype = "longdash") +
  annotate("text", x = 40, y = 40, label = paste0('AUC: ', round(roc1$auc,1), '%'), size = 4, color = "#1B5249", family = "Times New Roman")+
  ylab('Sensitivity (%)')+
  xlab('Specificity (%)')+
  scale_fill_gradient(low = "#1B5249", high = "#1F6E8C") +
  scale_color_gradient(low = "#1B5249", high = "#1F6E8C") +
  theme(legend.position = "none",
        plot.background = element_rect(fill = "#0E264E", color = "#0E264E"),
        panel.background = element_rect(fill = "#0E264E"),
        panel.grid = element_line(colour = "white"),
        panel.grid.major.x = element_line(colour = "white"),
        panel.grid.minor.x = element_line(colour = "white"),
        axis.text.x = element_text(color = "white", family = "Times New Roman", size = 14),
        axis.text.y = element_text(color = "white", family = "Times New Roman", size = 14),
        axis.title.x = element_text(color = "#1B5249", family = "Times New Roman", size = 14),
        axis.title.y = element_text(color = "#1B5249", family = "Times New Roman", size = 14),
        axis.ticks = element_blank())

ggarrange(plot_knn, nrow = 1, ncol = 1)

Our K-NN model achieves an accuracy rate of accuracy of 74.45% when applied to new and unseen data, indicating that it correctly classifies 74.45% of the data.
The sensitivity value of span style=“color:red”>sensitivity is 77.55% demonstrates the accurate classification of the majority of positive outcomes.
The specificity value stands at specificity is 70.93%, signifying accurate classification of the majority of negative outcomes.
The precision value stands at 90.62%, indicating that 75.24% of our positive predictions are accurate.
The ROC Curve we have obtained demonstrates a remarkable level of distinction with an AUC_ score_ of 58.6%. This indicates that our model has been verified to possess a high degree of effectiveness in distinguishing between the classes we are targeting.

Conclusion

Since we are discussing health diagnosis prediction, our goal is to minimize the error in predicting positive outcomes, specifically for patients with heart disease. Upon evaluating the performance of our models, we aim to select a model with a high sensitivity score.

Based on the two models we developed, the Logistic Regression model shows a higher sensitivity score of 88.77%. Therefore, it can be concluded that in this scenario the Logistic Regression model is superior in classifying heart disease compared to the K-NN model.

This concludes my attempt to classify heart disease using classification models. As time progresses, I will incorporate additional models that have not yet been included in this project. Nonetheless, I sincerely hope that this study will assist individuals dealing with similar cases in the future.

A work by Ahmad Fauzi

wrk.ahmadfauzi@gmail.com

with R Language

Heart Disease Classification

Logistic Regression and K-NN Models

Introduction

Data Understanding

Columnwise Description

Data Wrangling

Exploratory Data Analysis

Build Models

Logistic Regression

Cross Validation

Modelling

Fitting

Predicting

K-NN

Cross Validation

Data Pre-Processing & Scaling

Predicting

Evaluation

Logistic Regression

K-NN

Conclusion