Did you know that in the United States, someone experiences a heart attack every 40 seconds? Additionally, approximately 647,000 people die from heart disease each year in this region. Heart disease is currently the leading cause of death for men, women, and individuals from various racial and ethnic backgrounds.
For our project, we aim to analyze the causes of heart attacks and develop a predictive model that can be valuable in predicting the occurrence of heart disease events. To achieve this, we will utilize the Heart Disease Data Set from the UCI Machine Learning Repository, which consists of patient data collected from Cleveland, Hungary, Switzerland, and Long Beach.
To summarize our Goals for this dataset:
Before conducting our Exploratory Data Analysis, we will first examine our dataset to identify any anomalies that can be rectified within it. It is crucial to recognize that the quality of insights relies heavily on the integrity of the underlying data. Consequently, ensuring that the data is clean and in a usable format is of utmost importance.
heart <- read.csv("heart.csv")
rmarkdown::paged_table(heart)
Each column in the given list corresponds to specific information in the dataset. Here’s a description of each column:
Age: Patient’s Age in years.Sex: Patient’s Gender. (M = Male, F =
Female)ChestPainType: Chest Pain type. (4 values: ATA,
NAP, ASY, TA)RestingBP: resting Blood Pressure. ( in mm Hg
)Cholesterol: Serum Cholesterol. ( in mg/dl
)FastingBS: Fasting Blood Sugar > 120 mg/dl.
(0 = True, 1 = False)RestingECG: resting Electroencephalographic result.
(values: Normal, ST, LVH)MaxHR: Maximum Heart Rate achieved.ExerciseAngina: Exercise induced Angina. (N =
No, Y = Yes)Oldpeak: ST Depression induced by Exercise relative to
rest.ST_Slope: Slope of the peak exercise ST segment.
(values: Up, Flat, Down)HeartDisease:: Heart Disease occured. (0 =
No, 1 = Yes)glimpse(heart)
## Rows: 918
## Columns: 12
## $ Age <int> 40, 49, 37, 48, 54, 39, 45, 54, 37, 48, 37, 58, 39, 49,…
## $ Sex <chr> "M", "F", "M", "F", "M", "M", "F", "M", "M", "F", "F", …
## $ ChestPainType <chr> "ATA", "NAP", "ATA", "ASY", "NAP", "NAP", "ATA", "ATA",…
## $ RestingBP <int> 140, 160, 130, 138, 150, 120, 130, 110, 140, 120, 130, …
## $ Cholesterol <int> 289, 180, 283, 214, 195, 339, 237, 208, 207, 284, 211, …
## $ FastingBS <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ RestingECG <chr> "Normal", "Normal", "ST", "Normal", "Normal", "Normal",…
## $ MaxHR <int> 172, 156, 98, 108, 122, 170, 170, 142, 130, 120, 142, 9…
## $ ExerciseAngina <chr> "N", "N", "N", "Y", "N", "N", "N", "N", "Y", "N", "N", …
## $ Oldpeak <dbl> 0.0, 1.0, 0.0, 1.5, 0.0, 0.0, 0.0, 0.0, 1.5, 0.0, 0.0, …
## $ ST_Slope <chr> "Up", "Flat", "Up", "Flat", "Up", "Up", "Up", "Up", "Fl…
## $ HeartDisease <int> 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1…
As pointed out earlier, our variable’s data type still doesn’t match its type. Some changes need to be made, namely:
Change some of the following columns to the factor
data type: Sex, ChestPainType,
FastingBS, RestingECG,
ExerciseAngina, and ST_Slope.
Change the value of the sex column where “M” stands for Male and “F” stands for Female
Change the value of the FastingBS column where
“0” becomes “< 120 mg/dl” and “1” becomes “> 120
mg/dl”
Changed the value of the ExerciseAngina column where
“N” becomes “No” and “Y” becomes
“Yes”
heart <-
heart %>%
mutate_at(vars(Sex, ChestPainType, FastingBS, RestingECG, ExerciseAngina, ST_Slope), as.factor) %>%
mutate(Sex = case_when(
Sex == "M" ~ "Male",
Sex == "F" ~ "Female"),
FastingBS = case_when(
FastingBS == "0" ~ "< 120 mg/dl",
FastingBS == "1" ~ "> 120 mg/dl"),
ExerciseAngina = case_when(
ExerciseAngina == "N" ~ "No",
ExerciseAngina == "Y" ~ "Yes"))
anyNA(heart)
## [1] FALSE
Our dataset does not contain any missing values.
We will assess the Pearson Correlation among our numerical variables to determine whether there exists any initial multicollinearity.
variable_num <- heart %>%
select(Age, RestingBP, Cholesterol, MaxHR, Oldpeak, HeartDisease)
format <- round(cor(variable_num),2)
get_lower_tri<-function(format){
format[upper.tri(format)] <- NA
return(format)
}
get_upper_tri <- function(format){
format[lower.tri(format)]<- NA
return(format)
}
lower_tri <- get_lower_tri(format)
heart_cor_melt <- reshape2::melt(lower_tri, na.rm = TRUE)
heart_cor_melt <- heart_cor_melt %>%
mutate(toltip = glue("{Var1} ~ {Var2}"))
plot_pm <- ggplot(heart_cor_melt,aes(Var1, Var2, text = toltip)) +
geom_tile(aes(fill = value)) +
geom_text(aes(label = round(value, 1)), alpha=0.5, size = 3, color = "White") +
scale_fill_gradientn(colors = c("#A8D8E0","#428793","#27636D"),
values = rescale(c(-1,0,1)),
limits = c(-1,1)) +
labs(x = NULL,
y = NULL,
fill = "Pearson Corr:") +
theme(legend.background = element_rect(fill = "#0E264E", color = "#0E264E"),
plot.background = element_rect(fill = "#0E264E", color = "#0E264E"),
panel.background = element_rect(fill = "#0E264E"),
panel.grid = element_line(colour = "#0E264E"),
panel.grid.major.x = element_line(colour = "#0E264E"),
panel.grid.minor.x = element_line(colour = "#0E264E"),
legend.title = element_text(colour = "#1B5249", face ="bold", family = "Times New Roman"),
legend.text = element_text(colour = "#1B5249", face ="bold", family = "Times New Roman"),
legend.justification = c(1, 0),
legend.position = c(0.6, 0.7),
legend.direction = "horizontal",
axis.text.x = element_text(color = "#1B5249", family = "Times New Roman",
angle = 45, vjust = 1, hjust = 1),
axis.text.y = element_blank(),
axis.ticks = element_blank()) +
guides(fill = guide_colorbar(barwidth = 7, barheight = 1,
title.position = "top", title.hjust = 0.5))
ggplotly(plot_pm, tooltip = "text") %>%
layout(hoverlabel = list( bgcolor = "rgba(255,255,255,0.75)",
font = list(
color = "Black",
family = "Cardo",
size = 12
)))
Based on the provided information, there is no evidence of multicollinearity as indicated by the absence of significant correlation coefficients exceeding a range of 0.4/-0.4.
Before doing modeling, we need to look at the proportion
of the target variable that we have in the
HeartDisease column.
proptarget <- paste(round(prop.table((table(heart$HeartDisease)))*100,2),"%")
proptarget_df <- data.frame(HeartDisease = c("No","Yes"), Prop = c(proptarget[1], proptarget[2]))
rmarkdown::paged_table(proptarget_df)
When viewed from the proportions of the two classes, it is quite balanced, so we don’t need additional pre-processing to balance the proportions between the two class target variables.
Logistic regression is used to describe data and the relationship between one dependent variable and one or more independent variables. The independent variables can be nominal, ordinal, or of interval type.
Logistic regression return value within range of 0,1 and not a binary class. Later, we have to classify ourselves the range to convert it to binary class.
This is the equation of a logistic regression model:
\[ log(\frac{p(X)}{1-p(X)}) = B_0 + B_1.X \]
The left-hand side is called the log-odds or logit. On the right side the b0 is the model intercept and b1 is the coefficient of feature X.
Prior to delving into the modeling process, our initial step is to conduct Cross Validation on our dataset. Why is this necessary? The Cross Validation technique is employed to assess the effectiveness of machine learning algorithms in making predictions on unfamiliar data. In this particular scenario, we will allocate 80% of our dataset for training our model and reserve 20% for testing our predictions, in order to evaluate the performance of the model.
heart <- heart %>%
mutate(HeartDisease = as.factor(HeartDisease))
RNGkind(sample.kind = "Rounding")
set.seed(100)
index <- sample(nrow(heart), size = nrow(heart)*0.80)
heart_train <- heart[index,] #take 80% data
heart_test <- heart[-index,] #take 20% data
prop_train <- paste(round(prop.table((table(heart_train$HeartDisease)))*100,2),"%")
prop_train_df <- data.frame(HeartDisease = c("No","Yes"), Prop = c(prop_train[1], prop_train[2]))
rmarkdown::paged_table(prop_train_df)
As demonstrated earlier, the class remains evenly distributed.
Once we have divided our data, our next step is to proceed with training our models using the heart_train dataset. Initially, we will utilize all variables available as independent variables (X).
model_all <- glm(formula = HeartDisease ~ .,
family = "binomial",
data = heart_train)
summary(model_all)
##
## Call:
## glm(formula = HeartDisease ~ ., family = "binomial", data = heart_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6670 -0.3924 0.1772 0.4543 2.4893
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.4576065 1.5628513 -0.293 0.769673
## Age 0.0163795 0.0143858 1.139 0.254874
## SexMale 1.3481197 0.3145406 4.286 0.000018192 ***
## ChestPainTypeATA -1.7436885 0.3785043 -4.607 0.000004089 ***
## ChestPainTypeNAP -1.5622087 0.2945054 -5.305 0.000000113 ***
## ChestPainTypeTA -1.6574256 0.4578414 -3.620 0.000295 ***
## RestingBP 0.0006014 0.0064614 0.093 0.925841
## Cholesterol -0.0045547 0.0012139 -3.752 0.000175 ***
## FastingBS> 120 mg/dl 1.3006319 0.3083736 4.218 0.000024679 ***
## RestingECGNormal -0.2035720 0.3007997 -0.677 0.498552
## RestingECGST -0.1711756 0.3847466 -0.445 0.656389
## MaxHR -0.0085998 0.0055707 -1.544 0.122650
## ExerciseAnginaYes 0.5911453 0.2750255 2.149 0.031601 *
## Oldpeak 0.5434655 0.1339077 4.059 0.000049387 ***
## ST_SlopeFlat 1.8358401 0.4773591 3.846 0.000120 ***
## ST_SlopeUp -0.4232906 0.5047039 -0.839 0.401643
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1007.44 on 733 degrees of freedom
## Residual deviance: 482.99 on 718 degrees of freedom
## AIC: 514.99
##
## Number of Fisher Scoring iterations: 5
After conducting our initial modeling, we have identified several predictors that do not exhibit statistical significance towards our target variable. Consequently, we will proceed with the Fitting process employing the stepwise method.
model_both <- step(object = model_all,
direction = "both",
trace = 0)
summary(model_both)
##
## Call:
## glm(formula = HeartDisease ~ Sex + ChestPainType + Cholesterol +
## FastingBS + MaxHR + ExerciseAngina + Oldpeak + ST_Slope,
## family = "binomial", data = heart_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6794 -0.3886 0.1805 0.4567 2.5841
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.609076 0.902756 0.675 0.499876
## SexMale 1.318423 0.313121 4.211 0.000025471 ***
## ChestPainTypeATA -1.767439 0.374885 -4.715 0.000002422 ***
## ChestPainTypeNAP -1.539844 0.292127 -5.271 0.000000136 ***
## ChestPainTypeTA -1.563223 0.447991 -3.489 0.000484 ***
## Cholesterol -0.004450 0.001179 -3.774 0.000161 ***
## FastingBS> 120 mg/dl 1.325977 0.303953 4.362 0.000012862 ***
## MaxHR -0.010429 0.005166 -2.019 0.043499 *
## ExerciseAnginaYes 0.594821 0.272666 2.182 0.029146 *
## Oldpeak 0.571182 0.132291 4.318 0.000015771 ***
## ST_SlopeFlat 1.794946 0.473894 3.788 0.000152 ***
## ST_SlopeUp -0.465411 0.499941 -0.931 0.351889
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1007.44 on 733 degrees of freedom
## Residual deviance: 485.34 on 722 degrees of freedom
## AIC: 509.34
##
## Number of Fisher Scoring iterations: 5
Interpretation of the result above:
Variables that increase the
probability of Heart Attack are: SexMale,
FastingBS> 120 mg/dl, ExerciseAnginaYes,
Oldpeak, and ST_SlopeFlat.
Variables that reduce the
probability of Heart Attack are: ChestPainTypeATA,
ChestPainTypeNAP, ChestPainTypeTA,
Cholesterol, MaxHR, and
ST_SlopeUp.
Note we already got rid the unsignificant variables and also the AIC scores has improved from 514.99 to 509.34.
Using our model_both, we will make predictions on our heart_test dataset.
heart_test$pred_risk <- predict(object = model_both,
newdata = heart_test,
type = "response")
heart_train$pred_risk <- predict(object = model_both,
newdata = heart_train,
type = "response")
plot_lr <- ggplot(heart_test, aes(x=pred_risk)) +
geom_density(lwd=0.5, color = "white", fill = "#1B5249", adjust = 0.5) +
labs(title = "Distribution of Probability Prediction Data", colours = "red") +
xlab("Pred Risk") +
ylab("Density") +
theme(legend.position = "none",
plot.title = element_text(color = "red", hjust = 0.5, face = "bold"),
plot.background = element_rect(fill = "#0E264E", color = "#0E264E"),
panel.background = element_rect(fill = "#0E264E"),
panel.grid = element_line(colour = "white"),
panel.grid.major.x = element_line(colour = "white"),
panel.grid.minor.x = element_line(colour = "white"),
axis.text.x = element_text(color = "white", family = "Times New Roman", size = 14),
axis.text.y = element_text(color = "white", family = "Times New Roman", size = 14),
axis.title.x = element_text(color = "#1B5249", family = "Times New Roman", size = 14),
axis.title.y = element_text(color = "#1B5249", family = "Times New Roman", size = 14),
axis.ticks = element_blank())
ggarrange(plot_lr, nrow = 1, ncol = 1)
The interpretaion for the graph above is the predictions are augmented towards 1 which means more predictions are in favour of positive Heart Disease Condition or Yes.
heart_test$pred_label <- ifelse(test = heart_test$pred_risk > 0.5,
yes = "1",
no = "0")
heart_train$pred_label <- ifelse(test = heart_train$pred_risk > 0.5,
yes = "1",
no = "0")
heart_test <- heart_test %>%
mutate(pred_label = as.factor(pred_label))
rmarkdown::paged_table(heart_test[1:20, c("HeartDisease", "pred_label", "pred_risk")])
The k-nearest neighbors is non-parametric, supervised learning classifier, which uses proximity to make classifications about the grouping of an individual data point. K-NN Model working off the assumption that similar points can be found near one another.
This is the equation of K-NN model:
\[ d(x,y) = \sqrt {\sum(x_i - y_i)^2} \]
We will classify using the optimum number of K, which is the square root of the number of train dataset.
The predictor variables for K-NN must be in numerical data formats.
model_knn <- heart %>%
select(HeartDisease, Age, RestingBP, Cholesterol, MaxHR)
RNGkind(sample.kind = "Rounding")
set.seed(100)
# index sampling
index_knn <- sample(nrow(model_knn), size = 0.8*nrow(model_knn))
# splitting
knn_train <- model_knn[index_knn, ]
knn_test <- model_knn[-index_knn, ]
prop_trainkkn <- paste(round(prop.table((table(knn_train$HeartDisease)))*100,2),"%")
prop_trainkkn_df <- data.frame(HeartDisease = c("No","Yes"), Prop = c(prop_trainkkn[1], prop_trainkkn[2]))
rmarkdown::paged_table(prop_trainkkn_df)
As demonstrated earlier, the class remains evenly distributed.
In the case of K-NN Model, it is necessary to distinguish the dependent variable from one or more independent variables.
knn_train_x <- select(knn_train, -HeartDisease)
knn_train_y <- knn_train %>% select(HeartDisease)
knn_test_x <- select(knn_test, -HeartDisease)
knn_test_y <- knn_test %>% select(HeartDisease)
We will standardize our independent variables/predictors using Z-score normalization because it ensures that all of our data ranges are within the same scale. The equation is as follows:
\[Z = \frac{x-mean}{sd}\]
heart_train_x_scale <- scale(x = knn_train_x)
heart_test_x_scale <- scale(x = knn_test_x,
center = attr(heart_train_x_scale, "scaled:center"),
scale = attr(heart_train_x_scale, "scaled:scale"))
In K-NN Model, there is no need for us to perform any modeling as we can directly proceed with predicting. However, before that, we need to determine the optimal value of K for our training data.
sqrt(nrow(knn_train))
## [1] 27.09243
As shown above, our optimum K is 27.
Based on the information presented, the optimum value for K is determined to be 27.
knn_pred <- knn(train = heart_train_x_scale,
test = heart_test_x_scale,
cl = knn_train_y$HeartDisease,
k = 27,
prob = T)
knn_pred_train <- knn(train = heart_train_x_scale,
test = heart_train_x_scale,
cl = knn_train_y$HeartDisease,
k = 27,
prob = T)
To assess the performance, we will utilize confusion matrix and ROC/AUC function which are capable of producing:
con_mat_test <- confusionMatrix(data = heart_test$pred_label,
reference = heart_test$HeartDisease,
positive = "1")
con_mat_train <- confusionMatrix(data = as.factor(heart_train$pred_label),
reference = heart_train$HeartDisease,
positive = "1")
performance <- cbind.data.frame(Accuracy = c(con_mat_train$overall[[1]], con_mat_test$overall[[1]]),
Recall = c(con_mat_train$byClass[[1]], con_mat_test$byClass[[1]]),
Precision = c(con_mat_train$byClass[[3]], con_mat_test$byClass[[3]]),
Specificity = c(con_mat_train$byClass[[2]], con_mat_test$byClass[[2]]))
rownames(performance) <- c("On Training Data", "On Unseen Data")
rmarkdown::paged_table(performance)
There are minimal disparities in the Accuracy, Sensitivity, Specificity, and Precision scores between our Unseen and Training Data.
This indicates that there are no signs of Overfitting and Underfitting in our Logistic Regression model.
Actual <- factor(c("No", "No", "Yes", "Yes"))
Predicted <- factor(c("No", "Yes", "No", "Yes"))
y <- c(con_mat_test$table[2], con_mat_test$table[1], con_mat_test$table[4], con_mat_test$table[3])
roc <- roc(predictor = heart_test$pred_risk,
response = heart_test$HeartDisease,
levels = c("0", "1"),
percent = TRUE)
df_log <- data.frame(Specificity=roc$specificities, Sensitivity=roc$sensitivities)
plot_log <- ggplot(data = df_log, aes(x = Specificity, y = Sensitivity))+
with_outer_glow(geom_path(colour = '#1F6E8C', size = 1), colour = "#1B5249", sigma = 10, expand = 1)+
scale_x_reverse() +
geom_abline(intercept = 100, slope = 1, color="#1F6E8C", linetype = "longdash") +
annotate("text", x = 40, y = 40, label = paste0('AUC: ', round(roc$auc,1), '%'), size = 4, color = "#1B5249", family = "Times New Roman")+
ylab('Sensitivity (%)')+
xlab('Specificity (%)')+
scale_fill_gradient(low = "#1B5249", high = "#1F6E8C") +
scale_color_gradient(low = "#1B5249", high = "#1F6E8C") +
theme(legend.position = "none",
plot.background = element_rect(fill = "#0E264E", color = "#0E264E"),
panel.background = element_rect(fill = "#0E264E"),
panel.grid = element_line(colour = "white"),
panel.grid.major.x = element_line(colour = "white"),
panel.grid.minor.x = element_line(colour = "white"),
axis.text.x = element_text(color = "white", family = "Times New Roman", size = 14),
axis.text.y = element_text(color = "white", family = "Times New Roman", size = 14),
axis.title.x = element_text(color = "#1B5249", family = "Times New Roman", size = 14),
axis.title.y = element_text(color = "#1B5249", family = "Times New Roman", size = 14),
axis.ticks = element_blank())
ggarrange(plot_log, nrow = 1, ncol = 1)
Our logistic regression model achieves an accuracy rate of accuracy of 89.13% when applied to new and unseen data, indicating that it correctly classifies 89.13% of the data.
The sensitivity value of style=“color:red”>sensitivity is 88.77% demonstrates the accurate classification of the majority of positive outcomes.
The specificity value stands at specificity is 89.53%, signifying accurate classification of the majority of negative outcomes.
The precision value stands at 90.62%, indicating that 90.62% of our positive predictions are accurate.
The ROC Curve we have obtained demonstrates a remarkable level of distinction with an AUC score of 93.4%. This indicates that our model has been verified to possess a high degree of effectiveness in distinguishing between the classes we are targeting.
con_mat_knn_test <- confusionMatrix(data = knn_pred,
reference = knn_test_y$HeartDisease,
positive = "1")
con_mat_knn_train <- confusionMatrix(data = knn_pred_train,
reference = knn_train_y$HeartDisease,
positive = "1")
performanceknn <- cbind.data.frame(Accuracy = c(con_mat_knn_train$overall[[1]], con_mat_knn_test$overall[[1]]),
Sensitivity = c(con_mat_knn_train$byClass[[1]], con_mat_knn_test$byClass[[1]]),
Specificity = c(con_mat_knn_train$byClass[[2]], con_mat_knn_test$byClass[[2]]),
Precision = c(con_mat_knn_train$byClass[[3]], con_mat_knn_test$byClass[[3]]))
rownames(performanceknn) <- c("On Training Data", "On Unseen Data")
rmarkdown::paged_table(performanceknn)
There are minimal discrepancies observed in Accuracy, Sensitivity, Specificity, and Precision scores between our Unseen and Training Data.
This implies that there are no signs of Overfitting or Underfitting in our K-NN model.
Actual1 <- factor(c("No", "No", "Yes", "Yes"))
Predicted1 <- factor(c("No", "Yes", "No", "Yes"))
y1 <- c(con_mat_knn_test$table[2], con_mat_knn_test$table[1], con_mat_knn_test$table[4], con_mat_knn_test$table[3])
roc1 <- roc(predictor = attributes(knn_pred)$prob,
response = knn_test_y$HeartDisease,
levels = c("0", "1"),
percent = TRUE)
df_knn <- data.frame(Specificity=roc1$specificities, Sensitivity=roc1$sensitivities)
plot_knn <- ggplot(data = df_knn, aes(x = Specificity, y = Sensitivity))+
with_outer_glow(geom_path(colour = '#1F6E8C', size = 1), colour = "#1B5249", sigma = 10, expand = 1)+
scale_x_reverse() +
geom_abline(intercept = 100, slope = 1, color="#1F6E8C", linetype = "longdash") +
annotate("text", x = 40, y = 40, label = paste0('AUC: ', round(roc1$auc,1), '%'), size = 4, color = "#1B5249", family = "Times New Roman")+
ylab('Sensitivity (%)')+
xlab('Specificity (%)')+
scale_fill_gradient(low = "#1B5249", high = "#1F6E8C") +
scale_color_gradient(low = "#1B5249", high = "#1F6E8C") +
theme(legend.position = "none",
plot.background = element_rect(fill = "#0E264E", color = "#0E264E"),
panel.background = element_rect(fill = "#0E264E"),
panel.grid = element_line(colour = "white"),
panel.grid.major.x = element_line(colour = "white"),
panel.grid.minor.x = element_line(colour = "white"),
axis.text.x = element_text(color = "white", family = "Times New Roman", size = 14),
axis.text.y = element_text(color = "white", family = "Times New Roman", size = 14),
axis.title.x = element_text(color = "#1B5249", family = "Times New Roman", size = 14),
axis.title.y = element_text(color = "#1B5249", family = "Times New Roman", size = 14),
axis.ticks = element_blank())
ggarrange(plot_knn, nrow = 1, ncol = 1)
Our K-NN model achieves an accuracy rate of accuracy of 74.45% when applied to new and unseen data, indicating that it correctly classifies 74.45% of the data.
The sensitivity value of span style=“color:red”>sensitivity is 77.55% demonstrates the accurate classification of the majority of positive outcomes.
The specificity value stands at specificity is 70.93%, signifying accurate classification of the majority of negative outcomes.
The precision value stands at 90.62%, indicating that 75.24% of our positive predictions are accurate.
The ROC Curve we have obtained demonstrates a remarkable level of distinction with an AUC_ score_ of 58.6%. This indicates that our model has been verified to possess a high degree of effectiveness in distinguishing between the classes we are targeting.
Since we are discussing health diagnosis prediction, our goal is to minimize the error in predicting positive outcomes, specifically for patients with heart disease. Upon evaluating the performance of our models, we aim to select a model with a high sensitivity score.
Based on the two models we developed, the Logistic Regression model shows a higher sensitivity score of 88.77%. Therefore, it can be concluded that in this scenario the Logistic Regression model is superior in classifying heart disease compared to the K-NN model.
This concludes my attempt to classify heart disease using classification models. As time progresses, I will incorporate additional models that have not yet been included in this project. Nonetheless, I sincerely hope that this study will assist individuals dealing with similar cases in the future.
A work by Ahmad Fauzi
wrk.ahmadfauzi@gmail.com
with R Language