Heart disease is a prevalent health concern worldwide, contributing significantly to mortality rates. Understanding and predicting the likelihood of heart disease in patients is crucial for early intervention and effective treatment. In this analysis, we will focus on examining patient data from a hospital database, particularly those diagnosed with heart disease. By leveraging advanced analytical techniques, such as logistic regression and k-nearest neighbor (KNN), which are supervised learning algorithms, we aim to develop predictive models to classify patients into those likely to have heart disease and those who are not
The objective of this analysis is to develop predictive models for heart disease detection based on a comprehensive set of patient attributes collected during hospital admissions. Leveraging important variables such as age, sex, chest pain type, blood pressure, cholesterol levels, fasting blood sugar, electrocardiographic results, maximum heart rate achieved, exercise-induced angina, ST depression induced by exercise, and other relevant factors, we aim to create robust models capable of accurately predicting the likelihood of heart disease in patients. By utilizing logistic regression and k-nearest neighbor algorithms, both widely used in supervised learning, we seek to provide healthcare professionals with valuable tools for early detection and intervention in heart disease cases. The ultimate goal is to enhance patient care and outcomes by enabling timely identification and management of individuals at risk of heart disease.
Here are several libraries that will be used in the analysis:
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.3.2
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(gtools)
## Warning: package 'gtools' was built under R version 4.3.2
library(gmodels)
## Warning: package 'gmodels' was built under R version 4.3.3
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.3.2
library(class)
library(tidyr)
In this section, data will be imported and column descriptions will be provided
heart_disease <- read.csv("data_input/heart.csv")
glimpse (heart_disease)
## Rows: 303
## Columns: 14
## $ age <int> 63, 37, 41, 56, 57, 57, 56, 44, 52, 57, 54, 48, 49, 64, 58, 5…
## $ sex <int> 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1…
## $ cp <int> 3, 2, 1, 1, 0, 0, 1, 1, 2, 2, 0, 2, 1, 3, 3, 2, 2, 3, 0, 3, 0…
## $ trestbps <int> 145, 130, 130, 120, 120, 140, 140, 120, 172, 150, 140, 130, 1…
## $ chol <int> 233, 250, 204, 236, 354, 192, 294, 263, 199, 168, 239, 275, 2…
## $ fbs <int> 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0…
## $ restecg <int> 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1…
## $ thalach <int> 150, 187, 172, 178, 163, 148, 153, 173, 162, 174, 160, 139, 1…
## $ exang <int> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0…
## $ oldpeak <dbl> 2.3, 3.5, 1.4, 0.8, 0.6, 0.4, 1.3, 0.0, 0.5, 1.6, 1.2, 0.2, 0…
## $ slope <int> 0, 0, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 1, 2, 1, 2, 0, 2, 2, 1…
## $ ca <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0…
## $ thal <int> 1, 2, 2, 2, 2, 1, 2, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3…
## $ target <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
Data type adjustment is needed as there are still some variables with inappropriate data types, followed by checking for missing values.
heart_disease <- heart_disease %>%
mutate_if(is.integer, as.factor) %>%
mutate(sex = factor(sex, levels = c(0,1), labels = c("Female", "Male")),
fbs =factor(fbs, levels = c(0,1), labels = c("False", "True")),
exang = factor(exang, levels = c(0,1), labels = c("No", "Yes")),
target = factor(target, levels = c(0,1),
labels = c("Health", "Not Health")))
glimpse(heart_disease)
## Rows: 303
## Columns: 14
## $ age <fct> 63, 37, 41, 56, 57, 57, 56, 44, 52, 57, 54, 48, 49, 64, 58, 5…
## $ sex <fct> Male, Male, Female, Male, Female, Male, Female, Male, Male, M…
## $ cp <fct> 3, 2, 1, 1, 0, 0, 1, 1, 2, 2, 0, 2, 1, 3, 3, 2, 2, 3, 0, 3, 0…
## $ trestbps <fct> 145, 130, 130, 120, 120, 140, 140, 120, 172, 150, 140, 130, 1…
## $ chol <fct> 233, 250, 204, 236, 354, 192, 294, 263, 199, 168, 239, 275, 2…
## $ fbs <fct> True, False, False, False, False, False, False, False, True, …
## $ restecg <fct> 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1…
## $ thalach <fct> 150, 187, 172, 178, 163, 148, 153, 173, 162, 174, 160, 139, 1…
## $ exang <fct> No, No, No, No, Yes, No, No, No, No, No, No, No, No, Yes, No,…
## $ oldpeak <dbl> 2.3, 3.5, 1.4, 0.8, 0.6, 0.4, 1.3, 0.0, 0.5, 1.6, 1.2, 0.2, 0…
## $ slope <fct> 0, 0, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 1, 2, 1, 2, 0, 2, 2, 1…
## $ ca <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0…
## $ thal <fct> 1, 2, 2, 2, 2, 1, 2, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3…
## $ target <fct> Not Health, Not Health, Not Health, Not Health, Not Health, N…
summary(heart_disease)
## age sex cp trestbps chol fbs
## 58 : 19 Female: 96 0:143 120 : 37 197 : 6 False:258
## 57 : 17 Male :207 1: 50 130 : 36 204 : 6 True : 45
## 54 : 16 2: 87 140 : 32 234 : 6
## 59 : 14 3: 23 110 : 19 212 : 5
## 52 : 13 150 : 17 254 : 5
## 51 : 12 138 : 13 269 : 5
## (Other):212 (Other):149 (Other):270
## restecg thalach exang oldpeak slope ca thal
## 0:147 162 : 11 No :204 Min. :0.00 0: 21 0:175 0: 2
## 1:152 160 : 9 Yes: 99 1st Qu.:0.00 1:140 1: 65 1: 18
## 2: 4 163 : 9 Median :0.80 2:142 2: 38 2:166
## 152 : 8 Mean :1.04 3: 20 3:117
## 173 : 8 3rd Qu.:1.60 4: 5
## 125 : 7 Max. :6.20
## (Other):251
## target
## Health :138
## Not Health:165
##
##
##
##
##
some insights :
Age Distribution: The age distribution shows that the dataset covers a wide range of ages, with the most frequent age group being 58 years old. This suggests that the dataset includes patients across various age groups, providing a comprehensive view of heart disease across different age demographics.
Gender Representation: There is a noticeable gender imbalance in the dataset, with roughly twice as many male entries as female entries. This gender skew highlights the need for gender-specific analysis and interventions in cardiovascular health.
Chest Pain Types: The distribution of chest pain types indicates that most patients present with type 0 chest pain, followed by types 2 and 1. This insight can help healthcare professionals prioritize chest pain assessments and treatments based on the type and severity of pain reported by patients.
Blood Pressure and Cholesterol Levels: The diverse range of blood pressure and cholesterol levels underscores the variability in cardiovascular health among patients. Understanding these distributions can assist in identifying risk factors and developing personalized treatment plans for individuals with elevated blood pressure or cholesterol levels.
Fasting Blood Sugar Levels: The majority of patients have fasting blood sugar levels below 120 mg/dl, indicating relatively normal glucose metabolism. However, a subset of patients has elevated fasting blood sugar levels, suggesting potential comorbidities such as diabetes mellitus, which can impact cardiovascular health.
Resting Electrocardiographic Results: The distribution of restecg types highlights the prevalence of specific electrocardiographic abnormalities among patients. This information can guide clinicians in interpreting electrocardiograms and diagnosing cardiac conditions based on characteristic ECG patterns.
Exercise-Induced Angina: The presence or absence of exercise-induced angina provides insights into the cardiovascular response to physical exertion. Patients with exercise-induced angina may require closer monitoring and tailored exercise regimens to manage their symptoms and prevent adverse events.
ST Depression and Slope of Peak Exercise ST Segment: The distribution of ST depression values and slope types reflects the extent of myocardial ischemia and the response of the heart to exercise stress. These parameters are crucial for assessing the severity of coronary artery disease and guiding decisions regarding further diagnostic testing and treatment.
Number of Major Vessels Colored by Fluoroscopy: The distribution of major vessels colored by fluoroscopy indicates the extent of coronary artery involvement and the presence of obstructive lesions. This information is valuable for risk stratification and determining the need for revascularization procedures such as angioplasty or bypass surgery.
Thalassemia Types: The prevalence of different thalassemia types highlights the association between genetic factors and cardiovascular disease. Understanding the distribution of thalassemia types can inform genetic counseling and screening programs aimed at identifying individuals at risk of cardiac complications.
Target Variable: The distribution of healthy and unhealthy status among patients underscores the prevalence of heart disease in the population. This insight emphasizes the importance of preventive measures, early detection, and effective management strategies to reduce the burden of cardiovascular morbidity and mortality.
Missing value check was performed on the dataset, however, prior to that, it is necessary to check the proportion of the target variable present in the target column. If the proportions of both classes are fairly balanced, we may not require additional pre-processing to balance the proportions between the two target variable classes.
# Data proportion check
prop.table(table(heart_disease$target))
##
## Health Not Health
## 0.4554455 0.5445545
table(heart_disease$target)
##
## Health Not Health
## 138 165
*** From the results, it is evident that the data proportions are sufficiently balanced, hence there is no need for additional preprocessing.
#Missing Value Checking
colSums(is.na(heart_disease))
## age sex cp trestbps chol fbs restecg thalach
## 0 0 0 0 0 0 0 0
## exang oldpeak slope ca thal target
## 0 0 0 0 0 0
There is no missing value in the dataset
# shows diagram for variable Sex
sex_counts <- table(heart_disease$sex)
# Create a bar plot
barplot(sex_counts,
main = "Distribution of Sex in the Dataset",
xlab = "Sex (0 = Female, 1 = Male)",
ylab = "Frequency")
set.seed(303)
# index sampling
index <- sample(nrow(heart_disease),
size = nrow(heart_disease)*0.7)
# splitting
heart_disease_train <- heart_disease[index, ]
heart_disease_test <- heart_disease[-index, ]
heart_disease$target %>%
levels()
## [1] "Health" "Not Health"
Next, models are built using Logistic Regression and KNN. The data is divided into training and testing sets using cross-validation. In the Logistic Regression model, the model is fitted using the glm() function, and feature selection is performed using the stepwise method. The evaluation results of the model are displayed based on accuracy, precision, recall, and F1-score.
model <- glm(formula = target ~ ., family = "binomial",
data = heart_disease_train)
## Warning: glm.fit: algorithm did not converge
summary(model)
##
## Call:
## glm(formula = target ~ ., family = "binomial", data = heart_disease_train)
##
## Coefficients: (92 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.353e+03 1.089e+08 0 1
## age34 -8.757e+02 1.411e+07 0 1
## age35 -3.989e+02 5.012e+06 0 1
## age37 -1.081e+03 1.679e+07 0 1
## age38 1.852e+03 4.009e+07 0 1
## age39 -1.199e+03 1.998e+07 0 1
## age40 -3.698e+03 6.895e+07 0 1
## age41 4.068e+02 8.401e+06 0 1
## age42 1.504e+03 2.979e+07 0 1
## age43 -6.091e+02 1.012e+07 0 1
## age44 3.374e+02 7.100e+06 0 1
## age45 1.534e+03 3.340e+07 0 1
## age46 1.065e+03 2.312e+07 0 1
## age47 1.756e+03 3.682e+07 0 1
## age48 5.609e+02 1.214e+07 0 1
## age49 1.053e+03 2.048e+07 0 1
## age50 2.808e+02 7.870e+06 0 1
## age51 1.534e+03 3.275e+07 0 1
## age52 4.440e+01 4.662e+06 0 1
## age53 2.061e+02 5.781e+06 0 1
## age54 7.970e+00 2.816e+06 0 1
## age55 -4.604e+02 5.822e+06 0 1
## age56 3.317e+02 7.248e+06 0 1
## age57 1.473e+03 3.172e+07 0 1
## age58 -2.467e+01 2.496e+06 0 1
## age59 1.681e+02 5.461e+06 0 1
## age60 1.932e+03 4.179e+07 0 1
## age61 -5.287e+02 7.910e+06 0 1
## age62 1.257e+03 2.734e+07 0 1
## age63 -2.288e+02 2.324e+06 0 1
## age64 1.450e+03 3.046e+07 0 1
## age65 -7.161e+02 1.271e+07 0 1
## age66 -1.280e+03 2.338e+07 0 1
## age67 -4.983e+03 9.517e+07 0 1
## age68 -1.262e+03 2.221e+07 0 1
## age69 2.314e+03 4.857e+07 0 1
## age70 -7.435e+02 1.591e+07 0 1
## age71 1.689e+03 3.504e+07 0 1
## age76 -9.716e+01 1.959e+06 0 1
## age77 5.252e+02 1.560e+07 0 1
## sexMale 4.068e+02 8.365e+06 0 1
## cp1 -3.120e+02 6.210e+06 0 1
## cp2 2.292e+02 4.987e+06 0 1
## cp3 -1.899e+03 3.782e+07 0 1
## trestbps100 4.844e+03 9.885e+07 0 1
## trestbps101 3.739e+03 7.554e+07 0 1
## trestbps102 3.875e+03 7.923e+07 0 1
## trestbps104 3.058e+03 5.919e+07 0 1
## trestbps105 3.399e+03 6.716e+07 0 1
## trestbps106 1.011e+04 1.989e+08 0 1
## trestbps108 5.446e+03 1.064e+08 0 1
## trestbps110 4.167e+03 8.353e+07 0 1
## trestbps112 4.895e+03 9.757e+07 0 1
## trestbps114 4.944e+03 9.860e+07 0 1
## trestbps115 5.869e+03 1.166e+08 0 1
## trestbps117 3.714e+03 7.469e+07 0 1
## trestbps118 6.407e+03 1.255e+08 0 1
## trestbps120 5.158e+03 1.016e+08 0 1
## trestbps122 4.411e+03 8.852e+07 0 1
## trestbps124 4.368e+03 8.784e+07 0 1
## trestbps125 3.410e+03 6.731e+07 0 1
## trestbps126 5.124e+03 1.041e+08 0 1
## trestbps128 4.354e+03 8.589e+07 0 1
## trestbps129 5.566e+03 1.134e+08 0 1
## trestbps130 4.217e+03 8.449e+07 0 1
## trestbps132 2.112e+03 4.172e+07 0 1
## trestbps134 5.423e+03 1.082e+08 0 1
## trestbps135 4.704e+03 9.548e+07 0 1
## trestbps136 6.369e+03 1.276e+08 0 1
## trestbps138 4.450e+03 8.614e+07 0 1
## trestbps140 4.221e+03 8.369e+07 0 1
## trestbps142 3.715e+03 7.196e+07 0 1
## trestbps144 4.638e+03 9.292e+07 0 1
## trestbps145 5.855e+03 1.166e+08 0 1
## trestbps146 6.430e+03 1.272e+08 0 1
## trestbps148 5.275e+03 1.054e+08 0 1
## trestbps150 3.817e+03 7.583e+07 0 1
## trestbps152 8.368e+03 1.644e+08 0 1
## trestbps154 3.758e+03 7.505e+07 0 1
## trestbps155 5.757e+03 1.125e+08 0 1
## trestbps156 5.522e+03 1.124e+08 0 1
## trestbps160 5.325e+03 1.072e+08 0 1
## trestbps165 4.153e+03 8.341e+07 0 1
## trestbps170 2.372e+03 4.771e+07 0 1
## trestbps174 4.998e+03 1.010e+08 0 1
## trestbps178 6.225e+03 1.275e+08 0 1
## trestbps180 6.668e+03 1.315e+08 0 1
## trestbps200 1.423e+03 3.101e+07 0 1
## chol131 -9.310e+02 1.824e+07 0 1
## chol141 -7.936e+02 1.209e+07 0 1
## chol157 -4.406e+02 6.675e+06 0 1
## chol160 -8.979e+02 1.877e+07 0 1
## chol164 -1.255e+03 2.580e+07 0 1
## chol166 9.983e+02 2.241e+07 0 1
## chol167 5.910e+03 1.140e+08 0 1
## chol169 -7.363e+02 1.077e+07 0 1
## chol174 -3.522e+02 3.352e+06 0 1
## chol175 -1.718e+03 3.345e+07 0 1
## chol177 3.704e+02 9.182e+06 0 1
## chol178 1.321e+03 2.425e+07 0 1
## chol180 -1.138e+03 2.133e+07 0 1
## chol182 1.340e+03 2.597e+07 0 1
## chol183 1.168e+03 2.396e+07 0 1
## chol184 3.096e+03 6.582e+07 0 1
## chol186 4.199e+02 9.076e+06 0 1
## chol187 9.275e+02 1.853e+07 0 1
## chol192 1.112e+03 1.960e+07 0 1
## chol193 2.089e+03 4.166e+07 0 1
## chol195 1.539e+03 3.037e+07 0 1
## chol196 -1.264e+03 2.807e+07 0 1
## chol197 1.027e+03 2.006e+07 0 1
## chol198 NA NA NA NA
## chol199 6.189e+03 1.209e+08 0 1
## chol200 NA NA NA NA
## chol201 -3.427e+02 8.824e+06 0 1
## chol203 1.290e+01 2.350e+06 0 1
## chol204 1.067e+03 2.234e+07 0 1
## chol205 1.432e+03 2.845e+07 0 1
## chol206 7.439e+02 1.467e+07 0 1
## chol207 1.227e+03 2.465e+07 0 1
## chol208 3.799e+02 8.863e+06 0 1
## chol209 -4.061e+02 9.204e+06 0 1
## chol210 NA NA NA NA
## chol211 1.254e+03 2.447e+07 0 1
## chol212 1.305e+03 2.633e+07 0 1
## chol213 1.928e+03 3.839e+07 0 1
## chol214 1.192e+02 3.886e+06 0 1
## chol215 9.131e+02 1.644e+07 0 1
## chol216 5.898e+02 1.292e+07 0 1
## chol217 9.989e+02 1.962e+07 0 1
## chol218 -5.579e+01 2.221e+06 0 1
## chol219 -4.490e+02 8.193e+06 0 1
## chol220 1.739e+03 3.486e+07 0 1
## chol221 7.355e+02 1.524e+07 0 1
## chol222 NA NA NA NA
## chol223 8.900e+01 3.861e+06 0 1
## chol225 3.685e+03 7.383e+07 0 1
## chol226 -9.124e+02 1.564e+07 0 1
## chol227 3.049e+03 6.030e+07 0 1
## chol228 9.279e+02 1.674e+07 0 1
## chol229 3.427e+02 8.921e+06 0 1
## chol230 -1.799e+02 4.531e+06 0 1
## chol231 -3.518e+02 6.427e+06 0 1
## chol232 NA NA NA NA
## chol233 2.846e+01 2.228e+06 0 1
## chol234 -6.034e+01 3.498e+06 0 1
## chol235 5.499e+02 1.244e+07 0 1
## chol236 -5.199e+01 1.672e+06 0 1
## chol237 4.584e+03 9.140e+07 0 1
## chol239 5.837e+02 1.198e+07 0 1
## chol240 1.369e+03 2.659e+07 0 1
## chol242 -1.754e+03 3.135e+07 0 1
## chol243 -2.672e+03 5.030e+07 0 1
## chol244 9.222e+01 3.784e+06 0 1
## chol245 5.063e+02 1.037e+07 0 1
## chol246 1.027e+03 2.016e+07 0 1
## chol248 1.468e+03 2.970e+07 0 1
## chol249 NA NA NA NA
## chol250 1.447e+03 2.500e+07 0 1
## chol252 6.744e+02 1.024e+07 0 1
## chol253 -1.390e+03 2.849e+07 0 1
## chol254 9.310e+02 1.775e+07 0 1
## chol255 3.605e+02 8.051e+06 0 1
## chol256 8.478e+02 1.863e+07 0 1
## chol257 -1.462e+03 2.772e+07 0 1
## chol258 -4.228e+02 8.980e+06 0 1
## chol260 1.298e+02 4.573e+06 0 1
## chol261 -7.700e+02 1.546e+07 0 1
## chol263 -3.712e+02 4.986e+06 0 1
## chol264 9.560e+02 1.895e+07 0 1
## chol265 NA NA NA NA
## chol266 -1.461e+02 2.184e+06 0 1
## chol267 -1.656e+03 3.113e+07 0 1
## chol268 -1.514e+02 3.455e+06 0 1
## chol269 -5.123e+01 2.550e+06 0 1
## chol270 4.782e+02 5.938e+06 0 1
## chol271 -9.446e+02 1.645e+07 0 1
## chol273 1.325e+03 2.583e+07 0 1
## chol274 -9.867e+00 2.800e+06 0 1
## chol275 -1.004e+03 2.006e+07 0 1
## chol277 -5.619e+02 1.103e+07 0 1
## chol278 NA NA NA NA
## chol281 1.932e+03 3.788e+07 0 1
## chol282 -2.868e+03 5.788e+07 0 1
## chol283 4.193e+03 8.346e+07 0 1
## chol284 9.792e+01 3.856e+06 0 1
## chol286 4.578e+03 8.841e+07 0 1
## chol288 4.278e+03 8.538e+07 0 1
## chol289 NA NA NA NA
## chol293 -1.233e+03 2.513e+07 0 1
## chol295 -1.542e+03 3.062e+07 0 1
## chol298 -9.951e+02 1.872e+07 0 1
## chol299 5.058e+03 9.681e+07 0 1
## chol302 1.095e+03 2.236e+07 0 1
## chol303 -4.478e+02 8.876e+06 0 1
## chol304 9.841e+02 1.797e+07 0 1
## chol305 -5.848e+02 1.126e+07 0 1
## chol306 NA NA NA NA
## chol307 NA NA NA NA
## chol308 -6.042e+02 1.287e+07 0 1
## chol309 -3.298e+02 5.142e+06 0 1
## chol311 -5.264e+02 9.399e+06 0 1
## chol315 9.750e+02 1.857e+07 0 1
## chol318 NA NA NA NA
## chol319 NA NA NA NA
## chol325 -7.818e+01 2.931e+06 0 1
## chol326 2.379e+03 4.766e+07 0 1
## chol327 -8.820e+02 1.762e+07 0 1
## chol330 1.638e+03 3.218e+07 0 1
## chol340 -1.435e+02 3.267e+06 0 1
## chol341 4.530e+03 9.175e+07 0 1
## chol342 3.879e+03 7.608e+07 0 1
## chol354 -1.412e+03 2.705e+07 0 1
## chol360 5.416e+02 9.535e+06 0 1
## chol394 -9.829e+01 3.485e+06 0 1
## chol407 1.738e+03 3.469e+07 0 1
## chol417 2.352e+03 4.741e+07 0 1
## fbsTrue -7.067e+02 1.451e+07 0 1
## restecg1 1.602e+02 2.630e+06 0 1
## restecg2 NA NA NA NA
## thalach96 -2.715e+03 5.165e+07 0 1
## thalach97 -1.662e+02 6.559e+06 0 1
## thalach99 NA NA NA NA
## thalach103 -3.115e+03 6.005e+07 0 1
## thalach105 -6.205e+02 1.418e+07 0 1
## thalach108 NA NA NA NA
## thalach109 6.892e+02 1.141e+07 0 1
## thalach111 NA NA NA NA
## thalach112 NA NA NA NA
## thalach114 -1.460e+03 2.823e+07 0 1
## thalach115 NA NA NA NA
## thalach116 NA NA NA NA
## thalach117 NA NA NA NA
## thalach118 NA NA NA NA
## thalach120 NA NA NA NA
## thalach122 NA NA NA NA
## thalach124 NA NA NA NA
## thalach125 NA NA NA NA
## thalach126 NA NA NA NA
## thalach127 NA NA NA NA
## thalach130 NA NA NA NA
## thalach131 NA NA NA NA
## thalach132 NA NA NA NA
## thalach133 NA NA NA NA
## thalach134 NA NA NA NA
## thalach136 NA NA NA NA
## thalach138 NA NA NA NA
## thalach140 NA NA NA NA
## thalach141 NA NA NA NA
## thalach142 NA NA NA NA
## thalach143 NA NA NA NA
## thalach144 NA NA NA NA
## thalach145 NA NA NA NA
## thalach146 NA NA NA NA
## thalach147 NA NA NA NA
## thalach148 NA NA NA NA
## thalach149 NA NA NA NA
## thalach150 NA NA NA NA
## thalach151 NA NA NA NA
## thalach152 NA NA NA NA
## thalach153 NA NA NA NA
## thalach154 NA NA NA NA
## thalach155 NA NA NA NA
## thalach156 NA NA NA NA
## thalach157 NA NA NA NA
## thalach158 NA NA NA NA
## thalach159 NA NA NA NA
## thalach160 NA NA NA NA
## thalach161 NA NA NA NA
## thalach162 NA NA NA NA
## thalach163 NA NA NA NA
## thalach164 NA NA NA NA
## thalach165 NA NA NA NA
## thalach166 NA NA NA NA
## thalach168 NA NA NA NA
## thalach169 NA NA NA NA
## thalach170 NA NA NA NA
## thalach171 NA NA NA NA
## thalach172 NA NA NA NA
## thalach173 NA NA NA NA
## thalach174 NA NA NA NA
## thalach175 NA NA NA NA
## thalach178 NA NA NA NA
## thalach179 NA NA NA NA
## thalach180 NA NA NA NA
## thalach181 NA NA NA NA
## thalach182 NA NA NA NA
## thalach184 NA NA NA NA
## thalach186 NA NA NA NA
## thalach187 NA NA NA NA
## thalach190 NA NA NA NA
## thalach192 NA NA NA NA
## thalach202 NA NA NA NA
## exangYes NA NA NA NA
## oldpeak NA NA NA NA
## slope1 NA NA NA NA
## slope2 NA NA NA NA
## ca1 NA NA NA NA
## ca2 NA NA NA NA
## ca3 NA NA NA NA
## ca4 NA NA NA NA
## thal1 NA NA NA NA
## thal2 NA NA NA NA
## thal3 NA NA NA NA
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2.9200e+02 on 211 degrees of freedom
## Residual deviance: 1.2299e-09 on 0 degrees of freedom
## AIC: 424
##
## Number of Fisher Scoring iterations: 25
library(MASS)
## Warning: package 'MASS' was built under R version 4.3.3
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
model_both <- step(model, direction = "both")
## Start: AIC=424
## target ~ age + sex + cp + trestbps + chol + fbs + restecg + thalach +
## exang + oldpeak + slope + ca + thal
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
##
## Step: AIC=424
## target ~ age + sex + cp + trestbps + chol + fbs + restecg + thalach +
## exang + oldpeak + slope + ca
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
##
## Step: AIC=424
## target ~ age + sex + cp + trestbps + chol + fbs + restecg + thalach +
## exang + oldpeak + slope
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
##
## Step: AIC=424
## target ~ age + sex + cp + trestbps + chol + fbs + restecg + thalach +
## exang + oldpeak
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
##
## Step: AIC=424
## target ~ age + sex + cp + trestbps + chol + fbs + restecg + thalach +
## exang
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
##
## Step: AIC=424
## target ~ age + sex + cp + trestbps + chol + fbs + restecg + thalach
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
##
## Step: AIC=424
## target ~ age + sex + cp + trestbps + chol + fbs + thalach
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
##
## Step: AIC=424
## target ~ age + sex + cp + trestbps + chol + thalach
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
##
## Step: AIC=424
## target ~ age + sex + cp + chol + thalach
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
##
## Step: AIC=424
## target ~ age + sex + chol + thalach
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
##
## Step: AIC=424
## target ~ age + chol + thalach
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
## Df Deviance AIC
## - thalach 45 16.64 350.64
## <none> 0.00 424.00
## - age 25 576.70 950.70
## - chol 101 2739.32 2961.32
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
##
## Step: AIC=350.64
## target ~ age + chol
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Df Deviance AIC
## - chol 127 224.763 304.76
## + exang 1 0.000 336.00
## + slope 2 0.000 338.00
## + thal 3 0.000 340.00
## + ca 4 0.000 342.00
## + restecg 1 13.863 349.86
## + oldpeak 1 14.030 350.03
## <none> 16.636 350.64
## + fbs 1 15.608 351.61
## + sex 1 16.636 352.64
## + cp 3 16.636 356.64
## - age 37 113.328 373.33
## + trestbps 33 0.000 400.00
## + thalach 45 0.000 424.00
##
## Step: AIC=304.76
## target ~ age
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Df Deviance AIC
## + thal 3 173.43 259.43
## + exang 1 180.50 262.50
## + slope 2 179.31 263.31
## + oldpeak 1 181.98 263.98
## + cp 3 182.23 268.23
## + ca 4 184.01 272.01
## + sex 1 209.63 291.63
## - age 39 292.00 294.00
## + restecg 2 219.66 303.66
## <none> 224.76 304.76
## + fbs 1 224.61 306.61
## + trestbps 44 159.15 327.15
## + chol 127 16.64 350.64
## + thalach 71 2739.32 2961.32
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
##
## Step: AIC=259.43
## target ~ age + thal
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Df Deviance AIC
## - age 39 228.17 236.17
## + exang 1 148.19 236.19
## + ca 4 145.42 239.42
## + oldpeak 1 151.72 239.72
## + slope 2 150.12 240.12
## + cp 3 150.04 242.04
## + restecg 2 167.53 257.53
## <none> 173.43 259.43
## + sex 1 172.91 260.91
## + fbs 1 172.94 260.94
## + trestbps 44 114.25 288.25
## - thal 3 224.76 304.76
## + chol 127 0.00 340.00
## + thalach 71 1874.27 2102.27
##
## Step: AIC=236.17
## target ~ thal
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Df Deviance AIC
## + cp 3 189.913 203.91
## + exang 1 196.208 206.21
## + ca 4 196.340 212.34
## + oldpeak 1 205.129 215.13
## + slope 2 207.760 219.76
## + restecg 2 221.373 233.37
## <none> 228.167 236.17
## + fbs 1 227.598 237.60
## + sex 1 227.978 237.98
## + age 39 173.431 259.43
## + trestbps 44 179.023 275.02
## + thalach 73 129.583 283.58
## - thal 3 292.005 294.00
## + chol 129 69.537 335.54
##
## Step: AIC=203.91
## target ~ thal + cp
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Df Deviance AIC
## + ca 4 169.119 191.12
## + exang 1 175.415 191.41
## + oldpeak 1 175.715 191.72
## + slope 2 174.344 192.34
## + restecg 2 185.394 203.39
## <none> 189.913 203.91
## + sex 1 188.591 204.59
## + fbs 1 189.912 205.91
## - thal 3 227.904 235.90
## - cp 3 228.167 236.17
## + age 39 150.042 242.04
## + trestbps 44 145.113 247.11
## + thalach 73 104.027 264.03
## + chol 129 40.884 312.88
##
## Step: AIC=191.12
## target ~ thal + cp + ca
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Df Deviance AIC
## + slope 2 147.676 173.68
## + oldpeak 1 155.914 179.91
## + exang 1 157.116 181.12
## <none> 169.119 191.12
## + restecg 2 165.360 191.36
## + sex 1 168.039 192.04
## + fbs 1 168.623 192.62
## - ca 4 189.913 203.91
## - cp 3 196.340 212.34
## - thal 3 202.057 218.06
## + age 39 126.899 226.90
## + trestbps 44 126.356 236.36
## + thalach 73 91.595 259.60
## + chol 129 31.826 311.83
##
## Step: AIC=173.68
## target ~ thal + cp + ca + slope
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Df Deviance AIC
## + exang 1 140.069 168.07
## + oldpeak 1 143.277 171.28
## + sex 1 144.822 172.82
## <none> 147.676 173.68
## + fbs 1 147.396 175.40
## + restecg 2 146.629 176.63
## - cp 3 170.338 190.34
## - slope 2 169.119 191.12
## - thal 3 171.344 191.34
## - ca 4 174.344 192.34
## + age 39 103.911 207.91
## + trestbps 44 106.963 220.96
## + thalach 73 81.588 253.59
## + chol 129 17.707 301.71
##
## Step: AIC=168.07
## target ~ thal + cp + ca + slope + exang
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Df Deviance AIC
## + sex 1 137.413 167.41
## + oldpeak 1 137.625 167.62
## <none> 140.069 168.07
## + fbs 1 139.498 169.50
## + restecg 2 139.411 171.41
## - exang 1 147.676 173.68
## - cp 3 153.773 175.77
## - thal 3 159.092 181.09
## - slope 2 157.116 181.12
## - ca 4 163.431 183.43
## + age 39 97.173 203.17
## + trestbps 44 98.560 214.56
## + thalach 73 78.590 252.59
## + chol 129 17.705 303.70
##
## Step: AIC=167.41
## target ~ thal + cp + ca + slope + exang + sex
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Df Deviance AIC
## + oldpeak 1 135.091 167.09
## <none> 137.413 167.41
## - sex 1 140.069 168.07
## + fbs 1 136.799 168.80
## + restecg 2 136.496 170.50
## - thal 3 147.838 171.84
## - exang 1 144.822 172.82
## - cp 3 152.339 176.34
## - slope 2 156.360 182.36
## - ca 4 161.964 183.96
## + age 39 91.053 199.05
## + trestbps 44 96.107 214.11
## + thalach 73 76.558 252.56
## + chol 129 0.000 288.00
##
## Step: AIC=167.09
## target ~ thal + cp + ca + slope + exang + sex + oldpeak
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Df Deviance AIC
## <none> 135.091 167.09
## - oldpeak 1 137.413 167.41
## - sex 1 137.625 167.62
## + fbs 1 134.566 168.57
## + restecg 2 134.646 170.65
## - exang 1 140.773 170.77
## - thal 3 144.842 170.84
## - slope 2 147.524 175.52
## - cp 3 150.659 176.66
## - ca 4 157.844 181.84
## + age 39 90.021 200.02
## + trestbps 44 93.004 213.00
## + thalach 73 62.934 240.93
## + chol 129 0.000 290.00
model$aic
## [1] 424
model_both$aic
## [1] 167.0912
heart_disease_test$prob_heart<-predict(model_both, type = "response", newdata = heart_disease_test)
heart_disease_test$pred_heart <- factor(ifelse(heart_disease_test$prob_heart > 0.5, "Not Health","Health"))
heart_disease_test[1:10, c("pred_heart", "target")]
## pred_heart target
## 1 Not Health Not Health
## 4 Not Health Not Health
## 6 Health Not Health
## 7 Not Health Not Health
## 9 Not Health Not Health
## 10 Not Health Not Health
## 12 Not Health Not Health
## 18 Not Health Not Health
## 19 Not Health Not Health
## 24 Not Health Not Health
From the result we know that when the data test probability more than 0.5, means the patient Not Health
library(caret)
## Warning: package 'caret' was built under R version 4.3.2
## Loading required package: lattice
conf_matrix <- confusionMatrix(heart_disease_test$pred_heart, heart_disease_test$target, positive = "Not Health")
accuracy <- conf_matrix$overall['Accuracy']
recall <- conf_matrix$byClass['Recall']
precision <- conf_matrix$byClass['Precision']
f1_score <- conf_matrix$byClass['F1']
cat("Accuracy:", accuracy, "\n")
## Accuracy: 0.8571429
cat("Precision:", precision, "\n")
## Precision: 0.8333333
cat("Recall:", recall, "\n")
## Recall: 0.9183673
cat("F1 Score:", f1_score, "\n")
## F1 Score: 0.8737864
library(dplyr)
exp(model_both$coefficients) %>%
data.frame()
## .
## (Intercept) 9.653188e-01
## thal1 3.886517e+00
## thal2 8.431730e+00
## thal3 1.797427e+00
## cp1 1.676569e+00
## cp2 7.621062e+00
## cp3 6.714734e+00
## ca1 1.522204e-01
## ca2 6.192590e-02
## ca3 1.726038e-01
## ca4 5.986371e+06
## slope1 5.323100e-01
## slope2 3.316779e+00
## exangYes 2.962824e-01
## sexMale 3.977241e-01
## oldpeak 6.768212e-01
Based on the confusionMatrix result above, we observe that the model’s overall accuracy in predicting the target variable (Health and Not Health) is 86%. Furthermore, among the total actual instances where individuals are not healthy, the model correctly predicts around 89%. Additionally, among the instances predicted as positive by the model, the proportion of true positives is 85%.
The KNN model is also constructed. In this stage, numeric predictor filtering and class proportion checking are performed. The data is also presented in summary form to understand the range of predictor variable values. Training and testing data are generated by cross-validation.
Create dummy variables from the categoric data in classification
dummy <- dummyVars("~target+sex+cp+fbs+exang+oldpeak+slope+ca+thal", data = heart_disease)
dummy <- data.frame(predict(dummy, newdata = heart_disease))
glimpse(dummy)
## Rows: 303
## Columns: 25
## $ target.Health <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ target.Not.Health <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ sex.Female <dbl> 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1…
## $ sex.Male <dbl> 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0…
## $ cp.0 <dbl> 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0…
## $ cp.1 <dbl> 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0…
## $ cp.2 <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0…
## $ cp.3 <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1…
## $ fbs.False <dbl> 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1…
## $ fbs.True <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0…
## $ exang.No <dbl> 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1…
## $ exang.Yes <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0…
## $ oldpeak <dbl> 2.3, 3.5, 1.4, 0.8, 0.6, 0.4, 1.3, 0.0, 0.5, 1.6, 1.…
## $ slope.0 <dbl> 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1…
## $ slope.1 <dbl> 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0…
## $ slope.2 <dbl> 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0…
## $ ca.0 <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ ca.1 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ ca.2 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ ca.3 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ ca.4 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ thal.0 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ thal.1 <dbl> 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ thal.2 <dbl> 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ thal.3 <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0…
Delete the dummy variable which the previous variable has two categories
dummy$target.Health <- NULL
dummy$sex.Female <- NULL
dummy$fbs.False <- NULL
dummy$exang.No <- NULL
prop.table(table(dummy$target))
##
## 0 1
## 0.4554455 0.5445545
summary(dummy)
## target.Not.Health sex.Male cp.0 cp.1
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.000
## Median :1.0000 Median :1.0000 Median :0.0000 Median :0.000
## Mean :0.5446 Mean :0.6832 Mean :0.4719 Mean :0.165
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:0.000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.000
## cp.2 cp.3 fbs.True exang.Yes
## Min. :0.0000 Min. :0.00000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.00000 Median :0.0000 Median :0.0000
## Mean :0.2871 Mean :0.07591 Mean :0.1485 Mean :0.3267
## 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:0.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.00000 Max. :1.0000 Max. :1.0000
## oldpeak slope.0 slope.1 slope.2
## Min. :0.00 Min. :0.00000 Min. :0.000 Min. :0.0000
## 1st Qu.:0.00 1st Qu.:0.00000 1st Qu.:0.000 1st Qu.:0.0000
## Median :0.80 Median :0.00000 Median :0.000 Median :0.0000
## Mean :1.04 Mean :0.06931 Mean :0.462 Mean :0.4686
## 3rd Qu.:1.60 3rd Qu.:0.00000 3rd Qu.:1.000 3rd Qu.:1.0000
## Max. :6.20 Max. :1.00000 Max. :1.000 Max. :1.0000
## ca.0 ca.1 ca.2 ca.3
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000
## Median :1.0000 Median :0.0000 Median :0.0000 Median :0.00000
## Mean :0.5776 Mean :0.2145 Mean :0.1254 Mean :0.06601
## 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.00000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.00000
## ca.4 thal.0 thal.1 thal.2
## Min. :0.0000 Min. :0.000000 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.000000 1st Qu.:0.00000 1st Qu.:0.0000
## Median :0.0000 Median :0.000000 Median :0.00000 Median :1.0000
## Mean :0.0165 Mean :0.006601 Mean :0.05941 Mean :0.5479
## 3rd Qu.:0.0000 3rd Qu.:0.000000 3rd Qu.:0.00000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.000000 Max. :1.00000 Max. :1.0000
## thal.3
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.3861
## 3rd Qu.:1.0000
## Max. :1.0000
set.seed(300)
index_dmy <- sample(x = nrow(dummy), size = nrow(dummy) * 0.8)
heartdmy_train <- dummy[index_dmy, ]
heartdmy_test <- dummy[-index_dmy, ]
heartdmy_train_label <- dummy[index_dmy,1]
heartdmy_test_label <- dummy[-index_dmy,1]
KNN_Pred <- class::knn(train = heartdmy_train,
test = heartdmy_test,
cl = heartdmy_train_label,
k = 17)
KNN_Pred_Coef <- confusionMatrix(as.factor(KNN_Pred), as.factor(heartdmy_test_label),"1")
KNN_Pred_Coef
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 25 1
## 1 2 33
##
## Accuracy : 0.9508
## 95% CI : (0.8629, 0.9897)
## No Information Rate : 0.5574
## P-Value [Acc > NIR] : 6.295e-12
##
## Kappa : 0.8999
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9706
## Specificity : 0.9259
## Pos Pred Value : 0.9429
## Neg Pred Value : 0.9615
## Prevalence : 0.5574
## Detection Rate : 0.5410
## Detection Prevalence : 0.5738
## Balanced Accuracy : 0.9483
##
## 'Positive' Class : 1
##
cat("Accuracy:", KNN_Pred_Coef$overall["Accuracy"], "\n")
## Accuracy: 0.9508197
cat("Precision:", KNN_Pred_Coef$byClass["Pos Pred Value"], "\n")
## Precision: 0.9428571
cat("Recall:", KNN_Pred_Coef$byClass["Sensitivity"], "\n")
## Recall: 0.9705882
cat("F1 Score:", KNN_Pred_Coef$byClass["F1"], "\n")
## F1 Score: 0.9565217
# Evaluasi Model Logistic Regression
eval_logit <- data.frame(Accuracy = conf_matrix$overall["Accuracy"],
Recall = conf_matrix$byClass["Sensitivity"],
Specificity = conf_matrix$byClass["Specificity"],
Precision = conf_matrix$byClass["Pos Pred Value"])
# Evaluasi Model K-NN
eval_knn <- data.frame(Accuracy = KNN_Pred_Coef$overall["Accuracy"],
Recall = KNN_Pred_Coef$byClass["Sensitivity"],
Specificity = KNN_Pred_Coef$byClass["Specificity"],
Precision = KNN_Pred_Coef$byClass["Pos Pred Value"])
eval_logit
## Accuracy Recall Specificity Precision
## Accuracy 0.8571429 0.9183673 0.7857143 0.8333333
eval_knn
## Accuracy Recall Specificity Precision
## Accuracy 0.9508197 0.9705882 0.9259259 0.9428571
**** Insights : Based on the Recall value, the K-NN model has a higher Recall value (0.972973) compared to the Logistic Regression model (0.9285714). This indicates that the K-NN model has better capability in predicting patients who are actually Not Health.
Therefore, from these evaluation results, it can be concluded that using the K-NN method is more recommended for predicting patients who are actually sick and not sick, due to its higher Recall value.
Based on the evaluation conducted, it can be concluded that the Logistic Regression model performs better in predicting passengers who actually survived as not survived. This is indicated by the higher recall value in the Logistic Regression model compared to the KNN model.
In conclusion, it is recommended to use the KNN model as the optimal model for predicting the tendency of patients with heart disease based on the performance evaluation conducted.
Thus, this report provides an overview of the classification analysis process using the heart disease dataset, model selection, performance evaluation, and the recommended model for use.