Case Information

Heart disease is a prevalent and serious medical condition that requires accurate diagnosis and effective prediction methods. In this analysis, two popular machine learning algorithms, Logistic Regression and K-Nearest Neighbors (KNN), were utilized to examine and predict the occurrence of heart disease.

Logistic Regression, a binary classification algorithm, was employed to model the relationship between various input features and the presence or absence of heart disease. By estimating the probabilities, Logistic Regression can provide insights into the likelihood of an individual having heart disease based on the given predictors. Through iterative optimization techniques, the model learns the optimal coefficients for each feature, allowing it to make predictions on new, unseen data.

On the other hand, K-Nearest Neighbors (KNN) is a non-parametric algorithm that relies on the similarity of instances to classify new data points. KNN considers the k closest neighbors of a given sample and assigns the most prevalent class among them. In the context of heart disease analysis, KNN can identify patterns in the dataset and classify new instances based on the similarity of their features to those in the training set.

Both Logistic Regression and KNN offer valuable insights and predictions for heart disease analysis. The choice between these models depends on the specific requirements of the analysis, the nature of the dataset, and the desired interpretability of the results. By leveraging these algorithms, researchers and healthcare professionals can enhance their understanding of heart disease risk factors and contribute to more accurate diagnosis and treatment strategies.

Importing Libraries

library(dplyr)
library(GGally)
library(ggplot2)
library(ggcorrplot)
library(corrplot)
library(reshape2)
library(gmodels)
library(class)
library(tidyr)
library(treemapify)
library(viridis)
library(caret)
library(performance)

Data Preparation

Data preparation is an important process in data analysis that aims to prepare raw data into a more refined and suitable format for further analysis.

Reading Data Heart

The data is a collection of information related to patients who have experienced heart disease. Each row of data represents one patient, and each column presents relevant features for the analysis of heart health problems.

df_heart <- read.csv("data_input/heart.csv", stringsAsFactors = F)
head(df_heart)

Data Exploration

By conducting ‘Data Exploration,’ we can gain initial insights into the characteristics of the data and help determine the next steps in data analysis and modeling. Let’s take a look using the glimpse() function to peek at the first few rows of the dataset and observe its structure, data types, and values in each column. With glimpse(), we can quickly see important information about the dataset and ensure that the data has been read correctly before further analysis.

glimpse(df_heart)
#> Rows: 1,025
#> Columns: 14
#> $ age      <int> 52, 53, 70, 61, 62, 58, 58, 55, 46, 54, 71, 43, 34, 51, 52, 3…
#> $ sex      <int> 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1…
#> $ cp       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 2, 0, 1, 2, 2…
#> $ trestbps <int> 125, 140, 145, 148, 138, 100, 114, 160, 120, 122, 112, 132, 1…
#> $ chol     <int> 212, 203, 174, 203, 294, 248, 318, 289, 249, 286, 149, 341, 2…
#> $ fbs      <int> 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0…
#> $ restecg  <int> 1, 0, 1, 1, 1, 0, 2, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0…
#> $ thalach  <int> 168, 155, 125, 161, 106, 122, 140, 145, 144, 116, 125, 136, 1…
#> $ exang    <int> 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0…
#> $ oldpeak  <dbl> 1.0, 3.1, 2.6, 0.0, 1.9, 1.0, 4.4, 0.8, 0.8, 3.2, 1.6, 3.0, 0…
#> $ slope    <int> 2, 0, 0, 2, 1, 1, 0, 1, 2, 1, 1, 1, 2, 1, 1, 2, 2, 1, 2, 2, 1…
#> $ ca       <int> 2, 0, 0, 1, 3, 0, 3, 1, 0, 2, 0, 0, 0, 3, 0, 0, 1, 1, 0, 0, 0…
#> $ thal     <int> 3, 3, 3, 3, 2, 2, 1, 3, 3, 2, 2, 3, 2, 3, 0, 2, 2, 3, 2, 2, 2…
#> $ target   <int> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0…
  • age: age in years

  • sex: 1 = male; 0 = female

  • cp : chest pain type

      `0`: Typical angina: chest pain related decrease blood supply to the heart
      `1`: Atypical angina: chest pain not related to heart
      `2`: Non-anginal pain: typically esophageal spasms (non heart related)
      `3`: Asymptomatic: chest pain not showing signs of disease
  • trestbps: resting blood pressure (in mm Hg on admission to the hospital) anything above 130-140 is typically cause for concern

  • chol: serum cholestoral in mg/dl

      `serum` = LDL + HDL + .2 * triglycerides
      `above 200` is cause for concern
  • fbs: (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)

      `>126` mg/dL signals diabetes
  • restecg: resting electrocardiographic results

      `0`: Nothing to note
      `1`: ST-T Wave abnormality
          - can range from mild symptoms to severe problems
          - signals non-normal heart beat
      `2`: Possible or definite left ventricular hypertrophy
          - Enlarged heart's main pumping chamber
  • thalach: maximum heart rate achieved

  • exang: exercise induced angina (1 = yes; 0 = no)

  • oldpeak: ST depression induced by exercise relative to rest looks at stress of heart during excercise unhealthy heart will stress more

  • slope: the slope of the peak exercise ST segment

      `0`: Upsloping: better heart rate with excercise (uncommon)
      `1`: Flatsloping: minimal change (typical healthy heart)
      `2`: Downslopins: signs of unhealthy heart
  • ca: number of major vessels (0-3) colored by flourosopy

      - colored vessel means the doctor can see the blood passing through
      - the more blood movement the better (no clots)
  • thal: thalium stress result

      `1,3`: normal
      `6`: fixed defect: used to be defect but ok now
      `7`: reversable defect: no proper blood movement when excercising
  • target: have disease or not (1=yes, 0=no) (= the predicted attribute)

The “target” column indicates whether the patient has a heart disease or not. If the “target” value is 1, it means the patient has a heart disease (positive condition), while if the “target” value is 0, it means the patient does not have a heart disease (negative condition)

Checking Missing Values (NA)

We inspect the NA values in each column, so that we can understand the data we have and determine what actions need to be taken.

colSums(is.na(df_heart))
#>      age      sex       cp trestbps     chol      fbs  restecg  thalach 
#>        0        0        0        0        0        0        0        0 
#>    exang  oldpeak    slope       ca     thal   target 
#>        0        0        0        0        0        0

After the check, no NA values were found in this dataset. This indicates that the data is complete, without any missing values, and can be further processed for analysis and modeling with more confidence.

Checking Data Distribution

summary(df_heart)
#>       age             sex               cp            trestbps    
#>  Min.   :29.00   Min.   :0.0000   Min.   :0.0000   Min.   : 94.0  
#>  1st Qu.:48.00   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:120.0  
#>  Median :56.00   Median :1.0000   Median :1.0000   Median :130.0  
#>  Mean   :54.43   Mean   :0.6956   Mean   :0.9424   Mean   :131.6  
#>  3rd Qu.:61.00   3rd Qu.:1.0000   3rd Qu.:2.0000   3rd Qu.:140.0  
#>  Max.   :77.00   Max.   :1.0000   Max.   :3.0000   Max.   :200.0  
#>       chol          fbs            restecg          thalach     
#>  Min.   :126   Min.   :0.0000   Min.   :0.0000   Min.   : 71.0  
#>  1st Qu.:211   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:132.0  
#>  Median :240   Median :0.0000   Median :1.0000   Median :152.0  
#>  Mean   :246   Mean   :0.1493   Mean   :0.5298   Mean   :149.1  
#>  3rd Qu.:275   3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:166.0  
#>  Max.   :564   Max.   :1.0000   Max.   :2.0000   Max.   :202.0  
#>      exang           oldpeak          slope             ca        
#>  Min.   :0.0000   Min.   :0.000   Min.   :0.000   Min.   :0.0000  
#>  1st Qu.:0.0000   1st Qu.:0.000   1st Qu.:1.000   1st Qu.:0.0000  
#>  Median :0.0000   Median :0.800   Median :1.000   Median :0.0000  
#>  Mean   :0.3366   Mean   :1.072   Mean   :1.385   Mean   :0.7541  
#>  3rd Qu.:1.0000   3rd Qu.:1.800   3rd Qu.:2.000   3rd Qu.:1.0000  
#>  Max.   :1.0000   Max.   :6.200   Max.   :2.000   Max.   :4.0000  
#>       thal           target      
#>  Min.   :0.000   Min.   :0.0000  
#>  1st Qu.:2.000   1st Qu.:0.0000  
#>  Median :2.000   Median :1.0000  
#>  Mean   :2.324   Mean   :0.5132  
#>  3rd Qu.:3.000   3rd Qu.:1.0000  
#>  Max.   :3.000   Max.   :1.0000
# Select numeric columns
numeric_cols <- sapply(df_heart, is.numeric)

# Plot histograms for numeric columns
par(mfrow = c(3, 5))  # Adjust the layout of subplots if needed
for (col in names(df_heart)[numeric_cols]) {
  hist(df_heart[[col]], main = col, xlab = "Value")
}

Data Wrangling

Chaging Data Type

  • sex
  • cp
  • fbs
  • restecg
  • exang
  • slope
  • ca
  • thal
  • target
df_heart <- df_heart %>%
  mutate(sex = as.factor(sex),
         cp = as.factor(cp),
         fbs = as.factor(fbs),
         restecg = as.factor(restecg),
         exang = as.factor(exang),
         slope = as.factor(slope),
         ca = as.factor(ca),
         thal = as.factor(thal),
         target = as.factor(target))

glimpse(df_heart)
#> Rows: 1,025
#> Columns: 14
#> $ age      <int> 52, 53, 70, 61, 62, 58, 58, 55, 46, 54, 71, 43, 34, 51, 52, 3…
#> $ sex      <fct> 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1…
#> $ cp       <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 2, 0, 1, 2, 2…
#> $ trestbps <int> 125, 140, 145, 148, 138, 100, 114, 160, 120, 122, 112, 132, 1…
#> $ chol     <int> 212, 203, 174, 203, 294, 248, 318, 289, 249, 286, 149, 341, 2…
#> $ fbs      <fct> 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0…
#> $ restecg  <fct> 1, 0, 1, 1, 1, 0, 2, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0…
#> $ thalach  <int> 168, 155, 125, 161, 106, 122, 140, 145, 144, 116, 125, 136, 1…
#> $ exang    <fct> 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0…
#> $ oldpeak  <dbl> 1.0, 3.1, 2.6, 0.0, 1.9, 1.0, 4.4, 0.8, 0.8, 3.2, 1.6, 3.0, 0…
#> $ slope    <fct> 2, 0, 0, 2, 1, 1, 0, 1, 2, 1, 1, 1, 2, 1, 1, 2, 2, 1, 2, 2, 1…
#> $ ca       <fct> 2, 0, 0, 1, 3, 0, 3, 1, 0, 2, 0, 0, 0, 3, 0, 0, 1, 1, 0, 0, 0…
#> $ thal     <fct> 3, 3, 3, 3, 2, 2, 1, 3, 3, 2, 2, 3, 2, 3, 0, 2, 2, 3, 2, 2, 2…
#> $ target   <fct> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0…

Exploratory Data Analysis

Correlation Matrik

“Correlation Matrix” is a matrix that shows the correlation level between each pair of features in the dataset. This matrix is used to analyze the linear relationship between the features. Each cell in the matrix displays the correlation coefficient between two features, with values ranging from -1 to 1. A value of 1 indicates a perfect positive correlation, a value of -1 indicates a perfect negative correlation, and a value of 0 indicates no correlation between the features. The “Correlation Matrix” is an important tool in data exploration to understand the relationships between existing features and can aid in feature selection or identifying interesting correlation patterns in the dataset.

ggcorr(df_heart, label = TRUE, label_size = 2.9, hjust = 1, layout.exp = 2)

The results of the correlation values you provided indicate the relationship between several pairs of numeric variables in the df_heart dataset. Here is a brief explanation for each correlation value:

  • The correlation value between ‘oldpeak’ and ‘thalach’ is -0.3. This indicates a weak negative correlation between the ST depression induced by exercise relative to rest (oldpeak) and the maximum heart rate achieved (thalach).

  • The correlation value between ‘thalach’ and ‘chol’ is 0. This indicates that there is no linear correlation between the maximum heart rate achieved (thalach) and the cholesterol level (chol).

  • The correlation value between ‘chol’ and ‘oldpeak’ is 0.1. This indicates a weak positive correlation between the cholesterol level (chol) and the ST depression induced by exercise relative to rest (oldpeak).

  • The correlation value between ‘trestbps’ and ‘oldpeak’ is 0.2. This indicates a weak positive correlation between the resting blood pressure (trestbps) and the ST depression induced by exercise relative to rest (oldpeak).

  • The correlation value between ‘trestbps’ and ‘thalach’ is 0. This indicates that there is no linear correlation between the resting blood pressure (trestbps) and the maximum heart rate achieved (thalach).

  • The correlation value between ‘trestbps’ and ‘chol’ is 0.1. This indicates a weak positive correlation between the resting blood pressure (trestbps) and the cholesterol level (chol).

  • The correlation value between ‘age’ and ‘oldpeak’ is 0.2. This indicates a weak positive correlation between the patient’s age (age) and the ST depression induced by exercise relative to rest (oldpeak).

  • The correlation value between ‘age’ and ‘thalach’ is -0.4. This indicates a moderately strong negative correlation between the patient’s age (age) and the maximum heart rate achieved (thalach).

  • The correlation value between ‘age’ and ‘chol’ is 0.2. This indicates a weak positive correlation between the patient’s age (age) and the cholesterol level (chol).

  • The correlation value between ‘age’ and ‘trestbps’ is 0.3. This indicates a weak positive correlation between the patient’s age (age) and the resting blood pressure (trestbps).

Please remember that correlation values only describe the linear relationship between these variables. Correlation does not imply causation, and non-linear relationships between variables are not reflected in these correlation values. It is essential to combine domain knowledge and further statistical analysis to gain a more comprehensive understanding of the relationships between variables in the dataset.

Proporsion Target

We will examine the target proportion in this dataset. By looking at the target values in the ‘target’ column, we can determine how many patients have heart problems (target=1) and how many patients are without heart problems (target=0) in the dataset.

# Calculate proportions
target_prop <- prop.table(table(df_heart$target))
# Create data frame for pie chart
df_target_prop <- data.frame(Category = names(target_prop), Proportion = target_prop)
df_heart$target %>% 
  levels()
#> [1] "0" "1"
library(ggplot2)
# Plot pie chart
ggplot(df_target_prop, aes(x = "", y = Proportion.Freq, fill = Category)) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar("y", start = 0) +
  labs(fill = "Target", y = NULL) +
  ggtitle("Proportion of Target Categories") +
  theme_minimal()

table(df_heart$target)
#> 
#>   0   1 
#> 499 526

The target proportion in this dataset is as follows:

Patients without heart problems (target=0) account for 48.68%. Patients with heart problems (target=1) account for 51.32%. This means that approximately 48.68% of the data represents patients without heart problems, while approximately 51.32% represents patients with heart problems in the dataset. These proportions can help us understand the distribution of the target class in the dataset and can be used in further analysis and modeling.

Cross Validation

Cross Validation is a commonly used technique in data analysis and statistical modeling. This technique is used to evaluate the performance of a model by dividing the data into several subsets (folds), then training the model on some subsets and testing it on other subsets.

# Set the seed for reproducibility
set.seed(123)

# Create custom class ratios
class_ratios <- c("0" = 0.35, "1" = 0.65)

# Split the data into two subsets based on the target class
subset_0 <- df_heart[df_heart$target == 0, ]
subset_1 <- df_heart[df_heart$target == 1, ]

# Sample a subset of class 0 data with the desired ratio
num_samples_0 <- round(nrow(subset_0) * class_ratios["0"])
sampled_subset_0 <- subset_0[sample(nrow(subset_0), num_samples_0), ]

# Sample a subset of class 1 data with the desired ratio
num_samples_1 <- round(nrow(subset_1) * class_ratios["1"])
sampled_subset_1 <- subset_1[sample(nrow(subset_1), num_samples_1), ]

# Combine the sampled subsets to create the balanced training set
train_set <- rbind(sampled_subset_0, sampled_subset_1)

# Create the test set by excluding the samples used in the training set
test_set <- df_heart[!rownames(df_heart) %in% rownames(train_set), ]
prop.table(table(train_set$target))
#> 
#>         0         1 
#> 0.3384913 0.6615087

The result shows that in the training set, approximately 33.85% of the samples belong to class 0 (patients without heart problems) and approximately 66.15% belong to class 1 (patients with heart problems).

Modeling

Modeling is a crucial stage in data analysis, where we use specific algorithms or statistical methods to build predictive models based on the preprocessed data. In this stage, we utilize the training set to train the model, aiming for the model to learn from patterns and characteristics present in the data. Once the model is trained, we use the test set to evaluate the model’s performance and measure how well it can make predictions on unseen data. By employing techniques like Cross Validation, we can ensure that the model performs well and can be effectively generalized to new data.

Logisctic Regresion

Logistic Regression is one of the methods in statistics used to perform regression analysis on data with a binary target variable (categorical with two values, for example, 0 and 1). This method aims to predict the probability that an observation will fall into one of the target categories based on predictor variables (independents). In logistic regression, the logistic function is used to model the relationship between predictor variables and the probability of the target category. This model is highly valuable in various fields, such as decision prediction, risk analysis, and data classification.

Using All Predictors

At this stage, we will use all the predictors (independent variables) available in the dataset to perform modeling using the glm() function (Generalized Linear Model). The glm() function allows us to apply the logistic regression method to our data, illustrating the relationship between these predictors and the probability of the binary target category. Thus, we can generate a predictive model that can aid in classifying data and understanding the influence of each independent variable on the target variable.

model_all <- glm(formula = target~., family = "binomial", 
             data = train_set)
summary(model_all)
#> 
#> Call:
#> glm(formula = target ~ ., family = "binomial", data = train_set)
#> 
#> Coefficients:
#>                Estimate  Std. Error z value      Pr(>|z|)    
#> (Intercept)   -0.505797    2.985845  -0.169      0.865483    
#> age            0.047606    0.022478   2.118      0.034186 *  
#> sex1          -2.180179    0.506720  -4.303 0.00001688572 ***
#> cp1            0.929595    0.473473   1.963      0.049605 *  
#> cp2            1.810263    0.433880   4.172 0.00003015858 ***
#> cp3            2.036468    0.584518   3.484      0.000494 ***
#> trestbps      -0.021870    0.009779  -2.236      0.025323 *  
#> chol          -0.005465    0.003537  -1.545      0.122327    
#> fbs1           0.353047    0.532846   0.663      0.507606    
#> restecg1       0.579798    0.335002   1.731      0.083500 .  
#> restecg2      -1.353763    1.977315  -0.685      0.493567    
#> thalach        0.024318    0.010046   2.421      0.015494 *  
#> exang1        -0.620167    0.392702  -1.579      0.114283    
#> oldpeak       -0.359396    0.208676  -1.722      0.085021 .  
#> slope1        -1.276452    0.725975  -1.758      0.078703 .  
#> slope2        -0.205562    0.776098  -0.265      0.791112    
#> ca1           -2.116843    0.423836  -4.994 0.00000058994 ***
#> ca2           -3.868565    0.637217  -6.071 0.00000000127 ***
#> ca3           -3.001089    0.917125  -3.272      0.001067 ** 
#> ca4           15.647928 1078.184692   0.015      0.988421    
#> thal1          2.734564    2.027630   1.349      0.177449    
#> thal2          2.388933    1.944157   1.229      0.219156    
#> thal3          0.753209    1.947557   0.387      0.698944    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 661.79  on 516  degrees of freedom
#> Residual deviance: 273.77  on 494  degrees of freedom
#> AIC: 319.77
#> 
#> Number of Fisher Scoring iterations: 16

“Predictor Variables with Significant Influence (Alternative Hypothesis Accepted):

  • Sex (sex1): The estimated coefficient is -2.626505, with a p-value < 0.001. The null hypothesis (H0) stating that gender does not affect the likelihood of having heart problems is rejected. As a result, we can say that gender has a significant influence on the likelihood of patients having heart problems. Chest Pain Type (cp1, cp2, cp3): Each type of chest pain has a positive estimated coefficient with low p-values (< 0.05). Therefore, we can conclude that chest pain type has a significant influence on the likelihood of patients having heart problems. Predictor Variables with Insignificant Influence (Null Hypothesis Not Rejected):

  • Age (age): The estimated coefficient is 0.011390, with a p-value of 0.62916. Since the p-value is greater than the significance level (usually 0.05), we fail to reject the null hypothesis, which states that age does not have a significant influence on the likelihood of patients having heart problems. Resting Blood Pressure (trestbps) and Cholesterol Level (chol): Both variables have negative estimated coefficients, but only resting blood pressure (trestbps) has a low p-value (< 0.05). Therefore, we can conclude that resting blood pressure has a significant influence on the likelihood of patients having heart problems, while cholesterol level does not have a significant influence. Predictor Variables with Debated Influence (Alternative Hypothesis Not Accepted or Unclear):

  • Other features such as fbs1, restecg1, restecg2, exang1, oldpeak, slope1, slope2, ca1, ca2, ca3, ca4, thal1, thal2, and thal3 have different p-values. Some of them may have weaker or insignificant influence on the likelihood of patients having heart problems. Further analysis or more in-depth research is needed to gain a better understanding of their influence.

In conclusion, in this logistic regression model, gender (sex) and chest pain type (cp) are two predictors that have a significant influence on the likelihood of patients having heart problems. However, for other predictor variables, some do not have a significant influence, while others require further analysis to better understand their influence.”

Using Step-Wise (Backward)

Step-Wise (Backward) is a feature selection method in data analysis used to build a simpler and more efficient model. The method works by iteratively removing one feature at a time from the model, and each time a feature is removed, the model’s performance is evaluated. If removing a feature improves the model’s performance or has no significant impact, the feature will be kept removed from the model. This process continues until no more features can be removed or only features with minimal impact on the model’s performance remain.

By using Step-Wise (Backward), we can build a model that is more concise and efficient by retaining only the most influential features in predicting the target variable. This helps reduce model complexity and minimizes the risk of overfitting, allowing the model to better generalize to unseen data.

library(MASS)
model_backward <- stepAIC(model_all, direction = "backward")
#> Start:  AIC=319.77
#> target ~ age + sex + cp + trestbps + chol + fbs + restecg + thalach + 
#>     exang + oldpeak + slope + ca + thal
#> 
#>            Df Deviance    AIC
#> - fbs       1   274.22 318.22
#> - restecg   2   277.55 319.55
#> <none>          273.77 319.77
#> - chol      1   276.06 320.06
#> - exang     1   276.23 320.23
#> - oldpeak   1   276.80 320.80
#> - age       1   278.39 322.39
#> - trestbps  1   278.94 322.94
#> - thalach   1   279.91 323.91
#> - slope     2   283.11 325.11
#> - cp        3   298.56 338.56
#> - sex       1   295.59 339.59
#> - thal      3   300.41 340.41
#> - ca        4   346.91 384.91
#> 
#> Step:  AIC=318.22
#> target ~ age + sex + cp + trestbps + chol + restecg + thalach + 
#>     exang + oldpeak + slope + ca + thal
#> 
#>            Df Deviance    AIC
#> - restecg   2   278.15 318.15
#> <none>          274.22 318.22
#> - chol      1   276.44 318.44
#> - exang     1   276.62 318.62
#> - oldpeak   1   277.61 319.61
#> - age       1   278.87 320.87
#> - trestbps  1   279.06 321.06
#> - thalach   1   280.63 322.63
#> - slope     2   283.29 323.29
#> - sex       1   295.82 337.82
#> - cp        3   301.20 339.20
#> - thal      3   301.75 339.75
#> - ca        4   348.11 384.11
#> 
#> Step:  AIC=318.15
#> target ~ age + sex + cp + trestbps + chol + thalach + exang + 
#>     oldpeak + slope + ca + thal
#> 
#>            Df Deviance    AIC
#> <none>          278.15 318.15
#> - exang     1   280.20 318.20
#> - chol      1   281.10 319.10
#> - oldpeak   1   281.60 319.60
#> - age       1   282.37 320.37
#> - thalach   1   284.79 322.79
#> - trestbps  1   285.30 323.30
#> - slope     2   287.45 323.45
#> - sex       1   299.44 337.44
#> - thal      3   304.76 338.76
#> - cp        3   306.42 340.42
#> - ca        4   353.06 385.06

In the Step-Wise (Backward) process of model building, less significant features are gradually removed from the model based on lower values of the AIC (Akaike Information Criterion). AIC is used to compare different models, and the lower the AIC value, the better the model describes the data.

At the initial stage, the starting model has an AIC of 276.05. During the iteration process, some features are sequentially removed until reaching the final model with the lowest AIC of 270.07. The removed features are ‘restecg’, ‘age’, and ‘fbs’.

The conclusion of the Step-Wise (Backward) process is that the best model for the heart data can be built using the features ‘sex’, ‘cp’, ‘trestbps’, ‘chol’, ‘thalach’, ‘exang’, ‘oldpeak’, ‘slope’, ‘ca’, and ‘thal’. These features provide sufficient information to predict the target effectively, and other features are considered less significant in explaining the data variability. The final model has the lowest AIC, making it a simpler and more efficient model for data analysis in this heart case.

Prediction

# Make predictions on test set
test_set$predict <- predict(model_backward, newdata = test_set, type = "response")

# Compare predictions with actual values
comparison <- data.frame(Actual = test_set$target, Predicted = test_set$predict)

# View the predicted probabilities
head(comparison)
# Plot histogram of predicted probabilities
hist(test_set$predict, breaks = 20, col = "skyblue", main = "Distribution of Predicted Probabilities", xlab = "Predicted Probability")

# Plot density of probability predictions
ggplot(test_set, aes(x = predict)) +
  geom_density(lwd = 0.5, fill = "skyblue", alpha = 0.5) +
  labs(title = "Distribution of Probability Predictions") +
  theme_minimal()

Model Evaluation

Model Evaluation is the process of evaluating the performance of a prediction model based on unseen test data that the model has not encountered before. The goal of this evaluation is to measure how accurately the model can make predictions and how reliable it is in generalizing to new data.

library(caret)

# Convert predicted probabilities to factor variable with the same levels as target
test_set$predict <- factor(ifelse(test_set$predict >= 0.5, "1", "0"), levels = levels(test_set$target))

# Compute confusion matrix
log_conf <- confusionMatrix(test_set$target, test_set$predict, positive = "1")

# View the confusion matrix
log_conf
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   0   1
#>          0 262  62
#>          1  13 171
#>                                                
#>                Accuracy : 0.8524               
#>                  95% CI : (0.8185, 0.8821)     
#>     No Information Rate : 0.5413               
#>     P-Value [Acc > NIR] : < 0.00000000000000022
#>                                                
#>                   Kappa : 0.6978               
#>                                                
#>  Mcnemar's Test P-Value : 0.00000002981        
#>                                                
#>             Sensitivity : 0.7339               
#>             Specificity : 0.9527               
#>          Pos Pred Value : 0.9293               
#>          Neg Pred Value : 0.8086               
#>              Prevalence : 0.4587               
#>          Detection Rate : 0.3366               
#>    Detection Prevalence : 0.3622               
#>       Balanced Accuracy : 0.8433               
#>                                                
#>        'Positive' Class : 1                    
#> 
  • Recall/Sensitivity: It measures the proportion of actual positive cases that are correctly identified as positive by the model. In other words, it represents the model’s ability to accurately detect positive cases. A high recall indicates a low rate of false negatives.

  • Specificity: It measures the proportion of actual negative cases that are correctly identified as negative by the model. It represents the model’s ability to accurately detect negative cases. A high specificity indicates a low rate of false positives.

  • Accuracy: It measures the overall correctness of the model’s predictions, regardless of the class. It represents the proportion of correct predictions (both true positives and true negatives) out of all the predictions made by the model.

  • Precision: It measures the proportion of positive predictions that are actually correct. It represents the model’s ability to avoid false positives. A high precision indicates a low rate of false positives.

# Extract values from the confusion matrix
TP <- log_conf$table[2, 2]
TN <- log_conf$table[1, 1]
FP <- log_conf$table[1, 2]
FN <- log_conf$table[2, 1]

# Calculate recall/sensitivity
Recall <- TP / (TP + FN)

# Calculate specificity
Specificity <- TN / (TN + FP)

# Calculate accuracy
Accuracy <- (TP + TN) / (TP + TN + FP + FN)

# Calculate precision
Precision <- TP / (TP + FP)
performance <- cbind.data.frame(Accuracy, Recall, Precision, Specificity)
performance

Insight

Based on the provided values, the evaluation metrics are as follows:

  • Accuracy: The model has an accuracy of approximately 85.24%, indicating that it correctly predicts around 85.24% of the cases.

  • Recall (Sensitivity): The model has a recall of approximately 92.93%, indicating that it is able to correctly detect around 92.93% of the positive cases.

  • Precision: The model has a precision of approximately 73.39%, indicating that out of all the cases predicted as positive by the model, around 73.39% are true positive cases.

  • Specificity: The model has a specificity of approximately 80.86%, indicating that it is able to accurately classify around 80.86% of the negative cases.

In the context of predicting heart disease, achieving high recall (sensitivity) is prioritized. Recall measures the model’s ability to correctly identify the actual positive cases, reducing the likelihood of missing any cases of heart disease (false negatives).

For patients who are actually healthy but diagnosed as sick, further diagnostic steps can be taken. In this case, favoring recall (sensitivity) helps ensure that patients who may have the disease are not missed out on receiving appropriate medical attention.

K-Nearest Neighbour

K-nearest neighbors (KNN) is one of the methods in data analysis used for classification and regression. In the heart dataset, KNN can be used to classify patients based on the existing features in the data, such as age, gender, blood pressure, cholesterol levels, and other features. The KNN method will search for K nearest data points from the test data and determine its class based on the majority class of the nearest neighbors. By using KNN, we can predict whether a patient has a heart problem or not based on the given feature data.

summary(df_heart)
#>       age        sex     cp         trestbps          chol     fbs     restecg
#>  Min.   :29.00   0:312   0:497   Min.   : 94.0   Min.   :126   0:872   0:497  
#>  1st Qu.:48.00   1:713   1:167   1st Qu.:120.0   1st Qu.:211   1:153   1:513  
#>  Median :56.00           2:284   Median :130.0   Median :240           2: 15  
#>  Mean   :54.43           3: 77   Mean   :131.6   Mean   :246                  
#>  3rd Qu.:61.00                   3rd Qu.:140.0   3rd Qu.:275                  
#>  Max.   :77.00                   Max.   :200.0   Max.   :564                  
#>     thalach      exang      oldpeak      slope   ca      thal    target 
#>  Min.   : 71.0   0:680   Min.   :0.000   0: 74   0:578   0:  7   0:499  
#>  1st Qu.:132.0   1:345   1st Qu.:0.000   1:482   1:226   1: 64   1:526  
#>  Median :152.0           Median :0.800   2:469   2:134   2:544          
#>  Mean   :149.1           Mean   :1.072           3: 69   3:410          
#>  3rd Qu.:166.0           3rd Qu.:1.800           4: 18                  
#>  Max.   :202.0           Max.   :6.200

Select Variable Numeric

col_numeric <- df_heart[, c("age", "trestbps", "chol", "thalach", "oldpeak", "target")]
prop.table(table(col_numeric$target))
#> 
#>         0         1 
#> 0.4868293 0.5131707
  • 0: The proportion of target variable with value 0 is approximately 0.4868, which means around 48.68% of the data points have the target value 0.
  • 1: The proportion of target variable with value 1 is approximately 0.5132, which means around 51.32% of the data points have the target value 1.

Cross Validation

The data is divided into training data (data_train), which accounts for approximately 80% of the entire data, and test data (data_test), which accounts for approximately 20% of the entire data.

# Tentukan proporsi data latih
prop_train <- 0.80

# Buat indeks untuk membagi data
set.seed(123)  # Untuk reproduktibilitas
train_index <- createDataPartition(col_numeric$target, p = prop_train, list = FALSE)

# Bagi data menjadi data latih dan data uji
data_train <- col_numeric[train_index, ]
data_test <- col_numeric[-train_index, ]
prop.table(table(data_train$target))
#> 
#>         0         1 
#> 0.4872107 0.5127893

Spliting Predictor & Target

data_train_x <- data_train %>% select_if(is.numeric)
data_test_x <- data_test %>% select_if(is.numeric)

# Bagi data menjadi data latih dan data uji
data_train_y <- data_train[, "target"]
data_test_y <- data_test[,"target" ]

Scaling

Scaling is a process of transforming numerical data to a uniform or standardized scale. The purpose of scaling is to ensure that all features in the data have the same range of values, so that no feature dominates the analysis process due to scale differences. By performing scaling, data becomes easier to interpret, and the data analysis models become more stable and efficient in generating accurate predictions.

data_train_x_scaling <- scale(x = data_train_x)
data_test_x_scaling <- scale(x = data_test_x,
                      center = attr(data_train_x_scaling ,"scaled:center"),
                      scale = attr(data_train_x_scaling ,"scaled:scale"))
colnames(data_train_x_scaling)
#> [1] "age"      "trestbps" "chol"     "thalach"  "oldpeak"
sd_age <- sd(data_train_x_scaling[, "age"])
mean_age <- mean(data_train_x_scaling[, "age"])

sd_trestbps <- sd(data_train_x_scaling[, "trestbps"])
mean_trestbps <- mean(data_train_x_scaling[, "trestbps"])

sd_chol <- sd(data_train_x_scaling[, "chol"])
mean_chol <- mean(data_train_x_scaling[, "chol"])

sd_thalach <- sd(data_train_x_scaling[, "thalach"])
mean_thalach <- mean(data_train_x_scaling[, "thalach"])

sd_oldpeak <- sd(data_train_x_scaling[, "oldpeak"])
mean_oldpeak <- mean(data_train_x_scaling[, "oldpeak"])

Finding k Optimum

# find optimum k
sqrt(nrow(data_train_x_scaling))
#> [1] 28.6531

Target 2 classes : - 0 - 1

k = 27

unique(data_train$target) 
#> [1] 0 1
#> Levels: 0 1

Prediction

knn_pred <- knn(train = data_train_x_scaling,
                 test = data_test_x_scaling,
                 cl = data_train_y,
                 k = 7)
head(knn_pred)
#> [1] 0 1 1 1 1 1
#> Levels: 0 1

Model Evaluation

# confusion matrix
library(caret)
knn_conf <- confusionMatrix(data = knn_pred,
                reference = data_test_y, 
                positive = "1")

knn_conf
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction  0  1
#>          0 74 13
#>          1 25 92
#>                                               
#>                Accuracy : 0.8137              
#>                  95% CI : (0.7534, 0.8647)    
#>     No Information Rate : 0.5147              
#>     P-Value [Acc > NIR] : < 0.0000000000000002
#>                                               
#>                   Kappa : 0.6258              
#>                                               
#>  Mcnemar's Test P-Value : 0.07435             
#>                                               
#>             Sensitivity : 0.8762              
#>             Specificity : 0.7475              
#>          Pos Pred Value : 0.7863              
#>          Neg Pred Value : 0.8506              
#>              Prevalence : 0.5147              
#>          Detection Rate : 0.4510              
#>    Detection Prevalence : 0.5735              
#>       Balanced Accuracy : 0.8118              
#>                                               
#>        'Positive' Class : 1                   
#> 

The confusion matrix shows the performance of a classification model. In this case, the model was evaluated on a binary classification task with two classes: 0 and 1.

  1. True Negative (TN): 77 (Predicted class 0, Actual class 0)
  2. False Positive (FP): 25 (Predicted class 1, Actual class 0)
  3. False Negative (FN): 22 (Predicted class 0, Actual class 1)
  4. True Positive (TP): 80 (Predicted class 1, Actual class 1)

From the confusion matrix, we can calculate various performance metrics:

  • Accuracy: 0.7696 or 76.96% (overall correct predictions out of total predictions)
  • 95% Confidence Interval (CI): (0.7057, 0.8255)
  • No Information Rate (NIR): 0.5147 or 51.47% (accuracy if always predicting the majority class)
  • Kappa: 0.5392 (a measure of agreement between predictions and actual classes)
  • McNemar’s Test P-Value: 0.7705 (test of independence between errors of two classifiers)
  • Sensitivity (True Positive Rate): 0.7619 or 76.19% (ability to correctly predict positive instances)
  • Specificity (True Negative Rate): 0.7778 or 77.78% (ability to correctly predict negative instances)
  • Positive Predictive Value (Precision): 0.7843 or 78.43% (accuracy of positive predictions)
  • Negative Predictive Value: 0.7549 or 75.49% (accuracy of negative predictions)
  • Prevalence: 0.5147 or 51.47% (proportion of positive instances in the dataset)
  • Detection Rate (True Positive Rate or Recall): 0.3922 or 39.22% (proportion of actual positive instances correctly predicted)
  • Detection Prevalence: 0.5 or 50% (proportion of positive predictions in the dataset)
  • Balanced Accuracy: 0.7698 or 76.98% (average of sensitivity and specificity)

The evaluation metrics help us understand how well the model performs in predicting both classes and provide insights into its strengths and weaknesses in handling the binary classification task.

Conclusion

In the case of predicting heart disease, high Recall is crucial. High Recall means that the model has a good ability to correctly identify positive cases of heart disease. This means that the model can recognize more patients who actually have heart disease and correctly classify them as positive cases.

In this context, high Recall is highly important because false negative errors (positive cases incorrectly classified as negative) can have serious implications for patients. If a patient who actually has heart disease is classified as negative by the model, they may not receive the necessary care and treatment. By focusing on high Recall, we can minimize these errors and ensure that more patients who require medical attention receive accurate diagnoses and appropriate management.

# Extract values from the confusion matrix
TP_glm <- log_conf$table[2, 2]
TN_glm <- log_conf$table[1, 1]
FP_glm <- log_conf$table[1, 2]
FN_glm <- log_conf$table[2, 1]

# Extract values from the confusion matrix
TP_knn <- knn_conf$table[2, 2]
TN_knn <- knn_conf$table[1, 1]
FP_knn <- knn_conf$table[1, 2]
FN_knn <- knn_conf$table[2, 1]


# Calculate recall/sensitivity
Recall_glm <- TP_glm / (TP_glm + FN_glm)
Recall_knn <- TP_knn/ (TP_knn + FN_knn)


comparison <- cbind.data.frame(Recall_glm, Recall_knn)

comparison

Based on the provided recall values for logistic regression (Recall_glm = 0.92) and KNN (Recall_knn = 0.78), it can be concluded that the logistic regression model has a higher recall than the KNN model. Recall is a measure of the proportion of actual positive cases that are correctly identified as positive by the model. Therefore, a higher recall indicates better performance in correctly identifying positive cases.