1 Summary

In conducting anlysis & making a machine learning model to predict the patient whether patient is heart disease or not, there are several step that this research pursue as well as data preprocessing, data wrangling, exploratory data analysis, build models, assumption test, model comparison also evaluation and conclusion. In the first round analysis, The model of model_logistic_all is suitable to make a prediciton due to lowest RMSE also lowest AIC than model_logistic_fs AIC. But, using a model of model_logistic_step is appropiate to see the significance of the predictor. Next, the model of logistic regression (model_logistic_all) is better than model KNN (knn_pred), due to higher sensitivity score. Finally, the next research is recommended to use a higher sensitivity score also accuracy score (above 90%) due to the life of someone.


2 Preface

2.1 Background

The top three topics of global causes of death which has an impact / a relation on the number of lives lost were cardiovascular (ischaemic heart disease, stroke), respiratory (chronic obstructive pulmonary disease, lower respiratory infections), and neonatal conditions – which include birth asphyxia and birth trauma, neonatal sepsis and illnesses, and preterm birth complications.

This research explores heart disease by analyzing and predicting the dataset. Therefore, people can improve their healthy lifestyle awareness and the ML model can predict which person potentially has heart disease so that the person can prevent the heart disease / take treatment as early as possible.


2.2 Research Objectives

This research aim is to provide several insights which can benefit to readers especially for society and hospital. This research will deep dive into heart disease by analyzing the dataset. Therefore, the research objectives are:
- Analyzing the medical history background of those with heart disease and without heart disease.
- Analyzing the numerical variable relationship.
- Making a model to predict the patient who has a heart disease.


2.3 Data Source

[Heart Disease Dataset] (https://www.kaggle.com/johnsmith88/heart-disease-dataset/)


2.4 Data Description

No. Feature Description Value
1. age Patient’s age in years 29-77
2. sex Patient’s gender (1)Male (0)Female
3. cp Chest pain type (0)Typical angina - TA (1)Atypical angina - ATA (2)Non-anginal pain - NAP (3)Asymptomatic - ASY.
4. trestbps Resting blood pressure (in mm Hg) 94-200
5. chol Cholestoral in mg/dl 126 – 564
6. fbs Fasting blood sugar > 120 mg/dl (1)True (0)False
7. restecg Resting electrocardiographic results (0)Normal (1)Resting electrocardiographic results (2)Showing probable or definite left ventricular hypertrophy by Estes’ criteria
8. thalach Maximum heart rate achieved 71-202
9. exang Exercise induced angina (1)Yes (0)No
10. oldpeak ST depression induced by exercise relative to rest 0-6.2
11. slope The slope of the peak exercise ST segment (1)Upsloping(2)Flat(3)Downsloping
12. ca Number of major vessels (0-3) colored by fluoroscopy 0,1,2,3
13. thal Thalassemia (3)Normal (6)Fixed defect (no blood flow in some part of the heart) (7)Reversable defect (a blood flow is observed but it is not normal)
14. target Diagnosis of heart disease (0)Heart disease not present (1)Heart disease present

2.5 List Packages

# data cleaning
library(readr) 
library(tidyverse)
library(dplyr)

#data analysis
library(GGally)

#data visualizationl
library(ggplot2) 
library(scales)
library(echarts4r)

# Confussion Matrix
library(caret)

# knn
library(class)

#ROC 
library(pROC)

#Glow
library(ggfx)

3 Data Preprocessing

3.1 Read & Extracting Data


This data set dates from 1988 and consists of four databases: Cleveland, Hungary, Switzerland, and Long Beach V. It contains 76 attributes, including the predicted attribute, but all published experiments refer to using a subset of 14 of them. The “target” field refers to the presence of heart disease in the patient. It is integer valued 0 = no disease and 1 = disease. Then loading the data set (“kaggle_4city.csv”) and assigned to heart_disease object. So, the data of heart_disease is ready to do data wrangling process.

heart_disease <- read.csv("data_input/kaggle_4city.csv")
heart_disease

4 Data Wrangling

4.1 Data Inspection


This step consists of two steps ranging from the top 6 observations of data set inspection and the bottom 6 observations of data set inspection by using head also tail function so that the data set’s background can be recognized a little bit.

# Top 6 data
head(heart_disease)
# Bottom 6 data
tail(heart_disease)

4.2 Change The Data Type


Perform data type inspection to ensure the data type of each column is appropriate by using glimpse() function in dplyr package. Then, there are several variables whose data type are changed into factor such as sex,cp,fbs,restecg,exang,slope,ca,thal,target.

# Change Data Type to Factor

# df for model
heart_disease <- heart_disease %>% 
  mutate_at(vars(sex,cp,fbs,restecg,exang,slope,ca,thal,target),as.factor)

# for visualization & ggcorr
heart <- heart_disease %>% 
  mutate(target=as.integer(target))

glimpse(heart_disease)
#> Rows: 1,025
#> Columns: 14
#> $ age      <int> 52, 53, 70, 61, 62, 58, 58, 55, 46, 54, 71, 43, 34, 51, 52, 3…
#> $ sex      <fct> 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1…
#> $ cp       <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 2, 0, 1, 2, 2…
#> $ trestbps <int> 125, 140, 145, 148, 138, 100, 114, 160, 120, 122, 112, 132, 1…
#> $ chol     <int> 212, 203, 174, 203, 294, 248, 318, 289, 249, 286, 149, 341, 2…
#> $ fbs      <fct> 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0…
#> $ restecg  <fct> 1, 0, 1, 1, 1, 0, 2, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0…
#> $ thalach  <int> 168, 155, 125, 161, 106, 122, 140, 145, 144, 116, 125, 136, 1…
#> $ exang    <fct> 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0…
#> $ oldpeak  <dbl> 1.0, 3.1, 2.6, 0.0, 1.9, 1.0, 4.4, 0.8, 0.8, 3.2, 1.6, 3.0, 0…
#> $ slope    <fct> 2, 0, 0, 2, 1, 1, 0, 1, 2, 1, 1, 1, 2, 1, 1, 2, 2, 1, 2, 2, 1…
#> $ ca       <fct> 2, 0, 0, 1, 3, 0, 3, 1, 0, 2, 0, 0, 0, 3, 0, 0, 1, 1, 0, 0, 0…
#> $ thal     <fct> 3, 3, 3, 3, 2, 2, 1, 3, 3, 2, 2, 3, 2, 3, 0, 2, 2, 3, 2, 2, 2…
#> $ target   <fct> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0…


Notes: Using target as an integer for correlation analysis ggcorr() also visualization purpose. And, using target as factor for machine learning modelling.


4.3 Check Missing Values


In this step, checking the missing values is a must so that the missing value treatment can be done and the data is ready to analyze.

anyNA(heart_disease)
#> [1] FALSE
colSums(is.na(heart_disease))
#>      age      sex       cp trestbps     chol      fbs  restecg  thalach 
#>        0        0        0        0        0        0        0        0 
#>    exang  oldpeak    slope       ca     thal   target 
#>        0        0        0        0        0        0

It shows that there are no missing values in heart_disease dataframe.


4.4 Check Duplicate Data


Checking the data whether there is the same values or duplicates for each row.

heart_disease %>% 
  duplicated() %>% 
  sum()
#> [1] 723
# Containing duplicated data
heart_disease[duplicated(heart_disease),] 

Insight: There is a possibility that each observations, patient’s data, has the same value. Then, the treatment for duplicated data is not make any changes in this dataframe.


5 EDA - Exploratory Data Analysis

5.1 Data Status


Inspecting the 5 number summary + mean in order to get an insight and data distribution informations.

summary(heart_disease)
#>       age        sex     cp         trestbps          chol     fbs     restecg
#>  Min.   :29.00   0:312   0:497   Min.   : 94.0   Min.   :126   0:872   0:497  
#>  1st Qu.:48.00   1:713   1:167   1st Qu.:120.0   1st Qu.:211   1:153   1:513  
#>  Median :56.00           2:284   Median :130.0   Median :240           2: 15  
#>  Mean   :54.43           3: 77   Mean   :131.6   Mean   :246                  
#>  3rd Qu.:61.00                   3rd Qu.:140.0   3rd Qu.:275                  
#>  Max.   :77.00                   Max.   :200.0   Max.   :564                  
#>     thalach      exang      oldpeak      slope   ca      thal    target 
#>  Min.   : 71.0   0:680   Min.   :0.000   0: 74   0:578   0:  7   0:499  
#>  1st Qu.:132.0   1:345   1st Qu.:0.000   1:482   1:226   1: 64   1:526  
#>  Median :152.0           Median :0.800   2:469   2:134   2:544          
#>  Mean   :149.1           Mean   :1.072           3: 69   3:410          
#>  3rd Qu.:166.0           3rd Qu.:1.800           4: 18                  
#>  Max.   :202.0           Max.   :6.200

Insight:

  • The data shows that the minimum of age is 29 years old, then the maximum of age is 77 years old, and mean of age is around 54.43 .
  • The amount of male’s observations are 713. And, The amount of female’s observations are 312.
  • 50% of the trestbps’ (resting blood pressure) data is in the range of 120-140 mm Hg and the mean is 131.6 mm Hg.
  • 50% of the chol’s (Cholestoral) data is in the range of 211 - 275 mg/dl and the mean is 246 mg/dl.
  • 50% of the thalach (Max Heart Rate Achieved) is in the range of 132-166, then max value of heart rate is 202 and minimum value of heart rate is 71.
  • 50% of the oldpeak (ST depression induced by exercise relative to rest)is in the range of 0.0 - 1.8 and the mean is 1.072 .

5.2 Check Imbalanced Data Set


Checking the imbalanced data set on target variable.

prop.table(table(heart_disease$target))
#> 
#>         0         1 
#> 0.4868293 0.5131707


Based on statistical best practice, if the proportion of target is in the range 60 % dan 40% so that it can be concluded that the data is balance.


5.3 Correlation


Using ggcorr function in GGally package to get the strength of correlation among numeric variables.

ggcorr(heart,label = T,label_round = 2, nbreaks = 4)

Insight:
- Positive correlation

  • Variable of thalach has a relationship strength with the variable of target that tends to be positive moderate correlation.

  • Variable of trestbps, chol, and age have a relationship strength that tends to be positive weak correlation toward oldpeak variable.

  • Negative correlation

    • Variable of age, trestbps, and chol have a relationship strength with the variable of target that tends to be negative weak correlation .
    • Variable of oldpeak has a relationship strength with the variable of target that tends to be negative moderate correlation.
    • Variable of chol and trestbps have negative weak correlation with thalach variable.

5.4 Patient Background Analysis

5.4.1 Categoric Variables


The categoric variables are conducted analysis by using visualization from ggplot function in ggplot2 package.
- Sex

agg2 <- heart_disease %>% 
  group_by(target, sex) %>% 
  summarise(Tot=n())

ggplot(agg2,aes(x=target,y=Tot)) + geom_col(aes(fill=sex),color="red") + facet_wrap(facets = "sex") + labs(title = "Total Patient Based on Sex & Disease", x = "Target - (0)No Heart Disease (1)Heart Disease", y="Total") + geom_text(aes(label=Tot),color="black",nudge_y = -10)+theme_dark() 


Insight:
- Most of female patients has a heart disease with total 226 patients.
- Most of male patients has no heart disease with total 413 patients.
- The highest amount of heart disease is male patient with total 300 patients.

  • Chest Pain
agg1 <- heart_disease %>% 
  group_by(target, cp) %>% 
  summarise(Tot=n())

ggplot(agg1,aes(x=target,y=Tot)) + geom_col(aes(fill=cp),color="blue") + facet_wrap(facets = "cp") + labs(title = "Total Patient Based on Chest Pain & Disease", x = "Target - (0)No Heart Disease (1)Heart Disease", y="Total") + geom_text(aes(label=Tot),color="black",nudge_y = -12) +theme_dark()


Insight:
- Most of patients with heart disease have level 1 typical of chest pain (Atypical angina) & level 2 typical of chest pain (Non-anginal pain - NAP). - Most of patients without heart disease have level 0 of typical of chest pain (Typical angina). - The typical of chest pain level 4 has a bigger total for patient with heart disease (51 patients) than others.

  • Fasting blood sugar > 120 mg/dl
agg3 <- heart_disease %>% 
  group_by(target, fbs) %>% 
  summarise(Tot=n())

ggplot(agg3,aes(x=target,y=Tot)) + geom_col(aes(fill=fbs),color="blue") + facet_wrap(facets = "fbs") + labs(title = "Total Patient Based on Fasting Blood Sugar", x = "Target - (0)No Heart Disease (1)Heart Disease", y="Total") + geom_text(aes(label=Tot),color="black",nudge_y = -12) +theme_dark()


Insight:
- Most of patients with/without heart disease have fasting blood sugar < 120 mg/dl. - The amount of patient with fasting blood sugar > 120 mg/dl and without heart disease have a higher score than patient with fasting blood sugar > 120 mg/dl and heart disease.

  • Resting electrocardiographic results
agg4 <- heart_disease %>% 
  group_by(target, restecg) %>% 
  summarise(Tot=n())

ggplot(agg4,aes(x=target,y=Tot)) + geom_col(aes(fill=restecg),color="red") + facet_wrap(facets = "restecg") + labs(title = "Total Patient Based on Resting Electrocardiographic", x = "Target - (0)No Heart Disease (1)Heart Disease", y="Total") + geom_text(aes(label=Tot),color="black",nudge_y = -5) +theme_dark()


Insight:
- Most of patients with heart disease have level 1 of resting electrocardiographic results. - Most of patients without heart disease have normal result. - Only 3 patients with heart disease also 12 patients without heart disease have probable or definite left ventricular hypertrophy by Estes’ criteria.

  • Exercise induced angina
agg5 <- heart_disease %>% 
  group_by(target, exang) %>% 
  summarise(Tot=n())

ggplot(agg5,aes(x=target,y=Tot)) + geom_col(aes(fill=exang),color="blue") + facet_wrap(facets = "exang") + labs(title = "Total Patient Based on Exercise Induced Angina", x = "Target - (0)No Heart Disease (1)Heart Disease", y="Total") + geom_text(aes(label=Tot),color="black",nudge_y = -12) +theme_dark()


Insight:
- Most of patients with heart disease don’t have exercise-induced angina.
- Most of patients without heart disease have exercise-induced angina and a little bit lower do not have exercise-induced angina.

  • The slope of the peak exercise ST segment
agg6 <- heart_disease %>% 
  group_by(target, slope) %>% 
  summarise(Tot=n())

ggplot(agg6,aes(x=target,y=Tot)) + geom_col(aes(fill=slope),color="red") + facet_wrap(facets = "slope") + labs(title = "Total Patient Based on Slope", x = "Target - (0)No Heart Disease (1)Heart Disease", y="Total") + geom_text(aes(label=Tot),color="black",nudge_y = -12) +theme_dark()


Insight:
- Most of patients with heart disease have level 2 of slope and followed by level 1 of slope then level 0 of slope.
- Most of patients without heart disease have level 1 of slope and followed by level 2 of slope then level 0 of slope.

  • Number of major vessels (0-3) colored by fluoroscopy
agg7 <- heart_disease %>% 
  group_by(target, ca) %>% 
  summarise(Tot=n())

ggplot(agg7,aes(x=target,y=Tot)) + geom_col(aes(fill=ca),color="blue") + facet_wrap(facets = "ca") + labs(title = "Total Patient", subtitle = "Based on Number of Major Vessels (0-3) Colored by Fluoroscopy",x = "Target - (0)No Heart Disease (1)Heart Disease", y="Total") + geom_text(aes(label=Tot),color="black",nudge_y = -13) +theme_dark()


Insight:
- Most of patients with heart disease have level 0 of ca. - Most of patients without heart disease have level 0 & level 1 of ca.

  • Thalassemia
agg8 <- heart_disease %>% 
  group_by(target, thal) %>% 
  summarise(Tot=n())

ggplot(agg8,aes(x=target,y=Tot)) + geom_col(aes(fill=thal),color="red") + facet_wrap(facets = "thal") + labs(title = "Total Patient Based on Thalassemia",x = "Target - (0)No Heart Disease (1)Heart Disease", y="Total") + geom_text(aes(label=Tot),color="black",nudge_y = -16) + theme_dark()


Insight:
- Most of patients with heart disease have level 2 of thalassemia. - Most of patients without heart disease have level 3 of thalassemia.


5.4.2 Numeric Variables


The numeric variables are conducted analysis by using visualization from ggplot function in ggplot2 package.
- Heart Disease by Age

heart_age <- heart_disease %>%
mutate(HeartDisease = case_when(
    target == 0 ~ "No",
    target == 1 ~ "Yes")) %>% 
  mutate(HeartDisease = as.factor(HeartDisease)) %>% 
  group_by(HeartDisease, age) %>% 
  summarize(count = n(), .groups = 'drop') %>% 
  pivot_wider(names_from = HeartDisease, values_from = count) %>% 
  mutate_if(is.integer, ~replace(., is.na(.), 0)) %>% 
  arrange(Yes)

linear_gradient <- htmlwidgets::JS(
  "new echarts.graphic.LinearGradient(
    0, 0, 0, 1,
    [
      { offset: 0, color: '#B83F3E' },
      { offset: 1, color: '#FB6A63' }
    ])"
)

linear_gradient2 <- htmlwidgets::JS(
  "new echarts.graphic.LinearGradient(
    0, 0, 0, 1,
    [
      { offset: 0, color: '#E08F83' },
      { offset: 1, color: '#FFE5DB' }
    ])"
)

heart_age %>% 
  e_charts(age) %>% 
    e_line(Yes,
         smooth = T,
         itemStyle = list(
           opacity = 0.5,
           color = linear_gradient),
         areaStyle = list(
           opcaity = 1,
           color = linear_gradient),
         emphasis = list(
           focus = "series")) %>% 
    e_line(No,
           smooth = T,
           itemStyle = list(
             opacity = 0.5,
             color = linear_gradient2),
           areaStyle = list(
             opcaity = 1,
             color = linear_gradient2),
           emphasis = list(
             focus = "series")) %>%
    e_x_axis(name = "Age",
           nameTextStyle = list(
             color = "White",
             fontFamily = "Cardo",
             fontSize = 14,
             fontWeight = "bold"
             ),
           min = 27) %>%
    e_y_axis(name = "Freq",
             nameTextStyle = list(
               color = "White",
               fontFamily = "Cardo",
               fontSize = 14,
               fontWeight = "bold"
               )) %>% 
    e_title(text = "Heart Disease by Age",
            textStyle = list(
              color = "#DFC18F",
              fontStyle = "italic"
            ),
            right = "10%",
            left = "50%") %>% 
    e_tooltip(axis = "trigger",
              showDelay = 0,
              axisPointer = list(
                type = "cross",
                label = list(
                  backgroundColor = "#333D47",
                  fontFamily = "Cardo",
                  color = "#DFC18F"
                ),
                crossStyle = list(
                  color = "#DFC18F"
                )),
                formatter = htmlwidgets::JS("
                                          function(params)
                                          {
                                              return `<strong style='color:#DFC18F;'>Heart Disease:</strong> <strong style='color:#D76662;'>${params.seriesName}</strong>
                                                      <br/><strong style='color:black;'>Age:</strong> ${params.value[0]}
                                                      <br/><strong style='color:black;'>Count:</strong> ${params.value[1]}`
                                          }  ")) %>% e_theme("dark")  %>% 
    e_legend(
      orient = "horizontal",
      right = "10%",
      left = "45%",
      bottom = "2%",
      textStyle = list(
        color = "#DFC18F"
      )
    ) 


Insight:
- Patients without Heart Disease Condition tend to have higher Age.
- Patients with Heart Disease Condition tend to group up between 29 and 76 years old.
- The highest count of Heart Disease Condition recorded are within 54 years old.


  • Age VS Resting blood pressure (trestbps)
max <- list(
  name = "Max",
  type = "max"
)

min <- list(
  name = "Min",
  type = "min"
)

avg <- list(
  type = "average",
  name = "AVG"
)

heart_trestbps <- heart_disease %>% 
  mutate(HeartDisease = case_when(
    target == 0 ~ "No",
    target == 1 ~ "Yes")) %>% 
  mutate(HeartDisease = as.factor(HeartDisease)) %>% 
  select(trestbps, age, HeartDisease) %>%
  group_by(HeartDisease) %>% 
  arrange(HeartDisease,age)



heart_trestbps %>% 
  e_charts(age) %>% 
  e_scatter(trestbps, HeartDisease,
            emphasis = list(
              focus = "series"
            )) %>%
  e_x_axis(name = "Age",
           nameTextStyle = list(
             color = "White",
             fontFamily = "Cardo",
             fontSize = 14,
             fontWeight = "bold"
             ),
           min = 27) %>% 
  e_y_axis(name = "RBP",
           nameTextStyle = list(
             color = "White",
             fontFamily = "Cardo",
             fontSize = 14,
             fontWeight = "bold"
             )) %>%
  e_mark_area(data = list(
                          list(xAxis = "min", yAxis = "min"), 
                          list(xAxis = "max", yAxis = "max")
                          ),
              itemStyle = list(
                color = "transparent",
                borderWidth = 1,
                borderType = "dashed")) %>% 
  e_mark_line(data = avg,
              lineStyle = list(
                type = "solid"
              )) %>% 
  e_title(text = "Heart Disease: Resting Blood Pressure vs. Age",
          textStyle = list(
            color = "#DFC18F",
            fontStyle = "italic"
          ),
          right = "10%",
          left = "50%") %>%   
  e_tooltip(axis = "trigger",
            showDelay = 0,
            axisPointer = list(
              type = "cross",
              label = list(
                backgroundColor = "#333D47",
                fontFamily = "Cardo",
                color = "#DFC18F"
              ),
              crossStyle = list(
                color = "#DFC18F"
              )),
              formatter = htmlwidgets::JS("
                                        function(params)
                                        {
                                            return `<strong style='color:#DFC18F;'>Heart Disease:</strong> <strong style='color:#D76662;'>${params.seriesName}</strong>
                                                    <br/><strong style='color:Black;'>Age:</strong> ${params.value[0]}
                                                    <br/><strong style='color:Black;'>RestingBP:</strong> ${params.value[1]}mm/hg`
                                        }  ")) %>% 
  e_legend(
    orient = "horizontal",
    right = "10%",
    left = "45%",
    bottom = "2%",
    textStyle = list(
      color = "#DFC18F"
    )
  ) %>% e_theme("dark")


Insight:
- Patients without Heart Disease Condition tend to have higher RestingBP when getting older.
- Most data of observations is distributed between 120 -140 mm/Hg for patients with heart disease.
- Mean of resting blood pressure (trbps) is 129.25 mm/Hg for patients with heart disease.
- Mean of resting blood pressure (trbps) is 134.11 mm/Hg for patients without heart disease.
- There is RestingBP with 0mm/hg value which indicates that this is an outlier.


  • Age VS Cholesterol (chol)
heart_ch <- heart_disease %>% 
  mutate(HeartDisease = case_when(
    target == 0 ~ "No",
    target == 1 ~ "Yes")) %>% 
  mutate(HeartDisease = as.factor(HeartDisease)) %>% 
  select(chol, age, HeartDisease) %>% 
  group_by(HeartDisease) 

heart_ch %>% 
  e_charts(age) %>% 
  e_scatter(chol, HeartDisease,
            emphasis = list(
              focus = "series"
            )) %>%
  e_x_axis(name = "Age",
           nameTextStyle = list(
             color = "White",
             fontFamily = "Cardo",
             fontSize = 14,
             fontWeight = "bold"
             ),
           min = 27) %>% 
  e_y_axis(name = "Cholesterol (mg/dl)",
           nameTextStyle = list(
             color = "White",
             fontFamily = "Cardo",
             fontSize = 14,
             fontWeight = "bold"
             )) %>%
  e_mark_area(data = list(
                          list(xAxis = "min", yAxis = "min"), 
                          list(xAxis = "max", yAxis = "max")
                          ),
              itemStyle = list(
                color = "transparent",
                borderWidth = 1,
                borderType = "dashed")) %>% 
  e_mark_line(data = avg,
              lineStyle = list(
                type = "solid"
              )) %>% 
  e_title(text = "Heart Disease: Cholesterol vs. Age",
          textStyle = list(
            color = "#DFC18F",
            fontStyle = "italic"
          ),
          right = "10%",
          left = "50%") %>%   
  e_tooltip(axis = "trigger",
            showDelay = 0,
            axisPointer = list(
              type = "cross",
              label = list(
                backgroundColor = "#333D47",
                fontFamily = "Cardo",
                color = "#DFC18F"
              ),
              crossStyle = list(
                color = "#DFC18F"
              )),
              formatter = htmlwidgets::JS("
                                        function(params)
                                        {
                                            return `<strong style='color:#DFC18F;'>Heart Disease:</strong> <strong style='color:#D76662;'>${params.seriesName}</strong>
                                                    <br/><strong style='color:Black;'>Age:</strong> ${params.value[0]}
                                                    <br/><strong style='color:Black;'>Cholesterol:</strong> ${params.value[1]}mg/dL`
                                        }  ")) %>% 
  e_legend(
    orient = "horizontal",
    right = "10%",
    left = "45%",
    bottom = "2%",
    textStyle = list(
      color = "#DFC18F"
    )
  ) %>% e_theme("dark")


Insight:
- Patients with heart disease have a potential for having higher cholesterol when getting older.
- The 50% of cholesterol data distribution of patient without heart disease is high.
- The mean of cholesterol is high for patient without heart disease.


  • Age VS Maximum heart rate achieved (thalach)
heart_thalach <- heart_disease %>% 
  mutate(HeartDisease = case_when(
    target == 0 ~ "No",
    target == 1 ~ "Yes")) %>% 
  mutate(HeartDisease = as.factor(HeartDisease)) %>%
  select(thalach, age, HeartDisease) %>% 
  group_by(HeartDisease)

heart_thalach %>% 
  e_charts(age) %>% 
  e_scatter(thalach, HeartDisease,
            emphasis = list(
              focus = "series"
            )) %>%
  e_x_axis(name = "Age",
           nameTextStyle = list(
             color = "White",
             fontFamily = "Cardo",
             fontSize = 14,
             fontWeight = "bold"
             ),
           min = 27) %>% 
  e_y_axis(name = "Max Heart Rate",
           nameTextStyle = list(
             color = "White",
             fontFamily = "Cardo",
             fontSize = 14,
             fontWeight = "bold"
             )) %>%
  e_mark_area(data = list(
                          list(xAxis = "min", yAxis = "min"), 
                          list(xAxis = "max", yAxis = "max")
                          ),
              itemStyle = list(
                color = "transparent",
                borderWidth = 1,
                borderType = "dashed")) %>% 
  e_mark_line(data = avg,
              lineStyle = list(
                type = "solid"
              )) %>% 
  e_title(text = "Heart Disease: Max Heart Rate vs. Age",
          textStyle = list(
            color = "#DFC18F",
            fontStyle = "italic"
          ),
          right = "10%",
          left = "50%") %>%   
  e_tooltip(axis = "trigger",
            showDelay = 0,
            axisPointer = list(
              type = "cross",
              label = list(
                backgroundColor = "#333D47",
                fontFamily = "Cardo",
                color = "#DFC18F"
              ),
              crossStyle = list(
                color = "#DFC18F"
              )),
              formatter = htmlwidgets::JS("
                                        function(params)
                                        {
                                            return `<strong style='color:#DFC18F;'>Heart Disease:</strong> <strong style='color:#D76662;'>${params.seriesName}</strong>
                                                    <br/><strong style='color:Black;'>Age:</strong> ${params.value[0]}
                                                    <br/><strong style='color:Black;'>MaxHR:</strong> ${params.value[1]}`
                                        }  ")) %>% 
  e_legend(
    orient = "horizontal",
    right = "10%",
    left = "45%",
    bottom = "2%",
    textStyle = list(
      color = "#DFC18F"
    )
  ) %>% e_theme("dark")


Insight:
- Patients with heart disease condition tend to have lower Max Heart Rate than others when getting older.


  • Age VS ST depression induced by exercise relative to rest (oldpeak)
heart_oldpeak <- heart_disease %>% 
  mutate(HeartDisease = case_when(
    target == 0 ~ "No",
    target == 1 ~ "Yes")) %>% 
  mutate(HeartDisease = as.factor(HeartDisease)) %>% 
  select(oldpeak, age, HeartDisease) %>%
  group_by(HeartDisease)

heart_oldpeak %>% 
  e_charts(age) %>% 
  e_scatter(oldpeak, HeartDisease,
            emphasis = list(
              focus = "series"
            )) %>%
  e_x_axis(name = "Age",
           nameTextStyle = list(
             color = "White",
             fontFamily = "Cardo",
             fontSize = 14,
             fontWeight = "bold"
             ),
           min = 27) %>% 
  e_y_axis(name = "Oldpeak",
           nameTextStyle = list(
             color = "White",
             fontFamily = "Cardo",
             fontSize = 14,
             fontWeight = "bold"
             ),
           min = -3,
           max = 7) %>%
  e_mark_area(data = list(
                          list(xAxis = "min", yAxis = "min"), 
                          list(xAxis = "max", yAxis = "max")
                          ),
              itemStyle = list(
                color = "transparent",
                borderWidth = 1,
                borderType = "dashed")) %>% 
  e_mark_line(data = avg,
              lineStyle = list(
                type = "solid"
              )) %>% 
  e_title(text = "Heart Disease: Oldpeak vs. Age",
          textStyle = list(
            color = "#DFC18F",
            fontStyle = "italic"
          ),
          right = "10%",
          left = "50%") %>%   
  e_tooltip(axis = "trigger",
            showDelay = 0,
            axisPointer = list(
              type = "cross",
              label = list(
                backgroundColor = "#333D47",
                fontFamily = "Cardo",
                color = "#DFC18F"
              ),
              crossStyle = list(
                color = "#DFC18F"
              )),
              formatter = htmlwidgets::JS("
                                        function(params)
                                        {
                                            return `<strong style='color:#DFC18F;'>Heart Disease:</strong> <strong style='color:#D76662;'>${params.seriesName}</strong>
                                                    <br/><strong style='color:Black;'>Age:</strong> ${params.value[0]}
                                                    <br/><strong style='color:Black;'>Oldpeak:</strong> ${params.value[1]}`
                                        }  ")) %>% 
  e_legend(
    orient = "horizontal",
    right = "10%",
    left = "45%",
    bottom = "2%",
    textStyle = list(
      color = "#DFC18F"
    )
  ) %>% e_theme("dark")


Insight:
- Patients without heart disease condition tend to have higher Oldpeak rate.
- Patients with heart disease condition observation data have many outliers with values of 0.


6 Logistic Regression

6.1 Explaination


Using logistic regression to describe data and the relationship between one dependent variable (target) and one or more independent variables(predictors). The independent variables can be numeric, nominal categorical, ordinal categorical, or interval type. And the dependent variable (target) must categorical data type.
This is the equation of a logistic regression model:
\[ log \frac{p(X)}{1-p(X)} =\beta_0+\beta_1.X \]
The left-hand side is called the log-odds or logit. On the right side the b0 is the model intercept and b1 is the coefficient of feature X.
The purpose of logistic regression is to predict probability using a linear regression model (which can be used for classification). The output of logistic regression is log of odds. Then, the output of logistic regression can be converted to probability form using inv.logit() in library(gtools).


6.2 Cross Validation


In cross validation step, the data is devided into train data with proportion 80% of original data and test data with proportion 20% of original data as unseen data.

# Set seed to lock the random
RNGkind(sample.kind = "Rounding")
set.seed(123)

# index sampling
index <- sample(x = nrow(heart_disease), size = nrow(heart_disease)*0.8)

# splitting 
heart_disease_train <- heart_disease[index,]
heart_disease_test <- heart_disease[-index,]

6.3 Imbalance Treatment


Checking the imbalanced data set on target variable.

prop.table(table(heart_disease$target))
#> 
#>         0         1 
#> 0.4868293 0.5131707

After, checking the imbalance of target variable and there is no imbalance over there (refers to statistical best practice the limmitation is 60% & 40%). It can be concluded that there is no needed to do the treatments such as up-sampling, down-sampling, and SMOTE. And the treatment can be conducted after cross validation and only conducted on the train data because the test data is unseen data.

6.4 Modelling


- 1. All predictors. The model of model_logistic_all involves all preddictors in this model.

model_logistic_all <- glm(formula = target~.,
                      data = heart_disease_train,
                      family = "binomial")

summary(model_logistic_all)
#> 
#> Call:
#> glm(formula = target ~ ., family = "binomial", data = heart_disease_train)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -2.9216  -0.2565   0.0834   0.4147   3.2579  
#> 
#> Coefficients:
#>              Estimate Std. Error z value Pr(>|z|)    
#> (Intercept) -1.459568   2.182499  -0.669 0.503648    
#> age          0.033384   0.015994   2.087 0.036862 *  
#> sex1        -2.027022   0.355146  -5.708 1.15e-08 ***
#> cp1          0.671228   0.352382   1.905 0.056802 .  
#> cp2          2.116855   0.329346   6.427 1.30e-10 ***
#> cp3          2.003688   0.437533   4.580 4.66e-06 ***
#> trestbps    -0.024883   0.007631  -3.261 0.001111 ** 
#> chol        -0.003915   0.002650  -1.477 0.139644    
#> fbs1         0.679642   0.384122   1.769 0.076837 .  
#> restecg1     0.426159   0.247949   1.719 0.085663 .  
#> restecg2    -0.682235   1.727780  -0.395 0.692945    
#> thalach      0.027101   0.007383   3.671 0.000242 ***
#> exang1      -0.877426   0.284421  -3.085 0.002036 ** 
#> oldpeak     -0.471277   0.156134  -3.018 0.002541 ** 
#> slope1      -0.530315   0.569302  -0.932 0.351585    
#> slope2       0.766945   0.602481   1.273 0.203026    
#> ca1         -2.492669   0.339801  -7.336 2.21e-13 ***
#> ca2         -3.950260   0.543148  -7.273 3.52e-13 ***
#> ca3         -2.741024   0.631727  -4.339 1.43e-05 ***
#> ca4          0.585698   1.227425   0.477 0.633236    
#> thal1        3.279136   1.522859   2.153 0.031297 *  
#> thal2        2.380597   1.449903   1.642 0.100611    
#> thal3        0.985519   1.452798   0.678 0.497544    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 1135.00  on 819  degrees of freedom
#> Residual deviance:  465.66  on 797  degrees of freedom
#> AIC: 511.66
#> 
#> Number of Fisher Scoring iterations: 6
library(car)
# vif(model_logistic_all)


Insight:
- The AIC score is 511.66 for model of model_logistic_all. And AIC represents the total of information lost.
- The predictor variables that significant to the target variable / model of model_logistic_all are age, sex, cp, trestbps, thalach, exang, oldpeak, ca, and thal.

  • 2. Future Selection.
    In future selection, using business perspective and statistical perspective to determine the predictor variables for model of model_logistic_fs.
model_logistic_fs <- glm(formula = target~ age + sex + cp + restecg + trestbps + oldpeak + thalach + ca + exang,
                      data = heart_disease_train,
                      family = "binomial")

summary(model_logistic_fs)
#> 
#> Call:
#> glm(formula = target ~ age + sex + cp + restecg + trestbps + 
#>     oldpeak + thalach + ca + exang, family = "binomial", data = heart_disease_train)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -2.5349  -0.3613   0.1084   0.4907   2.7215  
#> 
#> Coefficients:
#>              Estimate Std. Error z value Pr(>|z|)    
#> (Intercept) -1.040850   1.511124  -0.689 0.490954    
#> age          0.021410   0.014800   1.447 0.148009    
#> sex1        -1.899376   0.277665  -6.841 7.89e-12 ***
#> cp1          1.098357   0.329627   3.332 0.000862 ***
#> cp2          2.114731   0.292065   7.241 4.47e-13 ***
#> cp3          1.708673   0.396645   4.308 1.65e-05 ***
#> restecg1     0.422447   0.227058   1.861 0.062811 .  
#> restecg2    -0.382087   1.545124  -0.247 0.804687    
#> trestbps    -0.018547   0.006935  -2.674 0.007487 ** 
#> oldpeak     -0.662163   0.126862  -5.220 1.79e-07 ***
#> thalach      0.030077   0.006466   4.652 3.29e-06 ***
#> ca1         -2.059008   0.295008  -6.979 2.96e-12 ***
#> ca2         -3.046294   0.442163  -6.890 5.60e-12 ***
#> ca3         -2.287338   0.519786  -4.401 1.08e-05 ***
#> ca4         -0.107496   1.041383  -0.103 0.917785    
#> exang1      -1.123481   0.260860  -4.307 1.66e-05 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 1135.0  on 819  degrees of freedom
#> Residual deviance:  535.6  on 804  degrees of freedom
#> AIC: 567.6
#> 
#> Number of Fisher Scoring iterations: 6
# vif(model_logistic_fs)
# exp(model_logistic_fs$coefficient)


Insight:
- The AIC score is 567.6 for model of model_logistic_fs And AIC represents the total of information lost.
- The predictor variables that significant to the target variable / model of model_logistic_fs are sex, cp, trestbps, oldpeak, thalach, exang, and ca.

6.5 Fitting


Using step_wise regression method,bacward, to find a suitable model with lowest AIC.

model_logistic_step <- step(object = model_logistic_all,direction = "backward",trace = F)
summary(model_logistic_step)
#> 
#> Call:
#> glm(formula = target ~ age + sex + cp + trestbps + chol + fbs + 
#>     thalach + exang + oldpeak + slope + ca + thal, family = "binomial", 
#>     data = heart_disease_train)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -2.8665  -0.2657   0.0883   0.4322   3.3248  
#> 
#> Coefficients:
#>              Estimate Std. Error z value Pr(>|z|)    
#> (Intercept) -0.837560   2.242458  -0.374 0.708776    
#> age          0.030826   0.015822   1.948 0.051376 .  
#> sex1        -2.023557   0.353388  -5.726 1.03e-08 ***
#> cp1          0.691878   0.352217   1.964 0.049489 *  
#> cp2          2.131196   0.329865   6.461 1.04e-10 ***
#> cp3          1.983810   0.435012   4.560 5.11e-06 ***
#> trestbps    -0.025880   0.007517  -3.443 0.000575 ***
#> chol        -0.004489   0.002610  -1.720 0.085435 .  
#> fbs1         0.680454   0.381298   1.785 0.074331 .  
#> thalach      0.026522   0.007392   3.588 0.000333 ***
#> exang1      -0.848557   0.284039  -2.987 0.002813 ** 
#> oldpeak     -0.468524   0.152622  -3.070 0.002142 ** 
#> slope1      -0.522503   0.552676  -0.945 0.344452    
#> slope2       0.812816   0.586968   1.385 0.166123    
#> ca1         -2.528997   0.338454  -7.472 7.89e-14 ***
#> ca2         -3.865166   0.532280  -7.262 3.83e-13 ***
#> ca3         -2.869306   0.643210  -4.461 8.16e-06 ***
#> ca4          0.499374   1.166787   0.428 0.668658    
#> thal1        3.342386   1.651537   2.024 0.042990 *  
#> thal2        2.414516   1.585634   1.523 0.127822    
#> thal3        1.067257   1.587569   0.672 0.501419    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 1135.0  on 819  degrees of freedom
#> Residual deviance:  468.9  on 799  degrees of freedom
#> AIC: 510.9
#> 
#> Number of Fisher Scoring iterations: 6


Insight:
- The AIC score is 510.9 for model of model_logistic_step And AIC represents the total of information lost.
- The predictor variables that not significant to the target variable / model of model_logistic_step are age, chol, fbs, and slope.

6.6 Predicting


First, conducting the model performance comparison to get best model to predict.

# Model Performance Comparison 
library(performance)
comparison <- compare_performance(model_logistic_all,model_logistic_fs, model_logistic_step)

as.data.frame(comparison)


Insight:
Best model of Logistic Regression, as follows:
- Based on RMSE with small score : model_logistic_all
- Based on AIC with small score : model_logistic_step


Second, the prediction is conducted by the model of model_logistic_all.

# Test Data Prediction

heart_disease_test$PredRisk_HeartDisease <- predict(
  object = model_logistic_all,
  newdata = heart_disease_test, 
  type = "response" # for probability results
)

heart_disease_test$Pred_Label_HeartDisease <- 
  ifelse(heart_disease_test$PredRisk_HeartDisease < 0.5, 0, 1) %>% 
  as.factor()

Pred_Result <- heart_disease_test %>% 
  select(target,PredRisk_HeartDisease,Pred_Label_HeartDisease) 

colnames(Pred_Result) <- c("Actual","Prob_Pred","Result_Pred")
Pred_Result
# Train Data Prediction
hd_train <- heart_disease_train
hd_train$ PredRisk_HeartDisease <- predict(
  object = model_logistic_all,
  newdata = hd_train, 
  type = "response" # for probability results
)

hd_train$Pred_Label_HeartDisease <- 
  ifelse(hd_train$PredRisk_HeartDisease < 0.5, 0, 1) %>% 
  as.factor()

Pred_Result_tr <- hd_train %>% 
  select(target,PredRisk_HeartDisease,Pred_Label_HeartDisease) 

colnames(Pred_Result_tr) <- c("Actual","Prob_Pred","Result_Pred")
Pred_Result_tr

6.7 Model Assumption Test


In this section, it is important to test the multicolinearity among predictors in order to met the assumption test for logistic regression. - 1. Multicolinearity.

vif(model_logistic_all)
#>              GVIF Df GVIF^(1/(2*Df))
#> age      1.503591  1        1.226210
#> sex      1.694283  1        1.301646
#> cp       1.972136  3        1.119840
#> trestbps 1.347567  1        1.160848
#> chol     1.202662  1        1.096660
#> fbs      1.238444  1        1.112854
#> restecg  1.151628  2        1.035924
#> thalach  1.538287  1        1.240277
#> exang    1.183068  1        1.087689
#> oldpeak  1.578693  1        1.256461
#> slope    1.855658  2        1.167144
#> ca       2.193381  4        1.103162
#> thal     1.604339  3        1.081972
  • 2. Independence of Observation.
ggcorr(heart_disease_train, label = T)


Insight:
- There is no strong correlation (+1 or -1) among variabel predictors, so that the assumption test of Independence of Observation is met.

7 K-NN

7.1 Explaination


The k-nearest neighbors is non-parametric, supervised learning classifier, which uses proximity to make classifications about the grouping of an individual data point. K-NN working off the assumption that similar points can be found near one another. In other words, k-nearest neighbors classifies new data by comparing the characteristics of the new data (test data) with existing data (train data). The proximity of these characteristics is measured by Euclidean Distance until distance is obtained. Then the nearest k neighbors will be selected from the new data, then the class will be determined using majority voting.


This is the equation of K-NN model:

\[ d(x,y) = \sqrt{\sum(x_i-y_i)^2} \]

7.2 Cross Validation


In cross validation step, the data is devided into train data with proportion 80% of original data and test data with proportion 20% of original data as unseen data. Then, the train data and test data devided into only predictor and only target. And, selecting only numeric variable predictors for train data also test data.

knn_train <- heart_disease_train
knn_test <- heart_disease_test

# Numeric Variable Predictors in train data
knn_train_x <- select(knn_train,-c(sex,cp,fbs,restecg,exang,slope,ca,thal,target))
# Numeric Variable Predictors in test data
knn_test_x <-  select(knn_test,-c(sex,cp,fbs,restecg,exang,slope,ca,thal,target,PredRisk_HeartDisease,Pred_Label_HeartDisease))

# Target Variable in train data
knn_train_y <- knn_train[,"target"]
# Target Variable in test data
knn_test_y <- knn_test[,"target"]

7.3 Scale


Generalizing the range of predictor variables by using z-score standarization.

knn_train_x_scale <- scale(x = knn_train_x)
knn_test_x_scale <- 
  scale(x = knn_test_x,
        center = attr(knn_train_x_scale, "scaled:center"),
        scale = attr(knn_train_x_scale, "scaled:scale"))

7.4 Predict

Defined the optimum k for k-nn model before conducting the prediction. Then, creating knn model by using knn() function.

# find optimum k
round(sqrt(nrow(knn_train)))
#> [1] 29
knn_pred <- knn(train = knn_train_x_scale, # Train data predictors after scale
                 test = knn_test_x_scale, # Test data predictors after scale
                 cl = knn_train_y, # Only Data Train Target
                 k = 29)

8 Evaluating


In evaluating section consists two steps which are comparing logistic regression model (model_logistic_all) with k-nn model (knn_pred) and using model after comparing result in order to predict the test data and train data fro getting the best model performance, and then analyze also conduct visualization to see the ROC & AUC score.

# Evaluation Test Data Prediction
eva_log <- confusionMatrix(data = Pred_Result$Result_Pred, 
                               reference = Pred_Result$Actual, 
                               positive = "1")

# KNN
eva_knn <- confusionMatrix(data = knn_pred,
                           reference = knn_test_y,
                           positive = "1")

performance_comparison <- cbind.data.frame(Accuracy=c(eva_log$overall[1],eva_knn$overall[1]),
                 Sensitivity = c(eva_log$byClass[[1]], eva_knn$byClass[[1]]),
                 Specificity = c(eva_log$byClass[[2]], eva_knn$byClass[[2]]),
                 Precision = c(eva_log$byClass[[3]], eva_knn$byClass[[3]])
                 )
rownames(performance_comparison) <- c("Logistic Regression (model_logistic_all)", "K-NN (knn_pred)")

performance_comparison
# Evaluation Train Data Prediction
eva_log_tr <- confusionMatrix(data=Pred_Result_tr$Result_Pred,
                              reference = Pred_Result_tr$Actual, 
                              positive = "1")
performance_log <- 
  cbind.data.frame(Accuracy=c(eva_log$overall[1],eva_log_tr$overall[1]),
                 Sensitivity = c(eva_log$byClass[[1]], eva_log_tr$byClass[[1]]),
                 Specificity = c(eva_log$byClass[[2]], eva_log_tr$byClass[[2]]),
                 Precision = c(eva_log$byClass[[3]], eva_log_tr$byClass[[3]]))
rownames(performance_log) <- c("On Test Data - Log.Reg.", "On Train Data - Log.Reg.")
performance_log
Actual <- factor(c("No", "No", "Yes", "Yes"))
Predicted <- factor(c("No", "Yes", "No", "Yes"))
y      <- c(eva_log$table[1], eva_log$table[2], eva_log$table[3], eva_log$table[4])
df1_log <- data.frame(Actual, Predicted, y)

# Plot 1
plot_log1 <- ggplot(data =  df1_log, mapping = aes(x = Actual, y = Predicted, col = y)) +
  geom_tile(aes(fill = y)) +
  geom_text(aes(label = sprintf("%1.0f", y)), color = "Black", size = 5, family = "Times New Roman") +
  scale_fill_gradient(low = "#91FFBF", high = "#3FF48B") +
  scale_color_gradient(low = "#91FFBF", high = "#3FF48B") +
  theme(legend.position = "none",
        plot.background = element_rect(fill = "#2C343C", color = "#2C343C"),
        panel.background = element_rect(fill = "#2C343C"),
        panel.grid = element_line(colour = "#2C343C"),
        panel.grid.major.x = element_line(colour = "#2C343C"),
        panel.grid.minor.x = element_line(colour = "#2C343C"),
        axis.text.x = element_text(color = "#7BFEB2", family = "Times New Roman", size = 14),
        axis.text.y = element_text(color = "#7BFEB2", family = "Times New Roman", size = 14),
        axis.title.x = element_text(color = "#DFC18F", family = "Times New Roman", size = 14),
        axis.title.y = element_text(color = "#DFC18F", family = "Times New Roman", size = 14),
        axis.ticks = element_blank()) 

plot_log1

roc <-  roc(predictor = Pred_Result$Prob_Pred, #Probability Prediction Result
            response = Pred_Result$Actual, # Target Actual Data Test
            levels = c("0", "1"),
            percent = TRUE)

df1_log <- data.frame(Specificity=roc$specificities, Sensitivity=roc$sensitivities)

plot_roc <- ggplot(data = df1_log, aes(x = Specificity, y = Sensitivity))+
  with_outer_glow(geom_path(colour = '#FFE5DB', size = 1), colour = "#E08F83", sigma = 10, expand = 1)+
  scale_x_reverse() +
  geom_abline(intercept = 100, slope = 1, color="#B8332E", linetype = "longdash") +
  annotate("text", x = 40, y = 40, label = paste0('AUC: ', round(roc$auc,1), '%'), size = 4, color = "#DFC18F", family = "Times New Roman")+
  ylab('Sensitivity (%)')+
  xlab('Specificity (%)')+
  scale_fill_gradient(low = "#E08F83", high = "#B8332E") +
  scale_color_gradient(low = "#E08F83", high = "#B8332E") +
  theme(legend.position = "none",
        plot.background = element_rect(fill = "#2C343C", color = "#2C343C"),
        panel.background = element_rect(fill = "#2C343C"),
        panel.grid = element_line(colour = "#2C343C"),
        panel.grid.major.x = element_line(colour = "#2C343C"),
        panel.grid.minor.x = element_line(colour = "#2C343C"),
        axis.text.x = element_text(color = "#2C343C", family = "Times New Roman", size = 14),
        axis.text.y = element_text(color = "#2C343C", family = "Times New Roman", size = 14),
        axis.title.x = element_text(color = "#DFC18F", family = "Times New Roman", size = 14),
        axis.title.y = element_text(color = "#DFC18F", family = "Times New Roman", size = 14),
        axis.ticks = element_blank())

plot_roc


MODEL: Logistic Regression (model_logistic_all)

1. Defined Class
- Positive Class : 1 (Observed Heart Disease)
- Negative Class : 0 (No Heart Disease)

2. Defined Risk
- FP/False Positif: : Predicted heart disease, then the actual is no heart disease.
- FN / False Negatif: : Predicted no heart disease, then the actual is heart disease.

3. Defined Intolerable Risk & Model Performance Metric
- The Intolerable risk is FN / False Negative risk. Hence, the research uses the sensitivity of model performance metric.


Insight:

- The sensivity / recall of Logistic Regression model is higher than K-NN model.
- The Logistic Regression (model_logistic_all) Accuracy : 0.8585366, it means that only 85.85% correctly classified data.
- The Logistic Regression (model_logistic_all) Sensitivity : 0.8659794, it means that 86.59% positive class is classified correctly from its actual data.
- The Logistic Regression (model_logistic_all) Specificity : 0.8518519, it means that the level of precision of the model predicts a negative class of 85.18%.
- The Logistic Regression (model_logistic_all) Precision : 0.8400, it means that the level of precision of the model predicts a positive class of 84%.
- There are no big differences between our Accuracy, Sensitivity, Specificity, and Precision score on the “On Test Data - Log.Reg.” and On Train Data - Log.Reg. .
- This means that there are no indication of Overfitting and Underfitting for Logistic Regression model.
- ROC Curve also shows a very good separation with AUC score of 91.9%. Meaning that the model is proven to be near excellent at separating our target classes.

9 Conclusion


  • Based on the results of the step-wise regression method in logistic regression model, several predictors increase a person’s likelihood of having a heart attack, as follows: sex, cp, trestbps, thalach, exang, oldpeak, ca, and thal.

  • The model of model_logistic_all is suitable to make a prediciton. But, using a model of model_logistic_step is appropiate to see the significance of the predictor. Then, step-wise regression is a greedy algorithm that focuses on finding the best results with relatively short time, but not necessarily giving the most optimal results. Therefore, this research uses the results of step-wise regression, model_logistic_step, as a model recommendation and there is still room for improvement.

  • The model of logistic regression (model_logistic_all) is better than model KNN (knn_pred), due to higher sensitivity score and AUC score.