Introduction

Did you know? someone has a heart attack every 40 seconds in the United States. And in the same region, about 647,000 people die from heart disease every year. Heart disease is the leading cause of death for men, women, and people of most racial and ethnic groups until now.

In this project, we want to analyze about the cause of heart attack and also we want to create a model that can be useful to predict the heart disease event. The dataset that we’re going to use is the Heart Disease Data Set from UCI Machine Learning Repository. In which the patients data itself was collected from Cleveland, Hungary, Switzerland, and Long Beach.

To summarize our objectives, here are what we’re going to do in this dataset:


Understanding our Data

Before we’re doing our Exploratory Data Analysis, we will examine our dataset first to see if there are any anomalies that we can fix within the dataset. It is important that insights are only as good as the data that informs them. This means it’s vital for the data to be clean and in the shape of usable form.

Description

There are 14 attributes in this dataset and here are their descriptions:

  • Age: Patient’s Age in years.
  • Sex: Patient’s Gender. (M = Male, F = Female)
  • ChestPainType: Chest Pain type. (4 values: ATA, NAP, ASY, TA)
  • RestingBP: resting Blood Pressure. ( in mm Hg )
  • Cholesterol: Serum Cholesterol. ( in mg/dl )
  • FastingBS: Fasting Blood Sugar > 120 mg/dl. (0 = True, 1 = False)
  • RestingECG: resting Electroencephalographic result. (values: Normal, ST, LVH)
  • MaxHR: Maximum Heart Rate achieved.
  • ExerciseAngina: Exercise induced Angina. (N = No, Y = Yes)
  • Oldpeak: ST Depression induced by Exercise relative to rest.
  • ST_Slope: Slope of the peak exercise ST segment. (values: Up, Flat, Down)
  • HeartDisease:: Heart Disease occured. (0 = No, 1 = Yes)


Table Preview

heart <- read.csv("heart.csv")
rmarkdown::paged_table(heart)


Data Wrangling

  • Check the data types:
glimpse(heart)
## Rows: 918
## Columns: 12
## $ Age            <int> 40, 49, 37, 48, 54, 39, 45, 54, 37, 48, 37, 58, 39, 49,…
## $ Sex            <chr> "M", "F", "M", "F", "M", "M", "F", "M", "M", "F", "F", …
## $ ChestPainType  <chr> "ATA", "NAP", "ATA", "ASY", "NAP", "NAP", "ATA", "ATA",…
## $ RestingBP      <int> 140, 160, 130, 138, 150, 120, 130, 110, 140, 120, 130, …
## $ Cholesterol    <int> 289, 180, 283, 214, 195, 339, 237, 208, 207, 284, 211, …
## $ FastingBS      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ RestingECG     <chr> "Normal", "Normal", "ST", "Normal", "Normal", "Normal",…
## $ MaxHR          <int> 172, 156, 98, 108, 122, 170, 170, 142, 130, 120, 142, 9…
## $ ExerciseAngina <chr> "N", "N", "N", "Y", "N", "N", "N", "N", "Y", "N", "N", …
## $ Oldpeak        <dbl> 0.0, 1.0, 0.0, 1.5, 0.0, 0.0, 0.0, 0.0, 1.5, 0.0, 0.0, …
## $ ST_Slope       <chr> "Up", "Flat", "Up", "Flat", "Up", "Up", "Up", "Up", "Fl…
## $ HeartDisease   <int> 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1…

As demonstrated above, our variables data types are still not in the right type. * Change the data types:

heart <- 
  heart %>% 
  mutate_at(vars(Sex, ChestPainType, FastingBS, RestingECG, ExerciseAngina, ST_Slope), as.factor) %>% 
  mutate(Sex = case_when(
    Sex == "M" ~ "Male",
    Sex == "F" ~ "Female"),
    FastingBS = case_when(
    FastingBS == "0" ~ "< 120 mg/dl",
    FastingBS == "1" ~ "> 120 mg/dl"),
    ExerciseAngina = case_when(
    ExerciseAngina == "N" ~ "No",
    ExerciseAngina == "Y" ~ "Yes"))
  • Check for missing values:
colSums(is.na(heart))
##            Age            Sex  ChestPainType      RestingBP    Cholesterol 
##              0              0              0              0              0 
##      FastingBS     RestingECG          MaxHR ExerciseAngina        Oldpeak 
##              0              0              0              0              0 
##       ST_Slope   HeartDisease 
##              0              0

There are no missing values within our dataset.



Exploratory Data Analysis

Exploratory data analysis (EDA) involves using graphics and visualizations to explore and analyze a data set. The goal is to explore, investigate and learn the data variables within our dataset.

Categorical Variables Distribution

From the inspection on Data Wrangling step earlier, these are all our categorical variables and their distribution:

Chest Pain

heart_cp <- 
  heart %>% 
  select(ChestPainType) %>%
  group_by(ChestPainType) %>%
  summarize(total = n()) %>% 
  ungroup() %>% 
  arrange(desc(-total))

heart_cp %>% 
  e_charts(x = ChestPainType) %>% 
  e_pie(total,
        roseType = "radius",
        itemStyle = list(
          color = htmlwidgets::JS("
          function(params) {
                                var colorList = [new echarts.graphic.LinearGradient(
                                                 1, 0, 0, 0,
                                                [
                                                 { offset: 0, color: '#FFE5DB' },
                                                 { offset: 1, color: '#FFE5DB' }
                                                ]),
                                                new echarts.graphic.LinearGradient(
                                                 0, 0, 0, 1,
                                                [
                                                 { offset: 0, color: '#FFE5DB' },
                                                 { offset: 1, color: '#FFE5DB' }
                                                ]),
                                                new echarts.graphic.LinearGradient(
                                                 0, 0, 0, 1,
                                                [
                                                 { offset: 0, color: '#FFE5DB' },
                                                 { offset: 1, color: '#E08F83' }
                                                ]),
                                                new echarts.graphic.LinearGradient(
                                                 0, 0, 0, 1,
                                                [
                                                 { offset: 0, color: '#FB6A63' },
                                                 { offset: 1, color: '#B83F3E' }
                                                ])];
                                return colorList[params.dataIndex]
                                }
                                
        ")),
        radius = c("25%", "70%"), 
        legend= FALSE, 
        label = list(
          color = "#DFC18F",
          fontFamily = "Cardo",
          fontSize = 14
        ),
        animationDuration = 4000) %>%
  e_title(text = "Chest Pain Distribution",
          textStyle = list(
            color = "#DFC18F",
            fontStyle = "italic"
          ),
          right = "10%",
          left = "50%") %>% 
  e_tooltip(formatter = htmlwidgets::JS("
                                        function(params)
                                        {
                                            return `<strong style='color:White;'>Count:</strong> ${params.value}
                                                    <br/><strong style='color:White;'>Percent:</strong> ${params.percent}%`
                                        }  ")) %>% 
  e_theme_custom("rangga3.json")
  • ASY takes the majority of the ChestPainType distribution. (54.03%)

  • ASY stands for Asymptomatic which means no symptoms or when a patient have an illness or condition (such as early stage high blood pressure or glaucoma) but do not have symptoms of it.


Gender

heart_sex <- 
  heart %>% 
  select(Sex) %>%
  group_by(Sex) %>%
  summarize(total = n()) %>% 
  ungroup() %>% 
  arrange(desc(-total))

heart_sex %>% 
  e_charts(x = Sex) %>% 
  e_pie(total, 
        roseType = "radius",
        itemStyle = list(
          color = htmlwidgets::JS("
          function(params) {
                                var colorList = [new echarts.graphic.LinearGradient(
                                                 0, 0, 0, 1,
                                                [
                                                 { offset: 0, color: '#FFE5DB' },
                                                 { offset: 1, color: '#E08F83' }
                                                ]),
                                                new echarts.graphic.LinearGradient(
                                                 0, 0, 0, 1,
                                                [
                                                 { offset: 0, color: '#FB6A63' },
                                                 { offset: 1, color: '#B83F3E' }
                                                ])];
                                return colorList[params.dataIndex]
                                }
                                
    ")),
        radius = c("25%", "70%"), 
        legend= FALSE, 
        label = list(
          color = "#DFC18F",
          fontFamily = "Cardo",
          fontSize = 14
        ),
        animationDuration = 4000) %>% 
  e_title(text = "Gender Distribution",
          textStyle = list(
            color = "#DFC18F",
            fontStyle = "italic"
          ),
          right = "10%",
          left = "50%",
          top = "1%") %>% 
  e_tooltip(formatter = htmlwidgets::JS("
                                        function(params)
                                        {
                                            return `<strong style='color:White;'>Count:</strong> ${params.value}
                                                    <br/><strong style='color:White;'>Percent:</strong> ${params.percent}%`
                                        }  ")) %>% 
  e_theme_custom("rangga1.json")
  • Male patient takes the majority of the Sex distribution. (78.98%)


Fasting Blood Sugar

heart_fbs <- 
  heart %>% 
  select(FastingBS) %>%
  group_by(FastingBS) %>%
  summarize(total = n()) %>% 
  ungroup() %>% 
  arrange(desc(-total))

heart_fbs %>% 
  e_charts(x = FastingBS) %>% 
  e_pie(total,
        roseType = "radius",
        itemStyle = list(
          color = htmlwidgets::JS("
          function(params) {
                                var colorList = [new echarts.graphic.LinearGradient(
                                                 0, 0, 0, 1,
                                                [
                                                 { offset: 0, color: '#FFE5DB' },
                                                 { offset: 1, color: '#E08F83' }
                                                ]),
                                                new echarts.graphic.LinearGradient(
                                                 0, 0, 0, 1,
                                                [
                                                 { offset: 0, color: '#FB6A63' },
                                                 { offset: 1, color: '#B83F3E' }
                                                ])];
                                return colorList[params.dataIndex]
                                }
                                
    ")),
        radius = c("25%", "70%"), 
        legend= FALSE, 
        label = list(
          color = "#DFC18F",
          fontFamily = "Cardo",
          fontSize = 14
        ),
        animationDuration = 4000) %>% 
  e_title(text = "Fasting Blood Sugar Distribution",
          textStyle = list(
            color = "#DFC18F",
            fontStyle = "italic"
          ),
          right = "10%",
          left = "50%",
          top = "1%") %>% 
  e_tooltip(formatter = htmlwidgets::JS("
                                        function(params)
                                        {
                                            return `<strong style='color:White;'>Count:</strong> ${params.value}
                                                    <br/><strong style='color:White;'>Percent:</strong> ${params.percent}%`
                                        }  ")) %>% 
  e_theme_custom("rangga1.json")
  • A fasting blood sugar level less than 100 mg/dL (5.6 mmol/L) is considered normal.

  • A fasting blood sugar level from 100 to 125 mg/dL (5.6 to 6.9 mmol/L) is considered pre-diabetes.

  • If it’s 126 mg/dL (7 mmol/L) or higher on two separate tests, means a person have diabetes.

  • Majority of patients still have normal and pre-diabetes conditions. (76.69%)


Resting ECG

heart_rtcg <- 
  heart %>% 
  select(RestingECG) %>%
  group_by(RestingECG) %>%
  summarize(total = n()) %>% 
  ungroup() %>% 
  arrange(desc(-total))

heart_rtcg %>% 
  e_charts(x = RestingECG) %>% 
  e_pie(total,
        roseType = "radius",
        itemStyle = list(
          color = htmlwidgets::JS("
          function(params) {
                                var colorList = [new echarts.graphic.LinearGradient(
                                                 0, 0, 0, 1,
                                                [
                                                 { offset: 0, color: '#FFE5DB' },
                                                 { offset: 1, color: '#FFE5DB' }
                                                ]),
                                                new echarts.graphic.LinearGradient(
                                                 0, 0, 0, 1,
                                                [
                                                 { offset: 0, color: '#FFE5DB' },
                                                 { offset: 1, color: '#E08F83' }
                                                ]),
                                                new echarts.graphic.LinearGradient(
                                                 0, 0, 0, 1,
                                                [
                                                 { offset: 0, color: '#FB6A63' },
                                                 { offset: 1, color: '#B83F3E' }
                                                ])];
                                return colorList[params.dataIndex]
                                }
                                
        ")),
        radius = c("25%", "70%"), 
        legend= FALSE, 
        label = list(
          color = "#DFC18F",
          fontFamily = "Cardo",
          fontSize = 14
        ),
        animationDuration = 4000) %>% 
  e_title(text = "Resting Electrocardiogram Distribution",
          textStyle = list(
            color = "#DFC18F",
            fontStyle = "italic"
          ),
          right = "10%",
          left = "50%",
          top = "1%") %>% 
  e_tooltip(formatter = htmlwidgets::JS("
                                        function(params)
                                        {
                                            return `<strong style='color:White;'>Count:</strong> ${params.value}
                                                    <br/><strong style='color:White;'>Percent:</strong> ${params.percent}%`
                                        }  ")) %>% 
  e_theme_custom("rangga1.json")
  • Normal: your Heart rate and Rythm is normal and consistent.

  • ST: T wave inversions and/or ST elevation or depression of more than 0.05 mV.

  • LVH: showing probable or definite left ventricular hypertrophy by Estes’ criteria.

  • Most of people still have a Normal RestingECG results. (60.13%)


ST Slope

heart_slp <- 
  heart %>% 
  select(ST_Slope) %>%
  group_by(ST_Slope) %>%
  summarize(total = n()) %>% 
  ungroup() %>% 
  arrange(desc(-total))

heart_slp %>% 
  e_charts(x = ST_Slope) %>% 
  e_pie(total,
        roseType = "radius",
        itemStyle = list(
          color = htmlwidgets::JS("
          function(params) {
                                var colorList = [new echarts.graphic.LinearGradient(
                                                 0, 0, 0, 1,
                                                [
                                                 { offset: 0, color: '#FFE5DB' },
                                                 { offset: 1, color: '#FFE5DB' }
                                                ]),
                                                new echarts.graphic.LinearGradient(
                                                 0, 0, 0, 1,
                                                [
                                                 { offset: 0, color: '#FFE5DB' },
                                                 { offset: 1, color: '#E08F83' }
                                                ]),
                                                new echarts.graphic.LinearGradient(
                                                 0, 0, 0, 1,
                                                [
                                                 { offset: 0, color: '#FB6A63' },
                                                 { offset: 1, color: '#B83F3E' }
                                                ])];
                                return colorList[params.dataIndex]
                                }
                                
        ")),
        radius = c("25%", "70%"), 
        legend= FALSE, 
        label = list(
          color = "#DFC18F",
          fontFamily = "Cardo",
          fontSize = 14
        ),
        animationDuration = 4000) %>%
  e_title(text = "ST Slope Distribution",
          textStyle = list(
            color = "#DFC18F",
            fontStyle = "italic"
          ),
          right = "10%",
          left = "50%",
          top = "1%") %>% 
  e_tooltip(formatter = htmlwidgets::JS("
                                        function(params)
                                        {
                                            return `<strong style='color:White;'>Count:</strong> ${params.value}
                                                    <br/><strong style='color:White;'>Percent:</strong> ${params.percent}%`
                                        }  ")) %>% 
  e_theme_custom("rangga1.json")
  • Up/Upslopping: upsloping ST‐segment depression is associated with an increased risk of coronary artery disease.

  • Down/Downsloping: downsloping ST-segment depression is a common manifestation of severe myocardial ischemia.

  • Flat: The normal ST-segment is flat and isoelectric.

  • The patient distribution for Flat and Upslopping ST-Slope are pretty even< (50% vs 43%) while Downsloping ST-Slope takes the minority role. (only 7%)


Exercise Angina

heart_exng <- 
  heart %>% 
  select(ExerciseAngina) %>%
  group_by(ExerciseAngina) %>%
  summarize(total = n()) %>% 
  ungroup() %>% 
  arrange(desc(-total))

heart_exng %>% 
  e_charts(x = ExerciseAngina) %>% 
  e_pie(total,
        roseType = "radius",
        itemStyle = list(
          color = htmlwidgets::JS("
          function(params) {
                                var colorList = [new echarts.graphic.LinearGradient(
                                                 0, 0, 0, 1,
                                                [
                                                 { offset: 0, color: '#FFE5DB' },
                                                 { offset: 1, color: '#E08F83' }
                                                ]),
                                                new echarts.graphic.LinearGradient(
                                                 0, 0, 0, 1,
                                                [
                                                 { offset: 0, color: '#FB6A63' },
                                                 { offset: 1, color: '#B83F3E' }
                                                ])];
                                return colorList[params.dataIndex]
                                }
                                
    ")),
        radius = c("25%", "70%"), 
        legend= FALSE, 
        label = list(
          color = "#DFC18F",
          fontFamily = "Cardo",
          fontSize = 14
        ),
        animationDuration = 4000) %>% 
    e_title(text = "Exercise Angina Distribution",
          textStyle = list(
            color = "#DFC18F",
            fontStyle = "italic"
          ),
          right = "10%",
          left = "50%",
          top = "1%") %>% 
  e_tooltip(formatter = htmlwidgets::JS("
                                        function(params)
                                        {
                                            return `<strong style='color:White;'>Count:</strong> ${params.value}
                                                    <br/><strong style='color:White;'>Percent:</strong> ${params.percent}%`
                                        }  ")) %>% 
  e_theme_custom("rangga1.json")
  • ExerciseAngina is a pain in the chest that comes on with exercise, stress, or other things that make the heart work harder.

  • The distributions are pretty even for the people who having an angina or not. (60% vs 40%)



Numerical Variables Distribution

Before we go into the data distribution, we will be analyzing the Pearson Correlation between all of our numerical variables first to see if there any initial multicollinearity.

heart_num <- heart %>% 
  select(Age, RestingBP, Cholesterol, MaxHR, Oldpeak, HeartDisease)

cormat <- round(cor(heart_num),2)

get_lower_tri<-function(cormat){
  cormat[upper.tri(cormat)] <- NA
  return(cormat)
  }

get_upper_tri <- function(cormat){
  cormat[lower.tri(cormat)]<- NA
  return(cormat)
}

lower_tri <- get_lower_tri(cormat)

heart_cor_melt <- reshape2::melt(lower_tri, na.rm = TRUE)
heart_cor_melt <- heart_cor_melt %>% 
  mutate(label = glue("{Var1} ~ {Var2}"))

ggplot(heart_cor_melt,aes(Var1, Var2, text = label)) +
  geom_tile(aes(fill = value)) +
  geom_text(aes(label = round(value, 1)), alpha=0.5, size = 3, color = "White") + 
  scale_fill_gradientn(colors = c("#FFE5DB","#E08F83","#B8332E"),
                      values = rescale(c(-1,0,1)),
                      limits = c(-1,1)) +
  labs(x = NULL,
      y = NULL,
      fill = "Pearson Corr:") +
  theme(legend.background = element_rect(fill = "#2C343C", color = "#2C343C"),
        plot.background = element_rect(fill = "#2C343C", color = "#2C343C"),
        panel.background = element_rect(fill = "#2C343C"),
        panel.grid = element_line(colour = "#2C343C"),
        panel.grid.major.x = element_line(colour = "#2C343C"),
        panel.grid.minor.x = element_line(colour = "#2C343C"),
        legend.title = element_text(colour = "#DFC18F", face ="bold", family = "Times New Roman"),
        legend.text = element_text(colour = "#DFC18F", face ="bold", family = "Times New Roman"),
        legend.justification = c(1, 0),
        legend.position = c(0.6, 0.7),
        legend.direction = "horizontal",
        axis.text.x = element_text(color = "#DFC18F", family = "Times New Roman",
                                angle = 45, vjust = 1, hjust = 1),
        axis.text.y = element_blank(),
        axis.ticks = element_blank()) +
        guides(fill = guide_colorbar(barwidth = 7, barheight = 1,
                  title.position = "top", title.hjust = 0.5))

As it shows above, there are no multicollinearity detected since the highest score available is between 0.4/-0.4.

Age

## [1] "Skewness :-0.2"
## [1] "Kurtosis :2.61"
heart_age <- heart %>% 
  select(Age) %>% 
  pivot_longer(
  Age,
  names_to = "name")

linear_gradient <- htmlwidgets::JS(
  "new echarts.graphic.LinearGradient(
    0, 0, 0, 1,
    [
      { offset: 0, color: '#D76662' },
      { offset: 1, color: '#862421' }
    ])"
)

linear_gradient2 <- htmlwidgets::JS(
  "new echarts.graphic.LinearGradient(
    0, 0, 0, 1,
    [
      { offset: 0, color: '#FFE5DB' },
      { offset: 1, color: '#E08F83' }
    ])"
)


heart_age1 <- heart_age %>% 
  e_charts() %>%
  e_boxplot(value, outliers = T, name = "Age",
            boxWidth = c(50, 150),
            itemStyle = list(color = linear_gradient,
                             borderColor = linear_gradient2,
                             borderType = "dashed"),
            animationDuration = 2000) %>% 
  e_title(text = "Summary",
          textStyle = list(
            color = "#DFC18F",
            fontStyle = "italic",
            fontFamily = "Cardo",
            fontSize = 22
          ),
          right = "10%",
          left = "40%",
          top = "1%") %>% 
  e_x_axis(axisLabel = list(color = "#2C343C"),
           axisLine = list(
             lineStyle = list(
               color = "#333D47"
             )
           )) %>% 
  e_y_axis(axisLabel = list(fontFamily = "Cardo", color = "#DFC18F"), 
           splitLine = list(
             lineStyle = list(
               color = "#333D47"
             )
           )) %>% 
  e_color(background = "#2C343C") %>% 
  e_tooltip(trigger = "axis",
            axisPointer = list(
              type = "line",
              label = list(
                backgroundColor = "#2C343C",
                fontFamily = "Cardo",
                color = "#DFC18F"
              ),
              lineStyle = list(
                color = linear_gradient2
              )
            ),
            textStyle = list(
              fontFamily = "Cardo",
              color = "#DFC18F"),
            backgroundColor = "#2C343C",
            borderColor = "#2C343C",
            formatter = htmlwidgets::JS("
                                        function(params)
                                        {
                                            return `<strong style='color:#D76662;'>Age</strong>
                                                    <br/><strong style='color:White;'>Max:</strong> 77
                                                    <br/><strong style='color:White;'>Q3:</strong> 60
                                                    <br/><strong style='color:White;'>Median:</strong> 54
                                                    <br/><strong style='color:White;'>Q1:</strong> 47
                                                    <br/><strong style='color:White;'>Min:</strong> <strong style='color:#D76662;'>28</strong>
                                                    <br/><strong style='color:White;'>Mean:</strong> 53.51`
                                        }  "))


heart_age2 <- heart_age %>% 
  e_charts()  %>% 
  e_histogram(value, 
              legend = F,
              areaStyle = list(opacity = .4,
                               color = linear_gradient2),
              itemStyle = list(color = linear_gradient2)) %>% 
  e_density(value, 
            areaStyle = list(opacity = .4,
                             color = linear_gradient),
            itemStyle = list(color = linear_gradient,
                             borderType = "dashed"),
            y_index = 1,
            legend = F) %>% 
  e_title(text = "Distribution",
          textStyle = list(
            color = "#DFC18F",
            fontStyle = "italic"
          ),
          right = "10%",
          left = "50%",
          top = "1%") %>% 
  e_theme_custom("rangga1.json") %>%
  e_animation(duration = 2000) %>% 
  e_tooltip(axis = "trigger",
            axisPointer = list(
              type = "cross",
              label = list(
                backgroundColor = "#2C343C",
                fontFamily = "Cardo",
                color = "#DFC18F"
              ),
              crossStyle = list(
                color = linear_gradient2
              )
            ),
            formatter = htmlwidgets::JS("
                                        function(params)
                                        {
                                            return `<strong style='color:White;'>Age:</strong> ${params.value[0]}
                                                    <br/><strong style='color:White;'>Count:</strong> ${params.value[1]}`
                                        }  "))

hw_grid(heart_age1, heart_age2)
  • The distribution of patient Age is quite normal since the curve bear resemblance to a bell curve.

  • Skewness and Kurtosis are adjacent to 0 and 3 respectively.


Resting Blood Pressure

## [1] "Skewness :0.18"
## [1] "Kurtosis :6.25"
heart_rbp <- heart %>% 
  select(RestingBP) %>% 
  pivot_longer(
  RestingBP,
  names_to = "name")

heart_rbp1 <- heart_rbp %>% 
  e_charts() %>%
  e_boxplot(value, outliers = F, name = "RBP",
            boxWidth = c(50, 150),
            itemStyle = list(color = linear_gradient,
                             borderColor = linear_gradient2,
                             borderType = "dashed"),
            animationDuration = 2000) %>% 
  e_title(text = "Summary",
          textStyle = list(
            color = "#DFC18F",
            fontStyle = "italic",
            fontFamily = "Cardo",
            fontSize = 22
          ),
          right = "10%",
          left = "40%",
          top = "1%") %>% 
  e_x_axis(axisLabel = list(color = "#2C343C"),
           axisLine = list(
             lineStyle = list(
               color = "#333D47"
             )
           )) %>% 
  e_y_axis(axisLabel = list(fontFamily = "Cardo",
                            color = "#DFC18F"), 
           splitLine = list(
             lineStyle = list(
               color = "#333D47"
             )
           )) %>% 
  e_color(background = "#2C343C") %>% 
  e_tooltip(trigger = "axis",
            axisPointer = list(
              type = "line",
              label = list(
                backgroundColor = "#2C343C",
                fontFamily = "Cardo",
                color = "#DFC18F"
              ),
              lineStyle = list(
                color = linear_gradient2
              )
            ),
            textStyle = list(
              fontFamily = "Cardo",
              color = "#DFC18F"),
            backgroundColor = "#2C343C",
            borderColor = "#2C343C",
            formatter = htmlwidgets::JS("
                                        function(params)
                                        {
                                            return `<strong style='color:#D76662;'>Blood Pressure</strong>
                                                    <br/><strong style='color:White;'>Max:</strong> 200.0
                                                    <br/><strong style='color:White;'>Q3:</strong> 140.0
                                                    <br/><strong style='color:White;'>Median:</strong> 130.0
                                                    <br/><strong style='color:White;'>Q1:</strong> 120.0
                                                    <br/><strong style='color:White;'>Min:</strong> <strong style='color:#D76662;'>53.51</strong>
                                                    <br/><strong style='color:White;'>Mean:</strong> 132.4`
                                        }  "))


heart_rbp2 <- heart_rbp %>% 
  e_charts()  %>% 
  e_histogram(value, 
              legend = F,
              areaStyle = list(opacity = .4,
                               color = linear_gradient2),
              itemStyle = list(color = linear_gradient2)) %>% 
  e_density(value, 
            areaStyle = list(opacity = .4,
                             color = linear_gradient),
            itemStyle = list(color = linear_gradient),
            y_index = 1,
            legend = F) %>% 
  e_title(text = "Distribution",
          textStyle = list(
            color = "#DFC18F",
            fontStyle = "italic"
          ),
          right = "10%",
          left = "50%",
          top = "1%") %>% 
  e_theme_custom("rangga1.json") %>%
  e_animation(duration = 2000) %>% 
  e_tooltip(axis = "trigger",
            axisPointer = list(
              type = "cross",
              label = list(
                backgroundColor = "#2C343C",
                fontFamily = "Cardo",
                color = "#DFC18F"
              ),
              crossStyle = list(
                color = linear_gradient2
              )
            ),
            formatter = htmlwidgets::JS("
                                        function(params)
                                        {
                                            return `<strong style='color:White;'>Blood Pressure:</strong> ${params.value[0]}mm/hg
                                                    <br/><strong style='color:White;'>Count:</strong> ${params.value[1]}`
                                        }  "))

hw_grid(heart_rbp1, heart_rbp2)
  • The distribution of patient RestingBP shows highly right skewed charts.

  • Majority of the patient’s Blood Pressure are between 110mm/hg - 150mm/hg.

  • This might be because there’s a outliers within RestingBP values 0mm/hg - 60mm/hg.

  • Skewness are adjacent to 0 while Kurtosis are not adjacent to 3.


Cholesterol

## [1] "Skewness :-0.61"
## [1] "Kurtosis :3.11"
heart_chol <- heart %>% 
  select(Cholesterol) %>% 
  pivot_longer(
  Cholesterol,
  names_to = "name")

heart_chol1 <- heart_chol %>% 
  e_charts() %>%
  e_boxplot(value, outliers = F, name = "Cholesterol",
            boxWidth = c(50, 150),
            itemStyle = list(color = linear_gradient,
                             borderColor = linear_gradient2,
                             borderType = "dashed"),
            animationDuration = 2000) %>% 
  e_title(text = "Summary",
          textStyle = list(
            color = "#DFC18F",
            fontStyle = "italic",
            fontFamily = "Cardo",
            fontSize = 22
          ),
          right = "10%",
          left = "40%",
          top = "1%") %>% 
  e_x_axis(axisLabel = list(color = "#2C343C"),
           axisLine = list(
             lineStyle = list(
               color = "#333D47"
             )
           )) %>% 
  e_y_axis(axisLabel = list(fontFamily = "Cardo",
                            color = "#DFC18F"), 
           splitLine = list(
             lineStyle = list(
               color = "#333D47"
             )
           )) %>% 
  e_color(background = "#2C343C") %>% 
  e_tooltip(axis = "trigger",
            axisPointer = list(
              type = "line",
              label = list(
                backgroundColor = "#2C343C",
                fontFamily = "Cardo",
                color = "#DFC18F"
              ),
              lineStyle = list(
                color = linear_gradient2
              )
            ),
            textStyle = list(
              fontFamily = "Cardo",
              color = "#DFC18F"),
            backgroundColor = "#2C343C",
            borderColor = "#2C343C",
            formatter = htmlwidgets::JS("
                                        function(params)
                                        {
                                            return `<strong style='color:#D76662;'>Cholesterol</strong>
                                                    <br/><strong style='color:White;'>Max:</strong> 603.0
                                                    <br/><strong style='color:White;'>Q3:</strong> 267.0
                                                    <br/><strong style='color:White;'>Median:</strong> 223.0
                                                    <br/><strong style='color:White;'>Q1:</strong> 173.2
                                                    <br/><strong style='color:White;'>Min:</strong> <strong style='color:#D76662;'>0.0</strong>
                                                    <br/><strong style='color:White;'>Mean:</strong> 198.8`
                                        }  "))


heart_chol2 <- heart_chol %>% 
  e_charts()  %>% 
  e_histogram(value, 
              legend = F,
              areaStyle = list(opacity = .4,
                               color = linear_gradient2),
              itemStyle = list(color = linear_gradient2)) %>% 
  e_density(value, 
            areaStyle = list(opacity = .4,
                             color = linear_gradient),
            itemStyle = list(color = linear_gradient),
            y_index = 1,
            legend = F) %>% 
  e_title(text = "Distribution",
          textStyle = list(
            color = "#DFC18F",
            fontStyle = "italic"
          ),
          right = "10%",
          left = "50%",
          top = "1%") %>% 
  e_theme_custom("rangga1.json") %>%
  e_animation(duration = 2000) %>% 
  e_tooltip(axis = "trigger",
            axisPointer = list(
              type = "cross",
              label = list(
                backgroundColor = "#2C343C",
                fontFamily = "Cardo",
                color = "#DFC18F"
              ),
              crossStyle = list(
                color = linear_gradient2
              )
            ),
            formatter = htmlwidgets::JS("
                                        function(params)
                                        {
                                            return `<strong style='color:White;'>Chol:</strong> ${params.value[0]}mg/dL
                                                    <br/><strong style='color:White;'>Count:</strong> ${params.value[1]}`
                                        }  "))

hw_grid(heart_chol1, heart_chol2)
  • The data distribution of patient Cholesterol shows highly left skewed charts.

  • Moreover, there are so many low values (below 100mg/dL) and also 0 values.

  • This means there are missing values and also outliers within the Cholesterol data.

For a better understanding, here are the Cholesterol thresholds:

  • Less than 100 mg/dL: Optimal.
  • 100-129 mg/dL: Near or above optimal.
  • 130-159 mg/dL: Borderline high.
  • 160-189 mg/dL: High.
  • 190 mg/dL and above: Very high.

Based on the plot above, the highest frequency transpired from the range 190 - 275 mg/dL. This implies that majority of patients are having a Cholesterol problem.


Max. Heart Rate

## [1] "Skewness :-0.14"
## [1] "Kurtosis :2.55"
heart_MaxHR <- heart %>% 
  select(MaxHR) %>% 
  pivot_longer(
  MaxHR,
  names_to = "name")

heart_MaxHR1 <- heart_MaxHR %>% 
  e_charts() %>%
  e_boxplot(value, outliers = F, name = "Max Heart Rate",
            boxWidth = c(50, 150),
            itemStyle = list(color = linear_gradient,
                             borderColor = linear_gradient2,
                             borderType = "dashed"),
            animationDuration = 2000) %>% 
  e_title(text = "Summary",
          textStyle = list(
            color = "#DFC18F",
            fontStyle = "italic",
            fontFamily = "Cardo",
            fontSize = 22
          ),
          right = "10%",
          left = "40%",
          top = "1%") %>% 
  e_x_axis(axisLabel = list(color = "#2C343C"),
           axisLine = list(
             lineStyle = list(
               color = "#333D47"
             )
           )) %>% 
  e_y_axis(axisLabel = list(fontFamily = "Cardo",
                            color = "#DFC18F"), 
           splitLine = list(
             lineStyle = list(
               color = "#333D47"
             )
           )) %>% 
  e_color(background = "#2C343C") %>% 
  e_tooltip(trigger = "axis",
            axisPointer = list(
              type = "line",
              label = list(
                backgroundColor = "#2C343C",
                fontFamily = "Cardo",
                color = "#DFC18F"
              ),
              lineStyle = list(
                color = linear_gradient2
              )
            ),
            textStyle = list(
              fontFamily = "Cardo",
              color = "#DFC18F"),
            backgroundColor = "#2C343C",
            borderColor = "#2C343C",
            formatter = htmlwidgets::JS("
                                        function(params)
                                        {
                                            return `<strong style='color:#D76662;'>Heart Rate</strong>
                                                    <br/><strong style='color:White;'>Max:</strong> 202.0
                                                    <br/><strong style='color:White;'>Q3:</strong> 156.0
                                                    <br/><strong style='color:White;'>Median:</strong> 138.0 
                                                    <br/><strong style='color:White;'>Q1:</strong> 120.0
                                                    <br/><strong style='color:White;'>Min:</strong> <strong style='color:#D76662;'>60.0</strong>
                                                    <br/><strong style='color:White;'>Mean:</strong> 136.8`
                                        }  ")) 


heart_MaxHR2 <- heart_MaxHR %>% 
  e_charts()  %>% 
  e_histogram(value, 
              legend = F,
              areaStyle = list(opacity = .4,
                               color = linear_gradient2),
              itemStyle = list(color = linear_gradient2)) %>% 
  e_density(value, 
            areaStyle = list(opacity = .4,
                             color = linear_gradient),
            itemStyle = list(color = linear_gradient),
            y_index = 1,
            legend = F) %>% 
  e_title(text = "Distribution",
          textStyle = list(
            color = "#DFC18F",
            fontStyle = "italic"
          ),
          right = "10%",
          left = "50%",
          top = "1%") %>% 
  e_theme_custom("rangga1.json") %>%
  e_animation(duration = 2000) %>% 
  e_tooltip(axis = "trigger",
            axisPointer = list(
              type = "cross",
              label = list(
                backgroundColor = "#2C343C",
                fontFamily = "Cardo",
                color = "#DFC18F"
              ),
              crossStyle = list(
                color = linear_gradient2
              )
            ),
            formatter = htmlwidgets::JS("
                                        function(params)
                                        {
                                            return `<strong style='color:White;'>Heart Rate:</strong> ${params.value[0]}
                                                    <br/><strong style='color:White;'>Count:</strong> ${params.value[1]}`
                                        }  "))

hw_grid(heart_MaxHR1, heart_MaxHR2)
  • The data distribution for MaxHR variable is quite normal since the curve bear resemblance to a bell curve.

  • Skewness and Kurtosis are adjacent to 0 and 3 respectively.


Oldpeak

## [1] "Skewness :1.02"
## [1] "Kurtosis :4.19"
heart_old <- heart %>% 
  select(Oldpeak) %>% 
  pivot_longer(
  Oldpeak,
  names_to = "name")

heart_old1 <- heart_old %>% 
  e_charts() %>%
  e_boxplot(value, outliers = F, name = "Oldpeak",
            boxWidth = c(50, 150),
            itemStyle = list(color = linear_gradient,
                             borderColor = linear_gradient2,
                             borderType = "dashed"),
            animationDuration = 2000) %>% 
  e_title(text = "Summary",
          textStyle = list(
            color = "#DFC18F",
            fontStyle = "italic",
            fontFamily = "Cardo",
            fontSize = 22
          ),
          right = "10%",
          left = "40%",
          top = "1%") %>% 
  e_x_axis(axisLabel = list(color = "#2C343C"),
           axisLine = list(
             lineStyle = list(
               color = "#333D47"
             )
           )) %>% 
  e_y_axis(axisLabel = list(fontFamily = "Cardo", color = "#DFC18F"), 
           splitLine = list(
             lineStyle = list(
               color = "#333D47"
             )
           )) %>% 
  e_color(background = "#2C343C") %>% 
  e_tooltip(trigger = "axis",
            axisPointer = list(
              type = "line",
              label = list(
                backgroundColor = "#2C343C",
                fontFamily = "Cardo",
                color = "#DFC18F"
              ),
              lineStyle = list(
                color = linear_gradient2
              )
            ),
            textStyle = list(
              fontFamily = "Cardo",
              color = "#DFC18F"),
            backgroundColor = "#2C343C",
            borderColor = "#2C343C",
            formatter = htmlwidgets::JS("
                                        function(params)
                                        {
                                            return `<strong style='color:#D76662;'>Oldpeak</strong>
                                                    <br/><strong style='color:White;'>Max:</strong> 6.2000 
                                                    <br/><strong style='color:White;'>Q3:</strong> 1.5000
                                                    <br/><strong style='color:White;'>Median:</strong> 0.6000 
                                                    <br/><strong style='color:White;'>Q1:</strong> 0.0000
                                                    <br/><strong style='color:White;'>Min:</strong> <strong style='color:#D76662;'>-2.6000</strong>
                                                    <br/><strong style='color:White;'>Mean:</strong> 0.8874`
                                        }  "))


heart_old2 <- heart_old %>% 
  e_charts()  %>% 
  e_histogram(value, 
              legend = F,
              areaStyle = list(opacity = .4,
                               color = linear_gradient2),
              itemStyle = list(color = linear_gradient2)) %>% 
  e_density(value, 
            areaStyle = list(opacity = .4,
                             color = linear_gradient),
            itemStyle = list(color = linear_gradient),
            y_index = 1,
            legend = F) %>% 
  e_title(text = "Distribution",
          textStyle = list(
            color = "#DFC18F",
            fontStyle = "italic"
          ),
          right = "10%",
          left = "50%",
          top = "1%") %>% 
  e_theme_custom("rangga1.json") %>%
  e_animation(duration = 2000) %>% 
  e_tooltip(axis = "trigger",
            axisPointer = list(
              type = "cross",
              label = list(
                backgroundColor = "#2C343C",
                fontFamily = "Cardo",
                color = "#DFC18F"
              ),
              crossStyle = list(
                color = linear_gradient2
              )
            ),
            formatter = htmlwidgets::JS("
                                        function(params)
                                        {
                                            return `<strong style='color:White;'>Oldpeak:</strong> ${params.value[0]}
                                                    <br/><strong style='color:White;'>Count:</strong> ${params.value[1]}`
                                        }  "))

hw_grid(heart_old1, heart_old2)
  • The data distribution of patient Oldpeak shows highly left skewed charts.

  • Skewness and Kurtosis are not adjacent to 0 and 3 respectively.



Bivariate Analysis

The purpose of bivariate analysis is to understand the relationship between two variables. The bivariate analysis involves the analysis of two variables, X: independent/explanatory/outcome variable and Y: dependent/outcome variable, to determine the relationship between them. In this part, we are going to separate our analysis again based on their types. But berfore that, we will see our target variable distribution first.

heart_heart <- heart %>% 
  select(HeartDisease) %>% 
  mutate(HeartDisease = case_when(
    HeartDisease == 0 ~ "No",
    HeartDisease == 1 ~ "Yes")) %>% 
  mutate(HeartDisease = as.factor(HeartDisease)) %>% 
  group_by(HeartDisease) %>% 
  summarize(count = n()) %>% 
  ungroup()

heart_heart %>% 
  e_charts(x = HeartDisease) %>% 
  e_pie(count,
        roseType = "radius",
        itemStyle = list(
          borderRadius = 1,
          color = htmlwidgets::JS("
          function(params) {
                                var colorList = [new echarts.graphic.LinearGradient(
                                                 0, 0, 0, 1,
                                                [
                                                 { offset: 0, color: '#FFE5DB' },
                                                 { offset: 1, color: '#E08F83' }
                                                ]),
                                                new echarts.graphic.LinearGradient(
                                                 0, 0, 0, 1,
                                                [
                                                 { offset: 0, color: '#FB6A63' },
                                                 { offset: 1, color: '#B83F3E' }
                                                ])];
                                return colorList[params.dataIndex]
                                }
                                
    ")),
        radius = c("25%", "70%"), 
        legend= FALSE, 
        label = list(
          color = "#DFC18F",
          fontFamily = "Cardo",
          fontSize = 14
        ),
        animationDuration = 4000) %>% 
  e_title(text = "Heart Disease Distribution",
          textStyle = list(
            color = "#DFC18F",
            fontStyle = "italic"
          ),
          right = "10%",
          left = "50%",
          top = "1%") %>% 
  e_tooltip(formatter = htmlwidgets::JS("
                                        function(params)
                                        {
                                            return `<strong style='color:White;'>Count:</strong> ${params.value}
                                                    <br/><strong style='color:White;'>Percent:</strong> ${params.percent}%`
                                        }  ")) %>% 
  e_theme_custom("rangga1.json")

Categorical Variables based on Target Variable

Chest Pain

heart_cp1 <- heart %>% 
  select(ChestPainType, HeartDisease) %>%
  mutate(HeartDisease = case_when(
    HeartDisease == 0 ~ "No",
    HeartDisease == 1 ~ "Yes"),
         ChestPainType = as.factor(ChestPainType)) %>% 
  mutate(HeartDisease = as.factor(HeartDisease)) %>% 
  count(ChestPainType, HeartDisease) %>% 
  group_by(HeartDisease) 

heart_cp1 %>% 
  e_charts(ChestPainType) %>% 
  e_bar(n, stack = T) %>% 
  e_tooltip(
    trigger = "axis",
    axisPointer = list(
      type = "shadow"
    )
  ) %>% 
  e_title(text = "Heart Disease by Chest Pain Type:",
          textStyle = list(
            color = "#DFC18F",
            fontStyle = "italic"
          ),
          right = "10%",
          left = "47%",
          top = "1%") %>% 
  e_y_axis(
    splitArea = list(show = FALSE),
    splitLine = list(show = FALSE)
  ) %>% 
  e_theme_custom("rangga2.json") %>% 
  e_legend(
    orient = "horizontal",
    right = "10%",
    left = "66%",
    top = "2%",
    textStyle = list(
      color = "#DFC18F"
    )
  ) %>% 
  e_flip_coords()
  • Surprisingly, the patients who have a Heart Disease condition are mostly with No Symptoms ChestPainType.

  • This might be many of the Heart Disease condition are caused by another factors. (Which we’ll find out later)


Gender

heart_sex1 <- heart %>% 
  select(Sex, HeartDisease) %>%
  mutate(HeartDisease = case_when(
    HeartDisease == 0 ~ "No",
    HeartDisease == 1 ~ "Yes"),
         Sex = as.factor(Sex)) %>% 
  mutate(HeartDisease = as.factor(HeartDisease)) %>% 
  count(Sex, HeartDisease) %>% 
  group_by(HeartDisease) 

heart_sex1 %>% 
  e_charts(Sex) %>% 
  e_bar(n, stack = T) %>% 
  e_tooltip(
    trigger = "axis",
    axisPointer = list(
      type = "shadow"
    )
  ) %>% 
  e_title(text = "Heart Disease by Gender:",
          textStyle = list(
            color = "#DFC18F",
            fontStyle = "italic"
          ),
          right = "10%",
          left = "46%",
          top = "1%") %>%   
  e_y_axis(
    splitArea = list(show = FALSE),
    splitLine = list(show = FALSE)
  ) %>% 
  e_theme_custom("rangga2.json") %>% 
  e_legend(
    orient = "horizontal",
    right = "10%",
    left = "61%",
    top = "2%",
    textStyle = list(
      color = "#DFC18F"
    )
  ) %>% 
  e_flip_coords()
  • Male have a higher risk of having Heart Disease Condition rather than Female and not having Heart Disease Condition.


Fasting Blood Sugar

heart_fb1 <- heart %>% 
  select(FastingBS, HeartDisease) %>%
  mutate(HeartDisease = case_when(
    HeartDisease == 0 ~ "No",
    HeartDisease == 1 ~ "Yes"),
         FastingBS = as.factor(FastingBS)) %>% 
  mutate(HeartDisease = as.factor(HeartDisease)) %>% 
  count(FastingBS, HeartDisease) %>% 
  group_by(HeartDisease) 

heart_fb1 %>% 
  e_charts(FastingBS) %>% 
  e_bar(n, stack = T) %>% 
  e_tooltip(
    trigger = "axis",
    axisPointer = list(
      type = "shadow"
    )
  ) %>% 
  e_title(text = "Heart Disease by Fasting Blood Sugar:",
          textStyle = list(
            color = "#DFC18F",
            fontStyle = "italic"
          ),
          right = "10%",
          left = "46%",
          top = "1%") %>%   
  e_y_axis(
    splitArea = list(show = FALSE),
    splitLine = list(show = FALSE)
  ) %>% 
  e_theme_custom("rangga2.json") %>% 
  e_legend(
    orient = "horizontal",
    right = "10%",
    left = "67%",
    top = "2%",
    textStyle = list(
      color = "#DFC18F"
    )
  ) %>% 
  e_flip_coords()
  • Patients with Fasting Blood Sugar above 120mg/dL are 80% more affected with Heart Disease Condition rather than not.

  • While patients below 120mg/dL probability of Heart Disease Condition are pretty even.


Resting ECG

heart_ecg1 <- heart %>% 
  select(RestingECG, HeartDisease) %>%
  mutate(HeartDisease = case_when(
    HeartDisease == 0 ~ "No",
    HeartDisease == 1 ~ "Yes"),
         RestingECG = as.factor(RestingECG)) %>% 
  mutate(HeartDisease = as.factor(HeartDisease)) %>% 
  count(RestingECG, HeartDisease) %>% 
  group_by(HeartDisease) 

heart_ecg1 %>% 
  e_charts(RestingECG) %>% 
  e_bar(n, stack = T) %>% 
  e_tooltip(
    trigger = "axis",
    axisPointer = list(
      type = "shadow"
    )
  ) %>% 
  e_title(text = "Heart Disease by Resting ECG:",
          textStyle = list(
            color = "#DFC18F",
            fontStyle = "italic"
          ),
          right = "10%",
          left = "47%",
          top = "1%") %>%   
  e_y_axis(
    splitArea = list(show = FALSE),
    splitLine = list(show = FALSE)
  ) %>% 
  e_theme_custom("rangga2.json") %>% 
  e_legend(
    orient = "horizontal",
    right = "10%",
    left = "65%",
    top = "2%",
    textStyle = list(
      color = "#DFC18F"
    )
  ) %>% 
  e_flip_coords()
  • The Heart Disease probability of patients with Normal RestingECG result are pretty even.

  • While patients with ST & LVH results are having higher risk of Heart Disease.


Exercise Angina

heart_ang1 <- heart %>% 
  select(ExerciseAngina, HeartDisease) %>%
  mutate(HeartDisease = case_when(
    HeartDisease == 0 ~ "No",
    HeartDisease == 1 ~ "Yes"),
         ExerciseAngina = as.factor(ExerciseAngina)) %>% 
  mutate(HeartDisease = as.factor(HeartDisease)) %>% 
  count(ExerciseAngina, HeartDisease) %>% 
  group_by(HeartDisease) 

heart_ang1 %>% 
  e_charts(ExerciseAngina) %>% 
  e_bar(n, stack = T) %>% 
  e_tooltip(
    trigger = "axis",
    axisPointer = list(
      type = "shadow"
    )
  ) %>% 
  e_title(text = "Heart Disease by Exercise Angina:",
          textStyle = list(
            color = "#DFC18F",
            fontStyle = "italic"
          ),
          right = "10%",
          left = "46%",
          top = "1%") %>%   
  e_y_axis(
    splitArea = list(show = FALSE),
    splitLine = list(show = FALSE)
  ) %>% 
  e_theme_custom("rangga2.json") %>% 
  e_legend(
    orient = "horizontal",
    right = "10%",
    left = "65%",
    top = "2%",
    textStyle = list(
      color = "#DFC18F"
    )
  ) %>% 
  e_flip_coords()
  • Patients with ExerciseAngina condition are 85% more vulnerable of Heart Disease condition.

  • While patients without ExerciseAngina condition are relatively 62% more safer from Heart Disease condition.


ST Slope

heart_st1 <- heart %>% 
  select(ST_Slope, HeartDisease) %>%
  mutate(HeartDisease = case_when(
    HeartDisease == 0 ~ "No",
    HeartDisease == 1 ~ "Yes"),
         ST_Slope = as.factor(ST_Slope)) %>% 
  mutate(HeartDisease = as.factor(HeartDisease)) %>% 
  count(ST_Slope, HeartDisease) %>% 
  group_by(HeartDisease) 

heart_st1 %>% 
  e_charts(ST_Slope) %>% 
  e_bar(n, stack = T) %>% 
  e_tooltip(
    trigger = "axis",
    axisPointer = list(
      type = "shadow"
    )
  ) %>% 
  e_title(text = "Heart Disease by ST Slope:",
          textStyle = list(
            color = "#DFC18F",
            fontStyle = "italic"
          ),
          right = "10%",
          left = "47%",
          top = "1%") %>%   
  e_y_axis(
    splitArea = list(show = FALSE),
    splitLine = list(show = FALSE)
  ) %>% 
  e_theme_custom("rangga2.json") %>% 
  e_legend(
    orient = "horizontal",
    right = "10%",
    left = "63%",
    top = "2%",
    textStyle = list(
      color = "#DFC18F"
    )
  ) %>% 
  e_flip_coords()
  • It is interesting that patients with Flat ST Slope (which considered as normal results) are the ones who are highly affected with Heart Disease condition.

  • Meanwhile patients with Downslopping ST Slope (which associated with an increased risk of coronary artery disease) are the ones who relatively safer from Heart Disease condition.

  • This might be many of the Heart Disease condition are caused by another factors. (Which we’ll find out later)


Numerical Variables based on Target Variable

Age

heart_age3 <- heart %>% 
  select(Age, HeartDisease) %>% 
  mutate(HeartDisease = case_when(
    HeartDisease == 0 ~ "No",
    HeartDisease == 1 ~ "Yes")) %>% 
  mutate(HeartDisease = as.factor(HeartDisease)) %>% 
  group_by(HeartDisease, Age) %>% 
  summarize(count = n(), .groups = 'drop') %>% 
  pivot_wider(names_from = HeartDisease, values_from = count) %>% 
  mutate_if(is.integer, ~replace(., is.na(.), 0))

linear_gradient <- htmlwidgets::JS(
  "new echarts.graphic.LinearGradient(
    0, 0, 0, 1,
    [
      { offset: 0, color: '#B83F3E' },
      { offset: 1, color: '#FB6A63' }
    ])"
)

linear_gradient2 <- htmlwidgets::JS(
  "new echarts.graphic.LinearGradient(
    0, 0, 0, 1,
    [
      { offset: 0, color: '#E08F83' },
      { offset: 1, color: '#FFE5DB' }
    ])"
)

heart_age3 %>% 
  e_charts(Age) %>% 
    e_line(Yes,
         smooth = T,
         itemStyle = list(
           opacity = 0.5,
           color = linear_gradient),
         areaStyle = list(
           opcaity = 1,
           color = linear_gradient),
         emphasis = list(
           focus = "series")) %>% 
  e_line(No,
         smooth = T,
         itemStyle = list(
           opacity = 0.5,
           color = linear_gradient2),
         areaStyle = list(
           opcaity = 1,
           color = linear_gradient2),
         emphasis = list(
           focus = "series")) %>% 
  e_x_axis(name = "Age",
           nameTextStyle = list(
             color = "White",
             fontFamily = "Cardo",
             fontSize = 14,
             fontWeight = "bold"
             ),
           min = 27) %>%
  e_y_axis(name = "Freq",
           nameTextStyle = list(
             color = "White",
             fontFamily = "Cardo",
             fontSize = 14,
             fontWeight = "bold"
             )) %>% 
  e_title(text = "Heart Disease by Age",
          textStyle = list(
            color = "#DFC18F",
            fontStyle = "italic"
          ),
          right = "10%",
          left = "50%") %>% 
  e_tooltip(axis = "trigger",
            showDelay = 0,
            axisPointer = list(
              type = "cross",
              label = list(
                backgroundColor = "#333D47",
                fontFamily = "Cardo",
                color = "#DFC18F"
              ),
              crossStyle = list(
                color = "#DFC18F"
              )),
              formatter = htmlwidgets::JS("
                                        function(params)
                                        {
                                            return `<strong style='color:#DFC18F;'>Heart Disease:</strong> <strong style='color:#D76662;'>${params.seriesName}</strong>
                                                    <br/><strong style='color:White;'>Age:</strong> ${params.value[0]}
                                                    <br/><strong style='color:White;'>Count:</strong> ${params.value[1]}`
                                        }  ")) %>% 
  e_theme_custom("rangga2.json") %>% 
  e_legend(
    orient = "horizontal",
    right = "10%",
    left = "45%",
    bottom = "2%",
    textStyle = list(
      color = "#DFC18F"
    )
  )
  • Patients with Heart Disease Condition tend to have higher Age.

  • Patients with Heart Disease Condition tend to grouped up within Age of 50 - 70 years old.

  • Highest count of Heart Disease Condition recorded are within 58 years old.


Resting Blood Pressure

max <- list(
  name = "Max",
  type = "max"
)

min <- list(
  name = "Min",
  type = "min"
)

avg <- list(
  type = "average",
  name = "AVG"
)

heart_rbp3 <- heart %>% 
  select(RestingBP, Age, HeartDisease) %>% 
  mutate(HeartDisease = case_when(
    HeartDisease == 0 ~ "No",
    HeartDisease == 1 ~ "Yes")) %>% 
  mutate(HeartDisease = as.factor(HeartDisease)) %>% 
  group_by(HeartDisease)

heart_rbp3 %>% 
  e_charts(Age) %>% 
  e_scatter(RestingBP, HeartDisease,
            emphasis = list(
              focus = "series"
            )) %>%
  e_theme_custom("rangga2.json") %>% 
  e_x_axis(name = "Age",
           nameTextStyle = list(
             color = "White",
             fontFamily = "Cardo",
             fontSize = 14,
             fontWeight = "bold"
             ),
           min = 27) %>% 
  e_y_axis(name = "RestingBP",
           nameTextStyle = list(
             color = "White",
             fontFamily = "Cardo",
             fontSize = 14,
             fontWeight = "bold"
             )) %>%
  e_mark_area(data = list(
                          list(xAxis = "min", yAxis = "min"), 
                          list(xAxis = "max", yAxis = "max")
                          ),
              itemStyle = list(
                color = "transparent",
                borderWidth = 1,
                borderType = "dashed")) %>% 
  e_mark_line(data = avg,
              lineStyle = list(
                type = "solid"
              )) %>% 
  e_title(text = "Heart Disease: RestingBP vs. Age",
          textStyle = list(
            color = "#DFC18F",
            fontStyle = "italic"
          ),
          right = "10%",
          left = "50%") %>%   
  e_tooltip(axis = "trigger",
            showDelay = 0,
            axisPointer = list(
              type = "cross",
              label = list(
                backgroundColor = "#333D47",
                fontFamily = "Cardo",
                color = "#DFC18F"
              ),
              crossStyle = list(
                color = "#DFC18F"
              )),
              formatter = htmlwidgets::JS("
                                        function(params)
                                        {
                                            return `<strong style='color:#DFC18F;'>Heart Disease:</strong> <strong style='color:#D76662;'>${params.seriesName}</strong>
                                                    <br/><strong style='color:White;'>Age:</strong> ${params.value[0]}
                                                    <br/><strong style='color:White;'>RestingBP:</strong> ${params.value[1]}mm/hg`
                                        }  ")) %>% 
  e_legend(
    orient = "horizontal",
    right = "10%",
    left = "45%",
    bottom = "2%",
    textStyle = list(
      color = "#DFC18F"
    )
  )
  • Patients with Heart Disease Condition tend to have higher RestingBP.

  • Moreover, if you look at the highest scores of RestingBP, all of them have Heart Disease Condition

  • There is RestingBP with 0mm/hg value which indicates that this is an outlier.


Cholesterol

heart_chol3 <- heart %>% 
  select(Cholesterol, Age, HeartDisease) %>% 
  mutate(HeartDisease = case_when(
    HeartDisease == 0 ~ "No",
    HeartDisease == 1 ~ "Yes")) %>% 
  mutate(HeartDisease = as.factor(HeartDisease)) %>% 
  group_by(HeartDisease)

heart_chol3 %>% 
  e_charts(Age) %>% 
  e_scatter(Cholesterol, HeartDisease,
            emphasis = list(
              focus = "series"
            )) %>%
  e_theme_custom("rangga2.json") %>% 
  e_x_axis(name = "Age",
           nameTextStyle = list(
             color = "White",
             fontFamily = "Cardo",
             fontSize = 14,
             fontWeight = "bold"
             ),
           min = 27) %>% 
  e_y_axis(name = "Cholesterol",
           nameTextStyle = list(
             color = "White",
             fontFamily = "Cardo",
             fontSize = 14,
             fontWeight = "bold"
             )) %>%
  e_mark_area(data = list(
                          list(xAxis = "min", yAxis = "min"), 
                          list(xAxis = "max", yAxis = "max")
                          ),
              itemStyle = list(
                color = "transparent",
                borderWidth = 1,
                borderType = "dashed")) %>% 
  e_mark_line(data = avg,
              lineStyle = list(
                type = "solid"
              )) %>% 
  e_title(text = "Heart Disease: Cholesterol vs. Age",
          textStyle = list(
            color = "#DFC18F",
            fontStyle = "italic"
          ),
          right = "10%",
          left = "50%") %>%   
  e_tooltip(axis = "trigger",
            showDelay = 0,
            axisPointer = list(
              type = "cross",
              label = list(
                backgroundColor = "#333D47",
                fontFamily = "Cardo",
                color = "#DFC18F"
              ),
              crossStyle = list(
                color = "#DFC18F"
              )),
              formatter = htmlwidgets::JS("
                                        function(params)
                                        {
                                            return `<strong style='color:#DFC18F;'>Heart Disease:</strong> <strong style='color:#D76662;'>${params.seriesName}</strong>
                                                    <br/><strong style='color:White;'>Age:</strong> ${params.value[0]}
                                                    <br/><strong style='color:White;'>Cholesterol:</strong> ${params.value[1]}mg/dL`
                                        }  ")) %>% 
  e_legend(
    orient = "horizontal",
    right = "10%",
    left = "45%",
    bottom = "2%",
    textStyle = list(
      color = "#DFC18F"
    )
  )
  • Patients with Heart Disease Condition tend to have lower Cholesterol rate.

  • However, this might be caused by uncollected Cholesterol rate results, since there are many outliers with values of 0mg/dL.


Max. Heart Rate

heart_HR3 <- heart %>% 
  select(MaxHR, Age, HeartDisease) %>% 
  mutate(HeartDisease = case_when(
    HeartDisease == 0 ~ "No",
    HeartDisease == 1 ~ "Yes")) %>% 
  mutate(HeartDisease = as.factor(HeartDisease)) %>% 
  group_by(HeartDisease)

heart_HR3 %>% 
  e_charts(Age) %>% 
  e_scatter(MaxHR, HeartDisease,
            emphasis = list(
              focus = "series"
            )) %>%
  e_theme_custom("rangga2.json") %>% 
  e_x_axis(name = "Age",
           nameTextStyle = list(
             color = "White",
             fontFamily = "Cardo",
             fontSize = 14,
             fontWeight = "bold"
             ),
           min = 27) %>% 
  e_y_axis(name = "Max Heart Rate",
           nameTextStyle = list(
             color = "White",
             fontFamily = "Cardo",
             fontSize = 14,
             fontWeight = "bold"
             )) %>%
  e_mark_area(data = list(
                          list(xAxis = "min", yAxis = "min"), 
                          list(xAxis = "max", yAxis = "max")
                          ),
              itemStyle = list(
                color = "transparent",
                borderWidth = 1,
                borderType = "dashed")) %>% 
  e_mark_line(data = avg,
              lineStyle = list(
                type = "solid"
              )) %>% 
  e_title(text = "Heart Disease: Max Heart Rate vs. Age",
          textStyle = list(
            color = "#DFC18F",
            fontStyle = "italic"
          ),
          right = "10%",
          left = "50%") %>%   
  e_tooltip(axis = "trigger",
            showDelay = 0,
            axisPointer = list(
              type = "cross",
              label = list(
                backgroundColor = "#333D47",
                fontFamily = "Cardo",
                color = "#DFC18F"
              ),
              crossStyle = list(
                color = "#DFC18F"
              )),
              formatter = htmlwidgets::JS("
                                        function(params)
                                        {
                                            return `<strong style='color:#DFC18F;'>Heart Disease:</strong> <strong style='color:#D76662;'>${params.seriesName}</strong>
                                                    <br/><strong style='color:White;'>Age:</strong> ${params.value[0]}
                                                    <br/><strong style='color:White;'>MaxHR:</strong> ${params.value[1]}`
                                        }  ")) %>% 
  e_legend(
    orient = "horizontal",
    right = "10%",
    left = "45%",
    bottom = "2%",
    textStyle = list(
      color = "#DFC18F"
    )
  )
  • Patients with Heart Disease Condition tend to have lower Max Heart Rate.


Oldpeak

heart_old3 <- heart %>% 
  select(Oldpeak, Age, HeartDisease) %>% 
  mutate(HeartDisease = case_when(
    HeartDisease == 0 ~ "No",
    HeartDisease == 1 ~ "Yes")) %>% 
  mutate(HeartDisease = as.factor(HeartDisease)) %>% 
  group_by(HeartDisease)

heart_old3 %>% 
  e_charts(Age) %>% 
  e_scatter(Oldpeak, HeartDisease,
            emphasis = list(
              focus = "series"
            )) %>%
  e_theme_custom("rangga2.json") %>% 
  e_x_axis(name = "Age",
           nameTextStyle = list(
             color = "White",
             fontFamily = "Cardo",
             fontSize = 14,
             fontWeight = "bold"
             ),
           min = 27) %>% 
  e_y_axis(name = "Oldpeak",
           nameTextStyle = list(
             color = "White",
             fontFamily = "Cardo",
             fontSize = 14,
             fontWeight = "bold"
             ),
           min = -3,
           max = 7) %>%
  e_mark_area(data = list(
                          list(xAxis = "min", yAxis = "min"), 
                          list(xAxis = "max", yAxis = "max")
                          ),
              itemStyle = list(
                color = "transparent",
                borderWidth = 1,
                borderType = "dashed")) %>% 
  e_mark_line(data = avg,
              lineStyle = list(
                type = "solid"
              )) %>% 
  e_title(text = "Heart Disease: Oldpeak vs. Age",
          textStyle = list(
            color = "#DFC18F",
            fontStyle = "italic"
          ),
          right = "10%",
          left = "50%") %>%   
  e_tooltip(axis = "trigger",
            showDelay = 0,
            axisPointer = list(
              type = "cross",
              label = list(
                backgroundColor = "#333D47",
                fontFamily = "Cardo",
                color = "#DFC18F"
              ),
              crossStyle = list(
                color = "#DFC18F"
              )),
              formatter = htmlwidgets::JS("
                                        function(params)
                                        {
                                            return `<strong style='color:#DFC18F;'>Heart Disease:</strong> <strong style='color:#D76662;'>${params.seriesName}</strong>
                                                    <br/><strong style='color:White;'>Age:</strong> ${params.value[0]}
                                                    <br/><strong style='color:White;'>Oldpeak:</strong> ${params.value[1]}`
                                        }  ")) %>% 
  e_legend(
    orient = "horizontal",
    right = "10%",
    left = "45%",
    bottom = "2%",
    textStyle = list(
      color = "#DFC18F"
    )
  )
  • Patients with Heart Disease Condition tend to have higher Oldpeak rate.

  • However, this might be caused by uncollected Oldpeak rate results, since there are many outliers with values of 0.



Modelling, Fitting, Predicting.

Logistic Regression

Logistic regression is used to describe data and the relationship between one dependent variable and one or more independent variables. The independent variables can be nominal, ordinal, or of interval type.

Logistic regression return value within range of 0,1 and not a binary class. Later, we have to classify ourselves the range to convert it to binary class.

This is the equation of a logistic regression model:

\[ log(\frac{p(X)}{1-p(X)}) = B_0 + B_1.X \]

The left-hand side is called the log-odds or logit. On the right side the b0 is the model intercept and b1 is the coefficient of feature X.


Cross Validation

Before we go into modelling, we are going to do Cross Validation for our dataset first. Why? The Cross Validation procedure is used to estimate the performance of machine learning algorithms when they are used to make predictions on unseen data. In this context, we will take 80% of our dataset to train our model and 20% of our dataset to test our prediction to see the model’s performance.

heart <- heart %>% 
  mutate(HeartDisease = as.factor(HeartDisease))

RNGkind(sample.kind = "Rounding")
set.seed(417)
index <- sample(nrow(heart), size = nrow(heart)*0.80)
heart_log_train <- heart[index,] #take 80%
heart_log_test <- heart[-index,] #take 20%
  • Re-check class balance:
proptrain <- paste(round(prop.table((table(heart_log_train$HeartDisease)))*100,2),"%")
proptraindf <- data.frame(HeartDisease = c("No","Yes"), Prop = c(proptrain[1], proptrain[2]))
rmarkdown::paged_table(proptraindf)

As shown above, the proportion of the class is still balanced.


Modelling

After we split our data, now we will train our models using the heart_train data. For the independent variables (X), we are going to use all variables first.

model_log <- glm(formula = HeartDisease ~ .,
             family = "binomial",
             data = heart_log_train)
summary(model_log)
## 
## Call:
## glm(formula = HeartDisease ~ ., family = "binomial", data = heart_log_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7439  -0.3699   0.1664   0.4168   2.5431  
## 
## Coefficients:
##                       Estimate Std. Error z value     Pr(>|z|)    
## (Intercept)          -1.254616   1.590161  -0.789     0.430120    
## Age                   0.020082   0.014630   1.373     0.169848    
## SexMale               1.292412   0.314651   4.107 0.0000400047 ***
## ChestPainTypeATA     -1.704532   0.369617  -4.612 0.0000039956 ***
## ChestPainTypeNAP     -1.720252   0.310134  -5.547 0.0000000291 ***
## ChestPainTypeTA      -1.549147   0.472862  -3.276     0.001052 ** 
## RestingBP             0.006125   0.006701   0.914     0.360709    
## Cholesterol          -0.004320   0.001242  -3.479     0.000504 ***
## FastingBS> 120 mg/dl  1.134681   0.312789   3.628     0.000286 ***
## RestingECGNormal     -0.451472   0.317714  -1.421     0.155316    
## RestingECGST         -0.315712   0.405508  -0.779     0.436239    
## MaxHR                -0.004626   0.005727  -0.808     0.419243    
## ExerciseAnginaYes     1.043952   0.284190   3.673     0.000239 ***
## Oldpeak               0.400279   0.135098   2.963     0.003048 ** 
## ST_SlopeFlat          1.500935   0.508268   2.953     0.003147 ** 
## ST_SlopeUp           -0.911941   0.528550  -1.725     0.084462 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1004.9  on 733  degrees of freedom
## Residual deviance:  457.0  on 718  degrees of freedom
## AIC: 489
## 
## Number of Fisher Scoring iterations: 6


Fitting

After our initial modelling, there are still many predictors who are not statistically significant to our target variable. Because of that, we will do Fitting using stepWise method.

model_logstep <- step(object = model_log,
                   direction = "backward",
                   trace = 0)

summary(model_logstep)
## 
## Call:
## glm(formula = HeartDisease ~ Age + Sex + ChestPainType + Cholesterol + 
##     FastingBS + ExerciseAngina + Oldpeak + ST_Slope, family = "binomial", 
##     data = heart_log_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6966  -0.3760   0.1693   0.4308   2.4996  
## 
## Coefficients:
##                       Estimate Std. Error z value     Pr(>|z|)    
## (Intercept)          -2.016719   0.976959  -2.064     0.038991 *  
## Age                   0.030469   0.013332   2.285     0.022293 *  
## SexMale               1.301603   0.312308   4.168 0.0000307708 ***
## ChestPainTypeATA     -1.732007   0.366224  -4.729 0.0000022523 ***
## ChestPainTypeNAP     -1.741057   0.305374  -5.701 0.0000000119 ***
## ChestPainTypeTA      -1.507028   0.462947  -3.255     0.001133 ** 
## Cholesterol          -0.004009   0.001171  -3.424     0.000617 ***
## FastingBS> 120 mg/dl  1.143914   0.310613   3.683     0.000231 ***
## ExerciseAnginaYes     1.097494   0.274569   3.997 0.0000641093 ***
## Oldpeak               0.394459   0.131695   2.995     0.002742 ** 
## ST_SlopeFlat          1.470728   0.509855   2.885     0.003919 ** 
## ST_SlopeUp           -0.962726   0.525608  -1.832     0.067005 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1004.95  on 733  degrees of freedom
## Residual deviance:  460.39  on 722  degrees of freedom
## AIC: 484.39
## 
## Number of Fisher Scoring iterations: 6

Interpretation of the result above:

  • Variables that increase the probability of Heart Attack are: Age, Male Gender, > 120 mg/dl Fasting Blood Sugar, Occuring Exercise Angina, Oldpeak, Flat ST Slope.

  • Variables that reduce the probability of Heart Attack are: ChestPainType ATA, ChestPainType NAP, ChestPainType TA, Cholesterol, Up ST Slope.

Note we already got rid the unsignificant variables and also the AIC scores has improved from 489 to 484.39.


Predicting

With our model_step we will predict our heart_test dataset.

heart_log_test$pred_Risk <- predict(object = model_logstep,
                                newdata = heart_log_test,
                                type = "response")

heart_log_train$pred_Risk <- predict(object = model_logstep,
                                newdata = heart_log_train,
                                type = "response")

heart_log_test %>% 
  e_charts()  %>% 
  e_density(pred_Risk, 
            smooth = T,
            itemStyle = list(
              opacity = 0.5,
              color = linear_gradient),
            areaStyle = list(
              opcaity = 1,
              color = linear_gradient),
            legend = F) %>% 
  e_theme_custom("rangga2.json") %>% 
  e_mark_line(data = list(xAxis=0.5 ), 
              title = "Threshold 0.5",
              lineStyle = list(
                color = linear_gradient2
              )) %>% 
    e_x_axis(name = "Pred Heart",
           nameTextStyle = list(
             color = "White",
             fontFamily = "Cardo",
             fontSize = 14,
             fontWeight = "bold"
             )) %>%
  e_y_axis(name = "Density",
           nameTextStyle = list(
             color = "White",
             fontFamily = "Cardo",
             fontSize = 14,
             fontWeight = "bold"
             )) %>% 
    e_title(text = "Distribution of Probability Prediction Data",
          textStyle = list(
            color = "#DFC18F",
            fontStyle = "italic"
          ),
          right = "10%",
          left = "50%")

The interpretaion for the graph above is the predictions are augmented towards 1 which means more predictions are in favour of positive Heart Disease Condition or Yes.

  • Classify our Probability Prediction Data with threshold of 0.5:
heart_log_test$pred_Label <- ifelse(test = heart_log_test$pred_Risk > 0.5,
                                    yes = "1",
                                    no = "0")

heart_log_train$pred_Label <- ifelse(test = heart_log_train$pred_Risk > 0.5,
                                    yes = "1",
                                    no = "0")

heart_log_test <- heart_log_test %>% 
  mutate(pred_Label = as.factor(pred_Label))

rmarkdown::paged_table(heart_log_test[1:10, c("HeartDisease", "pred_Label", "pred_Risk")])


K-NN

The k-nearest neighbors is non-parametric, supervised learning classifier, which uses proximity to make classifications about the grouping of an individual data point. K-NN working off the assumption that similar points can be found near one another.

This is the equation of K-NN model:

\[ d(x,y) = \sqrt {\sum(x_i - y_i)^2} \]

We will classify using the optimum number of K, which is the square root of the number of train dataset.

Cross Validation

K-NN predictor variables have to be in numeric data types.

heart_knn <- heart %>% 
  select(HeartDisease, Age, RestingBP, Cholesterol, MaxHR)

RNGkind(sample.kind = "Rounding")
set.seed(123)

# index sampling
index1 <-  sample(nrow(heart_knn), size = 0.8*nrow(heart_knn))
  
# splitting
heartknn_train <- heart_knn[index1, ]
heartknn_test <- heart_knn[-index1, ]
  • Re-check class balance:
proptraink <- paste(round(prop.table((table(heartknn_train$HeartDisease)))*100,2),"%")
proptrainkdf <- data.frame(HeartDisease = c("No","Yes"), Prop = c(proptraink[1], proptraink[2]))
rmarkdown::paged_table(proptrainkdf)

As shown above, the proportion of the class is still balanced.


Data Pre-Processing & Scaling

For K-NN we have to separate the one dependent variable from one or more independent variables.

heartknn_train_x <- select(heartknn_train, -HeartDisease)
heartknn_test_x <- select(heartknn_test, -HeartDisease)

heartknn_train_y <- heartknn_train %>% select(HeartDisease)
heartknn_test_y <- heartknn_test %>% select(HeartDisease)

Our independent variables/predictors will be scaled using Z-score standarization since all of our data range should be in the same range. Here is the equation:

\[Z = \frac{x-mean}{sd}\]
  • Here’s how to do it in R.
heart_train_x_scale <- scale(x = heartknn_train_x)

heart_test_x_scale <- scale(x = heartknn_test_x,
                           center = attr(heart_train_x_scale, "scaled:center"),
                           scale  = attr(heart_train_x_scale, "scaled:scale"))


Predicting

In K-NN we don’t have to do any Modeling since we can just go straight Predicting. But first, we have to find the optimum K of our train data.

sqrt(nrow(heartknn_train))
## [1] 27.09243

As shown above, our optimum K is 27.

  • Predict
heartknn_pred <- knn(train = heart_train_x_scale,
                     test = heart_test_x_scale,
                     cl = heartknn_train_y$HeartDisease, 
                     k = 27,
                     prob = T)

heartknn_pred_t <- knn(train = heart_train_x_scale,
                     test = heart_train_x_scale,
                     cl = heartknn_train_y$HeartDisease, 
                     k = 27,
                     prob = T)



Decision Tree

Decision Tree belongs to the family of Supervised Learning algorithms. Unlike other supervised learning algorithms, the decision tree algorithm can be used for solving regression and classification problems too. The goal of using a Decision Tree is to create a training model that can use to predict the class or value of the target variable by learning simple decision rules inferred from prior data(training data).

Cross Validation

heart_tree <- heart %>% 
  mutate_if(is.character, as.factor) 

RNGkind(sample.kind = "Rounding")
set.seed(417)
index <- sample(nrow(heart), size = nrow(heart)*0.80)
heart_train_tree <- heart_tree[index,] #take 80%
heart_test_tree <- heart_tree[-index,] #take 20%
  • Re-check class balance:
proptrain <- paste(round(prop.table((table(heart_train_tree$HeartDisease)))*100,2),"%")
proptraindf <- data.frame(HeartDisease = c("No","Yes"), Prop = c(proptrain[1], proptrain[2]))
rmarkdown::paged_table(proptraindf)

As shown above, the proportion of the class is still balanced.


Modelling & Fitting

To understand what Decision Trees looks like, we will create tree plot for our model using fancyRPartPlot() from rattle and rpart.plot packages. As for initial predictor variables, we will use all variables first.

model_dtree <- rpart(formula = HeartDisease~., data = heart_train_tree, method = "class")

Our decision tree consists of several parts. The box on the top/Up ST_Slope is the root node. The root will split and make branches by certain rules. Each branches ended with a node. Then the node split again into other nodes that called as internal node. Nodes that do not split anymore appear at the very bottom of our plot which is called terminal node or leaf node. All in all, in summary, each node shows:

  • Predicted class (Yes or No)

  • Predicted class probabilities.

  • Percentage of observations in the node.

  • Root, internal, and terminal nodes along with the rules that partition with each observation.


Predicting

Now we already have our model, we can continue to predict with our test dataset.

dtree_pred <- predict(model_dtree, heart_test_tree, type = "class")
dtree_prob <- predict(model_dtree, heart_test_tree, type = "prob")
dtree_pred_t <- predict(model_dtree, heart_train_tree, type = "class")

dtree_table <- select(heart_test_tree, HeartDisease) %>% 
  bind_cols(heart_pred = dtree_pred) %>% 
  bind_cols(heart_eprob = round(dtree_prob[,1],4)) %>% 
  bind_cols(heart_pprob = round(dtree_prob[,2],4))

dtree_table <- dtree_table %>% 
  mutate(HeartDisease = as.factor(HeartDisease))

rmarkdown::paged_table(head(dtree_table, 10))



Naive Bayes

Naive Bayes algorithm is a supervised learning algorithm which is based on Bayes theorem and used for solving classification problem. It is one of the simple and most effective Classification algorithm which helps in building the fast machine learning models that can make quick prediction.

To understand Naive Bayes, here is the equation:

\[P(A|B) = \frac{P(B|A)P(A)}{P(B)}\]
  • P(A|B) is posterior probability: Probability of hypthesis A on the observed event B.
  • P(B|A) is likelihood probability: Probability of the evidence given that the probability of a hypothesis is true.

Cross Validation

heart_nb <- heart %>% 
  mutate(HeartDisease = as.factor(HeartDisease))

RNGkind(sample.kind = "Rounding")
set.seed(417)
index <- sample(nrow(heart_nb), size = nrow(heart_nb)*0.80)
heart_nb_train <- heart_nb[index,] #take 80%
heart_nb_test <- heart_nb[-index,] #take 20%

head(heart_nb)
##   Age    Sex ChestPainType RestingBP Cholesterol   FastingBS RestingECG MaxHR
## 1  40   Male           ATA       140         289 < 120 mg/dl     Normal   172
## 2  49 Female           NAP       160         180 < 120 mg/dl     Normal   156
## 3  37   Male           ATA       130         283 < 120 mg/dl         ST    98
## 4  48 Female           ASY       138         214 < 120 mg/dl     Normal   108
## 5  54   Male           NAP       150         195 < 120 mg/dl     Normal   122
## 6  39   Male           NAP       120         339 < 120 mg/dl     Normal   170
##   ExerciseAngina Oldpeak ST_Slope HeartDisease
## 1             No     0.0       Up            0
## 2             No     1.0     Flat            1
## 3             No     0.0       Up            0
## 4            Yes     1.5     Flat            1
## 5             No     0.0       Up            0
## 6             No     0.0       Up            0
  • Re-check class balance:
proptrain <- paste(round(prop.table((table(heart_nb_train$HeartDisease)))*100,2),"%")
proptraindf <- data.frame(HeartDisease = c("No","Yes"), Prop = c(proptrain[1], proptrain[2]))
rmarkdown::paged_table(proptraindf)

As shown above, the proportion of the class is still balanced.


Modelling & Fitting

library(e1071)
model_nb <- naiveBayes(formula = HeartDisease~. , data = heart_nb_train, laplace = 1)
model_nb
## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
##         0         1 
## 0.4346049 0.5653951 
## 
## Conditional probabilities:
##    Age
## Y       [,1]     [,2]
##   0 50.26959 9.389680
##   1 56.04578 9.055402
## 
##    Sex
## Y      Female      Male
##   0 0.3667712 0.6394984
##   1 0.1036145 0.9012048
## 
##    ChestPainType
## Y          ASY        ATA        NAP         TA
##   0 0.25077399 0.35603715 0.31888545 0.07430341
##   1 0.77565632 0.04534606 0.13603819 0.04295943
## 
##    RestingBP
## Y       [,1]     [,2]
##   0 130.1755 16.17807
##   1 134.5205 20.36073
## 
##    Cholesterol
## Y       [,1]      [,2]
##   0 227.3386  72.73644
##   1 177.5060 126.58521
## 
##    FastingBS
## Y   < 120 mg/dl > 120 mg/dl
##   0   0.8996865   0.1065831
##   1   0.6795181   0.3253012
## 
##    RestingECG
## Y         LVH    Normal        ST
##   0 0.1863354 0.6521739 0.1614907
##   1 0.2153110 0.5502392 0.2344498
## 
##    MaxHR
## Y       [,1]     [,2]
##   0 148.6646 22.91090
##   1 127.4361 23.34358
## 
##    ExerciseAngina
## Y          No       Yes
##   0 0.8840125 0.1222571
##   1 0.3686747 0.6361446
## 
##    Oldpeak
## Y        [,1]      [,2]
##   0 0.3971787 0.7017217
##   1 1.3159036 1.1603921
## 
##    ST_Slope
## Y        Down      Flat        Up
##   0 0.0310559 0.1863354 0.7826087
##   1 0.1004785 0.7440191 0.1555024

Interpretation of the result above:

*Male Gender has higher probability for getting a Heart Disease Condition (90%) and 63% more for not getting a Heart Disease Condition rather than Female.

  • ASY Chest Pain has the highest probability for getting a Heart Disease Condition (77%) while ATA Chest Pain has the highest probability (35%) for not getting a Heart Disease Condition rather than its counterparts.

  • Patient with Fasting Blood Sugar < 120mg has higher probability for getting a Heart Disease Condition (67%) and 89% more for not getting a Heart Disease Condition rather than patient with Fasting Blood Sugar > 120mg.

  • Patient with Normal Resting ECG has higher probability for getting a Heart Disease Condition (55%) and 65% more for not getting a Heart Disease Condition rather than its counterparts.

  • Patient with Exercise Angina has higher probability for getting a Heart Disease Condition (63%) while Patient without Exercise Angina is 88% more for not getting a Heart Disease Condition.

  • Flat ST Slope has the highest probability for getting a Heart Disease Condition (74%) while UP ST Slope has the highest probability (78%) for not getting a Heart Disease Condition rather than its counterparts.


Predicting

naive_pred <- predict(model_nb, heart_nb_test, type = "class")
naive_prob <- predict(model_nb, heart_nb_test, type = "raw")
naive_pred_t <- predict(model_nb, heart_nb_train, type = "class")

naive_table <- select(heart_nb_test, HeartDisease) %>% 
  bind_cols(heart_pred = naive_pred) %>% 
  bind_cols(heart_eprob = round(naive_prob[,1],4)) %>% 
  bind_cols(heart_pprob = round(naive_prob[,2],4)) 

naive_table <- naive_table %>% 
  mutate(HeartDisease = as.factor(HeartDisease))

rmarkdown::paged_table(head(naive_table, 10))



Evaluation

For Evaluation, we are going to use confusionMatrix and ROC/AUC function that can generate:

Accuracy: \[ \frac {TP + TN} {TP + TN + FP + FN} \]
Sensitivity: \[ \frac {TP} {TP + FN} \]
Specificity: \[ \frac {TN} {TN + FP} \]
Precision: \[ \frac {TP} {TP + FP} \]

Logistic Regression

EVALUATION SUMMARY
con_mat_log <- confusionMatrix(data = heart_log_test$pred_Label, 
                               reference = heart_log_test$HeartDisease, 
                               positive = "1")

con_mat_log_t <- confusionMatrix(data = as.factor(heart_log_train$pred_Label), 
                                 reference = heart_log_train$HeartDisease, 
                                 positive = "1")

performance <- cbind.data.frame(Accuracy = c(con_mat_log_t$overall[[1]], con_mat_log$overall[[1]]),
                                Sensitivity = c(con_mat_log_t$byClass[[1]], con_mat_log$byClass[[1]]),
                                Specificity = c(con_mat_log_t$byClass[[2]], con_mat_log$byClass[[2]]),
                                Precision = c(con_mat_log_t$byClass[[3]], con_mat_log$byClass[[3]]))

rownames(performance) <- c("On Training Data", "On Unseen Data")

rmarkdown::paged_table(performance)
  • There are no big differences between our Accuracy, Sensitivity, Specificity, and Precision score on the Unseen and Training Data.

  • This means that there are no indication of Overfitting and Underfitting for our Logistic Regression model.

Actual <- factor(c("No", "No", "Yes", "Yes"))
Predicted <- factor(c("No", "Yes", "No", "Yes"))
y      <- c(con_mat_log$table[2], con_mat_log$table[1], con_mat_log$table[4], con_mat_log$table[3])
df1_log <- data.frame(Actual, Predicted, y)

roc <-  roc(predictor = heart_log_test$pred_Risk,
            response = heart_log_test$HeartDisease,
            levels = c("0", "1"),
            percent = TRUE)
df2_log <-  data.frame(Specificity=roc$specificities, Sensitivity=roc$sensitivities)

plot_log1 <- ggplot(data =  df1_log, mapping = aes(x = Actual, y = Predicted, col = y)) +
  geom_tile(aes(fill = y)) +
  geom_text(aes(label = sprintf("%1.0f", y)), color = "White", size = 5, family = "Times New Roman") +
  scale_fill_gradient(low = "#E08F83", high = "#B8332E") +
  scale_color_gradient(low = "#E08F83", high = "#B8332E") +
  theme(legend.position = "none",
        plot.background = element_rect(fill = "#2C343C", color = "#2C343C"),
        panel.background = element_rect(fill = "#2C343C"),
        panel.grid = element_line(colour = "#2C343C"),
        panel.grid.major.x = element_line(colour = "#2C343C"),
        panel.grid.minor.x = element_line(colour = "#2C343C"),
        axis.text.x = element_text(color = "#2C343C", family = "Times New Roman", size = 14),
        axis.text.y = element_text(color = "#2C343C", family = "Times New Roman", size = 14),
        axis.title.x = element_text(color = "#DFC18F", family = "Times New Roman", size = 14),
        axis.title.y = element_text(color = "#DFC18F", family = "Times New Roman", size = 14),
        axis.ticks = element_blank())

plot_log2 <- ggplot(data = df2_log, aes(x = Specificity, y = Sensitivity))+
  with_outer_glow(geom_path(colour = '#FFE5DB', size = 1), colour = "#E08F83", sigma = 10, expand = 1)+
  scale_x_reverse() +
  geom_abline(intercept = 100, slope = 1, color="#B8332E", linetype = "longdash") +
  annotate("text", x = 40, y = 40, label = paste0('AUC: ', round(roc$auc,1), '%'), size = 4, color = "#DFC18F", family = "Times New Roman")+
  ylab('Sensitivity (%)')+
  xlab('Specificity (%)')+
  scale_fill_gradient(low = "#E08F83", high = "#B8332E") +
  scale_color_gradient(low = "#E08F83", high = "#B8332E") +
  theme(legend.position = "none",
        plot.background = element_rect(fill = "#2C343C", color = "#2C343C"),
        panel.background = element_rect(fill = "#2C343C"),
        panel.grid = element_line(colour = "#2C343C"),
        panel.grid.major.x = element_line(colour = "#2C343C"),
        panel.grid.minor.x = element_line(colour = "#2C343C"),
        axis.text.x = element_text(color = "#2C343C", family = "Times New Roman", size = 14),
        axis.text.y = element_text(color = "#2C343C", family = "Times New Roman", size = 14),
        axis.title.x = element_text(color = "#DFC18F", family = "Times New Roman", size = 14),
        axis.title.y = element_text(color = "#DFC18F", family = "Times New Roman", size = 14),
        axis.ticks = element_blank())

ggarrange(plot_log1, plot_log2, nrow = 2, ncol = 1)

  • Our logistic regression model has performance accuracy of 83.15% on unseen data. Meaning that 83.15% of our data is correctly classified.

  • The value of sensitivity is 87.09% which indicates that most positive outcomes are correctly classified.

  • The value of specificity is 79.12% which indicates that most negative outcomes are correctly classified.

  • The value of precision is 81% which indicates that 81% of our positive prediction is correct.

  • Our ROC Curve also shows a very good separation with AUC score of 91.5%. Meaning that our model is proven to be near excellent at separating our target classes.


K-NN

EVALUATION SUMMARY
con_mat_knn <- confusionMatrix(data = heartknn_pred, 
                               reference = heartknn_test_y$HeartDisease, 
                               positive = "1")

con_mat_knn_t <- confusionMatrix(data = heartknn_pred_t, 
                               reference = heartknn_train_y$HeartDisease, 
                               positive = "1")

performanceknn <- cbind.data.frame(Accuracy = c(con_mat_knn_t$overall[[1]], con_mat_knn$overall[[1]]),
                                Sensitivity = c(con_mat_knn_t$byClass[[1]], con_mat_knn$byClass[[1]]),
                                Specificity = c(con_mat_knn_t$byClass[[2]], con_mat_knn$byClass[[2]]),
                                Precision = c(con_mat_knn_t$byClass[[3]], con_mat_knn$byClass[[3]]))

rownames(performanceknn) <- c("On Training Data", "On Unseen Data")

rmarkdown::paged_table(performanceknn)
  • There are no big differences between our Accuracy, Sensitivity, Specificity, and Precision score on the Unseen and Training Data.

  • This means that there are no indication of Overfitting and Underfitting for our K-NN model.

Actual1 <- factor(c("No", "No", "Yes", "Yes"))
Predicted1 <- factor(c("No", "Yes", "No", "Yes"))
y1      <- c(con_mat_knn$table[2], con_mat_knn$table[1], con_mat_knn$table[4], con_mat_knn$table[3])
df1_knn <- data.frame(Actual1, Predicted1, y1)

roc1 <- roc(predictor = attributes(heartknn_pred)$prob,
            response = heartknn_test_y$HeartDisease,
            levels = c("0", "1"),
            percent = TRUE)

df2_knn <- data.frame(Specificity=roc1$specificities, Sensitivity=roc1$sensitivities)

plot_knn1 <- ggplot(data =  df1_knn, mapping = aes(x = Actual1, y = Predicted1, col = y1)) +
  geom_tile(aes(fill = y1)) +
  geom_text(aes(label = sprintf("%1.0f", y1)), color = "White", size = 5, family = "Times New Roman") +
  scale_fill_gradient(low = "#E08F83", high = "#B8332E") +
  scale_color_gradient(low = "#E08F83", high = "#B8332E") +
  labs(x = "Actual",
       y = "Predicted") +
  theme(legend.position = "none",
        plot.background = element_rect(fill = "#2C343C", color = "#2C343C"),
        panel.background = element_rect(fill = "#2C343C"),
        panel.grid = element_line(colour = "#2C343C"),
        panel.grid.major.x = element_line(colour = "#2C343C"),
        panel.grid.minor.x = element_line(colour = "#2C343C"),
        axis.text.x = element_text(color = "#2C343C", family = "Times New Roman", size = 14),
        axis.text.y = element_text(color = "#2C343C", family = "Times New Roman", size = 14),
        axis.title.x = element_text(color = "#DFC18F", family = "Times New Roman", size = 14),
        axis.title.y = element_text(color = "#DFC18F", family = "Times New Roman", size = 14),
        axis.ticks = element_blank())

plot_knn2 <- ggplot(data = df2_knn, aes(x = Specificity, y = Sensitivity))+
  with_outer_glow(geom_path(colour = '#FFE5DB', size = 1), colour = "#E08F83", sigma = 10, expand = 1)+
  scale_x_reverse() +
  geom_abline(intercept = 100, slope = 1, color="#B8332E", linetype = "longdash") +
  annotate("text", x = 40, y = 40, label = paste0('AUC: ', round(roc1$auc,1), '%'), size = 4, color = "#DFC18F", family = "Times New Roman")+
  ylab('Sensitivity (%)')+
  xlab('Specificity (%)')+
  scale_fill_gradient(low = "#E08F83", high = "#B8332E") +
  scale_color_gradient(low = "#E08F83", high = "#B8332E") +
  theme(legend.position = "none",
        plot.background = element_rect(fill = "#2C343C", color = "#2C343C"),
        panel.background = element_rect(fill = "#2C343C"),
        panel.grid = element_line(colour = "#2C343C"),
        panel.grid.major.x = element_line(colour = "#2C343C"),
        panel.grid.minor.x = element_line(colour = "#2C343C"),
        axis.text.x = element_text(color = "#2C343C", family = "Times New Roman", size = 14),
        axis.text.y = element_text(color = "#2C343C", family = "Times New Roman", size = 14),
        axis.title.x = element_text(color = "#DFC18F", family = "Times New Roman", size = 14),
        axis.title.y = element_text(color = "#DFC18F", family = "Times New Roman", size = 14),
        axis.ticks = element_blank())

ggarrange(plot_knn1, plot_knn2, nrow = 2, ncol = 1)

  • Our K-NN model has performance accuracy of 75% on unseen data. Meaning that 75% of our data is correctly classified.

  • The value of sensitivity is 82.85% which indicates that most positive outcomes are correctly classified.

  • The value of specificity is 64.55% which indicates that only moderate negative outcomes are correctly classified.

  • The value of precision is 75.65% which indicates that 75.65% of our positive prediction is correct.

  • Our ROC Curve also shows a bad performance of separation with AUC score of 55.3%. Meaning that if we want to use this model, we have to tune the model until we get a better score results than the other model.


Decision Tree

EVALUATION SUMMARY
con_mat_dt <- confusionMatrix(data = dtree_table$heart_pred, 
                              reference = dtree_table$HeartDisease, 
                              positive = "1")

con_mat_dt_t <- confusionMatrix(data = dtree_pred_t, 
                                reference = as.factor(heart_train_tree$HeartDisease), 
                                positive = "1")

performance_dt <- cbind.data.frame(Accuracy = c(con_mat_dt_t$overall[[1]], con_mat_dt$overall[[1]]),
                                   Sensitivity = c(con_mat_dt_t$byClass[[1]], con_mat_dt$byClass[[1]]),
                                   Specificity = c(con_mat_dt_t$byClass[[2]], con_mat_dt$byClass[[2]]),
                                   Precision = c(con_mat_dt_t$byClass[[3]], con_mat_dt$byClass[[3]]))

rownames(performance_dt) <- c("On Training Data", "On Unseen Data")

rmarkdown::paged_table(performance_dt)
  • There are no big differences between our Accuracy, Sensitivity, Specificity, and Precision score on the Unseen and Training Data.

  • This means that there are no indication of Overfitting and Underfitting for our Decision Tree model.

Actual2 <- factor(c("No", "No", "Yes", "Yes"))
Predicted2 <- factor(c("No", "Yes", "No", "Yes"))
y2      <- c(con_mat_dt$table[2], con_mat_dt$table[1], con_mat_dt$table[4], con_mat_dt$table[3])
df1_dt <- data.frame(Actual2, Predicted2, y2)

roc2 <- roc(predictor = dtree_table$heart_eprob,
            response = dtree_table$HeartDisease,
            levels = c("0", "1"),
            percent = TRUE)

df2_dt <- data.frame(Specificity=roc2$specificities, Sensitivity=roc2$sensitivities)

plot_dt <- ggplot(data =  df1_dt, mapping = aes(x = Actual2, y = Predicted2, col = y2)) +
  geom_tile(aes(fill = y2)) +
  geom_text(aes(label = sprintf("%1.0f", y2)), color = "White", size = 5, family = "Times New Roman") +
  scale_fill_gradient(low = "#E08F83", high = "#B8332E") +
  scale_color_gradient(low = "#E08F83", high = "#B8332E") +
  labs(x = "Actual",
       y = "Predicted") +
  theme(legend.position = "none",
        plot.background = element_rect(fill = "#2C343C", color = "#2C343C"),
        panel.background = element_rect(fill = "#2C343C"),
        panel.grid = element_line(colour = "#2C343C"),
        panel.grid.major.x = element_line(colour = "#2C343C"),
        panel.grid.minor.x = element_line(colour = "#2C343C"),
        axis.text.x = element_text(color = "#2C343C", family = "Times New Roman", size = 14),
        axis.text.y = element_text(color = "#2C343C", family = "Times New Roman", size = 14),
        axis.title.x = element_text(color = "#DFC18F", family = "Times New Roman", size = 14),
        axis.title.y = element_text(color = "#DFC18F", family = "Times New Roman", size = 14),
        axis.ticks = element_blank())

plot_dt2 <- ggplot(data = df2_dt, aes(x = Specificity, y = Sensitivity))+
  with_outer_glow(geom_path(colour = '#FFE5DB', size = 1), colour = "#E08F83", sigma = 10, expand = 1)+
  scale_x_reverse() +
  geom_abline(intercept = 100, slope = 1, color="#B8332E", linetype = "longdash") +
  annotate("text", x = 40, y = 40, label = paste0('AUC: ', round(roc2$auc,1), '%'), size = 4, color = "#DFC18F", family = "Times New Roman")+
  ylab('Sensitivity (%)')+
  xlab('Specificity (%)')+
  scale_fill_gradient(low = "#E08F83", high = "#B8332E") +
  scale_color_gradient(low = "#E08F83", high = "#B8332E") +
  theme(legend.position = "none",
        plot.background = element_rect(fill = "#2C343C", color = "#2C343C"),
        panel.background = element_rect(fill = "#2C343C"),
        panel.grid = element_line(colour = "#2C343C"),
        panel.grid.major.x = element_line(colour = "#2C343C"),
        panel.grid.minor.x = element_line(colour = "#2C343C"),
        axis.text.x = element_text(color = "#2C343C", family = "Times New Roman", size = 14),
        axis.text.y = element_text(color = "#2C343C", family = "Times New Roman", size = 14),
        axis.title.x = element_text(color = "#DFC18F", family = "Times New Roman", size = 14),
        axis.title.y = element_text(color = "#DFC18F", family = "Times New Roman", size = 14),
        axis.ticks = element_blank())

ggarrange(plot_dt, plot_dt2, nrow = 2, ncol = 1)

  • Our Decision Tree model has performance accuracy of 82.06% on unseen data. Meaning that 82.06% of our data is correctly classified.

  • The value of sensitivity is 89.24% which indicates that most positive outcomes are correctly classified.

  • The value of specificity is 74.72% which indicates that moderate number of negative outcomes are correctly classified.

  • The value of precision is 78.3% which indicates that 77.9% of our positive prediction is correct.

  • Our ROC Curve also shows a very good separation with AUC score of 87.3%. Meaning that our model is already good at separating our target classes.


Naive Bayes

EVALUATION SUMMARY
con_mat_nb <- confusionMatrix(data = naive_table$heart_pred, 
                              reference = naive_table$HeartDisease, 
                              positive = "1")

con_mat_nb_t <- confusionMatrix(data = naive_pred_t, 
                                reference = as.factor(heart_nb_train$HeartDisease), 
                                positive = "1")

performance_nb <- cbind.data.frame(Accuracy = c(con_mat_nb_t$overall[[1]], con_mat_nb$overall[[1]]),
                                  Sensitivity = c(con_mat_nb_t$byClass[[1]], con_mat_nb$byClass[[1]]),
                                  Specificity = c(con_mat_nb_t$byClass[[2]], con_mat_nb$byClass[[2]]),
                                  Precision = c(con_mat_nb_t$byClass[[3]], con_mat_nb$byClass[[3]]))

rownames(performance_nb) <- c("On Training Data", "On Unseen Data")

rmarkdown::paged_table(performance_nb)
  • There are no big differences between our Accuracy, Sensitivity, Specificity, and Precision score on the Unseen and Training Data.

  • This means that there are no indication of Overfitting and Underfitting for our Naive Bayes model.

Actual3 <- factor(c("No", "No", "Yes", "Yes"))
Predicted3 <- factor(c("No", "Yes", "No", "Yes"))
y3      <- c(con_mat_nb$table[2], con_mat_nb$table[1], con_mat_nb$table[4], con_mat_nb$table[3])
df1_nb <- data.frame(Actual3, Predicted3, y3)

roc3 <- roc(predictor = naive_table$heart_eprob,
            response = naive_table$HeartDisease,
            levels = c("0","1"),
            percent = TRUE)

df2_nb <-  data.frame(Specificity=roc3$specificities, Sensitivity=roc3$sensitivities)

plot_nb <- ggplot(data =  df1_nb, mapping = aes(x = Actual3, y = Predicted3, col = y3)) +
  geom_tile(aes(fill = y3)) +
  geom_text(aes(label = sprintf("%1.0f", y3)), color = "White", size = 5, family = "Times New Roman") +
  scale_fill_gradient(low = "#E08F83", high = "#B8332E") +
  scale_color_gradient(low = "#E08F83", high = "#B8332E") +
  labs(x = "Actual",
       y = "Predicted") +
  theme(legend.position = "none",
        plot.background = element_rect(fill = "#2C343C", color = "#2C343C"),
        panel.background = element_rect(fill = "#2C343C"),
        panel.grid = element_line(colour = "#2C343C"),
        panel.grid.major.x = element_line(colour = "#2C343C"),
        panel.grid.minor.x = element_line(colour = "#2C343C"),
        axis.text.x = element_text(color = "#2C343C", family = "Times New Roman", size = 14),
        axis.text.y = element_text(color = "#2C343C", family = "Times New Roman", size = 14),
        axis.title.x = element_text(color = "#DFC18F", family = "Times New Roman", size = 14),
        axis.title.y = element_text(color = "#DFC18F", family = "Times New Roman", size = 14),
        axis.ticks = element_blank())

plot_nb2 <- ggplot(data = df2_nb, aes(x = Specificity, y = Sensitivity))+
  with_outer_glow(geom_path(colour = '#FFE5DB', size = 1), colour = "#E08F83", sigma = 10, expand = 1)+
  scale_x_reverse() +
  geom_abline(intercept = 100, slope = 1, color="#B8332E", linetype = "longdash") +
  annotate("text", x = 40, y = 40, label = paste0('AUC: ', round(roc3$auc,1), '%'), size = 4, color = "#DFC18F", family = "Times New Roman")+
  ylab('Sensitivity (%)')+
  xlab('Specificity (%)')+
  scale_fill_gradient(low = "#E08F83", high = "#B8332E") +
  scale_color_gradient(low = "#E08F83", high = "#B8332E") +
  theme(legend.position = "none",
        plot.background = element_rect(fill = "#2C343C", color = "#2C343C"),
        panel.background = element_rect(fill = "#2C343C"),
        panel.grid = element_line(colour = "#2C343C"),
        panel.grid.major.x = element_line(colour = "#2C343C"),
        panel.grid.minor.x = element_line(colour = "#2C343C"),
        axis.text.x = element_text(color = "#2C343C", family = "Times New Roman", size = 14),
        axis.text.y = element_text(color = "#2C343C", family = "Times New Roman", size = 14),
        axis.title.x = element_text(color = "#DFC18F", family = "Times New Roman", size = 14),
        axis.title.y = element_text(color = "#DFC18F", family = "Times New Roman", size = 14),
        axis.ticks = element_blank())

ggarrange(plot_nb, plot_nb2, nrow = 2, ncol = 1)

  • Our Naive Bayes model has performance accuracy of 80.9% on unseen data. Meaning that 80.90% of our data is correctly classified.

  • The value of sensitivity is 87.09% which indicates that most positive outcomes are correctly classified.

  • The value of specificity is 74.72% which indicates that moderate number of negative outcomes are correctly classified.

  • The value of precision is 77.9% which indicates that 77.9% of our positive prediction is correct.

  • Our ROC Curve also shows a very good separation with AUC score of 88.9%. Meaning that our model is already good at separating our target classes.


Conclusion

Since we’re talking about Health diagnosis prediction, we want to minimize our error on positive outcomes prediction. Which in our case is about patient with Heart Disease Condition. Hereby if we look at our model performance evaluation, we want to use a model with a good score of sensitivity.

In that manner, from our 2 models that we’ve been created, the one who have higher sensitivity score is Decision Tree model with a value of 89.24%. So we can conclude that, in this case, Decision Tree proven to be better for classifying Heart Disease Condition rather than K-NN, Logistic Regression, & Naive Bayes model.

And that’s all my attempt to classify Heart Disease with Classification Models. As time goes by, I will add more models who are not covered yet in this project. However, I sincerely hope that this study could help people who will do some similar cases in the future.

 

A work by Rangga Gemilang

gemilang.rangga94@gmail.com

R Language