Did you know? someone has a heart attack every 40 seconds in the United States. And in the same region, about 647,000 people die from heart disease every year. Heart disease is the leading cause of death for men, women, and people of most racial and ethnic groups until now.
In this project, we want to analyze about the cause of heart attack and also we want to create a model that can be useful to predict the heart disease event. The dataset that we’re going to use is the Heart Disease Data Set from UCI Machine Learning Repository. In which the patients data itself was collected from Cleveland, Hungary, Switzerland, and Long Beach.
To summarize our objectives, here are what we’re going to do in this dataset:
Before we’re doing our Exploratory Data Analysis, we will examine our dataset first to see if there are any anomalies that we can fix within the dataset. It is important that insights are only as good as the data that informs them. This means it’s vital for the data to be clean and in the shape of usable form.
There are 14 attributes in this dataset and here are their descriptions:
Age: Patient’s Age in years.Sex: Patient’s Gender. (M = Male, F =
Female)ChestPainType: Chest Pain type. (4 values: ATA,
NAP, ASY, TA)RestingBP: resting Blood Pressure. ( in mm Hg
)Cholesterol: Serum Cholesterol. ( in mg/dl
)FastingBS: Fasting Blood Sugar > 120 mg/dl.
(0 = True, 1 = False)RestingECG: resting Electroencephalographic result.
(values: Normal, ST, LVH)MaxHR: Maximum Heart Rate achieved.ExerciseAngina: Exercise induced Angina. (N =
No, Y = Yes)Oldpeak: ST Depression induced by Exercise relative to
rest.ST_Slope: Slope of the peak exercise ST segment.
(values: Up, Flat, Down)HeartDisease:: Heart Disease occured. (0 =
No, 1 = Yes)heart <- read.csv("heart.csv")
rmarkdown::paged_table(heart)
glimpse(heart)
## Rows: 918
## Columns: 12
## $ Age <int> 40, 49, 37, 48, 54, 39, 45, 54, 37, 48, 37, 58, 39, 49,…
## $ Sex <chr> "M", "F", "M", "F", "M", "M", "F", "M", "M", "F", "F", …
## $ ChestPainType <chr> "ATA", "NAP", "ATA", "ASY", "NAP", "NAP", "ATA", "ATA",…
## $ RestingBP <int> 140, 160, 130, 138, 150, 120, 130, 110, 140, 120, 130, …
## $ Cholesterol <int> 289, 180, 283, 214, 195, 339, 237, 208, 207, 284, 211, …
## $ FastingBS <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ RestingECG <chr> "Normal", "Normal", "ST", "Normal", "Normal", "Normal",…
## $ MaxHR <int> 172, 156, 98, 108, 122, 170, 170, 142, 130, 120, 142, 9…
## $ ExerciseAngina <chr> "N", "N", "N", "Y", "N", "N", "N", "N", "Y", "N", "N", …
## $ Oldpeak <dbl> 0.0, 1.0, 0.0, 1.5, 0.0, 0.0, 0.0, 0.0, 1.5, 0.0, 0.0, …
## $ ST_Slope <chr> "Up", "Flat", "Up", "Flat", "Up", "Up", "Up", "Up", "Fl…
## $ HeartDisease <int> 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1…
As demonstrated above, our variables data types are still not in the right type. * Change the data types:
heart <-
heart %>%
mutate_at(vars(Sex, ChestPainType, FastingBS, RestingECG, ExerciseAngina, ST_Slope), as.factor) %>%
mutate(Sex = case_when(
Sex == "M" ~ "Male",
Sex == "F" ~ "Female"),
FastingBS = case_when(
FastingBS == "0" ~ "< 120 mg/dl",
FastingBS == "1" ~ "> 120 mg/dl"),
ExerciseAngina = case_when(
ExerciseAngina == "N" ~ "No",
ExerciseAngina == "Y" ~ "Yes"))
colSums(is.na(heart))
## Age Sex ChestPainType RestingBP Cholesterol
## 0 0 0 0 0
## FastingBS RestingECG MaxHR ExerciseAngina Oldpeak
## 0 0 0 0 0
## ST_Slope HeartDisease
## 0 0
There are no missing values within our dataset.
Exploratory data analysis (EDA) involves using graphics and visualizations to explore and analyze a data set. The goal is to explore, investigate and learn the data variables within our dataset.
From the inspection on Data Wrangling step earlier, these are all our categorical variables and their distribution:
heart_cp <-
heart %>%
select(ChestPainType) %>%
group_by(ChestPainType) %>%
summarize(total = n()) %>%
ungroup() %>%
arrange(desc(-total))
heart_cp %>%
e_charts(x = ChestPainType) %>%
e_pie(total,
roseType = "radius",
itemStyle = list(
color = htmlwidgets::JS("
function(params) {
var colorList = [new echarts.graphic.LinearGradient(
1, 0, 0, 0,
[
{ offset: 0, color: '#FFE5DB' },
{ offset: 1, color: '#FFE5DB' }
]),
new echarts.graphic.LinearGradient(
0, 0, 0, 1,
[
{ offset: 0, color: '#FFE5DB' },
{ offset: 1, color: '#FFE5DB' }
]),
new echarts.graphic.LinearGradient(
0, 0, 0, 1,
[
{ offset: 0, color: '#FFE5DB' },
{ offset: 1, color: '#E08F83' }
]),
new echarts.graphic.LinearGradient(
0, 0, 0, 1,
[
{ offset: 0, color: '#FB6A63' },
{ offset: 1, color: '#B83F3E' }
])];
return colorList[params.dataIndex]
}
")),
radius = c("25%", "70%"),
legend= FALSE,
label = list(
color = "#DFC18F",
fontFamily = "Cardo",
fontSize = 14
),
animationDuration = 4000) %>%
e_title(text = "Chest Pain Distribution",
textStyle = list(
color = "#DFC18F",
fontStyle = "italic"
),
right = "10%",
left = "50%") %>%
e_tooltip(formatter = htmlwidgets::JS("
function(params)
{
return `<strong style='color:White;'>Count:</strong> ${params.value}
<br/><strong style='color:White;'>Percent:</strong> ${params.percent}%`
} ")) %>%
e_theme_custom("rangga3.json")
ASY takes the
majority of the ChestPainType distribution. (54.03%)
ASY stands for Asymptomatic which means no symptoms or when a patient have an illness or condition (such as early stage high blood pressure or glaucoma) but do not have symptoms of it.
heart_sex <-
heart %>%
select(Sex) %>%
group_by(Sex) %>%
summarize(total = n()) %>%
ungroup() %>%
arrange(desc(-total))
heart_sex %>%
e_charts(x = Sex) %>%
e_pie(total,
roseType = "radius",
itemStyle = list(
color = htmlwidgets::JS("
function(params) {
var colorList = [new echarts.graphic.LinearGradient(
0, 0, 0, 1,
[
{ offset: 0, color: '#FFE5DB' },
{ offset: 1, color: '#E08F83' }
]),
new echarts.graphic.LinearGradient(
0, 0, 0, 1,
[
{ offset: 0, color: '#FB6A63' },
{ offset: 1, color: '#B83F3E' }
])];
return colorList[params.dataIndex]
}
")),
radius = c("25%", "70%"),
legend= FALSE,
label = list(
color = "#DFC18F",
fontFamily = "Cardo",
fontSize = 14
),
animationDuration = 4000) %>%
e_title(text = "Gender Distribution",
textStyle = list(
color = "#DFC18F",
fontStyle = "italic"
),
right = "10%",
left = "50%",
top = "1%") %>%
e_tooltip(formatter = htmlwidgets::JS("
function(params)
{
return `<strong style='color:White;'>Count:</strong> ${params.value}
<br/><strong style='color:White;'>Percent:</strong> ${params.percent}%`
} ")) %>%
e_theme_custom("rangga1.json")
Sex distribution. (78.98%)heart_fbs <-
heart %>%
select(FastingBS) %>%
group_by(FastingBS) %>%
summarize(total = n()) %>%
ungroup() %>%
arrange(desc(-total))
heart_fbs %>%
e_charts(x = FastingBS) %>%
e_pie(total,
roseType = "radius",
itemStyle = list(
color = htmlwidgets::JS("
function(params) {
var colorList = [new echarts.graphic.LinearGradient(
0, 0, 0, 1,
[
{ offset: 0, color: '#FFE5DB' },
{ offset: 1, color: '#E08F83' }
]),
new echarts.graphic.LinearGradient(
0, 0, 0, 1,
[
{ offset: 0, color: '#FB6A63' },
{ offset: 1, color: '#B83F3E' }
])];
return colorList[params.dataIndex]
}
")),
radius = c("25%", "70%"),
legend= FALSE,
label = list(
color = "#DFC18F",
fontFamily = "Cardo",
fontSize = 14
),
animationDuration = 4000) %>%
e_title(text = "Fasting Blood Sugar Distribution",
textStyle = list(
color = "#DFC18F",
fontStyle = "italic"
),
right = "10%",
left = "50%",
top = "1%") %>%
e_tooltip(formatter = htmlwidgets::JS("
function(params)
{
return `<strong style='color:White;'>Count:</strong> ${params.value}
<br/><strong style='color:White;'>Percent:</strong> ${params.percent}%`
} ")) %>%
e_theme_custom("rangga1.json")
A fasting blood sugar level less than 100 mg/dL (5.6 mmol/L) is considered normal.
A fasting blood sugar level from 100 to 125 mg/dL (5.6 to 6.9 mmol/L) is considered pre-diabetes.
If it’s 126 mg/dL (7 mmol/L) or higher on two separate tests, means a person have diabetes.
Majority of patients still have normal and pre-diabetes conditions. (76.69%)
heart_rtcg <-
heart %>%
select(RestingECG) %>%
group_by(RestingECG) %>%
summarize(total = n()) %>%
ungroup() %>%
arrange(desc(-total))
heart_rtcg %>%
e_charts(x = RestingECG) %>%
e_pie(total,
roseType = "radius",
itemStyle = list(
color = htmlwidgets::JS("
function(params) {
var colorList = [new echarts.graphic.LinearGradient(
0, 0, 0, 1,
[
{ offset: 0, color: '#FFE5DB' },
{ offset: 1, color: '#FFE5DB' }
]),
new echarts.graphic.LinearGradient(
0, 0, 0, 1,
[
{ offset: 0, color: '#FFE5DB' },
{ offset: 1, color: '#E08F83' }
]),
new echarts.graphic.LinearGradient(
0, 0, 0, 1,
[
{ offset: 0, color: '#FB6A63' },
{ offset: 1, color: '#B83F3E' }
])];
return colorList[params.dataIndex]
}
")),
radius = c("25%", "70%"),
legend= FALSE,
label = list(
color = "#DFC18F",
fontFamily = "Cardo",
fontSize = 14
),
animationDuration = 4000) %>%
e_title(text = "Resting Electrocardiogram Distribution",
textStyle = list(
color = "#DFC18F",
fontStyle = "italic"
),
right = "10%",
left = "50%",
top = "1%") %>%
e_tooltip(formatter = htmlwidgets::JS("
function(params)
{
return `<strong style='color:White;'>Count:</strong> ${params.value}
<br/><strong style='color:White;'>Percent:</strong> ${params.percent}%`
} ")) %>%
e_theme_custom("rangga1.json")
Normal: your Heart rate and Rythm is normal and consistent.
ST: T wave inversions and/or ST elevation or depression of more than 0.05 mV.
LVH: showing probable or definite left ventricular hypertrophy by Estes’ criteria.
Most of people still have a Normal
RestingECG results. (60.13%)
heart_slp <-
heart %>%
select(ST_Slope) %>%
group_by(ST_Slope) %>%
summarize(total = n()) %>%
ungroup() %>%
arrange(desc(-total))
heart_slp %>%
e_charts(x = ST_Slope) %>%
e_pie(total,
roseType = "radius",
itemStyle = list(
color = htmlwidgets::JS("
function(params) {
var colorList = [new echarts.graphic.LinearGradient(
0, 0, 0, 1,
[
{ offset: 0, color: '#FFE5DB' },
{ offset: 1, color: '#FFE5DB' }
]),
new echarts.graphic.LinearGradient(
0, 0, 0, 1,
[
{ offset: 0, color: '#FFE5DB' },
{ offset: 1, color: '#E08F83' }
]),
new echarts.graphic.LinearGradient(
0, 0, 0, 1,
[
{ offset: 0, color: '#FB6A63' },
{ offset: 1, color: '#B83F3E' }
])];
return colorList[params.dataIndex]
}
")),
radius = c("25%", "70%"),
legend= FALSE,
label = list(
color = "#DFC18F",
fontFamily = "Cardo",
fontSize = 14
),
animationDuration = 4000) %>%
e_title(text = "ST Slope Distribution",
textStyle = list(
color = "#DFC18F",
fontStyle = "italic"
),
right = "10%",
left = "50%",
top = "1%") %>%
e_tooltip(formatter = htmlwidgets::JS("
function(params)
{
return `<strong style='color:White;'>Count:</strong> ${params.value}
<br/><strong style='color:White;'>Percent:</strong> ${params.percent}%`
} ")) %>%
e_theme_custom("rangga1.json")
Up/Upslopping: upsloping ST‐segment depression is associated with an increased risk of coronary artery disease.
Down/Downsloping: downsloping ST-segment depression is a common manifestation of severe myocardial ischemia.
Flat: The normal ST-segment is flat and isoelectric.
The patient distribution for Flat and Upslopping
ST-Slope are pretty even< (50% vs 43%) while Downsloping
ST-Slope takes the minority role. (only 7%)
heart_exng <-
heart %>%
select(ExerciseAngina) %>%
group_by(ExerciseAngina) %>%
summarize(total = n()) %>%
ungroup() %>%
arrange(desc(-total))
heart_exng %>%
e_charts(x = ExerciseAngina) %>%
e_pie(total,
roseType = "radius",
itemStyle = list(
color = htmlwidgets::JS("
function(params) {
var colorList = [new echarts.graphic.LinearGradient(
0, 0, 0, 1,
[
{ offset: 0, color: '#FFE5DB' },
{ offset: 1, color: '#E08F83' }
]),
new echarts.graphic.LinearGradient(
0, 0, 0, 1,
[
{ offset: 0, color: '#FB6A63' },
{ offset: 1, color: '#B83F3E' }
])];
return colorList[params.dataIndex]
}
")),
radius = c("25%", "70%"),
legend= FALSE,
label = list(
color = "#DFC18F",
fontFamily = "Cardo",
fontSize = 14
),
animationDuration = 4000) %>%
e_title(text = "Exercise Angina Distribution",
textStyle = list(
color = "#DFC18F",
fontStyle = "italic"
),
right = "10%",
left = "50%",
top = "1%") %>%
e_tooltip(formatter = htmlwidgets::JS("
function(params)
{
return `<strong style='color:White;'>Count:</strong> ${params.value}
<br/><strong style='color:White;'>Percent:</strong> ${params.percent}%`
} ")) %>%
e_theme_custom("rangga1.json")
ExerciseAngina is a pain in the chest that comes on
with exercise, stress, or other things that make the heart work
harder.
The distributions are pretty even for the people who having an angina or not. (60% vs 40%)
Before we go into the data distribution, we will be analyzing the Pearson Correlation between all of our numerical variables first to see if there any initial multicollinearity.
heart_num <- heart %>%
select(Age, RestingBP, Cholesterol, MaxHR, Oldpeak, HeartDisease)
cormat <- round(cor(heart_num),2)
get_lower_tri<-function(cormat){
cormat[upper.tri(cormat)] <- NA
return(cormat)
}
get_upper_tri <- function(cormat){
cormat[lower.tri(cormat)]<- NA
return(cormat)
}
lower_tri <- get_lower_tri(cormat)
heart_cor_melt <- reshape2::melt(lower_tri, na.rm = TRUE)
heart_cor_melt <- heart_cor_melt %>%
mutate(label = glue("{Var1} ~ {Var2}"))
ggplot(heart_cor_melt,aes(Var1, Var2, text = label)) +
geom_tile(aes(fill = value)) +
geom_text(aes(label = round(value, 1)), alpha=0.5, size = 3, color = "White") +
scale_fill_gradientn(colors = c("#FFE5DB","#E08F83","#B8332E"),
values = rescale(c(-1,0,1)),
limits = c(-1,1)) +
labs(x = NULL,
y = NULL,
fill = "Pearson Corr:") +
theme(legend.background = element_rect(fill = "#2C343C", color = "#2C343C"),
plot.background = element_rect(fill = "#2C343C", color = "#2C343C"),
panel.background = element_rect(fill = "#2C343C"),
panel.grid = element_line(colour = "#2C343C"),
panel.grid.major.x = element_line(colour = "#2C343C"),
panel.grid.minor.x = element_line(colour = "#2C343C"),
legend.title = element_text(colour = "#DFC18F", face ="bold", family = "Times New Roman"),
legend.text = element_text(colour = "#DFC18F", face ="bold", family = "Times New Roman"),
legend.justification = c(1, 0),
legend.position = c(0.6, 0.7),
legend.direction = "horizontal",
axis.text.x = element_text(color = "#DFC18F", family = "Times New Roman",
angle = 45, vjust = 1, hjust = 1),
axis.text.y = element_blank(),
axis.ticks = element_blank()) +
guides(fill = guide_colorbar(barwidth = 7, barheight = 1,
title.position = "top", title.hjust = 0.5))
As it shows above, there are no multicollinearity detected since
the highest score available is between 0.4/-0.4.
## [1] "Skewness :-0.2"
## [1] "Kurtosis :2.61"
heart_age <- heart %>%
select(Age) %>%
pivot_longer(
Age,
names_to = "name")
linear_gradient <- htmlwidgets::JS(
"new echarts.graphic.LinearGradient(
0, 0, 0, 1,
[
{ offset: 0, color: '#D76662' },
{ offset: 1, color: '#862421' }
])"
)
linear_gradient2 <- htmlwidgets::JS(
"new echarts.graphic.LinearGradient(
0, 0, 0, 1,
[
{ offset: 0, color: '#FFE5DB' },
{ offset: 1, color: '#E08F83' }
])"
)
heart_age1 <- heart_age %>%
e_charts() %>%
e_boxplot(value, outliers = T, name = "Age",
boxWidth = c(50, 150),
itemStyle = list(color = linear_gradient,
borderColor = linear_gradient2,
borderType = "dashed"),
animationDuration = 2000) %>%
e_title(text = "Summary",
textStyle = list(
color = "#DFC18F",
fontStyle = "italic",
fontFamily = "Cardo",
fontSize = 22
),
right = "10%",
left = "40%",
top = "1%") %>%
e_x_axis(axisLabel = list(color = "#2C343C"),
axisLine = list(
lineStyle = list(
color = "#333D47"
)
)) %>%
e_y_axis(axisLabel = list(fontFamily = "Cardo", color = "#DFC18F"),
splitLine = list(
lineStyle = list(
color = "#333D47"
)
)) %>%
e_color(background = "#2C343C") %>%
e_tooltip(trigger = "axis",
axisPointer = list(
type = "line",
label = list(
backgroundColor = "#2C343C",
fontFamily = "Cardo",
color = "#DFC18F"
),
lineStyle = list(
color = linear_gradient2
)
),
textStyle = list(
fontFamily = "Cardo",
color = "#DFC18F"),
backgroundColor = "#2C343C",
borderColor = "#2C343C",
formatter = htmlwidgets::JS("
function(params)
{
return `<strong style='color:#D76662;'>Age</strong>
<br/><strong style='color:White;'>Max:</strong> 77
<br/><strong style='color:White;'>Q3:</strong> 60
<br/><strong style='color:White;'>Median:</strong> 54
<br/><strong style='color:White;'>Q1:</strong> 47
<br/><strong style='color:White;'>Min:</strong> <strong style='color:#D76662;'>28</strong>
<br/><strong style='color:White;'>Mean:</strong> 53.51`
} "))
heart_age2 <- heart_age %>%
e_charts() %>%
e_histogram(value,
legend = F,
areaStyle = list(opacity = .4,
color = linear_gradient2),
itemStyle = list(color = linear_gradient2)) %>%
e_density(value,
areaStyle = list(opacity = .4,
color = linear_gradient),
itemStyle = list(color = linear_gradient,
borderType = "dashed"),
y_index = 1,
legend = F) %>%
e_title(text = "Distribution",
textStyle = list(
color = "#DFC18F",
fontStyle = "italic"
),
right = "10%",
left = "50%",
top = "1%") %>%
e_theme_custom("rangga1.json") %>%
e_animation(duration = 2000) %>%
e_tooltip(axis = "trigger",
axisPointer = list(
type = "cross",
label = list(
backgroundColor = "#2C343C",
fontFamily = "Cardo",
color = "#DFC18F"
),
crossStyle = list(
color = linear_gradient2
)
),
formatter = htmlwidgets::JS("
function(params)
{
return `<strong style='color:White;'>Age:</strong> ${params.value[0]}
<br/><strong style='color:White;'>Count:</strong> ${params.value[1]}`
} "))
hw_grid(heart_age1, heart_age2)
The distribution of patient Age is quite
normal since the curve bear resemblance to a bell curve.
Skewness and Kurtosis are adjacent to 0 and 3 respectively.
## [1] "Skewness :0.18"
## [1] "Kurtosis :6.25"
heart_rbp <- heart %>%
select(RestingBP) %>%
pivot_longer(
RestingBP,
names_to = "name")
heart_rbp1 <- heart_rbp %>%
e_charts() %>%
e_boxplot(value, outliers = F, name = "RBP",
boxWidth = c(50, 150),
itemStyle = list(color = linear_gradient,
borderColor = linear_gradient2,
borderType = "dashed"),
animationDuration = 2000) %>%
e_title(text = "Summary",
textStyle = list(
color = "#DFC18F",
fontStyle = "italic",
fontFamily = "Cardo",
fontSize = 22
),
right = "10%",
left = "40%",
top = "1%") %>%
e_x_axis(axisLabel = list(color = "#2C343C"),
axisLine = list(
lineStyle = list(
color = "#333D47"
)
)) %>%
e_y_axis(axisLabel = list(fontFamily = "Cardo",
color = "#DFC18F"),
splitLine = list(
lineStyle = list(
color = "#333D47"
)
)) %>%
e_color(background = "#2C343C") %>%
e_tooltip(trigger = "axis",
axisPointer = list(
type = "line",
label = list(
backgroundColor = "#2C343C",
fontFamily = "Cardo",
color = "#DFC18F"
),
lineStyle = list(
color = linear_gradient2
)
),
textStyle = list(
fontFamily = "Cardo",
color = "#DFC18F"),
backgroundColor = "#2C343C",
borderColor = "#2C343C",
formatter = htmlwidgets::JS("
function(params)
{
return `<strong style='color:#D76662;'>Blood Pressure</strong>
<br/><strong style='color:White;'>Max:</strong> 200.0
<br/><strong style='color:White;'>Q3:</strong> 140.0
<br/><strong style='color:White;'>Median:</strong> 130.0
<br/><strong style='color:White;'>Q1:</strong> 120.0
<br/><strong style='color:White;'>Min:</strong> <strong style='color:#D76662;'>53.51</strong>
<br/><strong style='color:White;'>Mean:</strong> 132.4`
} "))
heart_rbp2 <- heart_rbp %>%
e_charts() %>%
e_histogram(value,
legend = F,
areaStyle = list(opacity = .4,
color = linear_gradient2),
itemStyle = list(color = linear_gradient2)) %>%
e_density(value,
areaStyle = list(opacity = .4,
color = linear_gradient),
itemStyle = list(color = linear_gradient),
y_index = 1,
legend = F) %>%
e_title(text = "Distribution",
textStyle = list(
color = "#DFC18F",
fontStyle = "italic"
),
right = "10%",
left = "50%",
top = "1%") %>%
e_theme_custom("rangga1.json") %>%
e_animation(duration = 2000) %>%
e_tooltip(axis = "trigger",
axisPointer = list(
type = "cross",
label = list(
backgroundColor = "#2C343C",
fontFamily = "Cardo",
color = "#DFC18F"
),
crossStyle = list(
color = linear_gradient2
)
),
formatter = htmlwidgets::JS("
function(params)
{
return `<strong style='color:White;'>Blood Pressure:</strong> ${params.value[0]}mm/hg
<br/><strong style='color:White;'>Count:</strong> ${params.value[1]}`
} "))
hw_grid(heart_rbp1, heart_rbp2)
The distribution of patient RestingBP shows
highly right skewed charts.
Majority of the patient’s Blood Pressure are between 110mm/hg - 150mm/hg.
This might be because there’s a outliers within
RestingBP values 0mm/hg -
60mm/hg.
Skewness are adjacent to 0 while Kurtosis are not adjacent to 3.
## [1] "Skewness :-0.61"
## [1] "Kurtosis :3.11"
heart_chol <- heart %>%
select(Cholesterol) %>%
pivot_longer(
Cholesterol,
names_to = "name")
heart_chol1 <- heart_chol %>%
e_charts() %>%
e_boxplot(value, outliers = F, name = "Cholesterol",
boxWidth = c(50, 150),
itemStyle = list(color = linear_gradient,
borderColor = linear_gradient2,
borderType = "dashed"),
animationDuration = 2000) %>%
e_title(text = "Summary",
textStyle = list(
color = "#DFC18F",
fontStyle = "italic",
fontFamily = "Cardo",
fontSize = 22
),
right = "10%",
left = "40%",
top = "1%") %>%
e_x_axis(axisLabel = list(color = "#2C343C"),
axisLine = list(
lineStyle = list(
color = "#333D47"
)
)) %>%
e_y_axis(axisLabel = list(fontFamily = "Cardo",
color = "#DFC18F"),
splitLine = list(
lineStyle = list(
color = "#333D47"
)
)) %>%
e_color(background = "#2C343C") %>%
e_tooltip(axis = "trigger",
axisPointer = list(
type = "line",
label = list(
backgroundColor = "#2C343C",
fontFamily = "Cardo",
color = "#DFC18F"
),
lineStyle = list(
color = linear_gradient2
)
),
textStyle = list(
fontFamily = "Cardo",
color = "#DFC18F"),
backgroundColor = "#2C343C",
borderColor = "#2C343C",
formatter = htmlwidgets::JS("
function(params)
{
return `<strong style='color:#D76662;'>Cholesterol</strong>
<br/><strong style='color:White;'>Max:</strong> 603.0
<br/><strong style='color:White;'>Q3:</strong> 267.0
<br/><strong style='color:White;'>Median:</strong> 223.0
<br/><strong style='color:White;'>Q1:</strong> 173.2
<br/><strong style='color:White;'>Min:</strong> <strong style='color:#D76662;'>0.0</strong>
<br/><strong style='color:White;'>Mean:</strong> 198.8`
} "))
heart_chol2 <- heart_chol %>%
e_charts() %>%
e_histogram(value,
legend = F,
areaStyle = list(opacity = .4,
color = linear_gradient2),
itemStyle = list(color = linear_gradient2)) %>%
e_density(value,
areaStyle = list(opacity = .4,
color = linear_gradient),
itemStyle = list(color = linear_gradient),
y_index = 1,
legend = F) %>%
e_title(text = "Distribution",
textStyle = list(
color = "#DFC18F",
fontStyle = "italic"
),
right = "10%",
left = "50%",
top = "1%") %>%
e_theme_custom("rangga1.json") %>%
e_animation(duration = 2000) %>%
e_tooltip(axis = "trigger",
axisPointer = list(
type = "cross",
label = list(
backgroundColor = "#2C343C",
fontFamily = "Cardo",
color = "#DFC18F"
),
crossStyle = list(
color = linear_gradient2
)
),
formatter = htmlwidgets::JS("
function(params)
{
return `<strong style='color:White;'>Chol:</strong> ${params.value[0]}mg/dL
<br/><strong style='color:White;'>Count:</strong> ${params.value[1]}`
} "))
hw_grid(heart_chol1, heart_chol2)
The data distribution of patient Cholesterol shows
highly left skewed charts.
Moreover, there are so many low values (below 100mg/dL) and also 0 values.
This means there are missing values and also outliers within the
Cholesterol data.
For a better understanding, here are the Cholesterol
thresholds:
Based on the plot above, the highest frequency transpired from the
range 190 - 275
mg/dL. This implies that majority of patients are having
a Cholesterol problem.
## [1] "Skewness :-0.14"
## [1] "Kurtosis :2.55"
heart_MaxHR <- heart %>%
select(MaxHR) %>%
pivot_longer(
MaxHR,
names_to = "name")
heart_MaxHR1 <- heart_MaxHR %>%
e_charts() %>%
e_boxplot(value, outliers = F, name = "Max Heart Rate",
boxWidth = c(50, 150),
itemStyle = list(color = linear_gradient,
borderColor = linear_gradient2,
borderType = "dashed"),
animationDuration = 2000) %>%
e_title(text = "Summary",
textStyle = list(
color = "#DFC18F",
fontStyle = "italic",
fontFamily = "Cardo",
fontSize = 22
),
right = "10%",
left = "40%",
top = "1%") %>%
e_x_axis(axisLabel = list(color = "#2C343C"),
axisLine = list(
lineStyle = list(
color = "#333D47"
)
)) %>%
e_y_axis(axisLabel = list(fontFamily = "Cardo",
color = "#DFC18F"),
splitLine = list(
lineStyle = list(
color = "#333D47"
)
)) %>%
e_color(background = "#2C343C") %>%
e_tooltip(trigger = "axis",
axisPointer = list(
type = "line",
label = list(
backgroundColor = "#2C343C",
fontFamily = "Cardo",
color = "#DFC18F"
),
lineStyle = list(
color = linear_gradient2
)
),
textStyle = list(
fontFamily = "Cardo",
color = "#DFC18F"),
backgroundColor = "#2C343C",
borderColor = "#2C343C",
formatter = htmlwidgets::JS("
function(params)
{
return `<strong style='color:#D76662;'>Heart Rate</strong>
<br/><strong style='color:White;'>Max:</strong> 202.0
<br/><strong style='color:White;'>Q3:</strong> 156.0
<br/><strong style='color:White;'>Median:</strong> 138.0
<br/><strong style='color:White;'>Q1:</strong> 120.0
<br/><strong style='color:White;'>Min:</strong> <strong style='color:#D76662;'>60.0</strong>
<br/><strong style='color:White;'>Mean:</strong> 136.8`
} "))
heart_MaxHR2 <- heart_MaxHR %>%
e_charts() %>%
e_histogram(value,
legend = F,
areaStyle = list(opacity = .4,
color = linear_gradient2),
itemStyle = list(color = linear_gradient2)) %>%
e_density(value,
areaStyle = list(opacity = .4,
color = linear_gradient),
itemStyle = list(color = linear_gradient),
y_index = 1,
legend = F) %>%
e_title(text = "Distribution",
textStyle = list(
color = "#DFC18F",
fontStyle = "italic"
),
right = "10%",
left = "50%",
top = "1%") %>%
e_theme_custom("rangga1.json") %>%
e_animation(duration = 2000) %>%
e_tooltip(axis = "trigger",
axisPointer = list(
type = "cross",
label = list(
backgroundColor = "#2C343C",
fontFamily = "Cardo",
color = "#DFC18F"
),
crossStyle = list(
color = linear_gradient2
)
),
formatter = htmlwidgets::JS("
function(params)
{
return `<strong style='color:White;'>Heart Rate:</strong> ${params.value[0]}
<br/><strong style='color:White;'>Count:</strong> ${params.value[1]}`
} "))
hw_grid(heart_MaxHR1, heart_MaxHR2)
The data distribution for MaxHR variable is quite
normal since the curve bear resemblance to a bell curve.
Skewness and Kurtosis are adjacent to 0 and 3 respectively.
## [1] "Skewness :1.02"
## [1] "Kurtosis :4.19"
heart_old <- heart %>%
select(Oldpeak) %>%
pivot_longer(
Oldpeak,
names_to = "name")
heart_old1 <- heart_old %>%
e_charts() %>%
e_boxplot(value, outliers = F, name = "Oldpeak",
boxWidth = c(50, 150),
itemStyle = list(color = linear_gradient,
borderColor = linear_gradient2,
borderType = "dashed"),
animationDuration = 2000) %>%
e_title(text = "Summary",
textStyle = list(
color = "#DFC18F",
fontStyle = "italic",
fontFamily = "Cardo",
fontSize = 22
),
right = "10%",
left = "40%",
top = "1%") %>%
e_x_axis(axisLabel = list(color = "#2C343C"),
axisLine = list(
lineStyle = list(
color = "#333D47"
)
)) %>%
e_y_axis(axisLabel = list(fontFamily = "Cardo", color = "#DFC18F"),
splitLine = list(
lineStyle = list(
color = "#333D47"
)
)) %>%
e_color(background = "#2C343C") %>%
e_tooltip(trigger = "axis",
axisPointer = list(
type = "line",
label = list(
backgroundColor = "#2C343C",
fontFamily = "Cardo",
color = "#DFC18F"
),
lineStyle = list(
color = linear_gradient2
)
),
textStyle = list(
fontFamily = "Cardo",
color = "#DFC18F"),
backgroundColor = "#2C343C",
borderColor = "#2C343C",
formatter = htmlwidgets::JS("
function(params)
{
return `<strong style='color:#D76662;'>Oldpeak</strong>
<br/><strong style='color:White;'>Max:</strong> 6.2000
<br/><strong style='color:White;'>Q3:</strong> 1.5000
<br/><strong style='color:White;'>Median:</strong> 0.6000
<br/><strong style='color:White;'>Q1:</strong> 0.0000
<br/><strong style='color:White;'>Min:</strong> <strong style='color:#D76662;'>-2.6000</strong>
<br/><strong style='color:White;'>Mean:</strong> 0.8874`
} "))
heart_old2 <- heart_old %>%
e_charts() %>%
e_histogram(value,
legend = F,
areaStyle = list(opacity = .4,
color = linear_gradient2),
itemStyle = list(color = linear_gradient2)) %>%
e_density(value,
areaStyle = list(opacity = .4,
color = linear_gradient),
itemStyle = list(color = linear_gradient),
y_index = 1,
legend = F) %>%
e_title(text = "Distribution",
textStyle = list(
color = "#DFC18F",
fontStyle = "italic"
),
right = "10%",
left = "50%",
top = "1%") %>%
e_theme_custom("rangga1.json") %>%
e_animation(duration = 2000) %>%
e_tooltip(axis = "trigger",
axisPointer = list(
type = "cross",
label = list(
backgroundColor = "#2C343C",
fontFamily = "Cardo",
color = "#DFC18F"
),
crossStyle = list(
color = linear_gradient2
)
),
formatter = htmlwidgets::JS("
function(params)
{
return `<strong style='color:White;'>Oldpeak:</strong> ${params.value[0]}
<br/><strong style='color:White;'>Count:</strong> ${params.value[1]}`
} "))
hw_grid(heart_old1, heart_old2)
The data distribution of patient Oldpeak shows
highly left skewed charts.
Skewness and Kurtosis are not adjacent to 0 and 3 respectively.
The purpose of bivariate analysis is to understand the relationship between two variables. The bivariate analysis involves the analysis of two variables, X: independent/explanatory/outcome variable and Y: dependent/outcome variable, to determine the relationship between them. In this part, we are going to separate our analysis again based on their types. But berfore that, we will see our target variable distribution first.
heart_heart <- heart %>%
select(HeartDisease) %>%
mutate(HeartDisease = case_when(
HeartDisease == 0 ~ "No",
HeartDisease == 1 ~ "Yes")) %>%
mutate(HeartDisease = as.factor(HeartDisease)) %>%
group_by(HeartDisease) %>%
summarize(count = n()) %>%
ungroup()
heart_heart %>%
e_charts(x = HeartDisease) %>%
e_pie(count,
roseType = "radius",
itemStyle = list(
borderRadius = 1,
color = htmlwidgets::JS("
function(params) {
var colorList = [new echarts.graphic.LinearGradient(
0, 0, 0, 1,
[
{ offset: 0, color: '#FFE5DB' },
{ offset: 1, color: '#E08F83' }
]),
new echarts.graphic.LinearGradient(
0, 0, 0, 1,
[
{ offset: 0, color: '#FB6A63' },
{ offset: 1, color: '#B83F3E' }
])];
return colorList[params.dataIndex]
}
")),
radius = c("25%", "70%"),
legend= FALSE,
label = list(
color = "#DFC18F",
fontFamily = "Cardo",
fontSize = 14
),
animationDuration = 4000) %>%
e_title(text = "Heart Disease Distribution",
textStyle = list(
color = "#DFC18F",
fontStyle = "italic"
),
right = "10%",
left = "50%",
top = "1%") %>%
e_tooltip(formatter = htmlwidgets::JS("
function(params)
{
return `<strong style='color:White;'>Count:</strong> ${params.value}
<br/><strong style='color:White;'>Percent:</strong> ${params.percent}%`
} ")) %>%
e_theme_custom("rangga1.json")
The are more patients who have a Heart Disease condition (55%) are bigger than the people who are not (45%).
However, the class proportion itself is still categorized as balanced, since the differences between them are no more than 15%.
heart_cp1 <- heart %>%
select(ChestPainType, HeartDisease) %>%
mutate(HeartDisease = case_when(
HeartDisease == 0 ~ "No",
HeartDisease == 1 ~ "Yes"),
ChestPainType = as.factor(ChestPainType)) %>%
mutate(HeartDisease = as.factor(HeartDisease)) %>%
count(ChestPainType, HeartDisease) %>%
group_by(HeartDisease)
heart_cp1 %>%
e_charts(ChestPainType) %>%
e_bar(n, stack = T) %>%
e_tooltip(
trigger = "axis",
axisPointer = list(
type = "shadow"
)
) %>%
e_title(text = "Heart Disease by Chest Pain Type:",
textStyle = list(
color = "#DFC18F",
fontStyle = "italic"
),
right = "10%",
left = "47%",
top = "1%") %>%
e_y_axis(
splitArea = list(show = FALSE),
splitLine = list(show = FALSE)
) %>%
e_theme_custom("rangga2.json") %>%
e_legend(
orient = "horizontal",
right = "10%",
left = "66%",
top = "2%",
textStyle = list(
color = "#DFC18F"
)
) %>%
e_flip_coords()
Surprisingly, the patients who have a Heart Disease
condition are mostly with No Symptoms
ChestPainType.
This might be many of the Heart Disease condition are caused by another factors. (Which we’ll find out later)
heart_sex1 <- heart %>%
select(Sex, HeartDisease) %>%
mutate(HeartDisease = case_when(
HeartDisease == 0 ~ "No",
HeartDisease == 1 ~ "Yes"),
Sex = as.factor(Sex)) %>%
mutate(HeartDisease = as.factor(HeartDisease)) %>%
count(Sex, HeartDisease) %>%
group_by(HeartDisease)
heart_sex1 %>%
e_charts(Sex) %>%
e_bar(n, stack = T) %>%
e_tooltip(
trigger = "axis",
axisPointer = list(
type = "shadow"
)
) %>%
e_title(text = "Heart Disease by Gender:",
textStyle = list(
color = "#DFC18F",
fontStyle = "italic"
),
right = "10%",
left = "46%",
top = "1%") %>%
e_y_axis(
splitArea = list(show = FALSE),
splitLine = list(show = FALSE)
) %>%
e_theme_custom("rangga2.json") %>%
e_legend(
orient = "horizontal",
right = "10%",
left = "61%",
top = "2%",
textStyle = list(
color = "#DFC18F"
)
) %>%
e_flip_coords()
heart_fb1 <- heart %>%
select(FastingBS, HeartDisease) %>%
mutate(HeartDisease = case_when(
HeartDisease == 0 ~ "No",
HeartDisease == 1 ~ "Yes"),
FastingBS = as.factor(FastingBS)) %>%
mutate(HeartDisease = as.factor(HeartDisease)) %>%
count(FastingBS, HeartDisease) %>%
group_by(HeartDisease)
heart_fb1 %>%
e_charts(FastingBS) %>%
e_bar(n, stack = T) %>%
e_tooltip(
trigger = "axis",
axisPointer = list(
type = "shadow"
)
) %>%
e_title(text = "Heart Disease by Fasting Blood Sugar:",
textStyle = list(
color = "#DFC18F",
fontStyle = "italic"
),
right = "10%",
left = "46%",
top = "1%") %>%
e_y_axis(
splitArea = list(show = FALSE),
splitLine = list(show = FALSE)
) %>%
e_theme_custom("rangga2.json") %>%
e_legend(
orient = "horizontal",
right = "10%",
left = "67%",
top = "2%",
textStyle = list(
color = "#DFC18F"
)
) %>%
e_flip_coords()
Patients with Fasting Blood Sugar above 120mg/dL are 80% more affected with Heart Disease
Condition rather than not.
While patients below 120mg/dL probability of Heart Disease Condition are pretty even.
heart_ecg1 <- heart %>%
select(RestingECG, HeartDisease) %>%
mutate(HeartDisease = case_when(
HeartDisease == 0 ~ "No",
HeartDisease == 1 ~ "Yes"),
RestingECG = as.factor(RestingECG)) %>%
mutate(HeartDisease = as.factor(HeartDisease)) %>%
count(RestingECG, HeartDisease) %>%
group_by(HeartDisease)
heart_ecg1 %>%
e_charts(RestingECG) %>%
e_bar(n, stack = T) %>%
e_tooltip(
trigger = "axis",
axisPointer = list(
type = "shadow"
)
) %>%
e_title(text = "Heart Disease by Resting ECG:",
textStyle = list(
color = "#DFC18F",
fontStyle = "italic"
),
right = "10%",
left = "47%",
top = "1%") %>%
e_y_axis(
splitArea = list(show = FALSE),
splitLine = list(show = FALSE)
) %>%
e_theme_custom("rangga2.json") %>%
e_legend(
orient = "horizontal",
right = "10%",
left = "65%",
top = "2%",
textStyle = list(
color = "#DFC18F"
)
) %>%
e_flip_coords()
The Heart Disease probability of patients with
Normal
RestingECG result are pretty even.
While patients with ST & LVH results are having higher risk of Heart Disease.
heart_ang1 <- heart %>%
select(ExerciseAngina, HeartDisease) %>%
mutate(HeartDisease = case_when(
HeartDisease == 0 ~ "No",
HeartDisease == 1 ~ "Yes"),
ExerciseAngina = as.factor(ExerciseAngina)) %>%
mutate(HeartDisease = as.factor(HeartDisease)) %>%
count(ExerciseAngina, HeartDisease) %>%
group_by(HeartDisease)
heart_ang1 %>%
e_charts(ExerciseAngina) %>%
e_bar(n, stack = T) %>%
e_tooltip(
trigger = "axis",
axisPointer = list(
type = "shadow"
)
) %>%
e_title(text = "Heart Disease by Exercise Angina:",
textStyle = list(
color = "#DFC18F",
fontStyle = "italic"
),
right = "10%",
left = "46%",
top = "1%") %>%
e_y_axis(
splitArea = list(show = FALSE),
splitLine = list(show = FALSE)
) %>%
e_theme_custom("rangga2.json") %>%
e_legend(
orient = "horizontal",
right = "10%",
left = "65%",
top = "2%",
textStyle = list(
color = "#DFC18F"
)
) %>%
e_flip_coords()
Patients with ExerciseAngina condition are 85% more vulnerable of Heart
Disease condition.
While patients without ExerciseAngina condition are
relatively 62% more safer from
Heart Disease condition.
heart_st1 <- heart %>%
select(ST_Slope, HeartDisease) %>%
mutate(HeartDisease = case_when(
HeartDisease == 0 ~ "No",
HeartDisease == 1 ~ "Yes"),
ST_Slope = as.factor(ST_Slope)) %>%
mutate(HeartDisease = as.factor(HeartDisease)) %>%
count(ST_Slope, HeartDisease) %>%
group_by(HeartDisease)
heart_st1 %>%
e_charts(ST_Slope) %>%
e_bar(n, stack = T) %>%
e_tooltip(
trigger = "axis",
axisPointer = list(
type = "shadow"
)
) %>%
e_title(text = "Heart Disease by ST Slope:",
textStyle = list(
color = "#DFC18F",
fontStyle = "italic"
),
right = "10%",
left = "47%",
top = "1%") %>%
e_y_axis(
splitArea = list(show = FALSE),
splitLine = list(show = FALSE)
) %>%
e_theme_custom("rangga2.json") %>%
e_legend(
orient = "horizontal",
right = "10%",
left = "63%",
top = "2%",
textStyle = list(
color = "#DFC18F"
)
) %>%
e_flip_coords()
It is interesting that patients with Flat ST Slope
(which considered as normal results) are the ones who are highly
affected with Heart Disease condition.
Meanwhile patients with Downslopping
ST Slope (which associated with an increased risk of
coronary artery disease) are the ones who relatively safer from
Heart Disease condition.
This might be many of the Heart Disease condition are caused by another factors. (Which we’ll find out later)
heart_age3 <- heart %>%
select(Age, HeartDisease) %>%
mutate(HeartDisease = case_when(
HeartDisease == 0 ~ "No",
HeartDisease == 1 ~ "Yes")) %>%
mutate(HeartDisease = as.factor(HeartDisease)) %>%
group_by(HeartDisease, Age) %>%
summarize(count = n(), .groups = 'drop') %>%
pivot_wider(names_from = HeartDisease, values_from = count) %>%
mutate_if(is.integer, ~replace(., is.na(.), 0))
linear_gradient <- htmlwidgets::JS(
"new echarts.graphic.LinearGradient(
0, 0, 0, 1,
[
{ offset: 0, color: '#B83F3E' },
{ offset: 1, color: '#FB6A63' }
])"
)
linear_gradient2 <- htmlwidgets::JS(
"new echarts.graphic.LinearGradient(
0, 0, 0, 1,
[
{ offset: 0, color: '#E08F83' },
{ offset: 1, color: '#FFE5DB' }
])"
)
heart_age3 %>%
e_charts(Age) %>%
e_line(Yes,
smooth = T,
itemStyle = list(
opacity = 0.5,
color = linear_gradient),
areaStyle = list(
opcaity = 1,
color = linear_gradient),
emphasis = list(
focus = "series")) %>%
e_line(No,
smooth = T,
itemStyle = list(
opacity = 0.5,
color = linear_gradient2),
areaStyle = list(
opcaity = 1,
color = linear_gradient2),
emphasis = list(
focus = "series")) %>%
e_x_axis(name = "Age",
nameTextStyle = list(
color = "White",
fontFamily = "Cardo",
fontSize = 14,
fontWeight = "bold"
),
min = 27) %>%
e_y_axis(name = "Freq",
nameTextStyle = list(
color = "White",
fontFamily = "Cardo",
fontSize = 14,
fontWeight = "bold"
)) %>%
e_title(text = "Heart Disease by Age",
textStyle = list(
color = "#DFC18F",
fontStyle = "italic"
),
right = "10%",
left = "50%") %>%
e_tooltip(axis = "trigger",
showDelay = 0,
axisPointer = list(
type = "cross",
label = list(
backgroundColor = "#333D47",
fontFamily = "Cardo",
color = "#DFC18F"
),
crossStyle = list(
color = "#DFC18F"
)),
formatter = htmlwidgets::JS("
function(params)
{
return `<strong style='color:#DFC18F;'>Heart Disease:</strong> <strong style='color:#D76662;'>${params.seriesName}</strong>
<br/><strong style='color:White;'>Age:</strong> ${params.value[0]}
<br/><strong style='color:White;'>Count:</strong> ${params.value[1]}`
} ")) %>%
e_theme_custom("rangga2.json") %>%
e_legend(
orient = "horizontal",
right = "10%",
left = "45%",
bottom = "2%",
textStyle = list(
color = "#DFC18F"
)
)
Patients with Heart Disease Condition tend to
have higher Age.
Patients with Heart Disease Condition tend to
grouped up within Age of 50 - 70 years old.
Highest count of Heart Disease Condition recorded are within 58 years old.
max <- list(
name = "Max",
type = "max"
)
min <- list(
name = "Min",
type = "min"
)
avg <- list(
type = "average",
name = "AVG"
)
heart_rbp3 <- heart %>%
select(RestingBP, Age, HeartDisease) %>%
mutate(HeartDisease = case_when(
HeartDisease == 0 ~ "No",
HeartDisease == 1 ~ "Yes")) %>%
mutate(HeartDisease = as.factor(HeartDisease)) %>%
group_by(HeartDisease)
heart_rbp3 %>%
e_charts(Age) %>%
e_scatter(RestingBP, HeartDisease,
emphasis = list(
focus = "series"
)) %>%
e_theme_custom("rangga2.json") %>%
e_x_axis(name = "Age",
nameTextStyle = list(
color = "White",
fontFamily = "Cardo",
fontSize = 14,
fontWeight = "bold"
),
min = 27) %>%
e_y_axis(name = "RestingBP",
nameTextStyle = list(
color = "White",
fontFamily = "Cardo",
fontSize = 14,
fontWeight = "bold"
)) %>%
e_mark_area(data = list(
list(xAxis = "min", yAxis = "min"),
list(xAxis = "max", yAxis = "max")
),
itemStyle = list(
color = "transparent",
borderWidth = 1,
borderType = "dashed")) %>%
e_mark_line(data = avg,
lineStyle = list(
type = "solid"
)) %>%
e_title(text = "Heart Disease: RestingBP vs. Age",
textStyle = list(
color = "#DFC18F",
fontStyle = "italic"
),
right = "10%",
left = "50%") %>%
e_tooltip(axis = "trigger",
showDelay = 0,
axisPointer = list(
type = "cross",
label = list(
backgroundColor = "#333D47",
fontFamily = "Cardo",
color = "#DFC18F"
),
crossStyle = list(
color = "#DFC18F"
)),
formatter = htmlwidgets::JS("
function(params)
{
return `<strong style='color:#DFC18F;'>Heart Disease:</strong> <strong style='color:#D76662;'>${params.seriesName}</strong>
<br/><strong style='color:White;'>Age:</strong> ${params.value[0]}
<br/><strong style='color:White;'>RestingBP:</strong> ${params.value[1]}mm/hg`
} ")) %>%
e_legend(
orient = "horizontal",
right = "10%",
left = "45%",
bottom = "2%",
textStyle = list(
color = "#DFC18F"
)
)
Patients with Heart Disease Condition tend to
have higher RestingBP.
Moreover, if you look at the highest scores of
RestingBP, all of them have Heart Disease
Condition
There is RestingBP with 0mm/hg value which indicates that this is
an outlier.
heart_chol3 <- heart %>%
select(Cholesterol, Age, HeartDisease) %>%
mutate(HeartDisease = case_when(
HeartDisease == 0 ~ "No",
HeartDisease == 1 ~ "Yes")) %>%
mutate(HeartDisease = as.factor(HeartDisease)) %>%
group_by(HeartDisease)
heart_chol3 %>%
e_charts(Age) %>%
e_scatter(Cholesterol, HeartDisease,
emphasis = list(
focus = "series"
)) %>%
e_theme_custom("rangga2.json") %>%
e_x_axis(name = "Age",
nameTextStyle = list(
color = "White",
fontFamily = "Cardo",
fontSize = 14,
fontWeight = "bold"
),
min = 27) %>%
e_y_axis(name = "Cholesterol",
nameTextStyle = list(
color = "White",
fontFamily = "Cardo",
fontSize = 14,
fontWeight = "bold"
)) %>%
e_mark_area(data = list(
list(xAxis = "min", yAxis = "min"),
list(xAxis = "max", yAxis = "max")
),
itemStyle = list(
color = "transparent",
borderWidth = 1,
borderType = "dashed")) %>%
e_mark_line(data = avg,
lineStyle = list(
type = "solid"
)) %>%
e_title(text = "Heart Disease: Cholesterol vs. Age",
textStyle = list(
color = "#DFC18F",
fontStyle = "italic"
),
right = "10%",
left = "50%") %>%
e_tooltip(axis = "trigger",
showDelay = 0,
axisPointer = list(
type = "cross",
label = list(
backgroundColor = "#333D47",
fontFamily = "Cardo",
color = "#DFC18F"
),
crossStyle = list(
color = "#DFC18F"
)),
formatter = htmlwidgets::JS("
function(params)
{
return `<strong style='color:#DFC18F;'>Heart Disease:</strong> <strong style='color:#D76662;'>${params.seriesName}</strong>
<br/><strong style='color:White;'>Age:</strong> ${params.value[0]}
<br/><strong style='color:White;'>Cholesterol:</strong> ${params.value[1]}mg/dL`
} ")) %>%
e_legend(
orient = "horizontal",
right = "10%",
left = "45%",
bottom = "2%",
textStyle = list(
color = "#DFC18F"
)
)
Patients with Heart Disease Condition tend to
have lower Cholesterol rate.
However, this might be caused by uncollected
Cholesterol rate results, since there are many
outliers with values of 0mg/dL.
heart_HR3 <- heart %>%
select(MaxHR, Age, HeartDisease) %>%
mutate(HeartDisease = case_when(
HeartDisease == 0 ~ "No",
HeartDisease == 1 ~ "Yes")) %>%
mutate(HeartDisease = as.factor(HeartDisease)) %>%
group_by(HeartDisease)
heart_HR3 %>%
e_charts(Age) %>%
e_scatter(MaxHR, HeartDisease,
emphasis = list(
focus = "series"
)) %>%
e_theme_custom("rangga2.json") %>%
e_x_axis(name = "Age",
nameTextStyle = list(
color = "White",
fontFamily = "Cardo",
fontSize = 14,
fontWeight = "bold"
),
min = 27) %>%
e_y_axis(name = "Max Heart Rate",
nameTextStyle = list(
color = "White",
fontFamily = "Cardo",
fontSize = 14,
fontWeight = "bold"
)) %>%
e_mark_area(data = list(
list(xAxis = "min", yAxis = "min"),
list(xAxis = "max", yAxis = "max")
),
itemStyle = list(
color = "transparent",
borderWidth = 1,
borderType = "dashed")) %>%
e_mark_line(data = avg,
lineStyle = list(
type = "solid"
)) %>%
e_title(text = "Heart Disease: Max Heart Rate vs. Age",
textStyle = list(
color = "#DFC18F",
fontStyle = "italic"
),
right = "10%",
left = "50%") %>%
e_tooltip(axis = "trigger",
showDelay = 0,
axisPointer = list(
type = "cross",
label = list(
backgroundColor = "#333D47",
fontFamily = "Cardo",
color = "#DFC18F"
),
crossStyle = list(
color = "#DFC18F"
)),
formatter = htmlwidgets::JS("
function(params)
{
return `<strong style='color:#DFC18F;'>Heart Disease:</strong> <strong style='color:#D76662;'>${params.seriesName}</strong>
<br/><strong style='color:White;'>Age:</strong> ${params.value[0]}
<br/><strong style='color:White;'>MaxHR:</strong> ${params.value[1]}`
} ")) %>%
e_legend(
orient = "horizontal",
right = "10%",
left = "45%",
bottom = "2%",
textStyle = list(
color = "#DFC18F"
)
)
Max Heart Rate.heart_old3 <- heart %>%
select(Oldpeak, Age, HeartDisease) %>%
mutate(HeartDisease = case_when(
HeartDisease == 0 ~ "No",
HeartDisease == 1 ~ "Yes")) %>%
mutate(HeartDisease = as.factor(HeartDisease)) %>%
group_by(HeartDisease)
heart_old3 %>%
e_charts(Age) %>%
e_scatter(Oldpeak, HeartDisease,
emphasis = list(
focus = "series"
)) %>%
e_theme_custom("rangga2.json") %>%
e_x_axis(name = "Age",
nameTextStyle = list(
color = "White",
fontFamily = "Cardo",
fontSize = 14,
fontWeight = "bold"
),
min = 27) %>%
e_y_axis(name = "Oldpeak",
nameTextStyle = list(
color = "White",
fontFamily = "Cardo",
fontSize = 14,
fontWeight = "bold"
),
min = -3,
max = 7) %>%
e_mark_area(data = list(
list(xAxis = "min", yAxis = "min"),
list(xAxis = "max", yAxis = "max")
),
itemStyle = list(
color = "transparent",
borderWidth = 1,
borderType = "dashed")) %>%
e_mark_line(data = avg,
lineStyle = list(
type = "solid"
)) %>%
e_title(text = "Heart Disease: Oldpeak vs. Age",
textStyle = list(
color = "#DFC18F",
fontStyle = "italic"
),
right = "10%",
left = "50%") %>%
e_tooltip(axis = "trigger",
showDelay = 0,
axisPointer = list(
type = "cross",
label = list(
backgroundColor = "#333D47",
fontFamily = "Cardo",
color = "#DFC18F"
),
crossStyle = list(
color = "#DFC18F"
)),
formatter = htmlwidgets::JS("
function(params)
{
return `<strong style='color:#DFC18F;'>Heart Disease:</strong> <strong style='color:#D76662;'>${params.seriesName}</strong>
<br/><strong style='color:White;'>Age:</strong> ${params.value[0]}
<br/><strong style='color:White;'>Oldpeak:</strong> ${params.value[1]}`
} ")) %>%
e_legend(
orient = "horizontal",
right = "10%",
left = "45%",
bottom = "2%",
textStyle = list(
color = "#DFC18F"
)
)
Patients with Heart Disease Condition tend to
have higher Oldpeak rate.
However, this might be caused by uncollected
Oldpeak rate results, since there are many outliers
with values of 0.
Logistic regression is used to describe data and the relationship between one dependent variable and one or more independent variables. The independent variables can be nominal, ordinal, or of interval type.
Logistic regression return value within range of 0,1 and not a binary class. Later, we have to classify ourselves the range to convert it to binary class.
This is the equation of a logistic regression model:\[ log(\frac{p(X)}{1-p(X)}) = B_0 + B_1.X \]
The left-hand side is called the log-odds or logit. On the right side the b0 is the model intercept and b1 is the coefficient of feature X.
Before we go into modelling, we are going to do Cross Validation for our dataset first. Why? The Cross Validation procedure is used to estimate the performance of machine learning algorithms when they are used to make predictions on unseen data. In this context, we will take 80% of our dataset to train our model and 20% of our dataset to test our prediction to see the model’s performance.
heart <- heart %>%
mutate(HeartDisease = as.factor(HeartDisease))
RNGkind(sample.kind = "Rounding")
set.seed(417)
index <- sample(nrow(heart), size = nrow(heart)*0.80)
heart_log_train <- heart[index,] #take 80%
heart_log_test <- heart[-index,] #take 20%
proptrain <- paste(round(prop.table((table(heart_log_train$HeartDisease)))*100,2),"%")
proptraindf <- data.frame(HeartDisease = c("No","Yes"), Prop = c(proptrain[1], proptrain[2]))
rmarkdown::paged_table(proptraindf)
As shown above, the proportion of the class is still balanced.
After we split our data, now we will train our models using the heart_train data. For the independent variables (X), we are going to use all variables first.
model_log <- glm(formula = HeartDisease ~ .,
family = "binomial",
data = heart_log_train)
summary(model_log)
##
## Call:
## glm(formula = HeartDisease ~ ., family = "binomial", data = heart_log_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7439 -0.3699 0.1664 0.4168 2.5431
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.254616 1.590161 -0.789 0.430120
## Age 0.020082 0.014630 1.373 0.169848
## SexMale 1.292412 0.314651 4.107 0.0000400047 ***
## ChestPainTypeATA -1.704532 0.369617 -4.612 0.0000039956 ***
## ChestPainTypeNAP -1.720252 0.310134 -5.547 0.0000000291 ***
## ChestPainTypeTA -1.549147 0.472862 -3.276 0.001052 **
## RestingBP 0.006125 0.006701 0.914 0.360709
## Cholesterol -0.004320 0.001242 -3.479 0.000504 ***
## FastingBS> 120 mg/dl 1.134681 0.312789 3.628 0.000286 ***
## RestingECGNormal -0.451472 0.317714 -1.421 0.155316
## RestingECGST -0.315712 0.405508 -0.779 0.436239
## MaxHR -0.004626 0.005727 -0.808 0.419243
## ExerciseAnginaYes 1.043952 0.284190 3.673 0.000239 ***
## Oldpeak 0.400279 0.135098 2.963 0.003048 **
## ST_SlopeFlat 1.500935 0.508268 2.953 0.003147 **
## ST_SlopeUp -0.911941 0.528550 -1.725 0.084462 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1004.9 on 733 degrees of freedom
## Residual deviance: 457.0 on 718 degrees of freedom
## AIC: 489
##
## Number of Fisher Scoring iterations: 6
After our initial modelling, there are still many predictors who are not statistically significant to our target variable. Because of that, we will do Fitting using stepWise method.
model_logstep <- step(object = model_log,
direction = "backward",
trace = 0)
summary(model_logstep)
##
## Call:
## glm(formula = HeartDisease ~ Age + Sex + ChestPainType + Cholesterol +
## FastingBS + ExerciseAngina + Oldpeak + ST_Slope, family = "binomial",
## data = heart_log_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6966 -0.3760 0.1693 0.4308 2.4996
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.016719 0.976959 -2.064 0.038991 *
## Age 0.030469 0.013332 2.285 0.022293 *
## SexMale 1.301603 0.312308 4.168 0.0000307708 ***
## ChestPainTypeATA -1.732007 0.366224 -4.729 0.0000022523 ***
## ChestPainTypeNAP -1.741057 0.305374 -5.701 0.0000000119 ***
## ChestPainTypeTA -1.507028 0.462947 -3.255 0.001133 **
## Cholesterol -0.004009 0.001171 -3.424 0.000617 ***
## FastingBS> 120 mg/dl 1.143914 0.310613 3.683 0.000231 ***
## ExerciseAnginaYes 1.097494 0.274569 3.997 0.0000641093 ***
## Oldpeak 0.394459 0.131695 2.995 0.002742 **
## ST_SlopeFlat 1.470728 0.509855 2.885 0.003919 **
## ST_SlopeUp -0.962726 0.525608 -1.832 0.067005 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1004.95 on 733 degrees of freedom
## Residual deviance: 460.39 on 722 degrees of freedom
## AIC: 484.39
##
## Number of Fisher Scoring iterations: 6
Interpretation of the result above:
Variables that increase the
probability of Heart Attack are: Age,
Male Gender,
> 120 mg/dl Fasting Blood Sugar,
Occuring Exercise Angina, Oldpeak,
Flat ST Slope.
Variables that reduce the
probability of Heart Attack are: ChestPainType ATA,
ChestPainType NAP, ChestPainType TA,
Cholesterol, Up ST Slope.
Note we already got rid the unsignificant variables and also the AIC scores has improved from 489 to 484.39.
With our model_step we will predict our heart_test dataset.
heart_log_test$pred_Risk <- predict(object = model_logstep,
newdata = heart_log_test,
type = "response")
heart_log_train$pred_Risk <- predict(object = model_logstep,
newdata = heart_log_train,
type = "response")
heart_log_test %>%
e_charts() %>%
e_density(pred_Risk,
smooth = T,
itemStyle = list(
opacity = 0.5,
color = linear_gradient),
areaStyle = list(
opcaity = 1,
color = linear_gradient),
legend = F) %>%
e_theme_custom("rangga2.json") %>%
e_mark_line(data = list(xAxis=0.5 ),
title = "Threshold 0.5",
lineStyle = list(
color = linear_gradient2
)) %>%
e_x_axis(name = "Pred Heart",
nameTextStyle = list(
color = "White",
fontFamily = "Cardo",
fontSize = 14,
fontWeight = "bold"
)) %>%
e_y_axis(name = "Density",
nameTextStyle = list(
color = "White",
fontFamily = "Cardo",
fontSize = 14,
fontWeight = "bold"
)) %>%
e_title(text = "Distribution of Probability Prediction Data",
textStyle = list(
color = "#DFC18F",
fontStyle = "italic"
),
right = "10%",
left = "50%")
The interpretaion for the graph above is the predictions are augmented towards 1 which means more predictions are in favour of positive Heart Disease Condition or Yes.
heart_log_test$pred_Label <- ifelse(test = heart_log_test$pred_Risk > 0.5,
yes = "1",
no = "0")
heart_log_train$pred_Label <- ifelse(test = heart_log_train$pred_Risk > 0.5,
yes = "1",
no = "0")
heart_log_test <- heart_log_test %>%
mutate(pred_Label = as.factor(pred_Label))
rmarkdown::paged_table(heart_log_test[1:10, c("HeartDisease", "pred_Label", "pred_Risk")])
The k-nearest neighbors is non-parametric, supervised learning classifier, which uses proximity to make classifications about the grouping of an individual data point. K-NN working off the assumption that similar points can be found near one another.
This is the equation of K-NN model:
\[ d(x,y) = \sqrt {\sum(x_i - y_i)^2} \]
We will classify using the optimum number of K, which is the square root of the number of train dataset.
K-NN predictor variables have to be in numeric data types.
heart_knn <- heart %>%
select(HeartDisease, Age, RestingBP, Cholesterol, MaxHR)
RNGkind(sample.kind = "Rounding")
set.seed(123)
# index sampling
index1 <- sample(nrow(heart_knn), size = 0.8*nrow(heart_knn))
# splitting
heartknn_train <- heart_knn[index1, ]
heartknn_test <- heart_knn[-index1, ]
proptraink <- paste(round(prop.table((table(heartknn_train$HeartDisease)))*100,2),"%")
proptrainkdf <- data.frame(HeartDisease = c("No","Yes"), Prop = c(proptraink[1], proptraink[2]))
rmarkdown::paged_table(proptrainkdf)
As shown above, the proportion of the class is still balanced.
For K-NN we have to separate the one dependent variable from one or more independent variables.
heartknn_train_x <- select(heartknn_train, -HeartDisease)
heartknn_test_x <- select(heartknn_test, -HeartDisease)
heartknn_train_y <- heartknn_train %>% select(HeartDisease)
heartknn_test_y <- heartknn_test %>% select(HeartDisease)
Our independent variables/predictors will be scaled using Z-score standarization since all of our data range should be in the same range. Here is the equation:
heart_train_x_scale <- scale(x = heartknn_train_x)
heart_test_x_scale <- scale(x = heartknn_test_x,
center = attr(heart_train_x_scale, "scaled:center"),
scale = attr(heart_train_x_scale, "scaled:scale"))
In K-NN we don’t have to do any Modeling since we can just go straight Predicting. But first, we have to find the optimum K of our train data.
sqrt(nrow(heartknn_train))
## [1] 27.09243
As shown above, our optimum K is 27.
heartknn_pred <- knn(train = heart_train_x_scale,
test = heart_test_x_scale,
cl = heartknn_train_y$HeartDisease,
k = 27,
prob = T)
heartknn_pred_t <- knn(train = heart_train_x_scale,
test = heart_train_x_scale,
cl = heartknn_train_y$HeartDisease,
k = 27,
prob = T)
Decision Tree belongs to the family of Supervised Learning algorithms. Unlike other supervised learning algorithms, the decision tree algorithm can be used for solving regression and classification problems too. The goal of using a Decision Tree is to create a training model that can use to predict the class or value of the target variable by learning simple decision rules inferred from prior data(training data).
heart_tree <- heart %>%
mutate_if(is.character, as.factor)
RNGkind(sample.kind = "Rounding")
set.seed(417)
index <- sample(nrow(heart), size = nrow(heart)*0.80)
heart_train_tree <- heart_tree[index,] #take 80%
heart_test_tree <- heart_tree[-index,] #take 20%
proptrain <- paste(round(prop.table((table(heart_train_tree$HeartDisease)))*100,2),"%")
proptraindf <- data.frame(HeartDisease = c("No","Yes"), Prop = c(proptrain[1], proptrain[2]))
rmarkdown::paged_table(proptraindf)
As shown above, the proportion of the class is still balanced.
To understand what Decision Trees looks like, we
will create tree plot for our model using
fancyRPartPlot() from rattle and
rpart.plot packages. As for initial predictor
variables, we will use all variables first.
model_dtree <- rpart(formula = HeartDisease~., data = heart_train_tree, method = "class")
Our decision tree consists of several parts. The box on the
top/Up ST_Slope is the root node. The root
will split and make branches by certain rules. Each branches ended with
a node. Then the node split again into other
nodes that called as internal node.
Nodes that do not split anymore appear at the very bottom of our plot
which is called terminal node or leaf
node. All in all, in summary, each node shows:
Predicted class (Yes or No)
Predicted class probabilities.
Percentage of observations in the node.
Root, internal, and terminal nodes along with the rules that partition with each observation.
Now we already have our model, we can continue to predict with our test dataset.
dtree_pred <- predict(model_dtree, heart_test_tree, type = "class")
dtree_prob <- predict(model_dtree, heart_test_tree, type = "prob")
dtree_pred_t <- predict(model_dtree, heart_train_tree, type = "class")
dtree_table <- select(heart_test_tree, HeartDisease) %>%
bind_cols(heart_pred = dtree_pred) %>%
bind_cols(heart_eprob = round(dtree_prob[,1],4)) %>%
bind_cols(heart_pprob = round(dtree_prob[,2],4))
dtree_table <- dtree_table %>%
mutate(HeartDisease = as.factor(HeartDisease))
rmarkdown::paged_table(head(dtree_table, 10))
Naive Bayes algorithm is a supervised learning algorithm which is based on Bayes theorem and used for solving classification problem. It is one of the simple and most effective Classification algorithm which helps in building the fast machine learning models that can make quick prediction.
To understand Naive Bayes, here is the equation:
heart_nb <- heart %>%
mutate(HeartDisease = as.factor(HeartDisease))
RNGkind(sample.kind = "Rounding")
set.seed(417)
index <- sample(nrow(heart_nb), size = nrow(heart_nb)*0.80)
heart_nb_train <- heart_nb[index,] #take 80%
heart_nb_test <- heart_nb[-index,] #take 20%
head(heart_nb)
## Age Sex ChestPainType RestingBP Cholesterol FastingBS RestingECG MaxHR
## 1 40 Male ATA 140 289 < 120 mg/dl Normal 172
## 2 49 Female NAP 160 180 < 120 mg/dl Normal 156
## 3 37 Male ATA 130 283 < 120 mg/dl ST 98
## 4 48 Female ASY 138 214 < 120 mg/dl Normal 108
## 5 54 Male NAP 150 195 < 120 mg/dl Normal 122
## 6 39 Male NAP 120 339 < 120 mg/dl Normal 170
## ExerciseAngina Oldpeak ST_Slope HeartDisease
## 1 No 0.0 Up 0
## 2 No 1.0 Flat 1
## 3 No 0.0 Up 0
## 4 Yes 1.5 Flat 1
## 5 No 0.0 Up 0
## 6 No 0.0 Up 0
proptrain <- paste(round(prop.table((table(heart_nb_train$HeartDisease)))*100,2),"%")
proptraindf <- data.frame(HeartDisease = c("No","Yes"), Prop = c(proptrain[1], proptrain[2]))
rmarkdown::paged_table(proptraindf)
As shown above, the proportion of the class is still balanced.
library(e1071)
model_nb <- naiveBayes(formula = HeartDisease~. , data = heart_nb_train, laplace = 1)
model_nb
##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
##
## A-priori probabilities:
## Y
## 0 1
## 0.4346049 0.5653951
##
## Conditional probabilities:
## Age
## Y [,1] [,2]
## 0 50.26959 9.389680
## 1 56.04578 9.055402
##
## Sex
## Y Female Male
## 0 0.3667712 0.6394984
## 1 0.1036145 0.9012048
##
## ChestPainType
## Y ASY ATA NAP TA
## 0 0.25077399 0.35603715 0.31888545 0.07430341
## 1 0.77565632 0.04534606 0.13603819 0.04295943
##
## RestingBP
## Y [,1] [,2]
## 0 130.1755 16.17807
## 1 134.5205 20.36073
##
## Cholesterol
## Y [,1] [,2]
## 0 227.3386 72.73644
## 1 177.5060 126.58521
##
## FastingBS
## Y < 120 mg/dl > 120 mg/dl
## 0 0.8996865 0.1065831
## 1 0.6795181 0.3253012
##
## RestingECG
## Y LVH Normal ST
## 0 0.1863354 0.6521739 0.1614907
## 1 0.2153110 0.5502392 0.2344498
##
## MaxHR
## Y [,1] [,2]
## 0 148.6646 22.91090
## 1 127.4361 23.34358
##
## ExerciseAngina
## Y No Yes
## 0 0.8840125 0.1222571
## 1 0.3686747 0.6361446
##
## Oldpeak
## Y [,1] [,2]
## 0 0.3971787 0.7017217
## 1 1.3159036 1.1603921
##
## ST_Slope
## Y Down Flat Up
## 0 0.0310559 0.1863354 0.7826087
## 1 0.1004785 0.7440191 0.1555024
Interpretation of the result above:
*Male Gender has higher probability for getting a
Heart Disease Condition (90%) and 63% more for not getting a
Heart Disease Condition rather than
Female.
ASY Chest Pain has the highest probability for
getting a Heart Disease Condition (77%) while ATA Chest Pain has
the highest probability (35%) for
not getting a Heart Disease Condition
rather than its counterparts.
Patient with Fasting Blood Sugar < 120mg has
higher probability for getting a Heart Disease
Condition (67%) and 89% more for not getting a
Heart Disease Condition rather than patient with
Fasting Blood Sugar > 120mg.
Patient with Normal Resting ECG has higher
probability for getting a Heart Disease Condition
(55%) and 65% more for not getting a
Heart Disease Condition rather than its
counterparts.
Patient with Exercise Angina has higher probability
for getting a Heart Disease Condition (63%) while Patient without
Exercise Angina is 88%
more for not getting a Heart Disease
Condition.
Flat ST Slope has the highest probability for
getting a Heart Disease Condition (74%) while UP ST Slope has the
highest probability (78%) for
not getting a Heart Disease Condition
rather than its counterparts.
naive_pred <- predict(model_nb, heart_nb_test, type = "class")
naive_prob <- predict(model_nb, heart_nb_test, type = "raw")
naive_pred_t <- predict(model_nb, heart_nb_train, type = "class")
naive_table <- select(heart_nb_test, HeartDisease) %>%
bind_cols(heart_pred = naive_pred) %>%
bind_cols(heart_eprob = round(naive_prob[,1],4)) %>%
bind_cols(heart_pprob = round(naive_prob[,2],4))
naive_table <- naive_table %>%
mutate(HeartDisease = as.factor(HeartDisease))
rmarkdown::paged_table(head(naive_table, 10))
For Evaluation, we are going to use confusionMatrix and ROC/AUC function that can generate:
con_mat_log <- confusionMatrix(data = heart_log_test$pred_Label,
reference = heart_log_test$HeartDisease,
positive = "1")
con_mat_log_t <- confusionMatrix(data = as.factor(heart_log_train$pred_Label),
reference = heart_log_train$HeartDisease,
positive = "1")
performance <- cbind.data.frame(Accuracy = c(con_mat_log_t$overall[[1]], con_mat_log$overall[[1]]),
Sensitivity = c(con_mat_log_t$byClass[[1]], con_mat_log$byClass[[1]]),
Specificity = c(con_mat_log_t$byClass[[2]], con_mat_log$byClass[[2]]),
Precision = c(con_mat_log_t$byClass[[3]], con_mat_log$byClass[[3]]))
rownames(performance) <- c("On Training Data", "On Unseen Data")
rmarkdown::paged_table(performance)
There are no big differences between our Accuracy, Sensitivity, Specificity, and Precision score on the Unseen and Training Data.
This means that there are no indication of Overfitting and Underfitting for our Logistic Regression model.
Actual <- factor(c("No", "No", "Yes", "Yes"))
Predicted <- factor(c("No", "Yes", "No", "Yes"))
y <- c(con_mat_log$table[2], con_mat_log$table[1], con_mat_log$table[4], con_mat_log$table[3])
df1_log <- data.frame(Actual, Predicted, y)
roc <- roc(predictor = heart_log_test$pred_Risk,
response = heart_log_test$HeartDisease,
levels = c("0", "1"),
percent = TRUE)
df2_log <- data.frame(Specificity=roc$specificities, Sensitivity=roc$sensitivities)
plot_log1 <- ggplot(data = df1_log, mapping = aes(x = Actual, y = Predicted, col = y)) +
geom_tile(aes(fill = y)) +
geom_text(aes(label = sprintf("%1.0f", y)), color = "White", size = 5, family = "Times New Roman") +
scale_fill_gradient(low = "#E08F83", high = "#B8332E") +
scale_color_gradient(low = "#E08F83", high = "#B8332E") +
theme(legend.position = "none",
plot.background = element_rect(fill = "#2C343C", color = "#2C343C"),
panel.background = element_rect(fill = "#2C343C"),
panel.grid = element_line(colour = "#2C343C"),
panel.grid.major.x = element_line(colour = "#2C343C"),
panel.grid.minor.x = element_line(colour = "#2C343C"),
axis.text.x = element_text(color = "#2C343C", family = "Times New Roman", size = 14),
axis.text.y = element_text(color = "#2C343C", family = "Times New Roman", size = 14),
axis.title.x = element_text(color = "#DFC18F", family = "Times New Roman", size = 14),
axis.title.y = element_text(color = "#DFC18F", family = "Times New Roman", size = 14),
axis.ticks = element_blank())
plot_log2 <- ggplot(data = df2_log, aes(x = Specificity, y = Sensitivity))+
with_outer_glow(geom_path(colour = '#FFE5DB', size = 1), colour = "#E08F83", sigma = 10, expand = 1)+
scale_x_reverse() +
geom_abline(intercept = 100, slope = 1, color="#B8332E", linetype = "longdash") +
annotate("text", x = 40, y = 40, label = paste0('AUC: ', round(roc$auc,1), '%'), size = 4, color = "#DFC18F", family = "Times New Roman")+
ylab('Sensitivity (%)')+
xlab('Specificity (%)')+
scale_fill_gradient(low = "#E08F83", high = "#B8332E") +
scale_color_gradient(low = "#E08F83", high = "#B8332E") +
theme(legend.position = "none",
plot.background = element_rect(fill = "#2C343C", color = "#2C343C"),
panel.background = element_rect(fill = "#2C343C"),
panel.grid = element_line(colour = "#2C343C"),
panel.grid.major.x = element_line(colour = "#2C343C"),
panel.grid.minor.x = element_line(colour = "#2C343C"),
axis.text.x = element_text(color = "#2C343C", family = "Times New Roman", size = 14),
axis.text.y = element_text(color = "#2C343C", family = "Times New Roman", size = 14),
axis.title.x = element_text(color = "#DFC18F", family = "Times New Roman", size = 14),
axis.title.y = element_text(color = "#DFC18F", family = "Times New Roman", size = 14),
axis.ticks = element_blank())
ggarrange(plot_log1, plot_log2, nrow = 2, ncol = 1)
Our logistic regression model has performance accuracy of 83.15% on unseen data. Meaning that 83.15% of our data is correctly classified.
The value of sensitivity is 87.09% which indicates that most positive outcomes are correctly classified.
The value of specificity is 79.12% which indicates that most negative outcomes are correctly classified.
The value of precision is 81% which indicates that 81% of our positive prediction is correct.
Our ROC Curve also shows a very good separation with AUC score of 91.5%. Meaning that our model is proven to be near excellent at separating our target classes.
con_mat_knn <- confusionMatrix(data = heartknn_pred,
reference = heartknn_test_y$HeartDisease,
positive = "1")
con_mat_knn_t <- confusionMatrix(data = heartknn_pred_t,
reference = heartknn_train_y$HeartDisease,
positive = "1")
performanceknn <- cbind.data.frame(Accuracy = c(con_mat_knn_t$overall[[1]], con_mat_knn$overall[[1]]),
Sensitivity = c(con_mat_knn_t$byClass[[1]], con_mat_knn$byClass[[1]]),
Specificity = c(con_mat_knn_t$byClass[[2]], con_mat_knn$byClass[[2]]),
Precision = c(con_mat_knn_t$byClass[[3]], con_mat_knn$byClass[[3]]))
rownames(performanceknn) <- c("On Training Data", "On Unseen Data")
rmarkdown::paged_table(performanceknn)
There are no big differences between our Accuracy, Sensitivity, Specificity, and Precision score on the Unseen and Training Data.
This means that there are no indication of Overfitting and Underfitting for our K-NN model.
Actual1 <- factor(c("No", "No", "Yes", "Yes"))
Predicted1 <- factor(c("No", "Yes", "No", "Yes"))
y1 <- c(con_mat_knn$table[2], con_mat_knn$table[1], con_mat_knn$table[4], con_mat_knn$table[3])
df1_knn <- data.frame(Actual1, Predicted1, y1)
roc1 <- roc(predictor = attributes(heartknn_pred)$prob,
response = heartknn_test_y$HeartDisease,
levels = c("0", "1"),
percent = TRUE)
df2_knn <- data.frame(Specificity=roc1$specificities, Sensitivity=roc1$sensitivities)
plot_knn1 <- ggplot(data = df1_knn, mapping = aes(x = Actual1, y = Predicted1, col = y1)) +
geom_tile(aes(fill = y1)) +
geom_text(aes(label = sprintf("%1.0f", y1)), color = "White", size = 5, family = "Times New Roman") +
scale_fill_gradient(low = "#E08F83", high = "#B8332E") +
scale_color_gradient(low = "#E08F83", high = "#B8332E") +
labs(x = "Actual",
y = "Predicted") +
theme(legend.position = "none",
plot.background = element_rect(fill = "#2C343C", color = "#2C343C"),
panel.background = element_rect(fill = "#2C343C"),
panel.grid = element_line(colour = "#2C343C"),
panel.grid.major.x = element_line(colour = "#2C343C"),
panel.grid.minor.x = element_line(colour = "#2C343C"),
axis.text.x = element_text(color = "#2C343C", family = "Times New Roman", size = 14),
axis.text.y = element_text(color = "#2C343C", family = "Times New Roman", size = 14),
axis.title.x = element_text(color = "#DFC18F", family = "Times New Roman", size = 14),
axis.title.y = element_text(color = "#DFC18F", family = "Times New Roman", size = 14),
axis.ticks = element_blank())
plot_knn2 <- ggplot(data = df2_knn, aes(x = Specificity, y = Sensitivity))+
with_outer_glow(geom_path(colour = '#FFE5DB', size = 1), colour = "#E08F83", sigma = 10, expand = 1)+
scale_x_reverse() +
geom_abline(intercept = 100, slope = 1, color="#B8332E", linetype = "longdash") +
annotate("text", x = 40, y = 40, label = paste0('AUC: ', round(roc1$auc,1), '%'), size = 4, color = "#DFC18F", family = "Times New Roman")+
ylab('Sensitivity (%)')+
xlab('Specificity (%)')+
scale_fill_gradient(low = "#E08F83", high = "#B8332E") +
scale_color_gradient(low = "#E08F83", high = "#B8332E") +
theme(legend.position = "none",
plot.background = element_rect(fill = "#2C343C", color = "#2C343C"),
panel.background = element_rect(fill = "#2C343C"),
panel.grid = element_line(colour = "#2C343C"),
panel.grid.major.x = element_line(colour = "#2C343C"),
panel.grid.minor.x = element_line(colour = "#2C343C"),
axis.text.x = element_text(color = "#2C343C", family = "Times New Roman", size = 14),
axis.text.y = element_text(color = "#2C343C", family = "Times New Roman", size = 14),
axis.title.x = element_text(color = "#DFC18F", family = "Times New Roman", size = 14),
axis.title.y = element_text(color = "#DFC18F", family = "Times New Roman", size = 14),
axis.ticks = element_blank())
ggarrange(plot_knn1, plot_knn2, nrow = 2, ncol = 1)
Our K-NN model has performance accuracy of 75% on unseen data. Meaning that 75% of our data is correctly classified.
The value of sensitivity is 82.85% which indicates that most positive outcomes are correctly classified.
The value of specificity is 64.55% which indicates that only moderate negative outcomes are correctly classified.
The value of precision is 75.65% which indicates that 75.65% of our positive prediction is correct.
Our ROC Curve also shows a bad performance of separation with AUC score of 55.3%. Meaning that if we want to use this model, we have to tune the model until we get a better score results than the other model.
con_mat_dt <- confusionMatrix(data = dtree_table$heart_pred,
reference = dtree_table$HeartDisease,
positive = "1")
con_mat_dt_t <- confusionMatrix(data = dtree_pred_t,
reference = as.factor(heart_train_tree$HeartDisease),
positive = "1")
performance_dt <- cbind.data.frame(Accuracy = c(con_mat_dt_t$overall[[1]], con_mat_dt$overall[[1]]),
Sensitivity = c(con_mat_dt_t$byClass[[1]], con_mat_dt$byClass[[1]]),
Specificity = c(con_mat_dt_t$byClass[[2]], con_mat_dt$byClass[[2]]),
Precision = c(con_mat_dt_t$byClass[[3]], con_mat_dt$byClass[[3]]))
rownames(performance_dt) <- c("On Training Data", "On Unseen Data")
rmarkdown::paged_table(performance_dt)
There are no big differences between our Accuracy, Sensitivity, Specificity, and Precision score on the Unseen and Training Data.
This means that there are no indication of Overfitting and Underfitting for our Decision Tree model.
Actual2 <- factor(c("No", "No", "Yes", "Yes"))
Predicted2 <- factor(c("No", "Yes", "No", "Yes"))
y2 <- c(con_mat_dt$table[2], con_mat_dt$table[1], con_mat_dt$table[4], con_mat_dt$table[3])
df1_dt <- data.frame(Actual2, Predicted2, y2)
roc2 <- roc(predictor = dtree_table$heart_eprob,
response = dtree_table$HeartDisease,
levels = c("0", "1"),
percent = TRUE)
df2_dt <- data.frame(Specificity=roc2$specificities, Sensitivity=roc2$sensitivities)
plot_dt <- ggplot(data = df1_dt, mapping = aes(x = Actual2, y = Predicted2, col = y2)) +
geom_tile(aes(fill = y2)) +
geom_text(aes(label = sprintf("%1.0f", y2)), color = "White", size = 5, family = "Times New Roman") +
scale_fill_gradient(low = "#E08F83", high = "#B8332E") +
scale_color_gradient(low = "#E08F83", high = "#B8332E") +
labs(x = "Actual",
y = "Predicted") +
theme(legend.position = "none",
plot.background = element_rect(fill = "#2C343C", color = "#2C343C"),
panel.background = element_rect(fill = "#2C343C"),
panel.grid = element_line(colour = "#2C343C"),
panel.grid.major.x = element_line(colour = "#2C343C"),
panel.grid.minor.x = element_line(colour = "#2C343C"),
axis.text.x = element_text(color = "#2C343C", family = "Times New Roman", size = 14),
axis.text.y = element_text(color = "#2C343C", family = "Times New Roman", size = 14),
axis.title.x = element_text(color = "#DFC18F", family = "Times New Roman", size = 14),
axis.title.y = element_text(color = "#DFC18F", family = "Times New Roman", size = 14),
axis.ticks = element_blank())
plot_dt2 <- ggplot(data = df2_dt, aes(x = Specificity, y = Sensitivity))+
with_outer_glow(geom_path(colour = '#FFE5DB', size = 1), colour = "#E08F83", sigma = 10, expand = 1)+
scale_x_reverse() +
geom_abline(intercept = 100, slope = 1, color="#B8332E", linetype = "longdash") +
annotate("text", x = 40, y = 40, label = paste0('AUC: ', round(roc2$auc,1), '%'), size = 4, color = "#DFC18F", family = "Times New Roman")+
ylab('Sensitivity (%)')+
xlab('Specificity (%)')+
scale_fill_gradient(low = "#E08F83", high = "#B8332E") +
scale_color_gradient(low = "#E08F83", high = "#B8332E") +
theme(legend.position = "none",
plot.background = element_rect(fill = "#2C343C", color = "#2C343C"),
panel.background = element_rect(fill = "#2C343C"),
panel.grid = element_line(colour = "#2C343C"),
panel.grid.major.x = element_line(colour = "#2C343C"),
panel.grid.minor.x = element_line(colour = "#2C343C"),
axis.text.x = element_text(color = "#2C343C", family = "Times New Roman", size = 14),
axis.text.y = element_text(color = "#2C343C", family = "Times New Roman", size = 14),
axis.title.x = element_text(color = "#DFC18F", family = "Times New Roman", size = 14),
axis.title.y = element_text(color = "#DFC18F", family = "Times New Roman", size = 14),
axis.ticks = element_blank())
ggarrange(plot_dt, plot_dt2, nrow = 2, ncol = 1)
Our Decision Tree model has performance accuracy of 82.06% on unseen data. Meaning that 82.06% of our data is correctly classified.
The value of sensitivity is 89.24% which indicates that most positive outcomes are correctly classified.
The value of specificity is 74.72% which indicates that moderate number of negative outcomes are correctly classified.
The value of precision is 78.3% which indicates that 77.9% of our positive prediction is correct.
Our ROC Curve also shows a very good separation with AUC score of 87.3%. Meaning that our model is already good at separating our target classes.
con_mat_nb <- confusionMatrix(data = naive_table$heart_pred,
reference = naive_table$HeartDisease,
positive = "1")
con_mat_nb_t <- confusionMatrix(data = naive_pred_t,
reference = as.factor(heart_nb_train$HeartDisease),
positive = "1")
performance_nb <- cbind.data.frame(Accuracy = c(con_mat_nb_t$overall[[1]], con_mat_nb$overall[[1]]),
Sensitivity = c(con_mat_nb_t$byClass[[1]], con_mat_nb$byClass[[1]]),
Specificity = c(con_mat_nb_t$byClass[[2]], con_mat_nb$byClass[[2]]),
Precision = c(con_mat_nb_t$byClass[[3]], con_mat_nb$byClass[[3]]))
rownames(performance_nb) <- c("On Training Data", "On Unseen Data")
rmarkdown::paged_table(performance_nb)
There are no big differences between our Accuracy, Sensitivity, Specificity, and Precision score on the Unseen and Training Data.
This means that there are no indication of Overfitting and Underfitting for our Naive Bayes model.
Actual3 <- factor(c("No", "No", "Yes", "Yes"))
Predicted3 <- factor(c("No", "Yes", "No", "Yes"))
y3 <- c(con_mat_nb$table[2], con_mat_nb$table[1], con_mat_nb$table[4], con_mat_nb$table[3])
df1_nb <- data.frame(Actual3, Predicted3, y3)
roc3 <- roc(predictor = naive_table$heart_eprob,
response = naive_table$HeartDisease,
levels = c("0","1"),
percent = TRUE)
df2_nb <- data.frame(Specificity=roc3$specificities, Sensitivity=roc3$sensitivities)
plot_nb <- ggplot(data = df1_nb, mapping = aes(x = Actual3, y = Predicted3, col = y3)) +
geom_tile(aes(fill = y3)) +
geom_text(aes(label = sprintf("%1.0f", y3)), color = "White", size = 5, family = "Times New Roman") +
scale_fill_gradient(low = "#E08F83", high = "#B8332E") +
scale_color_gradient(low = "#E08F83", high = "#B8332E") +
labs(x = "Actual",
y = "Predicted") +
theme(legend.position = "none",
plot.background = element_rect(fill = "#2C343C", color = "#2C343C"),
panel.background = element_rect(fill = "#2C343C"),
panel.grid = element_line(colour = "#2C343C"),
panel.grid.major.x = element_line(colour = "#2C343C"),
panel.grid.minor.x = element_line(colour = "#2C343C"),
axis.text.x = element_text(color = "#2C343C", family = "Times New Roman", size = 14),
axis.text.y = element_text(color = "#2C343C", family = "Times New Roman", size = 14),
axis.title.x = element_text(color = "#DFC18F", family = "Times New Roman", size = 14),
axis.title.y = element_text(color = "#DFC18F", family = "Times New Roman", size = 14),
axis.ticks = element_blank())
plot_nb2 <- ggplot(data = df2_nb, aes(x = Specificity, y = Sensitivity))+
with_outer_glow(geom_path(colour = '#FFE5DB', size = 1), colour = "#E08F83", sigma = 10, expand = 1)+
scale_x_reverse() +
geom_abline(intercept = 100, slope = 1, color="#B8332E", linetype = "longdash") +
annotate("text", x = 40, y = 40, label = paste0('AUC: ', round(roc3$auc,1), '%'), size = 4, color = "#DFC18F", family = "Times New Roman")+
ylab('Sensitivity (%)')+
xlab('Specificity (%)')+
scale_fill_gradient(low = "#E08F83", high = "#B8332E") +
scale_color_gradient(low = "#E08F83", high = "#B8332E") +
theme(legend.position = "none",
plot.background = element_rect(fill = "#2C343C", color = "#2C343C"),
panel.background = element_rect(fill = "#2C343C"),
panel.grid = element_line(colour = "#2C343C"),
panel.grid.major.x = element_line(colour = "#2C343C"),
panel.grid.minor.x = element_line(colour = "#2C343C"),
axis.text.x = element_text(color = "#2C343C", family = "Times New Roman", size = 14),
axis.text.y = element_text(color = "#2C343C", family = "Times New Roman", size = 14),
axis.title.x = element_text(color = "#DFC18F", family = "Times New Roman", size = 14),
axis.title.y = element_text(color = "#DFC18F", family = "Times New Roman", size = 14),
axis.ticks = element_blank())
ggarrange(plot_nb, plot_nb2, nrow = 2, ncol = 1)
Our Naive Bayes model has performance accuracy of 80.9% on unseen data. Meaning that 80.90% of our data is correctly classified.
The value of sensitivity is 87.09% which indicates that most positive outcomes are correctly classified.
The value of specificity is 74.72% which indicates that moderate number of negative outcomes are correctly classified.
The value of precision is 77.9% which indicates that 77.9% of our positive prediction is correct.
Our ROC Curve also shows a very good separation with AUC score of 88.9%. Meaning that our model is already good at separating our target classes.
Since we’re talking about Health diagnosis prediction, we want to minimize our error on positive outcomes prediction. Which in our case is about patient with Heart Disease Condition. Hereby if we look at our model performance evaluation, we want to use a model with a good score of sensitivity.
In that manner, from our 2 models that we’ve been created, the one who have higher sensitivity score is Decision Tree model with a value of 89.24%. So we can conclude that, in this case, Decision Tree proven to be better for classifying Heart Disease Condition rather than K-NN, Logistic Regression, & Naive Bayes model.
And that’s all my attempt to classify Heart Disease with Classification Models. As time goes by, I will add more models who are not covered yet in this project. However, I sincerely hope that this study could help people who will do some similar cases in the future.
A work by Rangga Gemilang
gemilang.rangga94@gmail.com
R Language