In conducting anlysis & making a machine learning model to predict the patient whether patient is heart disease or not, there are several step that this research pursue as well as data preprocessing, data wrangling, exploratory data analysis, build models, assumption test, model comparison also evaluation and conclusion. In the first round analysis, The model of model_logistic_all is suitable to make a prediciton due to lowest RMSE also lowest AIC than model_logistic_fs AIC. But, using a model of model_logistic_step is appropiate to see the significance of the predictor. Next, the model of logistic regression (model_logistic_all) is better than model KNN (knn_pred), due to higher sensitivity score. Finally, the next research is recommended to use a higher sensitivity score also accuracy score (above 90%) due to the life of someone.
The top three topics of global causes of death which has an impact / a relation on the number of lives lost were cardiovascular (ischaemic heart disease, stroke), respiratory (chronic obstructive pulmonary disease, lower respiratory infections), and neonatal conditions – which include birth asphyxia and birth trauma, neonatal sepsis and illnesses, and preterm birth complications.
This research explores heart disease by analyzing and predicting the dataset. Therefore, people can improve their healthy lifestyle awareness and the ML model can predict which person potentially has heart disease so that the person can prevent the heart disease / take treatment as early as possible.
This research aim is to provide several insights which can benefit to
readers especially for society and hospital. This research will deep
dive into heart disease by analyzing the dataset. Therefore, the
research objectives are:
- Analyzing the medical history background
of those with heart disease and without heart disease.
- Analyzing
the numerical variable relationship.
- Making a model to predict
the patient who has a heart disease.
| No. | Feature | Description | Value |
|---|---|---|---|
| 1. | age | Patient’s age in years | 29-77 |
| 2. | sex | Patient’s gender | (1)Male (0)Female |
| 3. | cp | Chest pain type | (0)Typical angina - TA (1)Atypical angina - ATA (2)Non-anginal pain - NAP (3)Asymptomatic - ASY. |
| 4. | trestbps | Resting blood pressure (in mm Hg) | 94-200 |
| 5. | chol | Cholestoral in mg/dl | 126 – 564 |
| 6. | fbs | Fasting blood sugar > 120 mg/dl | (1)True (0)False |
| 7. | restecg | Resting electrocardiographic results | (0)Normal (1)Resting electrocardiographic results (2)Showing probable or definite left ventricular hypertrophy by Estes’ criteria |
| 8. | thalach | Maximum heart rate achieved | 71-202 |
| 9. | exang | Exercise induced angina | (1)Yes (0)No |
| 10. | oldpeak | ST depression induced by exercise relative to rest | 0-6.2 |
| 11. | slope | The slope of the peak exercise ST segment | (1)Upsloping(2)Flat(3)Downsloping |
| 12. | ca | Number of major vessels (0-3) colored by fluoroscopy | 0,1,2,3 |
| 13. | thal | Thalassemia | (3)Normal (6)Fixed defect (no blood flow in some part of the heart) (7)Reversable defect (a blood flow is observed but it is not normal) |
| 14. | target | Diagnosis of heart disease | (0)Heart disease not present (1)Heart disease present |
# data cleaning
library(readr)
library(tidyverse)
library(dplyr)
#data analysis
library(GGally)
#data visualizationl
library(ggplot2)
library(scales)
library(echarts4r)
# Confussion Matrix
library(caret)
# knn
library(class)
#ROC
library(pROC)
#Glow
library(ggfx)
This data set dates from 1988 and consists of four databases:
Cleveland, Hungary, Switzerland, and Long Beach V. It contains 76
attributes, including the predicted attribute, but all published
experiments refer to using a subset of 14 of them. The “target” field
refers to the presence of heart disease in the patient. It is integer
valued 0 = no disease and 1 = disease. Then loading the data set
(“kaggle_4city.csv”) and assigned to heart_disease object. So, the data
of heart_disease is ready to do data wrangling process.
heart_disease <- read.csv("data_input/kaggle_4city.csv")
heart_disease
This step consists of two steps ranging from the top 6
observations of data set inspection and the bottom 6 observations of
data set inspection by using head also tail function so that the data
set’s background can be recognized a little bit.
# Top 6 data
head(heart_disease)# Bottom 6 data
tail(heart_disease)
Perform data type inspection to ensure the data type of each
column is appropriate by using glimpse() function in dplyr
package. Then, there are several variables whose data type are changed
into factor such as sex,cp,fbs,restecg,exang,slope,ca,thal,target.
# Change Data Type to Factor
# df for model
heart_disease <- heart_disease %>%
mutate_at(vars(sex,cp,fbs,restecg,exang,slope,ca,thal,target),as.factor)
# for visualization & ggcorr
heart <- heart_disease %>%
mutate(target=as.integer(target))
glimpse(heart_disease)#> Rows: 1,025
#> Columns: 14
#> $ age <int> 52, 53, 70, 61, 62, 58, 58, 55, 46, 54, 71, 43, 34, 51, 52, 3…
#> $ sex <fct> 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1…
#> $ cp <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 2, 0, 1, 2, 2…
#> $ trestbps <int> 125, 140, 145, 148, 138, 100, 114, 160, 120, 122, 112, 132, 1…
#> $ chol <int> 212, 203, 174, 203, 294, 248, 318, 289, 249, 286, 149, 341, 2…
#> $ fbs <fct> 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0…
#> $ restecg <fct> 1, 0, 1, 1, 1, 0, 2, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0…
#> $ thalach <int> 168, 155, 125, 161, 106, 122, 140, 145, 144, 116, 125, 136, 1…
#> $ exang <fct> 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0…
#> $ oldpeak <dbl> 1.0, 3.1, 2.6, 0.0, 1.9, 1.0, 4.4, 0.8, 0.8, 3.2, 1.6, 3.0, 0…
#> $ slope <fct> 2, 0, 0, 2, 1, 1, 0, 1, 2, 1, 1, 1, 2, 1, 1, 2, 2, 1, 2, 2, 1…
#> $ ca <fct> 2, 0, 0, 1, 3, 0, 3, 1, 0, 2, 0, 0, 0, 3, 0, 0, 1, 1, 0, 0, 0…
#> $ thal <fct> 3, 3, 3, 3, 2, 2, 1, 3, 3, 2, 2, 3, 2, 3, 0, 2, 2, 3, 2, 2, 2…
#> $ target <fct> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0…
Notes: Using target as an integer for
correlation analysis ggcorr() also visualization purpose.
And, using target as factor for machine learning modelling.
In this step, checking the missing values is a must so that the
missing value treatment can be done and the data is ready to analyze.
anyNA(heart_disease)#> [1] FALSE
colSums(is.na(heart_disease))#> age sex cp trestbps chol fbs restecg thalach
#> 0 0 0 0 0 0 0 0
#> exang oldpeak slope ca thal target
#> 0 0 0 0 0 0
It shows that there are no missing values in heart_disease dataframe.
Checking the data whether there is the same values or duplicates
for each row.
heart_disease %>%
duplicated() %>%
sum()#> [1] 723
# Containing duplicated data
heart_disease[duplicated(heart_disease),] Insight: There is a possibility that each observations, patient’s data, has the same value. Then, the treatment for duplicated data is not make any changes in this dataframe.
Inspecting the 5 number summary + mean in order to get an
insight and data distribution informations.
summary(heart_disease)#> age sex cp trestbps chol fbs restecg
#> Min. :29.00 0:312 0:497 Min. : 94.0 Min. :126 0:872 0:497
#> 1st Qu.:48.00 1:713 1:167 1st Qu.:120.0 1st Qu.:211 1:153 1:513
#> Median :56.00 2:284 Median :130.0 Median :240 2: 15
#> Mean :54.43 3: 77 Mean :131.6 Mean :246
#> 3rd Qu.:61.00 3rd Qu.:140.0 3rd Qu.:275
#> Max. :77.00 Max. :200.0 Max. :564
#> thalach exang oldpeak slope ca thal target
#> Min. : 71.0 0:680 Min. :0.000 0: 74 0:578 0: 7 0:499
#> 1st Qu.:132.0 1:345 1st Qu.:0.000 1:482 1:226 1: 64 1:526
#> Median :152.0 Median :0.800 2:469 2:134 2:544
#> Mean :149.1 Mean :1.072 3: 69 3:410
#> 3rd Qu.:166.0 3rd Qu.:1.800 4: 18
#> Max. :202.0 Max. :6.200
Insight:
Checking the imbalanced data set on target variable.
prop.table(table(heart_disease$target))#>
#> 0 1
#> 0.4868293 0.5131707
Based on statistical best practice, if the proportion of target
is in the range 60 % dan 40% so that it can be concluded that the data
is balance.
Using ggcorr function in GGally package to get the
strength of correlation among numeric variables.
ggcorr(heart,label = T,label_round = 2, nbreaks = 4)
Insight:
- Positive correlation
Variable of thalach has a relationship strength with the variable of target that tends to be positive moderate correlation.
Variable of trestbps, chol, and age have a relationship strength that tends to be positive weak correlation toward oldpeak variable.
Negative correlation
The categoric variables are conducted analysis by using
visualization from ggplot function in ggplot2 package.
- Sex
agg2 <- heart_disease %>%
group_by(target, sex) %>%
summarise(Tot=n())
ggplot(agg2,aes(x=target,y=Tot)) + geom_col(aes(fill=sex),color="red") + facet_wrap(facets = "sex") + labs(title = "Total Patient Based on Sex & Disease", x = "Target - (0)No Heart Disease (1)Heart Disease", y="Total") + geom_text(aes(label=Tot),color="black",nudge_y = -10)+theme_dark()
Insight:
- Most of female patients has a
heart disease with total 226 patients.
- Most of male patients has
no heart disease with total 413 patients.
- The highest amount of
heart disease is male patient with total 300 patients.
agg1 <- heart_disease %>%
group_by(target, cp) %>%
summarise(Tot=n())
ggplot(agg1,aes(x=target,y=Tot)) + geom_col(aes(fill=cp),color="blue") + facet_wrap(facets = "cp") + labs(title = "Total Patient Based on Chest Pain & Disease", x = "Target - (0)No Heart Disease (1)Heart Disease", y="Total") + geom_text(aes(label=Tot),color="black",nudge_y = -12) +theme_dark()
Insight:
- Most of patients with heart
disease have level 1 typical of chest pain (Atypical angina) & level
2 typical of chest pain (Non-anginal pain - NAP). - Most of patients
without heart disease have level 0 of typical of chest pain (Typical
angina). - The typical of chest pain level 4 has a bigger total for
patient with heart disease (51 patients) than others.
agg3 <- heart_disease %>%
group_by(target, fbs) %>%
summarise(Tot=n())
ggplot(agg3,aes(x=target,y=Tot)) + geom_col(aes(fill=fbs),color="blue") + facet_wrap(facets = "fbs") + labs(title = "Total Patient Based on Fasting Blood Sugar", x = "Target - (0)No Heart Disease (1)Heart Disease", y="Total") + geom_text(aes(label=Tot),color="black",nudge_y = -12) +theme_dark()
Insight:
- Most of patients with/without
heart disease have fasting blood sugar < 120 mg/dl. - The amount of
patient with fasting blood sugar > 120 mg/dl and without heart
disease have a higher score than patient with fasting blood sugar >
120 mg/dl and heart disease.
agg4 <- heart_disease %>%
group_by(target, restecg) %>%
summarise(Tot=n())
ggplot(agg4,aes(x=target,y=Tot)) + geom_col(aes(fill=restecg),color="red") + facet_wrap(facets = "restecg") + labs(title = "Total Patient Based on Resting Electrocardiographic", x = "Target - (0)No Heart Disease (1)Heart Disease", y="Total") + geom_text(aes(label=Tot),color="black",nudge_y = -5) +theme_dark()
Insight:
- Most of patients with heart
disease have level 1 of resting electrocardiographic results. - Most of
patients without heart disease have normal result. - Only 3 patients
with heart disease also 12 patients without heart disease have probable
or definite left ventricular hypertrophy by Estes’ criteria.
agg5 <- heart_disease %>%
group_by(target, exang) %>%
summarise(Tot=n())
ggplot(agg5,aes(x=target,y=Tot)) + geom_col(aes(fill=exang),color="blue") + facet_wrap(facets = "exang") + labs(title = "Total Patient Based on Exercise Induced Angina", x = "Target - (0)No Heart Disease (1)Heart Disease", y="Total") + geom_text(aes(label=Tot),color="black",nudge_y = -12) +theme_dark()
Insight:
- Most of patients with heart
disease don’t have exercise-induced angina.
- Most of patients
without heart disease have exercise-induced angina and a little bit
lower do not have exercise-induced angina.
agg6 <- heart_disease %>%
group_by(target, slope) %>%
summarise(Tot=n())
ggplot(agg6,aes(x=target,y=Tot)) + geom_col(aes(fill=slope),color="red") + facet_wrap(facets = "slope") + labs(title = "Total Patient Based on Slope", x = "Target - (0)No Heart Disease (1)Heart Disease", y="Total") + geom_text(aes(label=Tot),color="black",nudge_y = -12) +theme_dark()
Insight:
- Most of patients with heart
disease have level 2 of slope and followed by level 1 of slope then
level 0 of slope.
- Most of patients without heart disease have
level 1 of slope and followed by level 2 of slope then level 0 of slope.
agg7 <- heart_disease %>%
group_by(target, ca) %>%
summarise(Tot=n())
ggplot(agg7,aes(x=target,y=Tot)) + geom_col(aes(fill=ca),color="blue") + facet_wrap(facets = "ca") + labs(title = "Total Patient", subtitle = "Based on Number of Major Vessels (0-3) Colored by Fluoroscopy",x = "Target - (0)No Heart Disease (1)Heart Disease", y="Total") + geom_text(aes(label=Tot),color="black",nudge_y = -13) +theme_dark()
Insight:
- Most of patients with heart
disease have level 0 of ca. - Most of patients without heart disease
have level 0 & level 1 of ca.
agg8 <- heart_disease %>%
group_by(target, thal) %>%
summarise(Tot=n())
ggplot(agg8,aes(x=target,y=Tot)) + geom_col(aes(fill=thal),color="red") + facet_wrap(facets = "thal") + labs(title = "Total Patient Based on Thalassemia",x = "Target - (0)No Heart Disease (1)Heart Disease", y="Total") + geom_text(aes(label=Tot),color="black",nudge_y = -16) + theme_dark()
Insight:
- Most of patients with heart
disease have level 2 of thalassemia. - Most of patients without heart
disease have level 3 of thalassemia.
The numeric variables are conducted analysis by using
visualization from ggplot function in ggplot2 package.
- Heart Disease by Age
heart_age <- heart_disease %>%
mutate(HeartDisease = case_when(
target == 0 ~ "No",
target == 1 ~ "Yes")) %>%
mutate(HeartDisease = as.factor(HeartDisease)) %>%
group_by(HeartDisease, age) %>%
summarize(count = n(), .groups = 'drop') %>%
pivot_wider(names_from = HeartDisease, values_from = count) %>%
mutate_if(is.integer, ~replace(., is.na(.), 0)) %>%
arrange(Yes)
linear_gradient <- htmlwidgets::JS(
"new echarts.graphic.LinearGradient(
0, 0, 0, 1,
[
{ offset: 0, color: '#B83F3E' },
{ offset: 1, color: '#FB6A63' }
])"
)
linear_gradient2 <- htmlwidgets::JS(
"new echarts.graphic.LinearGradient(
0, 0, 0, 1,
[
{ offset: 0, color: '#E08F83' },
{ offset: 1, color: '#FFE5DB' }
])"
)
heart_age %>%
e_charts(age) %>%
e_line(Yes,
smooth = T,
itemStyle = list(
opacity = 0.5,
color = linear_gradient),
areaStyle = list(
opcaity = 1,
color = linear_gradient),
emphasis = list(
focus = "series")) %>%
e_line(No,
smooth = T,
itemStyle = list(
opacity = 0.5,
color = linear_gradient2),
areaStyle = list(
opcaity = 1,
color = linear_gradient2),
emphasis = list(
focus = "series")) %>%
e_x_axis(name = "Age",
nameTextStyle = list(
color = "White",
fontFamily = "Cardo",
fontSize = 14,
fontWeight = "bold"
),
min = 27) %>%
e_y_axis(name = "Freq",
nameTextStyle = list(
color = "White",
fontFamily = "Cardo",
fontSize = 14,
fontWeight = "bold"
)) %>%
e_title(text = "Heart Disease by Age",
textStyle = list(
color = "#DFC18F",
fontStyle = "italic"
),
right = "10%",
left = "50%") %>%
e_tooltip(axis = "trigger",
showDelay = 0,
axisPointer = list(
type = "cross",
label = list(
backgroundColor = "#333D47",
fontFamily = "Cardo",
color = "#DFC18F"
),
crossStyle = list(
color = "#DFC18F"
)),
formatter = htmlwidgets::JS("
function(params)
{
return `<strong style='color:#DFC18F;'>Heart Disease:</strong> <strong style='color:#D76662;'>${params.seriesName}</strong>
<br/><strong style='color:black;'>Age:</strong> ${params.value[0]}
<br/><strong style='color:black;'>Count:</strong> ${params.value[1]}`
} ")) %>% e_theme("dark") %>%
e_legend(
orient = "horizontal",
right = "10%",
left = "45%",
bottom = "2%",
textStyle = list(
color = "#DFC18F"
)
)
Insight:
- Patients without Heart Disease
Condition tend to have higher Age.
- Patients with Heart Disease
Condition tend to group up between 29 and 76 years old.
- The
highest count of Heart Disease Condition recorded are within 54 years
old.
max <- list(
name = "Max",
type = "max"
)
min <- list(
name = "Min",
type = "min"
)
avg <- list(
type = "average",
name = "AVG"
)
heart_trestbps <- heart_disease %>%
mutate(HeartDisease = case_when(
target == 0 ~ "No",
target == 1 ~ "Yes")) %>%
mutate(HeartDisease = as.factor(HeartDisease)) %>%
select(trestbps, age, HeartDisease) %>%
group_by(HeartDisease) %>%
arrange(HeartDisease,age)
heart_trestbps %>%
e_charts(age) %>%
e_scatter(trestbps, HeartDisease,
emphasis = list(
focus = "series"
)) %>%
e_x_axis(name = "Age",
nameTextStyle = list(
color = "White",
fontFamily = "Cardo",
fontSize = 14,
fontWeight = "bold"
),
min = 27) %>%
e_y_axis(name = "RBP",
nameTextStyle = list(
color = "White",
fontFamily = "Cardo",
fontSize = 14,
fontWeight = "bold"
)) %>%
e_mark_area(data = list(
list(xAxis = "min", yAxis = "min"),
list(xAxis = "max", yAxis = "max")
),
itemStyle = list(
color = "transparent",
borderWidth = 1,
borderType = "dashed")) %>%
e_mark_line(data = avg,
lineStyle = list(
type = "solid"
)) %>%
e_title(text = "Heart Disease: Resting Blood Pressure vs. Age",
textStyle = list(
color = "#DFC18F",
fontStyle = "italic"
),
right = "10%",
left = "50%") %>%
e_tooltip(axis = "trigger",
showDelay = 0,
axisPointer = list(
type = "cross",
label = list(
backgroundColor = "#333D47",
fontFamily = "Cardo",
color = "#DFC18F"
),
crossStyle = list(
color = "#DFC18F"
)),
formatter = htmlwidgets::JS("
function(params)
{
return `<strong style='color:#DFC18F;'>Heart Disease:</strong> <strong style='color:#D76662;'>${params.seriesName}</strong>
<br/><strong style='color:Black;'>Age:</strong> ${params.value[0]}
<br/><strong style='color:Black;'>RestingBP:</strong> ${params.value[1]}mm/hg`
} ")) %>%
e_legend(
orient = "horizontal",
right = "10%",
left = "45%",
bottom = "2%",
textStyle = list(
color = "#DFC18F"
)
) %>% e_theme("dark")
Insight:
- Patients without Heart Disease
Condition tend to have higher RestingBP when getting older.
- Most
data of observations is distributed between 120 -140 mm/Hg for patients
with heart disease.
- Mean of resting blood pressure (trbps) is
129.25 mm/Hg for patients with heart disease.
- Mean of resting
blood pressure (trbps) is 134.11 mm/Hg for patients without heart
disease.
- There is RestingBP with 0mm/hg value which indicates
that this is an outlier.
heart_ch <- heart_disease %>%
mutate(HeartDisease = case_when(
target == 0 ~ "No",
target == 1 ~ "Yes")) %>%
mutate(HeartDisease = as.factor(HeartDisease)) %>%
select(chol, age, HeartDisease) %>%
group_by(HeartDisease)
heart_ch %>%
e_charts(age) %>%
e_scatter(chol, HeartDisease,
emphasis = list(
focus = "series"
)) %>%
e_x_axis(name = "Age",
nameTextStyle = list(
color = "White",
fontFamily = "Cardo",
fontSize = 14,
fontWeight = "bold"
),
min = 27) %>%
e_y_axis(name = "Cholesterol (mg/dl)",
nameTextStyle = list(
color = "White",
fontFamily = "Cardo",
fontSize = 14,
fontWeight = "bold"
)) %>%
e_mark_area(data = list(
list(xAxis = "min", yAxis = "min"),
list(xAxis = "max", yAxis = "max")
),
itemStyle = list(
color = "transparent",
borderWidth = 1,
borderType = "dashed")) %>%
e_mark_line(data = avg,
lineStyle = list(
type = "solid"
)) %>%
e_title(text = "Heart Disease: Cholesterol vs. Age",
textStyle = list(
color = "#DFC18F",
fontStyle = "italic"
),
right = "10%",
left = "50%") %>%
e_tooltip(axis = "trigger",
showDelay = 0,
axisPointer = list(
type = "cross",
label = list(
backgroundColor = "#333D47",
fontFamily = "Cardo",
color = "#DFC18F"
),
crossStyle = list(
color = "#DFC18F"
)),
formatter = htmlwidgets::JS("
function(params)
{
return `<strong style='color:#DFC18F;'>Heart Disease:</strong> <strong style='color:#D76662;'>${params.seriesName}</strong>
<br/><strong style='color:Black;'>Age:</strong> ${params.value[0]}
<br/><strong style='color:Black;'>Cholesterol:</strong> ${params.value[1]}mg/dL`
} ")) %>%
e_legend(
orient = "horizontal",
right = "10%",
left = "45%",
bottom = "2%",
textStyle = list(
color = "#DFC18F"
)
) %>% e_theme("dark")
Insight:
- Patients with heart disease
have a potential for having higher cholesterol when getting older.
- The 50% of cholesterol data distribution of patient without heart
disease is high.
- The mean of cholesterol is high for patient
without heart disease.
heart_thalach <- heart_disease %>%
mutate(HeartDisease = case_when(
target == 0 ~ "No",
target == 1 ~ "Yes")) %>%
mutate(HeartDisease = as.factor(HeartDisease)) %>%
select(thalach, age, HeartDisease) %>%
group_by(HeartDisease)
heart_thalach %>%
e_charts(age) %>%
e_scatter(thalach, HeartDisease,
emphasis = list(
focus = "series"
)) %>%
e_x_axis(name = "Age",
nameTextStyle = list(
color = "White",
fontFamily = "Cardo",
fontSize = 14,
fontWeight = "bold"
),
min = 27) %>%
e_y_axis(name = "Max Heart Rate",
nameTextStyle = list(
color = "White",
fontFamily = "Cardo",
fontSize = 14,
fontWeight = "bold"
)) %>%
e_mark_area(data = list(
list(xAxis = "min", yAxis = "min"),
list(xAxis = "max", yAxis = "max")
),
itemStyle = list(
color = "transparent",
borderWidth = 1,
borderType = "dashed")) %>%
e_mark_line(data = avg,
lineStyle = list(
type = "solid"
)) %>%
e_title(text = "Heart Disease: Max Heart Rate vs. Age",
textStyle = list(
color = "#DFC18F",
fontStyle = "italic"
),
right = "10%",
left = "50%") %>%
e_tooltip(axis = "trigger",
showDelay = 0,
axisPointer = list(
type = "cross",
label = list(
backgroundColor = "#333D47",
fontFamily = "Cardo",
color = "#DFC18F"
),
crossStyle = list(
color = "#DFC18F"
)),
formatter = htmlwidgets::JS("
function(params)
{
return `<strong style='color:#DFC18F;'>Heart Disease:</strong> <strong style='color:#D76662;'>${params.seriesName}</strong>
<br/><strong style='color:Black;'>Age:</strong> ${params.value[0]}
<br/><strong style='color:Black;'>MaxHR:</strong> ${params.value[1]}`
} ")) %>%
e_legend(
orient = "horizontal",
right = "10%",
left = "45%",
bottom = "2%",
textStyle = list(
color = "#DFC18F"
)
) %>% e_theme("dark")
Insight:
- Patients with heart disease
condition tend to have lower Max Heart Rate than others when getting
older.
heart_oldpeak <- heart_disease %>%
mutate(HeartDisease = case_when(
target == 0 ~ "No",
target == 1 ~ "Yes")) %>%
mutate(HeartDisease = as.factor(HeartDisease)) %>%
select(oldpeak, age, HeartDisease) %>%
group_by(HeartDisease)
heart_oldpeak %>%
e_charts(age) %>%
e_scatter(oldpeak, HeartDisease,
emphasis = list(
focus = "series"
)) %>%
e_x_axis(name = "Age",
nameTextStyle = list(
color = "White",
fontFamily = "Cardo",
fontSize = 14,
fontWeight = "bold"
),
min = 27) %>%
e_y_axis(name = "Oldpeak",
nameTextStyle = list(
color = "White",
fontFamily = "Cardo",
fontSize = 14,
fontWeight = "bold"
),
min = -3,
max = 7) %>%
e_mark_area(data = list(
list(xAxis = "min", yAxis = "min"),
list(xAxis = "max", yAxis = "max")
),
itemStyle = list(
color = "transparent",
borderWidth = 1,
borderType = "dashed")) %>%
e_mark_line(data = avg,
lineStyle = list(
type = "solid"
)) %>%
e_title(text = "Heart Disease: Oldpeak vs. Age",
textStyle = list(
color = "#DFC18F",
fontStyle = "italic"
),
right = "10%",
left = "50%") %>%
e_tooltip(axis = "trigger",
showDelay = 0,
axisPointer = list(
type = "cross",
label = list(
backgroundColor = "#333D47",
fontFamily = "Cardo",
color = "#DFC18F"
),
crossStyle = list(
color = "#DFC18F"
)),
formatter = htmlwidgets::JS("
function(params)
{
return `<strong style='color:#DFC18F;'>Heart Disease:</strong> <strong style='color:#D76662;'>${params.seriesName}</strong>
<br/><strong style='color:Black;'>Age:</strong> ${params.value[0]}
<br/><strong style='color:Black;'>Oldpeak:</strong> ${params.value[1]}`
} ")) %>%
e_legend(
orient = "horizontal",
right = "10%",
left = "45%",
bottom = "2%",
textStyle = list(
color = "#DFC18F"
)
) %>% e_theme("dark")
Insight:
- Patients without heart disease
condition tend to have higher Oldpeak rate.
- Patients with heart
disease condition observation data have many outliers with values of 0.
Using logistic regression to describe data and the relationship
between one dependent variable (target) and one or more independent
variables(predictors). The independent variables can be numeric, nominal
categorical, ordinal categorical, or interval type. And the dependent
variable (target) must categorical data type.
This is the equation
of a logistic regression model:
\[
log \frac{p(X)}{1-p(X)} =\beta_0+\beta_1.X
\]
The left-hand side is called the log-odds or logit. On
the right side the b0 is the model intercept and b1 is the coefficient
of feature X.
The purpose of logistic regression is to predict
probability using a linear regression model (which can be used for
classification). The output of logistic regression is log of
odds. Then, the output of logistic regression can be converted to
probability form using inv.logit() in
library(gtools).
In cross validation step, the data is devided into train data
with proportion 80% of original data and test data with proportion 20%
of original data as unseen data.
# Set seed to lock the random
RNGkind(sample.kind = "Rounding")
set.seed(123)
# index sampling
index <- sample(x = nrow(heart_disease), size = nrow(heart_disease)*0.8)
# splitting
heart_disease_train <- heart_disease[index,]
heart_disease_test <- heart_disease[-index,]
Checking the imbalanced data set on target variable.
prop.table(table(heart_disease$target))#>
#> 0 1
#> 0.4868293 0.5131707
After, checking the imbalance of target variable and there is no imbalance over there (refers to statistical best practice the limmitation is 60% & 40%). It can be concluded that there is no needed to do the treatments such as up-sampling, down-sampling, and SMOTE. And the treatment can be conducted after cross validation and only conducted on the train data because the test data is unseen data.
- 1. All predictors. The model of
model_logistic_all involves all preddictors in this model.
model_logistic_all <- glm(formula = target~.,
data = heart_disease_train,
family = "binomial")
summary(model_logistic_all)#>
#> Call:
#> glm(formula = target ~ ., family = "binomial", data = heart_disease_train)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -2.9216 -0.2565 0.0834 0.4147 3.2579
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -1.459568 2.182499 -0.669 0.503648
#> age 0.033384 0.015994 2.087 0.036862 *
#> sex1 -2.027022 0.355146 -5.708 1.15e-08 ***
#> cp1 0.671228 0.352382 1.905 0.056802 .
#> cp2 2.116855 0.329346 6.427 1.30e-10 ***
#> cp3 2.003688 0.437533 4.580 4.66e-06 ***
#> trestbps -0.024883 0.007631 -3.261 0.001111 **
#> chol -0.003915 0.002650 -1.477 0.139644
#> fbs1 0.679642 0.384122 1.769 0.076837 .
#> restecg1 0.426159 0.247949 1.719 0.085663 .
#> restecg2 -0.682235 1.727780 -0.395 0.692945
#> thalach 0.027101 0.007383 3.671 0.000242 ***
#> exang1 -0.877426 0.284421 -3.085 0.002036 **
#> oldpeak -0.471277 0.156134 -3.018 0.002541 **
#> slope1 -0.530315 0.569302 -0.932 0.351585
#> slope2 0.766945 0.602481 1.273 0.203026
#> ca1 -2.492669 0.339801 -7.336 2.21e-13 ***
#> ca2 -3.950260 0.543148 -7.273 3.52e-13 ***
#> ca3 -2.741024 0.631727 -4.339 1.43e-05 ***
#> ca4 0.585698 1.227425 0.477 0.633236
#> thal1 3.279136 1.522859 2.153 0.031297 *
#> thal2 2.380597 1.449903 1.642 0.100611
#> thal3 0.985519 1.452798 0.678 0.497544
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 1135.00 on 819 degrees of freedom
#> Residual deviance: 465.66 on 797 degrees of freedom
#> AIC: 511.66
#>
#> Number of Fisher Scoring iterations: 6
library(car)
# vif(model_logistic_all)
Insight:
- The AIC score is 511.66 for
model of model_logistic_all. And AIC represents the total of information
lost.
- The predictor variables that significant to the target
variable / model of model_logistic_all are age, sex, cp, trestbps,
thalach, exang, oldpeak, ca, and thal.
model_logistic_fs <- glm(formula = target~ age + sex + cp + restecg + trestbps + oldpeak + thalach + ca + exang,
data = heart_disease_train,
family = "binomial")
summary(model_logistic_fs)#>
#> Call:
#> glm(formula = target ~ age + sex + cp + restecg + trestbps +
#> oldpeak + thalach + ca + exang, family = "binomial", data = heart_disease_train)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -2.5349 -0.3613 0.1084 0.4907 2.7215
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -1.040850 1.511124 -0.689 0.490954
#> age 0.021410 0.014800 1.447 0.148009
#> sex1 -1.899376 0.277665 -6.841 7.89e-12 ***
#> cp1 1.098357 0.329627 3.332 0.000862 ***
#> cp2 2.114731 0.292065 7.241 4.47e-13 ***
#> cp3 1.708673 0.396645 4.308 1.65e-05 ***
#> restecg1 0.422447 0.227058 1.861 0.062811 .
#> restecg2 -0.382087 1.545124 -0.247 0.804687
#> trestbps -0.018547 0.006935 -2.674 0.007487 **
#> oldpeak -0.662163 0.126862 -5.220 1.79e-07 ***
#> thalach 0.030077 0.006466 4.652 3.29e-06 ***
#> ca1 -2.059008 0.295008 -6.979 2.96e-12 ***
#> ca2 -3.046294 0.442163 -6.890 5.60e-12 ***
#> ca3 -2.287338 0.519786 -4.401 1.08e-05 ***
#> ca4 -0.107496 1.041383 -0.103 0.917785
#> exang1 -1.123481 0.260860 -4.307 1.66e-05 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 1135.0 on 819 degrees of freedom
#> Residual deviance: 535.6 on 804 degrees of freedom
#> AIC: 567.6
#>
#> Number of Fisher Scoring iterations: 6
# vif(model_logistic_fs)
# exp(model_logistic_fs$coefficient)
Insight:
- The AIC score is 567.6 for
model of model_logistic_fs And AIC represents the total of information
lost.
- The predictor variables that significant to the target
variable / model of model_logistic_fs are sex, cp, trestbps, oldpeak,
thalach, exang, and ca.
Using step_wise regression method,bacward, to find a suitable
model with lowest AIC.
model_logistic_step <- step(object = model_logistic_all,direction = "backward",trace = F)
summary(model_logistic_step)#>
#> Call:
#> glm(formula = target ~ age + sex + cp + trestbps + chol + fbs +
#> thalach + exang + oldpeak + slope + ca + thal, family = "binomial",
#> data = heart_disease_train)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -2.8665 -0.2657 0.0883 0.4322 3.3248
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -0.837560 2.242458 -0.374 0.708776
#> age 0.030826 0.015822 1.948 0.051376 .
#> sex1 -2.023557 0.353388 -5.726 1.03e-08 ***
#> cp1 0.691878 0.352217 1.964 0.049489 *
#> cp2 2.131196 0.329865 6.461 1.04e-10 ***
#> cp3 1.983810 0.435012 4.560 5.11e-06 ***
#> trestbps -0.025880 0.007517 -3.443 0.000575 ***
#> chol -0.004489 0.002610 -1.720 0.085435 .
#> fbs1 0.680454 0.381298 1.785 0.074331 .
#> thalach 0.026522 0.007392 3.588 0.000333 ***
#> exang1 -0.848557 0.284039 -2.987 0.002813 **
#> oldpeak -0.468524 0.152622 -3.070 0.002142 **
#> slope1 -0.522503 0.552676 -0.945 0.344452
#> slope2 0.812816 0.586968 1.385 0.166123
#> ca1 -2.528997 0.338454 -7.472 7.89e-14 ***
#> ca2 -3.865166 0.532280 -7.262 3.83e-13 ***
#> ca3 -2.869306 0.643210 -4.461 8.16e-06 ***
#> ca4 0.499374 1.166787 0.428 0.668658
#> thal1 3.342386 1.651537 2.024 0.042990 *
#> thal2 2.414516 1.585634 1.523 0.127822
#> thal3 1.067257 1.587569 0.672 0.501419
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 1135.0 on 819 degrees of freedom
#> Residual deviance: 468.9 on 799 degrees of freedom
#> AIC: 510.9
#>
#> Number of Fisher Scoring iterations: 6
Insight:
- The AIC score is 510.9 for
model of model_logistic_step And AIC represents the total of information
lost.
- The predictor variables that not significant to the target
variable / model of model_logistic_step are age, chol, fbs, and slope.
First, conducting the model performance comparison to
get best model to predict.
# Model Performance Comparison
library(performance)
comparison <- compare_performance(model_logistic_all,model_logistic_fs, model_logistic_step)
as.data.frame(comparison)
Insight:
Best model of Logistic
Regression, as follows:
- Based on RMSE with small score :
model_logistic_all
- Based on AIC with small score :
model_logistic_step
Second, the prediction is conducted by the model of
model_logistic_all.
# Test Data Prediction
heart_disease_test$PredRisk_HeartDisease <- predict(
object = model_logistic_all,
newdata = heart_disease_test,
type = "response" # for probability results
)
heart_disease_test$Pred_Label_HeartDisease <-
ifelse(heart_disease_test$PredRisk_HeartDisease < 0.5, 0, 1) %>%
as.factor()
Pred_Result <- heart_disease_test %>%
select(target,PredRisk_HeartDisease,Pred_Label_HeartDisease)
colnames(Pred_Result) <- c("Actual","Prob_Pred","Result_Pred")
Pred_Result# Train Data Prediction
hd_train <- heart_disease_train
hd_train$ PredRisk_HeartDisease <- predict(
object = model_logistic_all,
newdata = hd_train,
type = "response" # for probability results
)
hd_train$Pred_Label_HeartDisease <-
ifelse(hd_train$PredRisk_HeartDisease < 0.5, 0, 1) %>%
as.factor()
Pred_Result_tr <- hd_train %>%
select(target,PredRisk_HeartDisease,Pred_Label_HeartDisease)
colnames(Pred_Result_tr) <- c("Actual","Prob_Pred","Result_Pred")
Pred_Result_tr
In this section, it is important to test the multicolinearity
among predictors in order to met the assumption test for logistic
regression. - 1. Multicolinearity.
vif(model_logistic_all)#> GVIF Df GVIF^(1/(2*Df))
#> age 1.503591 1 1.226210
#> sex 1.694283 1 1.301646
#> cp 1.972136 3 1.119840
#> trestbps 1.347567 1 1.160848
#> chol 1.202662 1 1.096660
#> fbs 1.238444 1 1.112854
#> restecg 1.151628 2 1.035924
#> thalach 1.538287 1 1.240277
#> exang 1.183068 1 1.087689
#> oldpeak 1.578693 1 1.256461
#> slope 1.855658 2 1.167144
#> ca 2.193381 4 1.103162
#> thal 1.604339 3 1.081972
ggcorr(heart_disease_train, label = T)
Insight:
- There is no strong correlation (+1
or -1) among variabel predictors, so that the assumption test of
Independence of Observation is met.
The k-nearest neighbors is non-parametric, supervised learning
classifier, which uses proximity to make classifications about the
grouping of an individual data point. K-NN working off the assumption
that similar points can be found near one another. In other words,
k-nearest neighbors classifies new data by comparing the characteristics
of the new data (test data) with existing data (train data). The
proximity of these characteristics is measured by Euclidean
Distance until distance is obtained. Then the
nearest k neighbors will be selected from the new data,
then the class will be determined using majority voting.
This is the equation of K-NN model:
\[
d(x,y) = \sqrt{\sum(x_i-y_i)^2}
\]
In cross validation step, the data is devided into train data
with proportion 80% of original data and test data with proportion 20%
of original data as unseen data. Then, the train data and test data
devided into only predictor and only target. And, selecting only numeric
variable predictors for train data also test data.
knn_train <- heart_disease_train
knn_test <- heart_disease_test
# Numeric Variable Predictors in train data
knn_train_x <- select(knn_train,-c(sex,cp,fbs,restecg,exang,slope,ca,thal,target))
# Numeric Variable Predictors in test data
knn_test_x <- select(knn_test,-c(sex,cp,fbs,restecg,exang,slope,ca,thal,target,PredRisk_HeartDisease,Pred_Label_HeartDisease))
# Target Variable in train data
knn_train_y <- knn_train[,"target"]
# Target Variable in test data
knn_test_y <- knn_test[,"target"]
Generalizing the range of predictor variables by using z-score
standarization.
knn_train_x_scale <- scale(x = knn_train_x)
knn_test_x_scale <-
scale(x = knn_test_x,
center = attr(knn_train_x_scale, "scaled:center"),
scale = attr(knn_train_x_scale, "scaled:scale"))Defined the optimum k for k-nn model before conducting the
prediction. Then, creating knn model by using knn()
function.
# find optimum k
round(sqrt(nrow(knn_train)))#> [1] 29
knn_pred <- knn(train = knn_train_x_scale, # Train data predictors after scale
test = knn_test_x_scale, # Test data predictors after scale
cl = knn_train_y, # Only Data Train Target
k = 29)
In evaluating section consists two steps which are comparing
logistic regression model (model_logistic_all) with k-nn model
(knn_pred) and using model after comparing result in order to predict
the test data and train data fro getting the best model performance, and
then analyze also conduct visualization to see the ROC & AUC
score.
# Evaluation Test Data Prediction
eva_log <- confusionMatrix(data = Pred_Result$Result_Pred,
reference = Pred_Result$Actual,
positive = "1")
# KNN
eva_knn <- confusionMatrix(data = knn_pred,
reference = knn_test_y,
positive = "1")
performance_comparison <- cbind.data.frame(Accuracy=c(eva_log$overall[1],eva_knn$overall[1]),
Sensitivity = c(eva_log$byClass[[1]], eva_knn$byClass[[1]]),
Specificity = c(eva_log$byClass[[2]], eva_knn$byClass[[2]]),
Precision = c(eva_log$byClass[[3]], eva_knn$byClass[[3]])
)
rownames(performance_comparison) <- c("Logistic Regression (model_logistic_all)", "K-NN (knn_pred)")
performance_comparison# Evaluation Train Data Prediction
eva_log_tr <- confusionMatrix(data=Pred_Result_tr$Result_Pred,
reference = Pred_Result_tr$Actual,
positive = "1")
performance_log <-
cbind.data.frame(Accuracy=c(eva_log$overall[1],eva_log_tr$overall[1]),
Sensitivity = c(eva_log$byClass[[1]], eva_log_tr$byClass[[1]]),
Specificity = c(eva_log$byClass[[2]], eva_log_tr$byClass[[2]]),
Precision = c(eva_log$byClass[[3]], eva_log_tr$byClass[[3]]))
rownames(performance_log) <- c("On Test Data - Log.Reg.", "On Train Data - Log.Reg.")
performance_logActual <- factor(c("No", "No", "Yes", "Yes"))
Predicted <- factor(c("No", "Yes", "No", "Yes"))
y <- c(eva_log$table[1], eva_log$table[2], eva_log$table[3], eva_log$table[4])
df1_log <- data.frame(Actual, Predicted, y)
# Plot 1
plot_log1 <- ggplot(data = df1_log, mapping = aes(x = Actual, y = Predicted, col = y)) +
geom_tile(aes(fill = y)) +
geom_text(aes(label = sprintf("%1.0f", y)), color = "Black", size = 5, family = "Times New Roman") +
scale_fill_gradient(low = "#91FFBF", high = "#3FF48B") +
scale_color_gradient(low = "#91FFBF", high = "#3FF48B") +
theme(legend.position = "none",
plot.background = element_rect(fill = "#2C343C", color = "#2C343C"),
panel.background = element_rect(fill = "#2C343C"),
panel.grid = element_line(colour = "#2C343C"),
panel.grid.major.x = element_line(colour = "#2C343C"),
panel.grid.minor.x = element_line(colour = "#2C343C"),
axis.text.x = element_text(color = "#7BFEB2", family = "Times New Roman", size = 14),
axis.text.y = element_text(color = "#7BFEB2", family = "Times New Roman", size = 14),
axis.title.x = element_text(color = "#DFC18F", family = "Times New Roman", size = 14),
axis.title.y = element_text(color = "#DFC18F", family = "Times New Roman", size = 14),
axis.ticks = element_blank())
plot_log1roc <- roc(predictor = Pred_Result$Prob_Pred, #Probability Prediction Result
response = Pred_Result$Actual, # Target Actual Data Test
levels = c("0", "1"),
percent = TRUE)
df1_log <- data.frame(Specificity=roc$specificities, Sensitivity=roc$sensitivities)
plot_roc <- ggplot(data = df1_log, aes(x = Specificity, y = Sensitivity))+
with_outer_glow(geom_path(colour = '#FFE5DB', size = 1), colour = "#E08F83", sigma = 10, expand = 1)+
scale_x_reverse() +
geom_abline(intercept = 100, slope = 1, color="#B8332E", linetype = "longdash") +
annotate("text", x = 40, y = 40, label = paste0('AUC: ', round(roc$auc,1), '%'), size = 4, color = "#DFC18F", family = "Times New Roman")+
ylab('Sensitivity (%)')+
xlab('Specificity (%)')+
scale_fill_gradient(low = "#E08F83", high = "#B8332E") +
scale_color_gradient(low = "#E08F83", high = "#B8332E") +
theme(legend.position = "none",
plot.background = element_rect(fill = "#2C343C", color = "#2C343C"),
panel.background = element_rect(fill = "#2C343C"),
panel.grid = element_line(colour = "#2C343C"),
panel.grid.major.x = element_line(colour = "#2C343C"),
panel.grid.minor.x = element_line(colour = "#2C343C"),
axis.text.x = element_text(color = "#2C343C", family = "Times New Roman", size = 14),
axis.text.y = element_text(color = "#2C343C", family = "Times New Roman", size = 14),
axis.title.x = element_text(color = "#DFC18F", family = "Times New Roman", size = 14),
axis.title.y = element_text(color = "#DFC18F", family = "Times New Roman", size = 14),
axis.ticks = element_blank())
plot_roc
MODEL: Logistic Regression (model_logistic_all)
1. Defined Class
- Positive
Class : 1 (Observed Heart Disease)
- Negative
Class : 0 (No Heart Disease)
2. Defined
Risk
- FP/False Positif: : Predicted
heart disease, then the actual is no heart disease.
- FN /
False Negatif: : Predicted no heart disease, then the actual is
heart disease.
3. Defined Intolerable Risk & Model
Performance Metric
- The Intolerable risk is FN / False
Negative risk. Hence, the research uses the sensitivity
of model performance metric.
Insight:
- The sensivity / recall of
Logistic Regression model is higher than K-NN model.
- The Logistic
Regression (model_logistic_all) Accuracy : 0.8585366, it means that only
85.85% correctly classified data.
- The Logistic Regression
(model_logistic_all) Sensitivity : 0.8659794, it means that 86.59%
positive class is classified correctly from its actual data.
- The
Logistic Regression (model_logistic_all) Specificity : 0.8518519, it
means that the level of precision of the model predicts a negative class
of 85.18%.
- The Logistic Regression (model_logistic_all) Precision
: 0.8400, it means that the level of precision of the model predicts a
positive class of 84%.
- There are no big differences between our
Accuracy, Sensitivity, Specificity, and Precision score on the “On Test
Data - Log.Reg.” and On Train Data - Log.Reg. .
- This means that
there are no indication of Overfitting and Underfitting for Logistic
Regression model.
- ROC Curve also shows a very good separation
with AUC score of 91.9%. Meaning that the model is proven to be near
excellent at separating our target classes.
Based on the results of the step-wise regression method in logistic regression model, several predictors increase a person’s likelihood of having a heart attack, as follows: sex, cp, trestbps, thalach, exang, oldpeak, ca, and thal.
The model of model_logistic_all is suitable to make a prediciton. But, using a model of model_logistic_step is appropiate to see the significance of the predictor. Then, step-wise regression is a greedy algorithm that focuses on finding the best results with relatively short time, but not necessarily giving the most optimal results. Therefore, this research uses the results of step-wise regression, model_logistic_step, as a model recommendation and there is still room for improvement.
The model of logistic regression (model_logistic_all) is better than model KNN (knn_pred), due to higher sensitivity score and AUC score.