After conducting this analysis, the patients with heart disease have 50% of age’s data distribution lower than patients without heart disease that is located in the range of Q1 = 44 years old, Q3 = 59 years old. The resting blood pressure included in elevated (120-129 mm Hg) category and hypertension stage 1 (130-139 mm Hg), it can be seen from the 50% of data distribution which located in the range of Q1 = 120 mm Hg & Q3 = 140 mm Hg. Next, patients with or without heart disease who have 4 types of chest pain have cholesterol in the range of 210 - 250 mg/dl. Then, Variable of thalach has a relationship strength with the variable of target that tends to be positive moderate correlation. And, Variable of trestbps, chol, age, and oldpeak have a relationship strength that tends to be positive weak correlation among those variables. Heart disease tends to have a higher minimum heart rate, which is 96, than not having heart disease, which is 71.
Finally, The patients must increase awareness about heart disease before the age of 29 based on the minimum age in the data of heart disease patients.The patients must be able to maintain healthy lifestyle to maintain stable blood pressure. And also, People with elevated blood pressure are likely to develop high blood pressure unless steps are taken to control the condition.
55% of the 55.4 million deaths worldwide came from the top 10 causes of death in 2019. The top three topics of global causes of death which has an impact / a relation on the number of lives lost were cardiovascular (ischaemic heart disease, stroke), respiratory (chronic obstructive pulmonary disease, lower respiratory infections), and neonatal conditions – which include birth asphyxia and birth trauma, neonatal sepsis and illnesses, and preterm birth complications.
This research aim is to provide several insights which can benefit to
readers especially for society and hospital. This research will deep
dive into heart disease by analyzing the dataset. Therefore, the
research objectives are:
- Analyzing the medical history background
of those with heart disease and without heart disease.
- Analyzing
the numerical variable relationship.
- Analyzing the probability
occurrence for each factor variable.
| No. | Feature | Description | Value |
|---|---|---|---|
| 1. | age | Patient’s age in years | 29-77 |
| 2. | sex | Patient’s gender | (1)Male (0)Female |
| 3. | cp | Chest pain type | (0)Typical angina - TA (1)Atypical angina - ATA (2)Non-anginal pain - NAP (3)Asymptomatic - ASY. |
| 4. | trestbps | Resting blood pressure (in mm Hg) | 94-200 |
| 5. | chol | Cholestoral in mg/dl | 126 – 564 |
| 6. | fbs | Fasting blood sugar > 120 mg/dl | (1)True (0)False |
| 7. | restecg | Resting electrocardiographic results | (0)Normal (1)Resting electrocardiographic results (2)Showing probable or definite left ventricular hypertrophy by Estes’ criteria |
| 8. | thalach | Maximum heart rate achieved | 71-202 |
| 9. | exang | Exercise induced angina | (1)Yes (0)No |
| 10. | oldpeak | ST depression induced by exercise relative to rest | 0-6.2 |
| 11. | slope | The slope of the peak exercise ST segment | (1)Upsloping(2)Flat(3)Downsloping |
| 12. | ca | Number of major vessels (0-3) colored by fluoroscopy | 0,1,2,3 |
| 13. | thal | Thalassemia | (3)Normal (6)Fixed defect (no blood flow in some part of the heart) (7)Reversable defect (a blood flow is observed but it is not normal) |
| 14. | target | Diagnosis of heart disease | (0)Heart disease not present (1)Heart disease present |
# data cleaning
library(readr)
library(tidyverse)
library(dplyr)
#data analysis
library(GGally)
#data visualizationl
library(ggplot2)
library(scales)This data set dates from 1988 and consists of four databases: Cleveland, Hungary, Switzerland, and Long Beach V. It contains 76 attributes, including the predicted attribute, but all published experiments refer to using a subset of 14 of them. The “target” field refers to the presence of heart disease in the patient. It is integer valued 0 = no disease and 1 = disease.
heart_disease <- read.csv("data_input/kaggle_4city.csv")
heart_disease# Top 6 data
head(heart_disease)# Bottom 6 data
tail(heart_disease)# Inspect Data Type
glimpse(heart_disease)#> Rows: 1,025
#> Columns: 14
#> $ age <int> 52, 53, 70, 61, 62, 58, 58, 55, 46, 54, 71, 43, 34, 51, 52, 3…
#> $ sex <int> 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1…
#> $ cp <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 2, 0, 1, 2, 2…
#> $ trestbps <int> 125, 140, 145, 148, 138, 100, 114, 160, 120, 122, 112, 132, 1…
#> $ chol <int> 212, 203, 174, 203, 294, 248, 318, 289, 249, 286, 149, 341, 2…
#> $ fbs <int> 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0…
#> $ restecg <int> 1, 0, 1, 1, 1, 0, 2, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0…
#> $ thalach <int> 168, 155, 125, 161, 106, 122, 140, 145, 144, 116, 125, 136, 1…
#> $ exang <int> 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0…
#> $ oldpeak <dbl> 1.0, 3.1, 2.6, 0.0, 1.9, 1.0, 4.4, 0.8, 0.8, 3.2, 1.6, 3.0, 0…
#> $ slope <int> 2, 0, 0, 2, 1, 1, 0, 1, 2, 1, 1, 1, 2, 1, 1, 2, 2, 1, 2, 2, 1…
#> $ ca <int> 2, 0, 0, 1, 3, 0, 3, 1, 0, 2, 0, 0, 0, 3, 0, 0, 1, 1, 0, 0, 0…
#> $ thal <int> 3, 3, 3, 3, 2, 2, 1, 3, 3, 2, 2, 3, 2, 3, 0, 2, 2, 3, 2, 2, 2…
#> $ target <int> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0…
# Change Data Type to Factor
heart_disease <- heart_disease %>%
mutate_at(vars(sex,cp,fbs,restecg,exang,slope,ca,thal),as.factor)
glimpse(heart_disease)#> Rows: 1,025
#> Columns: 14
#> $ age <int> 52, 53, 70, 61, 62, 58, 58, 55, 46, 54, 71, 43, 34, 51, 52, 3…
#> $ sex <fct> 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1…
#> $ cp <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 2, 0, 1, 2, 2…
#> $ trestbps <int> 125, 140, 145, 148, 138, 100, 114, 160, 120, 122, 112, 132, 1…
#> $ chol <int> 212, 203, 174, 203, 294, 248, 318, 289, 249, 286, 149, 341, 2…
#> $ fbs <fct> 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0…
#> $ restecg <fct> 1, 0, 1, 1, 1, 0, 2, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0…
#> $ thalach <int> 168, 155, 125, 161, 106, 122, 140, 145, 144, 116, 125, 136, 1…
#> $ exang <fct> 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0…
#> $ oldpeak <dbl> 1.0, 3.1, 2.6, 0.0, 1.9, 1.0, 4.4, 0.8, 0.8, 3.2, 1.6, 3.0, 0…
#> $ slope <fct> 2, 0, 0, 2, 1, 1, 0, 1, 2, 1, 1, 1, 2, 1, 1, 2, 2, 1, 2, 2, 1…
#> $ ca <fct> 2, 0, 0, 1, 3, 0, 3, 1, 0, 2, 0, 0, 0, 3, 0, 0, 1, 1, 0, 0, 0…
#> $ thal <fct> 3, 3, 3, 3, 2, 2, 1, 3, 3, 2, 2, 3, 2, 3, 0, 2, 2, 3, 2, 2, 2…
#> $ target <int> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0…
Notes: Using target as an integer for correlation
analysis cor(). And, using target as factor for machine
learning modelling.
anyNA(heart_disease)#> [1] FALSE
colSums(is.na(heart_disease))#> age sex cp trestbps chol fbs restecg thalach
#> 0 0 0 0 0 0 0 0
#> exang oldpeak slope ca thal target
#> 0 0 0 0 0 0
It shows that there are no missing values in heart_disease dataframe.
heart_disease %>%
duplicated() %>%
sum()#> [1] 723
# Containing duplicated data
heart_disease[duplicated(heart_disease),] Insight: There is a possibility that each observations, patient’s data, has the same value. Then, the treatment for duplicated data is not make any changes in this dataframe.
# Inspect the data
# With Heart Disease
heart_disease1 <- heart_disease[heart_disease$target==1,]
# Without Heart Disease
heart_disease2 <- heart_disease[heart_disease$target==0,]
Inspecting the 5 number summary + mean in order to get an
insight and data distribution informations.
# Original Dataset
summary(heart_disease)#> age sex cp trestbps chol fbs restecg
#> Min. :29.00 0:312 0:497 Min. : 94.0 Min. :126 0:872 0:497
#> 1st Qu.:48.00 1:713 1:167 1st Qu.:120.0 1st Qu.:211 1:153 1:513
#> Median :56.00 2:284 Median :130.0 Median :240 2: 15
#> Mean :54.43 3: 77 Mean :131.6 Mean :246
#> 3rd Qu.:61.00 3rd Qu.:140.0 3rd Qu.:275
#> Max. :77.00 Max. :200.0 Max. :564
#> thalach exang oldpeak slope ca thal
#> Min. : 71.0 0:680 Min. :0.000 0: 74 0:578 0: 7
#> 1st Qu.:132.0 1:345 1st Qu.:0.000 1:482 1:226 1: 64
#> Median :152.0 Median :0.800 2:469 2:134 2:544
#> Mean :149.1 Mean :1.072 3: 69 3:410
#> 3rd Qu.:166.0 3rd Qu.:1.800 4: 18
#> Max. :202.0 Max. :6.200
#> target
#> Min. :0.0000
#> 1st Qu.:0.0000
#> Median :1.0000
#> Mean :0.5132
#> 3rd Qu.:1.0000
#> Max. :1.0000
Insight:
Inspecting the 5 number summary + mean in order to get an insight and data distribution informations.
summary(heart_disease[heart_disease$target == 1,])#> age sex cp trestbps chol fbs
#> Min. :29.00 0:226 0:122 Min. : 94.0 Min. :126.0 0:455
#> 1st Qu.:44.00 1:300 1:134 1st Qu.:120.0 1st Qu.:208.0 1: 71
#> Median :52.00 2:219 Median :130.0 Median :234.0
#> Mean :52.41 3: 51 Mean :129.2 Mean :241.0
#> 3rd Qu.:59.00 3rd Qu.:140.0 3rd Qu.:265.8
#> Max. :76.00 Max. :180.0 Max. :564.0
#> restecg thalach exang oldpeak slope ca thal
#> 0:214 Min. : 96.0 0:455 Min. :0.00 0: 28 0:415 0: 3
#> 1:309 1st Qu.:149.0 1: 71 1st Qu.:0.00 1:158 1: 66 1: 21
#> 2: 3 Median :161.5 Median :0.20 2:340 2: 21 2:412
#> Mean :158.6 Mean :0.57 3: 9 3: 90
#> 3rd Qu.:172.0 3rd Qu.:1.00 4: 15
#> Max. :202.0 Max. :4.20
#> target
#> Min. :1
#> 1st Qu.:1
#> Median :1
#> Mean :1
#> 3rd Qu.:1
#> Max. :1
# Distribution with heart disease
boxplot(x = heart_disease1$age, horizontal = T, xlab="Age Distribution") plot(x = heart_disease1$cp , y = heart_disease1$trestbps, horizontal = T,ylab="Chest Pain Type", xlab="Resting Blood Pressure")plot(x = heart_disease1$cp , y = heart_disease1$chol, horizontal = T,ylab="Chest Pain Type", xlab="Amount of Cholesterol")plot(x = heart_disease1$cp, y = heart_disease1$thalach, horizontal = T,ylab="Chest Pain Type", xlab="Maximum Heart Rate Achieved")plot(x = heart_disease1$cp, y = heart_disease1$oldpeak, horizontal = T,ylab="Chest Pain Type", xlab="ST Depression Induced")Insight:
Inspecting the 5 number summary + mean in order to get an insight and data distribution informations.
summary(heart_disease[heart_disease$target == 0,])#> age sex cp trestbps chol fbs
#> Min. :35.00 0: 86 0:375 Min. :100.0 Min. :131.0 0:417
#> 1st Qu.:52.00 1:413 1: 33 1st Qu.:120.0 1st Qu.:217.0 1: 82
#> Median :58.00 2: 65 Median :130.0 Median :249.0
#> Mean :56.57 3: 26 Mean :134.1 Mean :251.3
#> 3rd Qu.:62.00 3rd Qu.:144.0 3rd Qu.:284.0
#> Max. :77.00 Max. :200.0 Max. :409.0
#> restecg thalach exang oldpeak slope ca thal
#> 0:283 Min. : 71.0 0:225 Min. :0.0 0: 46 0:163 0: 4
#> 1:204 1st Qu.:125.0 1:274 1st Qu.:0.6 1:324 1:160 1: 43
#> 2: 12 Median :142.0 Median :1.4 2:129 2:113 2:132
#> Mean :139.1 Mean :1.6 3: 60 3:320
#> 3rd Qu.:156.0 3rd Qu.:2.5 4: 3
#> Max. :195.0 Max. :6.2
#> target
#> Min. :0
#> 1st Qu.:0
#> Median :0
#> Mean :0
#> 3rd Qu.:0
#> Max. :0
# Distribution without heart disease
boxplot(x = heart_disease2$age, horizontal = T) plot(x = heart_disease2$cp , y = heart_disease2$trestbps, horizontal = T,xlab="Chest Pain Type", ylab="Resting Blood Pressure")plot(x = heart_disease2$cp , y = heart_disease2$chol, horizontal = T,xlab="Chest Pain Type", ylab="Amount of Cholesterol")plot(x = heart_disease2$cp, y = heart_disease2$thalach, horizontal = T,xlab="Chest Pain Type", ylab="Maximum Heart Rate Achieved")plot(x = heart_disease2$cp, y = heart_disease2$oldpeak, horizontal = T,xlab="Chest Pain Type", ylab="ST Depression Induced")Insight:
1. Covariance
This measurement is used to see the linear
relationship between two numeric variables.
Covariance
shows how the variances of 2 data (different variables) move
together.
cov(heart_disease$thalach,heart_disease$target)#> [1] 4.865194
Insight:
cov(heart_disease$oldpeak,heart_disease$target)#> [1] -0.2576322
cov(heart_disease$chol,heart_disease$target)#> [1] -2.579102
cov(heart_disease$trestbps,heart_disease$target)#> [1] -1.215584
cov(heart_disease$age,heart_disease$target)#> [1] -1.040392
Insight:
2. Correlation
ggcorr(heart_disease, label = T, label_round = 2)
Insight:
Positive correlation
Negative correlation
No correlation
In this section, the probability mass function performs the probability of an event occurring calculation for discrete data / categorical data type.
1. “Sex” Probability Occurrence.
Explanation : The patient’s gender (1)Male and
(0)Female.
# With Heart Disease
prop.table(table(heart_disease1$sex))#>
#> 0 1
#> 0.4296578 0.5703422
# Without Heart Disease
prop.table(table(heart_disease2$sex))#>
#> 0 1
#> 0.1723447 0.8276553
Insight:
- The patients’ heart disease has
probability occurrence 57.034% for male and 42.96% for female. - The
patients’ without heart disease has probability occurrence 82.27% for
male and 17.23% for female.
2. “cp” Probability Occurrence.
Explanation : The variable of cp refers to chest pain
type and this variable has 4 levels, which are (0)Typical angina - TA,
(1)Atypical angina - ATA, (2)Non-anginal pain - NAP, (3)Asymptomatic -
ASY.
# With Heart Disease
prop.table(table(heart_disease1$cp))#>
#> 0 1 2 3
#> 0.23193916 0.25475285 0.41634981 0.09695817
The passage above uses patients’ heart disease dataset.
Insight:
# Without Heart Disease
prop.table(table(heart_disease2$cp))#>
#> 0 1 2 3
#> 0.75150301 0.06613226 0.13026052 0.05210421
The passage above uses patients’ without heart disease dataset.
Insight:
3. “fbs” Probability Occurrence .
Explanation : The variable of fbs refers to fasting
blood sugar > 120 mg/dl and this variable has 2 levels, which are
(1)True and (0)False.
# With Heart Disease
prop.table(table(heart_disease1$fbs))#>
#> 0 1
#> 0.865019 0.134981
# Without Heart Disease
prop.table(table(heart_disease2$fbs))#>
#> 0 1
#> 0.8356713 0.1643287
Insight:
4. “restecg” Probability Occurrence.
Explanation : The variable of restecg refers to the
resting electrocardiographic results and this variable has 3 levels,
which are (0)Normal, (1)Resting electrocardiographic results, (2)Showing
probable or definite left ventricular hypertrophy by Estes’
criteria.
# With Heart Disease
prop.table(table(heart_disease1$restecg))#>
#> 0 1 2
#> 0.406844106 0.587452471 0.005703422
# Without Heart Disease
prop.table(table(heart_disease2$restecg))#>
#> 0 1 2
#> 0.5671343 0.4088176 0.0240481
Insight:
- The patients with heart disease has
Resting electrocardiographic results with 58.745% probability
occurrence.
- The patients without heart disease has normal results
with 56.71% probability occurrence and 40.88% probability occurrence of
Resting electrocardiographic results.
- Probable or definite left
ventricular hypertrophy by Estes’ criteria for both patients have
probability occurrence around 0.57% - 2.4%.
5. “exang” Probability Occurrence.
Explanation : The variable of exang refers to exercise
induced angina whereby (1)Yes (0)No.
# With Heart Disease
prop.table(table(heart_disease1$exang))#>
#> 0 1
#> 0.865019 0.134981
# Without Heart Disease
prop.table(table(heart_disease2$exang))#>
#> 0 1
#> 0.4509018 0.5490982
Insight:
- The patients with heart disease has
exercise induced angina with 86.5% probability occurrence.
- The
patients without heart disease has exercise induced angina with 45.09%
probability occurrence.
6. “slope” Probability Occurrence.
Explanation : The variable of slope refers to slope of
the peak exercise ST segment and this variable has 3 levels, which are
(1)Upsloping(2)Flat(3)Downsloping.
# With Heart Disease
prop.table(table(heart_disease1$slope))#>
#> 0 1 2
#> 0.05323194 0.30038023 0.64638783
# Without Heart Disease
prop.table(table(heart_disease2$slope))#>
#> 0 1 2
#> 0.09218437 0.64929860 0.25851703
Insight:
- The patients with heart disease has
Downs loping slope of the peak exercise ST segment with 64.638%
probability occurrence. And followed by flat slope of the peak exercise
ST segment with 30.03% probability occurrence.
- The patients
without heart disease has flat slope of the peak exercise ST segment
with 64.929% probability occurrence. And followed by Downs loping slope
of the peak exercise ST segment with 25.85% probability occurrence.
7. “ca” Probability Occurrence.
Explanation : The variable of ca refers to the number
of major vessels colored by fluoroscopy and this variable has 4 levels
which are 0,1,2,3.
# With Heart Disease
prop.table(table(heart_disease1$ca))#>
#> 0 1 2 3 4
#> 0.78897338 0.12547529 0.03992395 0.01711027 0.02851711
# Without Heart Disease
prop.table(table(heart_disease2$ca))#>
#> 0 1 2 3 4
#> 0.326653307 0.320641283 0.226452906 0.120240481 0.006012024
Insight:
- The top 3 most probability occurance
of ca variable is number of 0,1,2 major vessels colored by fluoroscopy
for the patients with / without heart disease.
8. “thal” Probability Occurrence.
Explanation : The variable of thal refers to
thalassemia and this variable has 3 levels, which are (3)Normal (6)Fixed
defect (7)Reversable defect.
# With Heart Disease
prop.table(table(heart_disease1$thal))#>
#> 0 1 2 3
#> 0.005703422 0.039923954 0.783269962 0.171102662
# Without Heart Disease
prop.table(table(heart_disease2$thal))#>
#> 0 1 2 3
#> 0.008016032 0.086172345 0.264529058 0.641282565
Insight:
- The patients with heart disease has
value 2 with 78.32% probability occurrence.
- The patients without
heart disease has value 3 with 64.12% probability occurrence.
# Resting Blood Pressure
ggplot(heart_disease1,aes(trestbps))+geom_histogram(bins = 6,color="red")+scale_x_continuous(breaks=seq(75,200,10))+ labs(title = "The range of trestbps variable's data distribution", x="Resting Blood Pressure",y="Total")# Cholesterol
heart_disease_chol <- heart_disease1 %>%
filter(age>44 & age < 59)
ggplot(heart_disease_chol,aes(chol))+geom_histogram(bins = 7,color="red")+scale_x_continuous(breaks=seq(100,600,20)) + labs(title = "The range of chol variable's data distribution", x="Cholesterol",y="Total")# thalach
ggplot(heart_disease1,aes(thalach))+geom_histogram(bins=6,color="red")+scale_x_continuous(breaks=seq(70,210,10))# Oldpeak
ggplot(heart_disease1,aes(oldpeak)) + geom_bar(color="blue") +scale_x_continuous(breaks=seq(0,5,0.2))+ labs(title = "The range of oldpeak variable's data distribution", x="ST Depression Induced",y="Total")
Insight:
- The highest total of patient’s heart
disease with resting blood pressure is located in the range 111 - 146 mm
Hg.
- The total of patient’s heart disease with fasting blood sugar
> 120 mg/dl is only below 100 patients.
- The total of patient’s
heart disease with cholesterol 182 - 254 mg/dl.
- The data’s center
distribution is located in a value of 0 and followed by 0.6, 0.4, 0.2 .
- The highest total of patient’s heart disease heart rate is
located in the range 160 - 180 with total above 200.