1 Summary

After conducting this analysis, the patients with heart disease have 50% of age’s data distribution lower than patients without heart disease that is located in the range of Q1 = 44 years old, Q3 = 59 years old. The resting blood pressure included in elevated (120-129 mm Hg) category and hypertension stage 1 (130-139 mm Hg), it can be seen from the 50% of data distribution which located in the range of Q1 = 120 mm Hg & Q3 = 140 mm Hg. Next, patients with or without heart disease who have 4 types of chest pain have cholesterol in the range of 210 - 250 mg/dl. Then, Variable of thalach has a relationship strength with the variable of target that tends to be positive moderate correlation. And, Variable of trestbps, chol, age, and oldpeak have a relationship strength that tends to be positive weak correlation among those variables. Heart disease tends to have a higher minimum heart rate, which is 96, than not having heart disease, which is 71.

Finally, The patients must increase awareness about heart disease before the age of 29 based on the minimum age in the data of heart disease patients.The patients must be able to maintain healthy lifestyle to maintain stable blood pressure. And also, People with elevated blood pressure are likely to develop high blood pressure unless steps are taken to control the condition.

2 Preface

2.1 Background

55% of the 55.4 million deaths worldwide came from the top 10 causes of death in 2019. The top three topics of global causes of death which has an impact / a relation on the number of lives lost were cardiovascular (ischaemic heart disease, stroke), respiratory (chronic obstructive pulmonary disease, lower respiratory infections), and neonatal conditions – which include birth asphyxia and birth trauma, neonatal sepsis and illnesses, and preterm birth complications.

2.2 Research Objectives

This research aim is to provide several insights which can benefit to readers especially for society and hospital. This research will deep dive into heart disease by analyzing the dataset. Therefore, the research objectives are:
- Analyzing the medical history background of those with heart disease and without heart disease.
- Analyzing the numerical variable relationship.
- Analyzing the probability occurrence for each factor variable.

2.3 Data Source

[Heart Disease Dataset] (https://www.kaggle.com/johnsmith88/heart-disease-dataset/)

2.4 Data Description

No.	Feature	Description	Value
1.	age	Patient’s age in years	29-77
2.	sex	Patient’s gender	(1)Male (0)Female
3.	cp	Chest pain type	(0)Typical angina - TA (1)Atypical angina - ATA (2)Non-anginal pain - NAP (3)Asymptomatic - ASY.
4.	trestbps	Resting blood pressure (in mm Hg)	94-200
5.	chol	Cholestoral in mg/dl	126 – 564
6.	fbs	Fasting blood sugar > 120 mg/dl	(1)True (0)False
7.	restecg	Resting electrocardiographic results	(0)Normal (1)Resting electrocardiographic results (2)Showing probable or definite left ventricular hypertrophy by Estes’ criteria
8.	thalach	Maximum heart rate achieved	71-202
9.	exang	Exercise induced angina	(1)Yes (0)No
10.	oldpeak	ST depression induced by exercise relative to rest	0-6.2
11.	slope	The slope of the peak exercise ST segment	(1)Upsloping(2)Flat(3)Downsloping
12.	ca	Number of major vessels (0-3) colored by fluoroscopy	0,1,2,3
13.	thal	Thalassemia	(3)Normal (6)Fixed defect (no blood flow in some part of the heart) (7)Reversable defect (a blood flow is observed but it is not normal)
14.	target	Diagnosis of heart disease	(0)Heart disease not present (1)Heart disease present

2.5 List Packages

# data cleaning
library(readr) 
library(tidyverse)
library(dplyr)

#data analysis
library(GGally)

#data visualizationl
library(ggplot2) 
library(scales)

3 Data Preprocessing

3.1 Read & Extracting Data

This data set dates from 1988 and consists of four databases: Cleveland, Hungary, Switzerland, and Long Beach V. It contains 76 attributes, including the predicted attribute, but all published experiments refer to using a subset of 14 of them. The “target” field refers to the presence of heart disease in the patient. It is integer valued 0 = no disease and 1 = disease.

heart_disease <- read.csv("data_input/kaggle_4city.csv")
heart_disease

4 Data Wrangling

4.1 Data Inspection

# Top 6 data
head(heart_disease)

# Bottom 6 data
tail(heart_disease)

4.2 Change The Data Type

# Inspect Data Type
glimpse(heart_disease)

#> Rows: 1,025
#> Columns: 14
#> $ age      <int> 52, 53, 70, 61, 62, 58, 58, 55, 46, 54, 71, 43, 34, 51, 52, 3…
#> $ sex      <int> 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1…
#> $ cp       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 2, 0, 1, 2, 2…
#> $ trestbps <int> 125, 140, 145, 148, 138, 100, 114, 160, 120, 122, 112, 132, 1…
#> $ chol     <int> 212, 203, 174, 203, 294, 248, 318, 289, 249, 286, 149, 341, 2…
#> $ fbs      <int> 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0…
#> $ restecg  <int> 1, 0, 1, 1, 1, 0, 2, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0…
#> $ thalach  <int> 168, 155, 125, 161, 106, 122, 140, 145, 144, 116, 125, 136, 1…
#> $ exang    <int> 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0…
#> $ oldpeak  <dbl> 1.0, 3.1, 2.6, 0.0, 1.9, 1.0, 4.4, 0.8, 0.8, 3.2, 1.6, 3.0, 0…
#> $ slope    <int> 2, 0, 0, 2, 1, 1, 0, 1, 2, 1, 1, 1, 2, 1, 1, 2, 2, 1, 2, 2, 1…
#> $ ca       <int> 2, 0, 0, 1, 3, 0, 3, 1, 0, 2, 0, 0, 0, 3, 0, 0, 1, 1, 0, 0, 0…
#> $ thal     <int> 3, 3, 3, 3, 2, 2, 1, 3, 3, 2, 2, 3, 2, 3, 0, 2, 2, 3, 2, 2, 2…
#> $ target   <int> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0…

# Change Data Type to Factor

heart_disease <- heart_disease %>% 
  mutate_at(vars(sex,cp,fbs,restecg,exang,slope,ca,thal),as.factor)

glimpse(heart_disease)

#> Rows: 1,025
#> Columns: 14
#> $ age      <int> 52, 53, 70, 61, 62, 58, 58, 55, 46, 54, 71, 43, 34, 51, 52, 3…
#> $ sex      <fct> 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1…
#> $ cp       <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 2, 0, 1, 2, 2…
#> $ trestbps <int> 125, 140, 145, 148, 138, 100, 114, 160, 120, 122, 112, 132, 1…
#> $ chol     <int> 212, 203, 174, 203, 294, 248, 318, 289, 249, 286, 149, 341, 2…
#> $ fbs      <fct> 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0…
#> $ restecg  <fct> 1, 0, 1, 1, 1, 0, 2, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0…
#> $ thalach  <int> 168, 155, 125, 161, 106, 122, 140, 145, 144, 116, 125, 136, 1…
#> $ exang    <fct> 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0…
#> $ oldpeak  <dbl> 1.0, 3.1, 2.6, 0.0, 1.9, 1.0, 4.4, 0.8, 0.8, 3.2, 1.6, 3.0, 0…
#> $ slope    <fct> 2, 0, 0, 2, 1, 1, 0, 1, 2, 1, 1, 1, 2, 1, 1, 2, 2, 1, 2, 2, 1…
#> $ ca       <fct> 2, 0, 0, 1, 3, 0, 3, 1, 0, 2, 0, 0, 0, 3, 0, 0, 1, 1, 0, 0, 0…
#> $ thal     <fct> 3, 3, 3, 3, 2, 2, 1, 3, 3, 2, 2, 3, 2, 3, 0, 2, 2, 3, 2, 2, 2…
#> $ target   <int> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0…

Notes: Using target as an integer for correlation analysis cor(). And, using target as factor for machine learning modelling.

4.3 Check Missing Values

anyNA(heart_disease)

#> [1] FALSE

colSums(is.na(heart_disease))

#>      age      sex       cp trestbps     chol      fbs  restecg  thalach 
#>        0        0        0        0        0        0        0        0 
#>    exang  oldpeak    slope       ca     thal   target 
#>        0        0        0        0        0        0

It shows that there are no missing values in heart_disease dataframe.

4.4 Check Duplicate Data

heart_disease %>% 
  duplicated() %>% 
  sum()

#> [1] 723

# Containing duplicated data
heart_disease[duplicated(heart_disease),]

Insight: There is a possibility that each observations, patient’s data, has the same value. Then, the treatment for duplicated data is not make any changes in this dataframe.

5 EDA - Exploratory Data Analysis

5.1 Practical Statistic

# Inspect the data

# With Heart Disease
heart_disease1 <- heart_disease[heart_disease$target==1,]

# Without Heart Disease
heart_disease2 <- heart_disease[heart_disease$target==0,]

5.1.1 Summary of original’s dataset stats

Inspecting the 5 number summary + mean in order to get an insight and data distribution informations.

# Original Dataset
summary(heart_disease)

#>       age        sex     cp         trestbps          chol     fbs     restecg
#>  Min.   :29.00   0:312   0:497   Min.   : 94.0   Min.   :126   0:872   0:497  
#>  1st Qu.:48.00   1:713   1:167   1st Qu.:120.0   1st Qu.:211   1:153   1:513  
#>  Median :56.00           2:284   Median :130.0   Median :240           2: 15  
#>  Mean   :54.43           3: 77   Mean   :131.6   Mean   :246                  
#>  3rd Qu.:61.00                   3rd Qu.:140.0   3rd Qu.:275                  
#>  Max.   :77.00                   Max.   :200.0   Max.   :564                  
#>     thalach      exang      oldpeak      slope   ca      thal   
#>  Min.   : 71.0   0:680   Min.   :0.000   0: 74   0:578   0:  7  
#>  1st Qu.:132.0   1:345   1st Qu.:0.000   1:482   1:226   1: 64  
#>  Median :152.0           Median :0.800   2:469   2:134   2:544  
#>  Mean   :149.1           Mean   :1.072           3: 69   3:410  
#>  3rd Qu.:166.0           3rd Qu.:1.800           4: 18          
#>  Max.   :202.0           Max.   :6.200                          
#>      target      
#>  Min.   :0.0000  
#>  1st Qu.:0.0000  
#>  Median :1.0000  
#>  Mean   :0.5132  
#>  3rd Qu.:1.0000  
#>  Max.   :1.0000

Insight:

The data shows that the minimum of age is 29 years old, then the maximum of age is 77 years old, and mean of age is around 54.43 .
The amount of male’s observations are 713. And, The amount of female’s observations are 312.
50% of the trestbps’ (resting blood pressure) data is in the range of 120-140 mm Hg and the mean is 131.6 mm Hg.
50% of the chol’s (Cholestoral) data is in the range of 211 - 275 mg/dl and the mean is 246 mg/dl.
50% of the thalach (Max Heart Rate Achieved) is in the range of 132-166, then max value of heart rate is 202 and minimum value of heart rate is 71.
50% of the oldpeak (ST depression induced by exercise relative to rest)is in the range of 0.0 - 1.8 and the mean is 1.072 .

5.1.2 Summary of patients’ heart disease dataset stats

Inspecting the 5 number summary + mean in order to get an insight and data distribution informations.

summary(heart_disease[heart_disease$target == 1,])

#>       age        sex     cp         trestbps          chol       fbs    
#>  Min.   :29.00   0:226   0:122   Min.   : 94.0   Min.   :126.0   0:455  
#>  1st Qu.:44.00   1:300   1:134   1st Qu.:120.0   1st Qu.:208.0   1: 71  
#>  Median :52.00           2:219   Median :130.0   Median :234.0          
#>  Mean   :52.41           3: 51   Mean   :129.2   Mean   :241.0          
#>  3rd Qu.:59.00                   3rd Qu.:140.0   3rd Qu.:265.8          
#>  Max.   :76.00                   Max.   :180.0   Max.   :564.0          
#>  restecg    thalach      exang      oldpeak     slope   ca      thal   
#>  0:214   Min.   : 96.0   0:455   Min.   :0.00   0: 28   0:415   0:  3  
#>  1:309   1st Qu.:149.0   1: 71   1st Qu.:0.00   1:158   1: 66   1: 21  
#>  2:  3   Median :161.5           Median :0.20   2:340   2: 21   2:412  
#>          Mean   :158.6           Mean   :0.57           3:  9   3: 90  
#>          3rd Qu.:172.0           3rd Qu.:1.00           4: 15          
#>          Max.   :202.0           Max.   :4.20                          
#>      target 
#>  Min.   :1  
#>  1st Qu.:1  
#>  Median :1  
#>  Mean   :1  
#>  3rd Qu.:1  
#>  Max.   :1

# Distribution with heart disease
boxplot(x = heart_disease1$age, horizontal = T, xlab="Age Distribution")

plot(x = heart_disease1$cp , y = heart_disease1$trestbps, horizontal = T,ylab="Chest Pain Type", xlab="Resting Blood Pressure")

plot(x = heart_disease1$cp , y = heart_disease1$chol, horizontal = T,ylab="Chest Pain Type", xlab="Amount of Cholesterol")

plot(x = heart_disease1$cp, y = heart_disease1$thalach, horizontal = T,ylab="Chest Pain Type", xlab="Maximum Heart Rate Achieved")

plot(x = heart_disease1$cp, y = heart_disease1$oldpeak, horizontal = T,ylab="Chest Pain Type", xlab="ST Depression Induced")

Insight:

The data shows that the minimum of heart disease’s age is 29 years old, then the maximum of heart disease’s age is 76 years old, and mean of heart disease’s age is around 52.41 .
The amount of male’s observations with heart disease are 300. And, The amount of female’s observations with heart disease are 226.
50% of the trestbps’ (resting blood pressure) data with heart disease is in the range of 120-140 mm Hg and the mean is 129.2 mm Hg.
50% of the chol’s (Cholestoral) data is in the range of 208 - 265.8 mg/dl and the mean is 241 mg/dl.
50% of the thalach (Max Heart Rate Achieved) is in the range of 149-172, then max value of heart rate is 202 and minimum value of heart rate is 96.
50% of the oldpeak (ST depression induced by exercise relative to rest)is in the range of 0.0 - 1.0 and the mean is 0.57 .

5.1.3 Summary of patients’ without heart disease dataset stats

Inspecting the 5 number summary + mean in order to get an insight and data distribution informations.

summary(heart_disease[heart_disease$target == 0,])

#>       age        sex     cp         trestbps          chol       fbs    
#>  Min.   :35.00   0: 86   0:375   Min.   :100.0   Min.   :131.0   0:417  
#>  1st Qu.:52.00   1:413   1: 33   1st Qu.:120.0   1st Qu.:217.0   1: 82  
#>  Median :58.00           2: 65   Median :130.0   Median :249.0          
#>  Mean   :56.57           3: 26   Mean   :134.1   Mean   :251.3          
#>  3rd Qu.:62.00                   3rd Qu.:144.0   3rd Qu.:284.0          
#>  Max.   :77.00                   Max.   :200.0   Max.   :409.0          
#>  restecg    thalach      exang      oldpeak    slope   ca      thal   
#>  0:283   Min.   : 71.0   0:225   Min.   :0.0   0: 46   0:163   0:  4  
#>  1:204   1st Qu.:125.0   1:274   1st Qu.:0.6   1:324   1:160   1: 43  
#>  2: 12   Median :142.0           Median :1.4   2:129   2:113   2:132  
#>          Mean   :139.1           Mean   :1.6           3: 60   3:320  
#>          3rd Qu.:156.0           3rd Qu.:2.5           4:  3          
#>          Max.   :195.0           Max.   :6.2                          
#>      target 
#>  Min.   :0  
#>  1st Qu.:0  
#>  Median :0  
#>  Mean   :0  
#>  3rd Qu.:0  
#>  Max.   :0

# Distribution without heart disease
boxplot(x = heart_disease2$age, horizontal = T)

plot(x = heart_disease2$cp , y = heart_disease2$trestbps, horizontal = T,xlab="Chest Pain Type", ylab="Resting Blood Pressure")

plot(x = heart_disease2$cp , y = heart_disease2$chol, horizontal = T,xlab="Chest Pain Type", ylab="Amount of Cholesterol")

plot(x = heart_disease2$cp, y = heart_disease2$thalach, horizontal = T,xlab="Chest Pain Type", ylab="Maximum Heart Rate Achieved")

plot(x = heart_disease2$cp, y = heart_disease2$oldpeak, horizontal = T,xlab="Chest Pain Type", ylab="ST Depression Induced")

Insight:

The data shows that the minimum of without heart disease’s age is 35 years old, then the maximum of without heart disease’s age is 77 years old, and mean of heart disease’s age is around 56.57 .
The amount of male’s observations without heart disease are 413. And, The amount of female’s observations with heart disease are 86.
50% of the trestbps’ (resting blood pressure) data without heart disease is in the range of 120-144 mm Hg and the mean is 134.1 mm Hg.
50% of the chol’s (Cholestoral) data is in the range of 217 - 284 mg/dl and the mean is 251 mg/dl.
50% of the thalach (Max Heart Rate Achieved) is in the range of 125-156, then max value of heart rate is 195 and minimum value of heart rate is 71.
50% of the oldpeak (ST depression induced by exercise relative to rest)is in the range of 0.0 - 2.5 and the mean is 1.6 .

5.2 Variable Relations

5.2.1 Numeric Variables Relationship Towards Target

1. Covariance

This measurement is used to see the linear relationship between two numeric variables.
Covariance shows how the variances of 2 data (different variables) move together.

cov(heart_disease$thalach,heart_disease$target)

#> [1] 4.865194

Insight:

The covariance value of thalach variable shows positive (unidirectional) relationship with target variable where the value of the target variable increases, then the value of thalach variable will also increase too.

cov(heart_disease$oldpeak,heart_disease$target)

#> [1] -0.2576322

cov(heart_disease$chol,heart_disease$target)

#> [1] -2.579102

cov(heart_disease$trestbps,heart_disease$target)

#> [1] -1.215584

cov(heart_disease$age,heart_disease$target)

#> [1] -1.040392

Insight:

The covariance value of oldpeak,chol,trestbps, and age variables show negative (unidirectional) relationship with target variable.

2. Correlation

ggcorr(heart_disease, label = T, label_round = 2)

Insight:

Positive correlation
- Variable of thalach has a relationship strength with the variable of target that tends to be positive moderate correlation.
- Variable of trestbps, chol, age, and oldpeak have a relationship strength that tends to be positive weak correlation among those variables.
Negative correlation
- Variable of age, trestbps, and chol have a relationship strength with the variable of target that tends to be negative weak correlation .
- Variable of oldpeak has a relationship strength with the variable of target that tends to be negative moderate correlation.
No correlation
- Variable of chol and trestbps have no correlation with thalach variable.

5.3 Probability

5.3.1 Mass Function

In this section, the probability mass function performs the probability of an event occurring calculation for discrete data / categorical data type.

1. “Sex” Probability Occurrence.
Explanation : The patient’s gender (1)Male and (0)Female.

# With Heart Disease
prop.table(table(heart_disease1$sex))

#> 
#>         0         1 
#> 0.4296578 0.5703422

# Without Heart Disease
prop.table(table(heart_disease2$sex))

#> 
#>         0         1 
#> 0.1723447 0.8276553

Insight:
- The patients’ heart disease has probability occurrence 57.034% for male and 42.96% for female. - The patients’ without heart disease has probability occurrence 82.27% for male and 17.23% for female.

2. “cp” Probability Occurrence.
Explanation : The variable of cp refers to chest pain type and this variable has 4 levels, which are (0)Typical angina - TA, (1)Atypical angina - ATA, (2)Non-anginal pain - NAP, (3)Asymptomatic - ASY.

# With Heart Disease 
prop.table(table(heart_disease1$cp))

#> 
#>          0          1          2          3 
#> 0.23193916 0.25475285 0.41634981 0.09695817

The passage above uses patients’ heart disease dataset.

Insight:

The patients’ heart disease mostly have type 2 of chest pain which is Non-anginal pain - NAP.
The next position followed by Atypical angina - ATA with 25.475% potential occurrence and Typical angina with 23.19% probability occurrence.
And rarely happening is Asymptomatic - ASY with 9.69% probability occurrence.

# Without Heart Disease

prop.table(table(heart_disease2$cp))

#> 
#>          0          1          2          3 
#> 0.75150301 0.06613226 0.13026052 0.05210421

The passage above uses patients’ without heart disease dataset.

Insight:

In the other hand, the patients’ without heart disease mostly have type 0 of chest pain which is Typical angina - TA with 75.15% probability occurrence.

3. “fbs” Probability Occurrence .
Explanation : The variable of fbs refers to fasting blood sugar > 120 mg/dl and this variable has 2 levels, which are (1)True and (0)False.

# With Heart Disease
prop.table(table(heart_disease1$fbs))

#> 
#>        0        1 
#> 0.865019 0.134981

# Without Heart Disease
prop.table(table(heart_disease2$fbs))

#> 
#>         0         1 
#> 0.8356713 0.1643287

Insight:

The patients with or without heart disease have the fasting blood sugar > 120 mg/dl rather with probability occurrence around 13.49% - 16.43%.

4. “restecg” Probability Occurrence.
Explanation : The variable of restecg refers to the resting electrocardiographic results and this variable has 3 levels, which are (0)Normal, (1)Resting electrocardiographic results, (2)Showing probable or definite left ventricular hypertrophy by Estes’ criteria.

# With Heart Disease
prop.table(table(heart_disease1$restecg))

#> 
#>           0           1           2 
#> 0.406844106 0.587452471 0.005703422

# Without Heart Disease
prop.table(table(heart_disease2$restecg))

#> 
#>         0         1         2 
#> 0.5671343 0.4088176 0.0240481

Insight:
- The patients with heart disease has Resting electrocardiographic results with 58.745% probability occurrence.
- The patients without heart disease has normal results with 56.71% probability occurrence and 40.88% probability occurrence of Resting electrocardiographic results.
- Probable or definite left ventricular hypertrophy by Estes’ criteria for both patients have probability occurrence around 0.57% - 2.4%.

5. “exang” Probability Occurrence.
Explanation : The variable of exang refers to exercise induced angina whereby (1)Yes (0)No.

# With Heart Disease
prop.table(table(heart_disease1$exang))

#> 
#>        0        1 
#> 0.865019 0.134981

# Without Heart Disease
prop.table(table(heart_disease2$exang))

#> 
#>         0         1 
#> 0.4509018 0.5490982

Insight:
- The patients with heart disease has exercise induced angina with 86.5% probability occurrence.
- The patients without heart disease has exercise induced angina with 45.09% probability occurrence.

6. “slope” Probability Occurrence.
Explanation : The variable of slope refers to slope of the peak exercise ST segment and this variable has 3 levels, which are (1)Upsloping(2)Flat(3)Downsloping.

# With Heart Disease
prop.table(table(heart_disease1$slope))

#> 
#>          0          1          2 
#> 0.05323194 0.30038023 0.64638783

# Without Heart Disease
prop.table(table(heart_disease2$slope))

#> 
#>          0          1          2 
#> 0.09218437 0.64929860 0.25851703

Insight:
- The patients with heart disease has Downs loping slope of the peak exercise ST segment with 64.638% probability occurrence. And followed by flat slope of the peak exercise ST segment with 30.03% probability occurrence.
- The patients without heart disease has flat slope of the peak exercise ST segment with 64.929% probability occurrence. And followed by Downs loping slope of the peak exercise ST segment with 25.85% probability occurrence.

7. “ca” Probability Occurrence.
Explanation : The variable of ca refers to the number of major vessels colored by fluoroscopy and this variable has 4 levels which are 0,1,2,3.

# With Heart Disease
prop.table(table(heart_disease1$ca))

#> 
#>          0          1          2          3          4 
#> 0.78897338 0.12547529 0.03992395 0.01711027 0.02851711

# Without Heart Disease
prop.table(table(heart_disease2$ca))

#> 
#>           0           1           2           3           4 
#> 0.326653307 0.320641283 0.226452906 0.120240481 0.006012024

Insight:
- The top 3 most probability occurance of ca variable is number of 0,1,2 major vessels colored by fluoroscopy for the patients with / without heart disease.

8. “thal” Probability Occurrence.
Explanation : The variable of thal refers to thalassemia and this variable has 3 levels, which are (3)Normal (6)Fixed defect (7)Reversable defect.

# With Heart Disease
prop.table(table(heart_disease1$thal))

#> 
#>           0           1           2           3 
#> 0.005703422 0.039923954 0.783269962 0.171102662

# Without Heart Disease
prop.table(table(heart_disease2$thal))

#> 
#>           0           1           2           3 
#> 0.008016032 0.086172345 0.264529058 0.641282565

Insight:
- The patients with heart disease has value 2 with 78.32% probability occurrence.
- The patients without heart disease has value 3 with 64.12% probability occurrence.

5.4 Distribution Analysis

# Resting Blood Pressure
ggplot(heart_disease1,aes(trestbps))+geom_histogram(bins = 6,color="red")+scale_x_continuous(breaks=seq(75,200,10))+ labs(title = "The range of trestbps variable's data distribution", x="Resting Blood Pressure",y="Total")

# Cholesterol
heart_disease_chol <- heart_disease1 %>% 
  filter(age>44 & age < 59)
ggplot(heart_disease_chol,aes(chol))+geom_histogram(bins = 7,color="red")+scale_x_continuous(breaks=seq(100,600,20)) + labs(title = "The range of chol variable's data distribution", x="Cholesterol",y="Total")

# thalach
ggplot(heart_disease1,aes(thalach))+geom_histogram(bins=6,color="red")+scale_x_continuous(breaks=seq(70,210,10))

# Oldpeak
ggplot(heart_disease1,aes(oldpeak)) + geom_bar(color="blue") +scale_x_continuous(breaks=seq(0,5,0.2))+ labs(title = "The range of oldpeak variable's data distribution", x="ST Depression Induced",y="Total")

Insight:
- The highest total of patient’s heart disease with resting blood pressure is located in the range 111 - 146 mm Hg.
- The total of patient’s heart disease with fasting blood sugar > 120 mg/dl is only below 100 patients.
- The total of patient’s heart disease with cholesterol 182 - 254 mg/dl.
- The data’s center distribution is located in a value of 0 and followed by 0.6, 0.4, 0.2 .
- The highest total of patient’s heart disease heart rate is located in the range 160 - 180 with total above 200.

Deep Analytical - Heart Disease

Faisal Adhisthana Nugraha

2023-03-05

1 Summary

2 Preface

2.1 Background

2.2 Research Objectives

2.3 Data Source

2.4 Data Description

2.5 List Packages

3 Data Preprocessing

3.1 Read & Extracting Data

4 Data Wrangling

4.1 Data Inspection

4.2 Change The Data Type

4.3 Check Missing Values

4.4 Check Duplicate Data

5 EDA - Exploratory Data Analysis

5.1 Practical Statistic

5.1.1 Summary of original’s dataset stats

5.1.2 Summary of patients’ heart disease dataset stats

5.1.3 Summary of patients’ without heart disease dataset stats

5.2 Variable Relations

5.2.1 Numeric Variables Relationship Towards Target

5.3 Probability

5.3.1 Mass Function

5.4 Distribution Analysis