Aim :
To classify which all attributes are the most important predictor of heart disease

Dataset Attributes

Age : age of the patient [years] Sex : sex of the patient [M: Male, F: Female] ChestPainType : chest pain type [TA: Typical Angina, ATA: Atypical Angina, NAP: Non-Anginal Pain, ASY: Asymptomatic] RestingBP : resting blood pressure [mm Hg] Cholesterol : serum cholesterol [mm/dl] FastingBS : fasting blood sugar [1: if FastingBS > 120 mg/dl, 0: otherwise] RestingECG : resting electrocardiogram results [Normal: Normal, ST: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV), LVH: showing probable or definite left ventricular hypertrophy by Estes’ criteria] MaxHR : maximum heart rate achieved [Numeric value between 60 and 202] ExerciseAngina : exercise-induced angina [Y: Yes, N: No] Oldpeak : oldpeak = ST [Numeric value measured in depression] ST_Slope : the slope of the peak exercise ST segment [Up: upsloping, Flat: flat, Down: downsloping] HeartDisease : output class [1: heart disease, 0: Normal]

Importing all the necessary libraries:

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   1.0.1 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.3.0      ✔ stringr 1.5.0 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(lubridate)

## 
## Attaching package: 'lubridate'
## 
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

library(nycflights13)
library(ggplot2)
library(dplyr)
library(plotly)

## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout

library(readr)

Data Info:

heart <- read_csv("C:/Users/ROSHAN D K/Desktop/R project/heart.csv", show_col_types = FALSE)
heart

## # A tibble: 918 × 12
##      Age Sex   ChestPain…¹ Resti…² Chole…³ Fasti…⁴ Resti…⁵ MaxHR Exerc…⁶ Oldpeak
##    <dbl> <chr> <chr>         <dbl>   <dbl>   <dbl> <chr>   <dbl> <chr>     <dbl>
##  1    40 M     ATA             140     289       0 Normal    172 N           0  
##  2    49 F     NAP             160     180       0 Normal    156 N           1  
##  3    37 M     ATA             130     283       0 ST         98 N           0  
##  4    48 F     ASY             138     214       0 Normal    108 Y           1.5
##  5    54 M     NAP             150     195       0 Normal    122 N           0  
##  6    39 M     NAP             120     339       0 Normal    170 N           0  
##  7    45 F     ATA             130     237       0 Normal    170 N           0  
##  8    54 M     ATA             110     208       0 Normal    142 N           0  
##  9    37 M     ASY             140     207       0 Normal    130 Y           1.5
## 10    48 F     ATA             120     284       0 Normal    120 N           0  
## # … with 908 more rows, 2 more variables: ST_Slope <chr>, HeartDisease <dbl>,
## #   and abbreviated variable names ¹ChestPainType, ²RestingBP, ³Cholesterol,
## #   ⁴FastingBS, ⁵RestingECG, ⁶ExerciseAngina

Checking the dimension of the dataset

dim(heart)

## [1] 918  12

Finding the data type of each column

str(heart)

## spc_tbl_ [918 × 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Age           : num [1:918] 40 49 37 48 54 39 45 54 37 48 ...
##  $ Sex           : chr [1:918] "M" "F" "M" "F" ...
##  $ ChestPainType : chr [1:918] "ATA" "NAP" "ATA" "ASY" ...
##  $ RestingBP     : num [1:918] 140 160 130 138 150 120 130 110 140 120 ...
##  $ Cholesterol   : num [1:918] 289 180 283 214 195 339 237 208 207 284 ...
##  $ FastingBS     : num [1:918] 0 0 0 0 0 0 0 0 0 0 ...
##  $ RestingECG    : chr [1:918] "Normal" "Normal" "ST" "Normal" ...
##  $ MaxHR         : num [1:918] 172 156 98 108 122 170 170 142 130 120 ...
##  $ ExerciseAngina: chr [1:918] "N" "N" "N" "Y" ...
##  $ Oldpeak       : num [1:918] 0 1 0 1.5 0 0 0 0 1.5 0 ...
##  $ ST_Slope      : chr [1:918] "Up" "Flat" "Up" "Flat" ...
##  $ HeartDisease  : num [1:918] 0 1 0 1 0 0 0 0 1 0 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Age = col_double(),
##   ..   Sex = col_character(),
##   ..   ChestPainType = col_character(),
##   ..   RestingBP = col_double(),
##   ..   Cholesterol = col_double(),
##   ..   FastingBS = col_double(),
##   ..   RestingECG = col_character(),
##   ..   MaxHR = col_double(),
##   ..   ExerciseAngina = col_character(),
##   ..   Oldpeak = col_double(),
##   ..   ST_Slope = col_character(),
##   ..   HeartDisease = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

Checking Missing values in each column

heart %>%
  summarise_all(~ sum(is.na(.)))

## # A tibble: 1 × 12
##     Age   Sex ChestPainT…¹ Resti…² Chole…³ Fasti…⁴ Resti…⁵ MaxHR Exerc…⁶ Oldpeak
##   <int> <int>        <int>   <int>   <int>   <int>   <int> <int>   <int>   <int>
## 1     0     0            0       0       0       0       0     0       0       0
## # … with 2 more variables: ST_Slope <int>, HeartDisease <int>, and abbreviated
## #   variable names ¹ChestPainType, ²RestingBP, ³Cholesterol, ⁴FastingBS,
## #   ⁵RestingECG, ⁶ExerciseAngina

Well,There is no null values found in the dataset

Now, We are going to divide the data into two features Categorical Features : Sex ChestPainType FastingBS RestingECG ExerciseAngina ST_Slope, HeartDisease

# Create a data frame with only numerical columns
heart_num <- select_if(heart, is.numeric)

# Create a data frame with only categorical columns
heart_cat <- heart[, sapply(heart, function(x) !is.numeric(x))]

Numerical Features : Age, RestingBP, Cholesterol, MaxHR, Oldpeak. If an attribute contains fewer than six distinct elements, it is characterised as a categorical feature; otherwise, it is a numerical feature.

Exploratory Data Analysis

Exploring Overall Distribution of data

# Create a bar plot to visualize the percentages
library(scales)

## 
## Attaching package: 'scales'

## The following object is masked from 'package:purrr':
## 
##     discard

## The following object is masked from 'package:readr':
## 
##     col_factor

heart_disease_percent <- prop.table(table(heart$HeartDisease))
bp <- ggplot(heart, aes(x="", y=HeartDisease, fill=heart$Sex)) +
geom_bar(width = 1, stat = "identity")
pie <- bp + coord_polar("y", start=0)
pie + scale_fill_manual(values=c("#E69F00", "#56B4E9"))

  geom_text(aes(y = HeartDisease/3 + c(0, cumsum(HeartDisease)[-length(HeartDisease)]), label = HeartDisease), size=5)

## mapping: y = ~HeartDisease/3 + c(0, cumsum(HeartDisease)[-length(HeartDisease)]), label = ~HeartDisease 
## geom_text: parse = FALSE, check_overlap = FALSE, na.rm = FALSE
## stat_identity: na.rm = FALSE
## position_identity

# Create a frequency table for the output class
table(heart$HeartDisease)

## 
##   0   1 
## 410 508

# Create a frequency table for the 'Sex' variable
table(heart$Sex, heart$HeartDisease)

##    
##       0   1
##   F 143  50
##   M 267 458

This dataset contains more number of Males observation than the female, so we need to keep this in mind. observations of male with heart disease is 8 times more than a female. people with heart disease and people with no heart disease is preety much evenly distributed.

Categorical Feature Analysis

Target variable visualisation All the observations contains after the FastingBS output.

ggplot(heart, aes(x=Sex, fill=factor(HeartDisease))) +
  geom_bar(position = "dodge") +
  ggtitle("Sex vs Heart Disease")

ggplot(heart, aes(x=ChestPainType, fill=factor(HeartDisease))) +
  geom_bar(position = "dodge") +
  ggtitle("ChestPainType vs Heart Disease")

ggplot(heart, aes(x=RestingECG, fill=factor(HeartDisease))) +
  geom_bar(position = "dodge") +
  ggtitle("RestingECG vs Heart Disease")

ggplot(heart, aes(x=ExerciseAngina, fill=factor(HeartDisease))) +
  geom_bar(position = "dodge") +
  ggtitle("ExerciseAngina vs Heart Disease")

ggplot(heart, aes(x=ST_Slope, fill=factor(HeartDisease))) +
  geom_bar(position = "dodge") +
  ggtitle("ST_Slope vs Heart Disease")

ggplot(heart, aes(x=FastingBS, fill=factor(HeartDisease))) +
  geom_bar(position = "dodge") +
  ggtitle("FastingBS vs Heart Disease")

Observations: 1. Men are nearly two times as likely as women to suffer heart disease.

There are about 500 cases of asymptomatic chest discomfort, which means that half of our participants had no prior symptoms before having a heart attack.
Resting electrocardiogram: Person with ST-T wave abnormalities is more likely to have a cardiac condition than the others.
ExerciseAngina: Compared to exercise-induced angina with a “No,” exercise-induced angina with a “Yes” is about 2.4 times more likely to be associated with a cardiac condition.
ST Slope: The peak workout ST segment’s slope varies. Compared to the other two segments, ST Slope Up considerably less likely to have heart disease.
FastingBS: Patients with and without fasting blood sugar diagnoses are much more likely to develop heart disease.

Numerical Features Analysis:

All the ranges are given after all the visualizations Detecting Outliers

# Box plot for the 'Age' variable
ggplot(heart, aes(x=factor(HeartDisease), y=Age)) +
  geom_boxplot() +
  ggtitle("Age vs Heart Disease")

# Box plot for the 'Age' variable
ggplot(heart, aes(x=factor(HeartDisease), y=RestingBP)) +
  geom_boxplot() +
  ggtitle("RestingBP vs Heart Disease")

# Box plot for the 'Age' variable
ggplot(heart, aes(x=factor(HeartDisease), y=Cholesterol)) +
  geom_boxplot() +
  ggtitle("Cholesterol vs Heart Disease")

# Box plot for the 'Age' variable
ggplot(heart, aes(x=factor(HeartDisease), y=MaxHR)) +
  geom_boxplot() +
  ggtitle("MaxHR vs Heart Disease")

# Box plot for the 'Age' variable
ggplot(heart, aes(x=factor(HeartDisease), y=Oldpeak)) +
  geom_boxplot() +
  ggtitle("Oldpeak vs Heart Disease")

Observations: The range at which patient have high chance of positive cases of heart disease Age : 50+ RestingBP : 95 - 170 Cholesterol : 0 - 600 MaxHR : 70 - 180 Oldpeak : 0 - 4

Checking Impact of numerical feature on output

ggplot(data = heart, aes(x= Age)) +
  stat_count(color = "darkblue", fill = "lightblue") +
  labs(title = "Age Distribution", x = "Age", y = "Count") +
  scale_x_log10()

ggplot(data = heart, aes(x= RestingBP)) +
  stat_count(color = "darkblue", fill = "lightblue") +
  labs(title = "RestingBP Histogram Plot", x = "Age", y = "Count")  +
  scale_x_log10()

## Warning: Transformation introduced infinite values in continuous x-axis

## Warning: Removed 1 rows containing non-finite values (`stat_count()`).

ggplot(data = heart, aes(x= Cholesterol)) +
  stat_count(color = "darkblue", fill = "lightblue") +
  labs(title = "Cholesterol Histogram Plot", x = "Age", y = "Count")   +
  scale_x_log10()

## Warning: Transformation introduced infinite values in continuous x-axis

## Warning: Removed 172 rows containing non-finite values (`stat_count()`).

ggplot(data = heart, aes(x= MaxHR)) +
  stat_count(color = "darkblue", fill = "lightblue") +
  labs(title = "MaxHR Histogram Plot", x = "Age", y = "Count")  +
  scale_x_log10()

ggplot(data = heart, aes(x= Oldpeak)) +
  stat_count(color = "darkblue", fill = "lightblue") +
  labs(title = "Oldpeak Histogram Plot", x = "Age", y = "Sex")

The maximum density is found in people aged 50 to 65, with a mean age of 53.5. Oldpeak’s data distribution is rightly skewed appropriately. The data distribution for cholestrol is bidmodal.

Bi-variate Analysis

ggplot(heart, aes(x=Age, y=HeartDisease, color=Sex)) +
  geom_point() +
  geom_smooth(method = lm, se = FALSE) +
  labs(title="Age vs Heart disease")

## `geom_smooth()` using formula = 'y ~ x'

It shows a clear trend that, as patients get older, they have more cases of heart diseases.

ggplot(heart, aes(x=MaxHR, y=HeartDisease, color=Sex)) +
  geom_point() +
  geom_smooth(method = lm, se = FALSE) +
  labs(title="Maximum Heart Rate Vs Heart Disease")

## `geom_smooth()` using formula = 'y ~ x'

As expected, those with cardiac disease had lower maximum heart rates. This might be due to the diseases themselves, which could lead to heart degeneration.

ggplot(heart, aes(x=Age, y=MaxHR, color=Sex)) +
  geom_point() +
  geom_smooth(method = lm, se = FALSE) +
  labs(title = "Age vs Maximum Heart Rate")

## `geom_smooth()` using formula = 'y ~ x'

This is not directly associated with the heart disease but still good to know: People’s maximum heart rates normally decline with age since their hearts aren’t as strong as they were in their youth.

ggplot(heart, aes(x=Age, y=RestingBP, color=Sex)) +
  geom_point() +
  geom_smooth(method = lm, se = FALSE) +
  labs(title = "Age vs Resting blood sugar level")

## `geom_smooth()` using formula = 'y ~ x'

As patients get older their blood sugar level also increases

ggplot(heart, aes(x=Cholesterol, y=HeartDisease, 
                                    color=Sex)) +
  geom_point() +
  geom_smooth(method = lm, se = FALSE) +
  labs(title="Cholesterol Levels Vs Heart Diseases")

## `geom_smooth()` using formula = 'y ~ x'

Patients with heart disease had normal cholesterol levels. This is uncommon because heart failure is often brought on by high cholesterol levels.

ggplot(heart, aes(x=FastingBS, y=HeartDisease, color=Sex)) +
  geom_point() +
  geom_smooth(method = lm, se = FALSE) +
  labs(title = "Sugar Blood Level Vs Heart Diseases")

## `geom_smooth()` using formula = 'y ~ x'

As expected, Patients with heart diseases usually have a higher levels of blood sugar sugar levels.

ggplot(heart, aes(x=Oldpeak, y=HeartDisease, color=Sex)) +
  geom_point() +
  geom_smooth(method = lm, se = FALSE) +
  labs(title = "Oldpeak vs HeartDisease")

## `geom_smooth()` using formula = 'y ~ x'

Trend is very clear that, patients with heart disease have higher oldpeak (Its a Numeric value measured in depression)

ggplot(heart, aes(x=RestingBP, y=HeartDisease, color=Sex)) +
  geom_point() +
  geom_smooth(method = lm, se = FALSE) +
  labs(title = "RestingBP vs HeartDisease")

## `geom_smooth()` using formula = 'y ~ x'

Patients with higher blood pressure level has higher cases of having heart disease.

Multivariate Analysis

# Scatter plot for the 'RestingBP' and 'Cholesterol' variables
ggplot(heart, aes(x=FastingBS, y=Cholesterol, color=factor(HeartDisease))) +
  geom_point() +
  geom_smooth(method = lm, se = FALSE) +
  ggtitle("Fasting blood sugar level vs Cholesterol vs Heart Disease")

## `geom_smooth()` using formula = 'y ~ x'

ggplot(heart, aes(x=Cholesterol, y=Oldpeak, color=factor(HeartDisease))) +
  geom_point() +
  geom_smooth(method = lm, se = FALSE) +
  ggtitle("Cholesterol vs Resting Blood Pressure vs Heart Disease")

## `geom_smooth()` using formula = 'y ~ x'

ggplot(heart, aes(x=RestingBP, y=MaxHR, color=factor(HeartDisease))) +
  geom_point() +
  geom_smooth(method = lm, se = FALSE) +
  ggtitle("Resting Blood Pressure vs Maximum Heart Rate vs Heart Disease")

## `geom_smooth()` using formula = 'y ~ x'

ggplot(heart, aes(x=Age, y=RestingBP, color=factor(HeartDisease))) +
  geom_point() +
  geom_smooth(method = lm, se = FALSE) +
  ggtitle("Age vs Resting Blood Pressure vs Heart Disease")

## `geom_smooth()` using formula = 'y ~ x'

ggplot(heart, aes(x=Age, y=Cholesterol, color=factor(HeartDisease))) +
  geom_point() +
  geom_smooth(method = lm, se = FALSE) +
  ggtitle("Age vs Cholesterol level vs Heart Disease")

## `geom_smooth()` using formula = 'y ~ x'

Patients are more likely to develop heart disease if their maximum heart rate is higher and their resting blood pressure is lower.
As we already know, as people age, their blood sugar levels rise as well, which also tends to raise their risk of developing heart disease.

# # Fit a decision tree model using all of the variables   
library(rpart)
library(rpart.plot)
model <- rpart(HeartDisease ~ ., data = heart)
rpart.plot(model)

Statistical testing

The chi-square test to determine if there is a significant association between all the numerical feature and HeartDisease

# Chi-square test
library(vcd)

## Loading required package: grid

chisq.test(table(heart$Cholesterol, heart$HeartDisease))

## Warning in chisq.test(table(heart$Cholesterol, heart$HeartDisease)):
## Chi-squared approximation may be incorrect

## 
##  Pearson's Chi-squared test
## 
## data:  table(heart$Cholesterol, heart$HeartDisease)
## X-squared = 332.19, df = 221, p-value = 1.86e-06

chisq.test(table(heart$MaxHR, heart$HeartDisease))

## Warning in chisq.test(table(heart$MaxHR, heart$HeartDisease)): Chi-squared
## approximation may be incorrect

## 
##  Pearson's Chi-squared test
## 
## data:  table(heart$MaxHR, heart$HeartDisease)
## X-squared = 241.32, df = 118, p-value = 1.722e-10

chisq.test(table(heart$FastingBS, heart$HeartDisease))

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table(heart$FastingBS, heart$HeartDisease)
## X-squared = 64.321, df = 1, p-value = 1.057e-15

chisq.test(table(heart$Age, heart$HeartDisease))

## Warning in chisq.test(table(heart$Age, heart$HeartDisease)): Chi-squared
## approximation may be incorrect

## 
##  Pearson's Chi-squared test
## 
## data:  table(heart$Age, heart$HeartDisease)
## X-squared = 125.37, df = 49, p-value = 1.282e-08

Observations: As p-value less than 0.05 would indicate a significant association between the heart disease and all thease features like Cholestrol level, Maximum heart rate, Age and finally blood sugar level.

For some reason I am not able to knit it using this Chunck so I am commenting it out, when you remove the comment model is working fine.

# Split the data into training and test sets
#set.seed(123)
# split_index <- createDataPartition(heart$HeartDisease, p = 0.7, list = FALSE)
#train_data <- heart[split_index, ]
#test_data <- heart[-split_index, ]

# Fit a logistic regression model
#model <- glm(HeartDisease ~ Age + Sex + ChestPainType + RestingBP + Cholesterol + FastingBS + RestingECG + MaxHR + ExerciseAngina + Oldpeak + ST_Slope, data = train_data, family = binomial())

# Summary of the model
#summary(model)

Observations: Age, RestingBP, FastingBS, ExerciseAngina (Yes),Oldpeak and ST_SlopeFlat feature positively correlate with the heart disease, whereas other feature negatively correlate with the heart disease.

Conclusion:

Out of all heart disease patients, a startling 90% of them are men, according to the data. 77% of ASY types result in cardiac problems, which is the majority. Exercise-induced angina with a ‘Yes’ is about 2.4 times more likely to suffer a heart disease. A fasting blood sugar level below 120 mg/dl indicates a greater risk of developing heart disease. From the perspective of numerical characteristics, patients in this range are more likely to suffer heart illness. Age: 50+, Resting Blood Pressure: 95-170, Cholesterol: 0–600, Maximum Heart Rate: 70–180, Oldpeak: 0–4. Age, Maximum Heart Rate, Cholesterol, and Fasting Blood Sugar Level are the most significant predictors of heart disease, according to our analysis of various features

The goal of researching the association between these characteristics and heart disease is to get a better knowledge of the potential risk poses by this feature on heart disease and to build reliable models for predicting heart disease.There were a few difficulties with this analysis, including: 1. Getting a big, diverse dataset for analysis that appropriately represents the community of interest is one of the biggest obstacles. 2. The quality of the data gathered is a significant difficulty; in our instance, the study of the cholestrol variable, inaccurate data might have a substantial impact on the findings. 3. Another issue in this research is selecting the most relevant factors, as the number of variables might be fairly vast, and not all variables may be useful for the categorization of heart disease. 4. I’ve simply used the logistics regression model in my example, but choosing the best model for this study is difficult because it can have a big impact on the outcomes. 5. Finally, applying domain knowledge from the medical field to the outcomes.

Few Mistake that were made during the analysis, was not properly dealing with the outlier before deploying the model like applying dimensionality reduction theorem on the outliers or any other technique. If I had more time to complete the research, I would have conducted a more thorough analysis of each variable’s distribution, as well as a deeper investigation into the association between each variable and heart disease using a random forest model and other sophisticated statistical techniques.

Another research subject I’d want to look at is: To determine the relationship between different variables and the likelihood of heart disease. Using a machine learning technique, to discover the most important predictors of heart disease. To investigate the effect of demographic parameters on the risk of heart disease, such as age and gender. Using predictive modelling tools, uncover possible early warning indicators for heart disease.

The same research question(s) could have been addressed differently if data contain 1. The smoking habbits of the patients. 2. Also, contains the variables like patients were on any medication or had any other disease 3. Data would give more insight if it contains the occupation and more demographic data. 4. data on Eating habbits could also be usefull. 5. Data about if they drink alchol or not.

References https://www.kaggle.com/code/durgancegaur/a-guide-to-any-classification-problem https://www.kaggle.com/code/tanmay111999/heart-failure-prediction-cv-score-90-5-models https://www.kaggle.com/code/azizozmen/heart-failure-predict-8-classification-techniques https://www.kaggle.com/code/pythonafroz/evaluation-metrics-used-in-machine-learning

Identyfing all Important predictor for Heart Disease

Dataset Attributes

Exploratory Data Analysis

Categorical Feature Analysis

Numerical Features Analysis:

Bi-variate Analysis