The heart data set

Source: https://www.kaggle.com/datasets/johnsmith88/heart-disease-dataset

This data set consists of 14 variables with a total of 1025 observations. This dates from 1988 and consists of four databases: Cleveland, Hungary, Switzerland, and Long Beach V. It contains 76 attributes, including the predicted attribute, but all published experiments refer to using a subset of 14 of them. Since there is no readme file with the information on how this data was collected, we cannot know if the data consists of any bias itself. However, they do specify that for the privacy of the patients, the names and social security numbers were removed. There is also no NA’s or missing values in this data making the cleaning process much easier.

Did you know, that Heart disease is the leading cause of death for men, women, and people of most racial and ethnic groups in the United States. According to CDC statistics, one person dies every 34 seconds in the United States from cardiovascular disease. As this is also in my family history, I was very interested in researching further on this topic and finding the leading factors to heart disease. Similarly, I find the heart a very fascinating organ. While it provides life for the entire human body, it that easily takes it away as well.

This project attempts on examining disease state and figuring out how it relates to a wide range of various variables and causes. In particular, we seek to comprehend the role that age, maximum heart rate, sex (male and female), and blood pressure play a role in the occurrence of heart disease.

Metadata

age - Patient’s age in years - Numerical

sex - Gender - Nominal

cp - Type of chest-pain - Nominal - (0)typical angina, (1)atypical angina, (2)non-angina pain, (3)asymptomatic

trestbps - Resting blood pressure - Numerical

chol - Serum cholestoral - Numerical

fbs - Fasting blood sugar higher than 120 mg/dl - Nominal

restecg - Resting electrocardiographic results - Nominal

thalach - Maximum heart rate achieved - Numerical

exang - Exercise induced angina - Nominal

oldpeak - ST depression induced by exercise relative to rest - Numerical

slope - The slope of the peak exercise ST segment - Nominal

ca - Number of major vessels colored by flourosopy - Nominal

thal - Thalassemia - Nominal

target - Diagnosis of heart disease - Nominal

Load the libraries

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   0.3.5 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(psych)
## 
## Attaching package: 'psych'
## 
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
library(RColorBrewer)
library(plotly)
## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout
library(ggplot2)
library(ggthemes)
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine

Import the dataset

setwd
## function (dir) 
## .Internal(setwd(dir))
## <bytecode: 0x7fb97eec4408>
## <environment: namespace:base>
heart_data <- read_csv("heart.csv")
## Rows: 1025 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (14): age, sex, cp, trestbps, chol, fbs, restecg, thalach, exang, oldpea...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

factor and clean data

# See how many rows there are in the data frame
NROW(heart_data)
## [1] 1025
# convert sex to factor
heart_data$sex <- heart_data$sex %>% factor(labels = c("Female","Male"))

# convert target to a logical variable
heart_data$target <- heart_data$target %>% as.logical()

# check for any NA's
sum(is.na(heart_data))
## [1] 0

Analyze data

summary(heart_data) # read the summary stats of the data 
##       age            sex            cp            trestbps          chol    
##  Min.   :29.00   Female:312   Min.   :0.0000   Min.   : 94.0   Min.   :126  
##  1st Qu.:48.00   Male  :713   1st Qu.:0.0000   1st Qu.:120.0   1st Qu.:211  
##  Median :56.00                Median :1.0000   Median :130.0   Median :240  
##  Mean   :54.43                Mean   :0.9424   Mean   :131.6   Mean   :246  
##  3rd Qu.:61.00                3rd Qu.:2.0000   3rd Qu.:140.0   3rd Qu.:275  
##  Max.   :77.00                Max.   :3.0000   Max.   :200.0   Max.   :564  
##       fbs            restecg          thalach          exang       
##  Min.   :0.0000   Min.   :0.0000   Min.   : 71.0   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:132.0   1st Qu.:0.0000  
##  Median :0.0000   Median :1.0000   Median :152.0   Median :0.0000  
##  Mean   :0.1493   Mean   :0.5298   Mean   :149.1   Mean   :0.3366  
##  3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:166.0   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :2.0000   Max.   :202.0   Max.   :1.0000  
##     oldpeak          slope             ca              thal      
##  Min.   :0.000   Min.   :0.000   Min.   :0.0000   Min.   :0.000  
##  1st Qu.:0.000   1st Qu.:1.000   1st Qu.:0.0000   1st Qu.:2.000  
##  Median :0.800   Median :1.000   Median :0.0000   Median :2.000  
##  Mean   :1.072   Mean   :1.385   Mean   :0.7541   Mean   :2.324  
##  3rd Qu.:1.800   3rd Qu.:2.000   3rd Qu.:1.0000   3rd Qu.:3.000  
##  Max.   :6.200   Max.   :2.000   Max.   :4.0000   Max.   :3.000  
##    target       
##  Mode :logical  
##  FALSE:499      
##  TRUE :526      
##                 
##                 
## 
str(heart_data) # view the structure of the data
## spc_tbl_ [1,025 × 14] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ age     : num [1:1025] 52 53 70 61 62 58 58 55 46 54 ...
##  $ sex     : Factor w/ 2 levels "Female","Male": 2 2 2 2 1 1 2 2 2 2 ...
##  $ cp      : num [1:1025] 0 0 0 0 0 0 0 0 0 0 ...
##  $ trestbps: num [1:1025] 125 140 145 148 138 100 114 160 120 122 ...
##  $ chol    : num [1:1025] 212 203 174 203 294 248 318 289 249 286 ...
##  $ fbs     : num [1:1025] 0 1 0 0 1 0 0 0 0 0 ...
##  $ restecg : num [1:1025] 1 0 1 1 1 0 2 0 0 0 ...
##  $ thalach : num [1:1025] 168 155 125 161 106 122 140 145 144 116 ...
##  $ exang   : num [1:1025] 0 1 1 0 0 0 0 1 0 1 ...
##  $ oldpeak : num [1:1025] 1 3.1 2.6 0 1.9 1 4.4 0.8 0.8 3.2 ...
##  $ slope   : num [1:1025] 2 0 0 2 1 1 0 1 2 1 ...
##  $ ca      : num [1:1025] 2 0 0 1 3 0 3 1 0 2 ...
##  $ thal    : num [1:1025] 3 3 3 3 2 2 1 3 3 2 ...
##  $ target  : logi [1:1025] FALSE FALSE FALSE FALSE FALSE TRUE ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   age = col_double(),
##   ..   sex = col_double(),
##   ..   cp = col_double(),
##   ..   trestbps = col_double(),
##   ..   chol = col_double(),
##   ..   fbs = col_double(),
##   ..   restecg = col_double(),
##   ..   thalach = col_double(),
##   ..   exang = col_double(),
##   ..   oldpeak = col_double(),
##   ..   slope = col_double(),
##   ..   ca = col_double(),
##   ..   thal = col_double(),
##   ..   target = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
dim(heart_data) # view the dimensions of the data
## [1] 1025   14
head(heart_data)
## # A tibble: 6 × 14
##     age sex       cp trestbps  chol   fbs restecg thalach exang oldpeak slope
##   <dbl> <fct>  <dbl>    <dbl> <dbl> <dbl>   <dbl>   <dbl> <dbl>   <dbl> <dbl>
## 1    52 Male       0      125   212     0       1     168     0     1       2
## 2    53 Male       0      140   203     1       0     155     1     3.1     0
## 3    70 Male       0      145   174     0       1     125     1     2.6     0
## 4    61 Male       0      148   203     0       1     161     0     0       2
## 5    62 Female     0      138   294     1       1     106     0     1.9     1
## 6    58 Female     0      100   248     0       0     122     0     1       1
## # … with 3 more variables: ca <dbl>, thal <dbl>, target <lgl>
tail(heart_data)
## # A tibble: 6 × 14
##     age sex       cp trestbps  chol   fbs restecg thalach exang oldpeak slope
##   <dbl> <fct>  <dbl>    <dbl> <dbl> <dbl>   <dbl>   <dbl> <dbl>   <dbl> <dbl>
## 1    47 Male       0      112   204     0       1     143     0     0.1     2
## 2    59 Male       1      140   221     0       1     164     1     0       2
## 3    60 Male       0      125   258     0       0     141     1     2.8     1
## 4    47 Male       0      110   275     0       0     118     1     1       1
## 5    50 Female     0      110   254     0       0     159     0     0       2
## 6    54 Male       0      120   188     0       1     113     0     1.4     1
## # … with 3 more variables: ca <dbl>, thal <dbl>, target <lgl>

Lets see how many people have heart disease according to this data set

disease <- prop.table(table(heart_data$target)) # find the porportion of people with and without heart disease in total
disease
## 
##     FALSE      TRUE 
## 0.4868293 0.5131707

There are ~ 49% of patients without heart disease, and ~51% with heart disease.

Rename variables

hearts_data <- rename(heart_data, maximum_heart_rate = thalach, resting_blood_pressure = trestbps) # rename variables so its easier to read

statistical analysis of heart disease according to gender, maximum heart rate, and blood pressure

# plot a bar graph to see the distribution of patients with and without heart disease according to gender
gender <- ggplot(hearts_data, mapping = aes(x = sex, fill = target)) +
  geom_bar(position = "fill") +
  guides(fill = guide_legend(reverse=TRUE)) + # flip the legend for true to be first
  scale_fill_manual(values=c("pink", "purple")) + # add colors 
  ggtitle("Gender and Heart Disease") +
  xlab("Sex") + ylab("Counts")  +
  theme(axis.text.x = element_text(size = 10), axis.title.x = element_text(size = 10), # change font size 
        axis.title.y = element_text(size = 10), axis.text.y = element_text(size = 10))

# plot a boxplot to show the relationship between resting blood pressure and heart disease
blood_pressure <- ggplot(hearts_data, mapping = aes(x=resting_blood_pressure, fill=target)) +
  geom_boxplot() +
  guides(fill = guide_legend(reverse=TRUE)) + # flip the legend for true to be first
  ggtitle("Blood Pressure and Heart Disease") +
  scale_fill_manual(values=c("pink", "purple")) +
  xlab("Resting blood pressure") + ylab("Proportion")+ # axis titles
  scale_x_discrete(labels = c("low", "high"))+
  theme(axis.text.x = element_text(size = 10), axis.title.x = element_text(size = 10), # change font size 
        axis.title.y = element_text(size = 10), axis.text.y = element_text(size = 10))

heart_rate <- ggplot(hearts_data, mapping = aes(x = maximum_heart_rate, fill = target)) +
  geom_boxplot() +
  guides(fill = guide_legend(reverse=TRUE)) + # flip the legend for true to be first
  scale_fill_manual(values=c("pink", "purple")) + # add colors 
  ggtitle("Maximum Heart Rate According to Disease") +
  xlab("Maximum heart rate achieved") + ylab("Proportion") + 
  theme(axis.text.x = element_text(size = 10), axis.title.x = element_text(size = 10)) # change font size 

grid.arrange(gender, blood_pressure, heart_rate, nrow=2)

According to these analysis, we can notice that the bar plot at the top indicates that there may be a relationship between sex and heart disease because there are more males than females with heart disease. The box plots for resting blood pressure are similar in both cases of heart disease, indicating that there is probably not a significant correlation between these two conditions. Also there are only three outliers where blood pressure associated to heart disease is true, which could be explained by the fact that a very high blood pressure could conclude heart disease. Additionally, the final box plot does demonstrate a link between heart illness and the highest heart rate attained during exercise. There is little overlap between the two boxes, and the maximum heart rate trend appears to be higher in those without heart disease than in people who do have it. Except for a few outliers with the maximul heat rate being very low, the average heart rates seem to be considerably higher for patients with heart disease. None of the distributions have too many outliers that could affect the predictions of our analysis.

Chest pain according to heart disease

Did you know, every year, more than 6.5 million people in the United States visit emergency rooms because of chest pain. However, little research has examined what it would mean in the future, according to the study’s principal investigator, Dr. Kentaro Ejiri, a postdoctoral fellow at the Baltimore-based Johns Hopkins Bloomberg School of Public Health. Although less so than with moderate to severe symptoms, even minor chest discomfort were associated with a long-term risk of cardiac issues.

# Create a new data frame with heart disease and chest pain types
heart_cp <- heart_data %>% 
  select(cp, target) %>% # select chest pain and heart disease
  mutate(target = case_when( # mutate so it displays yes and no for heart disease accordance
    target == 0 ~ "No",
    target == 1 ~ "Yes"),
         cp = as.factor(cp)) %>% # Convert to factor
  mutate(target = as.factor(target))
# Plot chest pain according to heart disease using bar graph 
p2 <- ggplot(heart_cp, aes(x = cp, fill = target)) +
  geom_bar(position = "stack") + # stack the bar graph 
  scale_fill_manual(values=c("dark blue", "red")) # include colors 
  labs(x = "Chest Pain Type", 
       y = "Number of Individuals",
       title = "Chest Pain Acoording to Heart Disease") +
  theme(plot.title = element_text(hjust = 0.5), # font size
        plot.subtitle = element_text(hjust = 0.5))
## NULL
p2 <- ggplotly(p2) # include interactivity through plotly
p2

Except for the first chest pain type, all chest pain types show a higher count of patients with heart disease explaining that there is enough evidence to show that there is a strong relation between chest pain and heart disease.

Serum cholestral

# histogram for serum cholestral levels in relation to heart disease
ggplot(heart_data, aes(x=chol, fill=target)) + 
  geom_histogram(aes(y=..density..), color="grey17") +
  geom_density(alpha=.2, fill="yellow") + # fill color under density line to yellow 
  facet_wrap(~target, ncol=1,scale="fixed") + # use facet wrap to display two histograms together 
  xlab("Cholesterol") + 
  ylab("Count") +
  ggtitle("Serum Cholestoral (mg/dl)") +
  scale_fill_discrete(name = "Heart Disease", labels = c("Yes", "No")) + theme(plot.title = element_text(hjust = 0.5))
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

There is no significant difference between patients with and without heart disease in their cholestoral (mg/dl).

Resting blood pressure and heart disease

A significant issue for public health is hypertension, which is also a documented independent risk factor for cardiovascular disease. Increasing blood pressure is consistently associated with cardiovascular disease

# plot to show the relationship between resting blood pressure and heart disease
p1 <- ggplot(hearts_data, aes(x = resting_blood_pressure, fill = target)) +
  geom_histogram(bins = 20, col=I("black")) +# color the outline of the bars black to distinguish between eachother
  scale_fill_manual(values=c("pink", "red")) +
  xlab("Resting Blood Pressure (mmHg)") +
       ylab("Number of Individuals") +
       ggtitle("Resting Blood Pressure in Relation to Heart Disease")
p1 <- ggplotly(p1)
p1

Although the distribution seems to be equal for patients with or without heart disease, you can still notice that the majority of patients with heart disease have elevated blood pressures.

Heart disease according to age and gender

First we need to see if theres an equal distribution between males and female.

# bar plot to see the number of males and females in the data
barplot(table(heart_data$sex), xlim = c(0, 700), horiz = T, # create horizontal bar graph
        col = c("pink", "blue"), border = "white", 
        main = "Number of Males and Females in the Data", ylab = "Gender", legend.text = T) # include legend

There are almost twice as many men in the data than women.

prop.table(table(select(heart_data, target,sex))) # porportion table to see the porportion of males and females with and without disease
##        sex
## target      Female       Male
##   FALSE 0.08390244 0.40292683
##   TRUE  0.22048780 0.29268293

According to BMJ Global Health, “Men generally develop heart disease at a younger age and have a higher risk of coronary heart disease (CHD) than women. Women, in contrast, are at a higher risk of stroke, which often occurs at older age.” Despite the fact that cardiovascular disease is the leading cause of mortality globally, there are significant variances between men and women. The apparent modest advantages in delaying heart disease to older ages in women may potentially be explained by gender differences in the presentation of heart disease. Women have less severe obstructive coronary artery disease and warning signals than men, and because they are frequently viewed as being “atypical,” they frequently go un diagnosed or neglected. As a result, women are less likely than males to be referred for diagnostic and therapeutic procedures or to get pharmacological treatment for disease risk factors. For women with heart disease, inadequate access to healthcare services causes delays in diagnosis and treatment as well as worse prognoses and outcomes.

# create data frame of age, target, and sex 
age <- heart_data %>% 
  mutate(age=as.factor(age)) %>% # mutate age to factor
  group_by(age, target, sex) %>%  # group by these three variables
  summarise(count=n()) %>%  # count the number of patients in the group as specified
  mutate(perc=count/sum(count))
## `summarise()` has grouped output by 'age', 'target'. You can override using the
## `.groups` argument.
# plot bar plots to see distribution of patients according to age and gender
plot3 <- ggplot(age, aes(x=age, y=count, fill=target))+
  geom_bar(stat = "identity") + xlab("Age") + # enable to plot bars with the bar length is set by your variable mappings
  scale_x_discrete(breaks = c("30", "35","40","45","50","55","60","65","70","75","80")) + # break ages into factors of 5
  ylab("Count") +
  ggtitle("Ages With Heart Disease") +
  scale_fill_manual(values = c("green", "red")) +
  theme_minimal(base_size = 10) + # theme for the graph
  theme(plot.title = element_text(hjust = 0.5)) + # font size
  facet_grid(~sex) # facet grid to see two bar plots according to each gender

ggplotly(plot3) # include interactivity through plotly

Conclusion

According to this plot, males seem to have more patients with heart disease than women, with the highest count of heart disease as well. something else to notice is that while the youngest male with heart disease is 29 years old, the youngest female with heart disease is 34 years old. First thing to keep in mind is that there are obviously twice as many men than women in this data set, not having an equal distribution between the two genders. This affects the way we interpret the data as theres a possibility females would have a higher proportion with a larger sample size. Similarly, as the study by BMJ Global Health stated, theres also a probability that the female diagnosis risk is being neglected at earlier ages which is why the age for females with heart disease starts with 34, while the male age is much earlier.

Heart disease Occurance According to Age, Maximum Heart Rate, and Blood Pressure

According to the NIH, Elevated heart rate is associated with elevated blood pressure, increased risk for hypertension, and, among hypertensives, increased risk for cardiovascular disease. Despite these important relationships, heart rate is generally not a major consideration in choosing antihypertensive medications. The heart rate varies for each person.

# Plot a scatter plot of heart disease according to age, maximum heart rate and blood pressure
p <- hearts_data %>% 
plot_ly(x = ~age, y = ~ maximum_heart_rate, color = ~target, colors = c("purple", "#1B98E0"), # use plotly to plot the scatter plot 
          size = ~resting_blood_pressure, 
          text = ~paste("</br> age: ", age, # text to show in tooltip
                      "</br> max heart rate: ", maximum_heart_rate, # name the variables
                      "</br> blood pressure: ", resting_blood_pressure),
        hoverinfo = "text", type = "scatter", mode = "markers") %>% 
  layout(title = "Heart disease Occurance According to Age, Maximum Heart Rate, and Blood Pressure",xaxis = list(title = "Age"), 
         yaxis = list(title = "Maximum Heart Rate"), legend = list(title=list(text="Occurance of Heart Disease")))
showlegend =  (TRUE) # show legend
p
## Warning: `line.width` does not currently support multiple values.

## Warning: `line.width` does not currently support multiple values.

Conclusion

After analyzing some exploratory graphs in relation to heart disease, I wanted to see the relation of maximum heart rate, age, and blood pressure according to heart disease. You can notice that there’s a pattern where most patients with heart disease also have a high maximum heart rate. Also, while blood pressure doesn’t seem to be a much significant factor in determining heart disease, you can still notice that none of the points with heart disease equals true have a very small size considering the many other points. Consequently, the age range from 40 - 60 have the most population of patients with heart disease. Something I would like to explore on forward regarding this topic is the death age for these patients as well since there seem to be barely anyone after the age of 70 with heart disease. Is this because they don’t make it to that age, or that there are none included in this research. Lastly, in most of my graphs, I wanted to reverse the stack order for heart disease = true to show on top of the data with false. Obviously, there’s still much to learn while coding data visualizations, however I am very grateful to Professor Saidi for the amount I have been able to learn from her throughout the semester.

Bibliography

Bots, Sophie H, et al. “Sex Differences in Coronary Heart Disease and Stroke Mortality: A Global Assessment of the Effect of Ageing between 1980 and 2010.” BMJ Global Health, BMJ Specialist Journals, 1 Mar. 2017, https://gh.bmj.com/content/2/2/e000298.

Reule, Scott, and Paul E Drawz. “Heart Rate and Blood Pressure: Any Possible Implications for Management of Hypertension?” Current Hypertension Reports, U.S. National Library of Medicine, Dec. 2012, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3491126/.

“Chest Pain, Shortness of Breath Linked to Long-Term Risk of Heart Trouble.” Www.heart.org, 17 Nov. 2022, https://www.heart.org/en/news/2022/11/02/chest-pain-shortness-of-breath-linked-to-long-term-risk-of-heart-trouble.