Source: https://www.kaggle.com/datasets/johnsmith88/heart-disease-dataset
This data set consists of 14 variables with a total of 1025 observations. This dates from 1988 and consists of four databases: Cleveland, Hungary, Switzerland, and Long Beach V. It contains 76 attributes, including the predicted attribute, but all published experiments refer to using a subset of 14 of them. Since there is no readme file with the information on how this data was collected, we cannot know if the data consists of any bias itself. However, they do specify that for the privacy of the patients, the names and social security numbers were removed. There is also no NA’s or missing values in this data making the cleaning process much easier.
Did you know, that Heart disease is the leading cause of death for men, women, and people of most racial and ethnic groups in the United States. According to CDC statistics, one person dies every 34 seconds in the United States from cardiovascular disease. As this is also in my family history, I was very interested in researching further on this topic and finding the leading factors to heart disease. Similarly, I find the heart a very fascinating organ. While it provides life for the entire human body, it that easily takes it away as well.
This project attempts on examining disease state and figuring out how it relates to a wide range of various variables and causes. In particular, we seek to comprehend the role that age, maximum heart rate, sex (male and female), and blood pressure play a role in the occurrence of heart disease.
age - Patient’s age in years - Numerical
sex - Gender - Nominal
cp - Type of chest-pain - Nominal - (0)typical angina, (1)atypical angina, (2)non-angina pain, (3)asymptomatic
trestbps - Resting blood pressure - Numerical
chol - Serum cholestoral - Numerical
fbs - Fasting blood sugar higher than 120 mg/dl - Nominal
restecg - Resting electrocardiographic results - Nominal
thalach - Maximum heart rate achieved - Numerical
exang - Exercise induced angina - Nominal
oldpeak - ST depression induced by exercise relative to rest - Numerical
slope - The slope of the peak exercise ST segment - Nominal
ca - Number of major vessels colored by flourosopy - Nominal
thal - Thalassemia - Nominal
target - Diagnosis of heart disease - Nominal
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 0.3.5
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(psych)
##
## Attaching package: 'psych'
##
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
library(RColorBrewer)
library(plotly)
##
## Attaching package: 'plotly'
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following object is masked from 'package:graphics':
##
## layout
library(ggplot2)
library(ggthemes)
library(gridExtra)
##
## Attaching package: 'gridExtra'
##
## The following object is masked from 'package:dplyr':
##
## combine
setwd
## function (dir)
## .Internal(setwd(dir))
## <bytecode: 0x7fb97eec4408>
## <environment: namespace:base>
heart_data <- read_csv("heart.csv")
## Rows: 1025 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (14): age, sex, cp, trestbps, chol, fbs, restecg, thalach, exang, oldpea...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# See how many rows there are in the data frame
NROW(heart_data)
## [1] 1025
# convert sex to factor
heart_data$sex <- heart_data$sex %>% factor(labels = c("Female","Male"))
# convert target to a logical variable
heart_data$target <- heart_data$target %>% as.logical()
# check for any NA's
sum(is.na(heart_data))
## [1] 0
summary(heart_data) # read the summary stats of the data
## age sex cp trestbps chol
## Min. :29.00 Female:312 Min. :0.0000 Min. : 94.0 Min. :126
## 1st Qu.:48.00 Male :713 1st Qu.:0.0000 1st Qu.:120.0 1st Qu.:211
## Median :56.00 Median :1.0000 Median :130.0 Median :240
## Mean :54.43 Mean :0.9424 Mean :131.6 Mean :246
## 3rd Qu.:61.00 3rd Qu.:2.0000 3rd Qu.:140.0 3rd Qu.:275
## Max. :77.00 Max. :3.0000 Max. :200.0 Max. :564
## fbs restecg thalach exang
## Min. :0.0000 Min. :0.0000 Min. : 71.0 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:132.0 1st Qu.:0.0000
## Median :0.0000 Median :1.0000 Median :152.0 Median :0.0000
## Mean :0.1493 Mean :0.5298 Mean :149.1 Mean :0.3366
## 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:166.0 3rd Qu.:1.0000
## Max. :1.0000 Max. :2.0000 Max. :202.0 Max. :1.0000
## oldpeak slope ca thal
## Min. :0.000 Min. :0.000 Min. :0.0000 Min. :0.000
## 1st Qu.:0.000 1st Qu.:1.000 1st Qu.:0.0000 1st Qu.:2.000
## Median :0.800 Median :1.000 Median :0.0000 Median :2.000
## Mean :1.072 Mean :1.385 Mean :0.7541 Mean :2.324
## 3rd Qu.:1.800 3rd Qu.:2.000 3rd Qu.:1.0000 3rd Qu.:3.000
## Max. :6.200 Max. :2.000 Max. :4.0000 Max. :3.000
## target
## Mode :logical
## FALSE:499
## TRUE :526
##
##
##
str(heart_data) # view the structure of the data
## spc_tbl_ [1,025 × 14] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ age : num [1:1025] 52 53 70 61 62 58 58 55 46 54 ...
## $ sex : Factor w/ 2 levels "Female","Male": 2 2 2 2 1 1 2 2 2 2 ...
## $ cp : num [1:1025] 0 0 0 0 0 0 0 0 0 0 ...
## $ trestbps: num [1:1025] 125 140 145 148 138 100 114 160 120 122 ...
## $ chol : num [1:1025] 212 203 174 203 294 248 318 289 249 286 ...
## $ fbs : num [1:1025] 0 1 0 0 1 0 0 0 0 0 ...
## $ restecg : num [1:1025] 1 0 1 1 1 0 2 0 0 0 ...
## $ thalach : num [1:1025] 168 155 125 161 106 122 140 145 144 116 ...
## $ exang : num [1:1025] 0 1 1 0 0 0 0 1 0 1 ...
## $ oldpeak : num [1:1025] 1 3.1 2.6 0 1.9 1 4.4 0.8 0.8 3.2 ...
## $ slope : num [1:1025] 2 0 0 2 1 1 0 1 2 1 ...
## $ ca : num [1:1025] 2 0 0 1 3 0 3 1 0 2 ...
## $ thal : num [1:1025] 3 3 3 3 2 2 1 3 3 2 ...
## $ target : logi [1:1025] FALSE FALSE FALSE FALSE FALSE TRUE ...
## - attr(*, "spec")=
## .. cols(
## .. age = col_double(),
## .. sex = col_double(),
## .. cp = col_double(),
## .. trestbps = col_double(),
## .. chol = col_double(),
## .. fbs = col_double(),
## .. restecg = col_double(),
## .. thalach = col_double(),
## .. exang = col_double(),
## .. oldpeak = col_double(),
## .. slope = col_double(),
## .. ca = col_double(),
## .. thal = col_double(),
## .. target = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
dim(heart_data) # view the dimensions of the data
## [1] 1025 14
head(heart_data)
## # A tibble: 6 × 14
## age sex cp trestbps chol fbs restecg thalach exang oldpeak slope
## <dbl> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 52 Male 0 125 212 0 1 168 0 1 2
## 2 53 Male 0 140 203 1 0 155 1 3.1 0
## 3 70 Male 0 145 174 0 1 125 1 2.6 0
## 4 61 Male 0 148 203 0 1 161 0 0 2
## 5 62 Female 0 138 294 1 1 106 0 1.9 1
## 6 58 Female 0 100 248 0 0 122 0 1 1
## # … with 3 more variables: ca <dbl>, thal <dbl>, target <lgl>
tail(heart_data)
## # A tibble: 6 × 14
## age sex cp trestbps chol fbs restecg thalach exang oldpeak slope
## <dbl> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 47 Male 0 112 204 0 1 143 0 0.1 2
## 2 59 Male 1 140 221 0 1 164 1 0 2
## 3 60 Male 0 125 258 0 0 141 1 2.8 1
## 4 47 Male 0 110 275 0 0 118 1 1 1
## 5 50 Female 0 110 254 0 0 159 0 0 2
## 6 54 Male 0 120 188 0 1 113 0 1.4 1
## # … with 3 more variables: ca <dbl>, thal <dbl>, target <lgl>
disease <- prop.table(table(heart_data$target)) # find the porportion of people with and without heart disease in total
disease
##
## FALSE TRUE
## 0.4868293 0.5131707
There are ~ 49% of patients without heart disease, and ~51% with heart disease.
hearts_data <- rename(heart_data, maximum_heart_rate = thalach, resting_blood_pressure = trestbps) # rename variables so its easier to read
# plot a bar graph to see the distribution of patients with and without heart disease according to gender
gender <- ggplot(hearts_data, mapping = aes(x = sex, fill = target)) +
geom_bar(position = "fill") +
guides(fill = guide_legend(reverse=TRUE)) + # flip the legend for true to be first
scale_fill_manual(values=c("pink", "purple")) + # add colors
ggtitle("Gender and Heart Disease") +
xlab("Sex") + ylab("Counts") +
theme(axis.text.x = element_text(size = 10), axis.title.x = element_text(size = 10), # change font size
axis.title.y = element_text(size = 10), axis.text.y = element_text(size = 10))
# plot a boxplot to show the relationship between resting blood pressure and heart disease
blood_pressure <- ggplot(hearts_data, mapping = aes(x=resting_blood_pressure, fill=target)) +
geom_boxplot() +
guides(fill = guide_legend(reverse=TRUE)) + # flip the legend for true to be first
ggtitle("Blood Pressure and Heart Disease") +
scale_fill_manual(values=c("pink", "purple")) +
xlab("Resting blood pressure") + ylab("Proportion")+ # axis titles
scale_x_discrete(labels = c("low", "high"))+
theme(axis.text.x = element_text(size = 10), axis.title.x = element_text(size = 10), # change font size
axis.title.y = element_text(size = 10), axis.text.y = element_text(size = 10))
heart_rate <- ggplot(hearts_data, mapping = aes(x = maximum_heart_rate, fill = target)) +
geom_boxplot() +
guides(fill = guide_legend(reverse=TRUE)) + # flip the legend for true to be first
scale_fill_manual(values=c("pink", "purple")) + # add colors
ggtitle("Maximum Heart Rate According to Disease") +
xlab("Maximum heart rate achieved") + ylab("Proportion") +
theme(axis.text.x = element_text(size = 10), axis.title.x = element_text(size = 10)) # change font size
grid.arrange(gender, blood_pressure, heart_rate, nrow=2)
According to these analysis, we can notice that the bar plot at the top indicates that there may be a relationship between sex and heart disease because there are more males than females with heart disease. The box plots for resting blood pressure are similar in both cases of heart disease, indicating that there is probably not a significant correlation between these two conditions. Also there are only three outliers where blood pressure associated to heart disease is true, which could be explained by the fact that a very high blood pressure could conclude heart disease. Additionally, the final box plot does demonstrate a link between heart illness and the highest heart rate attained during exercise. There is little overlap between the two boxes, and the maximum heart rate trend appears to be higher in those without heart disease than in people who do have it. Except for a few outliers with the maximul heat rate being very low, the average heart rates seem to be considerably higher for patients with heart disease. None of the distributions have too many outliers that could affect the predictions of our analysis.
Did you know, every year, more than 6.5 million people in the United States visit emergency rooms because of chest pain. However, little research has examined what it would mean in the future, according to the study’s principal investigator, Dr. Kentaro Ejiri, a postdoctoral fellow at the Baltimore-based Johns Hopkins Bloomberg School of Public Health. Although less so than with moderate to severe symptoms, even minor chest discomfort were associated with a long-term risk of cardiac issues.
# Create a new data frame with heart disease and chest pain types
heart_cp <- heart_data %>%
select(cp, target) %>% # select chest pain and heart disease
mutate(target = case_when( # mutate so it displays yes and no for heart disease accordance
target == 0 ~ "No",
target == 1 ~ "Yes"),
cp = as.factor(cp)) %>% # Convert to factor
mutate(target = as.factor(target))
# Plot chest pain according to heart disease using bar graph
p2 <- ggplot(heart_cp, aes(x = cp, fill = target)) +
geom_bar(position = "stack") + # stack the bar graph
scale_fill_manual(values=c("dark blue", "red")) # include colors
labs(x = "Chest Pain Type",
y = "Number of Individuals",
title = "Chest Pain Acoording to Heart Disease") +
theme(plot.title = element_text(hjust = 0.5), # font size
plot.subtitle = element_text(hjust = 0.5))
## NULL
p2 <- ggplotly(p2) # include interactivity through plotly
p2
Except for the first chest pain type, all chest pain types show a higher count of patients with heart disease explaining that there is enough evidence to show that there is a strong relation between chest pain and heart disease.
# histogram for serum cholestral levels in relation to heart disease
ggplot(heart_data, aes(x=chol, fill=target)) +
geom_histogram(aes(y=..density..), color="grey17") +
geom_density(alpha=.2, fill="yellow") + # fill color under density line to yellow
facet_wrap(~target, ncol=1,scale="fixed") + # use facet wrap to display two histograms together
xlab("Cholesterol") +
ylab("Count") +
ggtitle("Serum Cholestoral (mg/dl)") +
scale_fill_discrete(name = "Heart Disease", labels = c("Yes", "No")) + theme(plot.title = element_text(hjust = 0.5))
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
There is no significant difference between patients with and without heart disease in their cholestoral (mg/dl).
A significant issue for public health is hypertension, which is also a documented independent risk factor for cardiovascular disease. Increasing blood pressure is consistently associated with cardiovascular disease
# plot to show the relationship between resting blood pressure and heart disease
p1 <- ggplot(hearts_data, aes(x = resting_blood_pressure, fill = target)) +
geom_histogram(bins = 20, col=I("black")) +# color the outline of the bars black to distinguish between eachother
scale_fill_manual(values=c("pink", "red")) +
xlab("Resting Blood Pressure (mmHg)") +
ylab("Number of Individuals") +
ggtitle("Resting Blood Pressure in Relation to Heart Disease")
p1 <- ggplotly(p1)
p1
Although the distribution seems to be equal for patients with or without heart disease, you can still notice that the majority of patients with heart disease have elevated blood pressures.
# bar plot to see the number of males and females in the data
barplot(table(heart_data$sex), xlim = c(0, 700), horiz = T, # create horizontal bar graph
col = c("pink", "blue"), border = "white",
main = "Number of Males and Females in the Data", ylab = "Gender", legend.text = T) # include legend
There are almost twice as many men in the data than women.
prop.table(table(select(heart_data, target,sex))) # porportion table to see the porportion of males and females with and without disease
## sex
## target Female Male
## FALSE 0.08390244 0.40292683
## TRUE 0.22048780 0.29268293
According to BMJ Global Health, “Men generally develop heart disease at a younger age and have a higher risk of coronary heart disease (CHD) than women. Women, in contrast, are at a higher risk of stroke, which often occurs at older age.” Despite the fact that cardiovascular disease is the leading cause of mortality globally, there are significant variances between men and women. The apparent modest advantages in delaying heart disease to older ages in women may potentially be explained by gender differences in the presentation of heart disease. Women have less severe obstructive coronary artery disease and warning signals than men, and because they are frequently viewed as being “atypical,” they frequently go un diagnosed or neglected. As a result, women are less likely than males to be referred for diagnostic and therapeutic procedures or to get pharmacological treatment for disease risk factors. For women with heart disease, inadequate access to healthcare services causes delays in diagnosis and treatment as well as worse prognoses and outcomes.
# create data frame of age, target, and sex
age <- heart_data %>%
mutate(age=as.factor(age)) %>% # mutate age to factor
group_by(age, target, sex) %>% # group by these three variables
summarise(count=n()) %>% # count the number of patients in the group as specified
mutate(perc=count/sum(count))
## `summarise()` has grouped output by 'age', 'target'. You can override using the
## `.groups` argument.
# plot bar plots to see distribution of patients according to age and gender
plot3 <- ggplot(age, aes(x=age, y=count, fill=target))+
geom_bar(stat = "identity") + xlab("Age") + # enable to plot bars with the bar length is set by your variable mappings
scale_x_discrete(breaks = c("30", "35","40","45","50","55","60","65","70","75","80")) + # break ages into factors of 5
ylab("Count") +
ggtitle("Ages With Heart Disease") +
scale_fill_manual(values = c("green", "red")) +
theme_minimal(base_size = 10) + # theme for the graph
theme(plot.title = element_text(hjust = 0.5)) + # font size
facet_grid(~sex) # facet grid to see two bar plots according to each gender
ggplotly(plot3) # include interactivity through plotly
According to this plot, males seem to have more patients with heart disease than women, with the highest count of heart disease as well. something else to notice is that while the youngest male with heart disease is 29 years old, the youngest female with heart disease is 34 years old. First thing to keep in mind is that there are obviously twice as many men than women in this data set, not having an equal distribution between the two genders. This affects the way we interpret the data as theres a possibility females would have a higher proportion with a larger sample size. Similarly, as the study by BMJ Global Health stated, theres also a probability that the female diagnosis risk is being neglected at earlier ages which is why the age for females with heart disease starts with 34, while the male age is much earlier.
According to the NIH, Elevated heart rate is associated with elevated blood pressure, increased risk for hypertension, and, among hypertensives, increased risk for cardiovascular disease. Despite these important relationships, heart rate is generally not a major consideration in choosing antihypertensive medications. The heart rate varies for each person.
# Plot a scatter plot of heart disease according to age, maximum heart rate and blood pressure
p <- hearts_data %>%
plot_ly(x = ~age, y = ~ maximum_heart_rate, color = ~target, colors = c("purple", "#1B98E0"), # use plotly to plot the scatter plot
size = ~resting_blood_pressure,
text = ~paste("</br> age: ", age, # text to show in tooltip
"</br> max heart rate: ", maximum_heart_rate, # name the variables
"</br> blood pressure: ", resting_blood_pressure),
hoverinfo = "text", type = "scatter", mode = "markers") %>%
layout(title = "Heart disease Occurance According to Age, Maximum Heart Rate, and Blood Pressure",xaxis = list(title = "Age"),
yaxis = list(title = "Maximum Heart Rate"), legend = list(title=list(text="Occurance of Heart Disease")))
showlegend = (TRUE) # show legend
p
## Warning: `line.width` does not currently support multiple values.
## Warning: `line.width` does not currently support multiple values.
After analyzing some exploratory graphs in relation to heart disease, I wanted to see the relation of maximum heart rate, age, and blood pressure according to heart disease. You can notice that there’s a pattern where most patients with heart disease also have a high maximum heart rate. Also, while blood pressure doesn’t seem to be a much significant factor in determining heart disease, you can still notice that none of the points with heart disease equals true have a very small size considering the many other points. Consequently, the age range from 40 - 60 have the most population of patients with heart disease. Something I would like to explore on forward regarding this topic is the death age for these patients as well since there seem to be barely anyone after the age of 70 with heart disease. Is this because they don’t make it to that age, or that there are none included in this research. Lastly, in most of my graphs, I wanted to reverse the stack order for heart disease = true to show on top of the data with false. Obviously, there’s still much to learn while coding data visualizations, however I am very grateful to Professor Saidi for the amount I have been able to learn from her throughout the semester.
Bots, Sophie H, et al. “Sex Differences in Coronary Heart Disease and Stroke Mortality: A Global Assessment of the Effect of Ageing between 1980 and 2010.” BMJ Global Health, BMJ Specialist Journals, 1 Mar. 2017, https://gh.bmj.com/content/2/2/e000298.
Reule, Scott, and Paul E Drawz. “Heart Rate and Blood Pressure: Any Possible Implications for Management of Hypertension?” Current Hypertension Reports, U.S. National Library of Medicine, Dec. 2012, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3491126/.
“Chest Pain, Shortness of Breath Linked to Long-Term Risk of Heart Trouble.” Www.heart.org, 17 Nov. 2022, https://www.heart.org/en/news/2022/11/02/chest-pain-shortness-of-breath-linked-to-long-term-risk-of-heart-trouble.