This analysis deals with exploring disease status and understanding the relationship it has with a multitude of different variables and factors. Specifically, we aim for a better understanding of how sex (male and female), age, serum cholesterol levels, chest pain, blood pressure (bp), and maximum heart rate (hr) plays a part in the prevalence of heart disease. The following variables will be analyzed individually and then used to create a generalized binomial linear model to predict the probability of not having heart disease. The data represented in this analysis can be found at archive.ics.uci.edu.
The data analysis will be broken down by the following sections:
Sex and Disease Status
Age and Disease Status
Serum Cholesterol Levels and Disease Status
Chest Pain and Disease Status
Resting Blood Pressure and Disease Status
Maximum Heart Rate Achieved per Patient and Disease Status
Logistic Linear Regression - Binomial and Disease Status Predictions
We will begin our analysis by determining the prevalence of each respective sex within the data set.
ggplot(tidy_data, mapping = aes(x = sex)) +
geom_bar() +
labs(x = "Gender",
y = "Number of Individuals",
title = "Amount of Males and Females within the Data Set",
subtitle = "Heart Disease UCI",
caption = "Source: https://archive.ics.uci.edu/ml/datasets/heart+Disease") +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))
Let’s explore the proportion of these genders and see the prevalence of heart disease among them:
ggplot(tidy_data, mapping = aes(x = sex, fill = status)) +
geom_bar(position = "fill") +
labs(x = "Gender",
y = "Proportion",
title = "Proportion of Individuals With and Without Heart Disease",
subtitle = "Heart Disease UCI",
caption = "Source: https://archive.ics.uci.edu/ml/datasets/heart+Disease") +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))
According to (Mosca, Barrett-Connor, and Kass Wenger 2011), “the prevalence of CHD is higher in men with each age stratum until after 75 years of age, which may contribute to the perception that heart disease is a man’s disease.” After 75 years of age, the proportion of those with coronary heart disease is relatively similar regardless of sex.
Therefore, the above proportions are not out of the ordinary since males are more susceptible to heart disease earlier in life than females.
According to the National Institute on Aging, adults that are 65 years and older have a higher risk of suffering heart complications. Therefore, we will analyze the data and see what the average age of individuals with and without heart disease is within our data set.
ggplot(tidy_data, mapping = aes( x = status, y = age)) +
geom_boxplot() +
labs(x = "Disease Status",
y = "Age",
title = "Older Age is Correlated with Heart Disease",
subtitle = "Heart Disease UCI",
caption = "Source: https://archive.ics.uci.edu/ml/datasets/heart+Disease") +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))
average_age_dz <- tidy_data %>%
group_by(status) %>%
summarise(
average_age = mean(age),
min = min(age),
max = max(age)
)
knitr::kable(average_age_dz, digits = 2)
| status | average_age | min | max |
|---|---|---|---|
| Dz | 56.6 | 35 | 77 |
| No Dz | 52.5 | 29 | 76 |
Within our data set, the average age of an individual with heart disease is 56.6 years old, with the youngest being 35 years old and the oldest being 77 years old.
The box plot above shows that the average age of an individual with heart disease is greater than that of someone without heart disease and there is less variability in the age of those with heart disease than those without heart disease.
Now we can run a t-test to see if there is a statistically significant difference between the average age of patients with heart disease vs without heart disease
Keep in mind, our null hypothesis is equal to zero, the alternative is two.sided, we want a 95% confidence interval, the variance is not equal and we have independent populations.
t.test(tidy_data$age ~ tidy_data$status, mu = 0, alt = "two.sided", conf = 0.95, var.eq = FALSE, paired = FALSE)
##
## Welch Two Sample t-test
##
## data: tidy_data$age by tidy_data$status
## t = 4.0797, df = 301, p-value = 5.781e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 2.124635 6.084324
## sample estimates:
## mean in group Dz mean in group No Dz
## 56.60145 52.49697
The null hypothesis is: there is no difference in average age between those with or without heart disease.
The p-value is close to zero, therefore we can reject our null hypothesis and with 95% confidence say there is a statistically significant difference between the average age of those who have heart disease vs those who do not.
The confidence interval also does not include zero, therefore, it supports our rejection of the null.
As mentioned in the previous section, heart disease status is influenced (not limited to) by both age and sex (male vs female). Let’s explore the average age of males vs females with and without heart disease.
According to (Maas and Appelman 2010), heart disease “develops 7-10 years later in women than in men.” Therefore, we should see that the average age of males with heart disease is less than that of females.
age_sex_dz <- tidy_data %>%
group_by(status, sex) %>%
summarise(
average_age = mean(age),
min = min(age),
max = max(age)
) %>%
filter(status == "Dz")
knitr::kable(age_sex_dz, digits = 2)
| status | sex | average_age | min | max |
|---|---|---|---|---|
| Dz | female | 59.04 | 43 | 66 |
| Dz | male | 56.09 | 35 | 77 |
The average age of males with heart disease is 56.1 years old and the average age of females with heart disease is 59 years old.
The youngest male with heart disease is 35 years old and the youngest female with heart disease is 43 years old.
The oldest male with heart disease is 77 years old and the oldest female with heart disease is 66 years old.
This analysis is aligned with the claims stated above from (Maas and Appelman 2010). The average age of males with heart disease was on average 3 years less than that of a female, which supports the claim that men tend to develop coronary artery disease earlier in life. If we look at the youngest male to develop heart disease, they are 8 years younger than the youngest female with heart disease in this data set. This supports the claim that heart disease “develops 7-10 years later in women than in men.”
The research by (Elshourbagy, Meyers, and Abdel-Meguid 2014) explains that atherosclerosis is an inflammatory condition that can result from a multitude of risk factors. Some known factors that can contribute to its development are high levels of Low-Density Lipoprotein (LDL) and low levels of High-Density Lipoprotein (HDL). The reason is that LDL is a pro-inflammatory and HDL is an anti-inflammatory.
Because atherosclerosis is a leading cause of cardiovascular disease due to dyslipidemia (abnormal levels of lipids in the blood (LDL/HDL)) (Elshourbagy, Meyers, and Abdel-Meguid 2014), we will analyze the cholesterol data of our patients.
Specifically, we will take a look at our data set and see what the serum cholesterol levels are compared to the normal range of serum cholesterol:
tidy_data %>%
group_by(status, chol) %>%
ggplot(tidy_data, mapping = aes(x = status, y = chol)) +
geom_boxplot() +
labs(x = "Disease Status",
y = "Serum Cholesterol Level (mg/dL)",
title = "Similar Cholesterol Levels Despite Heart Disease",
subtitle = "Heart Disease UCI",
caption = "Source: https://archive.ics.uci.edu/ml/datasets/heart+Disease") +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))
average_chol_status <- tidy_data %>%
group_by(status) %>%
summarise(
average_chol = mean(chol),
min = min(chol),
max = max(chol)
)
knitr::kable(average_chol_status, digits = 2)
| status | average_chol | min | max |
|---|---|---|---|
| Dz | 251.09 | 131 | 409 |
| No Dz | 242.23 | 126 | 564 |
As shown above, the average serum cholesterol level of those with heart disease was 251 mg/dL and the average serum cholesterol level of those without heart disease was 242 mg/dL.
According to WebMD:
The Desirable Range of serum cholesterol level is less than 200mg/dL.
The Borderline High of serum cholesterol level is: 200-239 mg/dL
The High of serum cholesterol level is: greater than 240 mg/dL (being at this stage may double the risk of heart disease)
Therefore, both of the average serum cholesterol levels of those with and without heart disease fall within the “High of Serum Cholesterol Level.”
Although, they are both above the healthy range, are the two averages significantly different?
t.test(tidy_data$chol ~ tidy_data$status, mu = 0, alt = "two.sided", conf = 0.95, var.eq = FALSE, paired = FALSE)
##
## Welch Two Sample t-test
##
## data: tidy_data$chol by tidy_data$status
## t = 1.4948, df = 298.03, p-value = 0.136
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -2.803241 20.516548
## sample estimates:
## mean in group Dz mean in group No Dz
## 251.0870 242.2303
Our null hypothesis states: There is no difference between the average serum cholesterol levels of a patient with heart disease and of a patient without heart disease.
Our p-value shows that the probability of getting our results/extremes, assuming the null hypothesis is true, will happen 13.6% of the time. Therefore, we cannot reject our null hypothesis.
Also since the 95% confidence interval includes zero, it supports our failure to reject the null hypothesis.
Therefore, there is no statistically significant difference between the two cholesterol levels regardless of the presence or absence of heart disease.
It is interesting to see that cholesterol level is not a good indicator of heart disease within this data set despite the known implications from the above publications.
Chest pain, also known as Angina, is not a disease but a symptom of potential underlying heart problems. There are many different variations of Angina, however, we will be following the below key for our analysis:
ggplot(tidy_data, aes(x = cp, fill = status)) +
geom_bar(position = "stack") +
labs(x = "Chest Pain Type",
y = "Number of Individuals",
title = "Large Majority of Heart Disease Patients are Asymptomatic",
subtitle = "Heart Disease UCI",
caption = "Source: https://archive.ics.uci.edu/ml/datasets/heart+Disease") +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))
Value 0: Asymptomatic
Value 1: Atypical Angina
Value 2: Non-Anginal Pain
Value 3: Typical Angina
As shown in the above plot, the majority of patients within the data set are asymptomatic. Within this majority, most of them have heart disease. This indicates that the majority of patients with heart disease are asymptomatic.
According to heart.org, “not all chest pain is a sign of heart disease.” There are other conditions that can cause chest pains, such as, Pulmonary Embolism, Aortic Dissection, Lung infection, Aortic Stenosis, etc.
Therefore, we can assume that chest pain might not be a great indicator for heart disease as well within this data set.
As stated in MedicalNewsToday, “when doctors evaluate the risk of high blood pressure, they usually pay more attention to systolic blood pressure, which they consider a major risk factor for cardiovascular disease in older adults.” They also mention that high diastolic blood pressure is also indicative of cardiovascular disease, however, our data set will only include systolic blood pressure.
The heart.org states that normal range of systolic blood pressure is less than 120 mmHg. Systolic blood pressure that is consistently greater than 130 mmHg categorizes the patient as hypertensive.
Below is a Histogram to show the resting systolic blood pressure of patients with and without heart disease:
ggplot(tidy_data, aes(x = trestbps, fill = status)) +
geom_histogram() +
labs(x = "Resting Blood Pressure (mmHg)",
y = "Number of Individuals",
title = "Majority of Heart Disease Patients have Elevated Blood Pressures",
subtitle = "Blood Pressure >120 mmHg",
caption = "Source: https://archive.ics.uci.edu/ml/datasets/heart+Disease") +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))
trestbps_table <- tidy_data %>%
group_by(status) %>%
summarise(
mean(trestbps)
)
knitr::kable(trestbps_table, digits = 0)
| status | mean(trestbps) |
|---|---|
| Dz | 134 |
| No Dz | 129 |
As calculated above, the average resting blood pressure of patients with heart disease is 134 mmHg (which is outside of the healthy range) and could result in being hypertensive (depending on consistent readings of >130 mmHg).
The average resting blood pressure of patients without heart disease is 129 mmHg, which is still greater than the healthy range but is less than a hypertensive patient.
Can we claim that the average blood pressure between those with or without heart disease is significantly different?
t.test(tidy_data$trestbps ~ tidy_data$status, mu = 0, alt = "two.sided", conf = 0.95, var.eq = FALSE, paired = FALSE)
##
## Welch Two Sample t-test
##
## data: tidy_data$trestbps by tidy_data$status
## t = 2.5083, df = 272.56, p-value = 0.01271
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 1.096240 9.094801
## sample estimates:
## mean in group Dz mean in group No Dz
## 134.3986 129.3030
The null hypothesis is: there is no difference in average systolic blood pressure between those with or without heart disease.
The p-value is 1.3% which is less than 5%, therefore we can reject our null hypothesis and with 95% confidence say there is a statistically significant difference between the average systolic blood pressure of those who have heart disease vs those who do not.
The confidence interval also does not include zero, therefore, it supports our rejection of the null.
ggplot(tidy_data, aes(x = thalach, fill = status)) +
geom_histogram() +
labs(x = "Maximum Heart Rate Achieved (bpm)",
y = "Number of Individuals",
title = "Maximum Heart Rate Achieved per Patient",
subtitle = "Heart Disease UCI",
caption = "Source: https://archive.ics.uci.edu/ml/datasets/heart+Disease") +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))
Upon researching the average maximum heart rate, there are many different age-predicted maximal heart rate (APMHR) equations that can be used. The most commonly used APMHR equation is HRmax = 220 - Age (Fox). This equation is limited by predictive accuracy and has a reported standard deviation of 10 - 12 bpm, however, “it is still used in clinical settings and published in resources by well-established organizations in the field.” (SHOOKSTER et al. 2020) Therefore, we will utilize this equation as a reference while analyzing our data.
ggplot(tidy_data, aes(x = age, y = thalach, color = status)) +
geom_point() +
geom_smooth(se = FALSE, size = 2) +
labs(x = "Age",
y = "Maximum Heart Rate Achieved",
title = "As Age Increases, Maximum Heart Rate Achieved Decreases",
subtitle = "Heart Disease Patients are Relatively Flat",
caption = "Source: https://archive.ics.uci.edu/ml/datasets/heart+Disease") +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))
In patients without heart disease, as age increases, the maximum heart rate achieved decreases.
Within patients with heart disease, the maximum heart rate achieved remains consistent around 140-150 beats per minute regardless of age.
We will now look at a linear model to confirm the relationship between heart rate, age, and disease status:
thalach_age_model <- lm(tidy_data$thalach ~ tidy_data$age + tidy_data$status)
summary(thalach_age_model)
##
## Call:
## lm(formula = tidy_data$thalach ~ tidy_data$age + tidy_data$status)
##
## Residuals:
## Min 1Q Median 3Q Max
## -59.72 -11.79 2.52 12.95 53.80
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 184.7374 7.3937 24.986 < 2e-16 ***
## tidy_data$age -0.8063 0.1273 -6.335 8.66e-10 ***
## tidy_data$statusNo Dz 16.0559 2.3171 6.929 2.59e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.57 on 300 degrees of freedom
## Multiple R-squared: 0.2749, Adjusted R-squared: 0.27
## F-statistic: 56.86 on 2 and 300 DF, p-value: < 2.2e-16
We can see in the summary of the linear model that there is a negative slope (-0.8063) for age in correspondence to a positive maximum heart rate achieved (vice versa). This supports the expected relationship that as age increases, the maximum heart rate achieved will decrease (more specifically, every change in age is associated with a negative change in maximum heart rate).
plot(thalach_age_model)
The plot function above is Base R’s 4 built-in regression diagnostic plots:
We will focus on the first two regression diagnostic plots
The first plot shows a relatively flat line and the relationship between the fitted values and residuals is cloud-shaped. This implies linearity between the three variables of age, maximum heart rate achieved, and disease status.
The second plot shows that if the residuals are normally distributed, the points should fall in a diagonal line as it does!
This, along with the p-values in the summary, shows that our linear model effectively captures the relationship between the three variables!
We will now fit a generalized linear regression model to our data to see if we can predict the status of a patient based on the variables within the data set.
predicted <- glm(status ~ ., family = "binomial", data = tidy_data)
summary(predicted)
##
## Call:
## glm(formula = status ~ ., family = "binomial", data = tidy_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6419 -0.5381 0.1969 0.6343 2.5166
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.254646 2.316904 0.973 0.330489
## age -0.022502 0.021359 -1.054 0.292100
## sexmale -2.002365 0.422739 -4.737 2.17e-06 ***
## cp1 1.449228 0.498659 2.906 0.003658 **
## cp2 1.959531 0.419741 4.668 3.04e-06 ***
## cp3 2.295322 0.611150 3.756 0.000173 ***
## trestbps -0.016457 0.009984 -1.648 0.099288 .
## chol -0.005131 0.003474 -1.477 0.139638
## fbs1 -0.037806 0.462840 -0.082 0.934899
## restecg1 0.406221 0.334214 1.215 0.224194
## restecg2 -0.251557 1.682235 -0.150 0.881130
## thalach 0.025217 0.009460 2.666 0.007680 **
## exang1 -0.807984 0.382162 -2.114 0.034495 *
## oldpeak -0.681673 0.177901 -3.832 0.000127 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 417.64 on 302 degrees of freedom
## Residual deviance: 240.17 on 289 degrees of freedom
## AIC: 268.17
##
## Number of Fisher Scoring iterations: 5
As shown by the stars next to the p-values, some variables are statistically significant to predict the probability of disease status.
Because we are using a binomial generalized linear model, we are trying to predict the probability of NOT getting heart disease since zero represents heart disease and one represents no heart disease.
Below is the scatter plot representing our Binomial Linear Regression Model:
probability_data <- data.frame(fitted.values = predicted$fitted.values, status = tidy_data$status)
probability_data <- probability_data %>%
arrange(fitted.values)
probability_data <- probability_data %>%
mutate(rank = 1:nrow(probability_data))
ggplot(probability_data, aes(x = rank, y = fitted.values, color = status)) +
geom_point(alpha = 1, shape = 1, stroke = 2) +
labs(x = "Rank",
y = "Predicted Probability of Not Getting Heart Disease",
title = "Predicted Probability of Heart Disease",
subtitle = "Closer to Zero, the Higher the Probability of Getting Heart Disease",
caption = "Source: https://archive.ics.uci.edu/ml/datasets/heart+Disease") +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))
By plotting our predicted values to the known status of someone having the disease or not having the disease, we can confidently say that our generalized linear model predicts the relationship between variables accurately.
Therefore, given a sample of the same variables, we can predict the likelihood of an individual not getting heart disease depending on their values!
Heart disease is the leading cause of death currently in the United States. Understanding the correlation between different variables and the disease itself can lead to a reduction in prevalence. As shown in this analysis, there are variables such as sex, age, and resting blood pressure that are supported by claims from literature to correspond with having heart disease. There are also other variables within our data, such as serum cholesterol levels, chest pains, and maximum heart rate achieved, that did not show correspondence with having heart disease despite common knowledge.
As shown in the statistical analysis, the generalized linear model was used to predict the probability of not having heart disease. This specifically looked at each component within the variables and specified them as either being a good predictor (p-value of less than 5%) or not a good predictor (p-value of greater than 5%). Since we are interested in predicting the probability of having heart disease, we will focus on the negative effect sizes as they convey heart disease prevalence. The results show that sex is a good predictor of heart disease prevalence (as notated by the big negative effect size and small p-value). There are other factors such as exercise-induced angina and ST depression induced by exercise relative to rest that were categorized as good predictors, however, their effect size were relatively small. Therefore, further analysis will be needed to explore other potential risk factors that can contribute to predicting heart disease that is limited by this data set.
Opportunities for further analysis could be researching the following:
Elshourbagy, Nabil A, Harold V Meyers, and Sherin S Abdel-Meguid. 2014. “Cholesterol: The Good, the Bad, and the Ugly-Therapeutic Targets for the Treatment of Dyslipidemia.” Medical Principles and Practice 23 (2): 99–111.
Maas, Angela HEM, and Yolande EA Appelman. 2010. “Gender Differences in Coronary Heart Disease.” Netherlands Heart Journal 18 (12): 598–603.
Mosca, Lori, Elizabeth Barrett-Connor, and Nanette Kass Wenger. 2011. “Sex/Gender Differences in Cardiovascular Disease Prevention: What a Difference a Decade Makes.” Circulation 124 (19): 2145–54.
SHOOKSTER, DANIEL, BRYNDAN LINDSEY, NELSON CORTES, and JOEL R MARTIN. 2020. “Accuracy of Commonly Used Age-Predicted Maximal Heart Rate Equations.” Int J Exerc Sci 13 (7): 1242–50.