Introduction
The increase in overall incidence of preventable diseases in developed countries has underscored the necessity of accurately predictive measures to combat the detriments associated with the decline in community health. This data set specifically introduces possible indicators of heart attack risk. In understanding the relationship among these variables, it is possible that some anomalies in this subset of the population could be telling of what factors have the most significant impact on individual predisposition to a heart attack. Questions arise when looking at this data, such as: Does the value of resting blood pressure vary based on or sex? How do exercise-based factors like exercise induced angina and maximum heart rate influence the probability of developing heart disease? And, are there influences of nutrition-based biomarkers like fasting blood sugar and cholesterol that impact heart disease probability? In establishing connections among these variables and how they respectively impact the risk of heart disease and heart attacks, the implications of analyzing this data could potentially aid in establishing a preventative framework for the risk factors of preventable heart disease to improve the overall health of both individuals and society as a whole.
Importing Dataset about Heart Attacks from Kaggle
heartdata<-read.csv("heartattackdata.csv",
header=TRUE)
str(heartdata)
## 'data.frame': 303 obs. of 14 variables:
## $ age : int 63 37 41 56 57 57 56 44 52 57 ...
## $ sex : int 1 1 0 1 0 1 0 1 1 1 ...
## $ cp : int 3 2 1 1 0 0 1 1 2 2 ...
## $ trestbps: int 145 130 130 120 120 140 140 120 172 150 ...
## $ chol : int 233 250 204 236 354 192 294 263 199 168 ...
## $ fbs : int 1 0 0 0 0 0 0 0 1 0 ...
## $ restecg : int 0 1 0 1 1 1 0 1 1 1 ...
## $ thalach : int 150 187 172 178 163 148 153 173 162 174 ...
## $ exang : int 0 0 0 0 1 0 0 0 0 0 ...
## $ oldpeak : num 2.3 3.5 1.4 0.8 0.6 0.4 1.3 0 0.5 1.6 ...
## $ slope : int 0 0 2 2 2 1 1 2 2 2 ...
## $ ca : int 0 0 0 0 0 0 0 0 0 0 ...
## $ thal : int 1 2 2 2 2 1 2 3 3 2 ...
## $ target : int 1 1 1 1 1 1 1 1 1 1 ...
head(heartdata)
## age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal
## 1 63 1 3 145 233 1 0 150 0 2.3 0 0 1
## 2 37 1 2 130 250 0 1 187 0 3.5 0 0 2
## 3 41 0 1 130 204 0 0 172 0 1.4 2 0 2
## 4 56 1 1 120 236 0 1 178 0 0.8 2 0 2
## 5 57 0 0 120 354 0 1 163 1 0.6 2 0 2
## 6 57 1 0 140 192 0 1 148 0 0.4 1 0 1
## target
## 1 1
## 2 1
## 3 1
## 4 1
## 5 1
## 6 1
tail(heartdata)
## age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal
## 298 59 1 0 164 176 1 0 90 0 1.0 1 2 1
## 299 57 0 0 140 241 0 1 123 1 0.2 1 0 3
## 300 45 1 3 110 264 0 1 132 0 1.2 1 0 3
## 301 68 1 0 144 193 1 1 141 0 3.4 1 2 3
## 302 57 1 0 130 131 0 1 115 1 1.2 1 1 3
## 303 57 0 1 130 236 0 0 174 0 0.0 1 1 2
## target
## 298 0
## 299 0
## 300 0
## 301 0
## 302 0
## 303 0
library(tidyverse)
## ── Attaching packages ────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2 ✓ purrr 0.3.4
## ✓ tibble 3.0.3 ✓ dplyr 1.0.3
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.5.0
## ── Conflicts ───────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
# Wrangling the data
# Converting sex, heart disease risk, exercise induced angina, and fasted blood sugar > 120 mg/dl to a factor.
heartdata$sexfactor <- factor(heartdata$sex, levels=c(0,1),
labels=c("Female","Male"))
heartdata$riskfactor <- factor(heartdata$target, levels=c(0,1),
labels=c("Low Risk","High Risk"))
heartdata$exangfactor <- factor(heartdata$exang, levels=c(0,1),
labels=c("No Exercise Induced Angina","Exercise Induced Angina"))
heartdata$fbsfactor <- factor(heartdata$fbs, levels=c(0,1),
labels=c("No","Yes"))
head(heartdata)
## age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal
## 1 63 1 3 145 233 1 0 150 0 2.3 0 0 1
## 2 37 1 2 130 250 0 1 187 0 3.5 0 0 2
## 3 41 0 1 130 204 0 0 172 0 1.4 2 0 2
## 4 56 1 1 120 236 0 1 178 0 0.8 2 0 2
## 5 57 0 0 120 354 0 1 163 1 0.6 2 0 2
## 6 57 1 0 140 192 0 1 148 0 0.4 1 0 1
## target sexfactor riskfactor exangfactor fbsfactor
## 1 1 Male High Risk No Exercise Induced Angina Yes
## 2 1 Male High Risk No Exercise Induced Angina No
## 3 1 Female High Risk No Exercise Induced Angina No
## 4 1 Male High Risk No Exercise Induced Angina No
## 5 1 Female High Risk Exercise Induced Angina No
## 6 1 Male High Risk No Exercise Induced Angina No
# write.csv(heartdata, "workableheartdataset.csv")
Variables of Interest
Resting Blood Pressure
Cholesterol
Fasting Blood Sugar
Sex
Age
Maximum Heart Rate Acheived
Exercise Induced Angina
Relationship between blood pressure and level of risk for heart disease
# Creating a plot of resting blood pressure vs. age by sex and heart disease risk.
ggplot(heartdata, aes(x=age, y=trestbps, color=riskfactor))+
geom_jitter(alpha=0.6, size=0.75)+
facet_grid(sexfactor~riskfactor)+
labs(x="Age (years)", y="Resting Blood Pressure (mmHg)", title="Blood Pressure, Age, Sex and Heart Disease Risk", color="Heart Disease Risk")+
scale_color_manual(breaks = c("Low Risk", "High Risk"),
values=c("#0099FF", "#FF0033"))+
theme_classic()

Scatter Plot: (Resting blood pressure and age, separated by sex and heart disease risk)
In this graph, there seems to be a relationship between a higher resting blood pressure (>120-130 mmHg) and being at a lower risk of heart disease for males with across all ages. This seems counterintuitive but the plot appears to indicate this, it could be that a majority of the respondents had a high blood pressure and thus it is causing the data to appear this way. There doesn’t seem to be a strong correlation between being higher risk based on higher resting blood pressure, and in fact it seems like there are more people and outliers with higher blood pressure in the low risk category across both sexes. For both low and high risk females, there appears to be a moderately weak positive correlation between resting blood pressure and age. So as women in both categories get older, they tend to have a higher resting blood pressure. It is hard to say if men have the same trend since there appear to be a lot of respondents with low blood pressure and high blood pressure even as they get older, but there could be a slightly positive correlation for men and increasing blood pressure as they age. There also appear to be a lot more male respondents than female ones.
Relationship between exercise biomarkers and heart disease risk
# Creating a column graph of maximum heart rate achieved vs. exercise induced angina by heart disease risk factor.
ggplot(heartdata, aes(x=exangfactor, y=thalach, fill=riskfactor))+
geom_col(position="dodge")+
labs(x = "Angina", y = "Maximum Heart Rate Achieved (bpm)", title="Exercise Biomarkers and Heart Disease Risk", fill= "Heart Disease Risk")+
scale_fill_manual(breaks = c("Low Risk", "High Risk"),
values=c("#0099FF", "#FF0033"))+
theme_classic()

# Creating a box-plot of maximum heart rate achieved vs. exercise induced angina by heart disease risk factor.
ggplot(heartdata, aes(x=exangfactor, y=thalach, fill=riskfactor))+
geom_boxplot()+
labs(x = "Angina", y = "Maximum Heart Rate Achieved (bpm)", title="Exercise Biomarkers and Heart Disease Risk", fill= "Heart Disease Risk")+
scale_fill_manual(breaks = c("Low Risk", "High Risk"),
values=c("#0099FF", "#FF0033"))+
theme_classic()

Bar Chart and Boxplot: (Maximum heart rate achieved vs exercise induced angina separated by low and high risk for heart disease)
Those with no exercise induced angina were able to achieve a higher average maximum heart rate across those in that observation category. This could indicate that having angina does not allow one to push as hard during a workout and thus they have a lower average heart rate. Those at a lower risk for heart disease, have on average a lower maximum heart rate achieved across both angina categories. This could indicate that achieving a very high heart rate during a workout could be a sign of heart disease or put one at risk for heart disease. Those in the low risk category with exercise induced angina have the lowest maximum heart rate achieved on average whereas those in the high risk with no exercise induced angina have the highest maximum heart rate achieved on average. This, again seems to indicate that having no exercise induced angina could allow one to push harder and achieve a higher heart rate, and that being able to achieve such a high heart rate could indicate heart disease or contribute to heart disease risk. For these two visuals that both show the same data, the box plot provides a better and easier-to-read visualization of the relationship between these variables.
Relationship between nutrition-based biomarkers and heart disease risk
# Creating a boxplot of cholesterol vs. whether one's fasted blood glucose is above 120 mg/dl by heart disease risk.
ggplot(heartdata, aes(x=fbsfactor, y=chol, fill=riskfactor))+
geom_boxplot()+
labs(x="Fasted Blood Sugar >120 mg/dl", y="Serum Cholesterol (mg/dl)", title="Nutrition-Based Biomarkers and Heart Disease Risk", fill="Heart Disease Risk")+
scale_fill_manual(breaks = c("Low Risk", "High Risk"),
values=c("#0099FF", "#FF0033"))+
theme_classic()

ggplot(heartdata, aes(x=fbsfactor, y=chol, fill=riskfactor))+
geom_col(position="dodge")+
labs(x="Fasted Blood Sugar >120 mg/dl", y="Serum Cholesterol (mg/dl)", title="Nutrition-Based Biomarkers and Heart Disease Risk", fill="Heart Disease Risk")+
scale_fill_manual(breaks = c("Low Risk", "High Risk"),
values=c("#0099FF", "#FF0033"))+
theme_classic()

ggplot(heartdata, aes(x=chol, fill=riskfactor))+
geom_histogram(position="dodge", binwidth = 100)+
labs(x="Serum Cholesterol (mg/dl)", title="Serum Cholesterol and Heart Disease Risk", fill="Heart Disease Risk")+
scale_fill_manual(breaks = c("Low Risk", "High Risk"),
values=c("#0099FF", "#FF0033"))+
theme_classic()

ggplot(heartdata, aes(x=fbsfactor, fill=riskfactor))+
geom_bar(position="dodge")+
labs(x="Fasted Blood Sugar >120 mg/dl", title="Fasted Blood Sugar and Heart Disease Risk", fill="Heart Disease Risk")+
scale_fill_manual(breaks = c("Low Risk", "High Risk"),
values=c("#0099FF", "#FF0033"))+
theme_classic()

Boxplot (Serum cholesterol and fasted blood sugar separated by low and high risk)
Those in the low risk category seem to have a higher average cholesterol concentration. This could indicate that cholesterol does not have that great of an impact on heart disease risk, or perhaps, a higher cholesterol may reduce the risk of heart disease. This variable, however, does not indicate which type of cholesterol is being measured (HDL or LDL or a combination). There seems to be no impact by fasted blood sugar >120 mg/dl on the averages of serum cholesterol between both low risk and high risk categories (averages are very similar). This seems to go against intuition that a higher fasted blood sugar would lead to a multitude of health problems, but there are also a lot less respondents with blood sugar >120 mg/dl that it could not be statistically significant. There are some outliers with extremely high cholesterol for high risk in both categories of fasted blood sugar as well, with one outlier with low cholesterol for a fasted blood sugar >120 mg/dl. There are also a couple outliers with high cholesterol in the low risk category for fasted blood sugar <120 mg/dl. These data could be showing that if one has an extremely high cholesterol (upper outliers) they are at a higher risk for heart disease.
Histogram (The number of people within serum cholesterol levels separated by low and high risk)
There are the most people in the high risk category at ~200-250 mg/dl cholesterol, which could indicate that having a serum cholesterol value within this range puts one at a higher risk for heart disease. It seems like when one has serum cholesterol >400 mg/dl they are at a higher risk for heart disease due to the lack of low risk (blue) on the histogram above that value.