Introduction

The increase in overall incidence of preventable diseases in developed countries has underscored the necessity of accurately predictive measures to combat the detriments associated with the decline in community health. This data set specifically introduces possible indicators of heart attack risk. In understanding the relationship among these variables, it is possible that some anomalies in this subset of the population could be telling of what factors have the most significant impact on individual predisposition to a heart attack. Questions arise when looking at this data, such as: Does the value of resting blood pressure vary based on or sex? How do exercise-based factors like exercise induced angina and maximum heart rate influence the probability of developing heart disease? And, are there influences of nutrition-based biomarkers like fasting blood sugar and cholesterol that impact heart disease probability? In establishing connections among these variables and how they respectively impact the risk of heart disease and heart attacks, the implications of analyzing this data could potentially aid in establishing a preventative framework for the risk factors of preventable heart disease to improve the overall health of both individuals and society as a whole.

Importing Dataset about Heart Attacks from Kaggle

heartdata<-read.csv("heartattackdata.csv",
                    header=TRUE)
str(heartdata)
## 'data.frame':    303 obs. of  14 variables:
##  $ age     : int  63 37 41 56 57 57 56 44 52 57 ...
##  $ sex     : int  1 1 0 1 0 1 0 1 1 1 ...
##  $ cp      : int  3 2 1 1 0 0 1 1 2 2 ...
##  $ trestbps: int  145 130 130 120 120 140 140 120 172 150 ...
##  $ chol    : int  233 250 204 236 354 192 294 263 199 168 ...
##  $ fbs     : int  1 0 0 0 0 0 0 0 1 0 ...
##  $ restecg : int  0 1 0 1 1 1 0 1 1 1 ...
##  $ thalach : int  150 187 172 178 163 148 153 173 162 174 ...
##  $ exang   : int  0 0 0 0 1 0 0 0 0 0 ...
##  $ oldpeak : num  2.3 3.5 1.4 0.8 0.6 0.4 1.3 0 0.5 1.6 ...
##  $ slope   : int  0 0 2 2 2 1 1 2 2 2 ...
##  $ ca      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ thal    : int  1 2 2 2 2 1 2 3 3 2 ...
##  $ target  : int  1 1 1 1 1 1 1 1 1 1 ...
head(heartdata)
##   age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal
## 1  63   1  3      145  233   1       0     150     0     2.3     0  0    1
## 2  37   1  2      130  250   0       1     187     0     3.5     0  0    2
## 3  41   0  1      130  204   0       0     172     0     1.4     2  0    2
## 4  56   1  1      120  236   0       1     178     0     0.8     2  0    2
## 5  57   0  0      120  354   0       1     163     1     0.6     2  0    2
## 6  57   1  0      140  192   0       1     148     0     0.4     1  0    1
##   target
## 1      1
## 2      1
## 3      1
## 4      1
## 5      1
## 6      1
tail(heartdata)
##     age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal
## 298  59   1  0      164  176   1       0      90     0     1.0     1  2    1
## 299  57   0  0      140  241   0       1     123     1     0.2     1  0    3
## 300  45   1  3      110  264   0       1     132     0     1.2     1  0    3
## 301  68   1  0      144  193   1       1     141     0     3.4     1  2    3
## 302  57   1  0      130  131   0       1     115     1     1.2     1  1    3
## 303  57   0  1      130  236   0       0     174     0     0.0     1  1    2
##     target
## 298      0
## 299      0
## 300      0
## 301      0
## 302      0
## 303      0
library(tidyverse)
## ── Attaching packages ────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
## ✓ tibble  3.0.3     ✓ dplyr   1.0.3
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0
## ── Conflicts ───────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
# Wrangling the data
# Converting sex, heart disease risk, exercise induced angina, and fasted blood sugar > 120 mg/dl to a factor.
heartdata$sexfactor <- factor(heartdata$sex, levels=c(0,1),
                                   labels=c("Female","Male"))
heartdata$riskfactor <- factor(heartdata$target, levels=c(0,1),
                                   labels=c("Low Risk","High Risk"))
heartdata$exangfactor <- factor(heartdata$exang, levels=c(0,1),
                                   labels=c("No Exercise Induced Angina","Exercise Induced Angina"))
heartdata$fbsfactor <- factor(heartdata$fbs, levels=c(0,1),
                                   labels=c("No","Yes"))
head(heartdata)
##   age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal
## 1  63   1  3      145  233   1       0     150     0     2.3     0  0    1
## 2  37   1  2      130  250   0       1     187     0     3.5     0  0    2
## 3  41   0  1      130  204   0       0     172     0     1.4     2  0    2
## 4  56   1  1      120  236   0       1     178     0     0.8     2  0    2
## 5  57   0  0      120  354   0       1     163     1     0.6     2  0    2
## 6  57   1  0      140  192   0       1     148     0     0.4     1  0    1
##   target sexfactor riskfactor                exangfactor fbsfactor
## 1      1      Male  High Risk No Exercise Induced Angina       Yes
## 2      1      Male  High Risk No Exercise Induced Angina        No
## 3      1    Female  High Risk No Exercise Induced Angina        No
## 4      1      Male  High Risk No Exercise Induced Angina        No
## 5      1    Female  High Risk    Exercise Induced Angina        No
## 6      1      Male  High Risk No Exercise Induced Angina        No
# write.csv(heartdata, "workableheartdataset.csv")

Variables of Interest

Resting Blood Pressure

Cholesterol

Fasting Blood Sugar

Sex

Age

Maximum Heart Rate Acheived

Exercise Induced Angina

Relationship between blood pressure and level of risk for heart disease

#  Creating a plot of resting blood pressure vs. age by sex and heart disease risk.
ggplot(heartdata, aes(x=age, y=trestbps, color=riskfactor))+
  geom_jitter(alpha=0.6, size=0.75)+
  facet_grid(sexfactor~riskfactor)+
  labs(x="Age (years)", y="Resting Blood Pressure (mmHg)", title="Blood Pressure, Age, Sex and Heart Disease Risk", color="Heart Disease Risk")+
  scale_color_manual(breaks = c("Low Risk", "High Risk"), 
                       values=c("#0099FF", "#FF0033"))+
  theme_classic()

Scatter Plot: (Resting blood pressure and age, separated by sex and heart disease risk)

In this graph, there seems to be a relationship between a higher resting blood pressure (>120-130 mmHg) and being at a lower risk of heart disease for males with across all ages. This seems counterintuitive but the plot appears to indicate this, it could be that a majority of the respondents had a high blood pressure and thus it is causing the data to appear this way. There doesn’t seem to be a strong correlation between being higher risk based on higher resting blood pressure, and in fact it seems like there are more people and outliers with higher blood pressure in the low risk category across both sexes. For both low and high risk females, there appears to be a moderately weak positive correlation between resting blood pressure and age. So as women in both categories get older, they tend to have a higher resting blood pressure. It is hard to say if men have the same trend since there appear to be a lot of respondents with low blood pressure and high blood pressure even as they get older, but there could be a slightly positive correlation for men and increasing blood pressure as they age. There also appear to be a lot more male respondents than female ones.

Relationship between exercise biomarkers and heart disease risk

# Creating a column graph of maximum heart rate achieved vs. exercise induced angina by heart disease risk factor.
ggplot(heartdata, aes(x=exangfactor, y=thalach, fill=riskfactor))+
  geom_col(position="dodge")+
  labs(x = "Angina", y = "Maximum Heart Rate Achieved (bpm)", title="Exercise Biomarkers and Heart Disease Risk", fill= "Heart Disease Risk")+
  scale_fill_manual(breaks = c("Low Risk", "High Risk"), 
                       values=c("#0099FF", "#FF0033"))+
  theme_classic()

# Creating a box-plot of maximum heart rate achieved vs. exercise induced angina by heart disease risk factor.
ggplot(heartdata, aes(x=exangfactor, y=thalach, fill=riskfactor))+
  geom_boxplot()+
  labs(x = "Angina", y = "Maximum Heart Rate Achieved (bpm)", title="Exercise Biomarkers and Heart Disease Risk", fill= "Heart Disease Risk")+
  scale_fill_manual(breaks = c("Low Risk", "High Risk"), 
                       values=c("#0099FF", "#FF0033"))+
  theme_classic()

Bar Chart and Boxplot: (Maximum heart rate achieved vs exercise induced angina separated by low and high risk for heart disease)

Those with no exercise induced angina were able to achieve a higher average maximum heart rate across those in that observation category. This could indicate that having angina does not allow one to push as hard during a workout and thus they have a lower average heart rate. Those at a lower risk for heart disease, have on average a lower maximum heart rate achieved across both angina categories. This could indicate that achieving a very high heart rate during a workout could be a sign of heart disease or put one at risk for heart disease. Those in the low risk category with exercise induced angina have the lowest maximum heart rate achieved on average whereas those in the high risk with no exercise induced angina have the highest maximum heart rate achieved on average. This, again seems to indicate that having no exercise induced angina could allow one to push harder and achieve a higher heart rate, and that being able to achieve such a high heart rate could indicate heart disease or contribute to heart disease risk. For these two visuals that both show the same data, the box plot provides a better and easier-to-read visualization of the relationship between these variables.

Relationship between nutrition-based biomarkers and heart disease risk

#  Creating a boxplot of cholesterol vs. whether one's fasted blood glucose is above 120 mg/dl by heart disease risk.
ggplot(heartdata, aes(x=fbsfactor, y=chol, fill=riskfactor))+
  geom_boxplot()+
  labs(x="Fasted Blood Sugar >120 mg/dl", y="Serum Cholesterol (mg/dl)", title="Nutrition-Based Biomarkers and Heart Disease Risk", fill="Heart Disease Risk")+
  scale_fill_manual(breaks = c("Low Risk", "High Risk"), 
                       values=c("#0099FF", "#FF0033"))+
  theme_classic()

ggplot(heartdata, aes(x=fbsfactor, y=chol, fill=riskfactor))+
  geom_col(position="dodge")+
  labs(x="Fasted Blood Sugar >120 mg/dl", y="Serum Cholesterol (mg/dl)", title="Nutrition-Based Biomarkers and Heart Disease Risk", fill="Heart Disease Risk")+
  scale_fill_manual(breaks = c("Low Risk", "High Risk"), 
                       values=c("#0099FF", "#FF0033"))+
  theme_classic()

ggplot(heartdata, aes(x=chol, fill=riskfactor))+
  geom_histogram(position="dodge", binwidth = 100)+
  labs(x="Serum Cholesterol (mg/dl)", title="Serum Cholesterol and Heart Disease Risk", fill="Heart Disease Risk")+
  scale_fill_manual(breaks = c("Low Risk", "High Risk"), 
                       values=c("#0099FF", "#FF0033"))+
  theme_classic()

ggplot(heartdata, aes(x=fbsfactor, fill=riskfactor))+
  geom_bar(position="dodge")+
  labs(x="Fasted Blood Sugar >120 mg/dl", title="Fasted Blood Sugar and Heart Disease Risk", fill="Heart Disease Risk")+
  scale_fill_manual(breaks = c("Low Risk", "High Risk"), 
                       values=c("#0099FF", "#FF0033"))+
  theme_classic()

Boxplot (Serum cholesterol and fasted blood sugar separated by low and high risk)

Those in the low risk category seem to have a higher average cholesterol concentration. This could indicate that cholesterol does not have that great of an impact on heart disease risk, or perhaps, a higher cholesterol may reduce the risk of heart disease. This variable, however, does not indicate which type of cholesterol is being measured (HDL or LDL or a combination). There seems to be no impact by fasted blood sugar >120 mg/dl on the averages of serum cholesterol between both low risk and high risk categories (averages are very similar). This seems to go against intuition that a higher fasted blood sugar would lead to a multitude of health problems, but there are also a lot less respondents with blood sugar >120 mg/dl that it could not be statistically significant. There are some outliers with extremely high cholesterol for high risk in both categories of fasted blood sugar as well, with one outlier with low cholesterol for a fasted blood sugar >120 mg/dl. There are also a couple outliers with high cholesterol in the low risk category for fasted blood sugar <120 mg/dl. These data could be showing that if one has an extremely high cholesterol (upper outliers) they are at a higher risk for heart disease.

Histogram (The number of people within serum cholesterol levels separated by low and high risk)

There are the most people in the high risk category at ~200-250 mg/dl cholesterol, which could indicate that having a serum cholesterol value within this range puts one at a higher risk for heart disease. It seems like when one has serum cholesterol >400 mg/dl they are at a higher risk for heart disease due to the lack of low risk (blue) on the histogram above that value.