Na’ol Kebede
February 26, 2023
For this assignment I will be looking at the Heart Attack Analysis & Prediction Dataset. This dataset collected a number of important medical features in order to predict the risk of a heart attack. These features include:
| Description | Variable name |
|---|---|
| Age | age |
| Sex | sex |
| If the person has experienced excercise induced angina | exercise_angina |
| The number of major vessels | major_vessels |
| The chest pain an individual has and the type | chest_pain |
| Resting blood pressure | blood_pressure |
| Cholesteral | cholest |
| Fasted blood sugar test, either 1 or 0 | bsugar_test |
| Resting ECG results | restecg |
| Maximum heart rate | max_heartrate |
| The slope of the peak exercise ST segment (2 = upsloping; 1 = flat; 0 = downsloping) | slp |
| ST depression induced by exercise relative to rest | oldpeak |
| Thalaseemia rate - 2 = normal; 1 = fixed defect; 3 = reversable defect | thal_rate |
| Their heart attack risk classification, either 1 or 0 | output |
With this data, I will explore the distribution of individuals and try to find the important features for extracting out for the likelihood of a heart attack.
Lets have a look at our data to make sure that it is proper in the sense that it does not have null values, each variable is in the correct domain, and it is ready for our visualizations.
## loading necessary libraries
library(ggplot2)
library(gridExtra)
library(RColorBrewer)
library(ggthemes)
library(lubridate)
library(data.table)
library(dplyr)
library(plotly)
## setting up enviornment
rm(list=ls())
setwd("C:/Users/Mrwin/Documents/DS736")
Here, we replace the existing column names so that there is more clarity when referring to them in our visualizations.
## reading in data and setting appropriate and understandable column names
df <- fread("heart.csv")
colnames(df) <- c("age", "sex", "chest_pain", "blood_pressure", "cholest", "bsugar_test", "restecg",
"max_heartrate", "exercise_angina", "oldpeak", "slp", "major_vessels", "thal_rate", "output")
head(df)
## age sex chest_pain blood_pressure cholest bsugar_test restecg max_heartrate
## 1: 63 1 3 145 233 1 0 150
## 2: 37 1 2 130 250 0 1 187
## 3: 41 0 1 130 204 0 0 172
## 4: 56 1 1 120 236 0 1 178
## 5: 57 0 0 120 354 0 1 163
## 6: 57 1 0 140 192 0 1 148
## exercise_angina oldpeak slp major_vessels thal_rate output
## 1: 0 2.3 0 0 1 1
## 2: 0 3.5 0 0 2 1
## 3: 0 1.4 2 0 2 1
## 4: 0 0.8 2 0 2 1
## 5: 1 0.6 2 0 2 1
## 6: 0 0.4 1 0 1 1
any(is.na(df))
## [1] FALSE
str(df)
## Classes 'data.table' and 'data.frame': 303 obs. of 14 variables:
## $ age : int 63 37 41 56 57 57 56 44 52 57 ...
## $ sex : int 1 1 0 1 0 1 0 1 1 1 ...
## $ chest_pain : int 3 2 1 1 0 0 1 1 2 2 ...
## $ blood_pressure : int 145 130 130 120 120 140 140 120 172 150 ...
## $ cholest : int 233 250 204 236 354 192 294 263 199 168 ...
## $ bsugar_test : int 1 0 0 0 0 0 0 0 1 0 ...
## $ restecg : int 0 1 0 1 1 1 0 1 1 1 ...
## $ max_heartrate : int 150 187 172 178 163 148 153 173 162 174 ...
## $ exercise_angina: int 0 0 0 0 1 0 0 0 0 0 ...
## $ oldpeak : num 2.3 3.5 1.4 0.8 0.6 0.4 1.3 0 0.5 1.6 ...
## $ slp : int 0 0 2 2 2 1 1 2 2 2 ...
## $ major_vessels : int 0 0 0 0 0 0 0 0 0 0 ...
## $ thal_rate : int 1 2 2 2 2 1 2 3 3 2 ...
## $ output : int 1 1 1 1 1 1 1 1 1 1 ...
## - attr(*, ".internal.selfref")=<externalptr>
summary(df)
## age sex chest_pain blood_pressure
## Min. :29.00 Min. :0.0000 Min. :0.000 Min. : 94.0
## 1st Qu.:47.50 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:120.0
## Median :55.00 Median :1.0000 Median :1.000 Median :130.0
## Mean :54.37 Mean :0.6832 Mean :0.967 Mean :131.6
## 3rd Qu.:61.00 3rd Qu.:1.0000 3rd Qu.:2.000 3rd Qu.:140.0
## Max. :77.00 Max. :1.0000 Max. :3.000 Max. :200.0
## cholest bsugar_test restecg max_heartrate
## Min. :126.0 Min. :0.0000 Min. :0.0000 Min. : 71.0
## 1st Qu.:211.0 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:133.5
## Median :240.0 Median :0.0000 Median :1.0000 Median :153.0
## Mean :246.3 Mean :0.1485 Mean :0.5281 Mean :149.6
## 3rd Qu.:274.5 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:166.0
## Max. :564.0 Max. :1.0000 Max. :2.0000 Max. :202.0
## exercise_angina oldpeak slp major_vessels
## Min. :0.0000 Min. :0.00 Min. :0.000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.00 1st Qu.:1.000 1st Qu.:0.0000
## Median :0.0000 Median :0.80 Median :1.000 Median :0.0000
## Mean :0.3267 Mean :1.04 Mean :1.399 Mean :0.7294
## 3rd Qu.:1.0000 3rd Qu.:1.60 3rd Qu.:2.000 3rd Qu.:1.0000
## Max. :1.0000 Max. :6.20 Max. :2.000 Max. :4.0000
## thal_rate output
## Min. :0.000 Min. :0.0000
## 1st Qu.:2.000 1st Qu.:0.0000
## Median :2.000 Median :1.0000
## Mean :2.314 Mean :0.5446
## 3rd Qu.:3.000 3rd Qu.:1.0000
## Max. :3.000 Max. :1.0000
We have 303 males and females ranging from 29 to 77 years old. Looking though medical information and the description of the dataset provided, the values are all well behaved.
ggplot(df, aes(x = as.factor(output), y = max_heartrate,fill = as.factor(output))) +
geom_boxplot(outlier.color = "red", outlier.shape = 5, outlier.size = 3) +
labs(fill = "Heart Attack Chance", title = "Max Heart rate By Heart Attack Risk Boxplot", x = "Heart Attack Chance", y = "Max Heart rate") +
scale_fill_brewer(palette = "Oranges", labels = c("Low Risk", "High Risk"))
As maximum heart rate ties into many indicators for a healthy individual, it is very interesting to see that those in the high risk group have a higher distribution than the low risk group. As people who have more cardiovascular fitness have higher max heart rates, this would be an area of study to find any conflicting or underlying connections with heart attack chance.
model_low_bp <- lm(df[df$output==0,]$age ~ df[df$output ==0,]$blood_pressure)
model_high_bp <- lm(df[df$output ==1,]$age ~ df[df$output ==1,]$blood_pressure)
model_low_ch <- lm(df[df$output ==0,]$age ~ df[df$output ==0,]$cholest)
model_high_ch <- lm(df[df$output ==1,]$age ~ df[df$output ==1,]$cholest)
pl1 <- ggplot(df[df$output ==0,], aes(age, blood_pressure)) +
geom_point(size=2, shape=4, color="blue") +
geom_smooth(method=lm) +
ggtitle("Low Risk") +
theme_bw() +
scale_color_brewer(palette="Dark2") +
labs(y ="Blood Pressure", x = "Age") +
annotate(geom="text", x=42, y=180,
label=paste("R-Squared: ",round(summary(model_low_bp)$r.squared,digits=2)), color="black")
pl2 <- ggplot(df[df$output ==1,], aes(age, blood_pressure)) +
geom_point(size=2, shape=4, color="red") +
geom_smooth(method=lm) +
ggtitle("High Risk") +
theme_bw() +
scale_color_brewer(palette="Dark2") +
labs(y ="Blood Pressure", x = "Age") +
annotate(geom="text", x=40, y=180,
label=paste("R-Squared: ",round(summary(model_high_bp)$r.squared,digits=2)), color="black")
pl3 <- ggplot(df[df$output ==0,], aes(age, cholest)) +
geom_point(size=2, shape=6, color="blue") +
geom_smooth(method=lm) +
ggtitle("Low Risk") +
theme_bw() +
scale_color_brewer(palette="Spectral") +
labs(y ="Cholesterol", x = "Age") +
annotate(geom="text", x=45, y= 400,
label=paste("R-Squared: ",round(summary(model_low_ch)$r.squared,digits=2)), color="black")
pl4 <- ggplot(df[df$output ==1,], aes(age, cholest)) +
geom_point(size=2, shape=6, color="red") +
geom_smooth(method=lm) +
ggtitle("High Risk") +
theme_bw() +
scale_color_brewer(palette="Spectral") +
labs(y ="Cholesterol", x = "Age") +
annotate(geom="text", x=40, y=350,
label=paste("R-Squared: ",round(summary(model_high_ch)$r.squared,digits=2)), color="black")
grid.arrange(pl1, pl2, pl3, pl4, nrow=2,ncol=2, top = "Blood Pressure and Cholesterol VS Age \n by Heart Attack Risk")
The relationship we see here between blood pressure, cholesterol, age, and heart attack risk shows a consistent trend between all four groups. It seems that regardless of heart attack risk, age does not correlate with blood pressure or cholesterol. This could mean early testing for younger individuals could be useful for understanding their risk based on other factors.
by_gen <- df %>%
mutate(gen = paste(as.factor(round(df$age-5,-1)),"s",sep="")) %>%
group_by(output,gen) %>%
summarise(hrate = mean(max_heartrate),
n = n(),
.groups="keep") %>%
data.frame()
ggplot(by_gen, aes(x = gen, y = hrate, group=as.factor(output))) +
geom_line(aes(color = as.factor(output)), linewidth=3) +
geom_text(aes(label = as.factor(n)), fontface = "bold", size = 4, nudge_x=.2) +
geom_point(shape=22,size=5,color="black", fill="white") +
labs(color = "Heart Attack Risk", title = "Maximum Heart Rate Vs Age Group by Heart Attack Risk", x = "Age Group", y = "Max Heart Rate") +
scale_color_brewer(palette = "Paired", labels = c("Low Risk", "High Risk"))
Looking at max heart rate again, we see that while we do expect to see the maximum heart rate decrease as individuals age, the high risk group is higher overall. With the exception of the outlier, more data is needed to understand this trend.
cpain <- df %>%
group_by(chest_pain, output) %>%
summarise(n = n(),
ch = round(mean(cholest), digits =1),
.groups="keep") %>%
data.frame()
ggplot(cpain, aes(x = as.factor(chest_pain), y = n)) +
geom_bar(aes(fill=as.factor(output)), stat= "identity", position = position_dodge()) +
geom_text(aes(label=ch, group = output), position = position_dodge(width = .9)) +
labs(color = "Heart Attack Risk", fill = "Heart Attack Risk", x = "Chest Pain Type", y = "Count", title = "Chest Pain type by Heart Attack Risk \n With Mean Cholesterol") +
scale_x_discrete(labels = c("1" = "Typical angina", "2" = "Atypical angina", "3" = "Non-anginal pain", "0" = "Asymptomatic")) +
scale_fill_hue(labels = c("Low Risk", "High Risk"))
We can see that the low risk group has less chest pain reported compared to the higher risk group, but, we do not see any major differences in average cholesterol. It seems for all pain types and between risk groups, cholesterol is consistent and does not change significantly.
recg <- df %>%
group_by(restecg, output) %>%
summarise(n = n(),
.groups="keep") %>%
data.frame()
plot_ly(hole = 0.7) %>%
layout(title = "Resting ECG Result by Heart Attack Risk", legend=list(title=list(text='<b>Resting ECG Test</b>'))) %>%
add_trace(data = recg[recg$output == 1,],
labels = ~restecg,
values = ~recg[recg$output == 1, "n"],
type = "pie",
textposition = "inside",
hovertemplate = "Risk: High risk<br>ECG test: %{label}<br>Percent: %{percent}<br>Num of Participants: %{value}<extra></extra>") %>%
add_trace(data = recg[recg$output == 0,],
labels = ~restecg,
values = ~recg[recg$output == 0, "n"],
type = "pie",
textposition = "inside",
hovertemplate = "Risk: Low risk<br>ECG test: %{label}<br>Percent: %{percent}<br>Num of Participants: %{value}<extra></extra>",
domain = list(x = c(0.16,0.84) ,y = c(0.16,0.84)))
It seems the resting ECG tests show an opposite distribution between the risk groups. There is a clear distinction between wave abnormality and a normal test. This is a great determinant of heart attack risk.
These plots show many features that have different trends correlating with heart attack risk. It is interesting to see that certain features like max heart rate don’t correlate the way I hypothesized, and how it may take different combinations of variables to give a good picture of risk.