R Assignment

Na’ol Kebede

February 26, 2023


Table Of Contents


1. Introduction

For this assignment I will be looking at the Heart Attack Analysis & Prediction Dataset. This dataset collected a number of important medical features in order to predict the risk of a heart attack. These features include:

Description Variable name
Age age
Sex sex
If the person has experienced excercise induced angina exercise_angina
The number of major vessels major_vessels
The chest pain an individual has and the type chest_pain
Resting blood pressure blood_pressure
Cholesteral cholest
Fasted blood sugar test, either 1 or 0 bsugar_test
Resting ECG results restecg
Maximum heart rate max_heartrate
The slope of the peak exercise ST segment (2 = upsloping; 1 = flat; 0 = downsloping) slp
ST depression induced by exercise relative to rest oldpeak
Thalaseemia rate - 2 = normal; 1 = fixed defect; 3 = reversable defect thal_rate
Their heart attack risk classification, either 1 or 0 output

With this data, I will explore the distribution of individuals and try to find the important features for extracting out for the likelihood of a heart attack.

2. Data processing

Lets have a look at our data to make sure that it is proper in the sense that it does not have null values, each variable is in the correct domain, and it is ready for our visualizations.

## loading necessary libraries
library(ggplot2)
library(gridExtra)
library(RColorBrewer)
library(ggthemes)
library(lubridate)
library(data.table)
library(dplyr)
library(plotly)
## setting up enviornment
rm(list=ls())

setwd("C:/Users/Mrwin/Documents/DS736")

Here, we replace the existing column names so that there is more clarity when referring to them in our visualizations.

## reading in data and setting appropriate and understandable column names
df <- fread("heart.csv")


colnames(df) <- c("age", "sex", "chest_pain", "blood_pressure", "cholest", "bsugar_test", "restecg",
                  "max_heartrate", "exercise_angina", "oldpeak", "slp", "major_vessels", "thal_rate", "output")
head(df)
##    age sex chest_pain blood_pressure cholest bsugar_test restecg max_heartrate
## 1:  63   1          3            145     233           1       0           150
## 2:  37   1          2            130     250           0       1           187
## 3:  41   0          1            130     204           0       0           172
## 4:  56   1          1            120     236           0       1           178
## 5:  57   0          0            120     354           0       1           163
## 6:  57   1          0            140     192           0       1           148
##    exercise_angina oldpeak slp major_vessels thal_rate output
## 1:               0     2.3   0             0         1      1
## 2:               0     3.5   0             0         2      1
## 3:               0     1.4   2             0         2      1
## 4:               0     0.8   2             0         2      1
## 5:               1     0.6   2             0         2      1
## 6:               0     0.4   1             0         1      1
any(is.na(df))
## [1] FALSE
str(df)
## Classes 'data.table' and 'data.frame':   303 obs. of  14 variables:
##  $ age            : int  63 37 41 56 57 57 56 44 52 57 ...
##  $ sex            : int  1 1 0 1 0 1 0 1 1 1 ...
##  $ chest_pain     : int  3 2 1 1 0 0 1 1 2 2 ...
##  $ blood_pressure : int  145 130 130 120 120 140 140 120 172 150 ...
##  $ cholest        : int  233 250 204 236 354 192 294 263 199 168 ...
##  $ bsugar_test    : int  1 0 0 0 0 0 0 0 1 0 ...
##  $ restecg        : int  0 1 0 1 1 1 0 1 1 1 ...
##  $ max_heartrate  : int  150 187 172 178 163 148 153 173 162 174 ...
##  $ exercise_angina: int  0 0 0 0 1 0 0 0 0 0 ...
##  $ oldpeak        : num  2.3 3.5 1.4 0.8 0.6 0.4 1.3 0 0.5 1.6 ...
##  $ slp            : int  0 0 2 2 2 1 1 2 2 2 ...
##  $ major_vessels  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ thal_rate      : int  1 2 2 2 2 1 2 3 3 2 ...
##  $ output         : int  1 1 1 1 1 1 1 1 1 1 ...
##  - attr(*, ".internal.selfref")=<externalptr>
summary(df)
##       age             sex           chest_pain    blood_pressure 
##  Min.   :29.00   Min.   :0.0000   Min.   :0.000   Min.   : 94.0  
##  1st Qu.:47.50   1st Qu.:0.0000   1st Qu.:0.000   1st Qu.:120.0  
##  Median :55.00   Median :1.0000   Median :1.000   Median :130.0  
##  Mean   :54.37   Mean   :0.6832   Mean   :0.967   Mean   :131.6  
##  3rd Qu.:61.00   3rd Qu.:1.0000   3rd Qu.:2.000   3rd Qu.:140.0  
##  Max.   :77.00   Max.   :1.0000   Max.   :3.000   Max.   :200.0  
##     cholest       bsugar_test        restecg       max_heartrate  
##  Min.   :126.0   Min.   :0.0000   Min.   :0.0000   Min.   : 71.0  
##  1st Qu.:211.0   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:133.5  
##  Median :240.0   Median :0.0000   Median :1.0000   Median :153.0  
##  Mean   :246.3   Mean   :0.1485   Mean   :0.5281   Mean   :149.6  
##  3rd Qu.:274.5   3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:166.0  
##  Max.   :564.0   Max.   :1.0000   Max.   :2.0000   Max.   :202.0  
##  exercise_angina     oldpeak          slp        major_vessels   
##  Min.   :0.0000   Min.   :0.00   Min.   :0.000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.00   1st Qu.:1.000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.80   Median :1.000   Median :0.0000  
##  Mean   :0.3267   Mean   :1.04   Mean   :1.399   Mean   :0.7294  
##  3rd Qu.:1.0000   3rd Qu.:1.60   3rd Qu.:2.000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :6.20   Max.   :2.000   Max.   :4.0000  
##    thal_rate         output      
##  Min.   :0.000   Min.   :0.0000  
##  1st Qu.:2.000   1st Qu.:0.0000  
##  Median :2.000   Median :1.0000  
##  Mean   :2.314   Mean   :0.5446  
##  3rd Qu.:3.000   3rd Qu.:1.0000  
##  Max.   :3.000   Max.   :1.0000

We have 303 males and females ranging from 29 to 77 years old. Looking though medical information and the description of the dataset provided, the values are all well behaved.

3. Visualization

3.1 Heart Rate Boxplot

ggplot(df, aes(x = as.factor(output), y = max_heartrate,fill = as.factor(output))) +
  geom_boxplot(outlier.color = "red", outlier.shape = 5, outlier.size = 3) +
  labs(fill = "Heart Attack Chance", title = "Max Heart rate By Heart Attack Risk Boxplot", x = "Heart Attack Chance", y = "Max Heart rate") +
  scale_fill_brewer(palette = "Oranges", labels = c("Low Risk", "High Risk"))

As maximum heart rate ties into many indicators for a healthy individual, it is very interesting to see that those in the high risk group have a higher distribution than the low risk group. As people who have more cardiovascular fitness have higher max heart rates, this would be an area of study to find any conflicting or underlying connections with heart attack chance.

3.2 Blood Pressure & Cholesterol Scatter Plot

model_low_bp <- lm(df[df$output==0,]$age ~ df[df$output ==0,]$blood_pressure)
model_high_bp <- lm(df[df$output ==1,]$age ~ df[df$output ==1,]$blood_pressure)
model_low_ch <- lm(df[df$output ==0,]$age ~ df[df$output ==0,]$cholest)
model_high_ch <- lm(df[df$output ==1,]$age ~ df[df$output ==1,]$cholest)

pl1 <- ggplot(df[df$output ==0,], aes(age, blood_pressure)) +
  geom_point(size=2, shape=4, color="blue") +
  geom_smooth(method=lm) +
  ggtitle("Low Risk") +
  theme_bw() +
  scale_color_brewer(palette="Dark2") +
  labs(y ="Blood Pressure", x = "Age") +
  annotate(geom="text", x=42, y=180,
           label=paste("R-Squared: ",round(summary(model_low_bp)$r.squared,digits=2)), color="black")

pl2 <- ggplot(df[df$output ==1,], aes(age, blood_pressure)) +
  geom_point(size=2, shape=4, color="red") +
  geom_smooth(method=lm) +
  ggtitle("High Risk") +
  theme_bw() +
  scale_color_brewer(palette="Dark2") +
  labs(y ="Blood Pressure", x = "Age") +
  annotate(geom="text", x=40, y=180,
           label=paste("R-Squared: ",round(summary(model_high_bp)$r.squared,digits=2)), color="black")

pl3 <- ggplot(df[df$output ==0,], aes(age, cholest)) +
  geom_point(size=2, shape=6, color="blue") +
  geom_smooth(method=lm) +
  ggtitle("Low Risk") +
  theme_bw() +
  scale_color_brewer(palette="Spectral") +
  labs(y ="Cholesterol", x = "Age") +
  annotate(geom="text", x=45, y= 400,
           label=paste("R-Squared: ",round(summary(model_low_ch)$r.squared,digits=2)), color="black")

pl4 <- ggplot(df[df$output ==1,], aes(age, cholest)) +
  geom_point(size=2, shape=6, color="red") +
  geom_smooth(method=lm) +
  ggtitle("High Risk") +
  theme_bw() +
  scale_color_brewer(palette="Spectral") +
  labs(y ="Cholesterol", x = "Age") +
  annotate(geom="text", x=40, y=350,
           label=paste("R-Squared: ",round(summary(model_high_ch)$r.squared,digits=2)), color="black")

grid.arrange(pl1, pl2, pl3, pl4, nrow=2,ncol=2, top = "Blood Pressure and Cholesterol VS Age \n by Heart Attack Risk")

The relationship we see here between blood pressure, cholesterol, age, and heart attack risk shows a consistent trend between all four groups. It seems that regardless of heart attack risk, age does not correlate with blood pressure or cholesterol. This could mean early testing for younger individuals could be useful for understanding their risk based on other factors.

3.3 Age Grouped Line Plot

by_gen <- df %>%
  mutate(gen = paste(as.factor(round(df$age-5,-1)),"s",sep="")) %>%
  group_by(output,gen) %>%
  summarise(hrate = mean(max_heartrate),
            n = n(),
            .groups="keep") %>%
  data.frame()

ggplot(by_gen, aes(x = gen, y = hrate, group=as.factor(output))) + 
  geom_line(aes(color = as.factor(output)), linewidth=3) +
  geom_text(aes(label = as.factor(n)), fontface = "bold", size = 4, nudge_x=.2) +
  geom_point(shape=22,size=5,color="black", fill="white") +
  labs(color = "Heart Attack Risk", title = "Maximum Heart Rate Vs Age Group by Heart Attack Risk", x = "Age Group", y = "Max Heart Rate") +
  scale_color_brewer(palette = "Paired", labels = c("Low Risk", "High Risk"))

Looking at max heart rate again, we see that while we do expect to see the maximum heart rate decrease as individuals age, the high risk group is higher overall. With the exception of the outlier, more data is needed to understand this trend.

3.4 Chest Pain Bar Graph

cpain <- df %>%
  group_by(chest_pain, output) %>%
  summarise(n = n(),
            ch = round(mean(cholest), digits =1),
            .groups="keep") %>%
  data.frame()

ggplot(cpain, aes(x = as.factor(chest_pain), y = n)) +
  geom_bar(aes(fill=as.factor(output)), stat= "identity", position = position_dodge()) +
  geom_text(aes(label=ch, group = output), position = position_dodge(width = .9)) +
  labs(color = "Heart Attack Risk", fill = "Heart Attack Risk", x = "Chest Pain Type", y = "Count", title = "Chest Pain type by Heart Attack Risk \n With Mean Cholesterol") +
  scale_x_discrete(labels = c("1" = "Typical angina", "2" = "Atypical angina", "3" = "Non-anginal pain", "0" = "Asymptomatic")) +
  scale_fill_hue(labels = c("Low Risk", "High Risk"))

We can see that the low risk group has less chest pain reported compared to the higher risk group, but, we do not see any major differences in average cholesterol. It seems for all pain types and between risk groups, cholesterol is consistent and does not change significantly.

3.5 ECG Pie Graph

recg <- df %>%
  group_by(restecg, output) %>%
  summarise(n = n(),
            .groups="keep") %>%
  data.frame()

plot_ly(hole = 0.7) %>%
  layout(title = "Resting ECG Result by Heart Attack Risk", legend=list(title=list(text='<b>Resting ECG Test</b>'))) %>%
  add_trace(data = recg[recg$output == 1,],
            labels = ~restecg,
            values = ~recg[recg$output == 1, "n"],
            type = "pie",
            textposition = "inside",
            hovertemplate = "Risk: High risk<br>ECG test: %{label}<br>Percent: %{percent}<br>Num of Participants: %{value}<extra></extra>") %>%
  add_trace(data = recg[recg$output == 0,],
            labels = ~restecg,
            values = ~recg[recg$output == 0, "n"],
            type = "pie",
            textposition = "inside",
            hovertemplate = "Risk: Low risk<br>ECG test: %{label}<br>Percent: %{percent}<br>Num of Participants: %{value}<extra></extra>",
            domain = list(x = c(0.16,0.84) ,y = c(0.16,0.84))) 

It seems the resting ECG tests show an opposite distribution between the risk groups. There is a clear distinction between wave abnormality and a normal test. This is a great determinant of heart attack risk.

4. Conclusion

These plots show many features that have different trends correlating with heart attack risk. It is interesting to see that certain features like max heart rate don’t correlate the way I hypothesized, and how it may take different combinations of variables to give a good picture of risk.