We aim to predict the presence of heart disease based on several risk factors, including age, cholesterol levels, resting blood pressure, and maximum heart rate. A logistic regression model will be used for prediction, and the relationships between these predictors and heart disease will be explored through a series of multivariate visualizations.

1. Prepare the Data

First we need to read our data frame with csv format and assign data frame to the variable “heart_data”.

setwd("/Users/farhadabasahl/Documents/R/heart+disease")
heart_data <- read.csv("processed.cleveland.data", header = FALSE)
head(heart_data)
##   V1 V2 V3  V4  V5 V6 V7  V8 V9 V10 V11 V12 V13 V14
## 1 63  1  1 145 233  1  2 150  0 2.3   3 0.0 6.0   0
## 2 67  1  4 160 286  0  2 108  1 1.5   2 3.0 3.0   2
## 3 67  1  4 120 229  0  2 129  1 2.6   2 2.0 7.0   1
## 4 37  1  3 130 250  0  0 187  0 3.5   3 0.0 3.0   0
## 5 41  0  2 130 204  0  2 172  0 1.4   1 0.0 3.0   0
## 6 56  1  2 120 236  0  0 178  0 0.8   1 0.0 3.0   0

Preparing our data and making sure the data are cleaned, transformed and labeled is essential for our task. After assigning meaningful names to the columns of the current data frame, we identified the cells containing the symbol “?” as placeholders for missing data. We then counted the total number of occurrences of “?” in the dataset and replaced all instances with NA to handle the missing values. Following this, we removed any rows containing NA values to ensure a clean dataset. Finally, we checked the cleaned data for integrity and saved it to a new CSV file for further analysis:

colnames(heart_data) <- c("age", "sex", "cp", "trestbps", "chol", "fbs", 
                          "restecg", "thalach", "exang", "oldpeak", "slope", 
                          "ca", "thal", "num")
which(heart_data == "?", arr.ind = TRUE)
##      row col
## [1,] 167  12
## [2,] 193  12
## [3,] 288  12
## [4,] 303  12
## [5,]  88  13
## [6,] 267  13
sum(heart_data == "?")
## [1] 6
heart_data[heart_data == "?"] <- NA
heart_data <- as.data.frame(lapply(heart_data, 
                                   function(x) as.numeric(as.character(x))))
heart_data_clean <- na.omit(heart_data)
summary(heart_data_clean)
##       age             sex               cp           trestbps    
##  Min.   :29.00   Min.   :0.0000   Min.   :1.000   Min.   : 94.0  
##  1st Qu.:48.00   1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:120.0  
##  Median :56.00   Median :1.0000   Median :3.000   Median :130.0  
##  Mean   :54.54   Mean   :0.6768   Mean   :3.158   Mean   :131.7  
##  3rd Qu.:61.00   3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:140.0  
##  Max.   :77.00   Max.   :1.0000   Max.   :4.000   Max.   :200.0  
##       chol            fbs            restecg          thalach     
##  Min.   :126.0   Min.   :0.0000   Min.   :0.0000   Min.   : 71.0  
##  1st Qu.:211.0   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:133.0  
##  Median :243.0   Median :0.0000   Median :1.0000   Median :153.0  
##  Mean   :247.4   Mean   :0.1448   Mean   :0.9966   Mean   :149.6  
##  3rd Qu.:276.0   3rd Qu.:0.0000   3rd Qu.:2.0000   3rd Qu.:166.0  
##  Max.   :564.0   Max.   :1.0000   Max.   :2.0000   Max.   :202.0  
##      exang           oldpeak          slope             ca        
##  Min.   :0.0000   Min.   :0.000   Min.   :1.000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.000   1st Qu.:1.000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.800   Median :2.000   Median :0.0000  
##  Mean   :0.3266   Mean   :1.056   Mean   :1.603   Mean   :0.6768  
##  3rd Qu.:1.0000   3rd Qu.:1.600   3rd Qu.:2.000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :6.200   Max.   :3.000   Max.   :3.0000  
##       thal            num        
##  Min.   :3.000   Min.   :0.0000  
##  1st Qu.:3.000   1st Qu.:0.0000  
##  Median :3.000   Median :0.0000  
##  Mean   :4.731   Mean   :0.9461  
##  3rd Qu.:7.000   3rd Qu.:2.0000  
##  Max.   :7.000   Max.   :4.0000
write.csv(heart_data_clean, "heart_data_clean.csv", row.names = FALSE)

Now let’s predict the presence of heart disease (hdisease) using several features. We’ll use a logistic regression model, where the target valuable (num) represets the presence of heart disease.

2. Visualization 1: Heart Disease by Age (age)

To assess the presence of heart disease, we created a binary variable indicating heart disease status (1 for presence and 0 for absence). A linear model was then fitted to predict heart disease based on various features. We provided a summary of the model’s results, which allowed us to understand the significance of each predictor. Predicted probabilities of heart disease were subsequently calculated, and the results were visualized through a plot to illustrate the relationship between the predictors and the likelihood of heart disease.

heart_data$hdisease <- ifelse(heart_data$num > 0, 1, 0)
model <- lm(hdisease ~ age + sex + trestbps + chol + thalach,  data = heart_data)
summary(model)
## 
## Call:
## lm(formula = hdisease ~ age + sex + trestbps + chol + thalach, 
##     data = heart_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.95685 -0.35390 -0.06046  0.36846  0.96938 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.6136691  0.3284495   1.868   0.0627 .  
## age          0.0023043  0.0031674   0.727   0.4675    
## sex          0.3139838  0.0540150   5.813 1.58e-08 ***
## trestbps     0.0035544  0.0014665   2.424   0.0160 *  
## chol         0.0011336  0.0004967   2.282   0.0232 *  
## thalach     -0.0082988  0.0011813  -7.025 1.46e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4276 on 297 degrees of freedom
## Multiple R-squared:  0.2783, Adjusted R-squared:  0.2661 
## F-statistic:  22.9 on 5 and 297 DF,  p-value: < 2.2e-16
heart_data$predicted_prob <- predict(model, type = "response")
heart_data$hdisease <- as.factor(heart_data$hdisease)


ggplot(heart_data, aes(x = age, y = predicted_prob)) +
  geom_point(aes(color = hdisease), alpha = 0.9) +  
  geom_smooth(method = "lm", color = "firebrick4", se = FALSE) + 
  labs(title = "Predicted Probability of Heart Disease by Age",
       x = "Age",
       y = "Predicted Probability") +
  scale_color_manual(values = c("black", "grey70"), labels = c(
    "No Heart Disease", "Heart Disease")) +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

The plot shows the predicted probability of heart disease as a function of age, with points representing individuals, color-coded by heart disease status (black for no heart disease, grey for heart disease), and a logistic regression line in firebrick4 highlighting the overall trend of increasing heart disease probability with age.

3. Visualization 2: Heart Disease by Cholesterol Levels (chol)

ggplot(heart_data, aes(x = factor(hdisease), y = chol , fill = factor(hdisease))) +
  geom_boxplot() +
  labs(title = "Cholesterol Levels by Heart Disease Status",
       x = "Heart Disease (0 = No, 1 = Yes)",
       y = "Cholesterol Level") +
  scale_fill_manual(values = c("grey30", "grey75"), labels = c(
    "No Heart Disease", "Heart Disease")) +
  theme_minimal()

The boxplot shows the distribution of cholesterol levels for individuals with and without heart disease, highlighting differences in cholesterol levels between the two groups, with a darker fill representing those without heart disease and a lighter fill for those with heart disease.

4. Visualization 3: Histogram and Faceting

ggplot(heart_data, aes(x = chol, fill = factor(hdisease))) +
  geom_histogram(bins = 30, alpha = 0.7, position = "identity") +
  facet_wrap(~sex) +
  labs(title = "Cholesterol Levels Distribution by Heart Disease Status and Sex",
       x = "Cholesterol Level",
       y = "Count",
       fill = "Heart Disease Status") +
  theme_minimal() +
  scale_fill_manual(values = c("skyblue", "tomato3"), labels = c("No Heart Disease", "Heart Disease"))

The faceted histogram illustrates the distribution of cholesterol levels for individuals with and without heart disease, separated by sex. This visualization allows us to examine how cholesterol levels vary not only by heart disease status but also by sex. The plot suggests that both males and females with higher cholesterol levels tend to have a higher prevalence of heart disease, though the distributions differ slightly between genders.

5. Visualization 4: Predicted Probability vs Maximum Heart Rate (thalach)

ggplot(heart_data, aes(x = thalach, y = predicted_prob)) +
  geom_point(aes(color = hdisease), alpha = 0.6) +
  geom_smooth(method = "loess", se = FALSE) +
  labs(title = "Predicted Probability of Heart Disease by Maximum Heart Rate",
       x = "Maximum Heart Rate",
       y = "Predicted Probability") +
  scale_color_manual(values = c("olivedrab2", "black"), labels = c("No Heart Disease", 
                                                              "Heart Disease")) +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

The plot illustrates the predicted probability of heart disease based on maximum heart rate, with points color-coded by heart disease status (grey for no heart disease, darker grey for heart disease), and a smooth trend line showing how heart disease probability changes as heart rate increases.

6. Visualization 5: Bar Plot of Sex Distribution

ggplot(heart_data, aes(x = factor(sex), fill = factor(hdisease))) +
  geom_bar(position = "dodge") +
  labs(title = "Heart Disease by Gender",
       x = "Sex (0 = Female, 1 = Male)",
       y = "Count",
       fill = "Heart Disease") +
  scale_fill_manual(values = c("bisque1", "lightgoldenrod4"), labels = c(
    "No Heart Disease - bisque1", "Heart Disease - lightgoldenrod4")) +
  theme_minimal()

The bar chart compares the count of heart disease cases between males and females. It reveals that males tend to have a higher prevalence of heart disease than females, highlighting potential sex-based differences in heart disease risk.

Discussion of Uncertainty

When predicting heart disease using a logistic regression model, uncertainty arises from several sources. One key source of uncertainty is the variability in the data. For example, while age is a strong predictor of heart disease, individual differences and other unmeasured factors can lead to variability in the predicted probabilities. The logistic model assumes a linear relationship between predictors (like age, cholesterol, and heart rate) and the outcome, but real-life data may not always follow this pattern perfectly.

Additionally, there is uncertainty in the estimated coefficients of the model. For each predictor, the model calculates confidence intervals, which indicate the range within which the true effect is likely to fall. Wide confidence intervals, particularly for variables like resting blood pressure, suggest that the model is less certain about the exact effect of that predictor on heart disease risk.

Lastly, the presence of missing data also contributes to uncertainty. Although we handled missing data by removing rows with incomplete values, this could lead to bias if the missing data was not random.

Overall, while the model provides useful predictions, these sources of uncertainty should be considered when interpreting the results.

Recap of Visualizations:

The analysis of heart disease risk using logistic regression highlights several important factors. Visualizations demonstrate that age, cholesterol levels, and maximum heart rate are significant predictors of heart disease risk. Specifically, as age increases, the probability of heart disease rises. Cholesterol levels also show a notable difference between individuals with and without heart disease, especially when segmented by sex. These insights suggest that both demographic (e.g., sex) and physiological (e.g., cholesterol, heart rate) factors are key to understanding heart disease risk.