Week 10: Data-Dive Notebook

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

HAA<- read.csv("/Users/rupeshswarnakar/Downloads/Cardiovascular_Disease_Dataset.csv")

Binary Column:

Let’s consider the ‘target’ column to be the column of interest for this data dive. Target column is regarding the presence or absence of heart disease in patients. In this column, 0 means the patient is normal without heart disease and 1 means the patient has heart disease. This column represents the overall significance of each attribute in the dataset, indicating whether they are key contributing factors or less influential in predicting the risk of heart failure in patients.

Logistic Regression Model:

Let’s build a logistic regression model using target as a response variable, and resting BP and fasting blood sugar as explanatory variables.

model <- glm(target ~ restingBP + fastingbloodsugar, 
             data = HAA, 
             family = binomial(link = 'logit'))

summary(model)

## 
## Call:
## glm(formula = target ~ restingBP + fastingbloodsugar, family = binomial(link = "logit"), 
##     data = HAA)
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       -5.722286   0.440710 -12.984   <2e-16 ***
## restingBP          0.038216   0.002936  13.017   <2e-16 ***
## fastingbloodsugar  1.388759   0.184384   7.532    5e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1360.6  on 999  degrees of freedom
## Residual deviance: 1042.6  on 997  degrees of freedom
## AIC: 1048.6
## 
## Number of Fisher Scoring iterations: 4

# Confidence interval for restingBP
confint(model, 'restingBP', level = 0.95)

## Waiting for profiling to be done...

##      2.5 %     97.5 % 
## 0.03258984 0.04411006

exp(model$coefficients)

##       (Intercept)         restingBP fastingbloodsugar 
##       0.003272221       1.038956107       4.009868863

Interpretation:

The above dataframe provides with the result which are described below.

a. Intercepts:

The value of -5.722286 is the log-odds value indicating the probability of event happening (heart disease occurring) when both resting BP and fasting blood sugar are zero. This is impractical because blood pressure and blood sugar are never zero on a living human body. However, the value of intercept provides with an idea of baseline risk when all other contributing variables are at minimum effect.

b. restingBP:

The value of 0.038216 means the each unit increase in resting BP, the log-odds of heart disease occurring increases by 0.038. In other words, e^{0.038216} approx. equals to 1.038 is the odds of outcome being positive when resting BP increases by 1 unit. This statement is statistically evidenced by strong p-value of < 2e-16.

c. fastingbloodsugar:

The value of 1.388759 means the each unit increase in fasting blood sugar, the log-odds of heart disease occurring increases by 1.388759. In other words, e^{1.388759} approx. equals to 4.010 is the odds of outcome being positive when fasting blood sugar increases by 1 unit. This statement is also statistically evidenced by strong p-value of 5e-14.

d. Deviance:

The value of null deviance 1360.6 on 999 degrees of freedom represents the fit of the model without any explanatory variables.
The value of residual deviance 1042.6 on 997 degrees of freedom represents the fit of the model after taking 2 predicting variables into consideration. We can clearly observe a substantial decrements in the value of residual deviance indicating a better model fit after considering resting BP and fasting blood sugar as two predicting variables in the model.

e. Confidence Interval:

A range of 0.03258984 to 0.04411006 is obtained from the 95% confidence interval of resting BP variable in the model.
Since there is not any negative value in the range, it indicates the positive relationship between resting BP and heart disease outcome (target).
Also, the corresponding odds ratio, which is e^{0.0326} approx. 1.033 to e^{0.0441} approx. 1.045, suggests that each unit increase in resting BP increases the odds of the outcome (heart disease occurring) by approximately 3.3% to 4.5%.
Also, it means that the true coefficients for resting BP is likely to fall in the interval of 0.03258984 to 0.04411006 with 95% confidence.

Visualization:

Target vs. Resting BP:

Let’s visualize the heart disease outcome using resting blood pressure as explanatory variable.

ggplot(HAA, aes(x = restingBP, y = target)) +
  geom_jitter(width = 0, height = 0.05, shape = 21, color = 'blue', fill = 'lightblue', size = 3) +
  geom_smooth(method = "glm", method.args = list(family = "binomial"), color = 'red', se = FALSE) +
  labs(title = "Scatter Plot of Target vs. RestingBP",
       x = "Resting Blood Pressure",
       y = "Target (Binary Response)") +
  scale_y_continuous(breaks = c(0, 0.5, 1)) +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

This above visualization help us understand the logistic relationship between heart disease outcome (target) and resting blood pressure. The plot show a positive relationship between them. We can observe that the patients with zero heart disease outcome, exhibit resting blood pressure usually lower around (90 to 140) mm of Hg. However, for those patients with risk of heart disease, exhibit the bulk of the resting blood pressure higher than (120 to 200) mm of Hg.

This logistic plot show a clear visualization of how strong a blood pressure is in terms of indicating the risk of heart disease as blood pressure fluctuates in human body.

Target vs. Fasting Blood Sugar:

Let’s visualize the plot between heart disease outcome and fasting blood sugar. In this plot, both of them are binary variables.

model <- glm(target ~ fastingbloodsugar, data = HAA, family = binomial(link = "logit"))

# Creating a data frame for predictions for fastingbloodsugar
pred_data <- data.frame(fastingbloodsugar = c(0, 1))

# Obtaining predicted probabilities
pred_data$predicted_prob <- predict(model, newdata = pred_data, type = "response")


ggplot(pred_data, aes(x = as.factor(fastingbloodsugar), y = predicted_prob, fill = as.factor(fastingbloodsugar))) +
  geom_bar(stat = "identity", position = "dodge", alpha = 0.7) +
  labs(title = "Predicted Probabilities of Target by Fasting Blood Sugar",
       x = "Fasting Blood Sugar (Binary)",
       y = "Predicted Probability of Target",
       fill = "Fasting Blood Sugar") +
  scale_fill_manual(values = c("lightblue", "salmon")) +
  theme_minimal() +
  geom_text(aes(label = round(predicted_prob, 2)), vjust = -0.5)

From the above plot, we can clearly observe that the patients with fasting blood sugar (1) higher than 120 mg/dL exhibits higher chances of heart disease than that of patients with fasting blood sugar (0) lower than 120 mg/dL.

There is significant difference of 33% in having risk of heart failure between patients with lower and higher blood sugar. This significance difference indicates the fasting blood sugar to be a strong indicator of heart disease.

An approximately 50% chance of having heart disease in patients with lower than 120 mg/dL sugar level is caused due to various other factors than come into action together such as diet, stress, exercise, sleep etc. that affects the heart health as well.

Summary:

In summary, the logistic regression model provides valuable insights into the relationship between target, resting BP and fasting blood sugar. By interpreting the coefficients, predicted probabilities, and statistical significance, we can draw conclusions that not only contribute to understanding the data but also have potential implications for clinical practice and patient care. By understanding the influence of the resting blood pressure and fasting blood sugar, a crucial information can be spread among the community to make changes in lifestyle in order to achieve a risk-free, healthy heart by managing the sugar level and blood pressure in body.