Week 11:Data Dive Notebook

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(car)

## Loading required package: carData
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some

library(ggthemes)
library(ggrepel)
library(lindia)

# remove scientific notation
options(scipen = 6)

# default theme, unless otherwise noted
theme_set(theme_minimal())

HAA<- read.csv("/Users/rupeshswarnakar/Downloads/STAT/Cardiovascular_Disease_Dataset.csv")

Linear Model:

Let’s perform a linear model among resting BP as response variable, and serum cholesterol, fasting blood sugar and chest pain as explanatory variables. Here, fasting blood sugar is a binary variable indicating 0 as blood sugar less than 120 mg/dL and 1 as greater than 120 mg/dL. Also, chest pain has been categorized into 4 parts such as 0 for typical angina, 1 for atypical angina, 2 for non-angina and 3 for asymptomatic pain.

Looking into the dataset, we can see that there are some outliers in the serum cholesterol column. The value 0 for serum cholesterol is impractical as no human being can survive with 0 cholesterol. Hence the data is filtered to remove 0 serum cholesterol so as to create a robust linear regression model.

HAA_filtered <- subset(HAA, serumcholestrol != 0)

model <- lm(restingBP ~ serumcholestrol + fastingbloodsugar + chestpain, data =HAA_filtered) 
summary(model)

## 
## Call:
## lm(formula = restingBP ~ serumcholestrol + fastingbloodsugar + 
##     chestpain, data = HAA_filtered)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -72.467 -19.620  -3.245  24.120  62.417 
## 
## Coefficients:
##                     Estimate Std. Error t value    Pr(>|t|)    
## (Intercept)       130.146064   2.871610  45.322     < 2e-16 ***
## serumcholestrol     0.039682   0.008602   4.613 0.000004510 ***
## fastingbloodsugar   8.967851   2.101268   4.268 0.000021732 ***
## chestpain           5.216343   1.017265   5.128 0.000000356 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 28.54 on 943 degrees of freedom
## Multiple R-squared:  0.1018, Adjusted R-squared:  0.09897 
## F-statistic: 35.64 on 3 and 943 DF,  p-value: < 2.2e-16

Interpretation:

Looking at the model at glance, it seems as the model is very robust with strong predictors of resting BP. Let’s interpret the results from the above model.

a. Residuals:

The residuals have minimum of -72.467 to maximum of 62.417 with a median of -3.245. This indicates a somewhat a well distributed data in the model.

b. Intercepts:

The value 130.14 indicates that even if all of the predictors’ value is zero, there will be a resting BP of 130.14 units. Practically, this data is not useful as it is impractical for a human body to have a zero cholesterol, zero blood sugar etc. Hence, for this situation, this value can be ignored.

c. Chest Pain:

From the above result, 5.216 means that for every 1 unit increase in fasting chest pain, the resting BP increases by 5.216 unit. This shows a positive relationship between chest pain and resting BP statistically evidenced by a very small p-value of 0.000000356.

d. F-statistics:

The F-statistic (35.64) with a p-value < 2.2e-16 suggests that the model as a whole is statistically significant, meaning at least one of the predictors contributes to explaining the variance in resting BP.

Multi-collinearity:

Let’s check if the variables exhibit multi-collinearity with each other using the dataframe as below.

model <- lm(restingBP ~  serumcholestrol + fastingbloodsugar + chestpain, data = HAA_filtered)
vif_values <- vif(model)
print(vif_values)

##   serumcholestrol fastingbloodsugar         chestpain 
##          1.098520          1.102840          1.104225

Interpretation:

The above dataframe suggest there is little to no multi-collinearity. As we know that, if the VIF is around 5 or higher, that shows a higher level of correlation among the variables. Also, a higher VIF affects the robustness of the model. However, all the variables in the model exhibit VIF around 1 showing little to no multicolinearity.

Diagnostic Plot:

Let’s diagnose the model using the 5 diagnostic plots as given below.

a. Residuals vs Fitted values

gg_resfitted(model) + 
  geom_smooth(se=FALSE)

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

In general, the above plot shows a widespread data across the horizontal red-axis. There is no divergence or convergence in the data indicating that the residuals are independent of each other. However a small bump in the middle along with the declining line at the end show that there might a different relationship between some predicting variables in the model with the output variable. A better model might be explored to better justice the relationship between the all the explanatory variables and response variables.

b. Residual vs x-values:

plots <- gg_resX(model, plot.all = FALSE)
plots$serumcholestrol

plots$fastingbloodsugar

plots$chestpain

From the above plot, we can see that the residual vs serum cholesterol and residual vs fasting blood sugar possess normal spread of data on the either side of horizontal line. However if we look into residual vs chest pain, we see that for each categories of chest pain, they possess different distribution of data. For example, category 0 possess normal distribution of residuals, however 1 and 2 possess not so well distribution (presence of skewness). Also, category 3 possess no any residuals in the horizontal red-line. This indicates that chest pain might have quadratic or cubical relationship with resting BP, which can be further explored.

c. Residual Histogram:

gg_reshist(model)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

From the above histogram plot, we can see that the histogram somewhat mimics the normal distribution graph. There are some point in the middle that are violates the expectation of normal distribution indicating that there might be some better ways to increase the robustness of model. Also, the left and right tail does not have symmetry that indicates some level of skewness in the residuals, which can be further explored.

d. QQ-plots:

gg_qqplot(model)

From the above QQ plot, it indicates that the residuals are somewhat normal enough, however it also indicates that the robustness of the model is not achieved yet. The center of the red line deviated a bit away from most of the residuals which indicate the normal Gaussian curve has not been achieved signaling some sort of skewness in the residuals.

Furthermore, the left tail and right tail has residuals that are far away from the red line that further evidence the presence of outliers or skewness in the graph. This demands a further detailed analysis of dataset to form the QQ-plot as close as possible to the normal distribution.

e. Cook’s D:

gg_cooksd(model, threshold = 'matlab')

From the above plot, we can see that there are numerous data that exhibits a higher cook’s distance. These higher cook’s D could be due to the extreme values present in cholesterol or presence of outliers in other predictors. Specifically, there are patients with higher serum cholesterol as close to 600 mg/dL. This higher value might have some impact on the residuals and hence the cook’s distance.

This plot suggest the further investigation on those higher cook’s distance values.

Visualization:

ggplot(HAA_filtered, aes(x = serumcholestrol, y = restingBP, color = as.factor(fastingbloodsugar))) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  facet_wrap(~ chestpain) +
  labs(title = "Resting Blood Pressure vs. Serum Cholesterol by Chest Pain Levels",
       x = "Serum Cholesterol",
       y = "Resting Blood Pressure",
       color = "Fasting Blood Sugar") +
  scale_color_manual(values = c("0" = "skyblue", "1" = "blue"))

## `geom_smooth()` using formula = 'y ~ x'

Interpretation:

From the above plots we can gain a lot of information. If we look at the 0 category of chest pain which is directly related to the heart attack risk, we see a positive correlation between resting BP and serum cholesterol. This plot exhibits a well spread data in the model which further evidence the positive correlation.

The second plot which is of 1 chest pain (less predictable) shows a stronger positive correlation than that of first plot. However, second plot does not have a well spread data indicating a less stronger model.

The third plot which is of 2 chest pain (not related with heart issue) also shows a positive correlation between resting BP and serum cholesterol.

The fourth plot is of 3 chest pain (asymptomatic) which shows the negative correlation between resting BP and cholesterol. Since the chest pain for fourth plot does not show any symptoms indicating a random relation of chest pain and heart failure risk, the fourth plot does not serve much to provide with critical information about the correlation of serum cholesterol with resting BP.

At the end, the one thing that all graph excluding the last one indicate is that there is a positive correlation between resting BP and serum cholesterol. This also opens door to dig deeper into analyzing the numeric values of chest pain to establish a better relation with resting BP. And, since fasting blood sugar can be observed as a widespread data in the model, this demands a better linear modeling of fasting blood sugar with resting BP.

Summary:

In general, the linear regression model for resting BP with its predicting variables serum cholesterol, fasting blood sugar and chest pain seems in between moderate to strong. The model can predict the risk of heart attack looking at these four variables and their relationship as they are strongly evidenced by the very small p-values.

Also, after the diagnostic plot, we can see that the residuals are mostly widespread resembling the normal distribution. There are some room to make the model even more robust by analyzing the outliers in the dataset. And, the relationship for fasting blood sugar and chest pain with resting BP can be further analyzed using the better regression model. However, for the most part the model does its job to predict the heart attack risk.