Week 9: Data Dive- Notebook

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(car)

## Loading required package: carData
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some

library(ggthemes)
library(ggrepel)
library(lindia)

# remove scientific notation
options(scipen = 6)

# default theme, unless otherwise noted
theme_set(theme_minimal())

HAA<- read.csv("/Users/rupeshswarnakar/Downloads/Cardiovascular_Disease_Dataset.csv")

Linear Regression Model:

Let’s add gender and fasting blood sugar variables to the previous linear regression model. Gender and fasting blood sugar are both binary terms since 0 and 1 represents female and male for gender and less than 120 mg/dL and greater than 120 mg/dL sugar level for fasting blood sugar.

Looking at the dataset, we can see that there are some outliers in the serum cholesterol column. The value 0 for serum cholesterol is impractical as no human being can survive with 0 cholesterol. Hence the data can be filtered to remove 0 serum cholesterol so as to create a robust linear regression model.

# Filterd out rows where serumcholestrol is 0
HAA_filtered <- subset(HAA, serumcholestrol != 0)

# Model using the filtered data
model <- lm(serumcholestrol ~ restingBP + gender + fastingbloodsugar, data = HAA_filtered)

summary(model)

## 
## Call:
## lm(formula = serumcholestrol ~ restingBP + gender + fastingbloodsugar, 
##     data = HAA_filtered)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -241.89  -85.78   -3.68   74.91  288.74 
## 
## Coefficients:
##                   Estimate Std. Error t value       Pr(>|t|)    
## (Intercept)       246.7089    19.5165  12.641        < 2e-16 ***
## restingBP           0.6273     0.1191   5.268 0.000000170685 ***
## gender            -36.6996     8.2411  -4.453 0.000009471275 ***
## fastingbloodsugar  49.4414     7.6981   6.423 0.000000000212 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 107.2 on 943 degrees of freedom
## Multiple R-squared:  0.1047, Adjusted R-squared:  0.1018 
## F-statistic: 36.75 on 3 and 943 DF,  p-value: < 2.2e-16

Interpretation:

Looking at the model at glance, it seems as the model is very robust with strong predictors of serum cholesterol. Let’s interpret the results from the above model.

a. Intercepts:

The value 246.71 indicates that even if all of the predictors’ value is zero, there will be a cholesterol level of 289.5 units. Practically this data is not useful since biological human body will always have some level of cholesterol. A zero cholesterol implies a dead human body which is not practical to be experimented.

b. Resting BP:

The value 0.6273 indicates that for each unit increase in blood pressure, cholesterol increases by 0.63mg/dL. This correlation is statistically very strong as it is supported by very small p-value of 0.000000170685.

c. Gender:

The negative value of 36.6996 means when model transit from male to female, the serum cholesterol decreases by 36.7 mg/dL. This statement is strongly supported by the very small value of 0.000009471275.

d. Fasting blood sugar:

The value of 49.4414 suggests that if blood sugar increases by 1 unit, then serum cholesterol increases by 49.5 mg/dL, which is fair amount of increment. This is statistically very strong as the p-value for this statement is 0.000000000212 which is significantly very small.

e. Residual standard error:

The value 107.2 is the average amount by which the observed values differ from the predicted values, and it gives an idea of the typical prediction error.

f. R-squared:

The value of 0.1047 means the model explains about 10.47% of the variance in serumcholestrol. This is relatively low, suggesting that other unmodeled factors could be influencing serumcholestrol.

g. Multiple R-squared:

The value of 0.1018 slightly adjusts the R-squared value for the number of predictors. This adjustment also indicates a modest fit.

f. F-statistics:

The F-statistic (36.75) with a p-value < 2.2e-16 suggests that the model as a whole is statistically significant, meaning at least one of the predictors contributes to explaining the variance in serum cholestrol.

Multicollinearity:

Let’s check if the variables exhibit multicollinearity with each other using the dataframe as below.

model <- lm(serumcholestrol ~ restingBP + gender + exerciseangia + fastingbloodsugar, data = HAA_filtered)
vif_values <- vif(model)
print(vif_values)

##         restingBP            gender     exerciseangia fastingbloodsugar 
##          1.058753          1.009824          1.003740          1.050367

Interpretation:

The above dataframe suggest there is little to no multicollinearity. Here we are using VIF which means variance inflation factor, which can be calculated for each variables in the model. If the value of VIF is around 5 or higher, that shows a higher level of correlation among the variables. Higher VIF affects the robustness of the model. However, all the variables in the model exhibit VIF around 1 showing little to no multicolinearity.

Diagnostic Plot:

Let’s diagnose the model using the 5 diagnostic plots as given below.

a. Residuals vs Fitted values

gg_resfitted(model) +
  geom_smooth(se=FALSE)

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

In general, the above plot shows a widespread data across the axis. There is no divergence or convergence in the data showing the residuals are independent. However a small bump in the middle show there are more residual on the upper side of the red line. This could be due to the interaction between continuous variable and binary variables or even due to the presence of outliers.

This model might need a better way to include binary variables with continuous variables.

b. Residual vs x-values:

plots <- gg_resX(model, plot.all = FALSE)

# for each variable of interest ...
plots$restingBP

plots$gender

plots$fastingbloodsugar

This plot represents each individual explanatory variables with the residuals. For resting blood pressure, there is a widespread of data point across the axis which evidence that assumption is met.

Also for gender and fasting blood sugar, there are data points on 0 and 1 axis, since they both are binary variables. And, the data are clustered around the red line which show the normal distribution of the residuals.

c. Residual Histogram:

gg_reshist(model)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

From the above histogram plot, we can see that the histogram somewhat mimics the normal distribution graph. There are some point in the middle that are violates the expectation of normal distribution indicating that there might be some better ways to increase the robustness of model. However, the left and right tail of histogram indicates the smooth tailing preserving the assumptions.

d. QQ-plots:

gg_qqplot(model)

From the above QQ plot, we can safely assume the the residuals are normal enough which is what desired in practicality. The center of the red line includes most of the residuals close to it.

However, the left tail and right tail has residuals that are far away from the red line indicate the presence of outliers or skewness in the graph. This demands a further detailed analysis of dataset to form the QQ-plot as close as possible to the normal distribution.

e. Cook’s D:

gg_cooksd(model, threshold = 'matlab')

From the above plot, we can see that there are numerous data that exhibits a higher cook’s distance. These higher cook’s D could be due to the extreme values present in cholesterol. There are patients with higher serum cholesterol as close to 600 mg/dL. This higher value might have some impact on the residuals and hence the cook’s distance.

This plot suggest the further investigation on those higher cook’s distance values.

Visualization:

# Faceted scatter plot
ggplot(HAA_filtered, aes(x = restingBP, y = serumcholestrol, color = fastingbloodsugar)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "lm", se = FALSE, color = "blue") +
  facet_wrap(~ gender) +
  labs(title = "Serum Cholesterol vs. Resting BP by Gender",
       x = "Resting Blood Pressure",
       y = "Serum Cholesterol") +
  scale_color_gradient(low = "lightgreen", high = "darkgreen")

## `geom_smooth()` using formula = 'y ~ x'

Interpretation:

From the above visualization, we can see that there is correlation between resting blood pressure and serum cholesterol while accounting the effect of fasting blood sugar on both male and female. The trend between resting BP and serum cholesterol among male is positive and stronger than female. This is intuitively true as well since male tends to have higher cholesterol (LDL: bad cholesterol) than female, in general. We can also see that data for female vs male are not similar as male have widespread data while female have clustered data. This result open doors to further dive deeper into the investigation.

Summary:

In general, the linear regression model for serum cholesterol with its predicting variables resting BP, gender and fasting blood sugar seems strong. The model can predict the risk of heart attack looking at these four variables. The correlation between the response and explanatory variables are strong evidenced by the significantly small p-values.

Also, after the diagnostic plot, we can see that the residuals are mostly widespread resembling the normal distribution. There are some room to make the model even more robust by analyzing the outliers in the dataset. However, for the most part the model serves best with higher level of confidence that the assumptions are met.