# Load the dataset and preprocess
d1 <- read.csv("http://hbiostat.org/data/repo/2.20.Framingham.csv")
d1 <- d1 %>% mutate(sex = factor(sex, 1:2, c("male", "female")))
# Note for the sex variable: 1 = Male, 2 = Female

Introduction

The Framingham Heart Study has been instrumental in identifying key risk factors for cardiovascular disease (CVD). This analysis focuses on predicting log serum cholesterol (scl) using body mass index (BMI), age, and sex. The objective is to fit regression models, explore the relationships among these variables, and evaluate model performance using various diagnostic tools.

We will begin by modeling the log serum cholesterol (scl) with the predictors bmi, age, and sex using both Maximum Likelihood Estimation (MLE) and Bayesian methods. We then proceed with diagnostic measures to evaluate the performance of our models.

Part 1: Regression Modeling

A. Model Specification .

First, we specify a model for log serum cholesterol (scl) with the predictors bmi, age, and sex. The model captures non-linear relationships through restricted cubic splines and includes interactions between age, BMI, and sex. If the number of knots \(k=5\), the model is given by

$$ \[\begin{aligned} E[\text{scl}|\text{bmi},\text{age},\text{sex}] &= \beta_0 + sex + \sum_{i=1}^3 \beta_i\space age^{(i)} + \sum_{j=1}^3 \beta_j\space bmi^{(j)}\\ & \sum_{i=1}^3 \beta_i \space age^{(i)}sex +\sum_{j=1}^3 \beta_j \space bmi^{(j)}sex + \sum_{i=1}^3 \sum_{j=1}^3 \beta_{(i,j)} \space age^{(i)}bmi^{(j)}. \\ \\ V[\text{scl}|\text{bmi},\text{age},\text{sex}] &= \sigma^2.\\ \end{aligned}\]

$$ In R, we can write it as follows:

f1 <- log(scl) ~ sex + rcs(age,5) + rcs(bmi,5) + rcs(age,5)*sex + rcs(bmi,5)*sex + rcs(age,5)*rcs(bmi,5)

Next, we fit this model using both MLE and Bayesian methods.In this example, the default priors are used.

B. Model Fitting with MLE and Bayesian Methods .

I. Maximum Likelihood Estimation

dd <- datadist(d1)
options(datadist = "dd")

m1 <- ols(f1, data = d1)
m11 <- ols(f1, data = d1, x = TRUE, y = TRUE)
m2 <- lm(f1, data = d1)

summary(m1)
##              Effects              Response : log(scl) 
## 
##  Factor            Low  High Diff. Effect    S.E.     Lower 0.95 Upper 0.95
##  age               39.0 53   14.0  0.1678800 0.017123  0.1343100 0.201450  
##  bmi               22.8 28    5.2  0.0251340 0.016198 -0.0066221 0.056889  
##  sex - male:female  2.0  1     NA  0.0097406 0.014532 -0.0187500 0.038231  
## 
## Adjusted to: sex=female age=45 bmi=25.2

II. Bayesian Estimation

C. Partical Effect Plots .

We generate partial effect plots from the MLE model. Specifically, we label the predicted median serum cholesterol for a 48-year-old female with a BMI of 28.

p1 <- Predict(m1, age=48, sex = "female", bmi=28)
p1
##   age    sex bmi     yhat    lower    upper
## 1  48 female  28 5.458447 5.437951 5.478942
## 
## Response variable (y): log(scl) 
## 
## Limits are 0.95 confidence limits
g1 <- m1 %>% Predict(age=48, sex = "female", bmi)%>%
    ggplot(d1, aes(x = bmi, y = yhat, colour=sex))+
    stat_smooth(method = 'nls', formula = 'f2', se=FALSE)+
    geom_line(size = 0.75, colour = "lightcoral")+
    labs(x="bmi",y="E[scl|bmi,age,sex = female]",title="Female Regression")+
    theme(plot.title = element_text(hjust=0.5))+ 
    annotate(geom = "point", x = 28, y = 5.458447, size = 2) + 
    annotate(geom = "text", x = 28, y = 5.47, label = "Female Patient")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
g1
## Warning: Computation failed in `stat_smooth()`
## Caused by error in `formula.character()`:
## ! invalid formula "f2": not a call

g2 <- m1 %>% Predict(age=48, sex = "male", bmi)%>%
    ggplot(d1, aes(x = bmi, y = yhat, colour=sex))+
    stat_smooth(method = 'nls', formula = 'f2', se=FALSE)+
    geom_line(size = 0.75, colour = "black")+
    labs(x="bmi",y="E[scl|bmi,age,sex = male]",title="Male Regression")+
    theme(plot.title = element_text(hjust=0.5))
g2
## Warning: Computation failed in `stat_smooth()`
## Caused by error in `formula.character()`:
## ! invalid formula "f2": not a call

g3 <- m1 %>% Predict(age = 48, bmi, sex) %>%
    ggplot(d1, aes(x = bmi, y = yhat, colour = sex)) +
    stat_smooth(method = 'nls', formula = 'f2', se = FALSE) +
    geom_line(size = 0.75) +
    labs(x = "bmi", y = "E[scl|bmi,age,sex]", title = "Female vs Male Comparison") +
    theme(plot.title = element_text(hjust = 0.5)) + 
    scale_color_manual(values = c("female" = "lightcoral", "male" = "black")) +  # Specify colors for sex
    annotate(geom = "point", x = 28, y = 5.458447, size = 2) + 
    annotate(geom = "text", x = 28, y = 5.485, label = "Female Patient")
## Scale for colour is already present.
## Adding another scale for colour, which will replace the existing scale.
g3
## Warning: Computation failed in `stat_smooth()`
## Caused by error in `formula.character()`:
## ! invalid formula "f2": not a call

plot_grid(g1, g2)
## Warning: Computation failed in `stat_smooth()`
## Computation failed in `stat_smooth()`
## Caused by error in `formula.character()`:
## ! invalid formula "f2": not a call

Part 2: Diagnostic Measures and Model Performance

In this section, we explore diagnostic measures and evaluate the performance of our regression models.

A. Diagnostic Measures .

Diagnostic measures are essential for evaluating the accuracy of a model. For a well-fitting model, the residuals (the differences between observed and predicted values) should resemble a normal distribution, specifically \(N(0,σ^2)\).

I. Key Diagnostic Measures for Model Performance

  • Central Tendency of Prediction Errors: These measures capture how far the predicted values deviate from the actual values. Common metrics include the mean absolute error, mean squared error, and logarithmic loss.

  • Discrimination Measures: Discrimination measures assess how well a model distinguishes between different outcomes. In this analysis, we focus on the coefficient of determination (\(R^2\)), which quantifies the proportion of variance in the dependent variable explained by the model.

  • Calibration Measures: Calibration reflects how well predicted probabilities align with actual outcomes. A model is considered well-calibrated if a prediction with confidence \(p\) is correct \(p%\) of the time. Ideally, a calibration plot should show a perfect diagonal line, indicating that predicted probabilities match observed probabilities.

II. Vocabulary

Residual: The difference between an observed value and a predicted value in our regression model. That is,

\[ residual = y - \hat{y}. \]

Standardized residual: A residual divided by its estimated standard deviation. In practice, a standardized residual with an absolute value greater than 3 is considered to be an outlier.

Leverage: refers to the extent to which the coefficients in the regression model would change if a particular observation was removed from the dataset.

Cook’s Distance: The higher the leverage and residuals, the higher the Cook’s distance.The Cook’s distance is considered high if it is greater than 0.5 and extreme if it is greater than 1

III. Diagnostic Plots

Histogram of Residuals: The Histogram of the Residuals can be used to check whether the variance is normally distributed. A symmetric bell-shaped histogram which is evenly distributed around zero indicates normality.

Boxplot: This a scatter plot of residuals on the y-axis and the predictor values (age,bmi) on the x-axis. The boxplot shows the difference between the actual and fitted values.

Residuals vs Fitted: The Residuals vs Fitted values is used to check the linear model assumptions. A horizontal line, without distinct patterns is an indication for a linear relationship, what is good.

Residuals vs Predictor: This plot works just like the The Residuals vs Fitted plot. A random pattern around the horizontal line indicates that a linear model provides a decent fit to the data.

Normal Q-Q: This visualization is used to examine whether the residuals are normally distributed. If linearity is not met, we would expect very large/small residuals. To assess the assumption of linearity we must ensure that the residuals are not too far away from 0. The closer the residual points follow the straight dashed line the better.

Scale-Location: This is used to check the homogeneity of variance of the residuals (homoscedasticity). Horizontal line with equally spread points is a good indication of homoscedasticity.

Residuals vs Leverage: This is used to identify the presence of extreme values that might influence the regression results when included or excluded from the analysis.

yhat=fitted(m2)
resids=resid(m2)
ehat=rstudent(m2)

d1[names(yhat),"yhat"]<-yhat
d1[names(ehat),"ehat"]<-ehat
d1[names(resids),"resids"]<-resids

Histogram of Residuals & Boxplot

p1<-ggplot(d1,aes(x=resids)) +
  geom_histogram(color="Black",fill="lightcoral")+
  labs(title = 'Histogram of Residuals', x = 'Residuals', y = 'Frequency')+
  theme(plot.title = element_text(hjust=0.5))

# Add the argument binwidth=0.01 inside geom_histogram

p2<-ggplot(data=d1,mapping=aes(x=sex,y=resids,color=sex)) +
  geom_boxplot(color="Black",fill="lightcoral")+
  labs(title = 'Boxplot of Residuals', x = 'Sex', y = 'Residuals')+
  theme(plot.title = element_text(hjust=0.5))

plot_grid(p1,p2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 41 rows containing non-finite values (`stat_bin()`).
## Warning: Removed 41 rows containing non-finite values (`stat_boxplot()`).

Residuals vs Fitted

#cols <- c("black", "gray20")
p1<-ggplot(d1,aes(x=yhat,y=ehat,shape=sex))+geom_point()+
  geom_hline(yintercept=0,linetype="dashed")+
  geom_smooth(method = "loess", size = 0.5,colour = "lightcoral")+
  labs(x="Fitted values",y="Residuals",title="Residuals vs Fitted")+
  theme(plot.title = element_text(hjust=0.5))
#+scale_color_manual(values = cols)

p2<-ggplot(d1,aes(x=yhat,y=ehat))+geom_point()+
  geom_hline(yintercept=0,linetype="dashed")+
  geom_smooth(method = "loess", size = 0.5,colour = "lightcoral")+
  labs(x="Fitted values",y="Residuals",title="Residuals vs Fitted")+
  theme(plot.title = element_text(hjust=0.5)) +
  facet_wrap(~sex) #Separate a plot based on groups

p1
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 41 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 41 rows containing missing values (`geom_point()`).

p2
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 41 rows containing non-finite values (`stat_smooth()`).
## Removed 41 rows containing missing values (`geom_point()`).

#Base R:
# ehat <- m1$residuals
# yhat <- m1$fitted.values

# plot(yhat,ehat)
# abline(h=0)
# points(yhat, predict(loess(ehat~yhat)), col = "pink")

Residuals vs Predictor

ggplot(d1,aes(x=age,y=ehat))+geom_point()+
  geom_hline(yintercept=0,linetype="dashed")+
  labs(x="age",y="Residuals",title="Residuals vs age")+
  theme(plot.title = element_text(hjust=0.5))
## Warning: Removed 41 rows containing missing values (`geom_point()`).

ggplot(d1,aes(x=bmi,y=ehat))+geom_point()+
  geom_hline(yintercept=0,linetype="dashed")+
  labs(x="bmi",y="Residuals",title="Residuals vs bmi")+
  theme(plot.title = element_text(hjust=0.5))
## Warning: Removed 41 rows containing missing values (`geom_point()`).

# Base R
# plot(m1$x[,"age"], ehat)
# plot(m1$x[,"bmi"], ehat)

Normal Q-Q Plot

ggplot(d1,aes(sample = resids))+
  geom_qq()+
  geom_qq_line(size = 0.5, colour = "lightcoral")+
  labs(x="Theoretical Quantiles",y="Sample Quantiles/Standarized Residuals",title="Normal Q-Q Plot")+
  theme(plot.title = element_text(hjust=0.5))
## Warning: Removed 41 rows containing non-finite values (`stat_qq()`).
## Warning: Removed 41 rows containing non-finite values (`stat_qq_line()`).

#Base R:
#m1 %>% 
# `class<-`("lm") %>% 
# rstudent %>% # vector residuals 
# qqnorm
#abline(0,1)

Scale-Location

This plot is a variation on the fitted vs residuals plot, but the y-axis uses the square root of the absolute value of the standardized residuals.

ggplot(d1,aes(x=yhat,y=sqrt(resids)))+geom_point()+
  geom_smooth(method = "loess", size = 0.5,colour = "lightcoral")+
  labs(x="Fitted values",y="sqrt(Standarized Residuals)",title="Scale-Location")+
  theme(plot.title = element_text(hjust=0.5))
## Warning in sqrt(resids): NaNs produced

## Warning in sqrt(resids): NaNs produced

## Warning in sqrt(resids): NaNs produced
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 2417 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 2417 rows containing missing values (`geom_point()`).

Cook’s distance & Residuals vs Leverage

If any point in this plot falls outside of Cook’s distance (the red dashed lines) then it is considered to be an influential observation.

# Cook's distance
plot(m2, 4)

# Residuals vs Leverage
plot(m2, 5)

#plot(lm(f1,data=d1))
#par(mfrow = c(2, 2))
#plot(m2)

SPLOM - Scatterplot Matrix

SPLOM highlights points that may not be extreme in single plots but show influence across multiple dimensions. This gives us a visual of overly influential observations. We will comment it for now as it outputs the graphs one by one and takes too much space.

#i1 <- residuals(m1, "influence")
#for(n in colnames(i1)) plot(i1[,n], ylab = n)

#idx <- (i1[,"cook.d"] > 0.04) %in% TRUE  # Assess if we have 
#d1[idx,] %>% 
#  kable %>% 
#  kable_styling("bordered")

#pairs(d1[,all.vars(f1)], col = idx + 1, pch = c(1,16)[idx+1]) # potentially infuential observatins
# Identify potential miscoding too

B. Measures of Model Performance .

I. Optimism-Corrected \(R^2\)

The following code includes a function that calculates the optimism corrected \(R^2\). We apply it to the MLE model.

optim_r2 <- function(m1, d1){
    idx <- sample(1:nrow(d1), nrow(d1), TRUE)
    m2 <- update(m1, data = d1[idx,])
    r2_b <- m2$stats["R2"]
    yhat <- predict(m2, newdata = d1)
    r2_o <- cor(yhat, model.frame(m1$terms, data = d1)[,1], use="complete")^2
    r2_b - r2_o
}

d2 <- d1 %>% 
dplyr::select(all.vars(f1)) %>% 
complete.cases()
d3 <- d1[d2,]

o1 <- replicate(100, optim_r2(m1,d3)) %>% 
  mean

data.frame(
    `R-square` = m1$stats["R2"]
  , Optimism = o1
  , `Adjusted R-square` = m1$stats["R2"] - o1
  , check.names = FALSE) %>% 
  round(3) %>% 
  kable %>% 
  kable_styling(c("bordered"))
R-square Optimism Adjusted R-square
R2 0.15 0.013 0.137

II. Calibration Curve

The plot below shows a calibration curve with an optimism-corrected loess line. Calibration curves are regression diagnostics that assess how well the model’s predicted probabilities align with the observed outcomes. If the model is well-calibrated, the loess curve should closely follow the diagonal, indicating that the predicted probabilities accurately reflect the actual data.

# Locally weighted regression
c1 <- calibrate(m11, B=100)
plot(c1)

## 
## n=4658   Mean absolute error=0.001   Mean squared error=0
## 0.9 Quantile of absolute error=0.004

III. Table of Optimism-Corrected Model Performance Metrics for the MLE Model

v1 <- validate(m11, B = 100)
v1 %>% 
  unclass() %>% 
  round(3) %>% 
  kable() %>% 
  kable_styling(c("bordered", "striped")) %>% 
  row_spec(0, bold = TRUE, color = "white", background = "lightcoral")
index.orig training test optimism index.corrected n
R-square 0.150 0.156 0.143 0.013 0.137 100
MSE 0.031 0.031 0.031 0.000 0.032 100
g 0.084 0.085 0.082 0.003 0.080 100
Intercept 0.000 0.000 0.209 -0.209 0.209 100
Slope 1.000 1.000 0.961 0.039 0.961 100