# Load the dataset and preprocess
d1 <- read.csv("http://hbiostat.org/data/repo/2.20.Framingham.csv")
d1 <- d1 %>% mutate(sex = factor(sex, 1:2, c("male", "female")))
# Note for the sex variable: 1 = Male, 2 = Female
The Framingham Heart Study has been instrumental in identifying key risk factors for cardiovascular disease (CVD). This analysis focuses on predicting log serum cholesterol (scl) using body mass index (BMI), age, and sex. The objective is to fit regression models, explore the relationships among these variables, and evaluate model performance using various diagnostic tools.
We will begin by modeling the log serum cholesterol (scl) with the predictors bmi, age, and sex using both Maximum Likelihood Estimation (MLE) and Bayesian methods. We then proceed with diagnostic measures to evaluate the performance of our models.
First, we specify a model for log serum cholesterol (scl) with the
predictors bmi, age, and sex. The
model captures non-linear relationships through restricted cubic splines
and includes interactions between age, BMI, and sex. If the number of
knots \(k=5\), the model is given
by
$$ In R, we can write it as follows:
f1 <- log(scl) ~ sex + rcs(age,5) + rcs(bmi,5) + rcs(age,5)*sex + rcs(bmi,5)*sex + rcs(age,5)*rcs(bmi,5)
Next, we fit this model using both MLE and Bayesian methods.In this example, the default priors are used.
dd <- datadist(d1)
options(datadist = "dd")
m1 <- ols(f1, data = d1)
m11 <- ols(f1, data = d1, x = TRUE, y = TRUE)
m2 <- lm(f1, data = d1)
summary(m1)
## Effects Response : log(scl)
##
## Factor Low High Diff. Effect S.E. Lower 0.95 Upper 0.95
## age 39.0 53 14.0 0.1678800 0.017123 0.1343100 0.201450
## bmi 22.8 28 5.2 0.0251340 0.016198 -0.0066221 0.056889
## sex - male:female 2.0 1 NA 0.0097406 0.014532 -0.0187500 0.038231
##
## Adjusted to: sex=female age=45 bmi=25.2
We generate partial effect plots from the MLE model. Specifically, we label the predicted median serum cholesterol for a 48-year-old female with a BMI of 28.
p1 <- Predict(m1, age=48, sex = "female", bmi=28)
p1
## age sex bmi yhat lower upper
## 1 48 female 28 5.458447 5.437951 5.478942
##
## Response variable (y): log(scl)
##
## Limits are 0.95 confidence limits
g1 <- m1 %>% Predict(age=48, sex = "female", bmi)%>%
ggplot(d1, aes(x = bmi, y = yhat, colour=sex))+
stat_smooth(method = 'nls', formula = 'f2', se=FALSE)+
geom_line(size = 0.75, colour = "lightcoral")+
labs(x="bmi",y="E[scl|bmi,age,sex = female]",title="Female Regression")+
theme(plot.title = element_text(hjust=0.5))+
annotate(geom = "point", x = 28, y = 5.458447, size = 2) +
annotate(geom = "text", x = 28, y = 5.47, label = "Female Patient")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
g1
## Warning: Computation failed in `stat_smooth()`
## Caused by error in `formula.character()`:
## ! invalid formula "f2": not a call
g2 <- m1 %>% Predict(age=48, sex = "male", bmi)%>%
ggplot(d1, aes(x = bmi, y = yhat, colour=sex))+
stat_smooth(method = 'nls', formula = 'f2', se=FALSE)+
geom_line(size = 0.75, colour = "black")+
labs(x="bmi",y="E[scl|bmi,age,sex = male]",title="Male Regression")+
theme(plot.title = element_text(hjust=0.5))
g2
## Warning: Computation failed in `stat_smooth()`
## Caused by error in `formula.character()`:
## ! invalid formula "f2": not a call
g3 <- m1 %>% Predict(age = 48, bmi, sex) %>%
ggplot(d1, aes(x = bmi, y = yhat, colour = sex)) +
stat_smooth(method = 'nls', formula = 'f2', se = FALSE) +
geom_line(size = 0.75) +
labs(x = "bmi", y = "E[scl|bmi,age,sex]", title = "Female vs Male Comparison") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_color_manual(values = c("female" = "lightcoral", "male" = "black")) + # Specify colors for sex
annotate(geom = "point", x = 28, y = 5.458447, size = 2) +
annotate(geom = "text", x = 28, y = 5.485, label = "Female Patient")
## Scale for colour is already present.
## Adding another scale for colour, which will replace the existing scale.
g3
## Warning: Computation failed in `stat_smooth()`
## Caused by error in `formula.character()`:
## ! invalid formula "f2": not a call
plot_grid(g1, g2)
## Warning: Computation failed in `stat_smooth()`
## Computation failed in `stat_smooth()`
## Caused by error in `formula.character()`:
## ! invalid formula "f2": not a call
In this section, we explore diagnostic measures and evaluate the performance of our regression models.
Diagnostic measures are essential for evaluating the accuracy of a model. For a well-fitting model, the residuals (the differences between observed and predicted values) should resemble a normal distribution, specifically \(N(0,σ^2)\).
Central Tendency of Prediction Errors: These measures capture how far the predicted values deviate from the actual values. Common metrics include the mean absolute error, mean squared error, and logarithmic loss.
Discrimination Measures: Discrimination measures assess how well a model distinguishes between different outcomes. In this analysis, we focus on the coefficient of determination (\(R^2\)), which quantifies the proportion of variance in the dependent variable explained by the model.
Calibration Measures: Calibration reflects how well predicted probabilities align with actual outcomes. A model is considered well-calibrated if a prediction with confidence \(p\) is correct \(p%\) of the time. Ideally, a calibration plot should show a perfect diagonal line, indicating that predicted probabilities match observed probabilities.
Residual: The difference between an observed value and a predicted value in our regression model. That is,
\[ residual = y - \hat{y}. \]
Standardized residual: A residual divided by its estimated standard deviation. In practice, a standardized residual with an absolute value greater than 3 is considered to be an outlier.
Leverage: refers to the extent to which the coefficients in the regression model would change if a particular observation was removed from the dataset.
Cook’s Distance: The higher the leverage and residuals, the higher the Cook’s distance.The Cook’s distance is considered high if it is greater than 0.5 and extreme if it is greater than 1
Histogram of Residuals: The Histogram of the Residuals can be used to check whether the variance is normally distributed. A symmetric bell-shaped histogram which is evenly distributed around zero indicates normality.
Boxplot: This a scatter plot of residuals on the y-axis and the predictor values (age,bmi) on the x-axis. The boxplot shows the difference between the actual and fitted values.
Residuals vs Fitted: The Residuals vs Fitted values is used to check the linear model assumptions. A horizontal line, without distinct patterns is an indication for a linear relationship, what is good.
Residuals vs Predictor: This plot works just like the The Residuals vs Fitted plot. A random pattern around the horizontal line indicates that a linear model provides a decent fit to the data.
Normal Q-Q: This visualization is used to examine whether the residuals are normally distributed. If linearity is not met, we would expect very large/small residuals. To assess the assumption of linearity we must ensure that the residuals are not too far away from 0. The closer the residual points follow the straight dashed line the better.
Scale-Location: This is used to check the homogeneity of variance of the residuals (homoscedasticity). Horizontal line with equally spread points is a good indication of homoscedasticity.
Residuals vs Leverage: This is used to identify the presence of extreme values that might influence the regression results when included or excluded from the analysis.
yhat=fitted(m2)
resids=resid(m2)
ehat=rstudent(m2)
d1[names(yhat),"yhat"]<-yhat
d1[names(ehat),"ehat"]<-ehat
d1[names(resids),"resids"]<-resids
p1<-ggplot(d1,aes(x=resids)) +
geom_histogram(color="Black",fill="lightcoral")+
labs(title = 'Histogram of Residuals', x = 'Residuals', y = 'Frequency')+
theme(plot.title = element_text(hjust=0.5))
# Add the argument binwidth=0.01 inside geom_histogram
p2<-ggplot(data=d1,mapping=aes(x=sex,y=resids,color=sex)) +
geom_boxplot(color="Black",fill="lightcoral")+
labs(title = 'Boxplot of Residuals', x = 'Sex', y = 'Residuals')+
theme(plot.title = element_text(hjust=0.5))
plot_grid(p1,p2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 41 rows containing non-finite values (`stat_bin()`).
## Warning: Removed 41 rows containing non-finite values (`stat_boxplot()`).
#cols <- c("black", "gray20")
p1<-ggplot(d1,aes(x=yhat,y=ehat,shape=sex))+geom_point()+
geom_hline(yintercept=0,linetype="dashed")+
geom_smooth(method = "loess", size = 0.5,colour = "lightcoral")+
labs(x="Fitted values",y="Residuals",title="Residuals vs Fitted")+
theme(plot.title = element_text(hjust=0.5))
#+scale_color_manual(values = cols)
p2<-ggplot(d1,aes(x=yhat,y=ehat))+geom_point()+
geom_hline(yintercept=0,linetype="dashed")+
geom_smooth(method = "loess", size = 0.5,colour = "lightcoral")+
labs(x="Fitted values",y="Residuals",title="Residuals vs Fitted")+
theme(plot.title = element_text(hjust=0.5)) +
facet_wrap(~sex) #Separate a plot based on groups
p1
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 41 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 41 rows containing missing values (`geom_point()`).
p2
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 41 rows containing non-finite values (`stat_smooth()`).
## Removed 41 rows containing missing values (`geom_point()`).
#Base R:
# ehat <- m1$residuals
# yhat <- m1$fitted.values
# plot(yhat,ehat)
# abline(h=0)
# points(yhat, predict(loess(ehat~yhat)), col = "pink")
ggplot(d1,aes(x=age,y=ehat))+geom_point()+
geom_hline(yintercept=0,linetype="dashed")+
labs(x="age",y="Residuals",title="Residuals vs age")+
theme(plot.title = element_text(hjust=0.5))
## Warning: Removed 41 rows containing missing values (`geom_point()`).
ggplot(d1,aes(x=bmi,y=ehat))+geom_point()+
geom_hline(yintercept=0,linetype="dashed")+
labs(x="bmi",y="Residuals",title="Residuals vs bmi")+
theme(plot.title = element_text(hjust=0.5))
## Warning: Removed 41 rows containing missing values (`geom_point()`).
# Base R
# plot(m1$x[,"age"], ehat)
# plot(m1$x[,"bmi"], ehat)
ggplot(d1,aes(sample = resids))+
geom_qq()+
geom_qq_line(size = 0.5, colour = "lightcoral")+
labs(x="Theoretical Quantiles",y="Sample Quantiles/Standarized Residuals",title="Normal Q-Q Plot")+
theme(plot.title = element_text(hjust=0.5))
## Warning: Removed 41 rows containing non-finite values (`stat_qq()`).
## Warning: Removed 41 rows containing non-finite values (`stat_qq_line()`).
#Base R:
#m1 %>%
# `class<-`("lm") %>%
# rstudent %>% # vector residuals
# qqnorm
#abline(0,1)
This plot is a variation on the fitted vs residuals plot, but the y-axis uses the square root of the absolute value of the standardized residuals.
ggplot(d1,aes(x=yhat,y=sqrt(resids)))+geom_point()+
geom_smooth(method = "loess", size = 0.5,colour = "lightcoral")+
labs(x="Fitted values",y="sqrt(Standarized Residuals)",title="Scale-Location")+
theme(plot.title = element_text(hjust=0.5))
## Warning in sqrt(resids): NaNs produced
## Warning in sqrt(resids): NaNs produced
## Warning in sqrt(resids): NaNs produced
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 2417 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 2417 rows containing missing values (`geom_point()`).
If any point in this plot falls outside of Cook’s distance (the red dashed lines) then it is considered to be an influential observation.
# Cook's distance
plot(m2, 4)
# Residuals vs Leverage
plot(m2, 5)
#plot(lm(f1,data=d1))
#par(mfrow = c(2, 2))
#plot(m2)
SPLOM highlights points that may not be extreme in single plots but show influence across multiple dimensions. This gives us a visual of overly influential observations. We will comment it for now as it outputs the graphs one by one and takes too much space.
#i1 <- residuals(m1, "influence")
#for(n in colnames(i1)) plot(i1[,n], ylab = n)
#idx <- (i1[,"cook.d"] > 0.04) %in% TRUE # Assess if we have
#d1[idx,] %>%
# kable %>%
# kable_styling("bordered")
#pairs(d1[,all.vars(f1)], col = idx + 1, pch = c(1,16)[idx+1]) # potentially infuential observatins
# Identify potential miscoding too
The following code includes a function that calculates the optimism corrected \(R^2\). We apply it to the MLE model.
optim_r2 <- function(m1, d1){
idx <- sample(1:nrow(d1), nrow(d1), TRUE)
m2 <- update(m1, data = d1[idx,])
r2_b <- m2$stats["R2"]
yhat <- predict(m2, newdata = d1)
r2_o <- cor(yhat, model.frame(m1$terms, data = d1)[,1], use="complete")^2
r2_b - r2_o
}
d2 <- d1 %>%
dplyr::select(all.vars(f1)) %>%
complete.cases()
d3 <- d1[d2,]
o1 <- replicate(100, optim_r2(m1,d3)) %>%
mean
data.frame(
`R-square` = m1$stats["R2"]
, Optimism = o1
, `Adjusted R-square` = m1$stats["R2"] - o1
, check.names = FALSE) %>%
round(3) %>%
kable %>%
kable_styling(c("bordered"))
| R-square | Optimism | Adjusted R-square | |
|---|---|---|---|
| R2 | 0.15 | 0.013 | 0.137 |
The plot below shows a calibration curve with an optimism-corrected loess line. Calibration curves are regression diagnostics that assess how well the model’s predicted probabilities align with the observed outcomes. If the model is well-calibrated, the loess curve should closely follow the diagonal, indicating that the predicted probabilities accurately reflect the actual data.
# Locally weighted regression
c1 <- calibrate(m11, B=100)
plot(c1)
##
## n=4658 Mean absolute error=0.001 Mean squared error=0
## 0.9 Quantile of absolute error=0.004
v1 <- validate(m11, B = 100)
v1 %>%
unclass() %>%
round(3) %>%
kable() %>%
kable_styling(c("bordered", "striped")) %>%
row_spec(0, bold = TRUE, color = "white", background = "lightcoral")
| index.orig | training | test | optimism | index.corrected | n | |
|---|---|---|---|---|---|---|
| R-square | 0.150 | 0.156 | 0.143 | 0.013 | 0.137 | 100 |
| MSE | 0.031 | 0.031 | 0.031 | 0.000 | 0.032 | 100 |
| g | 0.084 | 0.085 | 0.082 | 0.003 | 0.080 | 100 |
| Intercept | 0.000 | 0.000 | 0.209 | -0.209 | 0.209 | 100 |
| Slope | 1.000 | 1.000 | 0.961 | 0.039 | 0.961 | 100 |