MCMC Sampling for a Logistic Regression Model
Packages
The following packages are used in this project:
here: enables easy file referencing and builds file paths in a OS-independent waystats: loads this before loadingtidyverseto avoid masking sometidyversefunctionscumstats: efficiently computing cumulative standard deviationtidyverse: includes collections of useful packages likedplyr(data manipulation),tidyr(tidying data),ggplots(creating graphs), etc.skimr: provide summary statistics about variables in data frames, tibbles, data tables and vectorsglue: embedding and evaluating R expressions into string to be printed as messagescales: formats and labels scales nicely for better visualizationbroom: tidy and unify the fit objects returned by most of the modeling functionsTeachingDemos: empirical HPD intervals
In addition, the pacman package provides handy tools to
manage R packages (install, update, load and unload). We use its
p_laod() instead of libarary() to load the
packages listed above.
## Loading required package: pacman
pacman::p_load(
here,
stats, cumstats,
tidyverse,
skimr, glue, scales, gt,
broom,
TeachingDemos
)
# helper function: string concatenation (ex: "act" %&% "5")
'%&%' <- function(x, y) paste0(x, y)
# not %in%
'%notin%' <- Negate('%in%')In this work, we reuse one of the helper function we developed before
dist_summary. It conveniently gives graphical (histogram)
and numeric summaries (mean, median, standard deviation and confidence
intervals) of a distribution from a vector. In this updated version, we
add the credible intervals by using
TeachingDemos::CI_hpd().
Arguments:
x: A numeric vectorvar_expression: expression used in plot axis labelpos_x,pos_y,adj: position of text annotation on the plotconf.level: confidence leveldigits: number of decimal placesplot: include plot or notReturned Value: A data frame with numeric summaries
# summarize a distribution (graphically and numerically)
dist_summary <- function(
x, var_expression,
pos_x = 0, pos_y = 1, adj = c(0, 1),
conf.level = 0.95, digits = 4, plot = T, ...){
# significance level
alpha <- (1 - conf.level)/2
# mean, median, and intervals
mean <- mean(x)
median <- median(x)
sd <- sd(x)
CI <- quantile(x, probs = c(alpha, 1 - alpha)) %>% `names<-`(vec_fmt_percent(c(alpha, 1 - alpha), decimals = 1))
CI_hpd <- TeachingDemos::emp.hpd(x) %>% `names<-`(vec_fmt_percent(c(alpha, 1 - alpha), decimals = 1))
if(plot){
# plot distributions with annotations of mean, median and CIs
hist_obj <- hist(
x,
probability = T,
col = NULL,
yaxt = "n",
xlab = var_expression,
main = "Histogram")
# add lines
abline(v = mean, col = "red", lwd = 2)
abline(v = median, col = "blue", lwd = 2)
abline(v = CI)
abline(v = CI_hpd, lty = 2, col = "green")
# annotation
text(
quantile(hist_obj$breaks, probs = pos_x),
quantile(hist_obj$density, probs = pos_y),
glue(
"Mean = {signif(mean, digits)}\n",
"Median = {signif(median, digits)}\n",
"Std = {signif(sd, digits)}\n",
"{conf.level*100}% credible interval:\n",
" Equal-tail = ({toString(signif(CI, digits))})\n",
" HPD = ({toString(signif(CI_hpd, digits))})\n"
),
adj = adj, ...)
}
# return summaries
return(tibble(
mean = mean,
median = median,
sd = sd,
CI = list(CI),
CI_hpd = list(CI_hpd)
))
}Data
Load the data and have a quick look at it to check its summary, data type and completeness:
| Name | diabetes |
| Number of rows | 2000 |
| Number of columns | 9 |
| _______________________ | |
| Column type frequency: | |
| numeric | 9 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Pregnancies | 0 | 1 | 3.70 | 3.31 | 0.00 | 1.00 | 3.00 | 6.00 | 17.00 | ▇▃▂▁▁ |
| Glucose | 0 | 1 | 121.18 | 32.07 | 0.00 | 99.00 | 117.00 | 141.00 | 199.00 | ▁▁▇▆▂ |
| BloodPressure | 0 | 1 | 69.15 | 19.19 | 0.00 | 63.50 | 72.00 | 80.00 | 122.00 | ▁▁▇▇▁ |
| SkinThickness | 0 | 1 | 20.93 | 16.10 | 0.00 | 0.00 | 23.00 | 32.00 | 110.00 | ▇▇▁▁▁ |
| Insulin | 0 | 1 | 80.25 | 111.18 | 0.00 | 0.00 | 40.00 | 130.00 | 744.00 | ▇▂▁▁▁ |
| BMI | 0 | 1 | 32.19 | 8.15 | 0.00 | 27.37 | 32.30 | 36.80 | 80.60 | ▁▇▇▁▁ |
| DiabetesPedigreeFunction | 0 | 1 | 0.47 | 0.32 | 0.08 | 0.24 | 0.38 | 0.62 | 2.42 | ▇▃▁▁▁ |
| Age | 0 | 1 | 33.09 | 11.79 | 21.00 | 24.00 | 29.00 | 40.00 | 81.00 | ▇▃▁▁▁ |
| Outcome | 0 | 1 | 0.34 | 0.47 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | ▇▁▁▁▅ |
There is no missing values in any column. The response variable
Outcome is numeric and has two possible values.
##
## 0 1
## 1316 684
MCMC MH Algorithm and Implementation
In this work, we are interested to model the response
Outcome by Glucose with a logistic linear
regression model:
\[Y=\log\frac{P(success)}{1-P(success)}=\beta_0+\beta_1 X\]
The likelihood (data) is a product of binomial pdf and the prior distributions for the parameters \(\beta_0\) and \(\beta_1\) are chosen to be two independent normal distributions \(\sim N(0, 15^2)\):
\[ \begin{aligned} L &= f_{\underset{\sim}{\beta_0}, \underset{\sim}{\beta_1}|\mathbf{Y}}(\beta_0,\beta_1|\mathbf{y}) \\ &\propto f_{\mathbf{Y}|\underset{\sim}{\beta_0}, \underset{\sim}{\beta_1}}(\mathbf{y}|\beta_0,\beta_1) \cdot f_\underset{\sim}{\beta_0}(\beta_0) f_\underset{\sim}{\beta_1}(\beta_1) \\ &\propto \prod_{i=1}^n \left[ p(x_i)^{y_i} \cdot \left(1-p(x_i) \right)^{1-y_i} \right] \times \left[ \sigma_0^2 \cdot e^{ (\beta_0-\beta_{00})^2/\sigma_0^2 } \right]^{-1/2} \times \left[ \sigma_1^2 \cdot e^{ (\beta_1-\beta_{10})^2/\sigma_1^2 } \right]^{-1/2} \\ l &= \ln L \\ &= \sum_{i=1}^n \left[ y_i\ln p(x_i) + (1-y_i)\ln \left(1 - p(x_i)\right) \right] -\frac{1}{2} \left(\ln \sigma_0^2 + \frac{(\beta_0-\beta_{00})^2}{\sigma_0^2} \right) -\frac{1}{2} \left(\ln \sigma_1^2 + \frac{(\beta_1-\beta_{10})^2}{\sigma_1^2} \right) + \text{constant} \end{aligned} \]
Based on the equations above, we therefore define the following helper functions to calculate the terms needed in MCMC MH algorithm:
expit(): calculate the expit \(p(x_i|\beta_0,\beta_1)\)logprior(): log prior for \(\beta_0 \sim N(0, 15^2)\) or \(\beta_1 \sim N(0, 15^2)\)logpost(): log posterior (the last equation showed above)
# calculate p
expit <- function(X, beta0, beta1) {
return( 1/(1 + exp(-beta0 - beta1*X)) )
}
# log prior for beta
logprior <- function(beta, mean = 0, sd = 15) {
return( -0.5*(log(sd^2) + (beta - mean)^2/sd^2) )
}
# log posterior
logpost <- function(Y, X, beta0, beta1){
P <- expit(X, beta0, beta1)
return( sum(Y*log(P) + (1-Y)*log(1-P)) + logprior(beta0) + logprior(beta1) )
}Next we define a sampler function to carry out Markov Chain Monte
Carlo (MCMC) Metropolis-Hastings algorithm in a
univariate manner (search one parameter at a time). To
compare the performance, we also keep track of the process time
proc_time and if a proposed candidate is accepted
accept. In each iteration:
- if both candidates are accepted,
accept = 1(100% acceptance) - if only one is accepted,
accept = 0.5(50% acceptance) - if none is accepted,
accept = 0
Arguments:
Y,X: response and prediction variablesN: number of MCMC iterationbeta0_initandbeta0_init: initial values of the parameterspos_x,pos_y,adj: position of text annotation on the plotproposed_sd0andproposed_sd1: standard deviation of jumping distribution for parameter candidatesseed: seed for random number generatorReturned Value: A data frame with all sampled values of the parameters, process time and acceptance indicator
# MCMC MH sampler - univariate manner
mcmc_mh <- function(
Y, X, N = 1000 * 10,
beta0_init = 0, proposed_sd0 = 0.01,
beta1_init = 0, proposed_sd1 = 0.01,
seed = 2022){
# make the simulation reproducible
if(!is.null(seed)) set.seed(seed)
# start timer
proc_timer <- proc.time()
# set up a data frame to store beta0 and beta1
df <- tibble(
beta0 = beta0_init,
beta1 = beta1_init,
proc_time = (proc.time() - proc_timer)["elapsed"],
accept = 0
)
# current step - initialize
beta0 <- beta0_init
beta1 <- beta1_init
# log posterior for the current step
l <- logpost(Y, X, beta0, beta1)
# iteration
for (i in 2:N) {
# reset accept score
accept <- 0
# beta0: pick a candidate
beta0_new <- rnorm(1, beta0, proposed_sd0)
# new log posterior
l_new <- logpost(Y, X, beta0_new, beta1)
# decide whether to accept or reject
lnR <- l_new - l
lnU <- log(runif(1))
if(lnR > lnU){
beta0 <- beta0_new
l <- l_new
accept <- accept + 0.5
}
# beta1: pick a candidate
beta1_new <- rnorm(1, beta1, proposed_sd1)
# new log posterior
l_new <- logpost(Y, X, beta0, beta1_new)
lnR <- l_new - l
lnU <- log(runif(1))
if(lnR > lnU){
beta1 <- beta1_new
l <- l_new
accept <- accept + 0.5
}
# save betas and elapsed time
df <- df %>% add_row(
beta0 = beta0,
beta1 = beta1,
proc_time = (proc.time() - proc_timer)["elapsed"],
accept = accept
)
}
print(glue(
"Last values: (beta0, beta1) = ({beta0}, {beta1})\n",
"Process time [s]: {tail(df$proc_time, 1)}\n",
"Acceptance rate: {mean(df$accept)}"
))
# return all betas
return(df)
}The following helper function is to summarized sampled values returned by MCMC sampler function. It creates several plots to show how algorithm is executed (steps, process time and acceptance rate) as well as a summary report on obtained beta values (posterior mean, median and credible intervals):
Arguments:
df: a data frame with all sampled valuesburn_in: number of burn_in iteration to removeReturned Value: A data frame with posterior mean, median and credible intervals
mcmc_summary <- function(df, burn_in, ...){
# add cumulative mean of acceptance (acceptance rate)
df <- mutate(df, cum_accept_rate = cummean(accept))
N <- nrow(df)
proc_time <- tail(df$proc_time, 1)
acceptance_rate <- tail(df$cum_accept_rate, 1)
# plot algorithm performance
par(mfrow = c(1,3))
# plot 1: trace
plot(df$beta0, df$beta1, type = "l", xlab = expression(beta[0]), ylab = expression(beta[1]))
text(df$beta0[1], df$beta1[1], "Start", col = "blue")
text(tail(df$beta0, 1), tail(df$beta1, 1), "End", col = "red")
# plot 2: process time
plot(df$proc_time, type = "l", xlab = "Iteration", ylab = "Process Time [s]")
text(N/2, proc_time, round(proc_time, 2) %&% " [s]", col = "red")
# plot 3: acceptance rate
plot(df$cum_accept_rate, type = "l", xlab = "Iteration", ylab = "Acceptance Rate")
text(N*0.9, acceptance_rate, vec_fmt_percent(acceptance_rate, 2), col = "red")
# remove burn-in rows
df2 <- tail(df, - burn_in)
# plot distributions of betas with annotations of mean, median and CIs
par(mfrow = c(1,2))
beta0 <- dist_summary(df2$beta0, var_expression = expression(beta[0]), ...)
beta1 <- dist_summary(df2$beta1, var_expression = expression(beta[1]), ...)
# reset par options
par(mfrow = c(1,1))
# combine betas and report intervals, process time and acceptance rate
bind_rows(list(beta0 = beta0, beta1 = beta1), .id = "term") %>%
# unpack list columns for intervals
unnest_wider(c(CI, CI_hpd), names_sep = "_") %>%
mutate(
proc_time = proc_time,
acceptance_rate = acceptance_rate
)
}Simulation and Results
We use the sampler function mcmc_mh() to perform MCMC MH
with \((0,0)\) as starting values and
(0.01, 0.001) as proposed standard deviation of the jumping distribution
for \((\beta_0, \beta_1)\)
respectively. After plotting the algorithm steps, process time and
accumulative acceptance rate, we call mcmc_summary() to
remove the burn-in rows and plot histograms and summarize the posterior
mean, median and credible intervals:
# MCMC MH - univariate manner
df_univar <- mcmc_mh(
Y = diabetes$Outcome, X = diabetes$Glucose, N = N,
beta0_init = 0, proposed_sd0 = 0.01,
beta1_init = 0, proposed_sd1 = 0.001) %>%
mcmc_summary(burn_in)## Last values: (beta0, beta1) = (-5.52522334113172, 0.0390173084051903)
## Process time [s]: 169.22
## Acceptance rate: 0.68779
MCMC takes 169.22 seconds to complete 10^{5} iteration in univariate manner. On average, the acceptance rate of candidates drawn from the jumping distributions is 0.68779.
GLM - MLE fit
To get an idea about how good our obtained results, we use
glm() to fit the logistic regression model (maximum
likelihood method):
# fit a GLM logistic regression model
proc_timer <- proc.time()
fit <- glm(Outcome ~ Glucose, family = binomial(link = 'logit'), data = diabetes)
proc_time <- proc.time() - proc_timer
# summary
summary(fit)##
## Call:
## glm(formula = Outcome ~ Glucose, family = binomial(link = "logit"),
## data = diabetes)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.254362 0.257281 -20.42 <2e-16 ***
## Glucose 0.036696 0.001973 18.60 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2569.4 on 1999 degrees of freedom
## Residual deviance: 2108.1 on 1998 degrees of freedom
## AIC: 2112.1
##
## Number of Fisher Scoring iterations: 4
# estimate, SE and 95% CI
df_glm <- tidy(fit, conf.int = T) %>%
mutate(proc_time = proc_time["elapsed"])
df_glmThe estimates are close to what we obtained using MCMC-MH. Both methods have similar standard deviations and CIs on the parameters. The major difference is the computation time: MCMC takes much longer to execute.
Additional Study - MCMC MH Bivariate Manner
With the assumption of the independence of the two parameters \(\beta_0\) and \(\beta_1\), we can perform the sampling process in bivariate manner:
# MCMC MH sampler - bivariate manner
mcmc_mh2 <- function(
Y, X, N = 1000 * 10,
beta0_init = 0, proposed_sd0 = 0.01,
beta1_init = 0, proposed_sd1 = 0.01,
seed = 2022){
# make the simulation reproducible
if(!is.null(seed)) set.seed(seed)
# start timer
proc_timer <- proc.time()
# set up a data frame to store beta0 and beta1
df <- tibble(
beta0 = beta0_init,
beta1 = beta1_init,
proc_time = (proc.time() - proc_timer)["elapsed"],
accept = 0
)
# current step - initialize
beta0 <- beta0_init
beta1 <- beta1_init
# log posterior for the current step
l <- logpost(Y, X, beta0, beta1)
# iteration
for (i in 2:N) {
# reset accept score
accept <- 0
# pick candidates
beta0_new <- rnorm(1, beta0, proposed_sd0)
beta1_new <- rnorm(1, beta1, proposed_sd1)
# candidate's log posterior
l_new <- logpost(Y, X, beta0_new, beta1_new)
# decide whether to accept or reject
lnR <- l_new - l
lnU <- log(runif(1))
if(lnR > lnU){
beta0 <- beta0_new
beta1 <- beta1_new
l <- l_new
accept <- 1
}
# save betas and elapsed time
df <- df %>% add_row(
beta0 = beta0,
beta1 = beta1,
proc_time = (proc.time() - proc_timer)["elapsed"],
accept = accept
)
}
print(glue(
"Last values: (beta0, beta1) = ({beta0}, {beta1})\n",
"Process time [s]: {tail(df$proc_time, 1)}\n",
"Acceptance rate: {mean(df$accept)}"
))
# return all betas
return(df)
}For a fair comparison, we use the same setup to execute the sampler
function mcmc_mh2():
# MCMC MH - bivariate manner
df_bivar <- mcmc_mh2(
Y = diabetes$Outcome, X = diabetes$Glucose, N = N,
beta0_init = 0, proposed_sd0 = 0.01,
beta1_init = 0, proposed_sd1 = 0.001) %>%
mcmc_summary(burn_in)## Last values: (beta0, beta1) = (-4.95027052778182, 0.0346536657845713)
## Process time [s]: 146.84
## Acceptance rate: 0.43319
MCMC takes 146.84 seconds to complete 10^{5} iteration in bivariate manner. On average, the acceptance rate of candidates drawn from the jumping distributions is 0.43319, which is lower than the univariate counterpart. This is likely due to the fact that we propose two new candidates for both parameters simultaneously and this has higher chance to enter rejection zone.
# A naive way to compare efficiency of process time usage
glue(
"Process Time Wasted [s] := Process Time * (1 - Acceptance Rate)\n",
"MCMC-MH univariate: {round(df_univar$proc_time[1] * (1 - df_univar$acceptance_rate[1]), 2)} [s]\n",
"MCMC-MH bivariate: {round(df_bivar$proc_time[1] * (1 - df_bivar$acceptance_rate[1]), 2)} [s]"
)## Process Time Wasted [s] := Process Time * (1 - Acceptance Rate)
## MCMC-MH univariate: 52.83 [s]
## MCMC-MH bivariate: 83.23 [s]
Beside this downside, the results are slightly better in term of smaller standard deviation and narrower CI.
Comparison and Conclusion
To compare all three methods:
# combine all results for comparison
bind_rows(
# from glm
df_glm %>%
rename(
"sd" = "std.error",
`CI_2.5%` = conf.low,
`CI_97.5%` = conf.high) %>%
mutate(
term = if_else(term == "(Intercept)", "beta0", "beta1"),
method = "MLE"
),
# from MCMC: univariate and bivariate
df_univar %>% mutate(method = "MCMC_MH_univar"),
df_bivar %>% mutate(method = "MCMC_MH_bivar")
) %>%
# calculate width of intervals
mutate(CI_width = `CI_97.5%` - `CI_2.5%`) %>%
# select columns of interest
select(term, method, estimate, median, sd, `CI_2.5%`, `CI_97.5%`, CI_width, proc_time, acceptance_rate) %>%
# make it easy to read
mutate(across(where(is.numeric), ~ round(.x, 4))) %>%
arrange(term)All three methods have similar values for parameters. Although MCMC-MH bivariate method is faster and slightly better (in term of sd and CI) than MCMC-MH univariate method, it is still much slower than MLE fit (maximum likelihood method).