library(rstanarm)
library(see)
library(bayestestR)
library(performance)
library(scatterplot3d)
library(brms)
library(tidyverse)
Bayes Theorem
\(P(H \mid D) = \frac{P(D \mid H)P(H)}{P(D)}\)
Bayesian Logistic Regression
\(P(H \mid D) = \frac{1}{1 + e^{-(\beta_D + \beta_0)}}\)
Logistic Regression
#fit regression model
#model_reg <- lm(Label ~ ., data = my_data)
model_reg <- glm(Label ~.,family=binomial(link='logit'),data=my_data)
summary(model_reg)
Call:
glm(formula = Label ~ ., family = binomial(link = "logit"), data = my_data)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.1774 -0.5357 -0.5357 -0.5357 2.0061
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.8687 0.1519 -12.299 <2e-16 ***
chr17_150259_C_A 1.8687 1.4224 1.314 0.189
chr17_150380_C_T -13.6973 1455.3975 -0.009 0.992
chr17_151226_C_T -13.6973 1455.3975 -0.009 0.992
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 299.10 on 377 degrees of freedom
Residual deviance: 296.99 on 374 degrees of freedom
AIC: 304.99
Number of Fisher Scoring iterations: 14
Bayesian Model 1
#specify priors
my_prior <- normal(location = c(0,0,0), scale = c(0.1, 0.1,0.1))
#run bayesian regression model
model_1 <- stan_glm(Label~.,
data=my_data,
prior = my_prior,
family = binomial,
prior_intercept = normal(-1.4, 0.7),
chains = 4, iter = 5000*2, seed = 84735,
prior_PD = FALSE)
summary(model_1)
Model Info:
function: stan_glm
family: binomial [logit]
formula: Label ~ .
algorithm: sampling
sample: 20000 (posterior sample size)
priors: see help('prior_summary')
observations: 378
predictors: 4
Estimates:
mean sd 10% 50% 90%
(Intercept) -1.8 0.1 -2.0 -1.8 -1.7
chr17_150259_C_A 0.0 0.1 -0.1 0.0 0.1
chr17_150380_C_T 0.0 0.1 -0.1 0.0 0.1
chr17_151226_C_T 0.0 0.1 -0.1 0.0 0.1
Fit Diagnostics:
mean sd 10% 50% 90%
mean_PPD 0.1 0.0 0.1 0.1 0.2
The mean_ppd is the sample average posterior predictive distribution of the outcome variable (for details see help('summary.stanreg')).
MCMC diagnostics
mcse Rhat n_eff
(Intercept) 0.0 1.0 25127
chr17_150259_C_A 0.0 1.0 26864
chr17_150380_C_T 0.0 1.0 26411
chr17_151226_C_T 0.0 1.0 25985
mean_PPD 0.0 1.0 22099
log-posterior 0.0 1.0 9574
For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).
performance(model_1)
mcmc_dens_overlay(model_1)
mcmc_acf(model_1)
#MCMC Trace
x <- as.array(model_1, pars = c("(Intercept)", "chr17_150259_C_A", "chr17_150380_C_T", "chr17_151226_C_T"))
bayesplot::mcmc_trace(x, facet_args = list(nrow = 2))
posterior_vs_prior(model_1)
pplot<-plot(model_1, "areas", prob = 0.95, prob_outer = 1)
pplot +
geom_vline(xintercept = 0) +
theme_bw() +
theme(panel.border = element_blank(), panel.grid.major = element_blank(),
panel.grid.minor = element_blank(), axis.line = element_line(colour = "black"))
#confidence ellipse
bayesplot::color_scheme_set("green")
plot(model_2, "scatter", pars = c("chr17_150380_C_T", "chr17_151226_C_T"),
size = 3, alpha = 0.5) +
ggplot2::stat_ellipse(level = 0.9)
set.seed(1234)
classification_summary(model =model_1, data = my_data, cutoff = 0.5)
$confusion_matrix
y 0 1
0 327 0
1 51 0
$accuracy_rates
NA
mcmc_acf(model_1)
round(posterior_interval(model_1, prob = 0.9), 2)
5% 95%
(Intercept) -2.09 -1.61
chr17_150259_C_A -0.16 0.17
chr17_150380_C_T -0.17 0.17
chr17_151226_C_T -0.17 0.17
Bayesian Model 2
gw
[1] 0.04147642 0.32856259 1.51119123
#specify priors
my_prior <- lasso(location = gw, scale = c(1,1,1))
Error in lasso(location = gw, scale = c(1, 1, 1)) :
unused argument (location = gw)
summary(model_2)
Model Info:
function: stan_glm
family: binomial [logit]
formula: Label ~ .
algorithm: sampling
sample: 20000 (posterior sample size)
priors: see help('prior_summary')
observations: 378
predictors: 4
Estimates:
mean sd 10% 50% 90%
(Intercept) -1.9 0.1 -2.0 -1.8 -1.7
chr17_150259_C_A 0.0 0.1 -0.1 0.0 0.2
chr17_150380_C_T 0.3 0.1 0.2 0.3 0.5
chr17_151226_C_T 1.5 0.1 1.4 1.5 1.6
Fit Diagnostics:
mean sd 10% 50% 90%
mean_PPD 0.1 0.0 0.1 0.1 0.2
The mean_ppd is the sample average posterior predictive distribution of the outcome variable (for details see help('summary.stanreg')).
MCMC diagnostics
mcse Rhat n_eff
(Intercept) 0.0 1.0 27356
chr17_150259_C_A 0.0 1.0 24423
chr17_150380_C_T 0.0 1.0 24838
chr17_151226_C_T 0.0 1.0 26738
mean_PPD 0.0 1.0 22615
log-posterior 0.0 1.0 8883
For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).
performance(model_2)
[34m# Indices of model performance[39m
ELPD | ELPD_SE | LOOIC | LOOIC_SE | WAIC | R2 | RMSE | Sigma | Log_loss | Score_log | Score_spherical
------------------------------------------------------------------------------------------------------------------
-150.918 | 12.361 | 301.836 | 24.722 | 301.836 | 0.002 | 0.342 | 1.000 | 0.397 | -7.238 | 0.032
mcmc_dens_overlay(model_2)
mcmc_acf(model_2)
#MCMC Trace
x <- as.array(model_2, pars = c("(Intercept)", "chr17_150259_C_A", "chr17_150380_C_T", "chr17_151226_C_T"))
bayesplot::mcmc_trace(x, facet_args = list(nrow = 2))
mcmc_acf(model_2)
gw
[1] 0.04147642 0.32856259 1.51119123
pplot<-plot(model_2, "areas", prob = 0.95, prob_outer = 1)
pplot +
geom_vline(xintercept = 0) +
theme_bw() +
theme(panel.border = element_blank(), panel.grid.major = element_blank(),
panel.grid.minor = element_blank(), axis.line = element_line(colour = "black"))
pp_check(model_2)
#confidence ellipse
bayesplot::color_scheme_set("green")
plot(model_2, "scatter", pars = c("chr17_150380_C_T", "chr17_151226_C_T"),
size = 3, alpha = 0.5) +
ggplot2::stat_ellipse(level = 0.9)
set.seed(84735)
label_pred_1 <- posterior_predict(model_2, newdata = my_data)
dim(label_pred_1)
[1] 20000 378
label_classifications <- my_data %>%
mutate(label_prob = colMeans(label_pred_1),
label_class_1 = as.numeric(label_prob >= 0.5)) %>%
select(chr17_150259_C_A, chr17_150380_C_T, chr17_151226_C_T, Label,label_class_1,label_prob)
head(label_classifications, 3)
# Confusion matrix
label_classifications %>%
tabyl(Label, label_class_1) %>%
adorn_totals(c("row", "col"))
Label 0 Total
0 327 327
1 51 51
Total 378 378
set.seed(1234)
classification_summary(model =model_2, data = my_data, cutoff = 0.5)
$confusion_matrix
y 0 1
0 327 0
1 51 0
$accuracy_rates
NA
round(posterior_interval(model_1, prob = 0.9), 2)
5% 95%
(Intercept) -2.09 -1.61
chr17_150259_C_A -0.16 0.17
chr17_150380_C_T -0.17 0.17
chr17_151226_C_T -0.17 0.17
Bayesian Model 3 Model bounded within upper and lower prior
summary(bound_priors_example_fit)
Family: bernoulli
Links: mu = logit
Formula: Label ~ chr17_150259_C_A + chr17_150380_C_T + chr17_151226_C_T
Data: my_data (Number of observations: 378)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept -1.88 0.16 -2.18 -1.58 1.00 4262 3140
chr17_150259_C_A 1.60 1.56 -1.59 4.62 1.00 3864 2676
chr17_150380_C_T -1.78 2.88 -8.08 3.03 1.00 3601 2900
chr17_151226_C_T -1.84 2.93 -8.32 3.07 1.00 4327 2673
Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).