In this article, we are going to use the R package rjags and implement Bayesian Logistic Regression, to be used to understand how diabetes is dependent on a given a set of diagnostic measurements. This appeared as a Data Analysis Project in the Coursera course Bayesian Statistics: Techniques and Models, by University of California Santa Cruz.
More specifically, our objective will be:
R and rjags library functions for the implementation.Predicting whether a patient is diabetic from her medical diagnosis will be the goal of this project. Under Bayesian settings, we shall impose priors on the coefficients corresponding to the predictors (diagnostic measurements) and use MCMC sampling to compute the posterior distributions, verifying that the Markov chains converge to the target stationary distribution with diagnostic tests. We shall try to fit a few different models and use model checking / selection to find the best model (with DIC), along with that we shall use posterior predictive distribution to answer to a few questions (test hypothesises) related to prediction of the outcomes, we shall also construct Bayesian credible intervals to quantify the uncertainty for the coefficient estimates and also for our prediction. As we shall see, few of the predictor variables we shall find as statistically significant.
We shall use the Diabetes dataset from Kaggle: https://www.kaggle.com/datasets/mathchi/diabetes-data-set. This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage. Each row corresponds to the measurements corresponding to a patient (there are 768 rows in the dataset), containing 8 predictor variables listed and described below (which can be used to predict the response variable), with a single response variable named Outcome (which is 1 if the patient is diabetic, 0 otherwise).
The following table shows first few rows of the dataset.
# navigate to the right path
dat = read.csv(file="diabetes.csv", header=TRUE)
dim(dat)
## [1] 768 9
head(dat)
## Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
## 1 6 148 72 35 0 33.6
## 2 1 85 66 29 0 26.6
## 3 8 183 64 0 0 23.3
## 4 1 89 66 23 94 28.1
## 5 0 137 40 35 168 43.1
## 6 5 116 74 0 0 25.6
## DiabetesPedigreeFunction Age Outcome
## 1 0.627 50 1
## 2 0.351 31 0
## 3 0.672 32 1
## 4 0.167 21 0
## 5 2.288 33 1
## 6 0.201 30 0
dat = na.omit(dat)
print('percent zero values for each predictor variable columns')
## [1] "percent zero values for each predictor variable columns"
100*colSums(dat == 0) / nrow(dat) # percentage zeros
## Pregnancies Glucose BloodPressure
## 14.4531250 0.6510417 4.5572917
## SkinThickness Insulin BMI
## 29.5572917 48.6979167 1.4322917
## DiabetesPedigreeFunction Age Outcome
## 0.0000000 0.0000000 65.1041667
vars <- colnames(dat)
col.means <- sapply(dat, function(x) mean(x[x!=0]))
for (var in vars[-length(vars)]) {
dat[dat[,var] == 0, var] <- col.means[var]
}
print('percent zero values for each predictor variable columns after mean imputation')
## [1] "percent zero values for each predictor variable columns after mean imputation"
100*colSums(dat == 0) / nrow(dat) # percentage zeros after mean imputation
## Pregnancies Glucose BloodPressure
## 0.00000 0.00000 0.00000
## SkinThickness Insulin BMI
## 0.00000 0.00000 0.00000
## DiabetesPedigreeFunction Age Outcome
## 0.00000 0.00000 65.10417
summary(dat)
## Pregnancies Glucose BloodPressure SkinThickness
## Min. : 1.000 Min. : 44.00 Min. : 24.00 Min. : 7.00
## 1st Qu.: 2.000 1st Qu.: 99.75 1st Qu.: 64.00 1st Qu.:25.00
## Median : 4.495 Median :117.00 Median : 72.20 Median :29.15
## Mean : 4.495 Mean :121.69 Mean : 72.41 Mean :29.15
## 3rd Qu.: 6.000 3rd Qu.:140.25 3rd Qu.: 80.00 3rd Qu.:32.00
## Max. :17.000 Max. :199.00 Max. :122.00 Max. :99.00
## Insulin BMI DiabetesPedigreeFunction Age
## Min. : 14.0 Min. :18.20 Min. :0.0780 Min. :21.00
## 1st Qu.:121.5 1st Qu.:27.50 1st Qu.:0.2437 1st Qu.:24.00
## Median :155.5 Median :32.40 Median :0.3725 Median :29.00
## Mean :155.5 Mean :32.46 Mean :0.4719 Mean :33.24
## 3rd Qu.:155.5 3rd Qu.:36.60 3rd Qu.:0.6262 3rd Qu.:41.00
## Max. :846.0 Max. :67.10 Max. :2.4200 Max. :81.00
## Outcome
## Min. :0.000
## 1st Qu.:0.000
## Median :0.000
## Mean :0.349
## 3rd Qu.:1.000
## Max. :1.000
library(GGally)
ggpairs(dat,
columns = 1:9,
aes(color = as.factor(Outcome),
alpha = 0.5))
library("corrplot")
Cor = cor(dat)
corrplot(Cor, type="upper", method="ellipse", tl.pos="d")
corrplot(Cor, type="lower", method="number", col="black",
add=TRUE, diag=FALSE, tl.pos="n", cl.pos="n")
par(mfrow = c(3, 3))
for (var in vars[-length(vars)]) {
boxplot(dat[,var]~dat$Outcome, ylab=var, xlab='outcome (diabetic?)')
}
par(mfrow = c(1, 1))
par(mfrow = c(3, 3))
for (var in vars) {
hist(dat[,var], xlab=var, main='', probability=TRUE)
}
par(mfrow = c(1, 1))
We shall first fit the glm (generalized linear model) in R with family=binomial, with Outcome as the response and the rest of the variables as predictors. Plot the residuals (diagnostic plot) to see if there is any pattern and whether the model assumptions are met. Note the values of the coefficients computed, the statistically significant ones. Finally, predict using the model fitted and compute the prediction accuracy along with the confusion matrix.
Then, we shall use rjags to fit Bayesian models. Let’s first z-score normalize (scale) the predictors and verify that each of them have mean zero and unit standard deviation.
X = scale(dat[,-length(vars)], center=TRUE, scale=TRUE)
print('means (rounded to 3 decimals)')
## [1] "means (rounded to 3 decimals)"
round(colMeans(X), 3)
## Pregnancies Glucose BloodPressure
## 0 0 0
## SkinThickness Insulin BMI
## 0 0 0
## DiabetesPedigreeFunction Age
## 0 0
print('standard deviations')
## [1] "standard deviations"
apply(X, 2, sd)
## Pregnancies Glucose BloodPressure
## 1 1 1
## SkinThickness Insulin BMI
## 1 1 1
## DiabetesPedigreeFunction Age
## 1 1
X <- as.data.frame(X)
X$Outcome <- dat$Outcome
* We shall use 3 independent Markov chains for mixing, reject the first 1000 samples from the burn-in period and simulate for another 5000 iterations, use trace plot to check the convergence (autocorrelation) as diagnostic.
Also, in order to examine convergence of the Markov chains to stationary (target posterior) distribution, we shall use the Gelman-Brooks-Rubin diagnostic to verify if the Markov chains don’t differ much / end in the same state-space after convergence (also check the effective sample size).
We shall use the Deviance information criterion (DIC) for model selection (the lower the value, the better the model fit) and to compute the effective parameter size of the problem.
We shall see that the posterior distribution of a few coefficients includes zero, so they are likely to be less important as predictors, we shall get rid of them and refit the model, this time with Gaussian prior on all the coefficients.
Finally, we shall fit a third model using all the eight predictors and placing flat normal priors on them. We shall compare the models using DIC and select the best model.
We shall compute the posterior mean of the MCMC samples (combined) to compute the estimate of the coefficients, also compute the 95% Highest Posterior Density (HPD) credible intervals for the coefficients.
We shal use the posterior estimates for the coefficients for prediction of the observations and compute the accuracy of prediction.
Finally, we shall use the posterior predictive distribution to predict the probability of diabetes of couple of new patients, construct 95% Bayesian credible intervals (CI) and compute the probability that one patient will have more chance to be diabetic over the other.
As can be seen from the following results,
Pregnancies, Glucose, BMI, DiabetesPedigreeFunction and BloodPressure are statistically significant (with p-value < 0.05, the null hypothesis can be rejected at 95% level of significance).Glucose by 1 unit increases the log odds for diabetes by 0.0372466, whereas decrease in BloodPressure by 1 unit increases the log odds for diabetes by 0.0104619.library(caret)
m <- glm(Outcome ~ ., data=dat, family='binomial')
summary(m)
##
## Call:
## glm(formula = Outcome ~ ., family = "binomial", data = dat)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6353 -0.7129 -0.3951 0.6932 2.4061
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.0705975 0.8148288 -11.132 < 2e-16 ***
## Pregnancies 0.1406610 0.0359271 3.915 9.03e-05 ***
## Glucose 0.0372466 0.0038779 9.605 < 2e-16 ***
## BloodPressure -0.0104619 0.0086383 -1.211 0.22585
## SkinThickness 0.0038213 0.0131661 0.290 0.77164
## Insulin -0.0007215 0.0011808 -0.611 0.54115
## BMI 0.0890451 0.0177613 5.013 5.35e-07 ***
## DiabetesPedigreeFunction 0.8541331 0.2981939 2.864 0.00418 **
## Age 0.0152995 0.0091727 1.668 0.09533 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 993.48 on 767 degrees of freedom
## Residual deviance: 712.94 on 759 degrees of freedom
## AIC: 730.94
##
## Number of Fisher Scoring iterations: 5
plot(residuals(m, type='deviance'))
pred <- predict(m, type='response')
pred <- as.integer(round(pred))
confusionMatrix(as.factor(dat$Outcome), as.factor(pred))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 443 57
## 1 114 154
##
## Accuracy : 0.7773
## 95% CI : (0.7462, 0.8063)
## No Information Rate : 0.7253
## P-Value [Acc > NIR] : 0.0005657
##
## Kappa : 0.4845
##
## Mcnemar's Test P-Value : 1.849e-05
##
## Sensitivity : 0.7953
## Specificity : 0.7299
## Pos Pred Value : 0.8860
## Neg Pred Value : 0.5746
## Prevalence : 0.7253
## Detection Rate : 0.5768
## Detection Prevalence : 0.6510
## Balanced Accuracy : 0.7626
##
## 'Positive' Class : 0
##
For the model with double exponential (Laplace) prior on the coefficients,
SkinThickness, Insulin and Age includes zero in posterior distribution, we can ignore them.library(rjags)
mod1_string = " model {
for (i in 1:length(Outcome)) {
Outcome[i] ~ dbern(p[i])
logit(p[i]) = int + b[1]*Pregnancies[i] + b[2]*Glucose[i] + b[3]*BloodPressure[i] + b[4]*SkinThickness[i] +
b[5]*Insulin[i] + b[6]*BMI[i] + b[7]*DiabetesPedigreeFunction[i] + b[8]*Age[i]
}
int ~ dnorm(0.0, 1.0/25.0)
for (j in 1:8) {
b[j] ~ ddexp(0.0, sqrt(2.0)) # has variance 1.0
}
} "
set.seed(92)
data_jags = as.list(X)
params = c("int", "b")
mod1 = jags.model(textConnection(mod1_string), data=data_jags, n.chains=3)
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 768
## Unobserved stochastic nodes: 9
## Total graph size: 9732
##
## Initializing model
update(mod1, 1e3)
mod1_sim = coda.samples(model=mod1,
variable.names=params,
n.iter=5e3)
mod1_csim = as.mcmc(do.call(rbind, mod1_sim))
par(mfrow=c(4,2))
densplot(mod1_csim[,1:8]) #, xlim=c(-3.0, 3.0))
par(mfrow=c(1,1))
## convergence diagnostics
plot(mod1_sim) #, ask=TRUE)
summary(mod1_sim)
##
## Iterations = 2001:7000
## Thinning interval = 1
## Number of chains = 3
## Sample size per chain = 5000
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## b[1] 0.40932 0.10764 0.0008789 0.001414
## b[2] 1.12534 0.11816 0.0009648 0.001499
## b[3] -0.10407 0.10139 0.0008278 0.001240
## b[4] 0.04265 0.11196 0.0009142 0.001531
## b[5] -0.04350 0.09667 0.0007893 0.001178
## b[6] 0.59322 0.12087 0.0009869 0.001711
## b[7] 0.27200 0.09743 0.0007955 0.001003
## b[8] 0.16881 0.10799 0.0008817 0.001530
## int -0.86179 0.09705 0.0007924 0.001115
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## b[1] 0.20083 0.33607 0.40936 0.48127 0.6224
## b[2] 0.90105 1.04423 1.12181 1.20542 1.3587
## b[3] -0.30440 -0.17319 -0.10270 -0.03378 0.0933
## b[4] -0.17506 -0.03094 0.04063 0.11646 0.2672
## b[5] -0.23480 -0.10761 -0.04324 0.02050 0.1492
## b[6] 0.35715 0.51151 0.59276 0.67313 0.8329
## b[7] 0.08135 0.20496 0.27093 0.33953 0.4640
## b[8] -0.03807 0.09479 0.16720 0.24145 0.3835
## int -1.05126 -0.92682 -0.86122 -0.79672 -0.6711
gelman.diag(mod1_sim)
## Potential scale reduction factors:
##
## Point est. Upper C.I.
## b[1] 1 1
## b[2] 1 1
## b[3] 1 1
## b[4] 1 1
## b[5] 1 1
## b[6] 1 1
## b[7] 1 1
## b[8] 1 1
## int 1 1
##
## Multivariate psrf
##
## 1
## calculate DIC
dic1 = dic.samples(mod1, n.iter=1e3)
dic1
## Mean deviance: 721.5
## penalty 8.415
## Penalized deviance: 730
SkinThickness, Insulin and refitting the model with flat normal prior (with zero mean, s.d. 5) on each of the predictor coefficients results in DIC of 727.7, a slight improvement from the previous model.mod2_string = " model {
for (i in 1:length(Outcome)) {
Outcome[i] ~ dbern(p[i])
logit(p[i]) = int + b[1]*Pregnancies[i] + b[2]*Glucose[i] + b[3]*BloodPressure[i] + b[4]*BMI[i]
+ b[5]*DiabetesPedigreeFunction[i] + b[6]*Age[i]
}
int ~ dnorm(0.0, 1.0/25.0)
for (j in 1:6) {
b[j] ~ dnorm(0.0, 1/25.0)
}
} "
set.seed(92)
mod2 = jags.model(textConnection(mod2_string), data=data_jags, n.chains=3)
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 768
## Unobserved stochastic nodes: 7
## Total graph size: 7958
##
## Initializing model
update(mod2, 1e3)
mod2_sim = coda.samples(model=mod2,
variable.names=params,
n.iter=5e3)
mod2_csim = as.mcmc(do.call(rbind, mod2_sim))
plot(mod2_sim) #, ask=TRUE)
summary(mod2_sim)
##
## Iterations = 2001:7000
## Thinning interval = 1
## Number of chains = 3
## Sample size per chain = 5000
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## b[1] 0.4278 0.10759 0.0008784 0.001406
## b[2] 1.1230 0.10999 0.0008981 0.001266
## b[3] -0.1230 0.10672 0.0008714 0.001316
## b[4] 0.6279 0.10842 0.0008852 0.001299
## b[5] 0.2879 0.09940 0.0008116 0.001055
## b[6] 0.1787 0.10790 0.0008810 0.001531
## int -0.8693 0.09816 0.0008015 0.001112
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## b[1] 0.21964 0.3564 0.4276 0.49927 0.64270
## b[2] 0.91099 1.0486 1.1223 1.19696 1.34420
## b[3] -0.33319 -0.1943 -0.1221 -0.05156 0.08416
## b[4] 0.41921 0.5545 0.6264 0.69907 0.84443
## b[5] 0.09182 0.2211 0.2878 0.35534 0.48449
## b[6] -0.03000 0.1053 0.1798 0.25159 0.39127
## int -1.06491 -0.9351 -0.8678 -0.80272 -0.68156
gelman.diag(mod2_sim)
## Potential scale reduction factors:
##
## Point est. Upper C.I.
## b[1] 1 1
## b[2] 1 1
## b[3] 1 1
## b[4] 1 1
## b[5] 1 1
## b[6] 1 1
## int 1 1
##
## Multivariate psrf
##
## 1
autocorr.diag(mod2_sim)
## b[1] b[2] b[3] b[4] b[5]
## Lag 0 1.000000000 1.000000000 1.000000000 1.000000000 1.0000000000
## Lag 1 0.416926247 0.312344429 0.389106968 0.354901791 0.2545199402
## Lag 5 0.032729903 0.005064261 0.029058845 0.006816196 -0.0028027285
## Lag 10 -0.001793951 0.016886666 0.011805779 0.004113391 -0.0004910797
## Lag 50 0.011530066 -0.001100237 -0.003472414 -0.010511972 0.0023777716
## b[6] int
## Lag 0 1.00000000 1.000000000
## Lag 1 0.48050382 0.297651811
## Lag 5 0.04479959 0.012691650
## Lag 10 0.01761801 -0.016319320
## Lag 50 0.00830341 0.001664174
autocorr.plot(mod2_sim)
effectiveSize(mod2_sim)
## b[1] b[2] b[3] b[4] b[5] b[6] int
## 5879.475 7582.354 6600.773 6971.613 8921.548 4985.693 7836.817
dic2 = dic.samples(mod2, n.iter=1e3)
dic2
## Mean deviance: 720.4
## penalty 7.056
## Penalized deviance: 727.5
(pm_coef = colMeans(mod2_csim))
## b[1] b[2] b[3] b[4] b[5] b[6] int
## 0.4278389 1.1230138 -0.1230491 0.6278820 0.2878744 0.1787103 -0.8693098
pm_Xb = pm_coef["int"] + as.matrix(X[,c(1,2,3,6,7,8)]) %*% pm_coef[1:6]
phat = 1.0 / (1.0 + exp(-pm_Xb))
head(phat)
## [,1]
## 1 0.69337781
## 2 0.03441203
## 3 0.78604455
## 4 0.03351797
## 5 0.92900934
## 6 0.12635652
plot(phat, jitter(dat$Outcome), pch=19, col=dat$Outcome+1, xlab='predicted', ylab='actual', main='prediction with logistic regression')
(tab0.5 = table(phat > 0.5, data_jags$Outcome))
##
## 0 1
## FALSE 442 114
## TRUE 58 154
sum(diag(tab0.5)) / sum(tab0.5)
## [1] 0.7760417
mod3_string = " model {
for (i in 1:length(Outcome)) {
Outcome[i] ~ dbern(p[i])
logit(p[i]) = int + b[1]*Pregnancies[i] + b[2]*Glucose[i] + b[3]*BloodPressure[i] + b[4]*SkinThickness[i] +
b[5]*Insulin[i] + b[6]*BMI[i] + b[7]*DiabetesPedigreeFunction[i] + b[8]*Age[i]
}
int ~ dnorm(0.0, 1.0/25.0)
for (j in 1:8) {
b[j] ~ dnorm(0.0, 1/25.0)
}
} "
set.seed(92)
mod3 = jags.model(textConnection(mod3_string), data=as.list(X), n.chains=3)
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 768
## Unobserved stochastic nodes: 9
## Total graph size: 9739
##
## Initializing model
update(mod3, 1e3)
mod3_sim = coda.samples(model=mod3,
variable.names=params,
n.iter=5e3)
mod3_csim = as.mcmc(do.call(rbind, mod3_sim))
plot(mod3_sim) #, ask=TRUE)
summary(mod3_sim)
##
## Iterations = 2001:7000
## Thinning interval = 1
## Number of chains = 3
## Sample size per chain = 5000
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## b[1] 0.42253 0.10854 0.0008863 0.001442
## b[2] 1.15290 0.11877 0.0009698 0.001498
## b[3] -0.12995 0.10424 0.0008511 0.001306
## b[4] 0.03638 0.11798 0.0009633 0.001650
## b[5] -0.06039 0.10250 0.0008369 0.001252
## b[6] 0.62261 0.12269 0.0010017 0.001881
## b[7] 0.28709 0.09958 0.0008130 0.001027
## b[8] 0.18483 0.10964 0.0008952 0.001519
## int -0.87444 0.09776 0.0007982 0.001112
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## b[1] 0.21217 0.3465 0.42185 0.496595 0.63678
## b[2] 0.93169 1.0702 1.15083 1.232786 1.38882
## b[3] -0.33652 -0.1991 -0.12949 -0.058179 0.06954
## b[4] -0.19160 -0.0436 0.03481 0.114996 0.27153
## b[5] -0.25802 -0.1297 -0.06136 0.007583 0.14181
## b[6] 0.38653 0.5385 0.62165 0.705236 0.86618
## b[7] 0.09138 0.2195 0.28613 0.354221 0.47996
## b[8] -0.02720 0.1112 0.18464 0.256533 0.40388
## int -1.07294 -0.9392 -0.87203 -0.808829 -0.68616
gelman.diag(mod3_sim)
## Potential scale reduction factors:
##
## Point est. Upper C.I.
## b[1] 1 1.00
## b[2] 1 1.00
## b[3] 1 1.00
## b[4] 1 1.00
## b[5] 1 1.01
## b[6] 1 1.00
## b[7] 1 1.00
## b[8] 1 1.00
## int 1 1.00
##
## Multivariate psrf
##
## 1
autocorr.diag(mod3_sim)
## b[1] b[2] b[3] b[4] b[5]
## Lag 0 1.0000000000 1.000000000 1.000000000 1.000000000 1.000000000
## Lag 1 0.4334543085 0.398960765 0.386561138 0.471580744 0.361767827
## Lag 5 0.0320956149 0.011812204 -0.005117396 0.041795701 0.012959840
## Lag 10 -0.0139371475 -0.011444892 0.003135870 0.015409249 0.003230473
## Lag 50 -0.0005609705 -0.003421055 0.005630901 0.002842327 0.008084170
## b[6] b[7] b[8] int
## Lag 0 1.000000000 1.0000000000 1.00000000 1.000000000
## Lag 1 0.506836365 0.2498184910 0.47557863 0.294150138
## Lag 5 0.057902671 -0.0030093917 0.04406218 0.003341419
## Lag 10 0.009600484 0.0033883613 -0.01098449 -0.009434692
## Lag 50 0.009168463 -0.0007639789 -0.00244981 -0.012994380
autocorr.plot(mod3_sim)
effectiveSize(mod3_sim)
## b[1] b[2] b[3] b[4] b[5] b[6] b[7] b[8]
## 5676.627 6296.441 6392.781 5126.782 6729.127 4292.068 9463.179 5278.964
## int
## 7756.402
dic3 = dic.samples(mod3, n.iter=1e3)
dic3
## Mean deviance: 722.1
## penalty 9.079
## Penalized deviance: 731.2
(pm_coef = colMeans(mod3_csim))
## b[1] b[2] b[3] b[4] b[5] b[6]
## 0.42252694 1.15289695 -0.12995391 0.03638308 -0.06038751 0.62261242
## b[7] b[8] int
## 0.28709429 0.18483462 -0.87443948
pm_Xb = pm_coef["int"] + as.matrix(X[,1:8]) %*% pm_coef[1:8]
phat = 1.0 / (1.0 + exp(-pm_Xb))
head(phat)
## [,1]
## 1 0.70389019
## 2 0.03348453
## 3 0.79588270
## 4 0.03315076
## 5 0.93103951
## 6 0.12544097
plot(phat, jitter(dat$Outcome), pch=19, col=dat$Outcome+1, xlab='predicted', ylab='actual', main='prediction with logistic regression')
(tab0.5 = table(phat > 0.5, data_jags$Outcome))
##
## 0 1
## FALSE 444 114
## TRUE 56 154
sum(diag(tab0.5)) / sum(tab0.5)
## [1] 0.7786458
The next plot shows how the flat priors compare with the sharp skinny posterior distribution for the coefficient for a predictor (samples drawn using MCMC simulation). Also it shows that the predictor SkinThickness is not an important predictor, since the posterior density includes zero.
plot.densities <- function(col) {
#df <- X[col]
#df$density <- 'data'
coeff <- paste0('β_', names(X[col]))
post.df <- data.frame(var=mod3_csim[,col], density=rep('posterior', nrow(mod3_csim)))
names(post.df)[1] <- coeff
#df <- rbind(df, post.df)
df <- post.df
prior.df <- data.frame(var=rnorm(1000,0,5), density=rep('prior', 1000))
names(prior.df)[1] <- coeff
df <- rbind(df, prior.df)
head(df)
return (ggplot(df, aes_string(coeff, fill='density')) + geom_density(alpha=0.5) + geom_vline(xintercept = mean(X[,col]), lty=2, col='red', alpha=0.5))
}
(p <- plot.densities(4))
The Highest Posterior Density intervals for the coefficients are shown below:
Glucose lies in the interval shown below with 95% probability.HPDinterval(mod3_csim)
## lower upper
## b[1] 0.21834012 0.64093736
## b[2] 0.93339740 1.38956930
## b[3] -0.32965513 0.07475399
## b[4] -0.20150163 0.26085259
## b[5] -0.25571422 0.14405698
## b[6] 0.37568675 0.85358616
## b[7] 0.09602151 0.48381158
## b[8] -0.01865905 0.41127996
## int -1.06922526 -0.68279383
## attr(,"Probability")
## [1] 0.95
For example, let’s try to answer the question: what’s the probability that the effect size for DiabetesPedigreeFunction is greater than that of the predictor Age? We can answer by conditioning on the samples obtained from simulation and then taking the mean as shown below.
mean(mod3_csim[,7] > mod3_csim[,8])
## [1] 0.7545333
p1 <- matrix(c(0,300,80,30,100,33,0.3,40), ncol=1)
p1df <- as.data.frame(t(p1))
colnames(p1df) <- names(dat)[-ncol(dat)]
p2 <- matrix(c(0,250,82,29,90,35,0.25,50), ncol=1)
p2df <- as.data.frame(t(p2))
colnames(p2df) <- names(dat)[-ncol(dat)]
pdf <- rbind(p1df, p2df)
rownames(pdf) <- c('patient 1', 'patient 2')
head(pdf)
## Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
## patient 1 0 300 80 30 100 33
## patient 2 0 250 82 29 90 35
## DiabetesPedigreeFunction Age
## patient 1 0.30 40
## patient 2 0.25 50
mu <- colMeans(dat[,-ncol(dat)])
sd <- apply(dat[,-ncol(dat)], 2, sd)
p1 <- (p1 - mu) / sd # z-score normalize
p2 <- (p2 - mu) / sd
p1_Xb = mod3_csim[,"int"] + mod3_csim[,1:8] %*% p1
phat1 = 1.0 / (1.0 + exp(-p1_Xb))
head(phat1)
## Markov Chain Monte Carlo (MCMC) output:
## Start = 1
## End = 7
## Thinning interval = 1
## [,1]
## [1,] 0.9901673
## [2,] 0.9979442
## [3,] 0.9985275
## [4,] 0.9920615
## [5,] 0.9856507
## [6,] 0.9967362
## [7,] 0.9963923
p2_Xb = mod3_csim[,"int"] + mod3_csim[,1:8] %*% p2
phat2 = 1.0 / (1.0 + exp(-p2_Xb))
ci_95 <- quantile(phat1, probs = c(0.025, 0.975))
ci_95
## 2.5% 97.5%
## 0.9792472 0.9986913
post_pred_df <- data.frame(pred=phat1)
ggplot(post_pred_df, aes(x = phat1)) +
xlab('probability of being diabetic (Outcome=1)') +
geom_density(color='darkblue', fill = 'lightblue') +
geom_vline(xintercept = ci_95, color = "red") +
theme_bw() +
ggtitle('Posterior Predictive distribution 95% Credible Interval for patient 1')
post_pred_df$patient <- 1
post_pred_df <- rbind(post_pred_df, data.frame(pred=phat2, patient=rep(2, length(phat2))))
post_pred_df$patient <- as.factor(post_pred_df$patient)
ggplot(post_pred_df, aes(pred, fill=patient)) +
geom_density(alpha=0.5) +
xlab('probability of being diabetic (Outcome=1)') +
theme_bw() +
ggtitle('Prediction with MCMC samples from posterior distribution of coefficients')
#mean(phat2 > phat1)
set.seed(92)
y1 <- rbinom(nrow(mod3_csim), 1, 1/(1+exp(-mod3_csim[,"int"] - mod3_csim[,1:8] %*% p1))) # draw samples
y2 <- rbinom(nrow(mod3_csim), 1, 1/(1+exp(-mod3_csim[,"int"] - mod3_csim[,1:8] %*% p2)))
mean(y2 > y1)
## [1] 0.007466667
post_pred_df <- data.frame(pred=c(y1, y2), patient=c(rep('1', length(y1)), rep('2', length(y2))))
ggplot(post_pred_df, aes(pred, fill=patient)) +
geom_density(alpha=0.5) +
xlab('probability of being diabetic (Outcome=1)') +
theme_bw() +
ggtitle('Posterior Predictive distribution for patient 1 and patient 2')
Using Bayesian Logistic Regression, we could not only compute the point estimates for the predictors and use them to predict the outcome, but the main strength of the model had been to quantify uncertainty in terms of how confident we are in terms of the estimates and predictions (using posterior predictive distribution). We can see that all predictor variables other than Insulin and BloodPressure, increase in their values increase the log odds of diabetes. The model even allows us to make probabilistic statements in terms of comparing two patients’ probability of being diabetic, given the values of the diagnostic measurements of the predictor variables.
The following can be tried to improve the model: