Executive summary

In this article, we are going to use the R package rjags and implement Bayesian Logistic Regression, to be used to understand how diabetes is dependent on a given a set of diagnostic measurements. This appeared as a Data Analysis Project in the Coursera course Bayesian Statistics: Techniques and Models, by University of California Santa Cruz.

More specifically, our objective will be:

  1. Use the diagnostic measurements as predictor variables and estimate the coefficients for the predictors under Bayesian settings
  2. Predict the response variable, i.e., whether a patient is diabetic or not, given those measurements and computed coefficients
  3. Construct Bayesian Credible Intervals (CI) for the coefficients to quantify uncertainty and make probabilistic claims on the coefficients.
  4. Compute the Posterior Predictive Distribution and use it to compute CI for the predictions and make probabilistic claims on the predictions for different patients.
  5. We shall use base R and rjags library functions for the implementation.

Introduction

Predicting whether a patient is diabetic from her medical diagnosis will be the goal of this project. Under Bayesian settings, we shall impose priors on the coefficients corresponding to the predictors (diagnostic measurements) and use MCMC sampling to compute the posterior distributions, verifying that the Markov chains converge to the target stationary distribution with diagnostic tests. We shall try to fit a few different models and use model checking / selection to find the best model (with DIC), along with that we shall use posterior predictive distribution to answer to a few questions (test hypothesises) related to prediction of the outcomes, we shall also construct Bayesian credible intervals to quantify the uncertainty for the coefficient estimates and also for our prediction. As we shall see, few of the predictor variables we shall find as statistically significant.

Data

We shall use the Diabetes dataset from Kaggle: https://www.kaggle.com/datasets/mathchi/diabetes-data-set. This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage. Each row corresponds to the measurements corresponding to a patient (there are 768 rows in the dataset), containing 8 predictor variables listed and described below (which can be used to predict the response variable), with a single response variable named Outcome (which is 1 if the patient is diabetic, 0 otherwise).

The following table shows first few rows of the dataset.

# navigate to the right path
dat = read.csv(file="diabetes.csv", header=TRUE)
dim(dat)
## [1] 768   9
head(dat)
##   Pregnancies Glucose BloodPressure SkinThickness Insulin  BMI
## 1           6     148            72            35       0 33.6
## 2           1      85            66            29       0 26.6
## 3           8     183            64             0       0 23.3
## 4           1      89            66            23      94 28.1
## 5           0     137            40            35     168 43.1
## 6           5     116            74             0       0 25.6
##   DiabetesPedigreeFunction Age Outcome
## 1                    0.627  50       1
## 2                    0.351  31       0
## 3                    0.672  32       1
## 4                    0.167  21       0
## 5                    2.288  33       1
## 6                    0.201  30       0

Exploratory Analysis & Cleaning

dat = na.omit(dat)
print('percent zero values for each predictor variable columns')
## [1] "percent zero values for each predictor variable columns"
100*colSums(dat == 0) / nrow(dat) # percentage zeros
##              Pregnancies                  Glucose            BloodPressure 
##               14.4531250                0.6510417                4.5572917 
##            SkinThickness                  Insulin                      BMI 
##               29.5572917               48.6979167                1.4322917 
## DiabetesPedigreeFunction                      Age                  Outcome 
##                0.0000000                0.0000000               65.1041667
vars <- colnames(dat)
col.means <- sapply(dat, function(x) mean(x[x!=0]))

for (var in vars[-length(vars)]) {
  dat[dat[,var] == 0, var] <- col.means[var]
}

print('percent zero values for each predictor variable columns after mean imputation')
## [1] "percent zero values for each predictor variable columns after mean imputation"
100*colSums(dat == 0) / nrow(dat)  # percentage zeros after mean imputation
##              Pregnancies                  Glucose            BloodPressure 
##                  0.00000                  0.00000                  0.00000 
##            SkinThickness                  Insulin                      BMI 
##                  0.00000                  0.00000                  0.00000 
## DiabetesPedigreeFunction                      Age                  Outcome 
##                  0.00000                  0.00000                 65.10417
summary(dat)
##   Pregnancies        Glucose       BloodPressure    SkinThickness  
##  Min.   : 1.000   Min.   : 44.00   Min.   : 24.00   Min.   : 7.00  
##  1st Qu.: 2.000   1st Qu.: 99.75   1st Qu.: 64.00   1st Qu.:25.00  
##  Median : 4.495   Median :117.00   Median : 72.20   Median :29.15  
##  Mean   : 4.495   Mean   :121.69   Mean   : 72.41   Mean   :29.15  
##  3rd Qu.: 6.000   3rd Qu.:140.25   3rd Qu.: 80.00   3rd Qu.:32.00  
##  Max.   :17.000   Max.   :199.00   Max.   :122.00   Max.   :99.00  
##     Insulin           BMI        DiabetesPedigreeFunction      Age       
##  Min.   : 14.0   Min.   :18.20   Min.   :0.0780           Min.   :21.00  
##  1st Qu.:121.5   1st Qu.:27.50   1st Qu.:0.2437           1st Qu.:24.00  
##  Median :155.5   Median :32.40   Median :0.3725           Median :29.00  
##  Mean   :155.5   Mean   :32.46   Mean   :0.4719           Mean   :33.24  
##  3rd Qu.:155.5   3rd Qu.:36.60   3rd Qu.:0.6262           3rd Qu.:41.00  
##  Max.   :846.0   Max.   :67.10   Max.   :2.4200           Max.   :81.00  
##     Outcome     
##  Min.   :0.000  
##  1st Qu.:0.000  
##  Median :0.000  
##  Mean   :0.349  
##  3rd Qu.:1.000  
##  Max.   :1.000
library(GGally)
ggpairs(dat,                 
        columns = 1:9,        
        aes(color = as.factor(Outcome),  
            alpha = 0.5))

library("corrplot")
Cor = cor(dat)
corrplot(Cor, type="upper", method="ellipse", tl.pos="d")
corrplot(Cor, type="lower", method="number", col="black", 
         add=TRUE, diag=FALSE, tl.pos="n", cl.pos="n")

par(mfrow = c(3, 3))
for (var in vars[-length(vars)]) {
  boxplot(dat[,var]~dat$Outcome, ylab=var, xlab='outcome (diabetic?)')
}
par(mfrow = c(1, 1))

par(mfrow = c(3, 3))
for (var in vars) {
  hist(dat[,var], xlab=var, main='', probability=TRUE)
}

par(mfrow = c(1, 1))

Model

X = scale(dat[,-length(vars)], center=TRUE, scale=TRUE)
print('means (rounded to 3 decimals)')
## [1] "means (rounded to 3 decimals)"
round(colMeans(X), 3)
##              Pregnancies                  Glucose            BloodPressure 
##                        0                        0                        0 
##            SkinThickness                  Insulin                      BMI 
##                        0                        0                        0 
## DiabetesPedigreeFunction                      Age 
##                        0                        0
print('standard deviations')
## [1] "standard deviations"
apply(X, 2, sd)
##              Pregnancies                  Glucose            BloodPressure 
##                        1                        1                        1 
##            SkinThickness                  Insulin                      BMI 
##                        1                        1                        1 
## DiabetesPedigreeFunction                      Age 
##                        1                        1
X <- as.data.frame(X)
X$Outcome <- dat$Outcome

* We shall use 3 independent Markov chains for mixing, reject the first 1000 samples from the burn-in period and simulate for another 5000 iterations, use trace plot to check the convergence (autocorrelation) as diagnostic.

Results

Logistic Regression with GLM

As can be seen from the following results,

  • The predictors Pregnancies, Glucose, BMI, DiabetesPedigreeFunction and BloodPressure are statistically significant (with p-value < 0.05, the null hypothesis can be rejected at 95% level of significance).
  • The residual plot looks okay and does not show any pattern.
  • Interpretation of the coefficients: e.g., increase in Glucose by 1 unit increases the log odds for diabetes by 0.0372466, whereas decrease in BloodPressure by 1 unit increases the log odds for diabetes by 0.0104619.
  • The prediction has accuracy of 77.73%.
library(caret)
m <- glm(Outcome ~ ., data=dat, family='binomial')
summary(m)
## 
## Call:
## glm(formula = Outcome ~ ., family = "binomial", data = dat)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6353  -0.7129  -0.3951   0.6932   2.4061  
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              -9.0705975  0.8148288 -11.132  < 2e-16 ***
## Pregnancies               0.1406610  0.0359271   3.915 9.03e-05 ***
## Glucose                   0.0372466  0.0038779   9.605  < 2e-16 ***
## BloodPressure            -0.0104619  0.0086383  -1.211  0.22585    
## SkinThickness             0.0038213  0.0131661   0.290  0.77164    
## Insulin                  -0.0007215  0.0011808  -0.611  0.54115    
## BMI                       0.0890451  0.0177613   5.013 5.35e-07 ***
## DiabetesPedigreeFunction  0.8541331  0.2981939   2.864  0.00418 ** 
## Age                       0.0152995  0.0091727   1.668  0.09533 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 993.48  on 767  degrees of freedom
## Residual deviance: 712.94  on 759  degrees of freedom
## AIC: 730.94
## 
## Number of Fisher Scoring iterations: 5
plot(residuals(m, type='deviance'))

pred <- predict(m, type='response')
pred <- as.integer(round(pred))
confusionMatrix(as.factor(dat$Outcome), as.factor(pred))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 443  57
##          1 114 154
##                                           
##                Accuracy : 0.7773          
##                  95% CI : (0.7462, 0.8063)
##     No Information Rate : 0.7253          
##     P-Value [Acc > NIR] : 0.0005657       
##                                           
##                   Kappa : 0.4845          
##                                           
##  Mcnemar's Test P-Value : 1.849e-05       
##                                           
##             Sensitivity : 0.7953          
##             Specificity : 0.7299          
##          Pos Pred Value : 0.8860          
##          Neg Pred Value : 0.5746          
##              Prevalence : 0.7253          
##          Detection Rate : 0.5768          
##    Detection Prevalence : 0.6510          
##       Balanced Accuracy : 0.7626          
##                                           
##        'Positive' Class : 0               
## 

Bayesian Logistic Regression with RJags

Model 1

For the model with double exponential (Laplace) prior on the coefficients,

  • The trace plots for the MCMC looks okay, indicating convergence.
  • The Gelman diagnostic looks okay too, the values are close to 1, as expected.
  • The coefficients for SkinThickness, Insulin and Age includes zero in posterior distribution, we can ignore them.
  • The DIC for the model is 730.5
library(rjags)
mod1_string = " model {
    for (i in 1:length(Outcome)) {
        Outcome[i] ~ dbern(p[i])
        logit(p[i]) = int + b[1]*Pregnancies[i] + b[2]*Glucose[i] + b[3]*BloodPressure[i] + b[4]*SkinThickness[i] + 
                            b[5]*Insulin[i] + b[6]*BMI[i] + b[7]*DiabetesPedigreeFunction[i] + b[8]*Age[i]
    }
    int ~ dnorm(0.0, 1.0/25.0)
    for (j in 1:8) {
        b[j] ~ ddexp(0.0, sqrt(2.0)) # has variance 1.0
    }
} "


set.seed(92)

data_jags = as.list(X)

params = c("int", "b")

mod1 = jags.model(textConnection(mod1_string), data=data_jags, n.chains=3)
## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 768
##    Unobserved stochastic nodes: 9
##    Total graph size: 9732
## 
## Initializing model
update(mod1, 1e3)

mod1_sim = coda.samples(model=mod1,
                        variable.names=params,
                        n.iter=5e3)
mod1_csim = as.mcmc(do.call(rbind, mod1_sim))

par(mfrow=c(4,2))
densplot(mod1_csim[,1:8]) #, xlim=c(-3.0, 3.0))

par(mfrow=c(1,1))

## convergence diagnostics
plot(mod1_sim) #, ask=TRUE)

summary(mod1_sim)
## 
## Iterations = 2001:7000
## Thinning interval = 1 
## Number of chains = 3 
## Sample size per chain = 5000 
## 
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
## 
##          Mean      SD  Naive SE Time-series SE
## b[1]  0.40932 0.10764 0.0008789       0.001414
## b[2]  1.12534 0.11816 0.0009648       0.001499
## b[3] -0.10407 0.10139 0.0008278       0.001240
## b[4]  0.04265 0.11196 0.0009142       0.001531
## b[5] -0.04350 0.09667 0.0007893       0.001178
## b[6]  0.59322 0.12087 0.0009869       0.001711
## b[7]  0.27200 0.09743 0.0007955       0.001003
## b[8]  0.16881 0.10799 0.0008817       0.001530
## int  -0.86179 0.09705 0.0007924       0.001115
## 
## 2. Quantiles for each variable:
## 
##          2.5%      25%      50%      75%   97.5%
## b[1]  0.20083  0.33607  0.40936  0.48127  0.6224
## b[2]  0.90105  1.04423  1.12181  1.20542  1.3587
## b[3] -0.30440 -0.17319 -0.10270 -0.03378  0.0933
## b[4] -0.17506 -0.03094  0.04063  0.11646  0.2672
## b[5] -0.23480 -0.10761 -0.04324  0.02050  0.1492
## b[6]  0.35715  0.51151  0.59276  0.67313  0.8329
## b[7]  0.08135  0.20496  0.27093  0.33953  0.4640
## b[8] -0.03807  0.09479  0.16720  0.24145  0.3835
## int  -1.05126 -0.92682 -0.86122 -0.79672 -0.6711
gelman.diag(mod1_sim)
## Potential scale reduction factors:
## 
##      Point est. Upper C.I.
## b[1]          1          1
## b[2]          1          1
## b[3]          1          1
## b[4]          1          1
## b[5]          1          1
## b[6]          1          1
## b[7]          1          1
## b[8]          1          1
## int           1          1
## 
## Multivariate psrf
## 
## 1
## calculate DIC
dic1 = dic.samples(mod1, n.iter=1e3)
dic1
## Mean deviance:  721.5 
## penalty 8.415 
## Penalized deviance: 730

Model 2

  • Ignoring SkinThickness, Insulin and refitting the model with flat normal prior (with zero mean, s.d. 5) on each of the predictor coefficients results in DIC of 727.7, a slight improvement from the previous model.
  • The trace plots look okay, the Markov chains seem to have converged, with the autocorrelation plot supporting it. Effective sample sizes for the coefficients are quite large too.
  • The model is used for prediction and it gives 77.6% accuracy.
mod2_string = " model {
    for (i in 1:length(Outcome)) {
        Outcome[i] ~ dbern(p[i])
        logit(p[i]) = int + b[1]*Pregnancies[i] + b[2]*Glucose[i] + b[3]*BloodPressure[i] + b[4]*BMI[i]
                          + b[5]*DiabetesPedigreeFunction[i] + b[6]*Age[i]
    }
    int ~ dnorm(0.0, 1.0/25.0)
    for (j in 1:6) {
        b[j] ~ dnorm(0.0, 1/25.0)
    }
} "

set.seed(92)

mod2 = jags.model(textConnection(mod2_string), data=data_jags, n.chains=3)
## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 768
##    Unobserved stochastic nodes: 7
##    Total graph size: 7958
## 
## Initializing model
update(mod2, 1e3)

mod2_sim = coda.samples(model=mod2,
                        variable.names=params,
                        n.iter=5e3)
mod2_csim = as.mcmc(do.call(rbind, mod2_sim))

plot(mod2_sim) #, ask=TRUE)

summary(mod2_sim)
## 
## Iterations = 2001:7000
## Thinning interval = 1 
## Number of chains = 3 
## Sample size per chain = 5000 
## 
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
## 
##         Mean      SD  Naive SE Time-series SE
## b[1]  0.4278 0.10759 0.0008784       0.001406
## b[2]  1.1230 0.10999 0.0008981       0.001266
## b[3] -0.1230 0.10672 0.0008714       0.001316
## b[4]  0.6279 0.10842 0.0008852       0.001299
## b[5]  0.2879 0.09940 0.0008116       0.001055
## b[6]  0.1787 0.10790 0.0008810       0.001531
## int  -0.8693 0.09816 0.0008015       0.001112
## 
## 2. Quantiles for each variable:
## 
##          2.5%     25%     50%      75%    97.5%
## b[1]  0.21964  0.3564  0.4276  0.49927  0.64270
## b[2]  0.91099  1.0486  1.1223  1.19696  1.34420
## b[3] -0.33319 -0.1943 -0.1221 -0.05156  0.08416
## b[4]  0.41921  0.5545  0.6264  0.69907  0.84443
## b[5]  0.09182  0.2211  0.2878  0.35534  0.48449
## b[6] -0.03000  0.1053  0.1798  0.25159  0.39127
## int  -1.06491 -0.9351 -0.8678 -0.80272 -0.68156
gelman.diag(mod2_sim)
## Potential scale reduction factors:
## 
##      Point est. Upper C.I.
## b[1]          1          1
## b[2]          1          1
## b[3]          1          1
## b[4]          1          1
## b[5]          1          1
## b[6]          1          1
## int           1          1
## 
## Multivariate psrf
## 
## 1
autocorr.diag(mod2_sim)
##                b[1]         b[2]         b[3]         b[4]          b[5]
## Lag 0   1.000000000  1.000000000  1.000000000  1.000000000  1.0000000000
## Lag 1   0.416926247  0.312344429  0.389106968  0.354901791  0.2545199402
## Lag 5   0.032729903  0.005064261  0.029058845  0.006816196 -0.0028027285
## Lag 10 -0.001793951  0.016886666  0.011805779  0.004113391 -0.0004910797
## Lag 50  0.011530066 -0.001100237 -0.003472414 -0.010511972  0.0023777716
##              b[6]          int
## Lag 0  1.00000000  1.000000000
## Lag 1  0.48050382  0.297651811
## Lag 5  0.04479959  0.012691650
## Lag 10 0.01761801 -0.016319320
## Lag 50 0.00830341  0.001664174
autocorr.plot(mod2_sim)

effectiveSize(mod2_sim)
##     b[1]     b[2]     b[3]     b[4]     b[5]     b[6]      int 
## 5879.475 7582.354 6600.773 6971.613 8921.548 4985.693 7836.817
dic2 = dic.samples(mod2, n.iter=1e3)
dic2
## Mean deviance:  720.4 
## penalty 7.056 
## Penalized deviance: 727.5
(pm_coef = colMeans(mod2_csim))
##       b[1]       b[2]       b[3]       b[4]       b[5]       b[6]        int 
##  0.4278389  1.1230138 -0.1230491  0.6278820  0.2878744  0.1787103 -0.8693098
pm_Xb = pm_coef["int"] + as.matrix(X[,c(1,2,3,6,7,8)]) %*% pm_coef[1:6]
phat = 1.0 / (1.0 + exp(-pm_Xb))
head(phat)
##         [,1]
## 1 0.69337781
## 2 0.03441203
## 3 0.78604455
## 4 0.03351797
## 5 0.92900934
## 6 0.12635652
plot(phat, jitter(dat$Outcome), pch=19, col=dat$Outcome+1, xlab='predicted', ylab='actual', main='prediction with logistic regression')

(tab0.5 = table(phat > 0.5, data_jags$Outcome))
##        
##           0   1
##   FALSE 442 114
##   TRUE   58 154
sum(diag(tab0.5)) / sum(tab0.5)
## [1] 0.7760417

Model 3

  • Using all eight of the predictors and refitting the model with flat normal prior (with zero mean, s.d. 5) on each of the predictor coefficients results in DIC of 730.8, a slight deterioration from the previous model.
  • The trace plots look okay, the Markov chains seem to have converged, with the autocorrelation plot supporting it. Effective sample sizes for the coefficients are quite large too.
  • The model is used for prediction and it gives 77.86% accuracy (on the same dataset it’s trained on).
mod3_string = " model {
    for (i in 1:length(Outcome)) {
        Outcome[i] ~ dbern(p[i])
        logit(p[i]) = int + b[1]*Pregnancies[i] + b[2]*Glucose[i] + b[3]*BloodPressure[i] + b[4]*SkinThickness[i] + 
                            b[5]*Insulin[i] + b[6]*BMI[i] + b[7]*DiabetesPedigreeFunction[i] + b[8]*Age[i]
    }
    int ~ dnorm(0.0, 1.0/25.0)
    for (j in 1:8) {
        b[j] ~ dnorm(0.0, 1/25.0)
    }
} "

set.seed(92)


mod3 = jags.model(textConnection(mod3_string), data=as.list(X), n.chains=3)
## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 768
##    Unobserved stochastic nodes: 9
##    Total graph size: 9739
## 
## Initializing model
update(mod3, 1e3)

mod3_sim = coda.samples(model=mod3,
                        variable.names=params,
                        n.iter=5e3)
mod3_csim = as.mcmc(do.call(rbind, mod3_sim))

plot(mod3_sim) #, ask=TRUE)

summary(mod3_sim)
## 
## Iterations = 2001:7000
## Thinning interval = 1 
## Number of chains = 3 
## Sample size per chain = 5000 
## 
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
## 
##          Mean      SD  Naive SE Time-series SE
## b[1]  0.42253 0.10854 0.0008863       0.001442
## b[2]  1.15290 0.11877 0.0009698       0.001498
## b[3] -0.12995 0.10424 0.0008511       0.001306
## b[4]  0.03638 0.11798 0.0009633       0.001650
## b[5] -0.06039 0.10250 0.0008369       0.001252
## b[6]  0.62261 0.12269 0.0010017       0.001881
## b[7]  0.28709 0.09958 0.0008130       0.001027
## b[8]  0.18483 0.10964 0.0008952       0.001519
## int  -0.87444 0.09776 0.0007982       0.001112
## 
## 2. Quantiles for each variable:
## 
##          2.5%     25%      50%       75%    97.5%
## b[1]  0.21217  0.3465  0.42185  0.496595  0.63678
## b[2]  0.93169  1.0702  1.15083  1.232786  1.38882
## b[3] -0.33652 -0.1991 -0.12949 -0.058179  0.06954
## b[4] -0.19160 -0.0436  0.03481  0.114996  0.27153
## b[5] -0.25802 -0.1297 -0.06136  0.007583  0.14181
## b[6]  0.38653  0.5385  0.62165  0.705236  0.86618
## b[7]  0.09138  0.2195  0.28613  0.354221  0.47996
## b[8] -0.02720  0.1112  0.18464  0.256533  0.40388
## int  -1.07294 -0.9392 -0.87203 -0.808829 -0.68616
gelman.diag(mod3_sim)
## Potential scale reduction factors:
## 
##      Point est. Upper C.I.
## b[1]          1       1.00
## b[2]          1       1.00
## b[3]          1       1.00
## b[4]          1       1.00
## b[5]          1       1.01
## b[6]          1       1.00
## b[7]          1       1.00
## b[8]          1       1.00
## int           1       1.00
## 
## Multivariate psrf
## 
## 1
autocorr.diag(mod3_sim)
##                 b[1]         b[2]         b[3]        b[4]        b[5]
## Lag 0   1.0000000000  1.000000000  1.000000000 1.000000000 1.000000000
## Lag 1   0.4334543085  0.398960765  0.386561138 0.471580744 0.361767827
## Lag 5   0.0320956149  0.011812204 -0.005117396 0.041795701 0.012959840
## Lag 10 -0.0139371475 -0.011444892  0.003135870 0.015409249 0.003230473
## Lag 50 -0.0005609705 -0.003421055  0.005630901 0.002842327 0.008084170
##               b[6]          b[7]        b[8]          int
## Lag 0  1.000000000  1.0000000000  1.00000000  1.000000000
## Lag 1  0.506836365  0.2498184910  0.47557863  0.294150138
## Lag 5  0.057902671 -0.0030093917  0.04406218  0.003341419
## Lag 10 0.009600484  0.0033883613 -0.01098449 -0.009434692
## Lag 50 0.009168463 -0.0007639789 -0.00244981 -0.012994380
autocorr.plot(mod3_sim)

effectiveSize(mod3_sim)
##     b[1]     b[2]     b[3]     b[4]     b[5]     b[6]     b[7]     b[8] 
## 5676.627 6296.441 6392.781 5126.782 6729.127 4292.068 9463.179 5278.964 
##      int 
## 7756.402
dic3 = dic.samples(mod3, n.iter=1e3)
dic3
## Mean deviance:  722.1 
## penalty 9.079 
## Penalized deviance: 731.2
(pm_coef = colMeans(mod3_csim))
##        b[1]        b[2]        b[3]        b[4]        b[5]        b[6] 
##  0.42252694  1.15289695 -0.12995391  0.03638308 -0.06038751  0.62261242 
##        b[7]        b[8]         int 
##  0.28709429  0.18483462 -0.87443948
pm_Xb = pm_coef["int"] + as.matrix(X[,1:8]) %*% pm_coef[1:8]
phat = 1.0 / (1.0 + exp(-pm_Xb))
head(phat)
##         [,1]
## 1 0.70389019
## 2 0.03348453
## 3 0.79588270
## 4 0.03315076
## 5 0.93103951
## 6 0.12544097
plot(phat, jitter(dat$Outcome), pch=19, col=dat$Outcome+1, xlab='predicted', ylab='actual', main='prediction with logistic regression')

(tab0.5 = table(phat > 0.5, data_jags$Outcome))
##        
##           0   1
##   FALSE 444 114
##   TRUE   56 154
sum(diag(tab0.5)) / sum(tab0.5)
## [1] 0.7786458

Prior vs. Posterior on Coefficients

The next plot shows how the flat priors compare with the sharp skinny posterior distribution for the coefficient for a predictor (samples drawn using MCMC simulation). Also it shows that the predictor SkinThickness is not an important predictor, since the posterior density includes zero.

plot.densities <- function(col) {
  #df <- X[col]
  #df$density <- 'data'
  coeff <- paste0('β_', names(X[col]))
  post.df <- data.frame(var=mod3_csim[,col], density=rep('posterior', nrow(mod3_csim)))
  names(post.df)[1] <- coeff
  #df <- rbind(df, post.df)
  df <- post.df
  prior.df <- data.frame(var=rnorm(1000,0,5), density=rep('prior', 1000))
  names(prior.df)[1] <- coeff
  df <- rbind(df, prior.df)
  head(df)
  return (ggplot(df, aes_string(coeff, fill='density')) + geom_density(alpha=0.5) + geom_vline(xintercept = mean(X[,col]), lty=2, col='red', alpha=0.5))
}

(p <- plot.densities(4))

Bayesian Credible Intervals (CI) for the Coefficients

The Highest Posterior Density intervals for the coefficients are shown below:

  • e.g., the estimated coefficient for the predictor Glucose lies in the interval shown below with 95% probability.
HPDinterval(mod3_csim)
##            lower       upper
## b[1]  0.21834012  0.64093736
## b[2]  0.93339740  1.38956930
## b[3] -0.32965513  0.07475399
## b[4] -0.20150163  0.26085259
## b[5] -0.25571422  0.14405698
## b[6]  0.37568675  0.85358616
## b[7]  0.09602151  0.48381158
## b[8] -0.01865905  0.41127996
## int  -1.06922526 -0.68279383
## attr(,"Probability")
## [1] 0.95

Make Probabilistic Statements

For example, let’s try to answer the question: what’s the probability that the effect size for DiabetesPedigreeFunction is greater than that of the predictor Age? We can answer by conditioning on the samples obtained from simulation and then taking the mean as shown below.

mean(mod3_csim[,7] > mod3_csim[,8])
## [1] 0.7545333

Posterior Predictive Distribution

  • Now, let’s consider the following two new patients with the following diagnostic measurements for the predictors.
  • First we need to scale the predictors by subtracting the mean and dividing by the s.d. of each of the respective predictors.
  • Now, let’s use the MCMC samples drawn to construct the posterior predictive distribution for patient one and construct 95% credible interval with quantile. Hence, with 95% of the time the probability that the first patient is diabetic will be in the interval computed and visualized in the next code snippet.
  • Also, let’s compare the probability that the second patient has higher probability of having diabetes than the first patient. As can be seen, it can also be computed by drawing samples from the posterior predictive distribution, as shown below.
p1 <- matrix(c(0,300,80,30,100,33,0.3,40), ncol=1)
p1df <- as.data.frame(t(p1))
colnames(p1df) <- names(dat)[-ncol(dat)]
p2 <- matrix(c(0,250,82,29,90,35,0.25,50), ncol=1)
p2df <- as.data.frame(t(p2))
colnames(p2df) <- names(dat)[-ncol(dat)]
pdf <- rbind(p1df, p2df)
rownames(pdf) <- c('patient 1', 'patient 2')
head(pdf) 
##           Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
## patient 1           0     300            80            30     100  33
## patient 2           0     250            82            29      90  35
##           DiabetesPedigreeFunction Age
## patient 1                     0.30  40
## patient 2                     0.25  50
mu <- colMeans(dat[,-ncol(dat)])
sd <- apply(dat[,-ncol(dat)], 2, sd)
p1 <- (p1 - mu) / sd # z-score normalize
p2 <- (p2 - mu) / sd
p1_Xb = mod3_csim[,"int"] + mod3_csim[,1:8] %*% p1
phat1 = 1.0 / (1.0 + exp(-p1_Xb))
head(phat1)
## Markov Chain Monte Carlo (MCMC) output:
## Start = 1 
## End = 7 
## Thinning interval = 1 
##           [,1]
## [1,] 0.9901673
## [2,] 0.9979442
## [3,] 0.9985275
## [4,] 0.9920615
## [5,] 0.9856507
## [6,] 0.9967362
## [7,] 0.9963923
p2_Xb = mod3_csim[,"int"] + mod3_csim[,1:8] %*% p2
phat2 = 1.0 / (1.0 + exp(-p2_Xb))
ci_95 <- quantile(phat1, probs = c(0.025, 0.975))
ci_95
##      2.5%     97.5% 
## 0.9792472 0.9986913
post_pred_df <- data.frame(pred=phat1)
ggplot(post_pred_df, aes(x = phat1)) + 
    xlab('probability of being diabetic (Outcome=1)') + 
    geom_density(color='darkblue', fill = 'lightblue') + 
    geom_vline(xintercept = ci_95, color = "red") +
    theme_bw() + 
    ggtitle('Posterior Predictive distribution 95% Credible Interval for patient 1')

post_pred_df$patient <- 1
post_pred_df <- rbind(post_pred_df, data.frame(pred=phat2, patient=rep(2, length(phat2))))
post_pred_df$patient <- as.factor(post_pred_df$patient)
ggplot(post_pred_df, aes(pred, fill=patient)) + 
    geom_density(alpha=0.5) + 
    xlab('probability of being diabetic (Outcome=1)') + 
    theme_bw() + 
    ggtitle('Prediction with MCMC samples from posterior distribution of coefficients')

#mean(phat2 > phat1)

set.seed(92)

y1 <- rbinom(nrow(mod3_csim), 1, 1/(1+exp(-mod3_csim[,"int"] - mod3_csim[,1:8] %*% p1))) # draw samples
y2 <- rbinom(nrow(mod3_csim), 1, 1/(1+exp(-mod3_csim[,"int"] - mod3_csim[,1:8] %*% p2)))
mean(y2 > y1)
## [1] 0.007466667
post_pred_df <- data.frame(pred=c(y1, y2), patient=c(rep('1', length(y1)), rep('2', length(y2))))
ggplot(post_pred_df, aes(pred, fill=patient)) + 
    geom_density(alpha=0.5) + 
    xlab('probability of being diabetic (Outcome=1)') + 
    theme_bw() + 
    ggtitle('Posterior Predictive distribution for patient 1 and patient 2')

Conclusions

Using Bayesian Logistic Regression, we could not only compute the point estimates for the predictors and use them to predict the outcome, but the main strength of the model had been to quantify uncertainty in terms of how confident we are in terms of the estimates and predictions (using posterior predictive distribution). We can see that all predictor variables other than Insulin and BloodPressure, increase in their values increase the log odds of diabetes. The model even allows us to make probabilistic statements in terms of comparing two patients’ probability of being diabetic, given the values of the diagnostic measurements of the predictor variables.

Improvements

The following can be tried to improve the model: