Data Analysis Project: Diabetes Prediction with Bayesian Logistic Regression

Executive summary

In this article, we are going to use the R package rjags and implement Bayesian Logistic Regression, to be used to understand how diabetes is dependent on a given a set of diagnostic measurements. This appeared as a Data Analysis Project in the Coursera course Bayesian Statistics: Techniques and Models, by University of California Santa Cruz.

More specifically, our objective will be:

Use the diagnostic measurements as predictor variables and estimate the coefficients for the predictors under Bayesian settings
Predict the response variable, i.e., whether a patient is diabetic or not, given those measurements and computed coefficients
Construct Bayesian Credible Intervals (CI) for the coefficients to quantify uncertainty and make probabilistic claims on the coefficients.
Compute the Posterior Predictive Distribution and use it to compute CI for the predictions and make probabilistic claims on the predictions for different patients.
We shall use base R and rjags library functions for the implementation.

Introduction

Predicting whether a patient is diabetic from her medical diagnosis will be the goal of this project. Under Bayesian settings, we shall impose priors on the coefficients corresponding to the predictors (diagnostic measurements) and use MCMC sampling to compute the posterior distributions, verifying that the Markov chains converge to the target stationary distribution with diagnostic tests. We shall try to fit a few different models and use model checking / selection to find the best model (with DIC), along with that we shall use posterior predictive distribution to answer to a few questions (test hypothesises) related to prediction of the outcomes, we shall also construct Bayesian credible intervals to quantify the uncertainty for the coefficient estimates and also for our prediction. As we shall see, few of the predictor variables we shall find as statistically significant.

Data

We shall use the Diabetes dataset from Kaggle: https://www.kaggle.com/datasets/mathchi/diabetes-data-set. This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage. Each row corresponds to the measurements corresponding to a patient (there are 768 rows in the dataset), containing 8 predictor variables listed and described below (which can be used to predict the response variable), with a single response variable named Outcome (which is 1 if the patient is diabetic, 0 otherwise).

Pregnancies: Number of times pregnant
Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
BloodPressure: Diastolic blood pressure (mm Hg)
SkinThickness: Triceps skin fold thickness (mm)
Insulin: 2-Hour serum insulin (mu U/ml)
BMI: Body mass index (weight in kg/(height in m)^2)
DiabetesPedigreeFunction: Diabetes pedigree function
Age: Age (years)
Outcome: Class variable (0 or 1)

The following table shows first few rows of the dataset.

# navigate to the right path
dat = read.csv(file="diabetes.csv", header=TRUE)
dim(dat)

## [1] 768   9

head(dat)

##   Pregnancies Glucose BloodPressure SkinThickness Insulin  BMI
## 1           6     148            72            35       0 33.6
## 2           1      85            66            29       0 26.6
## 3           8     183            64             0       0 23.3
## 4           1      89            66            23      94 28.1
## 5           0     137            40            35     168 43.1
## 6           5     116            74             0       0 25.6
##   DiabetesPedigreeFunction Age Outcome
## 1                    0.627  50       1
## 2                    0.351  31       0
## 3                    0.672  32       1
## 4                    0.167  21       0
## 5                    2.288  33       1
## 6                    0.201  30       0

Exploratory Analysis & Cleaning

First we omit the NA values, if any, from the dataset.
We observe that many of the predictors (e.g., BloodPressure, SkinThickness, Insulin) contain quite a few zero values (which appears to be impossible value for the measurements, hence), we suspect they are noise.
We must remove / replace the noise / outlier values, otherwise the model fitted can be very inaccurate.
We first compute what percentage of values for a predictor contain inappropriate 0 values and find that there are quite significant (as shown below), so we can’t drop the corresponding observations, since they will reduce the data size a lot.
Instead, we use mean imputation technique to replace the zero values for a predictor with the mean of its non-zero values, as shown below.
Also, let’s verify there is no remainign zero values in the predicor variables after imputation, so that we get clean data (only the Outcome column which is the response variable can contain zero values).

dat = na.omit(dat)
print('percent zero values for each predictor variable columns')

## [1] "percent zero values for each predictor variable columns"

100*colSums(dat == 0) / nrow(dat) # percentage zeros

##              Pregnancies                  Glucose            BloodPressure 
##               14.4531250                0.6510417                4.5572917 
##            SkinThickness                  Insulin                      BMI 
##               29.5572917               48.6979167                1.4322917 
## DiabetesPedigreeFunction                      Age                  Outcome 
##                0.0000000                0.0000000               65.1041667

vars <- colnames(dat)
col.means <- sapply(dat, function(x) mean(x[x!=0]))

for (var in vars[-length(vars)]) {
  dat[dat[,var] == 0, var] <- col.means[var]
}

print('percent zero values for each predictor variable columns after mean imputation')

## [1] "percent zero values for each predictor variable columns after mean imputation"

100*colSums(dat == 0) / nrow(dat)  # percentage zeros after mean imputation

##              Pregnancies                  Glucose            BloodPressure 
##                  0.00000                  0.00000                  0.00000 
##            SkinThickness                  Insulin                      BMI 
##                  0.00000                  0.00000                  0.00000 
## DiabetesPedigreeFunction                      Age                  Outcome 
##                  0.00000                  0.00000                 65.10417

Now, let’s compute the summary of the dataset and visualize the correlation between the variables (separately for diabetes and non-diabetes observations, denoted by the colors red and blue, respectively), as shown below.

summary(dat)

##   Pregnancies        Glucose       BloodPressure    SkinThickness  
##  Min.   : 1.000   Min.   : 44.00   Min.   : 24.00   Min.   : 7.00  
##  1st Qu.: 2.000   1st Qu.: 99.75   1st Qu.: 64.00   1st Qu.:25.00  
##  Median : 4.495   Median :117.00   Median : 72.20   Median :29.15  
##  Mean   : 4.495   Mean   :121.69   Mean   : 72.41   Mean   :29.15  
##  3rd Qu.: 6.000   3rd Qu.:140.25   3rd Qu.: 80.00   3rd Qu.:32.00  
##  Max.   :17.000   Max.   :199.00   Max.   :122.00   Max.   :99.00  
##     Insulin           BMI        DiabetesPedigreeFunction      Age       
##  Min.   : 14.0   Min.   :18.20   Min.   :0.0780           Min.   :21.00  
##  1st Qu.:121.5   1st Qu.:27.50   1st Qu.:0.2437           1st Qu.:24.00  
##  Median :155.5   Median :32.40   Median :0.3725           Median :29.00  
##  Mean   :155.5   Mean   :32.46   Mean   :0.4719           Mean   :33.24  
##  3rd Qu.:155.5   3rd Qu.:36.60   3rd Qu.:0.6262           3rd Qu.:41.00  
##  Max.   :846.0   Max.   :67.10   Max.   :2.4200           Max.   :81.00  
##     Outcome     
##  Min.   :0.000  
##  1st Qu.:0.000  
##  Median :0.000  
##  Mean   :0.349  
##  3rd Qu.:1.000  
##  Max.   :1.000

library(GGally)
ggpairs(dat,                 
        columns = 1:9,        
        aes(color = as.factor(Outcome),  
            alpha = 0.5))

Also, let’s plot the correlation between the predictors and the outcome, with darker color representing higher correlation value.

library("corrplot")
Cor = cor(dat)
corrplot(Cor, type="upper", method="ellipse", tl.pos="d")
corrplot(Cor, type="lower", method="number", col="black", 
         add=TRUE, diag=FALSE, tl.pos="n", cl.pos="n")

Next let’s find out whether the distributions of the predictors given Outcome=0 and Outcome=1 differ significantly using the following boxplot. Also, let’s visualize the predictor distributions from the data using histograms (we can see some of the predictors in the data, e.g. Age are very much right-skewed). As can be seen from below, the distribution of the predictor variable SkinThickness is almost same when conditioned on different values of the outcome, so it’s likely not to be a good predictor.

par(mfrow = c(3, 3))
for (var in vars[-length(vars)]) {
  boxplot(dat[,var]~dat$Outcome, ylab=var, xlab='outcome (diabetic?)')
}
par(mfrow = c(1, 1))

par(mfrow = c(3, 3))
for (var in vars) {
  hist(dat[,var], xlab=var, main='', probability=TRUE)
}

par(mfrow = c(1, 1))

Model

We shall first fit the glm (generalized linear model) in R with family=binomial, with Outcome as the response and the rest of the variables as predictors. Plot the residuals (diagnostic plot) to see if there is any pattern and whether the model assumptions are met. Note the values of the coefficients computed, the statistically significant ones. Finally, predict using the model fitted and compute the prediction accuracy along with the confusion matrix.
Then, we shall use rjags to fit Bayesian models. Let’s first z-score normalize (scale) the predictors and verify that each of them have mean zero and unit standard deviation.

X = scale(dat[,-length(vars)], center=TRUE, scale=TRUE)
print('means (rounded to 3 decimals)')

## [1] "means (rounded to 3 decimals)"

round(colMeans(X), 3)

##              Pregnancies                  Glucose            BloodPressure 
##                        0                        0                        0 
##            SkinThickness                  Insulin                      BMI 
##                        0                        0                        0 
## DiabetesPedigreeFunction                      Age 
##                        0                        0

print('standard deviations')

## [1] "standard deviations"

apply(X, 2, sd)

##              Pregnancies                  Glucose            BloodPressure 
##                        1                        1                        1 
##            SkinThickness                  Insulin                      BMI 
##                        1                        1                        1 
## DiabetesPedigreeFunction                      Age 
##                        1                        1

X <- as.data.frame(X)
X$Outcome <- dat$Outcome

First, let’s fit the following model with Bernoulli likelihood with logit link function and Gaussian prior (with zero mean, sd 5) on the intercept, along with Laplace (double exponential) prior (with zero mean and unit variance) on the remaining coefficients, to find posterior with sparse coefficients.

* We shall use 3 independent Markov chains for mixing, reject the first 1000 samples from the burn-in period and simulate for another 5000 iterations, use trace plot to check the convergence (autocorrelation) as diagnostic.

Also, in order to examine convergence of the Markov chains to stationary (target posterior) distribution, we shall use the Gelman-Brooks-Rubin diagnostic to verify if the Markov chains don’t differ much / end in the same state-space after convergence (also check the effective sample size).
We shall use the Deviance information criterion (DIC) for model selection (the lower the value, the better the model fit) and to compute the effective parameter size of the problem.
We shall see that the posterior distribution of a few coefficients includes zero, so they are likely to be less important as predictors, we shall get rid of them and refit the model, this time with Gaussian prior on all the coefficients.

Finally, we shall fit a third model using all the eight predictors and placing flat normal priors on them. We shall compare the models using DIC and select the best model.
We shall compute the posterior mean of the MCMC samples (combined) to compute the estimate of the coefficients, also compute the 95% Highest Posterior Density (HPD) credible intervals for the coefficients.
We shal use the posterior estimates for the coefficients for prediction of the observations and compute the accuracy of prediction.
Finally, we shall use the posterior predictive distribution to predict the probability of diabetes of couple of new patients, construct 95% Bayesian credible intervals (CI) and compute the probability that one patient will have more chance to be diabetic over the other.

Results

Logistic Regression with GLM

As can be seen from the following results,

The predictors Pregnancies, Glucose, BMI, DiabetesPedigreeFunction and BloodPressure are statistically significant (with p-value < 0.05, the null hypothesis can be rejected at 95% level of significance).
The residual plot looks okay and does not show any pattern.
Interpretation of the coefficients: e.g., increase in Glucose by 1 unit increases the log odds for diabetes by 0.0372466, whereas decrease in BloodPressure by 1 unit increases the log odds for diabetes by 0.0104619.
The prediction has accuracy of 77.73%.

library(caret)
m <- glm(Outcome ~ ., data=dat, family='binomial')
summary(m)

## 
## Call:
## glm(formula = Outcome ~ ., family = "binomial", data = dat)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6353  -0.7129  -0.3951   0.6932   2.4061  
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              -9.0705975  0.8148288 -11.132  < 2e-16 ***
## Pregnancies               0.1406610  0.0359271   3.915 9.03e-05 ***
## Glucose                   0.0372466  0.0038779   9.605  < 2e-16 ***
## BloodPressure            -0.0104619  0.0086383  -1.211  0.22585    
## SkinThickness             0.0038213  0.0131661   0.290  0.77164    
## Insulin                  -0.0007215  0.0011808  -0.611  0.54115    
## BMI                       0.0890451  0.0177613   5.013 5.35e-07 ***
## DiabetesPedigreeFunction  0.8541331  0.2981939   2.864  0.00418 ** 
## Age                       0.0152995  0.0091727   1.668  0.09533 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 993.48  on 767  degrees of freedom
## Residual deviance: 712.94  on 759  degrees of freedom
## AIC: 730.94
## 
## Number of Fisher Scoring iterations: 5

plot(residuals(m, type='deviance'))

pred <- predict(m, type='response')
pred <- as.integer(round(pred))
confusionMatrix(as.factor(dat$Outcome), as.factor(pred))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 443  57
##          1 114 154
##                                           
##                Accuracy : 0.7773          
##                  95% CI : (0.7462, 0.8063)
##     No Information Rate : 0.7253          
##     P-Value [Acc > NIR] : 0.0005657       
##                                           
##                   Kappa : 0.4845          
##                                           
##  Mcnemar's Test P-Value : 1.849e-05       
##                                           
##             Sensitivity : 0.7953          
##             Specificity : 0.7299          
##          Pos Pred Value : 0.8860          
##          Neg Pred Value : 0.5746          
##              Prevalence : 0.7253          
##          Detection Rate : 0.5768          
##    Detection Prevalence : 0.6510          
##       Balanced Accuracy : 0.7626          
##                                           
##        'Positive' Class : 0               
##

Bayesian Logistic Regression with RJags

Model 1

For the model with double exponential (Laplace) prior on the coefficients,

The trace plots for the MCMC looks okay, indicating convergence.
The Gelman diagnostic looks okay too, the values are close to 1, as expected.
The coefficients for SkinThickness, Insulin and Age includes zero in posterior distribution, we can ignore them.
The DIC for the model is 730.5

library(rjags)
mod1_string = " model {
    for (i in 1:length(Outcome)) {
        Outcome[i] ~ dbern(p[i])
        logit(p[i]) = int + b[1]*Pregnancies[i] + b[2]*Glucose[i] + b[3]*BloodPressure[i] + b[4]*SkinThickness[i] + 
                            b[5]*Insulin[i] + b[6]*BMI[i] + b[7]*DiabetesPedigreeFunction[i] + b[8]*Age[i]
    }
    int ~ dnorm(0.0, 1.0/25.0)
    for (j in 1:8) {
        b[j] ~ ddexp(0.0, sqrt(2.0)) # has variance 1.0
    }
} "


set.seed(92)

data_jags = as.list(X)

params = c("int", "b")

mod1 = jags.model(textConnection(mod1_string), data=data_jags, n.chains=3)

## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 768
##    Unobserved stochastic nodes: 9
##    Total graph size: 9732
## 
## Initializing model

update(mod1, 1e3)

mod1_sim = coda.samples(model=mod1,
                        variable.names=params,
                        n.iter=5e3)
mod1_csim = as.mcmc(do.call(rbind, mod1_sim))

par(mfrow=c(4,2))
densplot(mod1_csim[,1:8]) #, xlim=c(-3.0, 3.0))

par(mfrow=c(1,1))

## convergence diagnostics
plot(mod1_sim) #, ask=TRUE)

summary(mod1_sim)

## 
## Iterations = 2001:7000
## Thinning interval = 1 
## Number of chains = 3 
## Sample size per chain = 5000 
## 
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
## 
##          Mean      SD  Naive SE Time-series SE
## b[1]  0.40932 0.10764 0.0008789       0.001414
## b[2]  1.12534 0.11816 0.0009648       0.001499
## b[3] -0.10407 0.10139 0.0008278       0.001240
## b[4]  0.04265 0.11196 0.0009142       0.001531
## b[5] -0.04350 0.09667 0.0007893       0.001178
## b[6]  0.59322 0.12087 0.0009869       0.001711
## b[7]  0.27200 0.09743 0.0007955       0.001003
## b[8]  0.16881 0.10799 0.0008817       0.001530
## int  -0.86179 0.09705 0.0007924       0.001115
## 
## 2. Quantiles for each variable:
## 
##          2.5%      25%      50%      75%   97.5%
## b[1]  0.20083  0.33607  0.40936  0.48127  0.6224
## b[2]  0.90105  1.04423  1.12181  1.20542  1.3587
## b[3] -0.30440 -0.17319 -0.10270 -0.03378  0.0933
## b[4] -0.17506 -0.03094  0.04063  0.11646  0.2672
## b[5] -0.23480 -0.10761 -0.04324  0.02050  0.1492
## b[6]  0.35715  0.51151  0.59276  0.67313  0.8329
## b[7]  0.08135  0.20496  0.27093  0.33953  0.4640
## b[8] -0.03807  0.09479  0.16720  0.24145  0.3835
## int  -1.05126 -0.92682 -0.86122 -0.79672 -0.6711

gelman.diag(mod1_sim)

## Potential scale reduction factors:
## 
##      Point est. Upper C.I.
## b[1]          1          1
## b[2]          1          1
## b[3]          1          1
## b[4]          1          1
## b[5]          1          1
## b[6]          1          1
## b[7]          1          1
## b[8]          1          1
## int           1          1
## 
## Multivariate psrf
## 
## 1

## calculate DIC
dic1 = dic.samples(mod1, n.iter=1e3)
dic1

## Mean deviance:  721.5 
## penalty 8.415 
## Penalized deviance: 730

Model 2

Ignoring SkinThickness, Insulin and refitting the model with flat normal prior (with zero mean, s.d. 5) on each of the predictor coefficients results in DIC of 727.7, a slight improvement from the previous model.
The trace plots look okay, the Markov chains seem to have converged, with the autocorrelation plot supporting it. Effective sample sizes for the coefficients are quite large too.
The model is used for prediction and it gives 77.6% accuracy.

mod2_string = " model {
    for (i in 1:length(Outcome)) {
        Outcome[i] ~ dbern(p[i])
        logit(p[i]) = int + b[1]*Pregnancies[i] + b[2]*Glucose[i] + b[3]*BloodPressure[i] + b[4]*BMI[i]
                          + b[5]*DiabetesPedigreeFunction[i] + b[6]*Age[i]
    }
    int ~ dnorm(0.0, 1.0/25.0)
    for (j in 1:6) {
        b[j] ~ dnorm(0.0, 1/25.0)
    }
} "

set.seed(92)

mod2 = jags.model(textConnection(mod2_string), data=data_jags, n.chains=3)

## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 768
##    Unobserved stochastic nodes: 7
##    Total graph size: 7958
## 
## Initializing model

update(mod2, 1e3)

mod2_sim = coda.samples(model=mod2,
                        variable.names=params,
                        n.iter=5e3)
mod2_csim = as.mcmc(do.call(rbind, mod2_sim))

plot(mod2_sim) #, ask=TRUE)

summary(mod2_sim)

## 
## Iterations = 2001:7000
## Thinning interval = 1 
## Number of chains = 3 
## Sample size per chain = 5000 
## 
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
## 
##         Mean      SD  Naive SE Time-series SE
## b[1]  0.4278 0.10759 0.0008784       0.001406
## b[2]  1.1230 0.10999 0.0008981       0.001266
## b[3] -0.1230 0.10672 0.0008714       0.001316
## b[4]  0.6279 0.10842 0.0008852       0.001299
## b[5]  0.2879 0.09940 0.0008116       0.001055
## b[6]  0.1787 0.10790 0.0008810       0.001531
## int  -0.8693 0.09816 0.0008015       0.001112
## 
## 2. Quantiles for each variable:
## 
##          2.5%     25%     50%      75%    97.5%
## b[1]  0.21964  0.3564  0.4276  0.49927  0.64270
## b[2]  0.91099  1.0486  1.1223  1.19696  1.34420
## b[3] -0.33319 -0.1943 -0.1221 -0.05156  0.08416
## b[4]  0.41921  0.5545  0.6264  0.69907  0.84443
## b[5]  0.09182  0.2211  0.2878  0.35534  0.48449
## b[6] -0.03000  0.1053  0.1798  0.25159  0.39127
## int  -1.06491 -0.9351 -0.8678 -0.80272 -0.68156

gelman.diag(mod2_sim)

## Potential scale reduction factors:
## 
##      Point est. Upper C.I.
## b[1]          1          1
## b[2]          1          1
## b[3]          1          1
## b[4]          1          1
## b[5]          1          1
## b[6]          1          1
## int           1          1
## 
## Multivariate psrf
## 
## 1

autocorr.diag(mod2_sim)

##                b[1]         b[2]         b[3]         b[4]          b[5]
## Lag 0   1.000000000  1.000000000  1.000000000  1.000000000  1.0000000000
## Lag 1   0.416926247  0.312344429  0.389106968  0.354901791  0.2545199402
## Lag 5   0.032729903  0.005064261  0.029058845  0.006816196 -0.0028027285
## Lag 10 -0.001793951  0.016886666  0.011805779  0.004113391 -0.0004910797
## Lag 50  0.011530066 -0.001100237 -0.003472414 -0.010511972  0.0023777716
##              b[6]          int
## Lag 0  1.00000000  1.000000000
## Lag 1  0.48050382  0.297651811
## Lag 5  0.04479959  0.012691650
## Lag 10 0.01761801 -0.016319320
## Lag 50 0.00830341  0.001664174

autocorr.plot(mod2_sim)

effectiveSize(mod2_sim)

##     b[1]     b[2]     b[3]     b[4]     b[5]     b[6]      int 
## 5879.475 7582.354 6600.773 6971.613 8921.548 4985.693 7836.817

dic2 = dic.samples(mod2, n.iter=1e3)
dic2

## Mean deviance:  720.4 
## penalty 7.056 
## Penalized deviance: 727.5

(pm_coef = colMeans(mod2_csim))

##       b[1]       b[2]       b[3]       b[4]       b[5]       b[6]        int 
##  0.4278389  1.1230138 -0.1230491  0.6278820  0.2878744  0.1787103 -0.8693098

pm_Xb = pm_coef["int"] + as.matrix(X[,c(1,2,3,6,7,8)]) %*% pm_coef[1:6]
phat = 1.0 / (1.0 + exp(-pm_Xb))
head(phat)

##         [,1]
## 1 0.69337781
## 2 0.03441203
## 3 0.78604455
## 4 0.03351797
## 5 0.92900934
## 6 0.12635652

plot(phat, jitter(dat$Outcome), pch=19, col=dat$Outcome+1, xlab='predicted', ylab='actual', main='prediction with logistic regression')

(tab0.5 = table(phat > 0.5, data_jags$Outcome))

##        
##           0   1
##   FALSE 442 114
##   TRUE   58 154

sum(diag(tab0.5)) / sum(tab0.5)

## [1] 0.7760417

Model 3

Using all eight of the predictors and refitting the model with flat normal prior (with zero mean, s.d. 5) on each of the predictor coefficients results in DIC of 730.8, a slight deterioration from the previous model.
The trace plots look okay, the Markov chains seem to have converged, with the autocorrelation plot supporting it. Effective sample sizes for the coefficients are quite large too.
The model is used for prediction and it gives 77.86% accuracy (on the same dataset it’s trained on).

mod3_string = " model {
    for (i in 1:length(Outcome)) {
        Outcome[i] ~ dbern(p[i])
        logit(p[i]) = int + b[1]*Pregnancies[i] + b[2]*Glucose[i] + b[3]*BloodPressure[i] + b[4]*SkinThickness[i] + 
                            b[5]*Insulin[i] + b[6]*BMI[i] + b[7]*DiabetesPedigreeFunction[i] + b[8]*Age[i]
    }
    int ~ dnorm(0.0, 1.0/25.0)
    for (j in 1:8) {
        b[j] ~ dnorm(0.0, 1/25.0)
    }
} "

set.seed(92)


mod3 = jags.model(textConnection(mod3_string), data=as.list(X), n.chains=3)

## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 768
##    Unobserved stochastic nodes: 9
##    Total graph size: 9739
## 
## Initializing model

update(mod3, 1e3)

mod3_sim = coda.samples(model=mod3,
                        variable.names=params,
                        n.iter=5e3)
mod3_csim = as.mcmc(do.call(rbind, mod3_sim))

plot(mod3_sim) #, ask=TRUE)

summary(mod3_sim)

## 
## Iterations = 2001:7000
## Thinning interval = 1 
## Number of chains = 3 
## Sample size per chain = 5000 
## 
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
## 
##          Mean      SD  Naive SE Time-series SE
## b[1]  0.42253 0.10854 0.0008863       0.001442
## b[2]  1.15290 0.11877 0.0009698       0.001498
## b[3] -0.12995 0.10424 0.0008511       0.001306
## b[4]  0.03638 0.11798 0.0009633       0.001650
## b[5] -0.06039 0.10250 0.0008369       0.001252
## b[6]  0.62261 0.12269 0.0010017       0.001881
## b[7]  0.28709 0.09958 0.0008130       0.001027
## b[8]  0.18483 0.10964 0.0008952       0.001519
## int  -0.87444 0.09776 0.0007982       0.001112
## 
## 2. Quantiles for each variable:
## 
##          2.5%     25%      50%       75%    97.5%
## b[1]  0.21217  0.3465  0.42185  0.496595  0.63678
## b[2]  0.93169  1.0702  1.15083  1.232786  1.38882
## b[3] -0.33652 -0.1991 -0.12949 -0.058179  0.06954
## b[4] -0.19160 -0.0436  0.03481  0.114996  0.27153
## b[5] -0.25802 -0.1297 -0.06136  0.007583  0.14181
## b[6]  0.38653  0.5385  0.62165  0.705236  0.86618
## b[7]  0.09138  0.2195  0.28613  0.354221  0.47996
## b[8] -0.02720  0.1112  0.18464  0.256533  0.40388
## int  -1.07294 -0.9392 -0.87203 -0.808829 -0.68616

gelman.diag(mod3_sim)

## Potential scale reduction factors:
## 
##      Point est. Upper C.I.
## b[1]          1       1.00
## b[2]          1       1.00
## b[3]          1       1.00
## b[4]          1       1.00
## b[5]          1       1.01
## b[6]          1       1.00
## b[7]          1       1.00
## b[8]          1       1.00
## int           1       1.00
## 
## Multivariate psrf
## 
## 1

autocorr.diag(mod3_sim)

##                 b[1]         b[2]         b[3]        b[4]        b[5]
## Lag 0   1.0000000000  1.000000000  1.000000000 1.000000000 1.000000000
## Lag 1   0.4334543085  0.398960765  0.386561138 0.471580744 0.361767827
## Lag 5   0.0320956149  0.011812204 -0.005117396 0.041795701 0.012959840
## Lag 10 -0.0139371475 -0.011444892  0.003135870 0.015409249 0.003230473
## Lag 50 -0.0005609705 -0.003421055  0.005630901 0.002842327 0.008084170
##               b[6]          b[7]        b[8]          int
## Lag 0  1.000000000  1.0000000000  1.00000000  1.000000000
## Lag 1  0.506836365  0.2498184910  0.47557863  0.294150138
## Lag 5  0.057902671 -0.0030093917  0.04406218  0.003341419
## Lag 10 0.009600484  0.0033883613 -0.01098449 -0.009434692
## Lag 50 0.009168463 -0.0007639789 -0.00244981 -0.012994380

autocorr.plot(mod3_sim)

effectiveSize(mod3_sim)

##     b[1]     b[2]     b[3]     b[4]     b[5]     b[6]     b[7]     b[8] 
## 5676.627 6296.441 6392.781 5126.782 6729.127 4292.068 9463.179 5278.964 
##      int 
## 7756.402

dic3 = dic.samples(mod3, n.iter=1e3)
dic3

## Mean deviance:  722.1 
## penalty 9.079 
## Penalized deviance: 731.2

(pm_coef = colMeans(mod3_csim))

##        b[1]        b[2]        b[3]        b[4]        b[5]        b[6] 
##  0.42252694  1.15289695 -0.12995391  0.03638308 -0.06038751  0.62261242 
##        b[7]        b[8]         int 
##  0.28709429  0.18483462 -0.87443948

pm_Xb = pm_coef["int"] + as.matrix(X[,1:8]) %*% pm_coef[1:8]
phat = 1.0 / (1.0 + exp(-pm_Xb))
head(phat)

##         [,1]
## 1 0.70389019
## 2 0.03348453
## 3 0.79588270
## 4 0.03315076
## 5 0.93103951
## 6 0.12544097

plot(phat, jitter(dat$Outcome), pch=19, col=dat$Outcome+1, xlab='predicted', ylab='actual', main='prediction with logistic regression')

(tab0.5 = table(phat > 0.5, data_jags$Outcome))

##        
##           0   1
##   FALSE 444 114
##   TRUE   56 154

sum(diag(tab0.5)) / sum(tab0.5)

## [1] 0.7786458

Prior vs. Posterior on Coefficients

The next plot shows how the flat priors compare with the sharp skinny posterior distribution for the coefficient for a predictor (samples drawn using MCMC simulation). Also it shows that the predictor SkinThickness is not an important predictor, since the posterior density includes zero.

plot.densities <- function(col) {
  #df <- X[col]
  #df$density <- 'data'
  coeff <- paste0('β_', names(X[col]))
  post.df <- data.frame(var=mod3_csim[,col], density=rep('posterior', nrow(mod3_csim)))
  names(post.df)[1] <- coeff
  #df <- rbind(df, post.df)
  df <- post.df
  prior.df <- data.frame(var=rnorm(1000,0,5), density=rep('prior', 1000))
  names(prior.df)[1] <- coeff
  df <- rbind(df, prior.df)
  head(df)
  return (ggplot(df, aes_string(coeff, fill='density')) + geom_density(alpha=0.5) + geom_vline(xintercept = mean(X[,col]), lty=2, col='red', alpha=0.5))
}

(p <- plot.densities(4))

Bayesian Credible Intervals (CI) for the Coefficients

The Highest Posterior Density intervals for the coefficients are shown below:

e.g., the estimated coefficient for the predictor Glucose lies in the interval shown below with 95% probability.

HPDinterval(mod3_csim)

##            lower       upper
## b[1]  0.21834012  0.64093736
## b[2]  0.93339740  1.38956930
## b[3] -0.32965513  0.07475399
## b[4] -0.20150163  0.26085259
## b[5] -0.25571422  0.14405698
## b[6]  0.37568675  0.85358616
## b[7]  0.09602151  0.48381158
## b[8] -0.01865905  0.41127996
## int  -1.06922526 -0.68279383
## attr(,"Probability")
## [1] 0.95

Make Probabilistic Statements

For example, let’s try to answer the question: what’s the probability that the effect size for DiabetesPedigreeFunction is greater than that of the predictor Age? We can answer by conditioning on the samples obtained from simulation and then taking the mean as shown below.

mean(mod3_csim[,7] > mod3_csim[,8])

## [1] 0.7545333

Posterior Predictive Distribution

Now, let’s consider the following two new patients with the following diagnostic measurements for the predictors.
First we need to scale the predictors by subtracting the mean and dividing by the s.d. of each of the respective predictors.
Now, let’s use the MCMC samples drawn to construct the posterior predictive distribution for patient one and construct 95% credible interval with quantile. Hence, with 95% of the time the probability that the first patient is diabetic will be in the interval computed and visualized in the next code snippet.
Also, let’s compare the probability that the second patient has higher probability of having diabetes than the first patient. As can be seen, it can also be computed by drawing samples from the posterior predictive distribution, as shown below.

p1 <- matrix(c(0,300,80,30,100,33,0.3,40), ncol=1)
p1df <- as.data.frame(t(p1))
colnames(p1df) <- names(dat)[-ncol(dat)]
p2 <- matrix(c(0,250,82,29,90,35,0.25,50), ncol=1)
p2df <- as.data.frame(t(p2))
colnames(p2df) <- names(dat)[-ncol(dat)]
pdf <- rbind(p1df, p2df)
rownames(pdf) <- c('patient 1', 'patient 2')
head(pdf)

##           Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
## patient 1           0     300            80            30     100  33
## patient 2           0     250            82            29      90  35
##           DiabetesPedigreeFunction Age
## patient 1                     0.30  40
## patient 2                     0.25  50

mu <- colMeans(dat[,-ncol(dat)])
sd <- apply(dat[,-ncol(dat)], 2, sd)
p1 <- (p1 - mu) / sd # z-score normalize
p2 <- (p2 - mu) / sd
p1_Xb = mod3_csim[,"int"] + mod3_csim[,1:8] %*% p1
phat1 = 1.0 / (1.0 + exp(-p1_Xb))
head(phat1)

## Markov Chain Monte Carlo (MCMC) output:
## Start = 1 
## End = 7 
## Thinning interval = 1 
##           [,1]
## [1,] 0.9901673
## [2,] 0.9979442
## [3,] 0.9985275
## [4,] 0.9920615
## [5,] 0.9856507
## [6,] 0.9967362
## [7,] 0.9963923

p2_Xb = mod3_csim[,"int"] + mod3_csim[,1:8] %*% p2
phat2 = 1.0 / (1.0 + exp(-p2_Xb))
ci_95 <- quantile(phat1, probs = c(0.025, 0.975))
ci_95

##      2.5%     97.5% 
## 0.9792472 0.9986913

post_pred_df <- data.frame(pred=phat1)
ggplot(post_pred_df, aes(x = phat1)) + 
    xlab('probability of being diabetic (Outcome=1)') + 
    geom_density(color='darkblue', fill = 'lightblue') + 
    geom_vline(xintercept = ci_95, color = "red") +
    theme_bw() + 
    ggtitle('Posterior Predictive distribution 95% Credible Interval for patient 1')

post_pred_df$patient <- 1
post_pred_df <- rbind(post_pred_df, data.frame(pred=phat2, patient=rep(2, length(phat2))))
post_pred_df$patient <- as.factor(post_pred_df$patient)
ggplot(post_pred_df, aes(pred, fill=patient)) + 
    geom_density(alpha=0.5) + 
    xlab('probability of being diabetic (Outcome=1)') + 
    theme_bw() + 
    ggtitle('Prediction with MCMC samples from posterior distribution of coefficients')

#mean(phat2 > phat1)

set.seed(92)

y1 <- rbinom(nrow(mod3_csim), 1, 1/(1+exp(-mod3_csim[,"int"] - mod3_csim[,1:8] %*% p1))) # draw samples
y2 <- rbinom(nrow(mod3_csim), 1, 1/(1+exp(-mod3_csim[,"int"] - mod3_csim[,1:8] %*% p2)))
mean(y2 > y1)

## [1] 0.007466667

post_pred_df <- data.frame(pred=c(y1, y2), patient=c(rep('1', length(y1)), rep('2', length(y2))))
ggplot(post_pred_df, aes(pred, fill=patient)) + 
    geom_density(alpha=0.5) + 
    xlab('probability of being diabetic (Outcome=1)') + 
    theme_bw() + 
    ggtitle('Posterior Predictive distribution for patient 1 and patient 2')

Conclusions

Using Bayesian Logistic Regression, we could not only compute the point estimates for the predictors and use them to predict the outcome, but the main strength of the model had been to quantify uncertainty in terms of how confident we are in terms of the estimates and predictions (using posterior predictive distribution). We can see that all predictor variables other than Insulin and BloodPressure, increase in their values increase the log odds of diabetes. The model even allows us to make probabilistic statements in terms of comparing two patients’ probability of being diabetic, given the values of the diagnostic measurements of the predictor variables.

Improvements

The following can be tried to improve the model:

Transform the predictors (e.g., with log) to remove skewness.
Split the data into train and test partition, train the model on the train parition and predict on test partition.
Use different imputation techniques, remove outliers.
Using domain knowledge to pre/post process the predictors / extract features / do more feature engineering to improve accuracy.