Bayesian Logistic Regression for Diabetes Prediction

Bayesian methods have been widely used for classification problems in machine learning. This study employs Bayesian technique for predicting diabetes in a well-known dataset.

Methodology

Binary logistic regression models have been developed to study the classification in two categories of the outcome. In a logistic regression model with p predictors, the probability of occurrence of an outcome is given as, \[ p_i = P(Y_i =1) = E(Y_i) = \frac{exp(\beta_0+\beta_1X_1+...+\beta_ pX_p)}{1+exp(\beta_0+\beta_1X_1+...+\beta_ pX_p)}\] where i = 1,2,….,n subjects in the study.

Equivalently, \[ p_i = E(Y_i) = \frac{1}{1+exp(-(\beta_0+\beta_1X_1+...+\beta_ pX_p))} \] Since the response variable \(Y_{i}\) takes binary values, we formulate it to follow a Bernoulli distribution. This gives the likelihood function for the model.

\(Y_{i}|p_i \sim {\sf Bernoulli}(p_i)\)

where \(p_i\) is just defined in the equation above.

Prior Distribution on Coefficients: We take two different piors and compare their performances. First, a non-informative normal prior is taken. This is considered as a default vague prior. Next, we take a Cauchy prior suggested by Gelman et.al.(2008). More on these priors are presented later.

Inferences from Bayesian analysis are summarized in the form of posterior distribution. The posterior distribution updates our prior belief and tells what is known after the data has been observed.

Read in the data and do some pre-processing

setwd("U:/PRJ/Bayesian")
diab = read.csv("diabetes.csv")
head(diab)

##   Pregnancies Glucose BloodPressure SkinThickness Insulin  BMI
## 1           6     148            72            35       0 33.6
## 2           1      85            66            29       0 26.6
## 3           8     183            64             0       0 23.3
## 4           1      89            66            23      94 28.1
## 5           0     137            40            35     168 43.1
## 6           5     116            74             0       0 25.6
##   DiabetesPedigreeFunction Age Outcome
## 1                    0.627  50       1
## 2                    0.351  31       0
## 3                    0.672  32       1
## 4                    0.167  21       0
## 5                    2.288  33       1
## 6                    0.201  30       0

# Value of 0 in the dataset refer missing values. Impute missing values with median for all predictors except "Pregnancies".
myfun = function(x){
  ifelse(x== 0,median(x),x)
}

imp.median <- data.frame(
  sapply(diab[,-c(1,9)],myfun))
finaldata = cbind(diab[,c(9,1)],imp.median)
head(finaldata)

##   Outcome Pregnancies Glucose BloodPressure SkinThickness Insulin  BMI
## 1       1           6     148            72            35    30.5 33.6
## 2       0           1      85            66            29    30.5 26.6
## 3       1           8     183            64            23    30.5 23.3
## 4       0           1      89            66            23    94.0 28.1
## 5       1           0     137            40            35   168.0 43.1
## 6       0           5     116            74            23    30.5 25.6
##   DiabetesPedigreeFunction Age
## 1                    0.627  50
## 2                    0.351  31
## 3                    0.672  32
## 4                    0.167  21
## 5                    2.288  33
## 6                    0.201  30

Next, we split the data into training and testing sets. Models will be built on training data and their performances will be evaluated on testing data.

set.seed(111)
train = sample(1:nrow(finaldata),nrow(finaldata)*0.6)  #60% data on training set
finaldata.train = finaldata[train,]
finaldata.test = finaldata[-train,]

Let’s see how correlated are the different predictors. High correlation between predictors are problematic if our goal is inference. However, if the goal is prediction, we should not worry much.

library(corrplot)
Cor = cor(finaldata.train[,-1])
corrplot(Cor, type="upper", method="number", tl.pos="d")

Pregnancies and Age seem to have moderate positive correlation. Similar is the story between Skin Thickness and BMI.

The predictors are measured on various units. Therefore, we are going to scale them so that the posterior results become comparable later on. The Cauchy Prior on regression coefficients requires scaling of predictors to have a mean of 0 and standard deviation 0.5. For consistency, we will use the same scaling while using the default normal prior.

X1 = scale(finaldata.train[,-1])*0.5+0
colMeans(X1)

##              Pregnancies                  Glucose            BloodPressure 
##            -5.053325e-18             4.132602e-17             2.406891e-16 
##            SkinThickness                  Insulin                      BMI 
##             9.184229e-17             3.397041e-17            -2.307060e-16 
## DiabetesPedigreeFunction                      Age 
##             6.050791e-18            -9.270211e-17

apply(X1,2,sd)

##              Pregnancies                  Glucose            BloodPressure 
##                      0.5                      0.5                      0.5 
##            SkinThickness                  Insulin                      BMI 
##                      0.5                      0.5                      0.5 
## DiabetesPedigreeFunction                      Age 
##                      0.5                      0.5

The models are developed using JAGS which stands for Just Another Gibbs Sampler. JAGS’s functionalities can be accessed from R using the “rjags” package. The first step is to specify the model. As discussed preivously, the first prior on intercept as well as other regression coefficients will be a normal prior with mean 0 and variance 100. These values make the prior non-informative and hence the inferences are typically data-driven. JAGS requires model be written as string.

library(rjags)

mod1_string = " model {
    for (i in 1:length(y)) {
        y[i] ~ dbern(p[i])         # Likelihood portion
        logit(p[i]) = int + b[1]*Pregnancies[i] + b[2]*Glucose[i] + b[3]*BloodPressure[i] + b[4]*SkinThickness[i] + b[5]*Insulin[i] + b[6]*BMI[i] + b[7]*DiabetesPedigreeFunction[i] + b[8]*Age[i]
    }
    int ~ dnorm(0.0, 1.0/100.0)    # Normal prior with mean 0 and variance 100 (equivalently, precision of 1/100).
    for (j in 1:8) {
        b[j] ~ dnorm(0.0, 1.0/100.0)
    }
} "

The second step is to set up the model and tell where the data are in a list.

set.seed(33)

data_jags = list(y=finaldata.train$Outcome, Pregnancies=X1[,"Pregnancies"], Glucose=X1[,"Glucose"], BloodPressure=X1[,"BloodPressure"], SkinThickness=X1[,"SkinThickness"], Insulin=X1[,"Insulin"], BMI=X1[,"BMI"], DiabetesPedigreeFunction=X1[,"DiabetesPedigreeFunction"], Age=X1[,"Age"])

# Parameters we may want to monitor are intercept and the other coefficients.
params = c("int", "b")

# Specify the model itself.
mod1 = jags.model(textConnection(mod1_string), data=data_jags, n.chains=3) # Run three different chains with different starting values

## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 460
##    Unobserved stochastic nodes: 9
##    Total graph size: 6043
## 
## Initializing model

# Give a burn-in period of 1000 iterations. Samples are not kept for first 1000 iterations.
update(mod1, 1e3)

# Actual posterior simulations we will keep. Run 5000 iterations.
mod1_sim = coda.samples(model=mod1, variable.names=params, n.iter=5e3) # Simulations are stored as matrices.

# Combine results from 3 different chains into one by stacking matrices that contain simulations.
mod1_csim = as.mcmc(do.call(rbind, mod1_sim))

Let’s perform some convergence diagnostics for the Markov Chains.

# Convergence diagnostics. Start with trace plots.
# plot(mod1_sim, ask=TRUE) # Not displayed in the shared file. Different colors for different chains we ran. Look random (no trend) which is desirable.
gelman.diag(mod1_sim)  # Potential scale reduction factors of 1 indicate models have probably converged.

## Potential scale reduction factors:
## 
##      Point est. Upper C.I.
## b[1]          1          1
## b[2]          1          1
## b[3]          1          1
## b[4]          1          1
## b[5]          1          1
## b[6]          1          1
## b[7]          1          1
## b[8]          1          1
## int           1          1
## 
## Multivariate psrf
## 
## 1

autocorr.plot(mod1_sim) # Autocorrelation quickly dropped to near zero within first 5 lags. No autocorrelation issue in the estimated coefficients.

Let’s also see the model summaries.

summary(mod1_sim)

## 
## Iterations = 2001:7000
## Thinning interval = 1 
## Number of chains = 3 
## Sample size per chain = 5000 
## 
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
## 
##          Mean     SD Naive SE Time-series SE
## b[1]  0.66637 0.2905 0.002372       0.004107
## b[2]  2.37000 0.3138 0.002562       0.003958
## b[3] -0.28632 0.2688 0.002195       0.003402
## b[4]  0.02945 0.2946 0.002405       0.003984
## b[5]  0.02182 0.2641 0.002157       0.003261
## b[6]  1.54258 0.3254 0.002657       0.004849
## b[7]  0.42220 0.2577 0.002104       0.002718
## b[8]  0.32579 0.3099 0.002531       0.004777
## int  -0.96787 0.1323 0.001080       0.001572
## 
## 2. Quantiles for each variable:
## 
##         2.5%     25%      50%     75%   97.5%
## b[1]  0.1025  0.4650  0.66235  0.8619  1.2445
## b[2]  1.7762  2.1548  2.36596  2.5775  2.9960
## b[3] -0.8167 -0.4679 -0.28676 -0.1011  0.2427
## b[4] -0.5462 -0.1693  0.03121  0.2299  0.6036
## b[5] -0.4902 -0.1564  0.02000  0.2002  0.5426
## b[6]  0.9078  1.3215  1.53945  1.7619  2.1897
## b[7] -0.0813  0.2481  0.42273  0.5949  0.9258
## b[8] -0.2959  0.1181  0.32680  0.5367  0.9225
## int  -1.2349 -1.0544 -0.96722 -0.8795 -0.7103

Prediction

If we have the regression coefficients and the predictor values, we can plug them into the second equation for \(p_i\) above to get an estimate of the probability that the Outcome = 1.

# Extract posterior mean of coefficients
pm_coef = colMeans(mod1_csim)

# The matrix multiplication below gives the exponentiation part in equation which will then be used to find estimated probabilities.

pm_Xb = pm_coef["int"] + X1[,c(1,2,3,4,5,6,7,8)] %*% pm_coef[1:8] # Intercept + Design Matrix*Coefficients
phat = 1.0 / (1.0 + exp(-pm_Xb))  # Predicted probabilities that the Outcome = 1 for each observations

The plot of predicted probabilities against the actual outcome value gives a rough idea on how successful the model is on the training dataset.

plot(phat, jitter(finaldata.train$Outcome))

Looks okay. Observations with lower probabilities of Outcome=0 assigned by the model were often actually 0 in the dataset. It would be more interesting to see this result in the test dataset.

Let’s select 0.5 as the cut-off. Probabilities greater than 0.5 will be labeled ‘1’(Presence of Diabetes) as the outcome and below 0.5 will be labeled ‘0’ (Absence of Diabetes).

(tab0.5 = table(phat > 0.5, finaldata.train$Outcome))

##        
##           0   1
##   FALSE 268  69
##   TRUE   36  87

sum(diag(tab0.5)) / sum(tab0.5)  # Correct classification rate in the training dataset

## [1] 0.7717391

Now, let’s see the model’s performance in the test dataset. Again, we start by standardizing the data.

X2 = scale(finaldata.test[,-1])*0.5+0

Now, using the coefficients obtained, let’s find the predicted probabilities of individual observations in the test dataset.

pm_coef = colMeans(mod1_csim)
pm_Xb = pm_coef["int"] + X2[,c(1,2,3,4,5,6,7,8)] %*% pm_coef[1:8] # Intercept + Design Matrix*Coefficients
phat = 1.0 / (1.0 + exp(-pm_Xb))

Model performance in the test dataset.

plot(phat, jitter(finaldata.test$Outcome))

(tab0.5 = table(phat > 0.5, finaldata.test$Outcome))

##        
##           0   1
##   FALSE 174  48
##   TRUE   22  64

sum(diag(tab0.5)) / sum(tab0.5)  # Correct classification rate in the training dataset

## [1] 0.7727273

Similar accuracy as the training data is found. Next, we want to proceed with a different prior and compare the performances of the two models. This time we will use the weakly informative Cauchy priors for the coefficients. As recommended by Gelman et.al (2008), data are first standardized so that all continuous variables have mean 0 and standard deviation 0.5. Then the intercept will be specified a Cauchy prior distribution centered at 0 and scale of 10. The other coefficients will get a scale of 2.5. Again, let’s set up, specify and run the model in training data.

mod2_string = " model {
    for (i in 1:length(y)) {
        y[i] ~ dbern(p[i])
        logit(p[i]) = int + b[1]*Pregnancies[i] + b[2]*Glucose[i] + b[3]*BloodPressure[i] + b[4]*SkinThickness[i] + b[5]*Insulin[i] + b[6]*BMI[i] + b[7]*DiabetesPedigreeFunction[i] + b[8]*Age[i]
    }
    int ~ dt(0, 1/10^2, 1)    # t prior with mean 0 and scale 10.This is weakly informative chaucy prior.
    for (j in 1:8) {
        b[j] ~ dt(0, 1/2.5^2, 1)  #  t prior with mean 0 and scale 2.5
    }
}"

Specify the parameters and run MCMC.

set.seed(44)

data_jags = list(y=finaldata.train$Outcome, Pregnancies=X1[,"Pregnancies"], Glucose=X1[,"Glucose"], BloodPressure=X1[,"BloodPressure"], SkinThickness=X1[,"SkinThickness"], Insulin=X1[,"Insulin"], BMI=X1[,"BMI"], DiabetesPedigreeFunction=X1[,"DiabetesPedigreeFunction"], Age=X1[,"Age"])

# Parameters we may want to monitor are intercept and the other coefficients.
params = c("int", "b")

# Specify the model itself.
mod2 = jags.model(textConnection(mod2_string), data=data_jags, n.chains=3) # Run three different chains with different starting values

## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 460
##    Unobserved stochastic nodes: 9
##    Total graph size: 6048
## 
## Initializing model

# Give a burn-in period of 1000 iterations. Samples are not kept for first 1000 iterations.
update(mod2, 1e3)

# Actual posterior simulations we will keep. Run 5000 iterations.
mod2_sim = coda.samples(model=mod2, variable.names=params, n.iter=5e3) # Simulations are stored as matrices.

# Combine results from 3 different chains into one by stacking matrices that contain simulations.
mod2_csim = as.mcmc(do.call(rbind, mod2_sim))

Again, let’s perform some convergence diagnostics.

# Convergence diagnostics. Start with trace plots.
# plot(mod2_sim, ask=TRUE) # Not displayed in the shared file. Different colors for different chains we ran. Look random (no trend) which is desirable.
gelman.diag(mod2_sim)  # Potential scale reduction factors of 1 indicate models have probably converged.

## Potential scale reduction factors:
## 
##      Point est. Upper C.I.
## b[1]          1       1.00
## b[2]          1       1.00
## b[3]          1       1.00
## b[4]          1       1.00
## b[5]          1       1.00
## b[6]          1       1.01
## b[7]          1       1.00
## b[8]          1       1.00
## int           1       1.00
## 
## Multivariate psrf
## 
## 1

autocorr.plot(mod2_sim) # Autocorrelation quickly dropped to near zero within first 5 lags. No autocorrelation issue in the estimated coefficients.

Model summary again.

summary(mod2_sim)

## 
## Iterations = 2001:7000
## Thinning interval = 1 
## Number of chains = 3 
## Sample size per chain = 5000 
## 
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
## 
##          Mean     SD Naive SE Time-series SE
## b[1]  0.64794 0.2818 0.002301       0.004131
## b[2]  2.32435 0.3101 0.002532       0.004035
## b[3] -0.26794 0.2688 0.002194       0.003510
## b[4]  0.04187 0.2861 0.002336       0.003859
## b[5]  0.03258 0.2538 0.002073       0.003206
## b[6]  1.49709 0.3195 0.002608       0.004706
## b[7]  0.41780 0.2527 0.002063       0.002779
## b[8]  0.32645 0.2994 0.002444       0.004907
## int  -0.95866 0.1302 0.001063       0.001492
## 
## 2. Quantiles for each variable:
## 
##          2.5%     25%      50%      75%   97.5%
## b[1]  0.09067  0.4580  0.65019  0.83520  1.1969
## b[2]  1.72869  2.1146  2.31823  2.52965  2.9435
## b[3] -0.79936 -0.4477 -0.26807 -0.08698  0.2546
## b[4] -0.51566 -0.1497  0.04230  0.23258  0.6043
## b[5] -0.46334 -0.1410  0.03412  0.20744  0.5268
## b[6]  0.87799  1.2813  1.49680  1.70650  2.1381
## b[7] -0.07735  0.2475  0.41539  0.58605  0.9168
## b[8] -0.25786  0.1247  0.32847  0.52802  0.9107
## int  -1.21672 -1.0472 -0.95731 -0.86839 -0.7062

Prediction

# Extract posterior mean of coefficients
pm_coef2 = colMeans(mod2_csim)

# The matrix multiplication below gives the exponentiation part in equation which will then be used to find estimated probabilities.

pm_Xb2 = pm_coef2["int"] + X1[,c(1,2,3,4,5,6,7,8)] %*% pm_coef2[1:8] # Intercept + Design Matrix*Coefficients
phat2 = 1.0 / (1.0 + exp(-pm_Xb2))  # Predicted probabilities that the Outcome = 1 for each observations

plot(phat2, jitter(finaldata.train$Outcome))

(tab0.5 = table(phat2 > 0.5, finaldata.train$Outcome))

##        
##           0   1
##   FALSE 268  69
##   TRUE   36  87

sum(diag(tab0.5)) / sum(tab0.5)  # Correct classification rate in the training dataset

## [1] 0.7717391

Now, let’s see the performance in test dataset.

pm_coef2 = colMeans(mod2_csim)
pm_Xb2 = pm_coef2["int"] + X2[,c(1,2,3,4,5,6,7,8)] %*% pm_coef2[1:8] # Intercept + Design Matrix*Coefficients
phat2 = 1.0 / (1.0 + exp(-pm_Xb2))

plot(phat2, jitter(finaldata.test$Outcome))

(tab0.5 = table(phat2 > 0.5, finaldata.test$Outcome))

##        
##           0   1
##   FALSE 174  48
##   TRUE   22  64

sum(diag(tab0.5)) / sum(tab0.5)  # Correct classification rate in the testing dataset

## [1] 0.7727273

Both priors lead to almost exact result. Therefore, in these cases the posterior analysis is largely driven by the likelihood and not the prior specification. In general, when n >>p, we don’t expect much difference in results from different priors. With both priors, we achieved slightly higher than 77% classification accuracy in the test dataset.

Bayesian Logistic Regression for Diabetes Prediction

Nirajan Budhathoki

9/26/2021

The Data

Methodology

Read in the data and do some pre-processing

Prediction

Prediction

References and Links