Bayesian methods have been widely used for classification problems in machine learning. This study employs Bayesian technique for predicting diabetes in a well-known dataset.
Data for this study is taken from Kaggle https://www.kaggle.com/uciml/pima-indians-diabetes-database
The data was collected by National Institute of Diabetes and Digestive and Kidney Diseases. The study participants are all female of PIMA Indian Heritage, at least 21 years old. The data set consists certain diagnostic measurements of 768 females. The variable “Outcome” is the response variable in the data set coded as “0” for absence of diabetes and “1” for the presence. Missing values for predictors were coded as “0” in the dataset. For this study, the missing values for all variables except “Pregnancies” were imputed using median values of the variables.
Binary logistic regression models have been developed to study the classification in two categories of the outcome. In a logistic regression model with p predictors, the probability of occurrence of an outcome is given as, \[ p_i = P(Y_i =1) = E(Y_i) = \frac{exp(\beta_0+\beta_1X_1+...+\beta_ pX_p)}{1+exp(\beta_0+\beta_1X_1+...+\beta_ pX_p)}\] where i = 1,2,….,n subjects in the study.
Equivalently, \[ p_i = E(Y_i) = \frac{1}{1+exp(-(\beta_0+\beta_1X_1+...+\beta_ pX_p))} \] Since the response variable \(Y_{i}\) takes binary values, we formulate it to follow a Bernoulli distribution. This gives the likelihood function for the model.
\(Y_{i}|p_i \sim {\sf Bernoulli}(p_i)\)
where \(p_i\) is just defined in the equation above.
Prior Distribution on Coefficients: We take two different piors and compare their performances. First, a non-informative normal prior is taken. This is considered as a default vague prior. Next, we take a Cauchy prior suggested by Gelman et.al.(2008). More on these priors are presented later.
Inferences from Bayesian analysis are summarized in the form of posterior distribution. The posterior distribution updates our prior belief and tells what is known after the data has been observed.
setwd("U:/PRJ/Bayesian")
diab = read.csv("diabetes.csv")
head(diab)
## Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
## 1 6 148 72 35 0 33.6
## 2 1 85 66 29 0 26.6
## 3 8 183 64 0 0 23.3
## 4 1 89 66 23 94 28.1
## 5 0 137 40 35 168 43.1
## 6 5 116 74 0 0 25.6
## DiabetesPedigreeFunction Age Outcome
## 1 0.627 50 1
## 2 0.351 31 0
## 3 0.672 32 1
## 4 0.167 21 0
## 5 2.288 33 1
## 6 0.201 30 0
# Value of 0 in the dataset refer missing values. Impute missing values with median for all predictors except "Pregnancies".
myfun = function(x){
ifelse(x== 0,median(x),x)
}
imp.median <- data.frame(
sapply(diab[,-c(1,9)],myfun))
finaldata = cbind(diab[,c(9,1)],imp.median)
head(finaldata)
## Outcome Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
## 1 1 6 148 72 35 30.5 33.6
## 2 0 1 85 66 29 30.5 26.6
## 3 1 8 183 64 23 30.5 23.3
## 4 0 1 89 66 23 94.0 28.1
## 5 1 0 137 40 35 168.0 43.1
## 6 0 5 116 74 23 30.5 25.6
## DiabetesPedigreeFunction Age
## 1 0.627 50
## 2 0.351 31
## 3 0.672 32
## 4 0.167 21
## 5 2.288 33
## 6 0.201 30
Next, we split the data into training and testing sets. Models will be built on training data and their performances will be evaluated on testing data.
set.seed(111)
train = sample(1:nrow(finaldata),nrow(finaldata)*0.6) #60% data on training set
finaldata.train = finaldata[train,]
finaldata.test = finaldata[-train,]
Let’s see how correlated are the different predictors. High correlation between predictors are problematic if our goal is inference. However, if the goal is prediction, we should not worry much.
library(corrplot)
Cor = cor(finaldata.train[,-1])
corrplot(Cor, type="upper", method="number", tl.pos="d")
Pregnancies and Age seem to have moderate positive correlation. Similar is the story between Skin Thickness and BMI.
The predictors are measured on various units. Therefore, we are going to scale them so that the posterior results become comparable later on. The Cauchy Prior on regression coefficients requires scaling of predictors to have a mean of 0 and standard deviation 0.5. For consistency, we will use the same scaling while using the default normal prior.
X1 = scale(finaldata.train[,-1])*0.5+0
colMeans(X1)
## Pregnancies Glucose BloodPressure
## -5.053325e-18 4.132602e-17 2.406891e-16
## SkinThickness Insulin BMI
## 9.184229e-17 3.397041e-17 -2.307060e-16
## DiabetesPedigreeFunction Age
## 6.050791e-18 -9.270211e-17
apply(X1,2,sd)
## Pregnancies Glucose BloodPressure
## 0.5 0.5 0.5
## SkinThickness Insulin BMI
## 0.5 0.5 0.5
## DiabetesPedigreeFunction Age
## 0.5 0.5
The models are developed using JAGS which stands for Just Another Gibbs Sampler. JAGS’s functionalities can be accessed from R using the “rjags” package. The first step is to specify the model. As discussed preivously, the first prior on intercept as well as other regression coefficients will be a normal prior with mean 0 and variance 100. These values make the prior non-informative and hence the inferences are typically data-driven. JAGS requires model be written as string.
library(rjags)
mod1_string = " model {
for (i in 1:length(y)) {
y[i] ~ dbern(p[i]) # Likelihood portion
logit(p[i]) = int + b[1]*Pregnancies[i] + b[2]*Glucose[i] + b[3]*BloodPressure[i] + b[4]*SkinThickness[i] + b[5]*Insulin[i] + b[6]*BMI[i] + b[7]*DiabetesPedigreeFunction[i] + b[8]*Age[i]
}
int ~ dnorm(0.0, 1.0/100.0) # Normal prior with mean 0 and variance 100 (equivalently, precision of 1/100).
for (j in 1:8) {
b[j] ~ dnorm(0.0, 1.0/100.0)
}
} "
The second step is to set up the model and tell where the data are in a list.
set.seed(33)
data_jags = list(y=finaldata.train$Outcome, Pregnancies=X1[,"Pregnancies"], Glucose=X1[,"Glucose"], BloodPressure=X1[,"BloodPressure"], SkinThickness=X1[,"SkinThickness"], Insulin=X1[,"Insulin"], BMI=X1[,"BMI"], DiabetesPedigreeFunction=X1[,"DiabetesPedigreeFunction"], Age=X1[,"Age"])
# Parameters we may want to monitor are intercept and the other coefficients.
params = c("int", "b")
# Specify the model itself.
mod1 = jags.model(textConnection(mod1_string), data=data_jags, n.chains=3) # Run three different chains with different starting values
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 460
## Unobserved stochastic nodes: 9
## Total graph size: 6043
##
## Initializing model
# Give a burn-in period of 1000 iterations. Samples are not kept for first 1000 iterations.
update(mod1, 1e3)
# Actual posterior simulations we will keep. Run 5000 iterations.
mod1_sim = coda.samples(model=mod1, variable.names=params, n.iter=5e3) # Simulations are stored as matrices.
# Combine results from 3 different chains into one by stacking matrices that contain simulations.
mod1_csim = as.mcmc(do.call(rbind, mod1_sim))
Let’s perform some convergence diagnostics for the Markov Chains.
# Convergence diagnostics. Start with trace plots.
# plot(mod1_sim, ask=TRUE) # Not displayed in the shared file. Different colors for different chains we ran. Look random (no trend) which is desirable.
gelman.diag(mod1_sim) # Potential scale reduction factors of 1 indicate models have probably converged.
## Potential scale reduction factors:
##
## Point est. Upper C.I.
## b[1] 1 1
## b[2] 1 1
## b[3] 1 1
## b[4] 1 1
## b[5] 1 1
## b[6] 1 1
## b[7] 1 1
## b[8] 1 1
## int 1 1
##
## Multivariate psrf
##
## 1
autocorr.plot(mod1_sim) # Autocorrelation quickly dropped to near zero within first 5 lags. No autocorrelation issue in the estimated coefficients.
Let’s also see the model summaries.
summary(mod1_sim)
##
## Iterations = 2001:7000
## Thinning interval = 1
## Number of chains = 3
## Sample size per chain = 5000
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## b[1] 0.66637 0.2905 0.002372 0.004107
## b[2] 2.37000 0.3138 0.002562 0.003958
## b[3] -0.28632 0.2688 0.002195 0.003402
## b[4] 0.02945 0.2946 0.002405 0.003984
## b[5] 0.02182 0.2641 0.002157 0.003261
## b[6] 1.54258 0.3254 0.002657 0.004849
## b[7] 0.42220 0.2577 0.002104 0.002718
## b[8] 0.32579 0.3099 0.002531 0.004777
## int -0.96787 0.1323 0.001080 0.001572
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## b[1] 0.1025 0.4650 0.66235 0.8619 1.2445
## b[2] 1.7762 2.1548 2.36596 2.5775 2.9960
## b[3] -0.8167 -0.4679 -0.28676 -0.1011 0.2427
## b[4] -0.5462 -0.1693 0.03121 0.2299 0.6036
## b[5] -0.4902 -0.1564 0.02000 0.2002 0.5426
## b[6] 0.9078 1.3215 1.53945 1.7619 2.1897
## b[7] -0.0813 0.2481 0.42273 0.5949 0.9258
## b[8] -0.2959 0.1181 0.32680 0.5367 0.9225
## int -1.2349 -1.0544 -0.96722 -0.8795 -0.7103
If we have the regression coefficients and the predictor values, we can plug them into the second equation for \(p_i\) above to get an estimate of the probability that the Outcome = 1.
# Extract posterior mean of coefficients
pm_coef = colMeans(mod1_csim)
# The matrix multiplication below gives the exponentiation part in equation which will then be used to find estimated probabilities.
pm_Xb = pm_coef["int"] + X1[,c(1,2,3,4,5,6,7,8)] %*% pm_coef[1:8] # Intercept + Design Matrix*Coefficients
phat = 1.0 / (1.0 + exp(-pm_Xb)) # Predicted probabilities that the Outcome = 1 for each observations
The plot of predicted probabilities against the actual outcome value gives a rough idea on how successful the model is on the training dataset.
plot(phat, jitter(finaldata.train$Outcome))
Looks okay. Observations with lower probabilities of Outcome=0 assigned by the model were often actually 0 in the dataset. It would be more interesting to see this result in the test dataset.
Let’s select 0.5 as the cut-off. Probabilities greater than 0.5 will be labeled ‘1’(Presence of Diabetes) as the outcome and below 0.5 will be labeled ‘0’ (Absence of Diabetes).
(tab0.5 = table(phat > 0.5, finaldata.train$Outcome))
##
## 0 1
## FALSE 268 69
## TRUE 36 87
sum(diag(tab0.5)) / sum(tab0.5) # Correct classification rate in the training dataset
## [1] 0.7717391
Now, let’s see the model’s performance in the test dataset. Again, we start by standardizing the data.
X2 = scale(finaldata.test[,-1])*0.5+0
Now, using the coefficients obtained, let’s find the predicted probabilities of individual observations in the test dataset.
pm_coef = colMeans(mod1_csim)
pm_Xb = pm_coef["int"] + X2[,c(1,2,3,4,5,6,7,8)] %*% pm_coef[1:8] # Intercept + Design Matrix*Coefficients
phat = 1.0 / (1.0 + exp(-pm_Xb))
Model performance in the test dataset.
plot(phat, jitter(finaldata.test$Outcome))
(tab0.5 = table(phat > 0.5, finaldata.test$Outcome))
##
## 0 1
## FALSE 174 48
## TRUE 22 64
sum(diag(tab0.5)) / sum(tab0.5) # Correct classification rate in the training dataset
## [1] 0.7727273
Similar accuracy as the training data is found. Next, we want to proceed with a different prior and compare the performances of the two models. This time we will use the weakly informative Cauchy priors for the coefficients. As recommended by Gelman et.al (2008), data are first standardized so that all continuous variables have mean 0 and standard deviation 0.5. Then the intercept will be specified a Cauchy prior distribution centered at 0 and scale of 10. The other coefficients will get a scale of 2.5. Again, let’s set up, specify and run the model in training data.
mod2_string = " model {
for (i in 1:length(y)) {
y[i] ~ dbern(p[i])
logit(p[i]) = int + b[1]*Pregnancies[i] + b[2]*Glucose[i] + b[3]*BloodPressure[i] + b[4]*SkinThickness[i] + b[5]*Insulin[i] + b[6]*BMI[i] + b[7]*DiabetesPedigreeFunction[i] + b[8]*Age[i]
}
int ~ dt(0, 1/10^2, 1) # t prior with mean 0 and scale 10.This is weakly informative chaucy prior.
for (j in 1:8) {
b[j] ~ dt(0, 1/2.5^2, 1) # t prior with mean 0 and scale 2.5
}
}"
Specify the parameters and run MCMC.
set.seed(44)
data_jags = list(y=finaldata.train$Outcome, Pregnancies=X1[,"Pregnancies"], Glucose=X1[,"Glucose"], BloodPressure=X1[,"BloodPressure"], SkinThickness=X1[,"SkinThickness"], Insulin=X1[,"Insulin"], BMI=X1[,"BMI"], DiabetesPedigreeFunction=X1[,"DiabetesPedigreeFunction"], Age=X1[,"Age"])
# Parameters we may want to monitor are intercept and the other coefficients.
params = c("int", "b")
# Specify the model itself.
mod2 = jags.model(textConnection(mod2_string), data=data_jags, n.chains=3) # Run three different chains with different starting values
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 460
## Unobserved stochastic nodes: 9
## Total graph size: 6048
##
## Initializing model
# Give a burn-in period of 1000 iterations. Samples are not kept for first 1000 iterations.
update(mod2, 1e3)
# Actual posterior simulations we will keep. Run 5000 iterations.
mod2_sim = coda.samples(model=mod2, variable.names=params, n.iter=5e3) # Simulations are stored as matrices.
# Combine results from 3 different chains into one by stacking matrices that contain simulations.
mod2_csim = as.mcmc(do.call(rbind, mod2_sim))
Again, let’s perform some convergence diagnostics.
# Convergence diagnostics. Start with trace plots.
# plot(mod2_sim, ask=TRUE) # Not displayed in the shared file. Different colors for different chains we ran. Look random (no trend) which is desirable.
gelman.diag(mod2_sim) # Potential scale reduction factors of 1 indicate models have probably converged.
## Potential scale reduction factors:
##
## Point est. Upper C.I.
## b[1] 1 1.00
## b[2] 1 1.00
## b[3] 1 1.00
## b[4] 1 1.00
## b[5] 1 1.00
## b[6] 1 1.01
## b[7] 1 1.00
## b[8] 1 1.00
## int 1 1.00
##
## Multivariate psrf
##
## 1
autocorr.plot(mod2_sim) # Autocorrelation quickly dropped to near zero within first 5 lags. No autocorrelation issue in the estimated coefficients.
Model summary again.
summary(mod2_sim)
##
## Iterations = 2001:7000
## Thinning interval = 1
## Number of chains = 3
## Sample size per chain = 5000
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## b[1] 0.64794 0.2818 0.002301 0.004131
## b[2] 2.32435 0.3101 0.002532 0.004035
## b[3] -0.26794 0.2688 0.002194 0.003510
## b[4] 0.04187 0.2861 0.002336 0.003859
## b[5] 0.03258 0.2538 0.002073 0.003206
## b[6] 1.49709 0.3195 0.002608 0.004706
## b[7] 0.41780 0.2527 0.002063 0.002779
## b[8] 0.32645 0.2994 0.002444 0.004907
## int -0.95866 0.1302 0.001063 0.001492
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## b[1] 0.09067 0.4580 0.65019 0.83520 1.1969
## b[2] 1.72869 2.1146 2.31823 2.52965 2.9435
## b[3] -0.79936 -0.4477 -0.26807 -0.08698 0.2546
## b[4] -0.51566 -0.1497 0.04230 0.23258 0.6043
## b[5] -0.46334 -0.1410 0.03412 0.20744 0.5268
## b[6] 0.87799 1.2813 1.49680 1.70650 2.1381
## b[7] -0.07735 0.2475 0.41539 0.58605 0.9168
## b[8] -0.25786 0.1247 0.32847 0.52802 0.9107
## int -1.21672 -1.0472 -0.95731 -0.86839 -0.7062
# Extract posterior mean of coefficients
pm_coef2 = colMeans(mod2_csim)
# The matrix multiplication below gives the exponentiation part in equation which will then be used to find estimated probabilities.
pm_Xb2 = pm_coef2["int"] + X1[,c(1,2,3,4,5,6,7,8)] %*% pm_coef2[1:8] # Intercept + Design Matrix*Coefficients
phat2 = 1.0 / (1.0 + exp(-pm_Xb2)) # Predicted probabilities that the Outcome = 1 for each observations
plot(phat2, jitter(finaldata.train$Outcome))
(tab0.5 = table(phat2 > 0.5, finaldata.train$Outcome))
##
## 0 1
## FALSE 268 69
## TRUE 36 87
sum(diag(tab0.5)) / sum(tab0.5) # Correct classification rate in the training dataset
## [1] 0.7717391
Now, let’s see the performance in test dataset.
pm_coef2 = colMeans(mod2_csim)
pm_Xb2 = pm_coef2["int"] + X2[,c(1,2,3,4,5,6,7,8)] %*% pm_coef2[1:8] # Intercept + Design Matrix*Coefficients
phat2 = 1.0 / (1.0 + exp(-pm_Xb2))
plot(phat2, jitter(finaldata.test$Outcome))
(tab0.5 = table(phat2 > 0.5, finaldata.test$Outcome))
##
## 0 1
## FALSE 174 48
## TRUE 22 64
sum(diag(tab0.5)) / sum(tab0.5) # Correct classification rate in the testing dataset
## [1] 0.7727273
Both priors lead to almost exact result. Therefore, in these cases the posterior analysis is largely driven by the likelihood and not the prior specification. In general, when n >>p, we don’t expect much difference in results from different priors. With both priors, we achieved slightly higher than 77% classification accuracy in the test dataset.
[1] Andrew Gelman. Aleks Jakulin. Maria Grazia Pittau. Yu-Sung Su. “A weakly informative default prior distribution for logistic and other regression models.” Ann. Appl. Stat. 2 (4) 1360 - 1383, December 2008.
[2] Matthew Heiner.Bayesian Statistics: Techniques and Models.Coursera. https://www.coursera.org/learn/mcmc-bayesian-statistics