Introduction

The dataset is the Pima indians diabetes dataset, which consists of information about various females of Pima Indian heritage and includes various medical markers about each subject. The dataset has been removed from the UCI Machine Learning repository due to permissions issues, but can still be found here: https://www.kaggle.com/uciml/pima-indians-diabetes-database

The purpose of this analysis is to determine which variables are related to the presence or absence of diabetes.

Note that some code and outputs have been removed for brevity.

Exploratory Data Analysis

First, let’s look at the data:

## 'data.frame':    768 obs. of  9 variables:
##  $ Pregnancies             : int  6 1 8 1 0 5 3 10 2 8 ...
##  $ Glucose                 : int  148 85 183 89 137 116 78 115 197 125 ...
##  $ BloodPressure           : int  72 66 64 66 40 74 50 0 70 96 ...
##  $ SkinThickness           : int  35 29 0 23 35 0 32 0 45 0 ...
##  $ Insulin                 : int  0 0 0 94 168 0 88 0 543 0 ...
##  $ BMI                     : num  33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
##  $ DiabetesPedigreeFunction: num  0.627 0.351 0.672 0.167 2.288 ...
##  $ Age                     : int  50 31 32 21 33 30 26 29 53 54 ...
##  $ Outcome                 : int  1 0 1 0 1 0 1 0 1 1 ...

All of the explanatory variables are continuous, although most are integers. The outcome is a categorical variable indicating whether the subject has diabetes. It should be noted that the outcome is not uniformly distributed, with about twice as many subjects without diabetes as with.

Let’s look at the correlations between variables:

There do not seem to be a large number of strongly correlated variables in the data. Age is correlated with number of Pregnancies, which seems rather obvious, and Insulin is slightly correlated with SkinThickness, which seems rather less obvious.

Now we will fit a generalized linear model to the data to see if anything interesting shows up:

l_mod <- glm(Outcome ~ ., data = data, family = "binomial")
summary(l_mod)$coef[, c("Estimate", "Pr(>|z|)")]
##                               Estimate     Pr(>|z|)
## (Intercept)              -8.4046963669 9.161142e-32
## Pregnancies               0.1231822984 1.229640e-04
## Glucose                   0.0351637146 2.509097e-21
## BloodPressure            -0.0132955469 1.107207e-02
## SkinThickness             0.0006189644 9.285152e-01
## Insulin                  -0.0011916990 1.860652e-01
## BMI                       0.0897009700 2.758937e-09
## DiabetesPedigreeFunction  0.9451797406 1.579978e-03
## Age                       0.0148690047 1.111920e-01

This p-values indicate that several variables are not statistically significant, including SkinThickness, Insulin and Age. We will consider removing these variables.

It appears that three of the explanatory variables - Glucose, Blood Pressure and BMI are roughly normally distributed. Pregnancies seems to follow an exponential distribution, and Diabetes Pedigree Function seems to follow a beta distribution.

Methods

Variable Selection

Rather than setting a separate prior for each beta coefficient we will scale and center the data, which we can do since the variables are all continuous.

X = scale(data[,1:8], center = TRUE, scale = TRUE)

Modeling

Since the outcome is binary we will fit a logistic regression using all of the explanatory variables. In order to encourage the model to favor coefficients near zero we will use a double exponential prior for the betas.

mod0_string = "model {
    for(i in 1:length(Outcome)){
        Outcome[i] ~ dbern(p[i])
        logit(p[i]) = b0 + b[1] * Pregnancies[i] + b[2] * Glucose[i] + 
          b[3] * BloodPressure[i] + b[4] * SkinThickness[i] + b[5] * Insulin[i] + 
          b[6] * BMI[i] + b[7] * DiabetesPedigreeFunction[i] + b[8] * Age[i]
    }
    
    b0 ~ dnorm(0.0, 1.0)
    for(j in 1:8){
        b[j] ~ ddexp(0.0, sqrt(2.0))
    }
}"

params = c("b0", "b")

data_jags = list(Outcome = data$Outcome, Pregnancies = X[, "Pregnancies"], Glucose = X[,"Glucose"], BloodPressure = X[,"BloodPressure"], SkinThickness = X[,"SkinThickness"], Insulin = X[,"Insulin"], BMI = X[,"BMI"], DiabetesPedigreeFunction = X[, "DiabetesPedigreeFunction"], Age = X[,"Age"])

mod0 = jags.model(textConnection(mod0_string), data = data_jags, n.chains=3)
update(mod0, 1e3)
mod_sim0 = coda.samples(model = mod0, variable.names = params, n.iter = 5e3)
mod_csim0 = as.mcmc(do.call(rbind, mod_sim0))
gelman.diag(mod_sim0)
## Potential scale reduction factors:
## 
##      Point est. Upper C.I.
## b[1]          1          1
## b[2]          1          1
## b[3]          1          1
## b[4]          1          1
## b[5]          1          1
## b[6]          1          1
## b[7]          1          1
## b[8]          1          1
## b0            1          1
## 
## Multivariate psrf
## 
## 1
autocorr.diag(mod_sim0)
##               b[1]         b[2]        b[3]          b[4]         b[5]
## Lag 0   1.00000000  1.000000000 1.000000000  1.0000000000  1.000000000
## Lag 1   0.45480848  0.402189489 0.342386909  0.4765570180  0.476035644
## Lag 5   0.04523439  0.038394220 0.019485405  0.0278758016  0.012144447
## Lag 10 -0.01081058  0.007879641 0.003297238 -0.0055877925 -0.013616305
## Lag 50 -0.01292045 -0.008649784 0.003265828  0.0003337635 -0.002513289
##               b[6]         b[7]        b[8]           b0
## Lag 0  1.000000000 1.0000000000  1.00000000  1.000000000
## Lag 1  0.367800022 0.2613412416  0.48889598  0.297014975
## Lag 5  0.012852215 0.0088312230  0.03186626 -0.007289176
## Lag 10 0.001982413 0.0025285175 -0.01128342 -0.005643497
## Lag 50 0.014809921 0.0001843585 -0.01179415 -0.011446262

The convergence criteria look acceptable and auto correlation doesn’t seem to be problematic. Now we will look at densplots to see which variables are significant.

This confirms our previous assessment that Insulin, SkinThickness and Age are not significant, so we will fit another model using only the variables which were significant in the previous model. For this model we use a non-informative normal prior for the betas rather than a double exponential.

mod1_string = "model {
    for(i in 1:length(Outcome)){
        Outcome[i] ~ dbern(p[i])
        logit(p[i]) = b0 + b[1] * Pregnancies[i] + b[2] * Glucose[i] + 
        b[3] * BloodPressure[i] + b[4] * BMI[i] + b[5] * DiabetesPedigreeFunction[i]
    }
    
    b0 ~ dnorm(0, 5.0)
    for(j in 1:5){
        b[j] ~ dnorm(0.0, 1.0/25.0)
    }
}"

These results indicate that the strongest predictor is DiabetesPedigreeFunction, which is an indictor of family history.

Now, since the Outcome is unbalanced, we will attempt to account for the fact that most people do not have diabetes by setting a different prior for the intercept. We will initialize it to a negative value and have it’s mean and standard deviation as random variables.

mod3_string = "model {
    for(i in 1:length(Outcome)){
        Outcome[i] ~ dbern(p[i])
        logit(p[i]) = b0 + b[1] * Pregnancies[i] + b[2] * Glucose[i] + 
        b[3] * BloodPressure[i] + b[4] * BMI[i] + b[5] * DiabetesPedigreeFunction[i]
    }
    
    b0 ~ dnorm(mu, sig)
    for(j in 1:5){
        b[j] ~ dnorm(0.0, 1.0/25.0)
    }
    mu ~ dnorm(0, 1.0/25.0)
    prec ~ dgamma(0.5, 0.5)
    sig = sqrt(1.0 / prec)
}"

Finally, we will attempt to fit the last model, except using the original, un-scaled data:

mod4 = jags.model(textConnection(mod3_string), inits = inits, data = as.list(data), n.chains=3)
update(mod4, 1e3)
mod_sim4 = coda.samples(model=mod4, variable.names = params3, n.iter = 5e3)

mod_csim4 = as.mcmc(do.call(rbind, mod_sim4))

dic4 = dic.samples(mod4, n.iter=1e3)

Results

We have fit four models, the first was for variable selection so we will examine the DICs of the remaining three.

dic1
## Mean deviance:  734.5 
## penalty 5.709 
## Penalized deviance: 740.2
dic3
## Mean deviance:  734.4 
## penalty 5.764 
## Penalized deviance: 740.2
dic4
## Mean deviance:  734.5 
## penalty 6.23 
## Penalized deviance: 740.7

The three models perform roughly equivalently, however since they use different priors the DICs are not really comparable. We will prefer the third model, using the scaled data.

(pm_coefs3 = colMeans(mod_csim3))
##       b[1]       b[2]       b[3]       b[4]       b[5]         b0 
##  0.5212246  1.1205382 -0.2326566  0.6770530  0.3053982 -0.8661636 
##         mu       prec 
## -0.8347109  1.0335807
par(mfrow=c(3,2))
densplot(mod_csim3[,1:6])

HPDinterval(mod_csim3)
##              lower       upper
## b[1]  3.305979e-01  0.69756727
## b[2]  9.106851e-01  1.33327126
## b[3] -4.252538e-01 -0.04231971
## b[4]  4.553531e-01  0.89473046
## b[5]  1.134678e-01  0.49727553
## b0   -1.055567e+00 -0.67518946
## mu   -2.702089e+00  0.97891948
## prec  2.882944e-07  3.88261278
## attr(,"Probability")
## [1] 0.95
pm_Xb = pm_coefs3["b0"] + X[, c(1,2,3,6,7)] %*% pm_coefs3[1:5]
phat = 1.0 / (1.0 + exp(-pm_Xb))

plot(phat, jitter(data$Outcome), main="Predicted versus Ground Truth", xlab="Predicted", ylab="Truth")

Conclusion

Our preferred model indicates that five of the variables in the data are related to diabetes. The family history of diabetes seems to be the strongest indicator, with all of the features positively correlated with diabetes except for BloodPressure which is negatively correlated. This seems rather odd as high blood pressure is typically associated with diabetes, but the magnitude mean coefficient is smaller than other betas indicating that this is not a very strong relationship.

As we see in the Predicted vs Ground Truth plot above, the model does not seem to have a great deal of predictive power, which could be due to problems inherent in the data or could indicate that there are non-linear relationships involved.

Future work could include setting separate priors for each beta coefficient, attempting to model interactions between variables and using different priors for the betas.