Exploratory Data Analysis: Correlation Plots

In order to formally estimate the correlation among predictor variables and make sure they are not strongly correlated with one another, you can use a correlation plot. For example, gravity and osmolarity as well as urea and osmolarity appear to have a very close linear relationship (quite tight ellipse, i.e. strong correlation).

Variable Selection

Final Model

As a result of the variable selection procedure, the final model will have the coefficients for gravity, conductivity, and calcium concentration only, removing the variables that are not significantly away from 0. In this final model, you change the prior distribution for the individual beta terms: you switch from the double exponential to a fairly noninformative normal distribution for the coefficients.

model_string2 <- "model{
  # Likelihood of the data: Bernouilli distribution
  for (i in 1:length(y)) {
    y[i] ~ dbern(p[i])  
    logit(p[i]) = int + b[1]*gravity[i] + b[2]*cond[i] + b[3]*calc[i]
  }
  # Noninformative Normal distribution prior for the intercept coefficient
  int ~ dnorm(0.0, 1.0/25)
  # Prior for the individual beta terms:
  # switch from the double exponential (Laplace) to a fairly noninformative Normal distribution for the coefficients
  for (j in 1:3) {
        b[j] ~ dnorm(0.0, sqrt(2)) 
  }
}"

set.seed(92)
mod2 <- jags.model(textConnection(model_string2), data = data_jags, 
                   n.chains = 3)
Compiling model graph
   Resolving undeclared variables
   Allocating nodes
Graph information:
   Observed stochastic nodes: 77
   Unobserved stochastic nodes: 4
   Total graph size: 637

Initializing model
mod2_sim <- coda.samples(model = mod2, variable.names = params, 
                         n.iter = 5e3)

mod2_csim <- as.mcmc(do.call(rbind, mod2_sim)) 
dic2 <- dic.samples(mod2, n.iter = 1e3)

Final Model Results

Both models yield essentially the same conclusions: Higher values of gravity (b1) and calc (b3, calcium concentration) are associated with higher probabilities of calcium oxalate crystals, while higher values of cond (b2, conductivity) are associated with lower probabilities of calcium oxalate crystals. Each of these estimates are several posterior standard deviatons away from 0, meaning they are significant predictors of the formation of calcium oxalate crystals in urine.


Iterations = 1001:6000
Thinning interval = 1 
Number of chains = 3 
Sample size per chain = 5000 

1. Empirical mean and standard deviation for each variable,
   plus standard error of the mean:

        Mean     SD Naive SE Time-series SE
b[1]  0.9888 0.3833 0.003130       0.005586
b[2] -0.8223 0.3454 0.002820       0.005271
b[3]  1.3377 0.3807 0.003108       0.004614
int  -0.2252 0.2917 0.002382       0.003155

2. Quantiles for each variable:

        2.5%     25%     50%      75%   97.5%
b[1]  0.2735  0.7234  0.9821  1.24396  1.7751
b[2] -1.5106 -1.0525 -0.8146 -0.58993 -0.1628
b[3]  0.6264  1.0734  1.3237  1.58713  2.1119
int  -0.7987 -0.4223 -0.2253 -0.02749  0.3409

Predictions

Suppose you choose a cutoff for predicted probabilities of 0.5: If it is higher than this value, it will be classified as a 1 and if it is less than 0.5, as a 0. These classifications against the truth tells you how well the model predicts the original data (accuracy). In this case, the model classifies positive cases with an accuracy of roughly 78%.

       
         0  1
  FALSE 39 12
  TRUE   5 21
[1] 0.7792208

Now suppose that it is considered really bad to predict no formation of calcium oxalate crystals when there in fact is one (i.e. false negatives), then you might choose to lower the threshold for classifying data points as 1s (say, change it to 0.3). That is, if the model says the probability is greater than 0.3, it will classify it as having a calcium oxalate crystal. In that case, the model classifies with an accuracy of about 75-76%, so classification accuracy decreased a bit, but it did indeed increase your chances of detecting a true positive. Anyway, this was an in-sample accuracy value; to get a less biased assessment of how well your model performs, it would be better to use the model with training data to make predictions on a testing dataset and evaluate accuracy once again.

       
         0  1
  FALSE 30  4
  TRUE  14 29
[1] 0.7662338