Logistic Regression using Bayesian Methods

Data

For an example of logistic regression, we’ll use the urine data set from the boot package in R. The response variable is r, which takes on values of \(0\) or \(1\). We will remove some rows from the data set which contain missing values.

library("boot")
data("urine")
?urine
head(urine)

##   r gravity   ph osmo cond urea calc
## 1 0   1.021 4.91  725   NA  443 2.45
## 2 0   1.017 5.74  577 20.0  296 4.49
## 3 0   1.008 7.20  321 14.9  101 2.36
## 4 0   1.011 5.51  408 12.6  224 2.15
## 5 0   1.005 6.52  187  7.5   91 1.16
## 6 0   1.020 5.27  668 25.3  252 3.34

dat = na.omit(urine)

Let’s look at pairwise scatter plots of the seven variables.

pairs(dat)

One thing that stands out is that several of these variables are strongly correlated with one another. For example gravity and osmo appear to have a very close linear relationship. Collinearity between \(x\) variables in linear regression models can cause trouble for statistical inference. Two correlated variables will compete for the ability to predict the response variable, leading to unstable estimates. This is not a problem for prediction of the response, if prediction is the end goal of the model. But if our objective is to discover how the variables relate to the response, we should avoid collinearity.

We can more formally estimate the correlation among these variables using the corrplot package.

library("corrplot")

## corrplot 0.92 loaded

Cor = cor(dat)
corrplot(Cor, type="upper", method="ellipse", tl.pos="d")
corrplot(Cor, type="lower", method="number", col="black", 
         add=TRUE, diag=FALSE, tl.pos="n", cl.pos="n")

Variable selection

One primary goal of this analysis is to find out which variables are related to the presence of calcium oxalate crystals. This objective is often called “variable selection.” We have already seen one way to do this: fit several models that include different sets of variables and see which one has the best DIC. Another way to do this is to use a linear model where the priors for the \(\beta\) coefficients favor values near \(0\) (indicating a weak relationship). This way, the burden of establishing association lies with the data. If there is not a strong signal, we assume it doesn’t exist.

Rather than tailoring a prior for each individual \(\beta\) based on the scale its covariate takes values on, it is customary to subtract the mean and divide by the standard deviation for each variable.

X = scale(dat[,-1], center=TRUE, scale=TRUE)
head(X)

##      gravity         ph       osmo       cond        urea        calc
## 2 -0.1403037 -0.4163725 -0.1528785 -0.1130908  0.25747827  0.09997564
## 3 -1.3710690  1.6055972 -1.2218894 -0.7502609 -1.23693077 -0.54608444
## 4 -0.9608139 -0.7349020 -0.8585927 -1.0376121 -0.29430353 -0.60978050
## 5 -1.7813240  0.6638579 -1.7814497 -1.6747822 -1.31356713 -0.91006194
## 6  0.2699514 -1.0672806  0.2271214  0.5490664 -0.07972172 -0.24883614
## 7 -0.8240622 -0.5825618 -0.6372741 -0.4379226 -0.51654898 -0.83726644

head(X[,"gravity"])

##          2          3          4          5          6          7 
## -0.1403037 -1.3710690 -0.9608139 -1.7813240  0.2699514 -0.8240622

colMeans(X)

##       gravity            ph          osmo          cond          urea 
## -1.011341e-13  3.763606e-15  9.948752e-17  6.719012e-16  6.055762e-17 
##          calc 
##  8.074349e-17

apply(X, MARGIN, FUN)

Here:

x: an array or matrix
MARGIN: take a value or range between \(1\) and \(2\) to define where to apply the function:
- MARGIN=1: the manipulation is performed on rows
- MARGIN=2: the manipulation is performed on columns
- MARGIN=c(1,2)` the manipulation is performed on rows and columns
FUN: tells which function to apply. Built functions like mean, median, sum, min, max and even user-defined functions can be applied

apply(X, 2, sd)

## gravity      ph    osmo    cond    urea    calc 
##       1       1       1       1       1       1

Model

Our prior for the \(\beta\) (which we’ll call \(b\) in the model) coefficients will be the double exponential (or Laplace) distribution, which as the name implies, is the exponential distribution with tails extending in the positive direction as well as the negative direction, with a sharp peak at \(0\). We can read more about it in the JAGS manual. The distribution looks like:

ddexp = function(x, mu, tau) {
  0.5*tau*exp(-tau*abs(x-mu)) 
}
curve(ddexp(x, mu=0.0, tau=1.0), from=-5.0, to=5.0, ylab="density", main="Double exponential\ndistribution") # double exponential distribution
curve(dnorm(x, mean=0.0, sd=1.0), from=-5.0, to=5.0, lty=2, add=TRUE) # normal distribution
legend("topright", legend=c("double exponential", "normal"), lty=c(1,2), bty="n")

library("rjags")

## Loading required package: coda

## Linked to JAGS 4.3.2

## Loaded modules: basemod,bugs

mod1_string = " model {
    for (i in 1:length(y)) {
        y[i] ~ dbern(p[i])
        logit(p[i]) = int + b[1]*gravity[i] + b[2]*ph[i] + b[3]*osmo[i] + b[4]*cond[i] + b[5]*urea[i] + b[6]*calc[i]
    }
    int ~ dnorm(0.0, 1.0/25.0)
    for (j in 1:6) {
        b[j] ~ ddexp(0.0, sqrt(2.0)) # has variance 1.0
    }
} "

set.seed(92)
head(X)

##      gravity         ph       osmo       cond        urea        calc
## 2 -0.1403037 -0.4163725 -0.1528785 -0.1130908  0.25747827  0.09997564
## 3 -1.3710690  1.6055972 -1.2218894 -0.7502609 -1.23693077 -0.54608444
## 4 -0.9608139 -0.7349020 -0.8585927 -1.0376121 -0.29430353 -0.60978050
## 5 -1.7813240  0.6638579 -1.7814497 -1.6747822 -1.31356713 -0.91006194
## 6  0.2699514 -1.0672806  0.2271214  0.5490664 -0.07972172 -0.24883614
## 7 -0.8240622 -0.5825618 -0.6372741 -0.4379226 -0.51654898 -0.83726644

data_jags = list(y=dat$r, gravity=X[,"gravity"], ph=X[,"ph"], osmo=X[,"osmo"], cond=X[,"cond"], urea=X[,"urea"], calc=X[,"calc"])

params = c("int", "b")

mod1 = jags.model(textConnection(mod1_string), data=data_jags, n.chains=3)

## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 77
##    Unobserved stochastic nodes: 7
##    Total graph size: 1085
## 
## Initializing model

update(mod1, 1e3)

mod1_sim = coda.samples(model=mod1,
                        variable.names=params,
                        n.iter=5e3)
mod1_csim = as.mcmc(do.call(rbind, mod1_sim))

## convergence diagnostics
plot(mod1_sim, ask=TRUE)

gelman.diag(mod1_sim)

## Potential scale reduction factors:
## 
##      Point est. Upper C.I.
## b[1]       1.00       1.01
## b[2]       1.00       1.00
## b[3]       1.01       1.02
## b[4]       1.00       1.00
## b[5]       1.00       1.01
## b[6]       1.00       1.00
## int        1.00       1.00
## 
## Multivariate psrf
## 
## 1

autocorr.diag(mod1_sim)

##              b[1]         b[2]        b[3]        b[4]        b[5]        b[6]
## Lag 0  1.00000000  1.000000000  1.00000000 1.000000000 1.000000000  1.00000000
## Lag 1  0.83151972  0.284902500  0.89983366 0.761719900 0.802351486  0.49405647
## Lag 5  0.42540473  0.005463314  0.58835014 0.354938947 0.372299887  0.04851994
## Lag 10 0.18327392  0.003639475  0.35477089 0.174409848 0.149151019 -0.01283015
## Lag 50 0.01789098 -0.008856912 -0.02958471 0.003188407 0.008918838  0.01760961
##                 int
## Lag 0   1.000000000
## Lag 1   0.296337930
## Lag 5   0.014164441
## Lag 10  0.011910560
## Lag 50 -0.007123416

autocorr.plot(mod1_sim)

effectiveSize(mod1_sim)

##      b[1]      b[2]      b[3]      b[4]      b[5]      b[6]       int 
## 1285.8719 7999.5820  761.1337 1515.6793 1522.8840 4941.3145 7884.8587

## calculate DIC
dic1 = dic.samples(mod1, n.iter=1e3)

Let’s look at the results.

summary(mod1_sim)

## 
## Iterations = 2001:7000
## Thinning interval = 1 
## Number of chains = 3 
## Sample size per chain = 5000 
## 
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
## 
##         Mean     SD Naive SE Time-series SE
## b[1]  1.6671 0.7492 0.006117       0.020919
## b[2] -0.1412 0.2896 0.002365       0.003244
## b[3] -0.2876 0.8292 0.006771       0.030444
## b[4] -0.7648 0.5123 0.004183       0.013245
## b[5] -0.6200 0.6065 0.004952       0.015604
## b[6]  1.6144 0.4985 0.004070       0.007101
## int  -0.1764 0.3034 0.002477       0.003425
## 
## 2. Quantiles for each variable:
## 
##         2.5%     25%     50%      75%  97.5%
## b[1]  0.3538  1.1407  1.6123  2.14410 3.3055
## b[2] -0.7455 -0.3226 -0.1227  0.04452 0.4144
## b[3] -2.1203 -0.7277 -0.1988  0.18774 1.2995
## b[4] -1.8211 -1.1039 -0.7444 -0.40117 0.1530
## b[5] -1.9492 -1.0050 -0.5531 -0.17788 0.3904
## b[6]  0.7155  1.2717  1.5805  1.92125 2.6931
## int  -0.7703 -0.3755 -0.1773  0.02527 0.4221

par(mfrow=c(3,2))
densplot(mod1_csim[,1:6], xlim=c(-3.0, 3.0))

colnames(X) # variable names

## [1] "gravity" "ph"      "osmo"    "cond"    "urea"    "calc"

It is clear that the coefficients for variables gravity, cond (conductivity), and calc (calcium concentration) are not \(0\). The posterior distribution for the coefficient of osmo (osmolarity) looks like the prior, and is almost centered on \(0\) still, so we’ll conclude that osmo is not a strong predictor of calcium oxalate crystals. The same goes for ph.

urea (urea concentration) appears to be a borderline case. However, if we refer back to our correlations among the variables, we see that urea is highly correlated with gravity, so we opt to remove it.

Our second model looks like this:

  mod2_string = " model {
    for (i in 1:length(y)) {
        y[i] ~ dbern(p[i])
        logit(p[i]) = int + b[1]*gravity[i] + b[2]*cond[i] + b[3]*calc[i]
    }
    int ~ dnorm(0.0, 1.0/25.0)
    for (j in 1:3) {
        b[j] ~ dnorm(0.0, 1.0/25.0) # noninformative for logistic regression
    }
} "

mod2 = jags.model(textConnection(mod2_string), data=data_jags, n.chains=3)

## Warning in jags.model(textConnection(mod2_string), data = data_jags, n.chains =
## 3): Unused variable "ph" in data

## Warning in jags.model(textConnection(mod2_string), data = data_jags, n.chains =
## 3): Unused variable "osmo" in data

## Warning in jags.model(textConnection(mod2_string), data = data_jags, n.chains =
## 3): Unused variable "urea" in data

## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 77
##    Unobserved stochastic nodes: 4
##    Total graph size: 635
## 
## Initializing model

update(mod2, 1e3)

mod2_sim = coda.samples(model=mod2,
                        variable.names=params,
                        n.iter=5e3)
mod2_csim = as.mcmc(do.call(rbind, mod2_sim))

plot(mod2_sim, ask=TRUE)

gelman.diag(mod2_sim)

## Potential scale reduction factors:
## 
##      Point est. Upper C.I.
## b[1]          1          1
## b[2]          1          1
## b[3]          1          1
## int           1          1
## 
## Multivariate psrf
## 
## 1

autocorr.diag(mod2_sim)

##              b[1]         b[2]         b[3]          int
## Lag 0  1.00000000 1.0000000000  1.000000000  1.000000000
## Lag 1  0.58512339 0.6509919607  0.495719935  0.285089258
## Lag 5  0.11799171 0.1428339563  0.034859106  0.000127407
## Lag 10 0.02027062 0.0008815241 -0.014370521  0.002494546
## Lag 50 0.00595493 0.0091683636 -0.005960314 -0.005702258

autocorr.plot(mod2_sim)

effectiveSize(mod2_sim)

##     b[1]     b[2]     b[3]      int 
## 3443.848 2853.365 4974.486 8482.137

dic2 = dic.samples(mod2, n.iter=1e3)

Results

dic1

## Mean deviance:  68.41 
## penalty 5.543 
## Penalized deviance: 73.95

dic2

## Mean deviance:  71.19 
## penalty 4.082 
## Penalized deviance: 75.28

summary(mod2_sim)

## 
## Iterations = 2001:7000
## Thinning interval = 1 
## Number of chains = 3 
## Sample size per chain = 5000 
## 
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
## 
##        Mean     SD Naive SE Time-series SE
## b[1]  1.422 0.5028 0.004105       0.008611
## b[2] -1.355 0.4558 0.003722       0.008584
## b[3]  1.881 0.5424 0.004429       0.007697
## int  -0.148 0.3222 0.002631       0.003506
## 
## 2. Quantiles for each variable:
## 
##         2.5%     25%     50%     75%   97.5%
## b[1]  0.4865  1.0727  1.4005  1.7450  2.4662
## b[2] -2.3121 -1.6500 -1.3327 -1.0389 -0.5088
## b[3]  0.9008  1.4988  1.8599  2.2330  3.0152
## int  -0.7781 -0.3616 -0.1542  0.0671  0.5024

HPDinterval(mod2_csim)

##           lower      upper
## b[1]  0.4710869  2.4477407
## b[2] -2.2603100 -0.4653454
## b[3]  0.8356818  2.9339304
## int  -0.7452106  0.5273541
## attr(,"Probability")
## [1] 0.95

par(mfrow=c(3,1))
densplot(mod2_csim[,1:3], xlim=c(-3.0, 3.0))

colnames(X)[c(1,4,6)] # variable names

## [1] "gravity" "cond"    "calc"

The DIC is actually better for the first model. Note that we did change the prior between models, and generally we should not use the DIC to choose between priors. Hence comparing DIC between these two models may not be a fair comparison. Nevertheless, they both yield essentially the same conclusions. Higher values of gravity and calc (calcium concentration) are associated with higher probabilities of calcium oxalate crystals, while higher values of cond (conductivity) are associated with lower probabilities of calcium oxalate crystals.

There are more modeling options in this scenario, perhaps including transformations of variables, different priors, and interactions between the predictors, but we’ll leave it to you to see if you can improve the model.

Prediction from a logistic regression model

How do we turn model parameter estimates into model predictions? The key is the form of the model. Remember that the likelihood is Bernoulli, which is \(1\) with probability \(p\). We modeled the logit of \(p\) as a linear model, which we showed in the first segment of this lesson leads to an exponential form for \(E(y)=p\).

Take the output from our model in the last segment. We will use the posterior means as point estimates of the parameters.

(pm_coef = colMeans(mod2_csim))

##       b[1]       b[2]       b[3]        int 
##  1.4221558 -1.3547290  1.8806026 -0.1480463

The posterior mean of the intercept was about \(−0.15\). Since we centered and scaled all of the covariates, values of \(0\) for each \(x\) correspond to the average values. Therefore, if we use our last model, then our point estimate for the probability of calcium oxalate crystals when gravity, cond, and calc are at their average values is \(1/(1+e^{-0.15})= 0.4625702\).

Now suppose we want to make a prediction for a new specimen whose value of gravity is average, whose value of cond is one standard deviation below the mean, and whose value of calc is one standard deviation above the mean. Our point estimate for the probability of calcium oxalate crystals is \(1/(1+e^{−(−0.15+1.4∗0.0−1.3∗(−1.0)+1.9∗(1.0))})= 0.9547825\).

If we want to make predictions in terms of the original \(x\) variable values, we have two options:

For each \(x\) variable, subtract the mean and divide by the standard deviation for that variable in the original data set used to fit the model.
Re-fit the model without centering and scaling the covariates.

Predictive checks

We can use the same ideas to make predictions for each of the original data points. This is similar to what we did to calculate residuals with earlier models.

First we take the \(X\) matrix and matrix multiply it with the posterior means of the coefficients. Then we need to pass these linear values through the inverse of the link function as we did above.

pm_Xb = pm_coef["int"] + X[,c(1,4,6)] %*% pm_coef[1:3]
phat = 1.0 / (1.0 + exp(-pm_Xb))
head(phat)

##         [,1]
## 2 0.49841046
## 3 0.10827461
## 4 0.22174178
## 5 0.10679020
## 6 0.27369983
## 7 0.09101617

These phat values are the model’s predicted probability of calcium oxalate crystals for each data point. We can get a rough idea of how successful the model is by plotting these predicted values against the actual outcome.

plot(phat, jitter(dat$r))

Suppose we choose a cutoff for these predicted probabilities. If the model tells us the probability is \(> 0.5\), we will classify the observation as a \(1\) and, if it is \(< 0.5\), we will classify it as a \(0\). That way the model classifies each data point. Now we can tabulate these classifications against the truth to see how well the model predicts the original data.

(tab0.5 = table(phat > 0.5, data_jags$y))

##        
##          0  1
##   FALSE 38 12
##   TRUE   6 21

sum(diag(tab0.5)) / sum(tab0.5)

## [1] 0.7662338

The correct classification rate is about 76%, not too bad, but not great.

Now suppose that it is considered really bad to predict no calcium oxalate crystal when there in fact is one. We might then choose to lower our threshold for classifying data points as \(1\)s. Say we change it to \(0.3\). That is, if the model says the probability is \(> 0.3\), we will classify it as having a calcium oxalate crystal.

(tab0.3 = table(phat > 0.3, data_jags$y))

##        
##          0  1
##   FALSE 32  7
##   TRUE  12 26

sum(diag(tab0.3)) / sum(tab0.3)

## [1] 0.7532468

It looks like we gave up a little classification accuracy, but we did indeed increase our chances of detecting a true positive.

We could repeat this exercise for many thresholds between \(0\) and \(1\), and each time calculate our error rates. This is equivalent to calculating what is called the ROC (receiver-operating characteristic) curve, which is often used to evaluate classification techniques.

These classification tables we have calculated were all in-sample. They were predicting for the same data used to fit the model. We could get a less biased assessment of how well our model performs if we calculated these tables for data that were not used to fit the model. For example, before fitting the model, you could withhold a set of randomly selected “test” data points, and use the model fit to the rest of the “training” data to make predictions on your “test” set.

Quiz

What is the advantage of using a link function such as the logit transform for logistic regression? Answer: It ensures the success probability (\(E(y)\) if \(y\) is Bernoulli) is between \(0\) and \(1\) without requiring any constraints on the \(x\) variables or the \(\beta\) coefficients.

Logistic regression works with binomial likelihoods in addition to Bernoulli likelihoods. If the response, \(y_i\), is a number of successes in \(n_i\) independent trials each with \(\phi_i\) success probability, we can still model \(\phi_i\) with a linear model using the logit transformation.

As an example, consider the OME data in the MASS package in R. The data consist of experimental results from tests of auditory perception in children. Under varying conditions and for multiple trials under each condition, children either correctly or incorrectly identified the source of changing signals.

Although the independence of the trials and results are questionable, we’ll try fitting a logistic regression to these data. First, we’ll explore the relationships briefly with the following code:

library("MASS")
data("OME")
?OME # background on the data
head(OME)

##   ID Age OME Loud      Noise Correct Trials
## 1  1  30 low   35   coherent       1      4
## 2  1  30 low   35 incoherent       4      5
## 3  1  30 low   40   coherent       0      3
## 4  1  30 low   40 incoherent       1      1
## 5  1  30 low   45   coherent       2      4
## 6  1  30 low   45 incoherent       2      2

any(is.na(OME)) # check for missing values

## [1] FALSE

dat = subset(OME, OME != "N/A") # manually remove OME missing values identified with "N/A"
dat$OME = factor(dat$OME)
str(dat)

## 'data.frame':    712 obs. of  7 variables:
##  $ ID     : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Age    : int  30 30 30 30 30 30 30 30 30 30 ...
##  $ OME    : Factor w/ 2 levels "high","low": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Loud   : int  35 35 40 40 45 45 50 50 55 55 ...
##  $ Noise  : Factor w/ 2 levels "coherent","incoherent": 1 2 1 2 1 2 1 2 1 2 ...
##  $ Correct: int  1 4 0 1 2 2 3 4 3 2 ...
##  $ Trials : int  4 5 3 1 4 2 3 4 3 2 ...

plot(dat$Age, dat$Correct / dat$Trials )

plot(dat$OME, dat$Correct / dat$Trials )

plot(dat$Loud, dat$Correct / dat$Trials )

plot(dat$Noise, dat$Correct / dat$Trials )

We are interested how these variables relate to the probability of successfully identifying the source of changes in sound. Of these four variables, which appears to have the weakest association with the probability of success?

Answer: OME because the box plots for the two levels of OME are nearly indistinguishable.

Next, we’ll fit a reference logistic model with noninformative prior in R. We can do this with the glm function, providing the model formula as with the usual lm, except now the response of the observed proportion of correct responses. We must also indicate how many trials were run for each experiment using the weights argument.

mod_glm = glm(Correct/Trials ~ Age + OME + Loud + Noise, data=dat, weights=Trials, family="binomial")
summary(mod_glm)

## 
## Call:
## glm(formula = Correct/Trials ~ Age + OME + Loud + Noise, family = "binomial", 
##     data = dat, weights = Trials)
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     -7.294441   0.434928 -16.772  < 2e-16 ***
## Age              0.018896   0.003767   5.016 5.28e-07 ***
## OMElow          -0.237150   0.123257  -1.924   0.0544 .  
## Loud             0.171682   0.008880  19.333  < 2e-16 ***
## Noiseincoherent  1.576304   0.115236  13.679  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1431.12  on 711  degrees of freedom
## Residual deviance:  732.38  on 707  degrees of freedom
## AIC: 1262.6
## 
## Number of Fisher Scoring iterations: 5

To get an idea of how the model fits, we can create residual (using a special type of residual for non-normal likelihoods) and in-sample prediction plots.

plot(residuals(mod_glm, type="deviance"))

plot(fitted(mod_glm), dat$Correct/dat$Trials)

It appears from the second plot that the model is not very precise (some model predictions were far from the observed proportion of correct responses.) Nevertheless, it can be informative about the relationships among the variables.

Report the posterior mode estimate of the coefficient for lowOME. The coefficient will be found in the model summary.

Next, we will fit a similar model in JAGS. To make results comparable to those of the reference model, we will use the same configuration of covariates. We can extract this information from the reference model using model.matrix.

X = model.matrix(mod_glm)[,-1] # -1 removes the column of 1s for the intercept
head(X)

##   Age OMElow Loud Noiseincoherent
## 1  30      1   35               0
## 2  30      1   35               1
## 3  30      1   40               0
## 4  30      1   40               1
## 5  30      1   45               0
## 6  30      1   45               1

The data include categorical covariates which R codes as dummy variables (as with ANOVA). Hence we have an indicator variable for whether OME is at its low level and another indicating whether the Noise is incoherent. The intercept is then associated with this baseline group. Ignoring the continuous variables Age and Loud, what are the characteristics of this baseline group?

Answer: The intercept is associated with values of \(0\) for all covariates. In this case, we have OMElow: \(0\) for high, \(1\) for low; Noiseincoherent: \(0\) for coherent, \(1\) for incoherent. Therefore high OME and coherent sound.

We now fit the JAGS model wiht the fairly noninformative priors given. Use three chains wiht at least 5000 iterations in each.

mod3_string = " model {
    for (i in 1:length(y)) {
        y[i] ~ dbin(phi[i], n[i])
        logit(phi[i]) = b0 + b[1]*Age[i] + b[2]*OMElow[i] + b[3]*Loud[i] + b[4]*Noiseincoherent[i]
    }
    
    b0 ~ dnorm(0.0, 1.0/5.0^2)
    for (j in 1:4) {
        b[j] ~ dnorm(0.0, 1.0/4.0^2)
    }
    
} "

data_jags = as.list(as.data.frame(X))
data_jags$y = dat$Correct # this will not work if there are missing values in dat (because they would be ignored by model.matrix). Always make sure that the data are accurately pre-processed for JAGS.
data_jags$n = dat$Trials
str(data_jags) # make sure that all variables have the same number of observations (712).

## List of 6
##  $ Age            : num [1:712] 30 30 30 30 30 30 30 30 30 30 ...
##  $ OMElow         : num [1:712] 1 1 1 1 1 1 1 1 1 1 ...
##  $ Loud           : num [1:712] 35 35 40 40 45 45 50 50 55 55 ...
##  $ Noiseincoherent: num [1:712] 0 1 0 1 0 1 0 1 0 1 ...
##  $ y              : int [1:712] 1 4 0 1 2 2 3 4 3 2 ...
##  $ n              : int [1:712] 4 5 3 1 4 2 3 4 3 2 ...

params = c("b0","b")

mod3 = jags.model(textConnection(mod3_string), data=data_jags, n.chains=3)

## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 712
##    Unobserved stochastic nodes: 5
##    Total graph size: 4377
## 
## Initializing model

update(mod3, 1e3)

mod3_sim = coda.samples(model=mod3,
                        variable.names=params,
                        n.iter=5e3)
mod3_csim = as.mcmc(do.call(rbind, mod3_sim))

plot(mod3_sim, ask=TRUE)

gelman.diag(mod3_sim)

## Potential scale reduction factors:
## 
##      Point est. Upper C.I.
## b[1]       1.01       1.02
## b[2]       1.00       1.01
## b[3]       1.16       1.47
## b[4]       1.01       1.04
## b0         1.16       1.46
## 
## Multivariate psrf
## 
## 1.12

autocorr.diag(mod3_sim)

##              b[1]       b[2]      b[3]       b[4]        b0
## Lag 0  1.00000000 1.00000000 1.0000000 1.00000000 1.0000000
## Lag 1  0.91578997 0.81500744 0.9870460 0.50709404 0.9901308
## Lag 5  0.65883199 0.39818264 0.9389104 0.10942812 0.9499293
## Lag 10 0.44934026 0.18448879 0.8862044 0.08470507 0.9008281
## Lag 50 0.03532912 0.03146967 0.6016906 0.05340868 0.6221219

autocorr.plot(mod3_sim)

effectiveSize(mod3_sim)

##       b[1]       b[2]       b[3]       b[4]         b0 
##  603.54411 1291.65630   93.02386 2168.06303   78.00710

dic2 = dic.samples(mod3, n.iter=1e3)

raftery.diag(mod3_csim)

## 
## Quantile (q) = 0.025
## Accuracy (r) = +/- 0.005
## Probability (s) = 0.95 
##                                              
##       Burn-in  Total  Lower bound  Dependence
##       (M)      (N)    (Nmin)       factor (I)
##  b[1] 32       34200  3746          9.13     
##  b[2] 14       16868  3746          4.50     
##  b[3] 96       104804 3746         28.00     
##  b[4] 6        7087   3746          1.89     
##  b0   77       80675  3746         21.50

We perform some MCMC diagnostic checks. What does the Raftery and Lewis diagnostic (raftery.diag()) suggest about these chains?

Answer: The dependence factor for many of the variables is large (\(> 5.0\)), indicating strong autocorrelation in the chains. We would require a large number of iterations to reliably produce 95% probability intervals for the parameters.

Although OMElow is the predictor with the weakest statistical association to probability of correct responses, the posterior probability that its coefficient, \(\beta_2\) is negative is still \(> 0.9\). How do we interpret this (most likely) negative coefficient in the context of our models?

Answer: While holding all other predictors constant, low OME is associated with a lower probability of correct responses than high OME.

Using the posterior means of the model coefficients, create a point estimate of the probability of correct responses for a child of age 60 months, high OME, using a coherent stimulus of 50 decibels.

xb = -7.29+(0.02)*(60)+(-0.24)*(0)+(0.17)*(50)+(1.58)*(0)
prob = 1/(1+exp(-xb))
prob

## [1] 0.9175867

Use the posterior mean estimates of the model coefficients to create point estimates of the probability of correct responses for each observation in the original data. To do this, follow the steps outlined in the lesson to create a vector of these probabilities called phat (using our notation from this quiz, it would be \(\phi\)).

One we have phat, calculate the proportion of in-sample observations that are correctly classified according to the following criterion: the model prediction and observed correct response rate are either both higher than \(0.7\) or both lower than \(0.7\).

pm_coef = colMeans(mod3_csim)
pm_coef

##        b[1]        b[2]        b[3]        b[4]          b0 
##  0.01879344 -0.24310415  0.17161151  1.57743555 -7.27966069

pm_Xb = pm_coef["b0"] + X[] %*% pm_coef[1:4]
phat = 1.0 / (1.0 + exp(-pm_Xb))
head(phat)

##        [,1]
## 1 0.2783705
## 2 0.6513269
## 3 0.4763922
## 4 0.8150154
## 5 0.6821253
## 6 0.9122155

(tab0.7 = table(phat > 0.7, (dat$Correct / dat$Trials) > 0.7))

##        
##         FALSE TRUE
##   FALSE   182   48
##   TRUE     63  419

sum(diag(tab0.7)) / sum(tab0.7)

## [1] 0.8441011