For an example of logistic regression, we’ll use the urine data set
from the boot package in R. The response variable is r,
which takes on values of \(0\) or \(1\). We will remove some rows from the data
set which contain missing values.
library("boot")
data("urine")
?urine
head(urine)
## r gravity ph osmo cond urea calc
## 1 0 1.021 4.91 725 NA 443 2.45
## 2 0 1.017 5.74 577 20.0 296 4.49
## 3 0 1.008 7.20 321 14.9 101 2.36
## 4 0 1.011 5.51 408 12.6 224 2.15
## 5 0 1.005 6.52 187 7.5 91 1.16
## 6 0 1.020 5.27 668 25.3 252 3.34
dat = na.omit(urine)
Let’s look at pairwise scatter plots of the seven variables.
pairs(dat)
One thing that stands out is that several of these variables are strongly correlated with one another. For example gravity and osmo appear to have a very close linear relationship. Collinearity between \(x\) variables in linear regression models can cause trouble for statistical inference. Two correlated variables will compete for the ability to predict the response variable, leading to unstable estimates. This is not a problem for prediction of the response, if prediction is the end goal of the model. But if our objective is to discover how the variables relate to the response, we should avoid collinearity.
We can more formally estimate the correlation among these variables using the corrplot package.
library("corrplot")
## corrplot 0.92 loaded
Cor = cor(dat)
corrplot(Cor, type="upper", method="ellipse", tl.pos="d")
corrplot(Cor, type="lower", method="number", col="black",
add=TRUE, diag=FALSE, tl.pos="n", cl.pos="n")
One primary goal of this analysis is to find out which variables are related to the presence of calcium oxalate crystals. This objective is often called “variable selection.” We have already seen one way to do this: fit several models that include different sets of variables and see which one has the best DIC. Another way to do this is to use a linear model where the priors for the \(\beta\) coefficients favor values near \(0\) (indicating a weak relationship). This way, the burden of establishing association lies with the data. If there is not a strong signal, we assume it doesn’t exist.
Rather than tailoring a prior for each individual \(\beta\) based on the scale its covariate takes values on, it is customary to subtract the mean and divide by the standard deviation for each variable.
X = scale(dat[,-1], center=TRUE, scale=TRUE)
head(X)
## gravity ph osmo cond urea calc
## 2 -0.1403037 -0.4163725 -0.1528785 -0.1130908 0.25747827 0.09997564
## 3 -1.3710690 1.6055972 -1.2218894 -0.7502609 -1.23693077 -0.54608444
## 4 -0.9608139 -0.7349020 -0.8585927 -1.0376121 -0.29430353 -0.60978050
## 5 -1.7813240 0.6638579 -1.7814497 -1.6747822 -1.31356713 -0.91006194
## 6 0.2699514 -1.0672806 0.2271214 0.5490664 -0.07972172 -0.24883614
## 7 -0.8240622 -0.5825618 -0.6372741 -0.4379226 -0.51654898 -0.83726644
head(X[,"gravity"])
## 2 3 4 5 6 7
## -0.1403037 -1.3710690 -0.9608139 -1.7813240 0.2699514 -0.8240622
colMeans(X)
## gravity ph osmo cond urea
## -1.011341e-13 3.763606e-15 9.948752e-17 6.719012e-16 6.055762e-17
## calc
## 8.074349e-17
apply(X, MARGIN, FUN)
Here:
apply(X, 2, sd)
## gravity ph osmo cond urea calc
## 1 1 1 1 1 1
Our prior for the \(\beta\) (which
we’ll call \(b\) in the model)
coefficients will be the double exponential (or Laplace) distribution,
which as the name implies, is the exponential distribution with tails
extending in the positive direction as well as the negative direction,
with a sharp peak at \(0\). We can read
more about it in the JAGS manual. The distribution looks
like:
ddexp = function(x, mu, tau) {
0.5*tau*exp(-tau*abs(x-mu))
}
curve(ddexp(x, mu=0.0, tau=1.0), from=-5.0, to=5.0, ylab="density", main="Double exponential\ndistribution") # double exponential distribution
curve(dnorm(x, mean=0.0, sd=1.0), from=-5.0, to=5.0, lty=2, add=TRUE) # normal distribution
legend("topright", legend=c("double exponential", "normal"), lty=c(1,2), bty="n")
library("rjags")
## Loading required package: coda
## Linked to JAGS 4.3.2
## Loaded modules: basemod,bugs
mod1_string = " model {
for (i in 1:length(y)) {
y[i] ~ dbern(p[i])
logit(p[i]) = int + b[1]*gravity[i] + b[2]*ph[i] + b[3]*osmo[i] + b[4]*cond[i] + b[5]*urea[i] + b[6]*calc[i]
}
int ~ dnorm(0.0, 1.0/25.0)
for (j in 1:6) {
b[j] ~ ddexp(0.0, sqrt(2.0)) # has variance 1.0
}
} "
set.seed(92)
head(X)
## gravity ph osmo cond urea calc
## 2 -0.1403037 -0.4163725 -0.1528785 -0.1130908 0.25747827 0.09997564
## 3 -1.3710690 1.6055972 -1.2218894 -0.7502609 -1.23693077 -0.54608444
## 4 -0.9608139 -0.7349020 -0.8585927 -1.0376121 -0.29430353 -0.60978050
## 5 -1.7813240 0.6638579 -1.7814497 -1.6747822 -1.31356713 -0.91006194
## 6 0.2699514 -1.0672806 0.2271214 0.5490664 -0.07972172 -0.24883614
## 7 -0.8240622 -0.5825618 -0.6372741 -0.4379226 -0.51654898 -0.83726644
data_jags = list(y=dat$r, gravity=X[,"gravity"], ph=X[,"ph"], osmo=X[,"osmo"], cond=X[,"cond"], urea=X[,"urea"], calc=X[,"calc"])
params = c("int", "b")
mod1 = jags.model(textConnection(mod1_string), data=data_jags, n.chains=3)
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 77
## Unobserved stochastic nodes: 7
## Total graph size: 1085
##
## Initializing model
update(mod1, 1e3)
mod1_sim = coda.samples(model=mod1,
variable.names=params,
n.iter=5e3)
mod1_csim = as.mcmc(do.call(rbind, mod1_sim))
## convergence diagnostics
plot(mod1_sim, ask=TRUE)
gelman.diag(mod1_sim)
## Potential scale reduction factors:
##
## Point est. Upper C.I.
## b[1] 1.00 1.01
## b[2] 1.00 1.00
## b[3] 1.01 1.02
## b[4] 1.00 1.00
## b[5] 1.00 1.01
## b[6] 1.00 1.00
## int 1.00 1.00
##
## Multivariate psrf
##
## 1
autocorr.diag(mod1_sim)
## b[1] b[2] b[3] b[4] b[5] b[6]
## Lag 0 1.00000000 1.000000000 1.00000000 1.000000000 1.000000000 1.00000000
## Lag 1 0.83151972 0.284902500 0.89983366 0.761719900 0.802351486 0.49405647
## Lag 5 0.42540473 0.005463314 0.58835014 0.354938947 0.372299887 0.04851994
## Lag 10 0.18327392 0.003639475 0.35477089 0.174409848 0.149151019 -0.01283015
## Lag 50 0.01789098 -0.008856912 -0.02958471 0.003188407 0.008918838 0.01760961
## int
## Lag 0 1.000000000
## Lag 1 0.296337930
## Lag 5 0.014164441
## Lag 10 0.011910560
## Lag 50 -0.007123416
autocorr.plot(mod1_sim)
effectiveSize(mod1_sim)
## b[1] b[2] b[3] b[4] b[5] b[6] int
## 1285.8719 7999.5820 761.1337 1515.6793 1522.8840 4941.3145 7884.8587
## calculate DIC
dic1 = dic.samples(mod1, n.iter=1e3)
Let’s look at the results.
summary(mod1_sim)
##
## Iterations = 2001:7000
## Thinning interval = 1
## Number of chains = 3
## Sample size per chain = 5000
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## b[1] 1.6671 0.7492 0.006117 0.020919
## b[2] -0.1412 0.2896 0.002365 0.003244
## b[3] -0.2876 0.8292 0.006771 0.030444
## b[4] -0.7648 0.5123 0.004183 0.013245
## b[5] -0.6200 0.6065 0.004952 0.015604
## b[6] 1.6144 0.4985 0.004070 0.007101
## int -0.1764 0.3034 0.002477 0.003425
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## b[1] 0.3538 1.1407 1.6123 2.14410 3.3055
## b[2] -0.7455 -0.3226 -0.1227 0.04452 0.4144
## b[3] -2.1203 -0.7277 -0.1988 0.18774 1.2995
## b[4] -1.8211 -1.1039 -0.7444 -0.40117 0.1530
## b[5] -1.9492 -1.0050 -0.5531 -0.17788 0.3904
## b[6] 0.7155 1.2717 1.5805 1.92125 2.6931
## int -0.7703 -0.3755 -0.1773 0.02527 0.4221
par(mfrow=c(3,2))
densplot(mod1_csim[,1:6], xlim=c(-3.0, 3.0))
colnames(X) # variable names
## [1] "gravity" "ph" "osmo" "cond" "urea" "calc"
It is clear that the coefficients for variables gravity,
cond (conductivity), and calc (calcium
concentration) are not \(0\). The
posterior distribution for the coefficient of osmo
(osmolarity) looks like the prior, and is almost centered on \(0\) still, so we’ll conclude that
osmo is not a strong predictor of calcium oxalate crystals.
The same goes for ph.
urea (urea concentration) appears to be a borderline
case. However, if we refer back to our correlations among the variables,
we see that urea is highly correlated with
gravity, so we opt to remove it.
Our second model looks like this:
mod2_string = " model {
for (i in 1:length(y)) {
y[i] ~ dbern(p[i])
logit(p[i]) = int + b[1]*gravity[i] + b[2]*cond[i] + b[3]*calc[i]
}
int ~ dnorm(0.0, 1.0/25.0)
for (j in 1:3) {
b[j] ~ dnorm(0.0, 1.0/25.0) # noninformative for logistic regression
}
} "
mod2 = jags.model(textConnection(mod2_string), data=data_jags, n.chains=3)
## Warning in jags.model(textConnection(mod2_string), data = data_jags, n.chains =
## 3): Unused variable "ph" in data
## Warning in jags.model(textConnection(mod2_string), data = data_jags, n.chains =
## 3): Unused variable "osmo" in data
## Warning in jags.model(textConnection(mod2_string), data = data_jags, n.chains =
## 3): Unused variable "urea" in data
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 77
## Unobserved stochastic nodes: 4
## Total graph size: 635
##
## Initializing model
update(mod2, 1e3)
mod2_sim = coda.samples(model=mod2,
variable.names=params,
n.iter=5e3)
mod2_csim = as.mcmc(do.call(rbind, mod2_sim))
plot(mod2_sim, ask=TRUE)
gelman.diag(mod2_sim)
## Potential scale reduction factors:
##
## Point est. Upper C.I.
## b[1] 1 1
## b[2] 1 1
## b[3] 1 1
## int 1 1
##
## Multivariate psrf
##
## 1
autocorr.diag(mod2_sim)
## b[1] b[2] b[3] int
## Lag 0 1.00000000 1.0000000000 1.000000000 1.000000000
## Lag 1 0.58512339 0.6509919607 0.495719935 0.285089258
## Lag 5 0.11799171 0.1428339563 0.034859106 0.000127407
## Lag 10 0.02027062 0.0008815241 -0.014370521 0.002494546
## Lag 50 0.00595493 0.0091683636 -0.005960314 -0.005702258
autocorr.plot(mod2_sim)
effectiveSize(mod2_sim)
## b[1] b[2] b[3] int
## 3443.848 2853.365 4974.486 8482.137
dic2 = dic.samples(mod2, n.iter=1e3)
dic1
## Mean deviance: 68.41
## penalty 5.543
## Penalized deviance: 73.95
dic2
## Mean deviance: 71.19
## penalty 4.082
## Penalized deviance: 75.28
summary(mod2_sim)
##
## Iterations = 2001:7000
## Thinning interval = 1
## Number of chains = 3
## Sample size per chain = 5000
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## b[1] 1.422 0.5028 0.004105 0.008611
## b[2] -1.355 0.4558 0.003722 0.008584
## b[3] 1.881 0.5424 0.004429 0.007697
## int -0.148 0.3222 0.002631 0.003506
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## b[1] 0.4865 1.0727 1.4005 1.7450 2.4662
## b[2] -2.3121 -1.6500 -1.3327 -1.0389 -0.5088
## b[3] 0.9008 1.4988 1.8599 2.2330 3.0152
## int -0.7781 -0.3616 -0.1542 0.0671 0.5024
HPDinterval(mod2_csim)
## lower upper
## b[1] 0.4710869 2.4477407
## b[2] -2.2603100 -0.4653454
## b[3] 0.8356818 2.9339304
## int -0.7452106 0.5273541
## attr(,"Probability")
## [1] 0.95
par(mfrow=c(3,1))
densplot(mod2_csim[,1:3], xlim=c(-3.0, 3.0))
colnames(X)[c(1,4,6)] # variable names
## [1] "gravity" "cond" "calc"
The DIC is actually better for the first model. Note that we did
change the prior between models, and generally we should not use the DIC
to choose between priors. Hence comparing DIC between these two models
may not be a fair comparison. Nevertheless, they both yield essentially
the same conclusions. Higher values of gravity and
calc (calcium concentration) are associated with higher
probabilities of calcium oxalate crystals, while higher values of
cond (conductivity) are associated with lower probabilities
of calcium oxalate crystals.
There are more modeling options in this scenario, perhaps including transformations of variables, different priors, and interactions between the predictors, but we’ll leave it to you to see if you can improve the model.
How do we turn model parameter estimates into model predictions? The key is the form of the model. Remember that the likelihood is Bernoulli, which is \(1\) with probability \(p\). We modeled the logit of \(p\) as a linear model, which we showed in the first segment of this lesson leads to an exponential form for \(E(y)=p\).
Take the output from our model in the last segment. We will use the posterior means as point estimates of the parameters.
(pm_coef = colMeans(mod2_csim))
## b[1] b[2] b[3] int
## 1.4221558 -1.3547290 1.8806026 -0.1480463
The posterior mean of the intercept was about \(−0.15\). Since we centered and scaled all
of the covariates, values of \(0\) for
each \(x\) correspond to the average
values. Therefore, if we use our last model, then our point estimate for
the probability of calcium oxalate crystals when gravity,
cond, and calc are at their average values is
\(1/(1+e^{-0.15})= 0.4625702\).
Now suppose we want to make a prediction for a new specimen whose value of gravity is average, whose value of cond is one standard deviation below the mean, and whose value of calc is one standard deviation above the mean. Our point estimate for the probability of calcium oxalate crystals is \(1/(1+e^{−(−0.15+1.4∗0.0−1.3∗(−1.0)+1.9∗(1.0))})= 0.9547825\).
If we want to make predictions in terms of the original \(x\) variable values, we have two options:
For each \(x\) variable, subtract the mean and divide by the standard deviation for that variable in the original data set used to fit the model.
Re-fit the model without centering and scaling the covariates.
We can use the same ideas to make predictions for each of the original data points. This is similar to what we did to calculate residuals with earlier models.
First we take the \(X\) matrix and matrix multiply it with the posterior means of the coefficients. Then we need to pass these linear values through the inverse of the link function as we did above.
pm_Xb = pm_coef["int"] + X[,c(1,4,6)] %*% pm_coef[1:3]
phat = 1.0 / (1.0 + exp(-pm_Xb))
head(phat)
## [,1]
## 2 0.49841046
## 3 0.10827461
## 4 0.22174178
## 5 0.10679020
## 6 0.27369983
## 7 0.09101617
These phat values are the model’s predicted probability
of calcium oxalate crystals for each data point. We can get a rough idea
of how successful the model is by plotting these predicted values
against the actual outcome.
plot(phat, jitter(dat$r))
Suppose we choose a cutoff for these predicted probabilities. If the model tells us the probability is \(> 0.5\), we will classify the observation as a \(1\) and, if it is \(< 0.5\), we will classify it as a \(0\). That way the model classifies each data point. Now we can tabulate these classifications against the truth to see how well the model predicts the original data.
(tab0.5 = table(phat > 0.5, data_jags$y))
##
## 0 1
## FALSE 38 12
## TRUE 6 21
sum(diag(tab0.5)) / sum(tab0.5)
## [1] 0.7662338
The correct classification rate is about 76%, not too bad, but not great.
Now suppose that it is considered really bad to predict no calcium oxalate crystal when there in fact is one. We might then choose to lower our threshold for classifying data points as \(1\)s. Say we change it to \(0.3\). That is, if the model says the probability is \(> 0.3\), we will classify it as having a calcium oxalate crystal.
(tab0.3 = table(phat > 0.3, data_jags$y))
##
## 0 1
## FALSE 32 7
## TRUE 12 26
sum(diag(tab0.3)) / sum(tab0.3)
## [1] 0.7532468
It looks like we gave up a little classification accuracy, but we did indeed increase our chances of detecting a true positive.
We could repeat this exercise for many thresholds between \(0\) and \(1\), and each time calculate our error rates. This is equivalent to calculating what is called the ROC (receiver-operating characteristic) curve, which is often used to evaluate classification techniques.
These classification tables we have calculated were all in-sample. They were predicting for the same data used to fit the model. We could get a less biased assessment of how well our model performs if we calculated these tables for data that were not used to fit the model. For example, before fitting the model, you could withhold a set of randomly selected “test” data points, and use the model fit to the rest of the “training” data to make predictions on your “test” set.
What is the advantage of using a link function such as the logit transform for logistic regression? Answer: It ensures the success probability (\(E(y)\) if \(y\) is Bernoulli) is between \(0\) and \(1\) without requiring any constraints on the \(x\) variables or the \(\beta\) coefficients.
Logistic regression works with binomial likelihoods in addition to Bernoulli likelihoods. If the response, \(y_i\), is a number of successes in \(n_i\) independent trials each with \(\phi_i\) success probability, we can still model \(\phi_i\) with a linear model using the logit transformation.
As an example, consider the OME data in the MASS package
in R. The data consist of experimental results from tests
of auditory perception in children. Under varying conditions and for
multiple trials under each condition, children either correctly or
incorrectly identified the source of changing signals.
Although the independence of the trials and results are questionable, we’ll try fitting a logistic regression to these data. First, we’ll explore the relationships briefly with the following code:
library("MASS")
data("OME")
?OME # background on the data
head(OME)
## ID Age OME Loud Noise Correct Trials
## 1 1 30 low 35 coherent 1 4
## 2 1 30 low 35 incoherent 4 5
## 3 1 30 low 40 coherent 0 3
## 4 1 30 low 40 incoherent 1 1
## 5 1 30 low 45 coherent 2 4
## 6 1 30 low 45 incoherent 2 2
any(is.na(OME)) # check for missing values
## [1] FALSE
dat = subset(OME, OME != "N/A") # manually remove OME missing values identified with "N/A"
dat$OME = factor(dat$OME)
str(dat)
## 'data.frame': 712 obs. of 7 variables:
## $ ID : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Age : int 30 30 30 30 30 30 30 30 30 30 ...
## $ OME : Factor w/ 2 levels "high","low": 2 2 2 2 2 2 2 2 2 2 ...
## $ Loud : int 35 35 40 40 45 45 50 50 55 55 ...
## $ Noise : Factor w/ 2 levels "coherent","incoherent": 1 2 1 2 1 2 1 2 1 2 ...
## $ Correct: int 1 4 0 1 2 2 3 4 3 2 ...
## $ Trials : int 4 5 3 1 4 2 3 4 3 2 ...
plot(dat$Age, dat$Correct / dat$Trials )
plot(dat$OME, dat$Correct / dat$Trials )
plot(dat$Loud, dat$Correct / dat$Trials )
plot(dat$Noise, dat$Correct / dat$Trials )
We are interested how these variables relate to the probability of successfully identifying the source of changes in sound. Of these four variables, which appears to have the weakest association with the probability of success?
Answer: OME because the box plots for the two levels of OME are nearly indistinguishable.
Next, we’ll fit a reference logistic model with noninformative prior
in R. We can do this with the glm function,
providing the model formula as with the usual lm, except
now the response of the observed proportion of correct responses. We
must also indicate how many trials were run for each experiment using
the weights argument.
mod_glm = glm(Correct/Trials ~ Age + OME + Loud + Noise, data=dat, weights=Trials, family="binomial")
summary(mod_glm)
##
## Call:
## glm(formula = Correct/Trials ~ Age + OME + Loud + Noise, family = "binomial",
## data = dat, weights = Trials)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.294441 0.434928 -16.772 < 2e-16 ***
## Age 0.018896 0.003767 5.016 5.28e-07 ***
## OMElow -0.237150 0.123257 -1.924 0.0544 .
## Loud 0.171682 0.008880 19.333 < 2e-16 ***
## Noiseincoherent 1.576304 0.115236 13.679 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1431.12 on 711 degrees of freedom
## Residual deviance: 732.38 on 707 degrees of freedom
## AIC: 1262.6
##
## Number of Fisher Scoring iterations: 5
To get an idea of how the model fits, we can create residual (using a special type of residual for non-normal likelihoods) and in-sample prediction plots.
plot(residuals(mod_glm, type="deviance"))
plot(fitted(mod_glm), dat$Correct/dat$Trials)
It appears from the second plot that the model is not very precise (some model predictions were far from the observed proportion of correct responses.) Nevertheless, it can be informative about the relationships among the variables.
Report the posterior mode estimate of the coefficient for
lowOME. The coefficient will be found in the model
summary.
Next, we will fit a similar model in JAGS. To make
results comparable to those of the reference model, we will use the same
configuration of covariates. We can extract this information from the
reference model using model.matrix.
X = model.matrix(mod_glm)[,-1] # -1 removes the column of 1s for the intercept
head(X)
## Age OMElow Loud Noiseincoherent
## 1 30 1 35 0
## 2 30 1 35 1
## 3 30 1 40 0
## 4 30 1 40 1
## 5 30 1 45 0
## 6 30 1 45 1
The data include categorical covariates which R codes as
dummy variables (as with ANOVA). Hence we have an indicator variable for
whether OME is at its low level and another indicating whether the Noise
is incoherent. The intercept is then associated with this baseline
group. Ignoring the continuous variables Age and
Loud, what are the characteristics of this baseline
group?
Answer: The intercept is associated with values of
\(0\) for all covariates. In this case,
we have OMElow: \(0\) for
high, \(1\) for low;
Noiseincoherent: \(0\) for
coherent, \(1\) for incoherent.
Therefore high OME and coherent sound.
We now fit the JAGS model wiht the fairly noninformative
priors given. Use three chains wiht at least 5000 iterations in
each.
mod3_string = " model {
for (i in 1:length(y)) {
y[i] ~ dbin(phi[i], n[i])
logit(phi[i]) = b0 + b[1]*Age[i] + b[2]*OMElow[i] + b[3]*Loud[i] + b[4]*Noiseincoherent[i]
}
b0 ~ dnorm(0.0, 1.0/5.0^2)
for (j in 1:4) {
b[j] ~ dnorm(0.0, 1.0/4.0^2)
}
} "
data_jags = as.list(as.data.frame(X))
data_jags$y = dat$Correct # this will not work if there are missing values in dat (because they would be ignored by model.matrix). Always make sure that the data are accurately pre-processed for JAGS.
data_jags$n = dat$Trials
str(data_jags) # make sure that all variables have the same number of observations (712).
## List of 6
## $ Age : num [1:712] 30 30 30 30 30 30 30 30 30 30 ...
## $ OMElow : num [1:712] 1 1 1 1 1 1 1 1 1 1 ...
## $ Loud : num [1:712] 35 35 40 40 45 45 50 50 55 55 ...
## $ Noiseincoherent: num [1:712] 0 1 0 1 0 1 0 1 0 1 ...
## $ y : int [1:712] 1 4 0 1 2 2 3 4 3 2 ...
## $ n : int [1:712] 4 5 3 1 4 2 3 4 3 2 ...
params = c("b0","b")
mod3 = jags.model(textConnection(mod3_string), data=data_jags, n.chains=3)
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 712
## Unobserved stochastic nodes: 5
## Total graph size: 4377
##
## Initializing model
update(mod3, 1e3)
mod3_sim = coda.samples(model=mod3,
variable.names=params,
n.iter=5e3)
mod3_csim = as.mcmc(do.call(rbind, mod3_sim))
plot(mod3_sim, ask=TRUE)
gelman.diag(mod3_sim)
## Potential scale reduction factors:
##
## Point est. Upper C.I.
## b[1] 1.01 1.02
## b[2] 1.00 1.01
## b[3] 1.16 1.47
## b[4] 1.01 1.04
## b0 1.16 1.46
##
## Multivariate psrf
##
## 1.12
autocorr.diag(mod3_sim)
## b[1] b[2] b[3] b[4] b0
## Lag 0 1.00000000 1.00000000 1.0000000 1.00000000 1.0000000
## Lag 1 0.91578997 0.81500744 0.9870460 0.50709404 0.9901308
## Lag 5 0.65883199 0.39818264 0.9389104 0.10942812 0.9499293
## Lag 10 0.44934026 0.18448879 0.8862044 0.08470507 0.9008281
## Lag 50 0.03532912 0.03146967 0.6016906 0.05340868 0.6221219
autocorr.plot(mod3_sim)
effectiveSize(mod3_sim)
## b[1] b[2] b[3] b[4] b0
## 603.54411 1291.65630 93.02386 2168.06303 78.00710
dic2 = dic.samples(mod3, n.iter=1e3)
raftery.diag(mod3_csim)
##
## Quantile (q) = 0.025
## Accuracy (r) = +/- 0.005
## Probability (s) = 0.95
##
## Burn-in Total Lower bound Dependence
## (M) (N) (Nmin) factor (I)
## b[1] 32 34200 3746 9.13
## b[2] 14 16868 3746 4.50
## b[3] 96 104804 3746 28.00
## b[4] 6 7087 3746 1.89
## b0 77 80675 3746 21.50
We perform some MCMC diagnostic checks. What does the Raftery and
Lewis diagnostic (raftery.diag()) suggest about these
chains?
Answer: The dependence factor for many of the variables is large (\(> 5.0\)), indicating strong autocorrelation in the chains. We would require a large number of iterations to reliably produce 95% probability intervals for the parameters.
Although OMElow is the predictor with the weakest
statistical association to probability of correct responses, the
posterior probability that its coefficient, \(\beta_2\) is negative is still \(> 0.9\). How do we interpret this (most
likely) negative coefficient in the context of our models?
Answer: While holding all other predictors constant, low OME is associated with a lower probability of correct responses than high OME.
Using the posterior means of the model coefficients, create a point estimate of the probability of correct responses for a child of age 60 months, high OME, using a coherent stimulus of 50 decibels.
xb = -7.29+(0.02)*(60)+(-0.24)*(0)+(0.17)*(50)+(1.58)*(0)
prob = 1/(1+exp(-xb))
prob
## [1] 0.9175867
Use the posterior mean estimates of the model coefficients to create
point estimates of the probability of correct responses for each
observation in the original data. To do this, follow the steps outlined
in the lesson to create a vector of these probabilities called
phat (using our notation from this quiz, it would be \(\phi\)).
One we have phat, calculate the proportion of in-sample
observations that are correctly classified according to the following
criterion: the model prediction and observed correct response rate are
either both higher than \(0.7\) or both
lower than \(0.7\).
pm_coef = colMeans(mod3_csim)
pm_coef
## b[1] b[2] b[3] b[4] b0
## 0.01879344 -0.24310415 0.17161151 1.57743555 -7.27966069
pm_Xb = pm_coef["b0"] + X[] %*% pm_coef[1:4]
phat = 1.0 / (1.0 + exp(-pm_Xb))
head(phat)
## [,1]
## 1 0.2783705
## 2 0.6513269
## 3 0.4763922
## 4 0.8150154
## 5 0.6821253
## 6 0.9122155
(tab0.7 = table(phat > 0.7, (dat$Correct / dat$Trials) > 0.7))
##
## FALSE TRUE
## FALSE 182 48
## TRUE 63 419
sum(diag(tab0.7)) / sum(tab0.7)
## [1] 0.8441011