\(p(\lambda, \theta) \propto \frac{1}{\lambda}\)
\(p(N|\lambda,\theta)=\frac{(\frac{\lambda}{\theta})^Ne^{-\frac{\lambda}{\theta}}}{N!}\)
\(\begin{aligned} p(N,\theta) &= \int_0^\infty p(N,\theta,\lambda)d\lambda \\ &= \int_0^\infty p(N|\lambda,\theta) p(\lambda,\theta) d\lambda \\ &\propto \frac{1}{\theta^N N!} \int_0^\infty \lambda^{N-1} exp(-\frac{\lambda}{\theta}) d\lambda \\ &\propto \frac{\Gamma(N) \theta^N}{N! \theta^N} \\ &\propto \frac{1}{N} \\ \end{aligned}\)
The rationale to specify the prior distribution in terms of \((\lambda, \theta)\) rather than \((\mu, \theta)\) is that it would be easier to formulate the prior information about \(\lambda\), the unconditional expectation of the observations, than about \(\mu\), the mean of the unobserved quantity N. The prior information about \(\lambda\) would be more precise than that about \(\mu\) or \(\theta\), so that it may be more reasonable to assume \(\lambda\) and \(\theta\) independent a priori than \(\mu\) and \(\theta\).
The suggested prior is not proper, since \(\int_0^1\int_0^\infty \frac{1}{\lambda}d\lambda d\theta = ln(\lambda)|_0^\infty = \infty\).
\(\begin{aligned} p(\lambda,\theta|y_1,...,y_n) &\propto p(y_1,...,y_n|\lambda,\theta) p(\lambda,\theta) \\ &\propto p(\lambda,\theta) \sum_{N=Y}^\infty p(y_1,...,y_n|N,\theta) p(N|\lambda,\theta) \ \#\ denote\ max\{y_i\}\ as\ Y \\ &\propto \frac{1}{\lambda} \sum_{N=Y}^\infty \{\prod_{i=1}^n (C_N^{y_i} \theta^{y_i} (1-\theta)^{N-y_i}) \frac{(\frac{\lambda}{\theta})^Ne^{-\frac{\lambda}{\theta}}}{N!}\} \ \#\ denote\ sum\{y_i\}\ as\ S \\ &\propto \frac{1}{\lambda} \theta^S (1-\theta)^{nN-S} e^{-\frac{\lambda}{\theta}} \sum_{N=Y}^\infty \frac{(\frac{\lambda}{\theta})^N}{N!} \ \#\ taylor\ expansion \\ &\propto \frac{1}{\lambda} \theta^S (1-\theta)^{nN-S} e^{-\frac{\lambda}{\theta}} (e^{\frac{\lambda}{\theta}} - C) \ \#\ C>1\ if\ not\ all\ y_i\ are\ 0 \\ &\propto \frac{1}{\lambda} \theta^S (1-\theta)^{nN-S} (1 - C e^{-\frac{\lambda}{\theta}}) \\ \end{aligned}\)
\(\begin{aligned} \int_0^\infty \int_0^1 p(\lambda,\theta|y_1,...,y_n) d\lambda d\theta &= \int_0^1 \theta^S (1-\theta)^{nN-S} \int_0^\infty \frac{1}{\lambda} (1 - C e^{-\frac{\lambda}{\theta}}) d\lambda d\theta \ \#\ choose\ b\ s.t.\ C\ e^{-\frac{\lambda}{\theta}}=1 \\ &< \int_0^1 \theta^S (1-\theta)^{nN-S} (\int_0^b \frac{1}{\lambda} (1 - C e^{-\frac{\lambda}{\theta}}) + \int_b^\infty \frac{1}{\lambda} (1 - C e^{-\frac{\lambda}{\theta}})) d\lambda d\theta \\ &< \int_0^1 \theta^S (1-\theta)^{nN-S} (0 + \int_b^\infty \frac{1}{\lambda}) d\lambda d\theta \\ &< \int_0^1 \theta^S (1-\theta)^{nN-S} C d\theta \ \#\ C\ is\ a\ constant \\ &< \infty \\ \end{aligned}\)
Therefore, the posterior density function \(p(\lambda,\theta|y_1,...,y_n)\) is proper.
set.seed(666)
setwd("C:/Users/Wei/Documents/Purdue STAT 695 Bayesian Data Analysis/HW3")
Load data first
Y1 = read.csv(file="impala.txt", header=FALSE)
Y2 = read.csv(file="waterbuck.txt", header=FALSE)
From previous section, derive our posterior density
\(\begin{aligned} p(N, \theta|y_1,...,y_n) &\propto p(y_1,...,y_n|N,\theta) p(N,\theta) \\ &\propto \frac{1}{N} \prod_{i=1}^n (C_N^{y_i} \theta^{y_i} (1-\theta)^{N-y_i}) \\ \end{aligned}\)
Build our posterior density function
posterior = function(pars, Y) { # pars: N, theta, Y
sum(dbinom(t(Y), pars[1], pars[2], log=TRUE)) - log(pars[1])
}
Set up the proposal distribution to generate random samples and compute transition probaiblity
library(KScorrect)
proposalFunc = function(par) {
return(c(rpois(1, par[1]), # poisson distribution
rbeta(1, 1, 1 / par[2] - 1))) # beta distribution with mean par[2]
}
log_proposal = function(new, par) { # p(new|par)
dpois(new[1], par[1], log=TRUE) + dbeta(new[2], 1, 1 / par[2] - 1, log=TRUE)
}
Build our Metropolis Hastings algorithm
library(coda)
## Warning: package 'coda' was built under R version 3.4.2
MCMC = function(initialValue, Y, burnIN, iterations) {
chains = array(dim=c(iterations + 1, 2))
chains[1, ] = initialValue
for (i in 1: iterations) {
proposal = proposalFunc(chains[i, ])
accept_p = exp(posterior(proposal, Y) + log_proposal(chains[i, ], proposal)
- posterior(chains[i, ], Y) - log_proposal(proposal, chains[i, ]))
# take care that accept_p = p1 / p2, thus p2 should not equal 0
if (runif(1) < accept_p)
chains[i+1, ] = proposal
else
chains[i+1, ] = chains[i, ]
}
#acceptance = 1 - mean(duplicated(chains)) # hide this information
#print(paste0("acceptance rate: ", round(acceptance * 100, 4), " %"))
chains = chains[-(1: burnIn), ]
chains = data.frame(N=chains[, 1], theta=chains[, 2])
return(mcmc(chains))
}
Generate our first markov chain for data \(impala.txt\).
initialValue = c(50, 0.5)
iterations = 100000
burnIn = 50000
chains1 = MCMC(initialValue, Y1, burnIn, iterations)
summary(chains1)
##
## Iterations = 1:50001
## Thinning interval = 1
## Number of chains = 1
## Sample size per chain = 50001
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## N 72.4938 51.2822 0.2293385 8.678
## theta 0.3837 0.1716 0.0007674 0.016
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## N 29.00000 41.0000 57.0000 83.0000 209.0000
## theta 0.09808 0.2508 0.3702 0.5095 0.7306
Construct the trace plots
plot(chains1)
ACF plot: a good MCMC algorithm should generate highly autocorrelated parameter values to traverse the whole sample space of the parameters, which shows a slow geometric decay.
par(mfrow=c(1,2))
acf(chains1[, 1])
acf(chains1[, 2])
From Gelman-Rubin statistics, it is better for our Potential scale reduction factor to be close to 1. We see from Gelman’s diagonisis that the result is acceptable.
LIST1 = list(chains1)
for (i in 2: 10) {
initialValue = c(sample(30: 60, 1), runif(1, 0.1, 0.9))
chains = MCMC(initialValue, Y1, burnIn, iterations)
LIST1[[i]] = chains
}
MCMC_list = mcmc.list(LIST1)
plot(MCMC_list)
gelman.diag(MCMC_list)
## Potential scale reduction factors:
##
## Point est. Upper C.I.
## N 1.06 1.11
## theta 1.02 1.05
##
## Multivariate psrf
##
## 1.05
#gelman.plot(MCMC_list)
Large effective sample size doesn’t guarantee sufficient/representative MCMC samples, but lower effective sample size does imply insufficient samples. In our case, the effective sample size is not too small which is safe for us to use.
effectiveSize(MCMC_list)
## N theta
## 247.1433 954.1380
Repeat the procedure for \(waterbuck.txt\).
initialValue = c(150, 0.5)
iterations = 100000
burnIn = 50000
chains2 = MCMC(initialValue, Y2, burnIn, iterations)
summary(chains2)
##
## Iterations = 1:50001
## Thinning interval = 1
## Number of chains = 1
## Sample size per chain = 50001
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## N 165.5799 63.6801 0.2847834 11.38122
## theta 0.4367 0.1549 0.0006926 0.02416
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## N 86.0000 114.0000 149.0000 207.0000 316.0000
## theta 0.1973 0.3057 0.4213 0.5514 0.7361
Construct the trace plots
plot(chains2)
The ACF plot shows a slow geometric decay and matches our expectation.
par(mfrow=c(1,2))
acf(chains2[, 1])
acf(chains2[, 2])
We see from Gelman’s diagonisis that the result is acceptable since it is close to 1. But we can no longer achieve a low gelman diagonosis using a small amount of samples.
LIST2 = list(chains2)
for (i in 2: 10) {
initialValue = c(sample(70: 300, 1), runif(1, 0.1, 0.9))
chains = MCMC(initialValue, Y2, burnIn, iterations)
LIST2[[i]] = chains
}
MCMC_list = mcmc.list(LIST2)
plot(MCMC_list)
gelman.diag(MCMC_list)
## Potential scale reduction factors:
##
## Point est. Upper C.I.
## N 4.17 10.21
## theta 1.54 2.02
##
## Multivariate psrf
##
## 4.48
In this dataset, the effective sample size is not too small.
effectiveSize(MCMC_list)
## N theta
## 212.0328 437.2159
Prepare the basic settings.
library(MASS)
library(RColorBrewer)
k = 11
my.cols = rev(brewer.pal(k, "RdYlBu"))
Posterior draws of \((N,\theta)\) for \(impala.txt\)(left) and \(waterbuck.txt\)(right)
par(mfrow=c(1,2))
z = kde2d(chains1[, 1], chains1[, 2], n=50)
p_posterior = sum(dbinom(t(Y1), chains1[, 1], chains1[, 2], log=TRUE)) - log(chains1[, 1])
plot(as.matrix(chains1), col="grey", cex = .1, main="impala.txt")
contour(z, drawlabels=FALSE, nlevels=k, col=my.cols, add=TRUE)
z = kde2d(chains2[, 1], chains2[, 2], n=50)
plot(as.matrix(chains2), col="grey", cex = .1, main="waterbuck.txt")
contour(z, drawlabels=FALSE, nlevels=k, col=my.cols, add=TRUE)
We already know from #1 that \(p(N, \theta|y_1,...,y_n) \propto \frac{1}{N} \prod_{i=1}^n (C_N^{y_i}) \theta^{\sum_{i=1}^n y_i} (1-\theta)^{nN-\sum_{i=1}^n y_i}\), since \(\int_0^\infty \theta^{\sum_{i=1}^n y_i} (1-\theta)^{nN-\sum_{i=1}^n y_i}=\frac{\Gamma(\sum_{i=1}^n y_i + 1)\Gamma(nN-\sum_{i=1}^n y_i + 1)}{\Gamma(nN+2)}\), therefore, the marginal posterior distribution for N is
\(\begin{aligned} p(N|prior) &\propto \frac{1}{N} \prod_{i=1}^n (C_N^{y_i}) \frac{\Gamma(\sum_{i=1}^n y_i + 1)\Gamma(nN-\sum_{i=1}^n y_i + 1)}{\Gamma(nN+2)} \\ &\propto \frac{1}{N} \prod_{i=1}^n (C_N^{y_i}) \frac{(nN-\sum_{i=1}^n y_i)!}{(nN+1)!} \\ &\propto \prod_{i=1}^n (C_N^{y_i}) \frac{1}{N \prod_{j=1+nN-\sum_{i=1}^n y_i}^{nN+1} j} \\ \end{aligned}\)
Build the marginal posterior function for every possible N, and numerically compute the normalizing constant, here we suppose \(N\) ranges from 1 to 1000, since when the marginal probability decays extremly fast as N increases.
# work with number that overflows in R, e.g. as.brob: log(as.brob(3)^100000) -> 109861.2
library("Brobdingnag")
##
## Attaching package: 'Brobdingnag'
## The following objects are masked from 'package:base':
##
## max, min, prod, range, sum
log_marginal_posterior = function(N, Y) {
if (N <= max(Y)) return(-Inf)
denominor_list = mapply(as.brob, seq(5*N - sum(Y) + 1, 5 * N + 1))
numerator_list = mapply(as.brob, choose(N, t(Y)))
denominor = N
numerator = 1
for (i in 1: length(denominor_list)) denominor = denominor * denominor_list[[i]]
for (i in 1: 5) numerator = numerator * numerator_list[[i]]
return(log(numerator / denominor))
}
Show the density functions of \(N\) for the two datasets.
par(mfrow=c(1,2))
Normal_constant1 = exp(mapply(log_marginal_posterior, seq(1, 1000), Y1)) # not overflow, use exp instead
Normal_constant2 = mapply(log_marginal_posterior, seq(1, 1000), Y2)
Normal_constant2 = Normal_constant2 - max(Normal_constant2)
Normal_constant2 = exp(Normal_constant2) / sum(exp(Normal_constant2))
plot(Normal_constant1, main="impala.txt", xlab="", ylab="", col="blue", cex = .3)
plot(Normal_constant2, main="waterbuck.txt", xlab="", ylab="", col="blue", cex = .3)
The posterior probability of \(N > 100\) obtained analytically for the two datasets.
prob_impala = sum(Normal_constant1[101: 1000]) / sum(Normal_constant1) # impala.txt
prob_waterbuck = sum(Normal_constant2[101: 1000]) / sum(Normal_constant2) # waterbuck.txt
prob_impala
## [1] 0.3107036
prob_waterbuck
## [1] 0.9518959
The posterior probability of \(N > 100\) obtained from 10 chains of MCMC samples for the two datasets, respectively.
prob_mcmc = array(dim=c(10, 2))
for (i in 1: 10) {
prob_mcmc[i, 1] = length(LIST1[[i]][LIST1[[i]][, 1] > 100, 1]) / length(LIST1[[i]][, 1])
prob_mcmc[i, 2] = length(LIST2[[i]][LIST2[[i]][, 1] > 100, 1]) / length(LIST2[[i]][, 1])
}
mean(prob_mcmc[, 1]); sd(prob_mcmc[, 1])
## [1] 0.2630747
## [1] 0.1024621
mean(prob_mcmc[, 2]); sd(prob_mcmc[, 2])
## [1] 0.8622988
## [1] 0.06439491
From the result, we see that the result computed analytically falls within the range of simulated MCMC samples, which suggests a good match. The waterbuck example has a very long tail, which may be related to the instability of the dataset.
par(mfrow = c(1, 2))
hist(prob_mcmc[, 1], nclass=3, main="Posterior of N", xlab="Red line is computed analytically" )
abline(v = prob_impala, col="red" )
hist(prob_mcmc[, 2], nclass=3, main="Posterior of N", xlab="Red line is computed analytically" )
abline(v = prob_waterbuck, col="red" )
To find significant main effects and interactions of the five interventions to representatives from the New York DoE, I will first use probit regression to do variable selection and proceed by using ridge regression to quantitatively analyze the effect these interventions.
Load data first
set.seed(666)
setwd("C:/Users/Wei/Documents/Purdue STAT 695 Bayesian Data Analysis/HW3")
data_covariates = read.csv("ny_doe_covariates.csv", header=TRUE)
data_assignments = read.csv("ny_doe_treatment_assignments.csv", header=TRUE)
data_grades = read.csv("ny_doe_cumulative_grades.csv", header=TRUE)
Normalize the input
library(standardize)
## Warning: package 'standardize' was built under R version 3.4.2
colNames = colnames(data_assignments)
data_assignments = scale(data_assignments)
From viewpoint of experiment design, we know that the pricinples of effect hierarchy, effect heredity and effect sparsity are quite related to model selection, first order effects typically have larger influence than second order effects, namely interactions. Before we take care of interactions, let’s first consider the input with only the first order effects and intercept.
X = cbind(rep(1, length(data_grades)), data_assignments)
colnames(X)[1] = "Intercept"
Compare the grades in last year and this year to determine if the first order effects have made a difference. 1 represents improvement in grade, 0 otherwise.
past_grades = data_covariates[11: 13]
past_grades = rowSums(past_grades)
Y = as.vector(data_grades > past_grades)
Y[Y==TRUE] = 1
Y[Y==FALSE] = 0
Build the posterior density function, since we have no prior information, scaled Cauchy prior is applied.
log_posterior = function(corr) {
return(sum(dcauchy(corr, scale=2.5, log=TRUE)) +
sum(Y * pnorm(X %*% corr, log.p=TRUE) +
(1 - Y) * pnorm(X %*% corr, lower.tail=FALSE, log.p=TRUE)))
}
Before sampling coefficients using MCMC, we first estimate the optimal parameters to give us a robust initial point to start.
optim_pars = optim(rep(0, ncol(X)), log_posterior,
method="BFGS", control=list(fnscale=-1), hessian=TRUE)
initialValue = optim_pars$par
Hessian = optim_pars$hessian / (2.4 / sqrt(ncol(X))) # BDA pg. 290
burnIn = 10000
iterations = 20000
Set up the MCMC sampler with symmetric proposal distribution
MCMC_MH = function(X, initialValue, Hessian) {
chains = array(dim=c(iterations + 1, ncol(X)))
chains[1, ] = initialValue
Rchol = chol(-Hessian)
BarProgress = txtProgressBar(min = 1, max = nrow(chains), style = 3)
for (i in 1: iterations) {
# better avoid saving the inverse of a matrix, compute them instead
proposal = chains[i, ] + backsolve(Rchol, rnorm(ncol(X)))
# write exp(num) as num to avoid overflow; symmetric proposal
log_acceptance_prob = log_posterior(proposal) - log_posterior(chains[i, ])
chains[i + 1, ] = chains[i, ]
if (log(runif(1)) < log_acceptance_prob)
chains[i + 1, ] = proposal
#setTxtProgressBar(BarProgress, i) # not suitable in rmd, useful in other cases
}
close(BarProgress)
return(chains[-(1: burnIn), ])
}
Compute and visualize the result. From the quantile of parameters bettwen 2.5% to 97.5%, we see that QR, PA and SPBP are significant different from 0, the rest of the coefficients, such as intercept, IT and ARIS, are not that clear.
pars_draws = MCMC_MH(X, initialValue, Hessian)
par(mfrow=c(1, 3))
for (i in 1: ncol(X)) {
if (i == 1) print("Significant effects")
interval = quantile(pars_draws[ , i], probs=c(0.025, 0.975))
if (interval[1] * interval[2] < 0) { # coefficient not significantly differs from 0
hist(pars_draws[ , i], nclass=30, main=colnames(X)[i], xlab="" )
abline(v = 0, col="red")
}
else {
print(colnames(X)[i])
}
}
## [1] "Significant effects"
## [1] "QR"
## [1] "PA"
## [1] "SPBP"
Next, let’s consider two-factor interactions items. We will evaluate the first order effects of QR, PA, SPBP with all second order effects containing one of these main factors. Moreover, we exclude the intercept item since it is not significant.
QR = X[, 2]; PA = X[, 3]; IT = X[, 4]; SPBP = X[, 5]; ARIS = X[, 6]
XX = cbind(QR, PA, SPBP, QR * PA, QR * SPBP, PA * SPBP,
QR * IT, QR * ARIS, PA * IT, PA * ARIS, SPBP * IT, SPBP * ARIS)
colnames(XX) = c("QR", "PA", "SPBP", "QR*PA", "QR*SPBP", "PA*SPBP",
"QR*IT", "QR*ARIS", "PA*IT", "PA*ARIS", "SPBP*IT", "SPBP*ARIS")
Rewrite the posterior density function to adapt to the new input, solve the optimal parameter again to sample with a robust starting point.
log_posterior = function(corr) {
return(sum(dcauchy(corr, scale=2.5, log=TRUE)) +
sum(Y * pnorm(XX %*% corr, log.p=TRUE) +
(1 - Y) * pnorm(XX %*% corr, lower.tail=FALSE, log.p=TRUE)))
}
optim_pars = optim(rep(0, ncol(XX)), log_posterior,
method="BFGS", control=list(fnscale=-1), hessian=TRUE)
initialValue = optim_pars$par
Hessian = optim_pars$hessian / (2.4 / sqrt(ncol(XX))) # BDA pg. 290
Compute and visualize the result. We see that we can ignore most of the second order effects other than \(PA \times IT\).
pars_draws = MCMC_MH(XX, initialValue, Hessian)
par(mfrow=c(2, 4))
for (i in 1: ncol(XX)) {
if (i == 1) print("Significant effects")
interval = quantile(pars_draws[ , i], probs=c(0.025, 0.975))
if (interval[1] * interval[2] < 0) { # coefficient not significantly differs from 0
hist(pars_draws[ , i], nclass=30, main=colnames(XX)[i], xlab="" )
abline(v = 0, col="red")
}
else {
print(colnames(XX)[i])
}
}
## [1] "Significant effects"
## [1] "QR"
## [1] "PA"
## [1] "SPBP"
## [1] "PA*IT"
Since second order effects already make a difference, let’s figure out if the third order effects should also be included. Same procedure again.
XX = cbind(QR, PA, SPBP, PA * IT,
QR*PA*IT, QR*PA*SPBP, QR*PA*ARIS, QR*IT*SPBP, QR*IT*ARIS, QR*SPBP*ARIS,
PA*IT*SPBP, PA*IT*ARIS, PA*SPBP*ARIS,
IT*SPBP*ARIS)
colnames(XX) = c("QR", "PA", "SPBP", "PA*IT", "QR*PA*IT",
"QR*PA*SPBP", "QR*PA*ARIS", "QR*IT*SPBP",
"QR*IT*ARIS", "QR*SPBP*ARIS", "PA*IT*SPBP",
"PA*IT*ARIS", "PA*SPBP*ARIS", "IT*SPBP*ARIS")
log_posterior = function(corr) {
return(sum(dcauchy(corr, scale=2.5, log=TRUE)) +
sum(Y * pnorm(XX %*% corr, log.p=TRUE) +
(1 - Y) * pnorm(XX %*% corr, lower.tail=FALSE, log.p=TRUE)))
}
optim_pars = optim(rep(0, ncol(XX)), log_posterior,
method="BFGS", control=list(fnscale=-1), hessian=TRUE)
initialValue = optim_pars$par
Hessian = optim_pars$hessian / (2.4 / sqrt(ncol(XX))) # BDA pg. 290
Compute and visualize the result. We find that all the higher order effects don’t have any too much impact to our model. We will then only focus on interventions of QR, PA, SPBP, \(PA \times IT\)
pars_draws = MCMC_MH(XX, initialValue, Hessian)
colnames(pars_draws) = colnames(XX)
par(mfrow=c(3, 4))
for (i in 1: ncol(XX)) {
if (i == 1) print("Significant effects")
interval = quantile(pars_draws[ , i], probs=c(0.025, 0.975))
if (interval[1] * interval[2] < 0) { # coefficient not significantly differs from 0
hist(pars_draws[ , i], nclass=30, main=colnames(XX)[i], xlab="" )
abline(v = 0, col="red")
}
else {
print(colnames(XX)[i])
}
}
## [1] "Significant effects"
## [1] "QR"
## [1] "PA"
## [1] "SPBP"
Evaluate the effects of QR, PA, SPBP, PA*IT. Genrate input and adapt the log posterior data to the new dataset.
XX = cbind(QR, PA, SPBP, PA * IT)
colnames(XX) = c("QR", "PA", "SPBP", "PA*IT")
log_posterior = function(corr) {
return(sum(dcauchy(corr, scale=2.5, log=TRUE)) +
sum(Y * pnorm(XX %*% corr, log.p=TRUE) +
(1 - Y) * pnorm(XX %*% corr, lower.tail=FALSE, log.p=TRUE)))
}
Same procedure again. We see from the quantile statistics that all the effects are significant in our model.
optim_pars = optim(rep(0, ncol(XX)), log_posterior,
method="BFGS", control=list(fnscale=-1), hessian=TRUE)
initialValue = optim_pars$par
Hessian = optim_pars$hessian / (2.4 / sqrt(ncol(XX))) # BDA pg. 290
pars_draws = MCMC_MH(XX, initialValue, Hessian)
colnames(pars_draws) = colnames(XX)
pars_draws = data.frame(pars_draws)
We can see that QR (Quality Review) has a negative impact to the performance
quantile(pars_draws$QR, probs=c(0.025, 0.975))
## 2.5% 97.5%
## -0.37940459 -0.03391224
PA (Periodic Assessments), SPBP(School-wide performance bonus program) and PA-IT (Periodic Assessments with Inquiry Team) can help improve the performance significantly.
quantile(pars_draws$PA, probs=c(0.025, 0.975))
## 2.5% 97.5%
## 0.3267412 0.7079752
quantile(pars_draws$SPBP, probs=c(0.025, 0.975))
## 2.5% 97.5%
## 0.3241203 0.6926350
quantile(pars_draws$PA.IT, probs=c(0.025, 0.975))
## 2.5% 97.5%
## 0.2029513 0.5753790
Let’s move to the second part to see if model makes sense, it follows by posterior checks. Set up the variables first
XX = data.frame(XX)
group = list(QR=XX$QR>0, PA=XX$PA>0, SPBP=XX$SPBP>0, PA.IT=XX$PA.IT>0)
mat = matrix(NA, nrow=nrow(pars_draws), ncol=2)
posterior_check = list(QR=mat, PA=mat, SPBP=mat, PA.IT=mat)
We use the sampling parameters to simulate the input and compare them with the real probability (since we have scaled the features, they all have 0 mean)
#BarProgress = txtProgressBar(min=1, max=nrow(pars_draws), style=3)
for (i in 1: nrow(pars_draws)) {
diff = rep(NA, nrow(XX))
for (j in 1: nrow(XX)) {
prediction = pnorm(sum(XX[j, ] * pars_draws[i, ]))
diff[j] = Y[j] - prediction
}
# posterior_check$element = [treatment group, control group]
posterior_check$QR[i, ] = c(mean(diff[group$QR]), mean(diff[!group$QR]))
posterior_check$PA[i, ] = c(mean(diff[group$PA]), mean(diff[!group$PA]))
posterior_check$SPBP[i, ] = c(mean(diff[group$SPBP]), mean(diff[!group$SPBP]))
posterior_check$PA.IT[i, ] = c(mean(diff[group$PA.IT]), mean(diff[!group$PA.IT]))
#setTxtProgressBar(BarProgress, i) # not suitable in rmd
}
#close(BarProgress)
Let’s try to evaluate the result using quantile statistics.
mt = cbind(posterior_check$QR, posterior_check$PA, posterior_check$SPBP, posterior_check$PA.IT)
up_quantile = function(vec) quantile(vec, prob=0.975)
down_quantile = function(vec) quantile(vec, prob=0.025)
evals = data.frame(mean=apply(mt, 2, mean),
up=apply(mt, 2, up_quantile),
down=apply(mt, 2, down_quantile))
From the figure below, we see that the mean values 0 are all located in the credit interval, which means the model does a fair job. For further problems like why the differences seem a bit of biased, due to time limit, we will not discuss it here.
plot(evals$mean, ylim=c(min(evals$down), max(evals$up)), xaxt="n", ylab="Diff", xlab="")
polygon(c(1: 8, rev(1: 8)),c(evals$down, rev(evals$up)),col=gray(0.8))
lines(evals$mean, pch=15, panel.first = grid())
points(evals$up)
points(evals$down)
abline(h=0, col="blue")
axis(1, at=1:8, labels=rownames(evals))
We have finished variable selection, let’s proceed by analyzing the magnitude of the effect of interventions.
XX = cbind(rep(1, nrow(XX)), QR, PA, SPBP, PA * IT)
colnames(XX) = c("Intercept", "QR", "PA", "SPBP", "PA*IT")
Y = (data_grades - past_grades)[1: 232, ]
Before we move forward, let’s analyze the correlation matrix of features. It appears that the correlations are small enough, which makes sure that we can safely use ridge regression
round(cor(XX[, 2:5]), 4)
## QR PA SPBP PA*IT
## QR 1.0000 0.0000 0.0082 -0.0003
## PA 0.0000 1.0000 -0.0259 0.0000
## SPBP 0.0082 -0.0259 1.0000 -0.0436
## PA*IT -0.0003 0.0000 -0.0436 1.0000
coef = function(k) {
R = t(XX) %*% XX
inv = solve(R + k * diag(5)) # when k is large, inverse suffers less numerical problem
mu = inv %*% t(XX) %*% Y
sd = sqrt(diag(inv %*% R %*% inv))
df = data.frame(mean=mu, sd=sd, '0.025'=mu-2*sd, '0.975'=mu+2*sd)
round(df, 4)
}
We tested three different choices of k, which is 0.01, 1, 100, corresponding to the prior distributions following \(N(0, 10^2)\), \(N(0, 1^2)\) and \(N(0, 0.1^2)\), it’s beyond our interest to seek for all possible priors because the score only ranges from 0 ~ 100.
coef(0.01)
## mean sd X0.025 X0.975
## Intercept -1.3130 0.0657 -1.4443 -1.1817
## QR -0.4107 0.0658 -0.5423 -0.2791
## PA 5.3510 0.0658 5.2194 5.4826
## SPBP 5.9791 0.0659 5.8474 6.1109
## PA*IT 3.4220 0.0660 3.2900 3.5540
coef(1)
## mean sd X0.025 X0.975
## Intercept -1.3071 0.0654 -1.4379 -1.1764
## QR -0.4087 0.0655 -0.5398 -0.2777
## PA 5.3275 0.0655 5.1964 5.4586
## SPBP 5.9523 0.0656 5.8211 6.0835
## PA*IT 3.4061 0.0657 3.2747 3.5375
coef(100)
## mean sd X0.025 X0.975
## Intercept -0.9044 0.0459 -0.9961 -0.8126
## QR -0.2762 0.0459 -0.3680 -0.1844
## PA 3.7008 0.0459 3.6090 3.7926
## SPBP 4.1090 0.0459 4.0172 4.2009
## PA*IT 2.3234 0.0460 2.2314 2.4153
From the three sets of results above, we can see with no doubt that QR (Quality Review) has a slighted negative impact to the performancce, with reducing roughtly -0.4 ~ -0.2 score to the overall perforamance. As a contrast, PA (Periodic Assessments), SPBP (School-wide performance bonus program) and PA-IT (Periodic Assessments with Inquiry Team) can significantly improve the annual performance by no less than 3.6, 4.0, 2.2 points respectively.
Raftery, A. E. (1988). Inference for the Binomial N parameter: A hierarchical Bayes approach. Biometrika 75 (2), 223-228.
A simple Metropolis-Hastings MCMC in R
MCMC chain analysis and convergence diagnostics with coda in R