First, let’s assume any vector is a column matrix, parameter with index 0 corresponds to bias item, letter with form \(\dot {letter}\) includes bias item. Then we define X as the \(n\times \dot K\) feature matrix, Z as the \(n\times \dot H\) hidden node matrix,
Given fixed \(\alpha\) and \(\beta\), and sigmoid function \(\psi(x) = \frac{1}{1+e^{-x}}\), the uncertainty of one forward pass can be quantified as
\(\begin{aligned} f(x_i;\alpha,\beta) &= \psi(z_i^T \beta) \\ &= \psi(\sum_{h=0}^H \psi(x_i^T \alpha_{\cdot h}) \beta_h) \\ &= \psi(\sum_{h=0}^H \psi(\sum_{k=0}^K x_{ik} \alpha_{kh}) \beta_h) \\ \end{aligned}\)
For a batch of input, the uncertainty can written in a matrix form
\(\begin{aligned} F &= F_{n\times 1}(X;\alpha,\beta) \\ &= \psi(Z_{n\times \dot H} \beta_{\dot H\times 1}) \\ &= \psi(\psi(X_{n\times \dot K} \alpha_{\dot K \times T})_{n\times \dot H} \beta_{\dot H\times 1}) \\ \end{aligned}\)
Model the probability that y is 0 or 1 with \(P(y_i|x_i;\alpha,\beta)=f(x_i;\alpha,\beta)^{y_i} (1-f(x_i;\alpha,\beta))^{1-y_i}\)
Thus, the log likelihood function follows
\(\begin{aligned} log\ L(\alpha, \beta|X) &= log\ P(Y|X;\alpha,\beta) \\ &= log\ \prod_{i=1}^n p(y_i|x_i;\alpha,\beta) \\ &= \sum \{y_i log\ f(x_i;\alpha,\beta) + (1-y_i)\log(1-f(x_i;\alpha,\beta))\} \\ &= Y_{n\times 1}^T logF_{n\times 1} + (1-Y)_{n\times 1}^Tlog(1-F)_{n\times 1} \\ \end{aligned}\)
Suppose the priors follow \(\alpha \sim N(0, I_{\dot K\times H})\) and \(\beta \sim N(0, I_{\dot H})\), denote . as pointwise operation.
The logarithm of the posterior density for \(Y\) is
\(\begin{aligned} log\ P(\alpha,\beta|X; Y) &\propto log\ L(\alpha, \beta|X) + log\ P(\alpha) + log\ P(\beta) \\ &\propto log\ L(\alpha, \beta|X) - \frac{1}{2} \sum_{h,k} \alpha.^2 - \frac{1}{2} \sum_{h} \beta.^2 \\ \end{aligned}\)
Denote \(\cdot\) as pointwise multiplication/division where the elements in the same row are operated, for a batch of input, the derivative of \(F\) with respect to \(\beta\) follows
\(\begin{aligned} F_{\beta}'(X;\alpha,\beta) &=\psi_{\beta}'(Z_{n\times \dot H} \beta_{\dot H\times 1}) \\ &=Z_{n\times \dot H} \cdot \psi_{n\times 1} \cdot (1-\psi)_{n\times 1} \\ &=Z_{n\times \dot H} \cdot F_{n\times 1} \cdot (1-F)_{n\times 1} \\ \end{aligned}\)
Therefore,
\(\frac{\partial log\ L(\alpha, \beta|X)}{\partial \beta}= (F_{\beta}'./F)^TY-(F_{\beta}'./(1-F))^T(1-Y) - \beta\)
\(\begin{aligned} F_{\alpha}'(X;\alpha,\beta) &=\psi_{\alpha}'(\psi(X_{n\times \dot K} \alpha_{\dot K\times H})_{n\times \dot H} \beta_{\dot H\times 1}) \\ &=((F_{n\times 1} \cdot (1-F)_{n\times 1})\beta_{H\times 1}^T \psi_{\alpha}'(X_{n\times \dot K} \alpha_{\dot K \times H}) \ \#\ no\ bias\ item,\ since\ bias\ gradient\ vanishes\ \\ &=\{[(F_{n\times 1} \cdot (1-F)_{n\times 1})\beta_{H\times 1}^T] \cdot Z_{n\times H} \cdot (1-Z)_{n\times H}\} \otimes X_{n\times \dot K} \\ \end{aligned}\)
where \(\otimes\) represents row-wise outer product.
Therefore, \(\frac{\partial log\ L(\alpha, \beta|X)}{\partial \alpha}=(F_{\alpha}'./F)^TY-(F_{\alpha}'./(1-F))^T(1-Y) - \alpha\)
set.seed(666)
setwd("C:/Users/Wei/Documents/Purdue STAT 695 Bayesian Data Analysis/HW4")
Load data first
data = read.csv(file="moon_shape.csv", header=TRUE)
Extract feature matrix \(features\) and label \(Y\).
features = as.matrix(data[, 1:2]) # features dim: 1000X2
Y = data[, 3]
Set the number of nodes in hidden layer. For the hidden layer nodes, I just picked 2 since from different trials, the result shows that \(H=2\) makes more sense, not many parameters are not significant from 0. The chains are easier to get mixed.
K = 2
H = 2 # can be set as arbitrary number
Normalize the feature matrix.
features = apply(features, MARGIN=2, FUN = function(X) (X - mean(X)) / sd(X))
Add bias to the feature matrix.
features = cbind(features, rep(1, nrow(features)))
Write the forward network to make predictions given weights.
sigmoid = function(x) 1 / (1 + exp(-x))
Hidden_nodes = function(weights) {
alpha = matrix(head(weights, H * (K + 1)), ncol=H)
sigmoid(features %*% alpha)
}
Forward = function(weights, Z_hidden=FALSE) {
beta = tail(weights, H + 1)
if (length(Z_hidden) == 1) Z_hidden = Hidden_nodes(weights)
as.vector(sigmoid(cbind(Z_hidden, rep(1, nrow(features))) %*% beta))
}
Build our log posterior function.
log_Posterior = function(weights) { # make it negative to suit Neal's code
Fs = Forward(weights)
logLike = sum(Y * log(Fs) + (1 - Y) * log(1 - Fs))
return(-(logLike - 0.5 * sum(weights^2)))
}
Derive the gradient funtion based on log posterior likelihood.
gradient = function(weights) {
beta = tail(weights, H + 1)
Zs = Hidden_nodes(weights)
Fs = Forward(weights, Zs)
# beta first
f_beta = Fs * (1 - Fs) * cbind(Zs, rep(1, H)) # pointwise multiplication, bias included
Pos_beta = t(f_beta / Fs) %*% Y + t(-f_beta / (1 - Fs)) %*% (1 - Y)
# compute alpha
fz_prime = f_beta[, 1: H] * (1 - Zs) %*% diag(beta[1: H]) # no bias item, since its gradient vanishes
f_alpha = fz_prime[, rep(1: H, each=K + 1)] * features[, rep(1: (K + 1), H)] # rowwise outer product
Pos_alpha = t(f_alpha / Fs) %*% Y + t(-f_alpha / (1 - Fs)) %*% (1 - Y)
return(-(c(Pos_alpha, Pos_beta) - weights))
}
Before we proceed, let’s use gradient check to see if our gradient function performs correct.
weights = rnorm(H * (K + 1) + H + 1)
round(gradient(weights), 6) == round(numDeriv::grad(log_Posterior, weights), 6)
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
The code comes from here, for more variants of HMC, you can go to here
# SIMPLE IMPLEMENTATION OF HAMILTONIAN MONTE CARLO. By Radford M. Neal, 2010.
HMC = function (U, grad_U, current_q, epsilon = .1, L = 10) {
q = current_q
p = rnorm(length(q), 0, 1) # independent standard normal variates
current_p = p
# Make a half step for momentum at the beginning
p = p - epsilon * grad_U(q) / 2
# Alternate full steps for position and momentum
for (i in 1: L) {
# Make a full step for the position
q = q + epsilon * p
# Make a full step for the momentum, except at end of trajectory
if (i != L) p = p - epsilon * grad_U(q)
}
# Make a half step for momentum at the end.
p = p - epsilon * grad_U(q) / 2
# Negate momentum at end of trajectory to make the proposal symmetric
p = -p
# Evaluate potential and kinetic energies at start and end of trajectory
current_U = U(current_q)
current_K = sum(current_p^2) / 2
proposed_U = U(q)
proposed_K = sum(p^2) / 2
# Accept or reject the state at end of trajectory, returning either
# the position at the end of the trajectory or the initial position
if (log(runif(1)) < current_U - proposed_U + current_K - proposed_K) {
# acc <<- acc + 1
current_q = q # accept
}
return(current_q)
}
One reason that Neural Network may not mix well is that there is no order for the nodes in the same layer. So the solution I used is to sort the matrix from column-wise by comparing the row means. Despite the fact that it is not the perfectly making sense, it works.
library(coda)
## Warning: package 'coda' was built under R version 3.4.2
iters = 2000
burnIn = 0.5
HMC_chain = function() {
chains = matrix(rnorm(iters * (H*(K+1) + H + 1)), nrow=iters)
for(i in 2: iters) chains[i, ] = HMC(log_Posterior, gradient, chains[i-1, ], 0.1, 20)
new_order = order(colMeans(chains[(iters*burnIn): iters, ]))
return(mcmc(chains[(iters*burnIn): iters, new_order]))
#return(mcmc(chains[(iters*burnIn): iters, ]))
}
Let’s generate two chains to see if they converges.
chains = list(c1=HMC_chain(), c2=HMC_chain())
From gelman diagnosis, the result is not bad, very close to 1.
gelman.diag(chains)
## Potential scale reduction factors:
##
## Point est. Upper C.I.
## [1,] 1.011 1.025
## [2,] 1.002 1.002
## [3,] 0.999 0.999
## [4,] 1.001 1.004
## [5,] 1.012 1.016
## [6,] 1.002 1.006
## [7,] 1.018 1.025
## [8,] 1.013 1.029
## [9,] 1.000 1.000
##
## Multivariate psrf
##
## 1.02
mn = colMeans(chains$c1)
qtl = apply(chains$c1, 2, quantile, c(.1,.9))
rslt = data.frame(variable=1:9, posterior=mn, q10=qtl[1,], q90=qtl[2,])
library(ggplot2)
ggplot(data=rslt, aes(x=variable,y=posterior)) + geom_point(size=2) +
geom_errorbar(aes(ymax=q90,ymin=q10),width=0.2) +
geom_hline(yintercept = 0)
By comparing the chains in the traceplot, we are satistied with the result. In general, the chains are mixing well and convergent.
library(coda)
## Warning: package 'coda' was built under R version 3.4.2
plot(mcmc.list(lapply(chains, FUN=function(x) x[, 1:3])))
plot(mcmc.list(lapply(chains, FUN=function(x) x[, 4:6])))
plot(mcmc.list(lapply(chains, FUN=function(x) x[, 7:9])))
For more trials with respect to different \(H\), I did it offline and didn’t show the result here, the principle it the same. To select the best model with respect to \(H\), we can either use Bayes Factor to compare different models, or use cross validation to choose the model that gives the best prediction in the testing set.
The basic idea to learn about the effect of high- versus low-volume treating hospital on patient survival is to analysis the coefficient from the regression analysis. Before we apply neural network, let’s first identify some unrelevant variables using Probit regression.
Load data first, then normalize data and add bias item. Here, we igore the year-month feature, because the month information doesn’t make sense here.
data = read.table(file="karolinska.txt", header=TRUE)
X = as.matrix(data[, c(1,2,3,4,5,8)])
## normalize features
X = apply(X, MARGIN=2, FUN=function(X) (X - mean(X)) / sd(X))
# add bias to features
X = cbind(X, rep(1, nrow(X)))
colnames(X)[7] = 'bias'
Since there are only roughly 11% people survive more than 5 years, 25% people survive 2-4 years, to avoid label imbalance, we group these two together as label 0, label 1 represents people who survive only 1 year.
data = read.table(file="karolinska.txt", header=TRUE)
Y = as.numeric(data[, 7])
Y[Y==3 | Y==2] = 0
As before, create the log posterior function and solve the optial initial parameters for sampling.
log_posterior = function(corr) {
return(sum(dcauchy(corr, scale=2.5, log=TRUE)) +
sum(Y * pnorm(X %*% corr, log.p=TRUE) +
(1 - Y) * pnorm(X %*% corr, lower.tail=FALSE, log.p=TRUE)))
}
optim_pars = optim(rep(0, ncol(X)), log_posterior,
method="BFGS", control=list(fnscale=-1), hessian=TRUE)
initialValue = optim_pars$par
Hessian = optim_pars$hessian / (2.4 / sqrt(ncol(X))) # BDA pg. 290
burnIn = 10000
iterations = 20000
Run our MCMC sampling
MCMC_MH = function(X, initialValue, Hessian) {
chains = array(dim=c(iterations + 1, ncol(X)))
chains[1, ] = initialValue
Rchol = chol(-Hessian)
BarProgress = txtProgressBar(min = 1, max = nrow(chains), style = 3)
for (i in 1: iterations) {
# better avoid saving the inverse of a matrix, compute them instead
proposal = chains[i, ] + backsolve(Rchol, rnorm(ncol(X)))
# write exp(num) as num to avoid overflow; symmetric proposal
log_acceptance_prob = log_posterior(proposal) - log_posterior(chains[i, ])
chains[i + 1, ] = chains[i, ]
if (log(runif(1)) < log_acceptance_prob)
chains[i + 1, ] = proposal
#setTxtProgressBar(BarProgress, i) # not suitable in rmd, useful in other cases
}
close(BarProgress)
return(chains[-(1: burnIn), ])
}
Compute and visualize the result. From the quantile of parameters bettwen 10% to 90%, we see that HighVolDiagHosp, HighVolTreatHosp, FromRuralArea and Male are significant different from 0, the rest of the coefficients are not that clear.
pars_draws = MCMC_MH(X, initialValue, Hessian)
par(mfrow=c(1, 2))
for (i in 1: ncol(X)) {
if (i == 1) print("Significant effects")
interval = quantile(pars_draws[ , i], probs=c(0.1, 0.9))
if (interval[1] * interval[2] < 0) { # coefficient not significantly differs from 0
hist(pars_draws[ , i], nclass=30, main=colnames(X)[i], xlab="" )
abline(v = 0, col="red")
}
else {
print(colnames(X)[i])
}
}
## [1] "Significant effects"
## [1] "HighVolDiagHosp"
## [1] "HighVolTreatHosp"
## [1] "FromRuralArea"
## [1] "Male"
## [1] "bias"
In our following analysis, we only consider the features we mentioned above.
features = as.matrix(data[, c(1,2,4,5)])
featureNames = c(colnames(features), 'bias')
Set the number of Hidden layer as 6. Again, normalize data and add bias item.
K = ncol(features)
H = 6 # can be set as arbitrary number
## normalize features
features = apply(features, MARGIN=2, FUN=function(X) (X - mean(X)) / sd(X))
# add bias to features
features = cbind(features, rep(1, nrow(features)))
We plan to use Bayesian neural network to analysis the effect of high- versus low-volume treating hospital on patient survival, the prior we choosed is laplace prior, because it is more likely to give us more sparse result. Here we set the scale of laplace prior as 2.
b = 2
Build Bayesian neural network with laplace prior.
sigmoid = function(x) 1 / (1 + exp(-x))
Hidden_nodes = function(weights) {
alpha = matrix(head(weights, H * (K + 1)), ncol=H)
sigmoid(features %*% alpha)
}
Forward = function(weights, Z_hidden=FALSE) {
beta = tail(weights, H + 1)
if (length(Z_hidden) == 1) Z_hidden = Hidden_nodes(weights)
as.vector(sigmoid(cbind(Z_hidden, rep(1, nrow(features))) %*% beta))
}
Build the log posterior function and derive the gradient function.
log_Posterior = function(weights) { # make it negative to suit Neal's code
Fs = Forward(weights)
logLike = sum(Y * log(Fs) + (1 - Y) * log(1 - Fs))
return(-(logLike + log(0.5 / b) - sum(abs(weights) / b)))
}
gradient = function(weights) {
beta = tail(weights, H + 1)
Zs = Hidden_nodes(weights)
Fs = Forward(weights, Zs)
# beta first
f_beta = Fs * (1 - Fs) * cbind(Zs, rep(1, H)) # pointwise multiplication, bias included
Pos_beta = t(f_beta / Fs) %*% Y + t(-f_beta / (1 - Fs)) %*% (1 - Y)
# compute alpha
fz_prime = f_beta[, 1: H] * (1 - Zs) %*% diag(beta[1: H]) # no bias item, since its gradient vanishes
f_alpha = fz_prime[, rep(1: H, each=K + 1)] * features[, rep(1: (K + 1), H)] # rowwise outer product
Pos_alpha = t(f_alpha / Fs) %*% Y + t(-f_alpha / (1 - Fs)) %*% (1 - Y)
weights[weights<0] = -1
weights[weights>=0] = 1
return(-(c(Pos_alpha, Pos_beta) - weights / b))
}
Gradient check again. HMC sampling method is the same as before.
library(LaplacesDemon)
weights = rlaplace(H * (K + 1) + H + 1)
round(gradient(weights), 6) == round(numDeriv::grad(log_Posterior, weights), 6)
## Warning in cbind(Zs, rep(1, H)): number of rows of result is not a multiple
## of vector length (arg 2)
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [12] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [23] TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [34] TRUE TRUE TRUE TRUE
# SIMPLE IMPLEMENTATION OF HAMILTONIAN MONTE CARLO. By Radford M. Neal, 2010.
HMC = function (U, grad_U, current_q, epsilon = .1, L = 10) {
q = current_q
p = rnorm(length(q), 0, 1) # independent standard normal variates
current_p = p
# Make a half step for momentum at the beginning
p = p - epsilon * grad_U(q) / 2
# Alternate full steps for position and momentum
for (i in 1: L) {
# Make a full step for the position
q = q + epsilon * p
# Make a full step for the momentum, except at end of trajectory
if (i != L) p = p - epsilon * grad_U(q)
}
# Make a half step for momentum at the end.
p = p - epsilon * grad_U(q) / 2
# Negate momentum at end of trajectory to make the proposal symmetric
p = -p
# Evaluate potential and kinetic energies at start and end of trajectory
current_U = U(current_q)
current_K = sum(current_p^2) / 2
proposed_U = U(q)
proposed_K = sum(p^2) / 2
# Accept or reject the state at end of trajectory, returning either
# the position at the end of the trajectory or the initial position
if (log(runif(1)) < current_U - proposed_U + current_K - proposed_K) {
acc <<- acc + 1
current_q = q # accept
}
return(current_q)
}
Let’s generate two hamiltonian chains now.
library(coda)
iters = 2000
burnIn = 0.5
acc = 0
HMC_chain = function() {
chains = matrix(rlaplace(iters * (H*(K+1) + H + 1), scale=b), nrow=iters)
for(i in 2: iters) chains[i, ] = HMC(log_Posterior, gradient, chains[i-1, ], 0.13, 20)
return(mcmc(chains[(iters*burnIn): iters, ]))
}
The acceptance rate is acceptable.
chains = list(c1=HMC_chain(), c2=HMC_chain())
print(acc / iters / 2)
## [1] 0.82925
From Gelman diagnosis, it is close to 1, not bad.
gelman.diag(chains)
## Potential scale reduction factors:
##
## Point est. Upper C.I.
## [1,] 1.01 1.02
## [2,] 1.00 1.01
## [3,] 1.02 1.02
## [4,] 1.01 1.03
## [5,] 1.00 1.01
## [6,] 1.09 1.34
## [7,] 1.04 1.11
## [8,] 1.02 1.03
## [9,] 1.03 1.12
## [10,] 1.05 1.21
## [11,] 1.01 1.04
## [12,] 1.03 1.12
## [13,] 1.01 1.02
## [14,] 1.00 1.02
## [15,] 1.05 1.05
## [16,] 1.03 1.12
## [17,] 1.01 1.01
## [18,] 1.05 1.05
## [19,] 1.12 1.43
## [20,] 1.04 1.07
## [21,] 1.01 1.03
## [22,] 1.12 1.45
## [23,] 1.08 1.16
## [24,] 1.01 1.06
## [25,] 1.01 1.03
## [26,] 1.05 1.10
## [27,] 1.00 1.01
## [28,] 1.03 1.03
## [29,] 1.02 1.04
## [30,] 1.02 1.07
## [31,] 1.00 1.03
## [32,] 1.04 1.16
## [33,] 1.02 1.04
## [34,] 1.02 1.09
## [35,] 1.07 1.27
## [36,] 1.01 1.02
## [37,] 1.02 1.08
##
## Multivariate psrf
##
## 1.35
From the traceplot, we see that the two chains are mixing well and convergent.
plot(mcmc.list(lapply(chains, FUN=function(x) x[, 1:4])))
plot(mcmc.list(lapply(chains, FUN=function(x) x[, 5:8])))
plot(mcmc.list(lapply(chains, FUN=function(x) x[, 9:12])))
plot(mcmc.list(lapply(chains, FUN=function(x) x[, 13:16])))
plot(mcmc.list(lapply(chains, FUN=function(x) x[, 17:20])))
plot(mcmc.list(lapply(chains, FUN=function(x) x[, 21:24])))
plot(mcmc.list(lapply(chains, FUN=function(x) x[, 25:28])))
plot(mcmc.list(lapply(chains, FUN=function(x) x[, 29:32])))
plot(mcmc.list(lapply(chains, FUN=function(x) x[, 33:36])))
Let’s analyze the result for a little bit, we see that neural network is not that easy to illustrate model fitting. However, we still can get some information from the weights. Overall, the HighVolTreatHosp features makes the biggest contribution to increase the survive duration.
mat = matrix(as.vector(colMeans(tail(chains$c1, 200))[1:(H * (K + 1))]), ncol=H)
rownames(mat) = featureNames
colnames(mat) = c("Hidden1", "Hidden2", "Hidden3", "Hidden4", "Hidden5", "Hidden6")
mat = round(mat, 2)
mat[abs(mat) < 0.3] = NA
mat
## Hidden1 Hidden2 Hidden3 Hidden4 Hidden5 Hidden6
## HighVolDiagHosp -0.46 0.56 0.32 -0.46 -0.38 NA
## HighVolTreatHosp NA NA -0.48 -0.81 0.88 -0.43
## FromRuralArea NA 0.54 -0.47 0.68 -0.64 NA
## Male NA 0.45 NA 0.73 -0.38 NA
## bias 0.30 0.40 -0.32 NA NA NA
From the last layer of beta weights, we see that Hidden node 3 contributes the most predicting power, which remind us that from the alpha weights above, we see that \(Hidden3 = \psi(0.32\times HighVolDiagHosp-0.48\times HighVolTreatHosp - 0.47\times FromRuralArea - 0.32)\), which tells us that being female, leaving in rural and receive high quality treatment helps people surive longer time. As a contrast, high quality diagnosis alone doesn’t guarantee that.
hds = c(colMeans(Hidden_nodes(colMeans(tail(chains$c1, 200)))), 1)
beta_wts = tail(colMeans(tail(chains$c1, 200)), H + 1)
nn = round(rbind(hds, beta_wts), 2)
nn[abs(nn) < 0.3] = NA
colnames(nn) = c(colnames(mat), 'bias')
nn
## Hidden1 Hidden2 Hidden3 Hidden4 Hidden5 Hidden6 bias
## hds 0.57 0.59 0.43 0.52 0.51 0.49 1.0
## beta_wts NA NA 0.38 NA NA NA 0.5
Let’s plot the posterior check to see if our model makes sense. From the figure below, we see that it performs not very good, due to time limit, we just stop here.
wts = as.matrix(tail(chains$c1, 200))
pst_ck = matrix(NA, nrow=201, ncol=158)
for (i in 1: 201) pst_ck[i, ] = Forward(wts[i, ])
qt = matrix(NA, nrow=2, ncol=158)
for (i in 1: 158) qt[, i] = quantile(pst_ck[ , i], probs=c(0.025, 0.975))
evals = data.frame(up=qt[2, ], down=qt[1, ])
plot(evals$up, ylim=c(-0, max(evals$up)), xaxt="n", ylab="Diff", xlab="")
polygon(c(1: 158, rev(1: 158)),c(evals$down, rev(evals$up)),col=gray(0.8))
points(evals$down)
axis(1, at=1:158, labels=rownames(evals))
points(Y, col="red")
In the future, we will try other prior like Spike and slab prior to see if we can get more sparse network.