Bayesian neural network and Hamilton monte carlo

Theory

This homework will focus on Bayesian neural networks. Recall first the definition of a neural network with a single hidden layer. This takes as its input a K-dimensional input x, and calculates a H-dimensional hidden layer according to \(z_{i,h} =\phi(x^T\alpha _h)\) for h = 1,…,H. Here \(\alpha_h\) is a K-dimensional vector, \(\psi\) is the logistic function, and we let \(\alpha\) denote the \(K\times H\)-dimensional matrix whose columns are \(\alpha_1\), \(\alpha_2\),…, \(\alpha_H\). The output \(y_i\) is then 1 with probability \(\phi(z_i^T \beta)\), where \(\beta\) is a H-dimensional vector and \(z_i = (z_{i,1}, ...,z_{i,H})^T\).

1. Given a dataset \(\{(x_1, y_1),...,(x_n,y_n)\}\), write down the logarithm of the sampling density for \((y_1, ..., y_n)\) given X, \(\alpha\), and \(\beta\).

First, let’s assume any vector is a column matrix, parameter with index 0 corresponds to bias item, letter with form \(\dot {letter}\) includes bias item. Then we define X as the \(n\times \dot K\) feature matrix, Z as the \(n\times \dot H\) hidden node matrix,

Given fixed \(\alpha\) and \(\beta\), and sigmoid function \(\psi(x) = \frac{1}{1+e^{-x}}\), the uncertainty of one forward pass can be quantified as

\(\begin{aligned} f(x_i;\alpha,\beta) &= \psi(z_i^T \beta) \\ &= \psi(\sum_{h=0}^H \psi(x_i^T \alpha_{\cdot h}) \beta_h) \\ &= \psi(\sum_{h=0}^H \psi(\sum_{k=0}^K x_{ik} \alpha_{kh}) \beta_h) \\ \end{aligned}\)

For a batch of input, the uncertainty can written in a matrix form

\(\begin{aligned} F &= F_{n\times 1}(X;\alpha,\beta) \\ &= \psi(Z_{n\times \dot H} \beta_{\dot H\times 1}) \\ &= \psi(\psi(X_{n\times \dot K} \alpha_{\dot K \times T})_{n\times \dot H} \beta_{\dot H\times 1}) \\ \end{aligned}\)

Model the probability that y is 0 or 1 with \(P(y_i|x_i;\alpha,\beta)=f(x_i;\alpha,\beta)^{y_i} (1-f(x_i;\alpha,\beta))^{1-y_i}\)

Thus, the log likelihood function follows

\(\begin{aligned} log\ L(\alpha, \beta|X) &= log\ P(Y|X;\alpha,\beta) \\ &= log\ \prod_{i=1}^n p(y_i|x_i;\alpha,\beta) \\ &= \sum \{y_i log\ f(x_i;\alpha,\beta) + (1-y_i)\log(1-f(x_i;\alpha,\beta))\} \\ &= Y_{n\times 1}^T logF_{n\times 1} + (1-Y)_{n\times 1}^Tlog(1-F)_{n\times 1} \\ \end{aligned}\)

2. For independent, standard Normal priors for the components of \(\alpha\) and \(\beta\), write down the logarithm of the posterior density function for \((\alpha, \beta)\).

Suppose the priors follow \(\alpha \sim N(0, I_{\dot K\times H})\) and \(\beta \sim N(0, I_{\dot H})\), denote . as pointwise operation.

The logarithm of the posterior density for \(Y\) is

\(\begin{aligned} log\ P(\alpha,\beta|X; Y) &\propto log\ L(\alpha, \beta|X) + log\ P(\alpha) + log\ P(\beta) \\ &\propto log\ L(\alpha, \beta|X) - \frac{1}{2} \sum_{h,k} \alpha.^2 - \frac{1}{2} \sum_{h} \beta.^2 \\ \end{aligned}\)

3. Write down the gradient of the log-posterior with respect to \(\beta\).

Denote \(\cdot\) as pointwise multiplication/division where the elements in the same row are operated, for a batch of input, the derivative of \(F\) with respect to \(\beta\) follows

\(\begin{aligned} F_{\beta}'(X;\alpha,\beta) &=\psi_{\beta}'(Z_{n\times \dot H} \beta_{\dot H\times 1}) \\ &=Z_{n\times \dot H} \cdot \psi_{n\times 1} \cdot (1-\psi)_{n\times 1} \\ &=Z_{n\times \dot H} \cdot F_{n\times 1} \cdot (1-F)_{n\times 1} \\ \end{aligned}\)

Therefore,

\(\frac{\partial log\ L(\alpha, \beta|X)}{\partial \beta}= (F_{\beta}'./F)^TY-(F_{\beta}'./(1-F))^T(1-Y) - \beta\)

4. Write down the gradient of the log-posterior with respect to \(\alpha\).

\(\begin{aligned} F_{\alpha}'(X;\alpha,\beta) &=\psi_{\alpha}'(\psi(X_{n\times \dot K} \alpha_{\dot K\times H})_{n\times \dot H} \beta_{\dot H\times 1}) \\ &=((F_{n\times 1} \cdot (1-F)_{n\times 1})\beta_{H\times 1}^T \psi_{\alpha}'(X_{n\times \dot K} \alpha_{\dot K \times H}) \ \#\ no\ bias\ item,\ since\ bias\ gradient\ vanishes\ \\ &=\{[(F_{n\times 1} \cdot (1-F)_{n\times 1})\beta_{H\times 1}^T] \cdot Z_{n\times H} \cdot (1-Z)_{n\times H}\} \otimes X_{n\times \dot K} \\ \end{aligned}\)

where \(\otimes\) represents row-wise outer product.

Therefore, \(\frac{\partial log\ L(\alpha, \beta|X)}{\partial \alpha}=(F_{\alpha}'./F)^TY-(F_{\alpha}'./(1-F))^T(1-Y) - \alpha\)

Computation

set.seed(666)
setwd("C:/Users/Wei/Documents/Purdue STAT 695 Bayesian Data Analysis/HW4")

1. Using the previous results, write an R function that takes as input a set of training data \((x_i, y_i)\), as well as a pair of weight vector/matrices \((\alpha, \beta)\). It should return the log posterior probability of the weights \((\alpha, \beta)\), as well as the gradients with respect to \(\alpha\) and \(\beta\). The latter can be organized into a single vector of length \(H + HK\) if you prefer.

Load data first

data = read.csv(file="moon_shape.csv", header=TRUE)

Extract feature matrix \(features\) and label \(Y\).

features = as.matrix(data[, 1:2]) # features dim: 1000X2
Y = data[, 3]

Set the number of nodes in hidden layer. For the hidden layer nodes, I just picked 2 since from different trials, the result shows that \(H=2\) makes more sense, not many parameters are not significant from 0. The chains are easier to get mixed.

K = 2
H = 2 # can be set as arbitrary number

Normalize the feature matrix.

features = apply(features, MARGIN=2, FUN = function(X) (X - mean(X)) / sd(X))

Add bias to the feature matrix.

features = cbind(features, rep(1, nrow(features)))

Write the forward network to make predictions given weights.

sigmoid = function(x) 1 / (1 + exp(-x))

Hidden_nodes = function(weights) {
  alpha = matrix(head(weights, H * (K + 1)), ncol=H)
  sigmoid(features %*% alpha)
}

Forward = function(weights, Z_hidden=FALSE) {
  beta = tail(weights, H + 1)
  if (length(Z_hidden) == 1) Z_hidden = Hidden_nodes(weights)
  as.vector(sigmoid(cbind(Z_hidden, rep(1, nrow(features))) %*% beta))
}

Build our log posterior function.

log_Posterior = function(weights) { # make it negative to suit Neal's code
  Fs = Forward(weights)
  logLike = sum(Y * log(Fs) + (1 - Y) * log(1 - Fs)) 
  return(-(logLike - 0.5 * sum(weights^2)))
}

Derive the gradient funtion based on log posterior likelihood.

gradient = function(weights) {
  beta = tail(weights, H + 1)
  Zs = Hidden_nodes(weights)
  Fs = Forward(weights, Zs)
  # beta first
  f_beta = Fs * (1 - Fs) * cbind(Zs, rep(1, H)) # pointwise multiplication, bias included
  Pos_beta = t(f_beta / Fs) %*% Y  + t(-f_beta / (1 - Fs)) %*% (1 - Y) 
  # compute alpha
  fz_prime = f_beta[, 1: H] * (1 - Zs) %*% diag(beta[1: H]) # no bias item, since its gradient vanishes
  f_alpha = fz_prime[, rep(1: H, each=K + 1)] * features[, rep(1: (K + 1), H)] # rowwise outer product

  Pos_alpha = t(f_alpha / Fs) %*% Y  + t(-f_alpha / (1 - Fs)) %*% (1 - Y)
  return(-(c(Pos_alpha, Pos_beta) - weights))
}

Before we proceed, let’s use gradient check to see if our gradient function performs correct.

weights = rnorm(H * (K + 1) + H + 1)
round(gradient(weights), 6) == round(numDeriv::grad(log_Posterior, weights), 6)

## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

2. Consider the \(moon\_shape.csv\) dataset on Blackboard. With this as input to your neural network model, you want samples from the posterior distribution over weights. Plug the function from the previous question into the skeleton of the HMC code provided in the lecture material to produce posterior samples from \(\alpha\) and \(\beta\). Note that though each xi is 2-dimensional, it is useful to add a third `offset’ component (i.e., an intercept term, which is always equal to 1) to each \(x_i\), so that \(K = 3\). Include diagnostics to argue that your sampler has converged. Play around with different settings of \(H\), and explain which value you chose for the classification task.

The code comes from here, for more variants of HMC, you can go to here

# SIMPLE IMPLEMENTATION OF HAMILTONIAN MONTE CARLO. By Radford M. Neal, 2010.
HMC = function (U, grad_U, current_q, epsilon = .1, L = 10) {
  q = current_q
  p = rnorm(length(q), 0, 1) # independent standard normal variates
  current_p = p
  # Make a half step for momentum at the beginning
  p = p - epsilon * grad_U(q) / 2
  # Alternate full steps for position and momentum
  for (i in 1: L) {
    # Make a full step for the position
    q = q + epsilon * p
    # Make a full step for the momentum, except at end of trajectory
    if (i != L) p = p - epsilon * grad_U(q)
  }
  # Make a half step for momentum at the end.
  p = p - epsilon * grad_U(q) / 2
  # Negate momentum at end of trajectory to make the proposal symmetric
  p = -p
  # Evaluate potential and kinetic energies at start and end of trajectory
  current_U = U(current_q)
  current_K = sum(current_p^2) / 2
  proposed_U = U(q)
  proposed_K = sum(p^2) / 2
  # Accept or reject the state at end of trajectory, returning either
  # the position at the end of the trajectory or the initial position
  if (log(runif(1)) < current_U - proposed_U + current_K - proposed_K) {
    # acc <<- acc + 1
    current_q = q # accept
  }
  return(current_q)
}

One reason that Neural Network may not mix well is that there is no order for the nodes in the same layer. So the solution I used is to sort the matrix from column-wise by comparing the row means. Despite the fact that it is not the perfectly making sense, it works.

library(coda)

## Warning: package 'coda' was built under R version 3.4.2

iters = 2000
burnIn = 0.5

HMC_chain = function() {
  chains = matrix(rnorm(iters * (H*(K+1) + H + 1)), nrow=iters)
  for(i in 2: iters) chains[i, ]  = HMC(log_Posterior, gradient, chains[i-1, ], 0.1, 20)
  new_order = order(colMeans(chains[(iters*burnIn): iters, ]))
  return(mcmc(chains[(iters*burnIn): iters, new_order]))
  #return(mcmc(chains[(iters*burnIn): iters, ]))
}

Let’s generate two chains to see if they converges.

chains = list(c1=HMC_chain(), c2=HMC_chain())

From gelman diagnosis, the result is not bad, very close to 1.

gelman.diag(chains)

## Potential scale reduction factors:
## 
##       Point est. Upper C.I.
##  [1,]      1.011      1.025
##  [2,]      1.002      1.002
##  [3,]      0.999      0.999
##  [4,]      1.001      1.004
##  [5,]      1.012      1.016
##  [6,]      1.002      1.006
##  [7,]      1.018      1.025
##  [8,]      1.013      1.029
##  [9,]      1.000      1.000
## 
## Multivariate psrf
## 
## 1.02

mn  = colMeans(chains$c1)
qtl = apply(chains$c1, 2, quantile, c(.1,.9))
rslt = data.frame(variable=1:9, posterior=mn, q10=qtl[1,], q90=qtl[2,])

library(ggplot2)
ggplot(data=rslt, aes(x=variable,y=posterior)) + geom_point(size=2) +
  geom_errorbar(aes(ymax=q90,ymin=q10),width=0.2) +
  geom_hline(yintercept = 0)

By comparing the chains in the traceplot, we are satistied with the result. In general, the chains are mixing well and convergent.

library(coda)

## Warning: package 'coda' was built under R version 3.4.2

plot(mcmc.list(lapply(chains, FUN=function(x) x[, 1:3])))

plot(mcmc.list(lapply(chains, FUN=function(x) x[, 4:6])))

plot(mcmc.list(lapply(chains, FUN=function(x) x[, 7:9])))

For more trials with respect to different \(H\), I did it offline and didn’t show the result here, the principle it the same. To select the best model with respect to \(H\), we can either use Bayes Factor to compare different models, or use cross validation to choose the model that gives the best prediction in the testing set.

Applied

Researchers at the Karolinska Institute, a large medical research center and university near Stockholm, have collected observational data on patients diagnosed with cardia cancer in Sweden. Cardia cancer is a particularly awful form of stomach cancer that often leads to death. The researchers would like to assess whether patients who receive medical treatment from hospitals that see a high volume of patients have different survival rates than patients who receive medical treatment from low-volume hospitals. A complication with this data is that the hospital in which a patient is initially diagnosed with cardia cancer may not be the same as the hospital in which they receive the bulk of their medical care. When we learn about principal stratification later in the semester, we may consider inference for the effect of high- versus low-volume treating hospital on survival, accounting for the type of diagnosing hospital and patient characteristics that may be associated with the response.

At this stage, your task is to use neural networks to learn about the effect of high- versus low-volume treating hospital on patient survival, accounting for patient characteristics but not for the type of diagnosing hospital. The observed data is contained in \(karolinska.txt\), and consists of the following variables for each of the 158 patients.

\(HighVolDiagHosp\): Indicator for the type of diagnosing hospital (0 if low-volume, 1 if high-volume).
\(HighVolTreatHosp\): Indicator for the type of treating hospital (0 if low-volume, 1 if high-volume).
\(AgeAtDiagnosis\): The patient’s age at the time of their diagnosis.
\(FromRuralArea\): Indicator for whether the patient is from a rural or urban area (0 if urban, 1 if rural).
\(Male\): An indicator for the patient’s sex (0 if female, 1 if male).
\(DateOfDiagnosis\): Date of patient’s diagnosis.
\(YearsSurvivingAfterDiagnosis\): A range for the number of years that the patient survived after diagnosis (either 1 year, 2-4 years, or 5+ years)
\(YearOfDiagnosis\): Year of patient’s diagnosis.

In order to receive credit for your work, you must

identify the experimental units,
define the outcome variable that you will utilize in your analysis,
clearly state and justify the assumptions underlying your analysis,
identify the estimands, i.e., the quantities to be estimated from this study,
state whether the treatment comparison in this study is fair, in the sense that the covariates are reasonably balanced over the interventions,
state appropriate justifications if you decide to drop certain patients and/or variables from your analysis,
clearly describe your methodology,
formally assess the convergence of your chosen Monte Carlo algorithm,
perform posterior predictive checks and/or use predictive accuracy measures to evaluate the fit of models entertained en route to your final model,
provide all your reasoning, computation, and discussion of results in a concise and coherent manner.

The basic idea to learn about the effect of high- versus low-volume treating hospital on patient survival is to analysis the coefficient from the regression analysis. Before we apply neural network, let’s first identify some unrelevant variables using Probit regression.

Load data first, then normalize data and add bias item. Here, we igore the year-month feature, because the month information doesn’t make sense here.

data = read.table(file="karolinska.txt", header=TRUE)
X = as.matrix(data[, c(1,2,3,4,5,8)])
## normalize features
X = apply(X, MARGIN=2, FUN=function(X) (X - mean(X)) / sd(X))

# add bias to features
X = cbind(X, rep(1, nrow(X)))
colnames(X)[7] = 'bias'

Since there are only roughly 11% people survive more than 5 years, 25% people survive 2-4 years, to avoid label imbalance, we group these two together as label 0, label 1 represents people who survive only 1 year.

data = read.table(file="karolinska.txt", header=TRUE)
Y = as.numeric(data[, 7])
Y[Y==3 | Y==2] = 0

As before, create the log posterior function and solve the optial initial parameters for sampling.

log_posterior = function(corr) {
  return(sum(dcauchy(corr, scale=2.5, log=TRUE)) +
         sum(Y * pnorm(X %*% corr, log.p=TRUE) + 
         (1 - Y) * pnorm(X %*% corr, lower.tail=FALSE, log.p=TRUE)))
}

optim_pars = optim(rep(0, ncol(X)), log_posterior, 
                   method="BFGS", control=list(fnscale=-1), hessian=TRUE)
initialValue = optim_pars$par
Hessian = optim_pars$hessian / (2.4 / sqrt(ncol(X))) # BDA pg. 290
burnIn = 10000
iterations = 20000

Run our MCMC sampling

MCMC_MH = function(X, initialValue, Hessian) {
  chains = array(dim=c(iterations + 1, ncol(X)))
  chains[1, ] = initialValue
  Rchol = chol(-Hessian)
  BarProgress = txtProgressBar(min = 1, max = nrow(chains), style = 3)
  for (i in 1: iterations) {
    # better avoid saving the inverse of a matrix, compute them instead
    proposal = chains[i, ] + backsolve(Rchol, rnorm(ncol(X)))

    # write exp(num) as num to avoid overflow; symmetric proposal
    log_acceptance_prob = log_posterior(proposal) - log_posterior(chains[i, ]) 

    chains[i + 1, ] = chains[i, ]
    if (log(runif(1)) < log_acceptance_prob)
      chains[i + 1, ] = proposal
    #setTxtProgressBar(BarProgress, i) # not suitable in rmd, useful in other cases
  }
  close(BarProgress)
  return(chains[-(1: burnIn), ])  
}

Compute and visualize the result. From the quantile of parameters bettwen 10% to 90%, we see that HighVolDiagHosp, HighVolTreatHosp, FromRuralArea and Male are significant different from 0, the rest of the coefficients are not that clear.

pars_draws = MCMC_MH(X, initialValue, Hessian)

par(mfrow=c(1, 2))
for (i in 1: ncol(X)) {
  if (i == 1) print("Significant effects")
  interval = quantile(pars_draws[ , i], probs=c(0.1, 0.9))
  if (interval[1] * interval[2] < 0) { # coefficient not significantly differs from 0
    hist(pars_draws[ , i], nclass=30, main=colnames(X)[i], xlab="" )
    abline(v = 0, col="red")
  }
  else { 
    print(colnames(X)[i])
  }
}

## [1] "Significant effects"
## [1] "HighVolDiagHosp"
## [1] "HighVolTreatHosp"

## [1] "FromRuralArea"
## [1] "Male"

## [1] "bias"

In our following analysis, we only consider the features we mentioned above.

features = as.matrix(data[, c(1,2,4,5)])
featureNames = c(colnames(features), 'bias')

Set the number of Hidden layer as 6. Again, normalize data and add bias item.

K = ncol(features)
H = 6 # can be set as arbitrary number

## normalize features
features = apply(features, MARGIN=2, FUN=function(X) (X - mean(X)) / sd(X))
# add bias to features
features = cbind(features, rep(1, nrow(features)))

We plan to use Bayesian neural network to analysis the effect of high- versus low-volume treating hospital on patient survival, the prior we choosed is laplace prior, because it is more likely to give us more sparse result. Here we set the scale of laplace prior as 2.

b = 2

Build Bayesian neural network with laplace prior.

sigmoid = function(x) 1 / (1 + exp(-x))

Hidden_nodes = function(weights) {
  alpha = matrix(head(weights, H * (K + 1)), ncol=H)
  sigmoid(features %*% alpha)
}

Forward = function(weights, Z_hidden=FALSE) {
  beta = tail(weights, H + 1)
  if (length(Z_hidden) == 1) Z_hidden = Hidden_nodes(weights)
  as.vector(sigmoid(cbind(Z_hidden, rep(1, nrow(features))) %*% beta))
}

Build the log posterior function and derive the gradient function.

log_Posterior = function(weights) { # make it negative to suit Neal's code
  Fs = Forward(weights)
  logLike = sum(Y * log(Fs) + (1 - Y) * log(1 - Fs)) 
  return(-(logLike + log(0.5 / b) - sum(abs(weights) / b)))
}

gradient = function(weights) {
  beta = tail(weights, H + 1)
  Zs = Hidden_nodes(weights)
  Fs = Forward(weights, Zs)
  # beta first
  f_beta = Fs * (1 - Fs) * cbind(Zs, rep(1, H)) # pointwise multiplication, bias included
  Pos_beta = t(f_beta / Fs) %*% Y  + t(-f_beta / (1 - Fs)) %*% (1 - Y) 
  # compute alpha
  fz_prime = f_beta[, 1: H] * (1 - Zs) %*% diag(beta[1: H]) # no bias item, since its gradient vanishes
  f_alpha = fz_prime[, rep(1: H, each=K + 1)] * features[, rep(1: (K + 1), H)] # rowwise outer product

  Pos_alpha = t(f_alpha / Fs) %*% Y  + t(-f_alpha / (1 - Fs)) %*% (1 - Y)
  weights[weights<0] = -1
  weights[weights>=0] = 1
  return(-(c(Pos_alpha, Pos_beta) - weights / b))
}

Gradient check again. HMC sampling method is the same as before.

library(LaplacesDemon)
weights = rlaplace(H * (K + 1) + H + 1)
round(gradient(weights), 6) == round(numDeriv::grad(log_Posterior, weights), 6)

## Warning in cbind(Zs, rep(1, H)): number of rows of result is not a multiple
## of vector length (arg 2)

##  [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [12]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [23]  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [34]  TRUE  TRUE  TRUE  TRUE

# SIMPLE IMPLEMENTATION OF HAMILTONIAN MONTE CARLO. By Radford M. Neal, 2010.
HMC = function (U, grad_U, current_q, epsilon = .1, L = 10) {
  q = current_q
  p = rnorm(length(q), 0, 1) # independent standard normal variates
  current_p = p
  # Make a half step for momentum at the beginning
  p = p - epsilon * grad_U(q) / 2
  # Alternate full steps for position and momentum
  for (i in 1: L) {
    # Make a full step for the position
    q = q + epsilon * p
    # Make a full step for the momentum, except at end of trajectory
    if (i != L) p = p - epsilon * grad_U(q)
  }
  # Make a half step for momentum at the end.
  p = p - epsilon * grad_U(q) / 2
  # Negate momentum at end of trajectory to make the proposal symmetric
  p = -p
  # Evaluate potential and kinetic energies at start and end of trajectory
  current_U = U(current_q)
  current_K = sum(current_p^2) / 2
  proposed_U = U(q)
  proposed_K = sum(p^2) / 2
  # Accept or reject the state at end of trajectory, returning either
  # the position at the end of the trajectory or the initial position
  if (log(runif(1)) < current_U - proposed_U + current_K - proposed_K) {
    acc <<- acc + 1
    current_q = q # accept
  }
  return(current_q)
}

Let’s generate two hamiltonian chains now.

library(coda)
iters = 2000
burnIn = 0.5
acc = 0
HMC_chain = function() {
  chains = matrix(rlaplace(iters * (H*(K+1) + H + 1), scale=b), nrow=iters)
  for(i in 2: iters) chains[i, ]  = HMC(log_Posterior, gradient, chains[i-1, ], 0.13, 20)
  return(mcmc(chains[(iters*burnIn): iters, ]))
}

The acceptance rate is acceptable.

chains = list(c1=HMC_chain(), c2=HMC_chain())
print(acc / iters / 2)

## [1] 0.82925

From Gelman diagnosis, it is close to 1, not bad.

gelman.diag(chains)

## Potential scale reduction factors:
## 
##       Point est. Upper C.I.
##  [1,]       1.01       1.02
##  [2,]       1.00       1.01
##  [3,]       1.02       1.02
##  [4,]       1.01       1.03
##  [5,]       1.00       1.01
##  [6,]       1.09       1.34
##  [7,]       1.04       1.11
##  [8,]       1.02       1.03
##  [9,]       1.03       1.12
## [10,]       1.05       1.21
## [11,]       1.01       1.04
## [12,]       1.03       1.12
## [13,]       1.01       1.02
## [14,]       1.00       1.02
## [15,]       1.05       1.05
## [16,]       1.03       1.12
## [17,]       1.01       1.01
## [18,]       1.05       1.05
## [19,]       1.12       1.43
## [20,]       1.04       1.07
## [21,]       1.01       1.03
## [22,]       1.12       1.45
## [23,]       1.08       1.16
## [24,]       1.01       1.06
## [25,]       1.01       1.03
## [26,]       1.05       1.10
## [27,]       1.00       1.01
## [28,]       1.03       1.03
## [29,]       1.02       1.04
## [30,]       1.02       1.07
## [31,]       1.00       1.03
## [32,]       1.04       1.16
## [33,]       1.02       1.04
## [34,]       1.02       1.09
## [35,]       1.07       1.27
## [36,]       1.01       1.02
## [37,]       1.02       1.08
## 
## Multivariate psrf
## 
## 1.35

From the traceplot, we see that the two chains are mixing well and convergent.

plot(mcmc.list(lapply(chains, FUN=function(x) x[, 1:4])))

plot(mcmc.list(lapply(chains, FUN=function(x) x[, 5:8])))

plot(mcmc.list(lapply(chains, FUN=function(x) x[, 9:12])))

plot(mcmc.list(lapply(chains, FUN=function(x) x[, 13:16])))

plot(mcmc.list(lapply(chains, FUN=function(x) x[, 17:20])))

plot(mcmc.list(lapply(chains, FUN=function(x) x[, 21:24])))

plot(mcmc.list(lapply(chains, FUN=function(x) x[, 25:28])))

plot(mcmc.list(lapply(chains, FUN=function(x) x[, 29:32])))

plot(mcmc.list(lapply(chains, FUN=function(x) x[, 33:36])))

Let’s analyze the result for a little bit, we see that neural network is not that easy to illustrate model fitting. However, we still can get some information from the weights. Overall, the HighVolTreatHosp features makes the biggest contribution to increase the survive duration.

mat = matrix(as.vector(colMeans(tail(chains$c1, 200))[1:(H * (K + 1))]), ncol=H)
rownames(mat) = featureNames
colnames(mat) = c("Hidden1", "Hidden2", "Hidden3", "Hidden4", "Hidden5", "Hidden6")
mat = round(mat, 2)
mat[abs(mat) < 0.3] = NA
mat

##                  Hidden1 Hidden2 Hidden3 Hidden4 Hidden5 Hidden6
## HighVolDiagHosp    -0.46    0.56    0.32   -0.46   -0.38      NA
## HighVolTreatHosp      NA      NA   -0.48   -0.81    0.88   -0.43
## FromRuralArea         NA    0.54   -0.47    0.68   -0.64      NA
## Male                  NA    0.45      NA    0.73   -0.38      NA
## bias                0.30    0.40   -0.32      NA      NA      NA

From the last layer of beta weights, we see that Hidden node 3 contributes the most predicting power, which remind us that from the alpha weights above, we see that \(Hidden3 = \psi(0.32\times HighVolDiagHosp-0.48\times HighVolTreatHosp - 0.47\times FromRuralArea - 0.32)\), which tells us that being female, leaving in rural and receive high quality treatment helps people surive longer time. As a contrast, high quality diagnosis alone doesn’t guarantee that.

hds = c(colMeans(Hidden_nodes(colMeans(tail(chains$c1, 200)))), 1)
beta_wts = tail(colMeans(tail(chains$c1, 200)), H + 1)
nn = round(rbind(hds, beta_wts), 2)
nn[abs(nn) < 0.3] = NA
colnames(nn) = c(colnames(mat), 'bias')
nn

##          Hidden1 Hidden2 Hidden3 Hidden4 Hidden5 Hidden6 bias
## hds         0.57    0.59    0.43    0.52    0.51    0.49  1.0
## beta_wts      NA      NA    0.38      NA      NA      NA  0.5

Let’s plot the posterior check to see if our model makes sense. From the figure below, we see that it performs not very good, due to time limit, we just stop here.

wts = as.matrix(tail(chains$c1, 200))
pst_ck = matrix(NA, nrow=201, ncol=158)

for (i in 1: 201) pst_ck[i, ] = Forward(wts[i, ])

qt = matrix(NA, nrow=2, ncol=158)
for (i in 1: 158) qt[, i] = quantile(pst_ck[ , i], probs=c(0.025, 0.975))

evals = data.frame(up=qt[2, ], down=qt[1, ])

plot(evals$up, ylim=c(-0, max(evals$up)), xaxt="n", ylab="Diff", xlab="")
polygon(c(1: 158, rev(1: 158)),c(evals$down, rev(evals$up)),col=gray(0.8))
points(evals$down)
axis(1, at=1:158, labels=rownames(evals))
points(Y, col="red")

In the future, we will try other prior like Spike and slab prior to see if we can get more sparse network.