Assignment #8

Questions

9E1. Which of the following is a requirement of the simple Metropolis algorithm?

The parameters must be discrete.
The likelihood function must be Gaussian.
The proposal distribution must be symmetric.

# 3 is the requirement of Metropolis. Because Metropolis can also use continuous parameter, and distribution can be any symmetric distribution.

9E2. Gibbs sampling is more efficient than the Metropolis algorithm. How does it achieve this extra efficiency? Are there any limitations to the Gibbs sampling strategy?

# When checking which location in the posterior to sample next, Gibbs sampling uses adaptive proposals, which makes less proposed steps to be rejected. Therefore, it's more efficient.
# As for limitations, when the correlation between variables increases, the sequence of draws from the Gibbs is more correlated, which would decrease the performance.

9E3. Which sort of parameters can Hamiltonian Monte Carlo not handle? Can you explain why?

# Hamiltonian Monte Carlo cannot handle discrete parameters, since it depends on gradients using a physics simulation. However, the construction of gradients is not allowed in discrete parameters.

9E4. Explain the difference between the effective number of samples, n_eff as calculated by Stan, and the actual number of samples.

# n_eff is used to identified the number of uncorrelated samples, in order to estimate the posterior distribution.
# Actual number of samples is the number of data points in the samples. Therefore, it is usually larger than n_eff.

9E5. Which value should Rhat approach, when a chain is sampling the posterior distribution correctly?

# For Rhat, it should approach 1.0, as it reflects the variance in a chain and between chains. Rhat = 1.0 means predictions would not be affected by the choice of the chain. If value bigger than 1.0, it means the model has issues.

9E6. Sketch a good trace plot for a Markov chain, one that is effectively sampling from the posterior distribution. What is good about its shape? Then sketch a trace plot for a malfunctioning Markov chain. What about its shape indicates malfunction?

# First, let's sketch a good trace plot:
y <- rnorm(10000, 1, 2)
good_trace <- ulam(
  alist(
    y ~ dnorm(mu, sigma),
    mu <- a,
    a ~ dnorm(0, 10),
    sigma ~ dcauchy(0, 1)
  ),
  data = list(y = y),
  cores = 2,
  chains = 2,
  start = list(
    a = 0,
    sigma = 1
  )
)
traceplot(good_trace)

## [1] 1000
## [1] 1
## [1] 1000

# As we can see from the plot, the chains can quickly find the area with highest posterior, and keep within that area.

# Now let's sketch a bad trace plot:
bad_trace <- ulam(
  alist(
    y ~ dnorm(mu, sigma),
    mu <- a1 + a2,
    a1 ~ dnorm(0, 10),
    a2 ~ dnorm(0, 10),
    sigma ~ dcauchy(0, 1)
  ),
  data = list(y = y),
  chains = 2,
  cores = 2,
  start = list(
    a1 = 0,
    a2 = 0,
    sigma = 1
  ),
)

## Warning: There were 540 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10. See
## http://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded

## Warning: Examine the pairs() plot to diagnose sampling problems

## Warning: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
## Running the chains for more iterations may help. See
## http://mc-stan.org/misc/warnings.html#bulk-ess

traceplot(bad_trace)

## [1] 1000
## [1] 1
## [1] 1000

# As we can see from the plot, a1 and a2 both has non stable chain fluctuation , because they would cancel each other out to get the right mu. On the other hand, sigma can have the right behavior that pick up the posterior and stay.

9E7. Repeat the problem above, but now for a trace rank plot.

# Let's first sketch the good trace rank plot
trankplot(good_trace)
# As we can see from the plot, the histograms that overlap, and almost stay within the same range, with a spike in the middle rank.

# Let's then sketch the bad trace rank plot
trankplot(bad_trace)

# As we can see from the plot, the histograms do not overlap much, and have many spikes, especially in the low and high ranks. This suggests the problems in exploring the posterior.

9M1. Re-estimate the terrain ruggedness model from the chapter, but now using a uniform prior for the standard deviation, sigma. The uniform prior should be dunif(0,1). Visualize the priors. Use ulam to estimate the posterior. Visualize the posteriors for both models. Does the different prior have any detectible influence on the posterior distribution of sigma? Why or why not?

# Let's first load and prepare the data
data(rugged)
d <- rugged
d$log_gdp <- log(d$rgdppc_2000)
dd <- d[complete.cases(d$rgdppc_2000), ]
dd$log_gdp_std <- dd$log_gdp / mean(dd$log_gdp)
dd$rugged_std <- dd$rugged / max(dd$rugged)
dd$cid <- ifelse(dd$cont_africa == 1, 1, 2)

dat_slim <- list(
  log_gdp_std = dd$log_gdp_std,
  rugged_std = dd$rugged_std,
  cid = as.integer(dd$cid)
)

# Then, let's use ulam for both the original from the chapter and uniform prior
# First is the prior from the chapter
m_9M1_orig <- ulam(
  alist(
    log_gdp_std ~ dnorm(mu, sigma),
    mu <- a[cid] + b[cid] * (rugged_std - 0.215),
    a[cid] ~ dnorm(1, 0.1),
    b[cid] ~ dnorm(0, 0.3),
    sigma ~ dexp(1)
  ),
  data = dat_slim,
  chains = 2,
  cores = 2,
)

# Second is the prior with uniform prior
m_9M1_uni <- ulam(
  alist(
    log_gdp_std ~ dnorm(mu, sigma),
    mu <- a[cid] + b[cid] * (rugged_std - 0.215),
    a[cid] ~ dnorm(1, 0.1),
    b[cid] ~ dnorm(0, 0.3),
    sigma ~ dnorm(0, 10)
  ),
  data = dat_slim,
  chains = 2,
  cores = 2,
)

# Let's compare these two model
pairs(m_9M1_orig)

pairs(m_9M1_uni)

precis(m_9M1_orig, depth = 2)

##             mean          sd        5.5%       94.5%    n_eff     Rhat4
## a[1]   0.8869169 0.016340400  0.86022841  0.91141555 1192.014 1.0024168
## a[2]   1.0505809 0.009975026  1.03508499  1.06598384 1521.424 1.0000378
## b[1]   0.1310527 0.076440988  0.01053792  0.25404646 1036.578 0.9993423
## b[2]  -0.1420161 0.057219458 -0.22990955 -0.04615411 1295.708 0.9997150
## sigma  0.1115488 0.006285802  0.10169826  0.12190974 1180.419 1.0003212

precis(m_9M1_uni, depth = 2)

##             mean          sd        5.5%       94.5%     n_eff     Rhat4
## a[1]   0.8870811 0.016871050  0.86119951  0.91407544 1156.8083 0.9991797
## a[2]   1.0506031 0.010038648  1.03450640  1.06731113  973.9089 1.0012681
## b[1]   0.1314396 0.073823912  0.01039421  0.25164018  860.9613 0.9986163
## b[2]  -0.1439426 0.056795261 -0.24102328 -0.05581452  953.4224 0.9991132
## sigma  0.1115098 0.005958955  0.10261915  0.12156540 1191.7311 0.9995119

# As we can see from the plot and chart, both model are very similar in almost all parameters, with only a slight difference in n_eff. However, the sigma itself doesn't change.
# Therefore, we can conclude that the different priors do not have any influence on the posterior distribution of sigma.

9M2. Modify the terrain ruggedness model again. This time, change the prior for b[cid] to dexp(0.3). What does this do to the posterior distribution? Can you explain it?

# Let's set the prior with new b[cid]
m_9M1_b_cid <- ulam(
  alist(
    log_gdp_std ~ dnorm(mu, sigma),
    mu <- a[cid] + b[cid] * (rugged_std - 0.215),
    a[cid] ~ dnorm(1, 0.1),
    b[cid] ~ dexp(0.3),
    sigma ~ dnorm(0, 10)
  ),
  data = dat_slim,
  chains = 2,
  cores = 2,
)

# Let's compare it with the original one
pairs(m_9M1_orig)

pairs(m_9M1_b_cid)

precis(m_9M1_orig, depth = 2)

##             mean          sd        5.5%       94.5%    n_eff     Rhat4
## a[1]   0.8869169 0.016340400  0.86022841  0.91141555 1192.014 1.0024168
## a[2]   1.0505809 0.009975026  1.03508499  1.06598384 1521.424 1.0000378
## b[1]   0.1310527 0.076440988  0.01053792  0.25404646 1036.578 0.9993423
## b[2]  -0.1420161 0.057219458 -0.22990955 -0.04615411 1295.708 0.9997150
## sigma  0.1115488 0.006285802  0.10169826  0.12190974 1180.419 1.0003212

precis(m_9M1_b_cid, depth = 2)

##             mean         sd        5.5%      94.5%     n_eff     Rhat4
## a[1]  0.88671367 0.01539428 0.861243080 0.91058237  790.6983 0.9982596
## a[2]  1.04812255 0.01064591 1.031466913 1.06528737  519.6330 1.0007106
## b[1]  0.14631392 0.07436966 0.030257756 0.26891748  754.9524 1.0010669
## b[2]  0.01807038 0.01679443 0.001312943 0.05032094 1038.0073 1.0006347
## sigma 0.11418410 0.00652012 0.103995588 0.12530825  807.9304 1.0006187

# Similarly, As we can see from the plot and chart, both model are very similar in almost all parameters, with only a slight difference in n_eff. However, the sigma itself doesn't change.
# Therefore, we can conclude that the different priors do not have any influence on the posterior distribution of sigma.

9M3. Re-estimate one of the Stan models from the chapter, but at different numbers of warmup iterations. Be sure to use the same number of sampling iterations in each case. Compare the n_eff values. How much warmup is enough?

# We can stick with the terrian ruggedness model.
# Let's first create a starting point.
start_point <- list(a = c(1, 1), b = c(0, 0), sigma = 1)

# Now we set the model
m_9M3 <- ulam(
  alist(
    log_gdp_std ~ dnorm(mu, sigma),
    mu <- a[cid] + b[cid] * (rugged_std - 0.215),
    a[cid] ~ dnorm(1, 0.1),
    b[cid] ~ dnorm(0, 0.3),
    sigma ~ dexp(1)
  ),
  data = dat_slim,
  start = start_point,
  chains = 2,
  cores = 2,
  iter = 100
)

## Warning: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
## Running the chains for more iterations may help. See
## http://mc-stan.org/misc/warnings.html#bulk-ess

## Warning: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
## Running the chains for more iterations may help. See
## http://mc-stan.org/misc/warnings.html#tail-ess

# Now we setup the warmup times to run against the model
warmup_list <- c(5, 10, 25, 50, 100, 300, 500, 800, 1000)

# Then we run the warmup list and store n_eff value in the matrix
matrix_n_eff <- matrix(NA, nrow = length(warmup_list), ncol = 5)

for (i in 1:length(warmup_list)) {
  w <- warmup_list[i]
  m_warmup <- ulam(
    m_9M3,
    chains = 2,
    cores = 2,
    iter = 1000 + w,
    warmup = w,
    start = start_point)
  matrix_n_eff[i, ] <- precis(m_warmup, 2)$n_eff
}

## Warning: There were 1442 divergent transitions after warmup. See
## http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
## to find out why this is a problem and how to eliminate them.

## Warning: There were 2 chains where the estimated Bayesian Fraction of Missing Information was low. See
## http://mc-stan.org/misc/warnings.html#bfmi-low

## Warning: Examine the pairs() plot to diagnose sampling problems

## Warning: The largest R-hat is 3.19, indicating chains have not mixed.
## Running the chains for more iterations may help. See
## http://mc-stan.org/misc/warnings.html#r-hat

## Warning: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
## Running the chains for more iterations may help. See
## http://mc-stan.org/misc/warnings.html#bulk-ess

## Warning: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
## Running the chains for more iterations may help. See
## http://mc-stan.org/misc/warnings.html#tail-ess

# Let's make the column and row name more meaningful
# column would be the parameter, row would be the warmup times
# the actual value in the table would be the corresponding n_eff 
colnames(matrix_n_eff) <- rownames(precis(m_warmup, 2))
rownames(matrix_n_eff) <- warmup_list
matrix_n_eff

##             a[1]        a[2]        b[1]        b[2]       sigma
## 5       1.622171    1.528533    1.104641    1.022921    1.185714
## 10   2400.976475 2373.476140 1030.991812 1369.147574 1490.101820
## 25   2262.688074 2047.315270  732.599196 1323.090607 1167.923019
## 50   2524.031248 2210.357579  872.466946 1071.418517 1187.219546
## 100  2258.612095 2439.490892  838.215115 1292.754053 1290.664647
## 300  2509.355736 2994.691051 1945.929485 2583.050705 2163.646761
## 500  3073.648058 3039.138977 2131.771725 2545.780281 2445.235815
## 800  2547.978854 2899.202311 2535.075460 2827.632955 2220.012952
## 1000 2380.332958 3336.294993 2418.349714 2694.021933 2453.422506

# As we can see from the table, starting from 10 warmup, the n_eff is stable and does not change much. Therefore, we can conclue that 10 warmup is enough.

9H1. Run the model below and then inspect the posterior distribution and explain what it is accomplishing.

mp <- ulam(
 alist(
   a ~ dnorm(0,1),
   b ~ dcauchy(0,1)
 ), data=list(y=1) , chains=1 )

## 
## SAMPLING FOR MODEL 'bcf56ee89f6cf2a4224a4139ff01c7d4' NOW (CHAIN 1).
## Chain 1: 
## Chain 1: Gradient evaluation took 0 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1: 
## Chain 1: 
## Chain 1: Iteration:   1 / 1000 [  0%]  (Warmup)
## Chain 1: Iteration: 100 / 1000 [ 10%]  (Warmup)
## Chain 1: Iteration: 200 / 1000 [ 20%]  (Warmup)
## Chain 1: Iteration: 300 / 1000 [ 30%]  (Warmup)
## Chain 1: Iteration: 400 / 1000 [ 40%]  (Warmup)
## Chain 1: Iteration: 500 / 1000 [ 50%]  (Warmup)
## Chain 1: Iteration: 501 / 1000 [ 50%]  (Sampling)
## Chain 1: Iteration: 600 / 1000 [ 60%]  (Sampling)
## Chain 1: Iteration: 700 / 1000 [ 70%]  (Sampling)
## Chain 1: Iteration: 800 / 1000 [ 80%]  (Sampling)
## Chain 1: Iteration: 900 / 1000 [ 90%]  (Sampling)
## Chain 1: Iteration: 1000 / 1000 [100%]  (Sampling)
## Chain 1: 
## Chain 1:  Elapsed Time: 0.026 seconds (Warm-up)
## Chain 1:                0.018 seconds (Sampling)
## Chain 1:                0.044 seconds (Total)
## Chain 1:

## Warning: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
## Running the chains for more iterations may help. See
## http://mc-stan.org/misc/warnings.html#bulk-ess

## Warning: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
## Running the chains for more iterations may help. See
## http://mc-stan.org/misc/warnings.html#tail-ess

Compare the samples for the parameters a and b. Can you explain the different trace plots? If you are unfamiliar with the Cauchy distribution, you should look it up. The key feature to attend to is that it has no expected value. Can you connect this fact to the trace plot?

# Let's first check the posterior
precis(mp, depth = 2)

##          mean        sd      5.5%    94.5%     n_eff     Rhat4
## a -0.05340716 0.9924407 -1.717610 1.494891 288.01416 0.9993281
## b  0.15894266 8.0190101 -8.168638 5.615980  52.88219 1.0088268

pairs(mp)

# We can see that a is the normal distribution, while b is a Cauchy distribution.
# Now let's see the trace plot
traceplot(mp)

## [1] 1000
## [1] 1
## [1] 1000

# As we can see from the plot, a is the normal distribution, the prior is around 0, and spread stay within -2 to 2. b is the Cauchy distribution, where there are some extreme values. Because Cauchy distribution has no expected value, most of the values in the trace plot are around the 0, only a few huge spikes. Cauchy distribution has very heavy tails, which makes it more prone to jump to a value that is far away and create the big spike.

Assignment #8

Qing Ji

2021-05-17

Chapter 9 - Markov Chain Monte Carlo

Questions