Chapter 11 - God Spiked the Integers

This chapter described some of the most common generalized linear models, those used to model counts. It is important to never convert counts to proportions before analysis, because doing so destroys information about sample size. A fundamental difficulty with these models is that parameters are on a different scale, typically log-odds (for binomial) or log-rate (for Poisson), than the outcome variable they describe. Therefore computing implied predictions is even more important than before.

Place each answer inside the code chunk (grey box). The code chunks should contain a text response or a code that completes/answers the question or activity requested. Make sure to include plots if the question requests them. Problems are labeled Easy (E), Medium (M), and Hard(H).

Finally, upon completion, name your final output .html file as: YourName_ANLY505-Year-Semester.html and publish the assignment to your R Pubs account and submit the link to Canvas. Each question is worth 5 points.

Questions

11E1. If an event has probability 0.35, what are the log-odds of this event?

log(0.35/(1-0.35))
## [1] -0.6190392

11E2. If an event has log-odds 3.2, what is the probability of this event?

# log(x/(1-x)) = 3.2
# x/(1-x) = exp(3.2)
# x = exp(3.2) - exp(3.2)*x
# (1+exp(3.2)) * x = exp(3.2)
# x = exp(3.2)/(1+exp(3.2))
exp(3.2)/(1+exp(3.2))
## [1] 0.9608343

11E3. Suppose that a coefficient in a logistic regression has value 1.7. What does this imply about the proportional change in odds of the outcome?

exp(1.7)
## [1] 5.473947
# The odds increase 5.473947.

11E4. Why do Poisson regressions sometimes require the use of an offset? Provide an example.

# The purpose of the offset to adjust the count into a time unit. For example, 10 counts in a year should not be the same to the 10 counts in 10 years. 

11M1. As explained in the chapter, binomial data can be organized in aggregated and disaggregated forms, without any impact on inference. But the likelihood of the data does change when the data are converted between the two formats. Can you explain why?

# Even though the likelihood change when converting between the two formats, it is just the changes on the magnitude which won't influence the inference since it is not in the function of p. 

11M2. If a coefficient in a Poisson regression has value 1.7, what does this imply about the change in the outcome?

exp(1.7)
## [1] 5.473947
# The outcome increase 5.473947

11M3. Explain why the logit link is appropriate for a binomial generalized linear model.

# The reason is that logit link can map from [0,1] to [-Inf, Inf] then we can use linear regression model directly on the transformed data. 

11M4. Explain why the log link is appropriate for a Poisson generalized linear model.

# the reason is that log link can map from [0, Inf] to [-Inf, Inf] then we can use linear regression model directly on the transformed data. 

11M5. What would it imply to use a logit link for the mean of a Poisson generalized linear model? Can you think of a real research problem for which this would make sense?

# If we want to use logit link for the Poisson generalized linear model, that means we can only have maximum 1 event per time unit.

11M6. State the constraints for which the binomial and Poisson distributions have maximum entropy. Are the constraints different at all for binomial and Poisson? Why or why not?

# When 1.outcomes are discrete binary and 2. expected value is constant, binomial distribution will have maximum entropy.

# The constraints for Poisson distribution is the same since Poisson distribution is a special case of Binomial distribution. When n is really large and p is really small, the poisson distribution can with lambda = np can approximate binomial distribution really well.

11M7. Use quap to construct a quadratic approximate posterior distribution for the chimpanzee model that includes a unique intercept for each actor, m11.4 (page 330). Plot and compare the quadratic approximation to the posterior distribution produced instead from MCMC. Can you explain both the differences and the similarities between the approximate and the MCMC distributions? Relax the prior on the actor intercepts to Normal(0,10). Re-estimate the posterior using both ulam and quap. Plot and compare the posterior distributions. Do the differences increase or decrease? Why?

data("chimpanzees")
d<- chimpanzees
d$recipient <- NULL

q2 <- map(alist(
  pulled_left ~ dbinom( 1,p) ,
  logit(p) <- a[actor] + (bp + bpC*condition)*prosoc_left ,
  a[actor] ~ dnorm(0,10),
  bp ~ dnorm(0,10),
  bpC ~ dnorm(0,10)
) ,
data=d)
precis(q2, depth=2)
##            mean        sd       5.5%      94.5%
## a[1] -0.7261803 0.2684852 -1.1552714 -0.2970891
## a[2]  6.6716135 3.6066904  0.9074257 12.4358013
## a[3] -1.0309246 0.2784264 -1.4759038 -0.5859454
## a[4] -1.0309229 0.2784263 -1.4759019 -0.5859438
## a[5] -0.7261796 0.2684852 -1.1552708 -0.2970885
## a[6]  0.2127752 0.2670014 -0.2139446  0.6394950
## a[7]  1.7545477 0.3845064  1.1400322  2.3690633
## bp    0.8221325 0.2610076  0.4049920  1.2392730
## bpC  -0.1318329 0.2969348 -0.6063920  0.3427262
pairs(q2)

stan <- map2stan(
  alist(
    pulled_left ~ dbinom(1, p),
    logit(p) <- a[actor] + (bp + bpC*condition)*prosoc_left,
    a[actor] ~ dnorm(0, 10),
    bp ~ dnorm(0, 10),
    bpC ~ dnorm(0, 10)
  ),
  data=d, chains=2, iter=2500, warmup=500
)
## Running /Library/Frameworks/R.framework/Resources/bin/R CMD SHLIB foo.c
## clang -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG   -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/Rcpp/include/"  -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/"  -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/unsupported"  -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/BH/include" -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/StanHeaders/include/src/"  -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/StanHeaders/include/"  -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppParallel/include/"  -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/rstan/include" -DEIGEN_NO_DEBUG  -DBOOST_DISABLE_ASSERTS  -DBOOST_PENDING_INTEGER_LOG2_HPP  -DSTAN_THREADS  -DBOOST_NO_AUTO_PTR  -include '/Library/Frameworks/R.framework/Versions/4.0/Resources/library/StanHeaders/include/stan/math/prim/mat/fun/Eigen.hpp'  -D_REENTRANT -DRCPP_PARALLEL_USE_TBB=1   -I/usr/local/include   -fPIC  -Wall -g -O2  -c foo.c -o foo.o
## In file included from <built-in>:1:
## In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/StanHeaders/include/stan/math/prim/mat/fun/Eigen.hpp:13:
## In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/Dense:1:
## In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/Core:88:
## /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/src/Core/util/Macros.h:613:1: error: unknown type name 'namespace'
## namespace Eigen {
## ^
## /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/src/Core/util/Macros.h:613:16: error: expected ';' after top level declarator
## namespace Eigen {
##                ^
##                ;
## In file included from <built-in>:1:
## In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/StanHeaders/include/stan/math/prim/mat/fun/Eigen.hpp:13:
## In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/Dense:1:
## /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/Core:96:10: fatal error: 'complex' file not found
## #include <complex>
##          ^~~~~~~~~
## 3 errors generated.
## make: *** [foo.o] Error 1
## 
## SAMPLING FOR MODEL '3b5b5c48fdb3951b2b35063c0971da92' NOW (CHAIN 1).
## Chain 1: 
## Chain 1: Gradient evaluation took 9.9e-05 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.99 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1: 
## Chain 1: 
## Chain 1: Iteration:    1 / 2500 [  0%]  (Warmup)
## Chain 1: Iteration:  250 / 2500 [ 10%]  (Warmup)
## Chain 1: Iteration:  500 / 2500 [ 20%]  (Warmup)
## Chain 1: Iteration:  501 / 2500 [ 20%]  (Sampling)
## Chain 1: Iteration:  750 / 2500 [ 30%]  (Sampling)
## Chain 1: Iteration: 1000 / 2500 [ 40%]  (Sampling)
## Chain 1: Iteration: 1250 / 2500 [ 50%]  (Sampling)
## Chain 1: Iteration: 1500 / 2500 [ 60%]  (Sampling)
## Chain 1: Iteration: 1750 / 2500 [ 70%]  (Sampling)
## Chain 1: Iteration: 2000 / 2500 [ 80%]  (Sampling)
## Chain 1: Iteration: 2250 / 2500 [ 90%]  (Sampling)
## Chain 1: Iteration: 2500 / 2500 [100%]  (Sampling)
## Chain 1: 
## Chain 1:  Elapsed Time: 0.800598 seconds (Warm-up)
## Chain 1:                3.79251 seconds (Sampling)
## Chain 1:                4.59311 seconds (Total)
## Chain 1: 
## 
## SAMPLING FOR MODEL '3b5b5c48fdb3951b2b35063c0971da92' NOW (CHAIN 2).
## Chain 2: 
## Chain 2: Gradient evaluation took 7.1e-05 seconds
## Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.71 seconds.
## Chain 2: Adjust your expectations accordingly!
## Chain 2: 
## Chain 2: 
## Chain 2: Iteration:    1 / 2500 [  0%]  (Warmup)
## Chain 2: Iteration:  250 / 2500 [ 10%]  (Warmup)
## Chain 2: Iteration:  500 / 2500 [ 20%]  (Warmup)
## Chain 2: Iteration:  501 / 2500 [ 20%]  (Sampling)
## Chain 2: Iteration:  750 / 2500 [ 30%]  (Sampling)
## Chain 2: Iteration: 1000 / 2500 [ 40%]  (Sampling)
## Chain 2: Iteration: 1250 / 2500 [ 50%]  (Sampling)
## Chain 2: Iteration: 1500 / 2500 [ 60%]  (Sampling)
## Chain 2: Iteration: 1750 / 2500 [ 70%]  (Sampling)
## Chain 2: Iteration: 2000 / 2500 [ 80%]  (Sampling)
## Chain 2: Iteration: 2250 / 2500 [ 90%]  (Sampling)
## Chain 2: Iteration: 2500 / 2500 [100%]  (Sampling)
## Chain 2: 
## Chain 2:  Elapsed Time: 0.669276 seconds (Warm-up)
## Chain 2:                2.15642 seconds (Sampling)
## Chain 2:                2.8257 seconds (Total)
## Chain 2:
precis(stan, depth=2)
##            mean        sd       5.5%      94.5%    n_eff     Rhat4
## a[1] -0.7388692 0.2740286 -1.1756318 -0.2932046 2846.607 1.0006255
## a[2] 11.1868489 5.5811916  4.4887534 21.5483434 1955.414 1.0001673
## a[3] -1.0479730 0.2801323 -1.5112660 -0.6154369 3151.018 0.9999610
## a[4] -1.0454123 0.2786197 -1.4915911 -0.6242526 2771.109 1.0010378
## a[5] -0.7366332 0.2762258 -1.1761415 -0.2981318 3381.982 0.9998491
## a[6]  0.2226815 0.2655904 -0.2000137  0.6455141 4397.702 1.0006333
## a[7]  1.8113921 0.3943392  1.2151596  2.4690684 3514.980 1.0000344
## bp    0.8312887 0.2648651  0.4117392  1.2529725 1890.847 1.0037960
## bpC  -0.1315532 0.3042695 -0.6185461  0.3447696 3357.928 1.0009007
pairs(stan)

# The means are pretty close except for a[2].
# MCMC has overall higher std.

11M8. Revisit the data(Kline) islands example. This time drop Hawaii from the sample and refit the models. What changes do you observe?

library(dplyr)

data(Kline)
d <- Kline
d <- d %>% 
  mutate(cid=ifelse(contact == "high", 2, 1),
                  stdPop=standardize(log(population))) 
d2 = d %>% 
  filter(culture != "Hawaii")

datList <- list(
  totTools = d$total_tools,
  stdPop = d$stdPop,
  cid = as.integer(d$cid)
)

m <- ulam(
  alist(
    totTools ~ dpois(lambda),
    log(lambda) <- a[cid] + b[cid]*stdPop,
    a[cid] ~ dnorm(3, 0.5),
    b[cid] ~ dnorm(0, 0.2)
  ),data = datList, chains = 4, cores = 4
)
## Running /Library/Frameworks/R.framework/Resources/bin/R CMD SHLIB foo.c
## clang -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG   -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/Rcpp/include/"  -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/"  -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/unsupported"  -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/BH/include" -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/StanHeaders/include/src/"  -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/StanHeaders/include/"  -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppParallel/include/"  -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/rstan/include" -DEIGEN_NO_DEBUG  -DBOOST_DISABLE_ASSERTS  -DBOOST_PENDING_INTEGER_LOG2_HPP  -DSTAN_THREADS  -DBOOST_NO_AUTO_PTR  -include '/Library/Frameworks/R.framework/Versions/4.0/Resources/library/StanHeaders/include/stan/math/prim/mat/fun/Eigen.hpp'  -D_REENTRANT -DRCPP_PARALLEL_USE_TBB=1   -I/usr/local/include   -fPIC  -Wall -g -O2  -c foo.c -o foo.o
## In file included from <built-in>:1:
## In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/StanHeaders/include/stan/math/prim/mat/fun/Eigen.hpp:13:
## In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/Dense:1:
## In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/Core:88:
## /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/src/Core/util/Macros.h:613:1: error: unknown type name 'namespace'
## namespace Eigen {
## ^
## /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/src/Core/util/Macros.h:613:16: error: expected ';' after top level declarator
## namespace Eigen {
##                ^
##                ;
## In file included from <built-in>:1:
## In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/StanHeaders/include/stan/math/prim/mat/fun/Eigen.hpp:13:
## In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/Dense:1:
## /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/Core:96:10: fatal error: 'complex' file not found
## #include <complex>
##          ^~~~~~~~~
## 3 errors generated.
## make: *** [foo.o] Error 1
precis(m, 2)
##           mean         sd        5.5%     94.5%    n_eff     Rhat4
## a[1] 3.3186962 0.08321845  3.18499884 3.4486382 1894.381 0.9987359
## a[2] 3.6100011 0.07016987  3.50016670 3.7227295 2155.562 0.9995233
## b[1] 0.3788884 0.05012660  0.30029417 0.4620582 1794.909 0.9982366
## b[2] 0.1970149 0.16246009 -0.06648495 0.4515786 2259.609 0.9984206
datList2 <- list(
  totTools = d2$total_tools,
  stdPop = d2$stdPop,
  cid = as.integer(d2$cid)
)

m2 <- ulam(
  alist(
    totTools ~ dpois(lambda),
    log(lambda) <- a[cid] + b[cid]*stdPop,
    a[cid] ~ dnorm(3, 0.5),
    b[cid] ~ dnorm(0, 0.2)
  ),data = datList2, chains = 4, cores = 4
)
## Running /Library/Frameworks/R.framework/Resources/bin/R CMD SHLIB foo.c
## clang -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG   -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/Rcpp/include/"  -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/"  -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/unsupported"  -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/BH/include" -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/StanHeaders/include/src/"  -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/StanHeaders/include/"  -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppParallel/include/"  -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/rstan/include" -DEIGEN_NO_DEBUG  -DBOOST_DISABLE_ASSERTS  -DBOOST_PENDING_INTEGER_LOG2_HPP  -DSTAN_THREADS  -DBOOST_NO_AUTO_PTR  -include '/Library/Frameworks/R.framework/Versions/4.0/Resources/library/StanHeaders/include/stan/math/prim/mat/fun/Eigen.hpp'  -D_REENTRANT -DRCPP_PARALLEL_USE_TBB=1   -I/usr/local/include   -fPIC  -Wall -g -O2  -c foo.c -o foo.o
## In file included from <built-in>:1:
## In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/StanHeaders/include/stan/math/prim/mat/fun/Eigen.hpp:13:
## In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/Dense:1:
## In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/Core:88:
## /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/src/Core/util/Macros.h:613:1: error: unknown type name 'namespace'
## namespace Eigen {
## ^
## /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/src/Core/util/Macros.h:613:16: error: expected ';' after top level declarator
## namespace Eigen {
##                ^
##                ;
## In file included from <built-in>:1:
## In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/StanHeaders/include/stan/math/prim/mat/fun/Eigen.hpp:13:
## In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/Dense:1:
## /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/Core:96:10: fatal error: 'complex' file not found
## #include <complex>
##          ^~~~~~~~~
## 3 errors generated.
## make: *** [foo.o] Error 1
precis(m2, 2)
##           mean         sd         5.5%     94.5%    n_eff     Rhat4
## a[1] 3.1785469 0.12254222  2.982054695 3.3691697 1612.621 0.9999678
## a[2] 3.6125894 0.07259429  3.493791511 3.7289580 1725.666 1.0003176
## b[1] 0.1926749 0.12446823 -0.008969917 0.3854473 1643.884 0.9996492
## b[2] 0.1866655 0.15479695 -0.064410023 0.4340975 2107.294 0.9989670
# After removing Hawaii which is the outlier in our dataset, the slopes are the same now. 

11H1. Use WAIC or PSIS to compare the chimpanzee model that includes a unique intercept for each actor, m11.4 (page 330), to the simpler models fit in the same section. Interpret the results.

data("chimpanzees")

d2 <- chimpanzees

m1 <- rethinking::map(
  alist(
    pulled_left ~ dbinom(1, p),
    logit(p) <- a,
    a ~ dnorm(0,10)
  ),
  data = d2 )

m2 <- rethinking::map(
  alist(
    pulled_left ~ dbinom(1, p) ,
    logit(p) <- a + bp*prosoc_left,
    a ~ dnorm(0,10) ,
    bp ~ dnorm(0,10)
  ),
  data = d2 )

m3 <- rethinking::map(
  alist(
    pulled_left ~ dbinom(1, p),
    logit(p) <- a + (bp + bpC*condition)*prosoc_left,
    a ~ dnorm(0,10),
    bp ~ dnorm(0,10),
    bpC ~ dnorm(0,10)
  ), 
  data = d2 )

m4 <- rethinking::map(
  alist(
    pulled_left ~ dbinom(1, p),
    logit(p) <- a[actor] + (bp + bpC*condition)*prosoc_left,
    a[actor] ~ dnorm(0, 10),
    bp ~ dnorm(0, 10),
    bpC ~ dnorm(0, 10)
  ),
  data = d2)


compare(m1,m2,m3,m4)
##        WAIC        SE    dWAIC      dSE      pWAIC       weight
## m4 544.9496 18.863817   0.0000       NA 13.3527865 1.000000e+00
## m2 680.5486  9.345572 135.5990 18.26942  2.0261231 3.589666e-30
## m3 682.2926  9.394348 137.3430 18.20716  2.9756967 1.500908e-30
## m1 687.8282  7.075198 142.8785 19.05541  0.9437476 9.426003e-32
plot(compare(m1,m2,m3,m4))

# From the plot, we can see that m4 performs better than others even though it has more parameters than others.