Assignment #9

11E1. If an event has probability 0.35, what are the log-odds of this event?

p <- 0.35
p/(1-p)

## [1] 0.5384615

11E2. If an event has log-odds 3.2, what is the probability of this event?

lo <- 3.2
lo/(1+lo)

## [1] 0.7619048

11E3. Suppose that a coefficient in a logistic regression has value 1.7. What does this imply about the proportional change in odds of the outcome?

# log ods will change in exp(1.7) times

11E4. Why do Poisson regressions sometimes require the use of an offset? Provide an example.

# Poisson distribution models number of events per some unit of time/spatial region.
# Measurement of the outcome variable can be provided on the different scale for each observation (daily vs weekly).
# Offset is used to bring all observations on the same scale.
# * Consider an example from the chapter, where a number of manuscripts produced by the monastery is measured on the daily or weekly basis. The offset parameter is used to convert all measurements to the daily basis.  
# * Number of animals in the area is another possible example where offset is helpful. Square of the area can be treated as an offset in this case.
# * While predicting number of conversions per ad campaign, number of impressions can be treated as an offset. I think it is a Poisson regression and not Binomial, because conversion rate is extremely small usually and number of impressions is huge (1 vs 1e+6)

11M1. As explained in the chapter, binomial data can be organized in aggregated and disaggregated forms, without any impact on inference. But the likelihood of the data does change when the data are converted between the two formats. Can you explain why?

# Binomial likelihood in the aggregated form contains C(n,m) multiplier. This multiplier is converted to an additional constant at the log-scale of the likelihood. For non-aggregated format, each event is modeled independently as a number of heads in the single drop of the coin.
# likelihood in the aggregated format: C(n,m)*p^m*(1-p)^n-m
# likelihood in the non-aggregated format: p^m*(1-p)^n-m

11M2. If a coefficient in a Poisson regression has value 1.7, what does this imply about the change in the outcome?

# It implies that the change of the predictor by 1 unit increases the lambda parameter of the Poisson distribution in exp(1.7)=5.4739 times. Thus an average number of events per interval increases in 5.47 times.
exp(1.7)

## [1] 5.473947

11M3. Explain why the logit link is appropriate for a binomial generalized linear model.

# Binomial likelihood is parametrised by parameter p - probability of an event
# We are interested in modelling it with linear combination of the predictors.
# But p shoudl fall in the [0, 1] range
# So instead of p we model f(p) as linear relation a + b*x, f is selected in a such way that
# p = f^-1(a+b*x) is in [0,1] scale
# f=logit=log(p/1-p), f^-1=logistic=exp(a+b*x)/(1+exp(a+b*x))
# logit function ensures required contraint.

11M4. Explain why the log link is appropriate for a Poisson generalized linear model.

# As in the previous case we are interested in modelling lambda - parameter of the Poisson distribution with linear model. 
# Lambda should be in [0, +inf) range. An exponential function is an appropriate inverse of the link that maps (-inf, +inf)  range of linear combination of parameters into (0,+inf). Then the link function should be logarithm as the inverse of exp.

11M5. What would it imply to use a logit link for the mean of a Poisson generalized linear model? Can you think of a real research problem for which this would make sense?

# Using logit link implies that a lambda parameter of the Poisson likelihood always falls in the (0,1) range.
# There is nothing wrong technically, not sure if it makes sense to set such a constraint for some real process.
# Here are simulations of such process:
prop.table(table(rpois(1e+5, .1)))

## 
##       0       1       2       3       4 
## 0.90499 0.09030 0.00456 0.00014 0.00001

# One of possible cases to model is a process with the high number of "zero" events and with the absence of exponential growth of the predicted rate.

11M6. State the constraints for which the binomial and Poisson distributions have maximum entropy. Are the constraints different at all for binomial and Poisson? Why or why not?

# The binomial distribution has maximum entropy when each trial must result in one of two possible events and the expected value is constant. Poisson distribution is a special case of the Binomial one. Just because there is no other max entropy distribution for constraints of the Binomial, I assume that Poisson distribution adds some new constraint.
# From the form of the Poisson distribution it is known that its variance is equal to the expected value and both are constant. So I assume that this is the additional constraint for the Poisson.

11M7. Use quap to construct a quadratic approximate posterior distribution for the chimpanzee model that includes a unique intercept for each actor, m11.4 (page 330). Compare the quadratic approximation to the posterior distribution produced instead from MCMC. Can you explain both the differences and the similarities between the approximate and the MCMC distributions? Relax the prior on the actor intercepts to Normal(0,10). Re-estimate the posterior using both ulam and quap. Do the differences increase or decrease? Why?

library(rethinking)
data(chimpanzees)
d <- chimpanzees
d$treatment <- 1 + d$prosoc_left + 2*d$condition

m11.1 <- quap(
    alist(
        pulled_left ~ dbinom( 1 , p ) ,
        logit(p) <- a ,
        a ~ dnorm( 0 , 10 )
) , data=d )

m11.2 <- quap(
    alist(
        pulled_left ~ dbinom( 1 , p ) ,
        logit(p) <- a + b[treatment] ,
        a ~ dnorm( 0 , 1.5 ),
        b[treatment] ~ dnorm( 0 , 10 )
    ) , data=d )
set.seed(1999)

11M8. Revisit the data(Kline) islands example. This time drop Hawaii from the sample and refit the models. What changes do you observe?

data(Kline)
d <- Kline
d$P <- scale( log(d$population) )
d$contact_id <- ifelse( d$contact=="high" , 2 , 1 )

11H1. Use WAIC or PSIS to compare the chimpanzee model that includes a unique intercept for each actor, m11.4 (page 330), to the simpler models fit in the same section. Interpret the results.

data("chimpanzees")

d <- chimpanzees
# model with single intercept for all actors
m11.3 <- map(
  alist(
    pulled_left ~ dbinom(1, p) ,
    logit(p) <- a + (bp + bpC*condition)*prosoc_left ,
    a ~ dnorm(0,10) ,
    bp ~ dnorm(0,10) ,
    bpC ~ dnorm(0,10)
  ), data=d )

# model with intercept per actor, but with no coefficient for condition variable

m11.2 <- map(
  alist(
    pulled_left ~ dbinom(1, p) ,
    logit(p) <- a + bp*prosoc_left ,
    a ~ dnorm(0,10) ,
    bp ~ dnorm(0,10)
  ),
  data=d )

# model with intercept per actor but no coefficient for condition and for prosoc option
m11.1 <- map(
  alist(
    pulled_left ~ dbinom(1, p),
    logit(p) <- a ,
    a ~ dnorm(0,10)
  ),
  data=d )

# compare
compare(m11.1,m11.2,m11.3)

##           WAIC       SE    dWAIC      dSE     pWAIC     weight
## m11.2 680.4441 9.344763 0.000000       NA 1.9741680 0.70795632
## m11.3 682.3403 9.354362 1.896185 0.752352 2.9991969 0.27431847
## m11.1 687.8189 7.206162 7.374788 6.179243 0.9385454 0.01772521

#  a model that has a single intercept for all actors is the worst one (has largest WAIC) and gains no score in per model weights distribution.
# a model that consists of single intercept per actor is comparable by WAIC, but still has a difference with the best model that is greater than deviance of the difference. Thus it also gains no score in model weights distribution.
# There is a tiny difference in WAIC between models with and without `condition` variable. Model without this variable looks better in terms of estimated WAIC.

Assignment #9

Yufeng Fan

2020-08-03