Chapter 12 - Monsters and Mixtures

This chapter introduced several new types of regression, all of which are generalizations of generalized linear models (GLMs). Ordered logistic models are useful for categorical outcomes with a strict ordering. They are built by attaching a cumulative link function to a categorical outcome distribution. Zero-inflated models mix together two different outcome distributions, allowing us to model outcomes with an excess of zeros. Models for overdispersion, such as beta-binomial and gamma-Poisson, draw the expected value of each observation from a distribution that changes shape as a function of a linear model.

Place each answer inside the code chunk (grey box). The code chunks should contain a text response or a code that completes/answers the question or activity requested. Problems are labeled Easy (E), Medium (M), and Hard(H).

Finally, upon completion, name your final output .html file as: YourName_ANLY505-Year-Semester.html and publish the assignment to your R Pubs account and submit the link to Canvas. Each question is worth 5 points.

Questions

12E1. What is the difference between an ordered categorical variable and an unordered one? Define and then give an example of each.

# The difference between an ordered categorical variable and an unordered variable is that ordered categorical variable has an inherent ranking in the categories.
# For example, education level (high school, college, and graduate degree) is a good example of ordered categorical varibale, whereas gender is unordered categorical variable.

12E2. What kind of link function does an ordered logistic regression employ? How does it differ from an ordinary logit link?

# #Ordered logistic regression employs a cumulative logit link function.The ordered logit model is a regression model for an ordinal response variable. The model is based on the cumulative probabilities of the response variable instead of a discrete probability of a single event.

12E3. When count data are zero-inflated, using a model that ignores zero-inflation will tend to induce which kind of inferential error?

# Using a model that ignores zero-inflation may lead to the underestimation of the rate of events, since a count distribution with extra zeros will have a lower mean.

12E4. Over-dispersion is common in count data. Give an example of a natural process that might produce over-dispersed counts. Can you also give an example of a process that might produce underdispersed counts?

# Overdispersion usually occurs because the mean and variance components of a GLM are related and depends on the same parameter that is being predicted through the independent vector. One example is the  the number of seedlings in a forest plot. There may be hundreds or none depending on the distance to the source tree. Such data will be overdispersed for a Poisson distribution. On the contrary, underdispersion ususally occurrs when there is less variation in the data than predicted. For example, the number of defects may increase as a tool wears out, and the increase in defect counts across subgroups can make the subgroups more similar than they would be by chance.

12M1. At a certain university, employees are annually rated from 1 to 4 on their productivity, with 1 being least productive and 4 most productive. In a certain department at this certain university in a certain year, the numbers of employees receiving each rating were (from 1 to 4): 12, 36, 7, 41. Compute the log cumulative odds of each rating.

n <- c( 12, 36 , 7 , 41 )
q <- n / sum(n)
q
## [1] 0.12500000 0.37500000 0.07291667 0.42708333
sum(q)
## [1] 1
p <- cumsum(q)
p
## [1] 0.1250000 0.5000000 0.5729167 1.0000000
log(p/(1-p))
## [1] -1.9459101  0.0000000  0.2937611        Inf

12M2. Make a version of Figure 12.5 for the employee ratings data given just above.

plot( 1:4 , p , xlab="rating" , ylab="cumulative proportion" ,
xlim=c(0.7,4.3) , ylim=c(0,1) , xaxt="n" )
axis( 1 , at=1:4 , labels=1:4 )
# plot gray cumulative probability lines
for ( x in 1:4 ) lines( c(x,x) , c(0,p[x]) , col="gray" , lwd=2 )
# plot blue discrete probability segments
for ( x in 1:4 )
lines( c(x,x)+0.1 , c(p[x]-q[x],p[x]) , col="slateblue" , lwd=2 )
# add number labels
text( 1:4+0.2 , p-q/2 , labels=1:4 , col="slateblue" )

12M3. Can you modify the derivation of the zero-inflated Poisson distribution (ZIPoisson) from the chapter to construct a zero-inflated binomial distribution?

# Binomial probability :Pr(y | N, p) = px(1-p) n-x
# X is the number of successes.
# (1- p_not_work) * p_success^n_success (1-p_success)^n_trials-n_success

12H1. In 2014, a paper was published that was entitled “Female hurricanes are deadlier than male hurricanes.”191 As the title suggests, the paper claimed that hurricanes with female names have caused greater loss of life, and the explanation given is that people unconsciously rate female hurricanes as less dangerous and so are less likely to evacuate. Statisticians severely criticized the paper after publication. Here, you’ll explore the complete data used in the paper and consider the hypothesis that hurricanes with female names are deadlier. Load the data with:

data(Hurricanes)

data(Hurricanes)
d <- Hurricanes
d$fmnnty_std <- ( d$femininity - mean(d$femininity) )/sd(d$femininity)
m1 <- map2stan(
alist(
deaths ~ dpois(lambda),
log(lambda) <- a + bf*fmnnty,
a ~ dnorm(0,10),
bf ~ dnorm(0,1)
),
data=list(
deaths=d$deaths,
fmnnty=d$fmnnty_std
) , chains=4 )
## 
## SAMPLING FOR MODEL 'bf2f1b0fd9f94697ecb57046ca35af2b' NOW (CHAIN 1).
## Chain 1: 
## Chain 1: Gradient evaluation took 6.4e-05 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.64 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1: 
## Chain 1: 
## Chain 1: Iteration:    1 / 2000 [  0%]  (Warmup)
## Chain 1: Iteration:  200 / 2000 [ 10%]  (Warmup)
## Chain 1: Iteration:  400 / 2000 [ 20%]  (Warmup)
## Chain 1: Iteration:  600 / 2000 [ 30%]  (Warmup)
## Chain 1: Iteration:  800 / 2000 [ 40%]  (Warmup)
## Chain 1: Iteration: 1000 / 2000 [ 50%]  (Warmup)
## Chain 1: Iteration: 1001 / 2000 [ 50%]  (Sampling)
## Chain 1: Iteration: 1200 / 2000 [ 60%]  (Sampling)
## Chain 1: Iteration: 1400 / 2000 [ 70%]  (Sampling)
## Chain 1: Iteration: 1600 / 2000 [ 80%]  (Sampling)
## Chain 1: Iteration: 1800 / 2000 [ 90%]  (Sampling)
## Chain 1: Iteration: 2000 / 2000 [100%]  (Sampling)
## Chain 1: 
## Chain 1:  Elapsed Time: 0.054185 seconds (Warm-up)
## Chain 1:                0.056841 seconds (Sampling)
## Chain 1:                0.111026 seconds (Total)
## Chain 1: 
## 
## SAMPLING FOR MODEL 'bf2f1b0fd9f94697ecb57046ca35af2b' NOW (CHAIN 2).
## Chain 2: 
## Chain 2: Gradient evaluation took 1.1e-05 seconds
## Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.11 seconds.
## Chain 2: Adjust your expectations accordingly!
## Chain 2: 
## Chain 2: 
## Chain 2: Iteration:    1 / 2000 [  0%]  (Warmup)
## Chain 2: Iteration:  200 / 2000 [ 10%]  (Warmup)
## Chain 2: Iteration:  400 / 2000 [ 20%]  (Warmup)
## Chain 2: Iteration:  600 / 2000 [ 30%]  (Warmup)
## Chain 2: Iteration:  800 / 2000 [ 40%]  (Warmup)
## Chain 2: Iteration: 1000 / 2000 [ 50%]  (Warmup)
## Chain 2: Iteration: 1001 / 2000 [ 50%]  (Sampling)
## Chain 2: Iteration: 1200 / 2000 [ 60%]  (Sampling)
## Chain 2: Iteration: 1400 / 2000 [ 70%]  (Sampling)
## Chain 2: Iteration: 1600 / 2000 [ 80%]  (Sampling)
## Chain 2: Iteration: 1800 / 2000 [ 90%]  (Sampling)
## Chain 2: Iteration: 2000 / 2000 [100%]  (Sampling)
## Chain 2: 
## Chain 2:  Elapsed Time: 0.059387 seconds (Warm-up)
## Chain 2:                0.054785 seconds (Sampling)
## Chain 2:                0.114172 seconds (Total)
## Chain 2: 
## 
## SAMPLING FOR MODEL 'bf2f1b0fd9f94697ecb57046ca35af2b' NOW (CHAIN 3).
## Chain 3: 
## Chain 3: Gradient evaluation took 1.2e-05 seconds
## Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.12 seconds.
## Chain 3: Adjust your expectations accordingly!
## Chain 3: 
## Chain 3: 
## Chain 3: Iteration:    1 / 2000 [  0%]  (Warmup)
## Chain 3: Iteration:  200 / 2000 [ 10%]  (Warmup)
## Chain 3: Iteration:  400 / 2000 [ 20%]  (Warmup)
## Chain 3: Iteration:  600 / 2000 [ 30%]  (Warmup)
## Chain 3: Iteration:  800 / 2000 [ 40%]  (Warmup)
## Chain 3: Iteration: 1000 / 2000 [ 50%]  (Warmup)
## Chain 3: Iteration: 1001 / 2000 [ 50%]  (Sampling)
## Chain 3: Iteration: 1200 / 2000 [ 60%]  (Sampling)
## Chain 3: Iteration: 1400 / 2000 [ 70%]  (Sampling)
## Chain 3: Iteration: 1600 / 2000 [ 80%]  (Sampling)
## Chain 3: Iteration: 1800 / 2000 [ 90%]  (Sampling)
## Chain 3: Iteration: 2000 / 2000 [100%]  (Sampling)
## Chain 3: 
## Chain 3:  Elapsed Time: 0.057708 seconds (Warm-up)
## Chain 3:                0.054345 seconds (Sampling)
## Chain 3:                0.112053 seconds (Total)
## Chain 3: 
## 
## SAMPLING FOR MODEL 'bf2f1b0fd9f94697ecb57046ca35af2b' NOW (CHAIN 4).
## Chain 4: 
## Chain 4: Gradient evaluation took 1.2e-05 seconds
## Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.12 seconds.
## Chain 4: Adjust your expectations accordingly!
## Chain 4: 
## Chain 4: 
## Chain 4: Iteration:    1 / 2000 [  0%]  (Warmup)
## Chain 4: Iteration:  200 / 2000 [ 10%]  (Warmup)
## Chain 4: Iteration:  400 / 2000 [ 20%]  (Warmup)
## Chain 4: Iteration:  600 / 2000 [ 30%]  (Warmup)
## Chain 4: Iteration:  800 / 2000 [ 40%]  (Warmup)
## Chain 4: Iteration: 1000 / 2000 [ 50%]  (Warmup)
## Chain 4: Iteration: 1001 / 2000 [ 50%]  (Sampling)
## Chain 4: Iteration: 1200 / 2000 [ 60%]  (Sampling)
## Chain 4: Iteration: 1400 / 2000 [ 70%]  (Sampling)
## Chain 4: Iteration: 1600 / 2000 [ 80%]  (Sampling)
## Chain 4: Iteration: 1800 / 2000 [ 90%]  (Sampling)
## Chain 4: Iteration: 2000 / 2000 [100%]  (Sampling)
## Chain 4: 
## Chain 4:  Elapsed Time: 0.058839 seconds (Warm-up)
## Chain 4:                0.055073 seconds (Sampling)
## Chain 4:                0.113912 seconds (Total)
## Chain 4:
## Computing WAIC
precis(m1)
##         mean         sd      5.5%     94.5%    n_eff     Rhat4
## a  2.9999750 0.02319814 2.9627097 3.0366200 2400.636 0.9999967
## bf 0.2393683 0.02631571 0.1978983 0.2818947 2300.328 1.0021359
m0 <- map2stan(
alist(
deaths ~ dpois(lambda),
log(lambda) <- a,
a ~ dnorm(0,10)
),
data=list(deaths=d$deaths) , chains=4 )
## 
## SAMPLING FOR MODEL '204cadb178326ed20d099c7a8ebca1e5' NOW (CHAIN 1).
## Chain 1: 
## Chain 1: Gradient evaluation took 2.6e-05 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.26 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1: 
## Chain 1: 
## Chain 1: Iteration:    1 / 2000 [  0%]  (Warmup)
## Chain 1: Iteration:  200 / 2000 [ 10%]  (Warmup)
## Chain 1: Iteration:  400 / 2000 [ 20%]  (Warmup)
## Chain 1: Iteration:  600 / 2000 [ 30%]  (Warmup)
## Chain 1: Iteration:  800 / 2000 [ 40%]  (Warmup)
## Chain 1: Iteration: 1000 / 2000 [ 50%]  (Warmup)
## Chain 1: Iteration: 1001 / 2000 [ 50%]  (Sampling)
## Chain 1: Iteration: 1200 / 2000 [ 60%]  (Sampling)
## Chain 1: Iteration: 1400 / 2000 [ 70%]  (Sampling)
## Chain 1: Iteration: 1600 / 2000 [ 80%]  (Sampling)
## Chain 1: Iteration: 1800 / 2000 [ 90%]  (Sampling)
## Chain 1: Iteration: 2000 / 2000 [100%]  (Sampling)
## Chain 1: 
## Chain 1:  Elapsed Time: 0.033647 seconds (Warm-up)
## Chain 1:                0.036703 seconds (Sampling)
## Chain 1:                0.07035 seconds (Total)
## Chain 1: 
## 
## SAMPLING FOR MODEL '204cadb178326ed20d099c7a8ebca1e5' NOW (CHAIN 2).
## Chain 2: 
## Chain 2: Gradient evaluation took 9e-06 seconds
## Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.09 seconds.
## Chain 2: Adjust your expectations accordingly!
## Chain 2: 
## Chain 2: 
## Chain 2: Iteration:    1 / 2000 [  0%]  (Warmup)
## Chain 2: Iteration:  200 / 2000 [ 10%]  (Warmup)
## Chain 2: Iteration:  400 / 2000 [ 20%]  (Warmup)
## Chain 2: Iteration:  600 / 2000 [ 30%]  (Warmup)
## Chain 2: Iteration:  800 / 2000 [ 40%]  (Warmup)
## Chain 2: Iteration: 1000 / 2000 [ 50%]  (Warmup)
## Chain 2: Iteration: 1001 / 2000 [ 50%]  (Sampling)
## Chain 2: Iteration: 1200 / 2000 [ 60%]  (Sampling)
## Chain 2: Iteration: 1400 / 2000 [ 70%]  (Sampling)
## Chain 2: Iteration: 1600 / 2000 [ 80%]  (Sampling)
## Chain 2: Iteration: 1800 / 2000 [ 90%]  (Sampling)
## Chain 2: Iteration: 2000 / 2000 [100%]  (Sampling)
## Chain 2: 
## Chain 2:  Elapsed Time: 0.031548 seconds (Warm-up)
## Chain 2:                0.033111 seconds (Sampling)
## Chain 2:                0.064659 seconds (Total)
## Chain 2: 
## 
## SAMPLING FOR MODEL '204cadb178326ed20d099c7a8ebca1e5' NOW (CHAIN 3).
## Chain 3: 
## Chain 3: Gradient evaluation took 9e-06 seconds
## Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.09 seconds.
## Chain 3: Adjust your expectations accordingly!
## Chain 3: 
## Chain 3: 
## Chain 3: Iteration:    1 / 2000 [  0%]  (Warmup)
## Chain 3: Iteration:  200 / 2000 [ 10%]  (Warmup)
## Chain 3: Iteration:  400 / 2000 [ 20%]  (Warmup)
## Chain 3: Iteration:  600 / 2000 [ 30%]  (Warmup)
## Chain 3: Iteration:  800 / 2000 [ 40%]  (Warmup)
## Chain 3: Iteration: 1000 / 2000 [ 50%]  (Warmup)
## Chain 3: Iteration: 1001 / 2000 [ 50%]  (Sampling)
## Chain 3: Iteration: 1200 / 2000 [ 60%]  (Sampling)
## Chain 3: Iteration: 1400 / 2000 [ 70%]  (Sampling)
## Chain 3: Iteration: 1600 / 2000 [ 80%]  (Sampling)
## Chain 3: Iteration: 1800 / 2000 [ 90%]  (Sampling)
## Chain 3: Iteration: 2000 / 2000 [100%]  (Sampling)
## Chain 3: 
## Chain 3:  Elapsed Time: 0.031648 seconds (Warm-up)
## Chain 3:                0.034371 seconds (Sampling)
## Chain 3:                0.066019 seconds (Total)
## Chain 3: 
## 
## SAMPLING FOR MODEL '204cadb178326ed20d099c7a8ebca1e5' NOW (CHAIN 4).
## Chain 4: 
## Chain 4: Gradient evaluation took 1.1e-05 seconds
## Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.11 seconds.
## Chain 4: Adjust your expectations accordingly!
## Chain 4: 
## Chain 4: 
## Chain 4: Iteration:    1 / 2000 [  0%]  (Warmup)
## Chain 4: Iteration:  200 / 2000 [ 10%]  (Warmup)
## Chain 4: Iteration:  400 / 2000 [ 20%]  (Warmup)
## Chain 4: Iteration:  600 / 2000 [ 30%]  (Warmup)
## Chain 4: Iteration:  800 / 2000 [ 40%]  (Warmup)
## Chain 4: Iteration: 1000 / 2000 [ 50%]  (Warmup)
## Chain 4: Iteration: 1001 / 2000 [ 50%]  (Sampling)
## Chain 4: Iteration: 1200 / 2000 [ 60%]  (Sampling)
## Chain 4: Iteration: 1400 / 2000 [ 70%]  (Sampling)
## Chain 4: Iteration: 1600 / 2000 [ 80%]  (Sampling)
## Chain 4: Iteration: 1800 / 2000 [ 90%]  (Sampling)
## Chain 4: Iteration: 2000 / 2000 [100%]  (Sampling)
## Chain 4: 
## Chain 4:  Elapsed Time: 0.033784 seconds (Warm-up)
## Chain 4:                0.033146 seconds (Sampling)
## Chain 4:                0.06693 seconds (Total)
## Chain 4:
## Computing WAIC
compare(m0,m1)
##        WAIC        SE    dWAIC      dSE     pWAIC       weight
## m1 4413.727  998.3765  0.00000       NA 129.70893 1.000000e+00
## m0 4452.288 1077.2488 38.56072 145.3998  79.82236 4.232977e-09
plot( d$fmnnty_std , d$deaths , pch=16 ,
col=rangi2 , xlab="femininity" , ylab="deaths" )
#compute model-based trend
pred_dat <- list( fmnnty=seq(from=-2,to=1.5,length.out=30) )
lambda <- link(m1,data=pred_dat)
## [ 100 / 1000 ]
[ 200 / 1000 ]
[ 300 / 1000 ]
[ 400 / 1000 ]
[ 500 / 1000 ]
[ 600 / 1000 ]
[ 700 / 1000 ]
[ 800 / 1000 ]
[ 900 / 1000 ]
[ 1000 / 1000 ]
lambda.mu <- apply(lambda,2,mean)
lambda.PI <- apply(lambda,2,PI)
#superimpose trend
lines( pred_dat$fmnnty , lambda.mu )
shade( lambda.PI , pred_dat$fmnnty )
#compute sampling distribution
deaths_sim <- sim(m1,data=pred_dat)
## [ 100 / 1000 ]
[ 200 / 1000 ]
[ 300 / 1000 ]
[ 400 / 1000 ]
[ 500 / 1000 ]
[ 600 / 1000 ]
[ 700 / 1000 ]
[ 800 / 1000 ]
[ 900 / 1000 ]
[ 1000 / 1000 ]
deaths_sim.PI <- apply(deaths_sim,2,PI)
#superimpose sampling interval as dashed lines
lines( pred_dat$fmnnty , deaths_sim.PI[1,] , lty=2 )
lines( pred_dat$fmnnty , deaths_sim.PI[2,] , lty=2 )

# Here we can tell that femininity is not strong related to number of deaths, especially at the high end. Poisson models exhibits high over-dispersion theefore poorly fitted.

Acquaint yourself with the columns by inspecting the help ?Hurricanes. In this problem, you’ll focus on predicting deaths using femininity of each hurricane’s name. Fit and interpret the simplest possible model, a Poisson model of deaths using femininity as a predictor. You can use quap or ulam. Compare the model to an intercept-only Poisson model of deaths. How strong is the association between femininity of name and deaths? Which storms does the model fit (retrodict) well? Which storms does it fit poorly?