Chapter 12 - Monsters and Mixtures

This chapter introduced several new types of regression, all of which are generalizations of generalized linear models (GLMs). Ordered logistic models are useful for categorical outcomes with a strict ordering. They are built by attaching a cumulative link function to a categorical outcome distribution. Zero-inflated models mix together two different outcome distributions, allowing us to model outcomes with an excess of zeros. Models for overdispersion, such as beta-binomial and gamma-Poisson, draw the expected value of each observation from a distribution that changes shape as a function of a linear model.

Place each answer inside the code chunk (grey box). The code chunks should contain a text response or a code that completes/answers the question or activity requested. Make sure to include plots if the question requests them.

Finally, upon completion, name your final output .html file as: YourName_ANLY505-Year-Semester.html and publish the assignment to your R Pubs account and submit the link to Canvas. Each question is worth 5 points.

Questions

12-1. At a certain university, employees are annually rated from 1 to 4 on their productivity, with 1 being least productive and 4 most productive. In a certain department at this certain university in a certain year, the numbers of employees receiving each rating were (from 1 to 4): 12, 36, 7, 41. Compute the log cumulative odds of each rating.

data =c(12, 36, 7, 41)
avg <- data/sum(data)
avg
## [1] 0.12500000 0.37500000 0.07291667 0.42708333
cumavg=cumsum(avg)
cumavg
## [1] 0.1250000 0.5000000 0.5729167 1.0000000
log_odds =log(avg/(1-avg))
log_odds
## [1] -1.9459101 -0.5108256 -2.5427262 -0.2937611

12-2. Make a version of Figure 12.5 for the employee ratings data given just above.

c_prop=avg
c_prop=sapply(1:4,function(i){sum(c_prop[1:i])})
plot(y=c_prop,x=1:4,type="b", ylim=c(0,1))
segments(1:4,0,1:4,c_prop)
for(i in 1:4){segments(i+0.05,c(0,c_prop)[i],i+0.05,c_prop[i], col = "blue")}

12-3. In 2014, a paper was published that was entitled “Female hurricanes are deadlier than male hurricanes.”191 As the title suggests, the paper claimed that hurricanes with female names have caused greater loss of life, and the explanation given is that people unconsciously rate female hurricanes as less dangerous and so are less likely to evacuate. Statisticians severely criticized the paper after publication. Here, you’ll explore the complete data used in the paper and consider the hypothesis that hurricanes with female names are deadlier.

Acquaint yourself with the columns by inspecting the help ?Hurricanes. In this problem, you’ll focus on predicting deaths using femininity of each hurricane’s name. Fit and interpret the simplest possible model, a Poisson model of deaths using femininity as a predictor. You can use quap or ulam. Compare the model to an intercept-only Poisson model of deaths. How strong is the association between femininity of name and deaths? Which storms does the model fit (retrodict) well? Which storms does it fit poorly?

library(rethinking)
data(Hurricanes)
data=Hurricanes
str(data)
## 'data.frame':    92 obs. of  8 variables:
##  $ name        : Factor w/ 83 levels "Able","Agnes",..: 38 77 1 9 47 20 40 60 27 33 ...
##  $ year        : int  1950 1950 1952 1953 1953 1954 1954 1954 1955 1955 ...
##  $ deaths      : int  2 4 3 1 0 60 20 20 0 200 ...
##  $ category    : int  3 3 1 1 1 3 3 4 3 1 ...
##  $ min_pressure: int  960 955 985 987 985 960 954 938 962 987 ...
##  $ damage_norm : int  1590 5350 150 58 15 19321 3230 24260 2030 14730 ...
##  $ female      : int  1 0 0 1 1 1 1 1 1 1 ...
##  $ femininity  : num  6.78 1.39 3.83 9.83 8.33 ...
data$fem_std = (data$femininity - mean(data$femininity)) / sd(data$femininity) 
dat = list(D = data$deaths, F = data$fem_std)



# Now get the model
f = alist(
  D ~ dpois(lambda), 
  log(lambda) <- a + bF * F, 
  a ~ dnorm(1, 1),
  bF ~ dnorm(0, 1)
)

mh1 = ulam(f, data = dat, chains = 4, cores = 4, log_lik = TRUE)
precis(mh1)
##        mean         sd      5.5%    94.5%     n_eff    Rhat4
## a  2.999002 0.02317929 2.9625388 3.035367 1198.6848 1.004510
## bF 0.238642 0.02572225 0.1977627 0.278589  896.1661 1.002731
# We can see from the parameter that there is a positive relationship between feminity and death.

# Let's plot the raw data
plot(dat$F, dat$D,xlab = "femininity", ylab = "deaths",pch = 16, lwd = 2, col = "purple")

# Now compute trend
p_dat = list(F = seq(from = -2, to = 2, length.out = 1e2))
lambda = link(mh1, data = p_dat)
lambda.mean = apply(lambda, 2, mean)
lambda.PI = apply(lambda, 2, PI)
lines(p_dat$F, lambda.mean)
shade(lambda.PI, p_dat$F)

# compute sampling distribution
y = sim(mh1, data = p_dat)
y.PI = apply(y, 2, PI)

# draw sampling interval
lines(p_dat$F, y.PI[1, ], lty = 2)
lines(p_dat$F, y.PI[2, ], lty = 2)

#The gray shaded area can hardly be seen which means that the prediction fits with the trend. However, there are so many outliers on the right hand side of the plot, which indicates the miss of many of the hurricane's death tolls. Therefore, it illustrates an over-dispersion in our model and the failure to retrodict many of the hurricanes.

12-4. Counts are nearly always over-dispersed relative to Poisson. So fit a gamma-Poisson (aka negative-binomial) model to predict deaths using femininity. Show that the over-dispersed model no longer shows as precise a positive association between femininity and deaths, with an 89% interval that overlaps zero. Can you explain why the association diminished in strength?

library(rethinking)
data(Hurricanes)
data=Hurricanes
data$fem_std = (data$femininity - mean(data$femininity)) / sd(data$femininity)
dat = list(D = data$deaths, F = data$fem_std)
f = alist(
  D ~ dpois(lambda), 
  log(lambda) <- a + bF * F, 
  a ~ dnorm(1, 1),
  bF ~ dnorm(0, 1)
)


mH2 = ulam(
  alist(
    D ~ dgampois(lambda, scale),
    log(lambda) <- a + bF * F,
    a ~ dnorm(1, 1),
    bF ~ dnorm(0, 1),
    scale ~ dexp(1) # strictly positive hence why exponential prior
  ),
  data = dat, chains = 4, cores = 4, log_lik = TRUE
)
precis(mH2)
##            mean         sd        5.5%     94.5%    n_eff     Rhat4
## a     2.9721705 0.15460967  2.73114297 3.2198908 1698.239 0.9990116
## bF    0.2108079 0.15207765 -0.03976311 0.4451646 1950.515 1.0001211
## scale 0.4524025 0.06215607  0.35857202 0.5497814 1835.365 1.0017102
plot(coeftab(mh1, mH2))

plot(dat$F, dat$D,
  pch = 16, lwd = 2,
  col = rangi2, xlab = "femininity (std)", ylab = "deaths"
)
# compute model-based trend
pred_dat = list(F = seq(from = -2, to = 2, length.out = 1e2))
lambda = link(mH2, data = pred_dat)
lambda.mu = apply(lambda, 2, mean)
lambda.PI = apply(lambda, 2, PI)
# superimpose trend
lines(pred_dat$F, lambda.mu)
shade(lambda.PI, pred_dat$F)
# compute sampling distribution
deaths_sim = sim(mH2, data = pred_dat)
deaths_sim.PI = apply(deaths_sim, 2, PI)
# superimpose sampling interval as dashed lines
lines(pred_dat$F, deaths_sim.PI[1, ], lty = 2)
lines(pred_dat$F, deaths_sim.PI[2, ], lty = 2)

post = extract.samples(mH2)
par(mfrow = c(1, 3))

    
 for (fem in -1:1) {
  for (i in 1:1e2) {
    curve(dgamma2(
      x, # where to calculate density
      exp(post$a[i] + post$bF[i] * fem), # linear model with inverse link applied
      post$scale[i] # scale for gamma
    ),
    from = 0, to = 70, xlab = "mean deaths", ylab = "Density",
    ylim = c(0, 0.19), col = col.alpha("black", 0.2),
    add = ifelse(i == 1, FALSE, TRUE)
    )
  }
  mtext(concat("femininity  =   ", fem))
}

ggplot(as.data.frame(PSISk(mH2)), aes(x = PSISk(mH2)))+theme_bw() + labs(title = "Paraeto-K values", subtitle = "Values > 1 indicate highly influential data")

#The interaction dimimished since the gamma distribution allows for a death rate to be calculated for each outcome individually rather than one overall death rate for all hurricanes.These individual rates are sampled from a common distribution which is a function of the femininity of hurricane names.

12-5. In the data, there are two measures of a hurricane’s potential to cause death: damage_norm and min_pressure. Consult ?Hurricanes for their meanings. It makes some sense to imagine that femininity of a name matters more when the hurricane is itself deadly. This implies an interaction between femininity and either or both of damage_norm and min_pressure. Fit a series of models evaluating these interactions. Interpret and compare the models. In interpreting the estimates, it may help to generate counterfactual predictions contrasting hurricanes with masculine and feminine names. Are the effect sizes plausible?

library(rethinking)
data(Hurricanes)
data=Hurricanes
data$fem_std = (data$femininity - mean(data$femininity)) / sd(data$femininity)
dat = list(D = data$deaths, F = data$fem_std)
dat$P = standardize(data$min_pressure)
dat$S = standardize(data$damage_norm)

mH3a = ulam(
  alist(
    D ~ dgampois(lambda, scale),
    log(lambda) <- a + bF * F + bP * P + bFP * F * P,
    a ~ dnorm(1, 1),
    c(bF, bP, bFP) ~ dnorm(0, 1),
    scale ~ dexp(1)
  ),
  data = dat, cores = 4, chains = 4, log_lik = TRUE
)
precis(mH3a)
##             mean         sd        5.5%      94.5%    n_eff     Rhat4
## a      2.7556742 0.14522292  2.52859885  2.9955102 2324.527 0.9984129
## bFP    0.2951350 0.14132564  0.07456602  0.5187696 1752.791 0.9986744
## bP    -0.6744020 0.14163010 -0.90703915 -0.4571220 2351.970 0.9998226
## bF     0.3050453 0.14351862  0.08280674  0.5293682 2883.769 0.9989487
## scale  0.5512012 0.07664478  0.43257002  0.6778165 2248.348 0.9990252
P_seq = seq(from = -3, to = 2, length.out = 1e2)
d_pred = data.frame(F = -1, P = P_seq)
lambda_m = link(mH3a, data = d_pred)
lambda_m.mu = apply(lambda_m, 2, mean)
lambda_m.PI = apply(lambda_m, 2, PI)

d_pred = data.frame(F = 1, P = P_seq)
lambda_f = link(mH3a, data = d_pred)
lambda_f.mu = apply(lambda_f, 2, mean)
lambda_f.PI = apply(lambda_f, 2, PI)

# Plotting, sqrt() to make differences easier to spot

plot(dat$P, sqrt(dat$D),
  pch = 1, lwd = 2, col = ifelse(dat$F > 0, "red", "dark gray"),
  xlab = "minimum pressure (std)", ylab = "sqrt(deaths)"
)
lines(P_seq, sqrt(lambda_m.mu), lty = 2)
shade(sqrt(lambda_m.PI), P_seq)
lines(P_seq, sqrt(lambda_f.mu), lty = 1, col = "blue")
shade(sqrt(lambda_f.PI), P_seq, col = col.alpha("blue", 0.2))

ggplot(as.data.frame(PSISk(mH3a)), aes(x = PSISk(mH3a)))+
  theme_bw() +labs(title = "Paraeto-K values", subtitle = "Values > 1 indicate highly influential data")

mH3b = ulam(
  alist(
    D ~ dgampois(lambda, scale),
    log(lambda) <- a + bF * F + bS * S + bFS * F * S,
    a ~ dnorm(1, 1),
    c(bF, bS, bFS) ~ dnorm(0, 1),
    scale ~ dexp(1)
  ),
  data = dat, chains = 4, cores = 4, log_lik = TRUE
)
precis(mH3b)
##             mean        sd        5.5%     94.5%    n_eff     Rhat4
## a     2.57368403 0.1270177  2.37952127 2.7813437 1905.390 0.9995841
## bFS   0.30356622 0.2016373 -0.03554816 0.6120307 2021.217 0.9994699
## bS    1.24795344 0.2139224  0.90988713 1.5982982 2623.418 0.9985154
## bF    0.08822835 0.1238046 -0.10852889 0.2942694 2290.858 1.0002853
## scale 0.68376412 0.1024650  0.53446759 0.8592009 2639.353 0.9989409
S_seq = seq(from = -1, to = 5.5, length.out = 1e2) # 
d_pred = data.frame(F = -1, S = S_seq)
lambda_m = link(mH3b, data = d_pred)
lambda_m.mu = apply(lambda_m, 2, mean)
lambda_m.PI = apply(lambda_m, 2, PI)


d_pred = data.frame(F = 1, S = S_seq)
lambda_f = link(mH3b, data = d_pred)
lambda_f.mu = apply(lambda_f, 2, mean)
lambda_f.PI = apply(lambda_f, 2, PI)
# plot
plot(dat$S, sqrt(dat$D),
  pch = 1, lwd = 2, col = ifelse(dat$F > 0, "red", "dark gray"),
  xlab = "normalized damage (std)", ylab = "sqrt(deaths)"
)
lines(S_seq, sqrt(lambda_m.mu), lty = 2)
shade(sqrt(lambda_m.PI), S_seq)
lines(S_seq, sqrt(lambda_f.mu), lty = 1, col = "pink")
shade(sqrt(lambda_f.PI), S_seq, col = col.alpha("pink", 0.2))

#The model makes less of a distinction between masculine and feminine hurricanes overall at this point. Damage norm scales trough multiplaction. The distances grow fast as we approach the rightward side of the plot. This is difficult for the model to account for.

#The interaction appears to be so strong due to the influential feminine storms at the upper-left hand corner of the plot.