Chapter 12 - Monsters and Mixtures

This chapter introduced several new types of regression, all of which are generalizations of generalized linear models (GLMs). Ordered logistic models are useful for categorical outcomes with a strict ordering. They are built by attaching a cumulative link function to a categorical outcome distribution. Zero-inflated models mix together two different outcome distributions, allowing us to model outcomes with an excess of zeros. Models for overdispersion, such as beta-binomial and gamma-Poisson, draw the expected value of each observation from a distribution that changes shape as a function of a linear model.

Place each answer inside the code chunk (grey box). The code chunks should contain a text response or a code that completes/answers the question or activity requested. Make sure to include plots if the question requests them. Problems are labeled Easy (E), Medium (M), and Hard(H).

Finally, upon completion, name your final output .html file as: YourName_ANLY505-Year-Semester.html and publish the assignment to your R Pubs account and submit the link to Canvas. Each question is worth 5 points.

Questions

12E1. What is the difference between an ordered categorical variable and an unordered one? Define and then give an example of each.

# ordered categorical variable can be expressed on ordinal scales, and can be categorized in order from one extreme to another, without implying equal distance between each values.
# Unordered categorical variable cannot be expressed on ordinal scales, and cannot be categorized in order.
# For example, the weight of the dog can be categorized into light, medium and heavy. These are ordered categorical variable, since they are in order from one extreme to another. On the other hand, the color of dog can be set as black, white, brown, etc. These are unordered categorical variable.

12E2. What kind of link function does an ordered logistic regression employ? How does it differ from an ordinary logit link?

# It employs a cumulative logit link function, which is based on the cumulative probabilities of the response variable. It's different from an ordinary logit link since it's cumulative logit link, and it's a regression model for an ordinal response variable.

12E3. When count data are zero-inflated, using a model that ignores zero-inflation will tend to induce which kind of inferential error?

# Zero-inflated data would be counted through more than one process and at least one of them would not be accounted for. Therefore, the estimate of the true rate would be closer to 0 than where it really is.

12E4. Over-dispersion is common in count data. Give an example of a natural process that might produce over-dispersed counts. Can you also give an example of a process that might produce underdispersed counts?

# Over-dispersion could be a result of heterogeneity in rates across different sample. For example, the count of the elephant in different parts of an area for a given time frame. The result would be over-dispersed, since the rate the elephants are observed could vary across different parts.
# Under-dispersion could be of less variation in the rates. This is often occur when autocorrelation play a part, where subgroups are correlated with each other. For example, if we take our elephant count example, but change the time period into half-day intervals, then each part would have highly autocorrelation, and result in under-dispersed counts.

12M1. At a certain university, employees are annually rated from 1 to 4 on their productivity, with 1 being least productive and 4 most productive. In a certain department at this certain university in a certain year, the numbers of employees receiving each rating were (from 1 to 4): 12, 36, 7, 41. Compute the log cumulative odds of each rating.

n <- c(12, 36, 7, 41)
prop <- n / sum(n)
cum_prop <- cumsum(prop)
cum_odd <- cum_prop / (1 - cum_prop)
log(cum_odd)
## [1] -1.9459101  0.0000000  0.2937611        Inf

12M2. Make a version of Figure 12.5 for the employee ratings data given just above.

# Let's first draw the raw proportion
plot(1:4, cum_prop,
  xlab = "rating", ylab = "cumulative proportion",
  xlim = c(0.7, 4.3), ylim = c(0, 1), xaxt = "n", type="b"
)
axis(1, at = 1:4, labels = 1:4)
# plot gray cumulative probability lines
for (x in 1:4) lines(c(x, x), c(0, cum_prop[x]), col = "gray", lwd = 2)
# plot purple discrete probability segments
for (x in 1:4) lines(c(x, x) + 0.1, c(cum_prop[x] - prop[x], cum_prop[x]), col = "purple", lwd = 2)
# add labels
text(1:4 + 0.2, cum_prop - prop / 2, labels = 1:4, col = "purple")

12M3. Can you modify the derivation of the zero-inflated Poisson distribution (ZIPoisson) from the chapter to construct a zero-inflated binomial distribution?

# We can change the Poisson likelihood (exp(-lambda)) to a Binomial likelihood ((1 - q)^n).
# Therefore, the Zero-inflated binomial distribution would be:
# Pr(0|p0, q, n) = p0 + (1 − p0)(1 − q)^n

12H1. In 2014, a paper was published that was entitled “Female hurricanes are deadlier than male hurricanes.”191 As the title suggests, the paper claimed that hurricanes with female names have caused greater loss of life, and the explanation given is that people unconsciously rate female hurricanes as less dangerous and so are less likely to evacuate. Statisticians severely criticized the paper after publication. Here, you’ll explore the complete data used in the paper and consider the hypothesis that hurricanes with female names are deadlier. Load the data with:

data(Hurricanes)

Acquaint yourself with the columns by inspecting the help ?Hurricanes. In this problem, you’ll focus on predicting deaths using femininity of each hurricane’s name. Fit and interpret the simplest possible model, a Poisson model of deaths using femininity as a predictor. You can use quap or ulam. Plot your results of prior predictive simulation. Compare the model to an intercept-only Poisson model of deaths. Plot the results over the raw data. How strong is the association between femininity of name and deaths? Which storms does the model fit (retrodict) well? Which storms does it fit poorly?

# Let's load the data and standardize the 
d <- Hurricanes
d$fem_s <- (d$femininity - mean(d$femininity)) / sd(d$femininity)
dat <- list(deaths = d$deaths, fem_s = d$fem_s)

# Now get the model
f <- alist(
  deaths ~ dpois(lambda),
  log(lambda) <- a + bF * fem_s,
  a ~ dnorm(1, 1),
  bF ~ dnorm(0, 1)
)
m_12H1 <- ulam(f, data = dat, chains = 4, cores = 4, log_lik = TRUE)
precis(m_12H1)
##         mean         sd      5.5%     94.5%    n_eff     Rhat4
## a  2.9990895 0.02292976 2.9630014 3.0348147 1199.293 1.0029901
## bF 0.2381551 0.02539482 0.1983931 0.2779744 1043.500 0.9995208
# We can see from the parameter that there is a positive relationship between feminity and death.

# Let's plot the raw data
plot(dat$fem_s, dat$deaths,
  xlab = "femininity", ylab = "deaths",
  pch = 16, lwd = 2, col = "purple"
)

# Now compute trend
p_dat <- list(fem_s = seq(from = -2, to = 2, length.out = 1e2))
lambda <- link(m_12H1, data = p_dat)
lambda.mean <- apply(lambda, 2, mean)
lambda.PI <- apply(lambda, 2, PI)
lines(p_dat$fem_s, lambda.mean)
shade(lambda.PI, p_dat$fem_s)

# compute sampling distribution
y <- sim(m_12H1, data = p_dat)
y.PI <- apply(y, 2, PI)

# draw sampling interval
lines(p_dat$fem_s, y.PI[1, ], lty = 2)
lines(p_dat$fem_s, y.PI[2, ], lty = 2)

# Looking at the plot, the gray shaded area can hardly be seen. That means the prediction fits with the trend. However, there are so many outliers on the right hand side of the plot, which indicates the miss of many of the hurricane's death tolls. Therefore, it illustrates an over-dispersion in our model and the failure to retrodict many of the hurricanes.