Chapter 12 - Monsters and Mixtures

This chapter introduced several new types of regression, all of which are generalizations of generalized linear models (GLMs). Ordered logistic models are useful for categorical outcomes with a strict ordering. They are built by attaching a cumulative link function to a categorical outcome distribution. Zero-inflated models mix together two different outcome distributions, allowing us to model outcomes with an excess of zeros. Models for overdispersion, such as beta-binomial and gamma-Poisson, draw the expected value of each observation from a distribution that changes shape as a function of a linear model.

Place each answer inside the code chunk (grey box). The code chunks should contain a text response or a code that completes/answers the question or activity requested. Make sure to include plots if the question requests them. Problems are labeled Easy (E), Medium (M), and Hard(H).

Finally, upon completion, name your final output .html file as: YourName_ANLY505-Year-Semester.html and publish the assignment to your R Pubs account and submit the link to Canvas. Each question is worth 5 points.

Questions

12E1. What is the difference between an ordered categorical variable and an unordered one? Define and then give an example of each.

# Compared to unordered categorical variables, ordered categorical variable includes the concept of ordered. For example, education level is ordered categorical since you knows that high school < undergraduate < graduate and unordered categorical variable is like occupation since you can not tell what type of occupation is higher than the other. 

12E2. What kind of link function does an ordered logistic regression employ? How does it differ from an ordinary logit link?

# Ordered logit. The difference is between ordered logit and ordinary logit is that ordered logit treat the random variable like continuous variable and when we calculate the probability, it can be cumulative.

12E3. When count data are zero-inflated, using a model that ignores zero-inflation will tend to induce which kind of inferential error?

# If zero-inflated is ignored when it shouldn't, then we will underestimate the true rate of event. 

12E4. Over-dispersion is common in count data. Give an example of a natural process that might produce over-dispersed counts. Can you also give an example of a process that might produce underdispersed counts?

# Over-dispersion happened when the variability larger than expected. For example, the count data collected from different time of period like the total number of people shown up in post office in early morning versus at noon. 

# Underdispersion happened when the variability smaller than expected and it usually happens when the observation has autocorrelation. For example, count data in time series format. 

12M1. At a certain university, employees are annually rated from 1 to 4 on their productivity, with 1 being least productive and 4 most productive. In a certain department at this certain university in a certain year, the numbers of employees receiving each rating were (from 1 to 4): 12, 36, 7, 41. Compute the log cumulative odds of each rating.

n = c(12, 36, 7, 41) 
q = n / sum(n) 
p = cumsum(q) 
o = p / (1 - p) 
log(o) # log cumulative odds of each rating 
## [1] -1.9459101  0.0000000  0.2937611        Inf

12M2. Make a version of Figure 12.5 for the employee ratings data given just above.

plot(1:4, p,
  xlab = "rating", ylab = "cumulative proportion",
  xlim = c(0.7, 4.3), ylim = c(0, 1), xaxt = "n", cex = 3
)
axis(1, at = 1:4, labels = 1:4)
for (x in 1:4) lines(c(x, x), c(0, p[x]), col = "gray", lwd = 4)
for (x in 1:4) lines(c(x, x) + 0.1, c(p[x] - q[x], p[x]), col = "slateblue", lwd = 4)
text(1:4 + 0.2, p - q / 2, labels = 1:4, col = "slateblue", cex = 3)

12M3. Can you modify the derivation of the zero-inflated Poisson distribution (ZIPoisson) from the chapter to construct a zero-inflated binomial distribution?

# When observation is zero
# Pr(0|p0, q, n) = p0 + (1 − p0)(1 − q)^n

# when observation is non-zero
# Pr(y|p0, q, n) = (1 − p0)*(n!/(y!(n − y)!))*(q^y)*((1 − q)^(n−y))

12H1. In 2014, a paper was published that was entitled “Female hurricanes are deadlier than male hurricanes.”191 As the title suggests, the paper claimed that hurricanes with female names have caused greater loss of life, and the explanation given is that people unconsciously rate female hurricanes as less dangerous and so are less likely to evacuate. Statisticians severely criticized the paper after publication. Here, you’ll explore the complete data used in the paper and consider the hypothesis that hurricanes with female names are deadlier. Load the data with:

library(rethinking)
library(ggplot2)
data(Hurricanes)

Acquaint yourself with the columns by inspecting the help ?Hurricanes. In this problem, you’ll focus on predicting deaths using femininity of each hurricane’s name. Fit and interpret the simplest possible model, a Poisson model of deaths using femininity as a predictor. You can use quap or ulam. Plot your results of prior predictive simulation. Compare the model to an intercept-only Poisson model of deaths. Plot the results over the raw data. How strong is the association between femininity of name and deaths? Which storms does the model fit (retrodict) well? Which storms does it fit poorly?

d <- Hurricanes # load data on object called d
d$fem_std <- (d$femininity - mean(d$femininity)) / sd(d$femininity) # standardised femininity
dat <- list(D = d$deaths, F = d$fem_std)

M1 <- map(
  alist(
    deaths ~ dpois( lambda ),
    log(lambda) <- a + bF*femininity,
    a ~ dnorm(0,10),
    bF ~ dnorm(0,5)
  ) ,
  data=d)

M2 <- map(
  alist(
    deaths ~ dpois( lambda ),
    log(lambda) <- a,
    a ~ dnorm(0,10)
  ) ,
  data=d)

compare(M1,M2)
##        WAIC        SE    dWAIC      dSE     pWAIC       weight
## M1 4403.572  994.7522  0.00000       NA 131.18078 1.000000e+00
## M2 4441.971 1070.5648 38.39872 140.6533  81.38447 4.590126e-09
y <- sim(M1)

y.mean <- colMeans(y)
y.PI <- apply(y, 2, PI)

plot(y=Hurricanes$deaths, x=Hurricanes$femininity, col=rangi2, ylab="deaths", xlab="femininity", pch=16)
points(y=y.mean, x=Hurricanes$femininity, pch=1)
segments(x0=Hurricanes$femininity, x1= Hurricanes$femininity, y0=y.PI[1,], y1=y.PI[2,])

lines(y= y.mean[order(Hurricanes$femininity)],  x=sort(Hurricanes$femininity))
lines( y.PI[1,order(Hurricanes$femininity)],  x=sort(Hurricanes$femininity), lty=2 )
lines( y.PI[2,order(Hurricanes$femininity)],  x=sort(Hurricanes$femininity), lty=2 )

# According to the plot, the relationship between deaths and femininity looks not strong, but some large deaths do come from relatively large femininity