This chapter introduced several new types of regression, all of which are generalizations of generalized linear models (GLMs). Ordered logistic models are useful for categorical outcomes with a strict ordering. They are built by attaching a cumulative link function to a categorical outcome distribution. Zero-inflated models mix together two different outcome distributions, allowing us to model outcomes with an excess of zeros. Models for overdispersion, such as beta-binomial and gamma-Poisson, draw the expected value of each observation from a distribution that changes shape as a function of a linear model.
Place each answer inside the code chunk (grey box). The code chunks should contain a text response or a code that completes/answers the question or activity requested. Problems are labeled Easy (E), Medium (M), and Hard(H).
Finally, upon completion, name your final output .html file as: YourName_ANLY505-Year-Semester.html and publish the assignment to your R Pubs account and submit the link to Canvas. Each question is worth 5 points.
12E1. What is the difference between an ordered categorical variable and an unordered one? Define and then give an example of each.
#Ordered categorical variables have order or a rank in which the categories are classified, whereas in unordered ones, the order or rank doesn't matter. Examples- Ordered categorical variables could be grades, unordered categorical variables could be genders.
12E2. What kind of link function does an ordered logistic regression employ? How does it differ from an ordinary logit link?
#Cumulative log odds function is employed in an ordered logistic regression. It differs from an ordinary logit link in terms of applying cumulative probability rather than discrete probability of a single event.
12E3. When count data are zero-inflated, using a model that ignores zero-inflation will tend to induce which kind of inferential error?
#When count data are zero-inflated, it will result in erroneous accounts of excessive zeros which cannot be accounted for, as it will result in inferencing incorrect or less mean due to more zeros.
12E4. Over-dispersion is common in count data. Give an example of a natural process that might produce over-dispersed counts. Can you also give an example of a process that might produce underdispersed counts?
#For overdispersion an example would be counting number of different types of trees that grew in a particular season. This could result in overdispersion because the same kind of variation in the average number of trees maybe different in other locations.
#For underdispersion, an example would be the number of trees that grew in a particular season in a certain area. The variation here should not be much since the conditions will remain likely same as it is the same location and same area.
12M1. At a certain university, employees are annually rated from 1 to 4 on their productivity, with 1 being least productive and 4 most productive. In a certain department at this certain university in a certain year, the numbers of employees receiving each rating were (from 1 to 4): 12, 36, 7, 41. Compute the log cumulative odds of each rating.
rate <- c(12, 36, 7, 41)
s_rate <- cumsum(rate) / sum(rate)
cum_log_odds_rate <- log(s_rate/(1-s_rate))
cum_log_odds_rate
## [1] -1.9459101 0.0000000 0.2937611 Inf
12M2. Make a version of Figure 12.5 for the employee ratings data given just above.
plot(1:4, s_rate)
lines(1:4, s_rate)
p <- 0
p2 <- 0
for (r in 1:4) {
lines(c(r, r), c(0, s_rate[r]), lwd = 4)
lines(c(r+0.03,r+0.03), c(p, s_rate[r]), lwd=4, col='red')
p2 <- p
p <- s_rate[r]
}
12M3. Can you modify the derivation of the zero-inflated Poisson distribution (ZIPoisson) from the chapter to construct a zero-inflated binomial distribution?
#Zero-inflated binomial distribution:
#Proability of zero distribution: Pr(0|p0, q, n) = p0 + (1-p0) * (1-q) ^ n
#Probability of non-zero distribution Pr(y|p0, q, n) = (1-p0) (n!/(y!(n-y)!))(q^y)((1-q)^(n-y))
12H1. In 2014, a paper was published that was entitled “Female hurricanes are deadlier than male hurricanes.”191 As the title suggests, the paper claimed that hurricanes with female names have caused greater loss of life, and the explanation given is that people unconsciously rate female hurricanes as less dangerous and so are less likely to evacuate. Statisticians severely criticized the paper after publication. Here, you’ll explore the complete data used in the paper and consider the hypothesis that hurricanes with female names are deadlier. Load the data with:
data(Hurricanes)
df <- Hurricanes
df_fstd <- (df$femininity - mean(df$femininity)) / sd(df$femininity)
m <- map2stan(
alist(
d ~ dpois(lambda),
log(lambda) <- x1 + x2*f,
x1 ~ dnorm(0, 10),
x2 ~ dnorm(0, 1)
),
data = list(d = df$deaths, f = df_fstd), chains = 4
)
##
## SAMPLING FOR MODEL 'fefa372da15012086716d97278224495' NOW (CHAIN 1).
## Chain 1:
## Chain 1: Gradient evaluation took 0 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1:
## Chain 1:
## Chain 1: Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 1: Iteration: 200 / 2000 [ 10%] (Warmup)
## Chain 1: Iteration: 400 / 2000 [ 20%] (Warmup)
## Chain 1: Iteration: 600 / 2000 [ 30%] (Warmup)
## Chain 1: Iteration: 800 / 2000 [ 40%] (Warmup)
## Chain 1: Iteration: 1000 / 2000 [ 50%] (Warmup)
## Chain 1: Iteration: 1001 / 2000 [ 50%] (Sampling)
## Chain 1: Iteration: 1200 / 2000 [ 60%] (Sampling)
## Chain 1: Iteration: 1400 / 2000 [ 70%] (Sampling)
## Chain 1: Iteration: 1600 / 2000 [ 80%] (Sampling)
## Chain 1: Iteration: 1800 / 2000 [ 90%] (Sampling)
## Chain 1: Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 1:
## Chain 1: Elapsed Time: 0.057 seconds (Warm-up)
## Chain 1: 0.053 seconds (Sampling)
## Chain 1: 0.11 seconds (Total)
## Chain 1:
##
## SAMPLING FOR MODEL 'fefa372da15012086716d97278224495' NOW (CHAIN 2).
## Chain 2:
## Chain 2: Gradient evaluation took 0 seconds
## Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Chain 2: Adjust your expectations accordingly!
## Chain 2:
## Chain 2:
## Chain 2: Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 2: Iteration: 200 / 2000 [ 10%] (Warmup)
## Chain 2: Iteration: 400 / 2000 [ 20%] (Warmup)
## Chain 2: Iteration: 600 / 2000 [ 30%] (Warmup)
## Chain 2: Iteration: 800 / 2000 [ 40%] (Warmup)
## Chain 2: Iteration: 1000 / 2000 [ 50%] (Warmup)
## Chain 2: Iteration: 1001 / 2000 [ 50%] (Sampling)
## Chain 2: Iteration: 1200 / 2000 [ 60%] (Sampling)
## Chain 2: Iteration: 1400 / 2000 [ 70%] (Sampling)
## Chain 2: Iteration: 1600 / 2000 [ 80%] (Sampling)
## Chain 2: Iteration: 1800 / 2000 [ 90%] (Sampling)
## Chain 2: Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 2:
## Chain 2: Elapsed Time: 0.058 seconds (Warm-up)
## Chain 2: 0.051 seconds (Sampling)
## Chain 2: 0.109 seconds (Total)
## Chain 2:
##
## SAMPLING FOR MODEL 'fefa372da15012086716d97278224495' NOW (CHAIN 3).
## Chain 3:
## Chain 3: Gradient evaluation took 0 seconds
## Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Chain 3: Adjust your expectations accordingly!
## Chain 3:
## Chain 3:
## Chain 3: Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 3: Iteration: 200 / 2000 [ 10%] (Warmup)
## Chain 3: Iteration: 400 / 2000 [ 20%] (Warmup)
## Chain 3: Iteration: 600 / 2000 [ 30%] (Warmup)
## Chain 3: Iteration: 800 / 2000 [ 40%] (Warmup)
## Chain 3: Iteration: 1000 / 2000 [ 50%] (Warmup)
## Chain 3: Iteration: 1001 / 2000 [ 50%] (Sampling)
## Chain 3: Iteration: 1200 / 2000 [ 60%] (Sampling)
## Chain 3: Iteration: 1400 / 2000 [ 70%] (Sampling)
## Chain 3: Iteration: 1600 / 2000 [ 80%] (Sampling)
## Chain 3: Iteration: 1800 / 2000 [ 90%] (Sampling)
## Chain 3: Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 3:
## Chain 3: Elapsed Time: 0.055 seconds (Warm-up)
## Chain 3: 0.054 seconds (Sampling)
## Chain 3: 0.109 seconds (Total)
## Chain 3:
##
## SAMPLING FOR MODEL 'fefa372da15012086716d97278224495' NOW (CHAIN 4).
## Chain 4:
## Chain 4: Gradient evaluation took 0 seconds
## Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Chain 4: Adjust your expectations accordingly!
## Chain 4:
## Chain 4:
## Chain 4: Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 4: Iteration: 200 / 2000 [ 10%] (Warmup)
## Chain 4: Iteration: 400 / 2000 [ 20%] (Warmup)
## Chain 4: Iteration: 600 / 2000 [ 30%] (Warmup)
## Chain 4: Iteration: 800 / 2000 [ 40%] (Warmup)
## Chain 4: Iteration: 1000 / 2000 [ 50%] (Warmup)
## Chain 4: Iteration: 1001 / 2000 [ 50%] (Sampling)
## Chain 4: Iteration: 1200 / 2000 [ 60%] (Sampling)
## Chain 4: Iteration: 1400 / 2000 [ 70%] (Sampling)
## Chain 4: Iteration: 1600 / 2000 [ 80%] (Sampling)
## Chain 4: Iteration: 1800 / 2000 [ 90%] (Sampling)
## Chain 4: Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 4:
## Chain 4: Elapsed Time: 0.057 seconds (Warm-up)
## Chain 4: 0.062 seconds (Sampling)
## Chain 4: 0.119 seconds (Total)
## Chain 4:
## Computing WAIC
precis(m)
## mean sd 5.5% 94.5% n_eff Rhat4
## x1 3.0000423 0.02294471 2.9628762 3.0368194 2397.050 0.9998511
## x2 0.2388316 0.02565480 0.1979128 0.2799888 2384.367 0.9996553
m2 <- map2stan(
alist(
d ~ dpois(lambda),
log(lambda) <- x1,
x1 ~ dnorm(0, 10)
),
data = list(d = df$deaths), chains = 4
)
##
## SAMPLING FOR MODEL '1548c8e970f82f8fa553b78b53734e64' NOW (CHAIN 1).
## Chain 1:
## Chain 1: Gradient evaluation took 0 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1:
## Chain 1:
## Chain 1: Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 1: Iteration: 200 / 2000 [ 10%] (Warmup)
## Chain 1: Iteration: 400 / 2000 [ 20%] (Warmup)
## Chain 1: Iteration: 600 / 2000 [ 30%] (Warmup)
## Chain 1: Iteration: 800 / 2000 [ 40%] (Warmup)
## Chain 1: Iteration: 1000 / 2000 [ 50%] (Warmup)
## Chain 1: Iteration: 1001 / 2000 [ 50%] (Sampling)
## Chain 1: Iteration: 1200 / 2000 [ 60%] (Sampling)
## Chain 1: Iteration: 1400 / 2000 [ 70%] (Sampling)
## Chain 1: Iteration: 1600 / 2000 [ 80%] (Sampling)
## Chain 1: Iteration: 1800 / 2000 [ 90%] (Sampling)
## Chain 1: Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 1:
## Chain 1: Elapsed Time: 0.044 seconds (Warm-up)
## Chain 1: 0.051 seconds (Sampling)
## Chain 1: 0.095 seconds (Total)
## Chain 1:
##
## SAMPLING FOR MODEL '1548c8e970f82f8fa553b78b53734e64' NOW (CHAIN 2).
## Chain 2:
## Chain 2: Gradient evaluation took 0 seconds
## Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Chain 2: Adjust your expectations accordingly!
## Chain 2:
## Chain 2:
## Chain 2: Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 2: Iteration: 200 / 2000 [ 10%] (Warmup)
## Chain 2: Iteration: 400 / 2000 [ 20%] (Warmup)
## Chain 2: Iteration: 600 / 2000 [ 30%] (Warmup)
## Chain 2: Iteration: 800 / 2000 [ 40%] (Warmup)
## Chain 2: Iteration: 1000 / 2000 [ 50%] (Warmup)
## Chain 2: Iteration: 1001 / 2000 [ 50%] (Sampling)
## Chain 2: Iteration: 1200 / 2000 [ 60%] (Sampling)
## Chain 2: Iteration: 1400 / 2000 [ 70%] (Sampling)
## Chain 2: Iteration: 1600 / 2000 [ 80%] (Sampling)
## Chain 2: Iteration: 1800 / 2000 [ 90%] (Sampling)
## Chain 2: Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 2:
## Chain 2: Elapsed Time: 0.037 seconds (Warm-up)
## Chain 2: 0.043 seconds (Sampling)
## Chain 2: 0.08 seconds (Total)
## Chain 2:
##
## SAMPLING FOR MODEL '1548c8e970f82f8fa553b78b53734e64' NOW (CHAIN 3).
## Chain 3:
## Chain 3: Gradient evaluation took 0 seconds
## Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Chain 3: Adjust your expectations accordingly!
## Chain 3:
## Chain 3:
## Chain 3: Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 3: Iteration: 200 / 2000 [ 10%] (Warmup)
## Chain 3: Iteration: 400 / 2000 [ 20%] (Warmup)
## Chain 3: Iteration: 600 / 2000 [ 30%] (Warmup)
## Chain 3: Iteration: 800 / 2000 [ 40%] (Warmup)
## Chain 3: Iteration: 1000 / 2000 [ 50%] (Warmup)
## Chain 3: Iteration: 1001 / 2000 [ 50%] (Sampling)
## Chain 3: Iteration: 1200 / 2000 [ 60%] (Sampling)
## Chain 3: Iteration: 1400 / 2000 [ 70%] (Sampling)
## Chain 3: Iteration: 1600 / 2000 [ 80%] (Sampling)
## Chain 3: Iteration: 1800 / 2000 [ 90%] (Sampling)
## Chain 3: Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 3:
## Chain 3: Elapsed Time: 0.038 seconds (Warm-up)
## Chain 3: 0.045 seconds (Sampling)
## Chain 3: 0.083 seconds (Total)
## Chain 3:
##
## SAMPLING FOR MODEL '1548c8e970f82f8fa553b78b53734e64' NOW (CHAIN 4).
## Chain 4:
## Chain 4: Gradient evaluation took 0 seconds
## Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Chain 4: Adjust your expectations accordingly!
## Chain 4:
## Chain 4:
## Chain 4: Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 4: Iteration: 200 / 2000 [ 10%] (Warmup)
## Chain 4: Iteration: 400 / 2000 [ 20%] (Warmup)
## Chain 4: Iteration: 600 / 2000 [ 30%] (Warmup)
## Chain 4: Iteration: 800 / 2000 [ 40%] (Warmup)
## Chain 4: Iteration: 1000 / 2000 [ 50%] (Warmup)
## Chain 4: Iteration: 1001 / 2000 [ 50%] (Sampling)
## Chain 4: Iteration: 1200 / 2000 [ 60%] (Sampling)
## Chain 4: Iteration: 1400 / 2000 [ 70%] (Sampling)
## Chain 4: Iteration: 1600 / 2000 [ 80%] (Sampling)
## Chain 4: Iteration: 1800 / 2000 [ 90%] (Sampling)
## Chain 4: Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 4:
## Chain 4: Elapsed Time: 0.043 seconds (Warm-up)
## Chain 4: 0.039 seconds (Sampling)
## Chain 4: 0.082 seconds (Total)
## Chain 4:
## Computing WAIC
compare(m, m2)
## WAIC SE dWAIC dSE pWAIC weight
## m 4412.612 999.0194 0.0000 NA 129.44124 1.000000e+00
## m2 4447.214 1074.4917 34.6018 142.1106 78.59488 3.064182e-08
#On comparing these two models, the model which has femininity of names, model m seems to be better
#plotting femininity of deaths data
plot(df_fstd, df$deaths, pch = 12, col= rangi2)
#trend
df_pred <- list(f = seq(from = -2, to = 1.5, length.out = 30))
l <- link(m, data = df_pred)
## [ 100 / 1000 ]
[ 200 / 1000 ]
[ 300 / 1000 ]
[ 400 / 1000 ]
[ 500 / 1000 ]
[ 600 / 1000 ]
[ 700 / 1000 ]
[ 800 / 1000 ]
[ 900 / 1000 ]
[ 1000 / 1000 ]
l_mu <- apply(l, 2, mean)
l_pi <- apply(l, 2, PI)
lines(df_pred$f, l_mu)
shade(l_pi, df_pred$f)
d_sim <- sim(m, data = df_pred)
## [ 100 / 1000 ]
[ 200 / 1000 ]
[ 300 / 1000 ]
[ 400 / 1000 ]
[ 500 / 1000 ]
[ 600 / 1000 ]
[ 700 / 1000 ]
[ 800 / 1000 ]
[ 900 / 1000 ]
[ 1000 / 1000 ]
d_sim_pi <- apply(d_sim, 2, PI)
lines(df_pred$f, d_sim_pi[1, ], lty = 2)
lines(df_pred$f, d_sim_pi[2, ], lty = 2)
#plotting femininity of names data
plot(df_fstd, df$names, pch = 12, col= rangi2)
#trend
df_pred2 <- list(f = seq(from = -2, to = 1.5, length.out = 30))
l <- link(m, data = df_pred2)
## [ 100 / 1000 ]
[ 200 / 1000 ]
[ 300 / 1000 ]
[ 400 / 1000 ]
[ 500 / 1000 ]
[ 600 / 1000 ]
[ 700 / 1000 ]
[ 800 / 1000 ]
[ 900 / 1000 ]
[ 1000 / 1000 ]
l_mu2 <- apply(l, 2, mean)
l_pi2 <- apply(l, 2, PI)
lines(df_pred2$f, l_mu2)
shade(l_pi2, df_pred2$f)
d_sim2 <- sim(m, data = df_pred2)
## [ 100 / 1000 ]
[ 200 / 1000 ]
[ 300 / 1000 ]
[ 400 / 1000 ]
[ 500 / 1000 ]
[ 600 / 1000 ]
[ 700 / 1000 ]
[ 800 / 1000 ]
[ 900 / 1000 ]
[ 1000 / 1000 ]
d_sim_pi2 <- apply(d_sim2, 2, PI)
lines(df_pred2$f, d_sim_pi2[1, ], lty = 2)
lines(df_pred2$f, d_sim_pi2[2, ], lty = 2)
#Based on the below plot we can observe some overdispersion, which was kind of expected. Femininity vs deaths, if we compare, we do see that femninity of names is much higher when compared to femininity of deaths. Femininity of names retrodict well and Poisson model fits poorly.
Acquaint yourself with the columns by inspecting the help ?Hurricanes. In this problem, you’ll focus on predicting deaths using femininity of each hurricane’s name. Fit and interpret the simplest possible model, a Poisson model of deaths using femininity as a predictor. You can use quap or ulam. Compare the model to an intercept-only Poisson model of deaths. How strong is the association between femininity of name and deaths? Which storms does the model fit (retrodict) well? Which storms does it fit poorly?