This chapter introduced several new types of regression, all of which are generalizations of generalized linear models (GLMs). Ordered logistic models are useful for categorical outcomes with a strict ordering. They are built by attaching a cumulative link function to a categorical outcome distribution. Zero-inflated models mix together two different outcome distributions, allowing us to model outcomes with an excess of zeros. Models for overdispersion, such as beta-binomial and gamma-Poisson, draw the expected value of each observation from a distribution that changes shape as a function of a linear model.
Place each answer inside the code chunk (grey box). The code chunks should contain a text response or a code that completes/answers the question or activity requested. Make sure to include plots if the question requests them. Problems are labeled Easy (E), Medium (M), and Hard(H).
Finally, upon completion, name your final output .html file as: YourName_ANLY505-Year-Semester.html and publish the assignment to your R Pubs account and submit the link to Canvas. Each question is worth 5 points.
12E1. What is the difference between an ordered categorical variable and an unordered one? Define and then give an example of each.
# The difference between an ordered categorical variable and an unordered one is there is a clear ordering of the variable in the ordered categorical variable, but not in the unordered one.
# For example, economic status is an ordered categorical variable with three clear ordering of the categories low, medium and high, while gender is an unordered categorical variable having two categories (male and female) and there is no intrinsic ordering to the categories.
12E2. What kind of link function does an ordered logistic regression employ? How does it differ from an ordinary logit link?
# Ordered logit
# It differs from an ordinary logit link as it is a log-cumulative-odds link probability model. The ordered logit model is a regression model for an ordinal response variable. The model is based on the cumulative probabilities of the response variable instead of a discrete probability of a single event.In particular, the logit of each cumulative probability is assumed to be a linear function of the covariates with Regression Coefficients constant across Response Categories.
12E3. When count data are zero-inflated, using a model that ignores zero-inflation will tend to induce which kind of inferential error?
# If using a model that ignores zero-inflation, the excess of zero cannot be accounted for, which may result in the underestimation of the rate of events, because a count distribution with extra zeros added to it will have a lower mean.
12E4. Over-dispersion is common in count data. Give an example of a natural process that might produce over-dispersed counts. Can you also give an example of a process that might produce underdispersed counts?
# Over-dispersion
# Over dispersion is essentially the occurrence of greater variability than accounted for based on the statistical model. The presence of over-dispersion is typically due to heterogeneity in populations. This heterogeneity may arise from simple aggregation issues like in the case considered in the text, of Monasteries which accumulate data weekly or daily, in-spite of following the same Poisson model.
# Under-dispersion
# Under dispersion is essentially the occurrence of less variability than accounted for based on the statistical model. The clearest example of under-dispersion is from the draws of an MCMC sampler. The number of effective samples is typically lower than the number of samples, as the data is highly correlated (autocorrelated) as the sampler draws sequential samples. For a count model, if a hidden rate limiting variable exists and has not been accounted for, then the variation in counts is lowered, and will show up as under-dispersion.
12M1. At a certain university, employees are annually rated from 1 to 4 on their productivity, with 1 being least productive and 4 most productive. In a certain department at this certain university in a certain year, the numbers of employees receiving each rating were (from 1 to 4): 12, 36, 7, 41. Compute the log cumulative odds of each rating.
rate <- c(12, 36, 7, 41)
q <- rate / sum(rate)
q
## [1] 0.12500000 0.37500000 0.07291667 0.42708333
p <- cumsum(q)
p
## [1] 0.1250000 0.5000000 0.5729167 1.0000000
cum_log_odds_rate <- log(p/(1-p))
cum_log_odds_rate
## [1] -1.9459101 0.0000000 0.2937611 Inf
# The log cumulative odds of each rating are: -1.9459101, 0.0000000, 0.2937611, Inf
12M2. Make a version of Figure 12.5 for the employee ratings data given just above.
plot(1:4, p , xlab = "Rating" , ylab = "Cumulative Proportion",
xlim = c(0.7, 4.3), ylim = c(0, 1), xaxt = "n" )
axis( 1 , at = 1:4 , labels = 1:4 )
# plot gray cumulative probability lines
for ( x in 1:4 ) lines( c(x,x) , c(0,p[x]) , col="dark gray" , lwd = 3 )
# plot blue discrete probability segments
for ( x in 1:4 )
lines( c(x, x)+0.05 , c(p[x]-q[x],p[x]) , col="slateblue" , lwd = 3 )
# add number labels
text( 1:4+0.15 , p-q/2 , labels=1:4 , col="slateblue" )
12M3. Can you modify the derivation of the zero-inflated Poisson distribution (ZIPoisson) from the chapter to construct a zero-inflated binomial distribution?
# The probability of a zero, mixing together both processes, is:
# Pr(0|p0, q, n) = p0 + (1 − p0)(1 − q)^n
# The probability of any particular non-zero observation y is:
# Pr(y|p0, q, n) = (1 − p0)(n!/(y!(n − y)!)(q^y)((1 − q)^(n−y))
12H1. In 2014, a paper was published that was entitled “Female hurricanes are deadlier than male hurricanes.”191 As the title suggests, the paper claimed that hurricanes with female names have caused greater loss of life, and the explanation given is that people unconsciously rate female hurricanes as less dangerous and so are less likely to evacuate. Statisticians severely criticized the paper after publication. Here, you’ll explore the complete data used in the paper and consider the hypothesis that hurricanes with female names are deadlier. Load the data with:
data(Hurricanes)
d <- Hurricanes
d$fmnnty_std <- ( d$femininity - mean(d$femininity) )/sd(d$femininity)
m1 <- map2stan(
alist(
deaths ~ dpois(lambda),
log(lambda) <- a + bf*fmnnty,
a ~ dnorm(0,10),
bf ~ dnorm(0,1)),
data=list(
deaths=d$deaths,
fmnnty=d$fmnnty_std),
chains=4 )
##
## SAMPLING FOR MODEL 'b87e7f7763324cb8e27ec4107da526d2' NOW (CHAIN 1).
## Chain 1:
## Chain 1: Gradient evaluation took 0 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1:
## Chain 1:
## Chain 1: Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 1: Iteration: 200 / 2000 [ 10%] (Warmup)
## Chain 1: Iteration: 400 / 2000 [ 20%] (Warmup)
## Chain 1: Iteration: 600 / 2000 [ 30%] (Warmup)
## Chain 1: Iteration: 800 / 2000 [ 40%] (Warmup)
## Chain 1: Iteration: 1000 / 2000 [ 50%] (Warmup)
## Chain 1: Iteration: 1001 / 2000 [ 50%] (Sampling)
## Chain 1: Iteration: 1200 / 2000 [ 60%] (Sampling)
## Chain 1: Iteration: 1400 / 2000 [ 70%] (Sampling)
## Chain 1: Iteration: 1600 / 2000 [ 80%] (Sampling)
## Chain 1: Iteration: 1800 / 2000 [ 90%] (Sampling)
## Chain 1: Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 1:
## Chain 1: Elapsed Time: 0.054 seconds (Warm-up)
## Chain 1: 0.049 seconds (Sampling)
## Chain 1: 0.103 seconds (Total)
## Chain 1:
##
## SAMPLING FOR MODEL 'b87e7f7763324cb8e27ec4107da526d2' NOW (CHAIN 2).
## Chain 2:
## Chain 2: Gradient evaluation took 0 seconds
## Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Chain 2: Adjust your expectations accordingly!
## Chain 2:
## Chain 2:
## Chain 2: Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 2: Iteration: 200 / 2000 [ 10%] (Warmup)
## Chain 2: Iteration: 400 / 2000 [ 20%] (Warmup)
## Chain 2: Iteration: 600 / 2000 [ 30%] (Warmup)
## Chain 2: Iteration: 800 / 2000 [ 40%] (Warmup)
## Chain 2: Iteration: 1000 / 2000 [ 50%] (Warmup)
## Chain 2: Iteration: 1001 / 2000 [ 50%] (Sampling)
## Chain 2: Iteration: 1200 / 2000 [ 60%] (Sampling)
## Chain 2: Iteration: 1400 / 2000 [ 70%] (Sampling)
## Chain 2: Iteration: 1600 / 2000 [ 80%] (Sampling)
## Chain 2: Iteration: 1800 / 2000 [ 90%] (Sampling)
## Chain 2: Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 2:
## Chain 2: Elapsed Time: 0.052 seconds (Warm-up)
## Chain 2: 0.051 seconds (Sampling)
## Chain 2: 0.103 seconds (Total)
## Chain 2:
##
## SAMPLING FOR MODEL 'b87e7f7763324cb8e27ec4107da526d2' NOW (CHAIN 3).
## Chain 3:
## Chain 3: Gradient evaluation took 0 seconds
## Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Chain 3: Adjust your expectations accordingly!
## Chain 3:
## Chain 3:
## Chain 3: Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 3: Iteration: 200 / 2000 [ 10%] (Warmup)
## Chain 3: Iteration: 400 / 2000 [ 20%] (Warmup)
## Chain 3: Iteration: 600 / 2000 [ 30%] (Warmup)
## Chain 3: Iteration: 800 / 2000 [ 40%] (Warmup)
## Chain 3: Iteration: 1000 / 2000 [ 50%] (Warmup)
## Chain 3: Iteration: 1001 / 2000 [ 50%] (Sampling)
## Chain 3: Iteration: 1200 / 2000 [ 60%] (Sampling)
## Chain 3: Iteration: 1400 / 2000 [ 70%] (Sampling)
## Chain 3: Iteration: 1600 / 2000 [ 80%] (Sampling)
## Chain 3: Iteration: 1800 / 2000 [ 90%] (Sampling)
## Chain 3: Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 3:
## Chain 3: Elapsed Time: 0.05 seconds (Warm-up)
## Chain 3: 0.046 seconds (Sampling)
## Chain 3: 0.096 seconds (Total)
## Chain 3:
##
## SAMPLING FOR MODEL 'b87e7f7763324cb8e27ec4107da526d2' NOW (CHAIN 4).
## Chain 4:
## Chain 4: Gradient evaluation took 0 seconds
## Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Chain 4: Adjust your expectations accordingly!
## Chain 4:
## Chain 4:
## Chain 4: Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 4: Iteration: 200 / 2000 [ 10%] (Warmup)
## Chain 4: Iteration: 400 / 2000 [ 20%] (Warmup)
## Chain 4: Iteration: 600 / 2000 [ 30%] (Warmup)
## Chain 4: Iteration: 800 / 2000 [ 40%] (Warmup)
## Chain 4: Iteration: 1000 / 2000 [ 50%] (Warmup)
## Chain 4: Iteration: 1001 / 2000 [ 50%] (Sampling)
## Chain 4: Iteration: 1200 / 2000 [ 60%] (Sampling)
## Chain 4: Iteration: 1400 / 2000 [ 70%] (Sampling)
## Chain 4: Iteration: 1600 / 2000 [ 80%] (Sampling)
## Chain 4: Iteration: 1800 / 2000 [ 90%] (Sampling)
## Chain 4: Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 4:
## Chain 4: Elapsed Time: 0.052 seconds (Warm-up)
## Chain 4: 0.051 seconds (Sampling)
## Chain 4: 0.103 seconds (Total)
## Chain 4:
precis(m1)
## mean sd 5.5% 94.5% n_eff Rhat4
## a 3.0014918 0.02323724 2.9635231 3.0384119 2448.493 1.000759
## bf 0.2382548 0.02513105 0.1989885 0.2787844 2200.483 1.001108
#Intercept only model:
m0 <- map2stan(
alist(
deaths ~ dpois(lambda),
log(lambda) <- a,
a ~ dnorm(0,10)),
data=list(deaths=d$deaths),
chains=4 )
##
## SAMPLING FOR MODEL '19ace8c6015d4d01fe33b485d9aff7f4' NOW (CHAIN 1).
## Chain 1:
## Chain 1: Gradient evaluation took 0 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1:
## Chain 1:
## Chain 1: Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 1: Iteration: 200 / 2000 [ 10%] (Warmup)
## Chain 1: Iteration: 400 / 2000 [ 20%] (Warmup)
## Chain 1: Iteration: 600 / 2000 [ 30%] (Warmup)
## Chain 1: Iteration: 800 / 2000 [ 40%] (Warmup)
## Chain 1: Iteration: 1000 / 2000 [ 50%] (Warmup)
## Chain 1: Iteration: 1001 / 2000 [ 50%] (Sampling)
## Chain 1: Iteration: 1200 / 2000 [ 60%] (Sampling)
## Chain 1: Iteration: 1400 / 2000 [ 70%] (Sampling)
## Chain 1: Iteration: 1600 / 2000 [ 80%] (Sampling)
## Chain 1: Iteration: 1800 / 2000 [ 90%] (Sampling)
## Chain 1: Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 1:
## Chain 1: Elapsed Time: 0.035 seconds (Warm-up)
## Chain 1: 0.039 seconds (Sampling)
## Chain 1: 0.074 seconds (Total)
## Chain 1:
##
## SAMPLING FOR MODEL '19ace8c6015d4d01fe33b485d9aff7f4' NOW (CHAIN 2).
## Chain 2:
## Chain 2: Gradient evaluation took 0 seconds
## Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Chain 2: Adjust your expectations accordingly!
## Chain 2:
## Chain 2:
## Chain 2: Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 2: Iteration: 200 / 2000 [ 10%] (Warmup)
## Chain 2: Iteration: 400 / 2000 [ 20%] (Warmup)
## Chain 2: Iteration: 600 / 2000 [ 30%] (Warmup)
## Chain 2: Iteration: 800 / 2000 [ 40%] (Warmup)
## Chain 2: Iteration: 1000 / 2000 [ 50%] (Warmup)
## Chain 2: Iteration: 1001 / 2000 [ 50%] (Sampling)
## Chain 2: Iteration: 1200 / 2000 [ 60%] (Sampling)
## Chain 2: Iteration: 1400 / 2000 [ 70%] (Sampling)
## Chain 2: Iteration: 1600 / 2000 [ 80%] (Sampling)
## Chain 2: Iteration: 1800 / 2000 [ 90%] (Sampling)
## Chain 2: Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 2:
## Chain 2: Elapsed Time: 0.034 seconds (Warm-up)
## Chain 2: 0.035 seconds (Sampling)
## Chain 2: 0.069 seconds (Total)
## Chain 2:
##
## SAMPLING FOR MODEL '19ace8c6015d4d01fe33b485d9aff7f4' NOW (CHAIN 3).
## Chain 3:
## Chain 3: Gradient evaluation took 0 seconds
## Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Chain 3: Adjust your expectations accordingly!
## Chain 3:
## Chain 3:
## Chain 3: Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 3: Iteration: 200 / 2000 [ 10%] (Warmup)
## Chain 3: Iteration: 400 / 2000 [ 20%] (Warmup)
## Chain 3: Iteration: 600 / 2000 [ 30%] (Warmup)
## Chain 3: Iteration: 800 / 2000 [ 40%] (Warmup)
## Chain 3: Iteration: 1000 / 2000 [ 50%] (Warmup)
## Chain 3: Iteration: 1001 / 2000 [ 50%] (Sampling)
## Chain 3: Iteration: 1200 / 2000 [ 60%] (Sampling)
## Chain 3: Iteration: 1400 / 2000 [ 70%] (Sampling)
## Chain 3: Iteration: 1600 / 2000 [ 80%] (Sampling)
## Chain 3: Iteration: 1800 / 2000 [ 90%] (Sampling)
## Chain 3: Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 3:
## Chain 3: Elapsed Time: 0.033 seconds (Warm-up)
## Chain 3: 0.034 seconds (Sampling)
## Chain 3: 0.067 seconds (Total)
## Chain 3:
##
## SAMPLING FOR MODEL '19ace8c6015d4d01fe33b485d9aff7f4' NOW (CHAIN 4).
## Chain 4:
## Chain 4: Gradient evaluation took 0 seconds
## Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Chain 4: Adjust your expectations accordingly!
## Chain 4:
## Chain 4:
## Chain 4: Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 4: Iteration: 200 / 2000 [ 10%] (Warmup)
## Chain 4: Iteration: 400 / 2000 [ 20%] (Warmup)
## Chain 4: Iteration: 600 / 2000 [ 30%] (Warmup)
## Chain 4: Iteration: 800 / 2000 [ 40%] (Warmup)
## Chain 4: Iteration: 1000 / 2000 [ 50%] (Warmup)
## Chain 4: Iteration: 1001 / 2000 [ 50%] (Sampling)
## Chain 4: Iteration: 1200 / 2000 [ 60%] (Sampling)
## Chain 4: Iteration: 1400 / 2000 [ 70%] (Sampling)
## Chain 4: Iteration: 1600 / 2000 [ 80%] (Sampling)
## Chain 4: Iteration: 1800 / 2000 [ 90%] (Sampling)
## Chain 4: Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 4:
## Chain 4: Elapsed Time: 0.034 seconds (Warm-up)
## Chain 4: 0.035 seconds (Sampling)
## Chain 4: 0.069 seconds (Total)
## Chain 4:
# Compare the two models:
compare(m0, m1)
## WAIC SE dWAIC dSE pWAIC weight
## m1 4407.786 995.9796 0.00000 NA 130.87503 1.000000e+00
## m0 4452.778 1077.1904 44.99285 145.2431 82.54795 1.697953e-10
# The model that includes femininity of names is better.
# Now in order to see which hurricanes the model retrodicts well, I’ll compute and plot the expected death count, 89% interval of the expectation, and 89% interval of the expected distribution of deaths with Poisson sampling.
# plot raw data
plot( d$fmnnty_std , d$deaths , pch=16 ,
col=rangi2 , xlab="femininity" , ylab="deaths" )
# compute model-based trend
pred_dat <- list( fmnnty = seq(from = -2, to = 1.5, length.out = 30) )
lambda <- link(m1,data=pred_dat)
## [ 100 / 1000 ]
[ 200 / 1000 ]
[ 300 / 1000 ]
[ 400 / 1000 ]
[ 500 / 1000 ]
[ 600 / 1000 ]
[ 700 / 1000 ]
[ 800 / 1000 ]
[ 900 / 1000 ]
[ 1000 / 1000 ]
lambda.mu <- apply(lambda,2,mean)
lambda.PI <- apply(lambda,2,PI)
# superimpose trend
lines( pred_dat$fmnnty , lambda.mu )
shade( lambda.PI , pred_dat$fmnnty )
# compute sampling distribution
deaths_sim <- sim(m1,data=pred_dat)
## [ 100 / 1000 ]
[ 200 / 1000 ]
[ 300 / 1000 ]
[ 400 / 1000 ]
[ 500 / 1000 ]
[ 600 / 1000 ]
[ 700 / 1000 ]
[ 800 / 1000 ]
[ 900 / 1000 ]
[ 1000 / 1000 ]
deaths_sim.PI <- apply(deaths_sim,2,PI)
# superimpose sampling interval as dashed lines
lines( pred_dat$fmnnty , deaths_sim.PI[1,] , lty=2 )
lines( pred_dat$fmnnty , deaths_sim.PI[2,] , lty=2 )
# Because it is so narrow, We can’t see the 89% interval of the expected value. The sampling distribution isn’t much wider itself. Here we can see femininity accounts for very little of the variation in deaths, especially at the high end. There’s a lot of over-dispersion, which is very common in Poisson models. Therefore, this homogenous Poisson model does a poor job for most of the hurricanes in the sample, since most of them lie outside the dashed prediction boundaries.
# plotting femininity of names data
plot(d$fmnnty_std, d$names, pch = 16, col= rangi2)
# compute model-based trend
pred_dat2 <- list( fmnnty = seq(from = -2, to = 1.5, length.out = 30))
lambda <- link(m1, data = pred_dat2)
## [ 100 / 1000 ]
[ 200 / 1000 ]
[ 300 / 1000 ]
[ 400 / 1000 ]
[ 500 / 1000 ]
[ 600 / 1000 ]
[ 700 / 1000 ]
[ 800 / 1000 ]
[ 900 / 1000 ]
[ 1000 / 1000 ]
lambda.mu2 <- apply(lambda, 2, mean)
lambda.PI2 <- apply(lambda, 2, PI)
# superimpose trend
lines(pred_dat2$fmnnty, lambda.mu2)
shade(lambda.PI2, pred_dat2$fmnnty)
# compute sampling distribution
deaths_sim2 <- sim(m1, data = pred_dat2)
## [ 100 / 1000 ]
[ 200 / 1000 ]
[ 300 / 1000 ]
[ 400 / 1000 ]
[ 500 / 1000 ]
[ 600 / 1000 ]
[ 700 / 1000 ]
[ 800 / 1000 ]
[ 900 / 1000 ]
[ 1000 / 1000 ]
deaths_sim.PI2 <- apply(deaths_sim2, 2, PI)
# superimpose sampling interval as dashed lines
lines(pred_dat2$fmnnty, deaths_sim.PI2[1, ], lty = 2)
lines(pred_dat2$fmnnty, deaths_sim.PI2[2, ], lty = 2)
# Based on the plot we can observe from Femininity vs deaths that femninity of names is much higher when compared to femininity of deaths. Femininity of names retrodict well and Poisson model fits poorly.
Acquaint yourself with the columns by inspecting the help ?Hurricanes. In this problem, you’ll focus on predicting deaths using femininity of each hurricane’s name. Fit and interpret the simplest possible model, a Poisson model of deaths using femininity as a predictor. You can use quap or ulam. Plot your results of prior predictive simulation. Compare the model to an intercept-only Poisson model of deaths. Plot the results over the raw data. How strong is the association between femininity of name and deaths? Which storms does the model fit (retrodict) well? Which storms does it fit poorly?