This chapter introduced several new types of regression, all of which are generalizations of generalized linear models (GLMs). Ordered logistic models are useful for categorical outcomes with a strict ordering. They are built by attaching a cumulative link function to a categorical outcome distribution. Zero-inflated models mix together two different outcome distributions, allowing us to model outcomes with an excess of zeros. Models for overdispersion, such as beta-binomial and gamma-Poisson, draw the expected value of each observation from a distribution that changes shape as a function of a linear model.
Place each answer inside the code chunk (grey box). The code chunks should contain a text response or a code that completes/answers the question or activity requested. Make sure to include plots if the question requests them. Problems are labeled Easy (E), Medium (M), and Hard(H).
Finally, upon completion, name your final output .html file as: YourName_ANLY505-Year-Semester.html and publish the assignment to your R Pubs account and submit the link to Canvas. Each question is worth 5 points.
12E1. What is the difference between an ordered categorical variable and an unordered one? Define and then give an example of each.
# Ordered categorical variables are those which are expressed on ordinal scales. These are variables which establish a pre-defined number of distinct outcomes. These values/categories of these variables can be ordered meaningfully from one extreme to the another without implying equal distances between the values. For example, think of animal weight. We may have measured these in grams, but now want to model simply whether that animal are light, medium or heavy-weights. This is an ordered categorical variable because we know how they can be ordered from one extreme to another, but we don’t assume that an animal has to become heavier by the same margin to classify as medium-weight instead of light-weight than to classify as heavy-weight instead of medium-weight.
#Unordered categorical variables are much like their ordered counterpart, but come with an important distinction, we cannot order these in any meaningful way. For example think of a bird and their colouration patterns. We may want to record them as overall black, brown, or grey. These are categories, but we certainly cannot order these from one extreme to another.
12E2. What kind of link function does an ordered logistic regression employ? How does it differ from an ordinary logit link?
# Ordered logistic regression employs cumulative log-adds, which provides the odds of the value or smaller values. The ordinary logit link provides the odds of a particular value.
12E3. When count data are zero-inflated, using a model that ignores zero-inflation will tend to induce which kind of inferential error?
# Ignoring zero-inflation is excluding the zero-inflation data resulting in underestimation of rate of events, as zero-inflation data still effects the overall mean.
12E4. Over-dispersion is common in count data. Give an example of a natural process that might produce over-dispersed counts. Can you also give an example of a process that might produce underdispersed counts?
# Dispersion occurs when theoritical variance and observed variance are different, Over-dispersion occurs when observed vairance is higher than theoritical one, example when a person who wants to loose weight eats more calories than required and underdispersed is the vice versa if a person who wants to gain muscle eats less protein than needed.
12M1. At a certain university, employees are annually rated from 1 to 4 on their productivity, with 1 being least productive and 4 most productive. In a certain department at this certain university in a certain year, the numbers of employees receiving each rating were (from 1 to 4): 12, 36, 7, 41. Compute the log cumulative odds of each rating.
n <- c(12, 36, 7, 41)
q <- n / sum(n)
p <- cumsum(q) # cumulative proportions
o <- p / (1 - p) # cumulative odds
log(o) # log-cumulative odds
## [1] -1.9459101 0.0000000 0.2937611 Inf
12M2. Make a version of Figure 12.5 for the employee ratings data given just above.
plot(1:4, p,
xlab = "rating", ylab = "cumulative proportion",
xlim = c(0.7, 4.3), ylim = c(0, 1), xaxt = "n", cex = 3
)
axis(1, at = 1:4, labels = 1:4)
# plot gray cumulative probability lines
for (x in 1:4) lines(c(x, x), c(0, p[x]), col = "gray", lwd = 4)
# plot green discrete probability segments
for (x in 1:4) lines(c(x, x) + 0.1, c(p[x] - q[x], p[x]), col = "green", lwd = 4)
# add number labels
text(1:4 + 0.2, p - q / 2, labels = 1:4, col = "darkgreen", cex = 3)
12M3. Can you modify the derivation of the zero-inflated Poisson distribution (ZIPoisson) from the chapter to construct a zero-inflated binomial distribution?
# The probability of a zero, mixing together both processes, is:
# Pr(0|p0, q, n) = p0 + (1 − p0)(1 − q)^n
# The probability of any particular non-zero observation y is:
# Pr(y|p0, q, n) = (1 − p0)(n!/(y!(n − y)!)(q^y)((1 − q)^(n−y))
12H1. In 2014, a paper was published that was entitled “Female hurricanes are deadlier than male hurricanes.”191 As the title suggests, the paper claimed that hurricanes with female names have caused greater loss of life, and the explanation given is that people unconsciously rate female hurricanes as less dangerous and so are less likely to evacuate. Statisticians severely criticized the paper after publication. Here, you’ll explore the complete data used in the paper and consider the hypothesis that hurricanes with female names are deadlier. Load the data with:
Acquaint yourself with the columns by inspecting the help ?Hurricanes. In this problem, you’ll focus on predicting deaths using femininity of each hurricane’s name. Fit and interpret the simplest possible model, a Poisson model of deaths using femininity as a predictor. You can use quap or ulam. Plot your results of prior predictive simulation. Compare the model to an intercept-only Poisson model of deaths. Plot the results over the raw data. How strong is the association between femininity of name and deaths? Which storms does the model fit (retrodict) well? Which storms does it fit poorly?
data(Hurricanes)
d <- Hurricanes
d$fem_std <- (d$femininity - mean(d$femininity)) / sd(d$femininity) # standardised femininity
dat <- list(D = d$deaths, F = d$fem_std)
f <- alist(
D ~ dpois(lambda), # poisson outcome distribution
log(lambda) <- a + bF * F, # log-link for lambda with linear model
# priors in log-space, 0 corresponds to outcome of 1
a ~ dnorm(1, 1),
bF ~ dnorm(0, 1)
)
N <- 1e3
a <- rnorm(N, 1, 1)
bF <- rnorm(N, 0, 1)
F_seq <- seq(from = -2, to = 2, length.out = 30) # sequence from -2 to 2 because femininity data is standardised
plot(NULL,
xlim = c(-2, 2), ylim = c(0, 500),
xlab = "name femininity (std)", ylab = "deaths"
)
for (i in 1:N) {
lines(F_seq,
exp(a[i] + bF[i] * F_seq), # inverse link to get outcome scale
col = grau(), lwd = 1.5
)
}
mH1 <- ulam(f, data = dat, chains = 4, cores = 4, log_lik = TRUE)
## Running /Library/Frameworks/R.framework/Resources/bin/R CMD SHLIB foo.c
## clang -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG -I"/Library/Frameworks/R.framework/Versions/3.6/Resources/library/Rcpp/include/" -I"/Library/Frameworks/R.framework/Versions/3.6/Resources/library/RcppEigen/include/" -I"/Library/Frameworks/R.framework/Versions/3.6/Resources/library/RcppEigen/include/unsupported" -I"/Library/Frameworks/R.framework/Versions/3.6/Resources/library/BH/include" -I"/Library/Frameworks/R.framework/Versions/3.6/Resources/library/StanHeaders/include/src/" -I"/Library/Frameworks/R.framework/Versions/3.6/Resources/library/StanHeaders/include/" -I"/Library/Frameworks/R.framework/Versions/3.6/Resources/library/RcppParallel/include/" -I"/Library/Frameworks/R.framework/Versions/3.6/Resources/library/rstan/include" -DEIGEN_NO_DEBUG -DBOOST_DISABLE_ASSERTS -DBOOST_PENDING_INTEGER_LOG2_HPP -DSTAN_THREADS -DBOOST_NO_AUTO_PTR -include '/Library/Frameworks/R.framework/Versions/3.6/Resources/library/StanHeaders/include/stan/math/prim/mat/fun/Eigen.hpp' -D_REENTRANT -DRCPP_PARALLEL_USE_TBB=1 -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk -I/usr/local/include -fPIC -Wall -g -O2 -c foo.c -o foo.o
## In file included from <built-in>:1:
## In file included from /Library/Frameworks/R.framework/Versions/3.6/Resources/library/StanHeaders/include/stan/math/prim/mat/fun/Eigen.hpp:13:
## In file included from /Library/Frameworks/R.framework/Versions/3.6/Resources/library/RcppEigen/include/Eigen/Dense:1:
## In file included from /Library/Frameworks/R.framework/Versions/3.6/Resources/library/RcppEigen/include/Eigen/Core:88:
## /Library/Frameworks/R.framework/Versions/3.6/Resources/library/RcppEigen/include/Eigen/src/Core/util/Macros.h:628:1: error: unknown type name 'namespace'
## namespace Eigen {
## ^
## /Library/Frameworks/R.framework/Versions/3.6/Resources/library/RcppEigen/include/Eigen/src/Core/util/Macros.h:628:16: error: expected ';' after top level declarator
## namespace Eigen {
## ^
## ;
## In file included from <built-in>:1:
## In file included from /Library/Frameworks/R.framework/Versions/3.6/Resources/library/StanHeaders/include/stan/math/prim/mat/fun/Eigen.hpp:13:
## In file included from /Library/Frameworks/R.framework/Versions/3.6/Resources/library/RcppEigen/include/Eigen/Dense:1:
## /Library/Frameworks/R.framework/Versions/3.6/Resources/library/RcppEigen/include/Eigen/Core:96:10: fatal error: 'complex' file not found
## #include <complex>
## ^~~~~~~~~
## 3 errors generated.
## make: *** [foo.o] Error 1
precis(mH1)
## mean sd 5.5% 94.5% n_eff Rhat4
## a 2.9990713 0.02352266 2.9614839 3.0365189 1128.862 1.002540
## bF 0.2401221 0.02485031 0.2005168 0.2806348 1292.315 1.004057
There is a positive relationship between hurricane name femininity and death toll.
plot(dat$F, dat$D,
pch = 16, lwd = 2,
col = rangi2, xlab = "femininity (std)", ylab = "deaths"
)
pred_dat <- list(F = seq(from = -2, to = 2, length.out = 1e2))
lambda <- link(mH1, data = pred_dat) # predict deaths
lambda.mu <- apply(lambda, 2, mean) # get mean prediction
lambda.PI <- apply(lambda, 2, PI) # get prediction interval
lines(pred_dat$F, lambda.mu)
shade(lambda.PI, pred_dat$F)
deaths_sim <- sim(mH1, data = pred_dat) # simulate posterior observations
deaths_sim.PI <- apply(deaths_sim, 2, PI) # get simulation interval
lines(pred_dat$F, deaths_sim.PI[1, ], lty = 2)
lines(pred_dat$F, deaths_sim.PI[2, ], lty = 2)
Our model does not retrodict many of the hurricanes well even though it is quite certain of its predictions (grey shaded area which is hardly visible). This model misses many of the hurricane death tolls to the right hand side of the above plot. This is a clear sign of over-dispersion which our model failed to account for. The weak, positive trend we are seeing here seems to be informed largely by these highly influential data points.