This chapter introduced several new types of regression, all of which are generalizations of generalized linear models (GLMs). Ordered logistic models are useful for categorical outcomes with a strict ordering. They are built by attaching a cumulative link function to a categorical outcome distribution. Zero-inflated models mix together two different outcome distributions, allowing us to model outcomes with an excess of zeros. Models for overdispersion, such as beta-binomial and gamma-Poisson, draw the expected value of each observation from a distribution that changes shape as a function of a linear model.
Place each answer inside the code chunk (grey box). The code chunks should contain a text response or a code that completes/answers the question or activity requested. Problems are labeled Easy (E), Medium (M), and Hard(H).
Finally, upon completion, name your final output .html file as: YourName_ANLY505-Year-Semester.html and publish the assignment to your R Pubs account and submit the link to Canvas. Each question is worth 5 points.
12E1. What is the difference between an ordered categorical variable and an unordered one? Define and then give an example of each.
#the ordered categorical variable is discrete values with order. {2,5,6,9}
#the unordered categorical variable is values without order {male, female} {single, married, separated}
12E2. What kind of link function does an ordered logistic regression employ? How does it differ from an ordinary logit link?
# ordered logistic regression uses a cumulative logit link function. it is similar to an ordinary logit link. but it's cumulative probability, not discrete probability.
#cumulative logit link: the linear model is the log-odds of the specified event or any event of lower ordered alue.
12E3. When count data are zero-inflated, using a model that ignores zero-inflation will tend to induce which kind of inferential error?
#ignore zero-inflate will lead to underestimate the events' rate.
# more zeros added, the count distribution will have a lower mean. it's single process count data, lead to lower estimate for the mean rate
12E4. Over-dispersion is common in count data. Give an example of a natural process that might produce over-dispersed counts. Can you also give an example of a process that might produce underdispersed counts?
#over -dispersion can show from variation in underlying rates across the units. one example, we can list the number of breads sold by the breads stores every day for one month. the sum counts will be over-dispersed.
# every stores don't have same average sales numbers of every day
# undersperion happens when the observed variance in data is less than variance of theorotical mode. one example, when the bike wears out, the number of defects rent bikes may increase.
# the increase of the defect counts of rent bikes group may be more similar than they are by chance.
12M1. At a certain university, employees are annually rated from 1 to 4 on their productivity, with 1 being least productive and 4 most productive. In a certain department at this certain university in a certain year, the numbers of employees receiving each rating were (from 1 to 4): 12, 36, 7, 41. Compute the log cumulative odds of each rating.
# we can define log cumulative odds: log(pk/(1-pk))
n <- c( 12, 36 , 7 , 41 )
q <- n / sum(n)
q
## [1] 0.12500000 0.37500000 0.07291667 0.42708333
sum(q)
## [1] 1
p <- cumsum(q)
p
## [1] 0.1250000 0.5000000 0.5729167 1.0000000
log(p/(1-p))
## [1] -1.9459101 0.0000000 0.2937611 Inf
we can know the log cumulative odds of the highest value is infinity.
12M2. Make a version of Figure 12.5 for the employee ratings data given just above.
plot( 1:4 , p , xlab="rating" , ylab="cumulative proportion" ,
xlim=c(0.7,4.3) , ylim=c(0,1) , xaxt="n" )
axis( 1 , at=1:4 , labels=1:4 )
# plot gray cumulative probability lines
for ( x in 1:4 ) lines( c(x,x) , c(0,p[x]) , col="gray" , lwd=2 )
# plot blue discrete probability segments
for ( x in 1:4 )
lines( c(x,x)+0.1 , c(p[x]-q[x],p[x]) , col="slateblue" , lwd=2 )
# add number labels
text( 1:4+0.2 , p-q/2 , labels=1:4 , col="slateblue" )
12M3. Can you modify the derivation of the zero-inflated Poisson distribution (ZIPoisson) from the chapter to construct a zero-inflated binomial distribution?
# let the zero-inflated binomial mixes some zeros to another distribution. then this structure is mostly same to ZI-poisson.
#p0 single probability determins if or not one zero is observed.
#when it's not observed, the binomial distribution takes over. it creates a zero.
# q : the probbility of a success from the binomial process q,
# it has n trials.
#the probability of a zero, mixing together both processes
\[\begin{align} Pr(0|p_0,q,n)=p_0+(1-p_0)(1-q)^{n}\\ \end{align}\]
# the logic is that a zero from p0 or zero from the binomial (1-q)^n
# the probability of any particular non-zero observation y
\[\begin{align} Pr(y|p_0,q,n)=(1-p_0)\frac{n!}{y!(n-y)!}\qquad q^{y}(1-q)^{n}\\ \end{align}\]
12H1. In 2014, a paper was published that was entitled “Female hurricanes are deadlier than male hurricanes.”191 As the title suggests, the paper claimed that hurricanes with female names have caused greater loss of life, and the explanation given is that people unconsciously rate female hurricanes as less dangerous and so are less likely to evacuate. Statisticians severely criticized the paper after publication. Here, you’ll explore the complete data used in the paper and consider the hypothesis that hurricanes with female names are deadlier. Load the data with:
# we can start with a simple poisson model and predict deaths using femininity as the predictor. we can simulate from the priors and see the next
#data(Hurricanes)
library(rethinking)
data(Hurricanes)
d <- Hurricanes
d$fmnnty_std <- ( d$femininity - mean(d$femininity) )/sd(d$femininity)
dat <- list( D=d$deaths , F=d$fmnnty_std )
# model formula - no fitting yet
f <- alist(
D ~ dpois(lambda),
log(lambda) <- a + bF*F,
a ~ dnorm(1,1),
bF ~ dnorm(0,1) )
N <- 100
a <- rnorm(N,1,1)
bF <- rnorm(N,0,1)
F_seq <- seq( from=-2 , to=2 , length.out=30 )
plot( NULL , xlim=c(-2,2) , ylim=c(0,500) ,
xlab="name femininity (std)" , ylab="deaths" )
for ( i in 1:N ) lines( F_seq , exp( a[i] + bF[i]*F_seq ) , col=grau() , lwd=1.5 )
m1 <- ulam( f , data=dat , chains=4 , log_lik=TRUE )
##
## SAMPLING FOR MODEL '08aa84d83c89f7279da5db58b11db816' NOW (CHAIN 1).
## Chain 1:
## Chain 1: Gradient evaluation took 0 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1:
## Chain 1:
## Chain 1: Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 1: Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 1: Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 1: Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 1: Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 1: Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 1: Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 1: Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 1: Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 1: Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 1: Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 1: Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 1:
## Chain 1: Elapsed Time: 0.053 seconds (Warm-up)
## Chain 1: 0.045 seconds (Sampling)
## Chain 1: 0.098 seconds (Total)
## Chain 1:
##
## SAMPLING FOR MODEL '08aa84d83c89f7279da5db58b11db816' NOW (CHAIN 2).
## Chain 2:
## Chain 2: Gradient evaluation took 0 seconds
## Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Chain 2: Adjust your expectations accordingly!
## Chain 2:
## Chain 2:
## Chain 2: Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 2: Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 2: Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 2: Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 2: Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 2: Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 2: Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 2: Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 2: Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 2: Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 2: Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 2: Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 2:
## Chain 2: Elapsed Time: 0.053 seconds (Warm-up)
## Chain 2: 0.048 seconds (Sampling)
## Chain 2: 0.101 seconds (Total)
## Chain 2:
##
## SAMPLING FOR MODEL '08aa84d83c89f7279da5db58b11db816' NOW (CHAIN 3).
## Chain 3:
## Chain 3: Gradient evaluation took 0 seconds
## Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Chain 3: Adjust your expectations accordingly!
## Chain 3:
## Chain 3:
## Chain 3: Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 3: Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 3: Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 3: Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 3: Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 3: Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 3: Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 3: Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 3: Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 3: Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 3: Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 3: Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 3:
## Chain 3: Elapsed Time: 0.043 seconds (Warm-up)
## Chain 3: 0.045 seconds (Sampling)
## Chain 3: 0.088 seconds (Total)
## Chain 3:
##
## SAMPLING FOR MODEL '08aa84d83c89f7279da5db58b11db816' NOW (CHAIN 4).
## Chain 4:
## Chain 4: Gradient evaluation took 0 seconds
## Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Chain 4: Adjust your expectations accordingly!
## Chain 4:
## Chain 4:
## Chain 4: Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 4: Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 4: Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 4: Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 4: Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 4: Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 4: Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 4: Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 4: Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 4: Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 4: Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 4: Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 4:
## Chain 4: Elapsed Time: 0.051 seconds (Warm-up)
## Chain 4: 0.047 seconds (Sampling)
## Chain 4: 0.098 seconds (Total)
## Chain 4:
precis( m1 )
## mean sd 5.5% 94.5% n_eff Rhat4
## a 2.9988262 0.02311666 2.9630236 3.0357317 1326.195 0.999476
## bF 0.2387923 0.02379598 0.2010903 0.2766953 1325.244 1.000368
# plot raw data
plot( dat$F , dat$D , pch=16 , lwd=2 ,
col=rangi2 , xlab="femininity (std)" , ylab="deaths" )
# compute model-based trend
pred_dat <- list( F=seq(from=-2,to=1.5,length.out=30) )
lambda <- link( m1 , data=pred_dat )
lambda.mu <- apply(lambda,2,mean)
lambda.PI <- apply(lambda,2,PI)
# superimpose trend
lines( pred_dat$F , lambda.mu )
shade( lambda.PI , pred_dat$F )
## Warning in if (class(object) == "formula") {: the condition has length > 1 and
## only the first element will be used
## Warning in if (class(object) == "density") {: the condition has length > 1 and
## only the first element will be used
## Warning in if (class(object) == "matrix" & length(dim(object)) == 2) {: the
## condition has length > 1 and only the first element will be used
## Warning in if (class(object) == "matrix") {: the condition has length > 1 and
## only the first element will be used
# compute sampling distribution
deaths_sim <- sim(m1,data=pred_dat)
deaths_sim.PI <- apply(deaths_sim,2,PI)
# superimpose sampling interval as dashed lines
lines( pred_dat$F , deaths_sim.PI[1,] , lty=2 )
lines( pred_dat$F , deaths_sim.PI[2,] , lty=2 )
stem( PSISk( m1 ) )
## Some Pareto k values are very high (>1). Set pointwise=TRUE to inspect individual points.
##
## The decimal point is 1 digit(s) to the left of the |
##
## -2 | 0
## -0 | 54110976664322221111111
## 0 | 00122334455555566778999999000023334557777789
## 2 | 0111223345562255
## 4 |
## 6 | 615
## 8 | 1
## 10 | 4
## 12 | 1
## 14 |
## 16 |
## 18 |
## 20 | 95
Acquaint yourself with the columns by inspecting the help ?Hurricanes. In this problem, you’ll focus on predicting deaths using femininity of each hurricane’s name. Fit and interpret the simplest possible model, a Poisson model of deaths using femininity as a predictor. You can use quap or ulam. Compare the model to an intercept-only Poisson model of deaths. How strong is the association between femininity of name and deaths? Which storms does the model fit (retrodict) well? Which storms does it fit poorly?