Chapter 12 - Monsters and Mixtures

This chapter introduced several new types of regression, all of which are generalizations of generalized linear models (GLMs). Ordered logistic models are useful for categorical outcomes with a strict ordering. They are built by attaching a cumulative link function to a categorical outcome distribution. Zero-inflated models mix together two different outcome distributions, allowing us to model outcomes with an excess of zeros. Models for overdispersion, such as beta-binomial and gamma-Poisson, draw the expected value of each observation from a distribution that changes shape as a function of a linear model.

Place each answer inside the code chunk (grey box). The code chunks should contain a text response or a code that completes/answers the question or activity requested. Problems are labeled Easy (E), Medium (M), and Hard(H).

Finally, upon completion, name your final output .html file as: YourName_ANLY505-Year-Semester.html and publish the assignment to your R Pubs account and submit the link to Canvas. Each question is worth 5 points.

Questions

12E1. What is the difference between an ordered categorical variable and an unordered one? Define and then give an example of each.

# The difference is that ordered categorical variable has a clear ordering among the elemnts compared with unordered one.

# Sex and geographic region is unordered categorical variables. Developmental status (infant, juvenile, adult) is ordered categorical variables.

12E2. What kind of link function does an ordered logistic regression employ? How does it differ from an ordinary logit link?

# Ordered logistic regression employs ordered logit link function.

# Difference: The ordered logit link is used in ordered logistic models for categorical outcomes with a strict ordering and it's built by attaching a cumulative link function to a categorical outcome distribution, which is a log-cumulative-odds link probability model.

12E3. When count data are zero-inflated, using a model that ignores zero-inflation will tend to induce which kind of inferential error?

# The model will assume the outcome of the process modeled is zero more than it actually is.

12E4. Over-dispersion is common in count data. Give an example of a natural process that might produce over-dispersed counts. Can you also give an example of a process that might produce underdispersed counts?

# When counts are more variable than a pure process, they exhibit over-dispersion. For example, if we observe the number of ice creams sold by shops for certain amount of time, the counts tend to be over-dispersed since some shops have more flavored ice creams than other shops.

#Similarly, under-dispersed is when counts are less variable than a pure process. For example, if the dataset is highly correlated. How many tasks to be completed is depended on the number of tasks in the first place.

12M1. At a certain university, employees are annually rated from 1 to 4 on their productivity, with 1 being least productive and 4 most productive. In a certain department at this certain university in a certain year, the numbers of employees receiving each rating were (from 1 to 4): 12, 36, 7, 41. Compute the log cumulative odds of each rating.

df <- c(12, 36, 7, 41)
q <- df/sum(df)
q
## [1] 0.12500000 0.37500000 0.07291667 0.42708333
p <- cumsum(q)
p
## [1] 0.1250000 0.5000000 0.5729167 1.0000000
log_odds <- log(p/(1-p))
log_odds
## [1] -1.9459101  0.0000000  0.2937611        Inf

12M2. Make a version of Figure 12.5 for the employee ratings data given just above.

c_prop=df/sum(df)
c_prop=sapply(1:4,function(i){sum(c_prop[1:i])})
plot(y=c_prop,x=1:4,type="b", ylim=c(0,1))
segments(1:4,0,1:4,c_prop)
for(i in 1:4){segments(i+0.05,c(0,c_prop)[i],i+0.05,c_prop[i], col = "blue")}

12M3. Can you modify the derivation of the zero-inflated Poisson distribution (ZIPoisson) from the chapter to construct a zero-inflated binomial distribution?

# Probability of a zero, mixing together both processes: 
# Pr(0|p_0, q, n) = p_0 + (1 − p_0)(1 − q)^n

# Probability of a non-zero observation (y): 
# Pr(y|p_0, q, n) = (1 − p_0)(n!/(y!(n − y)!))(q^y)((1 − q)^(n−y))

12H1. In 2014, a paper was published that was entitled “Female hurricanes are deadlier than male hurricanes.” 191 As the title suggests, the paper claimed that hurricanes with female names have caused greater loss of life, and the explanation given is that people unconsciously rate female hurricanes as less dangerous and so are less likely to evacuate. Statisticians severely criticized the paper after publication. Here, you’ll explore the complete data used in the paper and consider the hypothesis that hurricanes with female names are deadlier. Load the data with:

data(Hurricanes)
?Hurricanes
df <- Hurricanes
str(df)
## 'data.frame':    92 obs. of  8 variables:
##  $ name        : Factor w/ 83 levels "Able","Agnes",..: 38 77 1 9 47 20 40 60 27 33 ...
##  $ year        : int  1950 1950 1952 1953 1953 1954 1954 1954 1955 1955 ...
##  $ deaths      : int  2 4 3 1 0 60 20 20 0 200 ...
##  $ category    : int  3 3 1 1 1 3 3 4 3 1 ...
##  $ min_pressure: int  960 955 985 987 985 960 954 938 962 987 ...
##  $ damage_norm : int  1590 5350 150 58 15 19321 3230 24260 2030 14730 ...
##  $ female      : int  1 0 0 1 1 1 1 1 1 1 ...
##  $ femininity  : num  6.78 1.39 3.83 9.83 8.33 ...
df$fmnnty_std <- ( df$femininity - mean(df$femininity))/sd(df$femininity)
df1 <- list(deaths = df$deaths, fmn <-df$fmnnty_std)

# Poisson model
m1 <- map(
  alist(
    deaths ~ dpois(lambda),
    log(lambda) <- a+ bF * fmn,
    a ~ dnorm(0,10),
    bF ~ dnorm(0,1)
  ), data = df1)

# intercept-only
df2 <- list(deaths = df$deaths)
m2 <- map(
  alist(
    deaths ~ dpois(lambda),
    log(lambda) <- a,
    a ~ dnorm(0,10)
  ), data = df2
)

compare(m1,m2)
##        WAIC       SE    dWAIC      dSE    pWAIC       weight
## m1 4418.373 1003.934  0.00000       NA 127.6535 9.999999e-01
## m2 4449.865 1075.321 31.49276 138.5414  80.9302 1.450220e-07
# Plotting

plot( df$fmnnty_std , df$deaths , pch=16 ,xlab="femininity" , ylab="deaths")
pred_dat <- list( fmn = seq(from = -2, to = 1.5, length.out = 30) )
lambda <- link(m1,data=pred_dat)
lambda.mu <- apply(lambda,2,mean)
lambda.PI <- apply(lambda,2,PI)

line(pred_dat$fmn , lambda.mu )
## 
## Call:
## line(pred_dat$fmn, lambda.mu)
## 
## Coefficients:
## [1]  20.534   4.586
shade( lambda.PI , pred_dat$fmn )

Acquaint yourself with the columns by inspecting the help ?Hurricanes. In this problem, you’ll focus on predicting deaths using femininity of each hurricane’s name.

Fit and interpret the simplest possible model, a Poisson model of deaths using femininity as a predictor. You can use quap or ulam.

Compare the model to an intercept-only Poisson model of deaths. How strong is the association between femininity of name and deaths? Which storms does the model fit (retrodict) well? Which storms does it fit poorly?

Answer: the realtionship seems not very strong, but still positive trend, which means female hurricanes seems to be more death. Second model fits well and the model using femininity as only predictor fits poorly.