Assignment #10

Questions

12E1. What is the difference between an ordered categorical variable and an unordered one? Define and then give an example of each.

#The difference between an ordered categorical variable and an unordered one are the underlining meaning between the ordered category then unordered.

# analysis of school aged students, category like grade level from 1-12 the number would have meaning where the category of district is not ordered.

12E2. What kind of link function does an ordered logistic regression employ? How does it differ from an ordinary logit link?

# It differs from an ordinary logit link as it is a log-cumulative-odds link probability model. The ordered logit model is a regression model for an ordinal response variable. The model is based on the cumulative probabilities of the response variable instead of a discrete probability of a single event.In particular, the logit of each cumulative probability is assumed to be a linear function of the covariates with  Regression Coefficients constant across Response Categories.

12E3. When count data are zero-inflated, using a model that ignores zero-inflation will tend to induce which kind of inferential error?

#If using a model that ignores zero-inflation, then we would get a estimation that is in between that is not zero nor close to the non-zero distirbution. it would probably have a relatively large MSE and it would not be a good estimation overall.

12E4. Over-dispersion is common in count data. Give an example of a natural process that might produce over-dispersed counts. Can you also give an example of a process that might produce underdispersed counts?

# over dispersion happens when true distribution demostrate higher variance than theoratical model.Like the number of cliks on a website within a day, we would assume this should be a poission distribution. Yet it could easily be over dispersed as click thru rate depends on so many thing it is far from fliping coin. 

# underdispersion could happen for the same example, if the click thru rate in within a minute is high sequential, then within certain time frame we would probably see a lot of similar counts that is under dispersed than theory.

12M1. At a certain university, employees are annually rated from 1 to 4 on their productivity, with 1 being least productive and 4 most productive. In a certain department at this certain university in a certain year, the numbers of employees receiving each rating were (from 1 to 4): 12, 36, 7, 41. Compute the log cumulative odds of each rating.

n <- c( 12, 36 , 7 , 41 )
p <- n / sum(n)
cum_p = cumsum(p)
log(cum_p / (1 - cum_p))

## [1] -1.9459101  0.0000000  0.2937611        Inf

12M2. Make a version of Figure 12.5 for the employee ratings data given just above.

n = c(12, 36, 7, 41)
p = n / sum(n)
cum_p = cumsum(p)
plot(
  y = cum_p,
  x = 1:4,
  type = "b",
  ylim = c(0, 1)
)
segments(1:4, 0, 1:4, cum_p)
for (i in 1:4) {
    segments(i + 0.05, c(0, cum_p)[i], i + 0.05, cum_p[i], col = "green")
}

12M3. Can you modify the derivation of the zero-inflated Poisson distribution (ZIPoisson) from the chapter to construct a zero-inflated binomial distribution?

# The probability of a zero, mixing together both processes, is: 
# Pr(0|p0, q, n) = p0 + (1 − p0)(1 − q)^n

# The probability of any particular non-zero observation y is: 
# Pr(y|p0, q, n) = (1 − p0)(n!/(y!(n − y)!)(q^y)((1 − q)^(n−y))

12H1. In 2014, a paper was published that was entitled “Female hurricanes are deadlier than male hurricanes.”191 As the title suggests, the paper claimed that hurricanes with female names have caused greater loss of life, and the explanation given is that people unconsciously rate female hurricanes as less dangerous and so are less likely to evacuate. Statisticians severely criticized the paper after publication. Here, you’ll explore the complete data used in the paper and consider the hypothesis that hurricanes with female names are deadlier. Load the data with:

data(Hurricanes)

data1 <- Hurricanes
m1 <- map(
  alist(
    deaths ~ dpois( lambda ),
    log(lambda) <- a + bF*femininity,
    a ~ dnorm(0,10),
    bF ~ dnorm(0,5)
  ) ,
  data=data1)

m2 <- map(
  alist(
    deaths ~ dpois( lambda ),
    log(lambda) <- a ,
    a ~ dnorm(0,10)
  ) ,
  data=data1)

compare(m1,m2)

##        WAIC        SE    dWAIC      dSE    pWAIC       weight
## m1 4418.034  999.4403  0.00000       NA 139.8159 1.000000e+00
## m2 4455.756 1078.7835 37.72173 144.8608  83.4494 6.439182e-09

y <- sim(m1)

y.mean <- colMeans(y)
y.PI <- apply(y, 2, PI)

plot(y=data1$deaths, x=data1$femininity, col=rangi2, ylab="deaths", xlab="femininity", pch=16)
points(y=y.mean, x=data1$femininity, pch=1)
segments(x0=data1$femininity, x1= data1$femininity, y0=y.PI[1,], y1=y.PI[2,])

lines(y= y.mean[order(data1$femininity)],  x=sort(data1$femininity))
lines( y.PI[1,order(data1$femininity)],  x=sort(data1$femininity), lty=2 )
lines( y.PI[2,order(data1$femininity)],  x=sort(data1$femininity), lty=2 )

Acquaint yourself with the columns by inspecting the help ?Hurricanes. In this problem, you’ll focus on predicting deaths using femininity of each hurricane’s name. Fit and interpret the simplest possible model, a Poisson model of deaths using femininity as a predictor. You can use quap or ulam. Compare the model to an intercept-only Poisson model of deaths. How strong is the association between femininity of name and deaths? Which storms does the model fit (retrodict) well? Which storms does it fit poorly?

# The model include the femininity rating fits the data better than the intercept-only model. But the plot showed that it is due to a few influential outliers on the right side of the plot.
# Total four storms that were particularly deadly, and all four had names that received high ratings of femininity, leading the model to believe that female hurricanes are deadlier than male hurricanes.

Assignment #10

your name

2020-12-05

Chapter 12 - Monsters and Mixtures

Questions