1 Horseshoe Crabs and Satellites

https://www.youtube.com/watch?v=Chk4AIydD0E

Brockmann (1996), Agresti (1996) and Agresti (2002) stydied if horseshoe crab had a male crab attached to her in her nest. The study investigated factors that affect whether the female crab had any other males:

  • the female crab’s color (),
  • spine condition (),
  • weight () in kilograms,
  • carapace width () in centimeters.
  • response outcome for each female crab is her number of satellites ().

There are 173 females in this study.

url <- 'https://newonlinecourses.science.psu.edu/stat504/sites/onlinecourses.science.psu.edu.stat504/files/lesson07/crab/index.txt'
crab <- read.table(file=url)
colnames(crab) <- c("Obs", "C", "S", "W", "Wt", "Sa")
crab <- crab[, -1] # To eliminate obs
crab$C <- as.factor(crab$C) # To ensure color as factor
head(crab)
##   C S    W   Wt Sa
## 1 2 3 28.3 3.05  8
## 2 3 3 26.0 2.60  4
## 3 3 3 25.6 2.15  0
## 4 4 2 21.0 1.85  0
## 5 2 3 29.0 3.00  1
## 6 1 2 25.0 2.30  3
barplot(prop.table(table(crab$Sa)), las=1, xlab='Numero de satelites',
        ylab='Proporcion', col='white')

with(crab, plot(x=W, y=Sa, xlab='Ancho caparazon', las=1,
                ylab='Numero de satelites', pch=20))

with(crab, plot(x=Wt, y=Sa, xlab='Peso de la hembra', las=1,
                ylab='Numero de satelites', pch=20))

tabla <- with(crab, table(C, Sa))
model::plot_prof(tabla, Row=T, cex=1, xlab='Color')

Para este ejemplo vamos a considerar tres modelos:

  1. Poisson
  2. Binomial negativa
  3. ZIP (zero inflated Poisson)

Se va a utilizar el \(GAIC=-2 \times LogLik + k \times npar\) para seleccionar los modelos. Entre menor \(GAIC\) mejor es el modelo.

Nota:
- Si \(k=2\) el \(GAIC\) es el \(AIC\) usual.
- Si \(k=\log(n)\) el \(GAIC\) es el \(BAIC\).

2 Ajustando los modelos

2.1 Poisson

Primero cargamos el paquete gamlss.

library(gamlss)

Ajustamos el modelo inicial y hacemos seleccion de variables.

mod.p <- gamlss(Sa ~ 1, data=crab, family=PO)
## GAMLSS-RS iteration 1: Global Deviance = 988.0893 
## GAMLSS-RS iteration 2: Global Deviance = 988.0893
sup <- formula(~ W + Wt + C)
n <- nrow(crab) # numero de observaciones
mod.p.final <- stepGAICAll.A(mod.p, trace=F, k=log(n),
                             scope=list(lower=~1, upper=sup))
## --------------------------------------------------- 
## Start:  AIC= 993.24 
##  Sa ~ 1 
## 
## ---------------------------------------------------

La tabla de resumen es:

summary(mod.p.final)
## ******************************************************************
## Family:  c("PO", "Poisson") 
## 
## Call:  gamlss(formula = Sa ~ Wt, family = PO, data = crab,      trace = FALSE) 
## 
## 
## Fitting method: RS() 
## 
## ------------------------------------------------------------------
## Mu link function:  log
## Mu Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.4282     0.1789  -2.393   0.0178 *  
## Wt            0.5892     0.0650   9.065 2.84e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## ------------------------------------------------------------------
## No. of observations in the fit:  173 
## Degrees of Freedom for the fit:  2
##       Residual Deg. of Freedom:  171 
##                       at cycle:  2 
##  
## Global Deviance:     916.1405 
##             AIC:     920.1405 
##             SBC:     926.4471 
## ******************************************************************

2.2 Negative Binomial

Ajustamos el modelo inicial y hacemos seleccion de variables.

mod.nb <- gamlss(Sa ~ 1, data=crab, family=NBI)
## GAMLSS-RS iteration 1: Global Deviance = 767.4092 
## GAMLSS-RS iteration 2: Global Deviance = 767.4092
mod.nb.final <- stepGAICAll.A(mod.nb, trace=F, k=log(n),
                              scope=list(lower=~1, upper=sup),
                              sigma.scope=list(lower=~1, upper=sup))
## --------------------------------------------------- 
## Start:  AIC= 777.72 
##  Sa ~ 1 
## 
## --------------------------------------------------- 
## Start:  AIC= 764.1 
##  ~1 
## 
## --------------------------------------------------- 
## Start:  AIC= 747.74 
##  Sa ~ Wt 
## 
## ---------------------------------------------------

La tabla de resumen es:

summary(mod.nb.final)
## ******************************************************************
## Family:  c("NBI", "Negative Binomial type I") 
## 
## Call:  gamlss(formula = Sa ~ Wt, sigma.formula = ~W, family = NBI,  
##     data = crab, trace = FALSE) 
## 
## Fitting method: RS() 
## 
## ------------------------------------------------------------------
## Mu link function:  log
## Mu Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.05839    0.38429  -0.152 0.879417    
## Wt           0.46117    0.12896   3.576 0.000455 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## ------------------------------------------------------------------
## Sigma link function:  log
## Sigma Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.52763    2.54265   4.534 1.09e-05 ***
## W           -0.43140    0.09664  -4.464 1.47e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## ------------------------------------------------------------------
## No. of observations in the fit:  173 
## Degrees of Freedom for the fit:  4
##       Residual Deg. of Freedom:  169 
##                       at cycle:  5 
##  
## Global Deviance:     727.1283 
##             AIC:     735.1283 
##             SBC:     747.7414 
## ******************************************************************

2.3 ZIP

Ajustamos el modelo inicial y hacemos seleccion de variables.

mod.zip <- gamlss(Sa ~ 1, data=crab, family=ZIP)
## GAMLSS-RS iteration 1: Global Deviance = 764.3072 
## GAMLSS-RS iteration 2: Global Deviance = 763.2293 
## GAMLSS-RS iteration 3: Global Deviance = 763.2292
mod.zip.final <- stepGAICAll.A(mod.zip, trace=F, k=log(n),
                               scope=list(lower=~1, upper=sup),
                               sigma.scope=list(lower=~1, upper=sup))
## --------------------------------------------------- 
## Start:  AIC= 773.54 
##  Sa ~ 1 
## 
## --------------------------------------------------- 
## Start:  AIC= 771.46 
##  ~1 
## 
## --------------------------------------------------- 
## Start:  AIC= 746.39 
##  Sa ~ Wt 
## 
## ---------------------------------------------------

La tabla de resumen es:

summary(mod.zip.final)
## ******************************************************************
## Family:  c("ZIP", "Poisson Zero Inflated") 
## 
## Call:  gamlss(formula = Sa ~ Wt, sigma.formula = ~W, family = ZIP,  
##     data = crab, trace = FALSE) 
## 
## Fitting method: RS() 
## 
## ------------------------------------------------------------------
## Mu link function:  log
## Mu Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.98927    0.20919   4.729 4.73e-06 ***
## Wt           0.19462    0.07604   2.559   0.0114 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## ------------------------------------------------------------------
## Sigma link function:  logit
## Sigma Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  12.3898     2.6928   4.601 8.21e-06 ***
## W            -0.5005     0.1044  -4.795 3.55e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## ------------------------------------------------------------------
## No. of observations in the fit:  173 
## Degrees of Freedom for the fit:  4
##       Residual Deg. of Freedom:  169 
##                       at cycle:  3 
##  
## Global Deviance:     725.7801 
##             AIC:     733.7801 
##             SBC:     746.3933 
## ******************************************************************

2.4 Grafico de residuales

2.4.1 Modelo Poisson

plot(mod.p.final)

## ******************************************************************
##   Summary of the Randomised Quantile Residuals
##                            mean   =  -0.1601136 
##                        variance   =  2.740651 
##                coef. of skewness  =  0.5877063 
##                coef. of kurtosis  =  2.692174 
## Filliben correlation coefficient  =  0.9776663 
## ******************************************************************

2.4.2 Modelo binomial negativo

plot(mod.nb.final)

## ******************************************************************
##   Summary of the Randomised Quantile Residuals
##                            mean   =  -0.06028562 
##                        variance   =  1.155294 
##                coef. of skewness  =  -0.2930333 
##                coef. of kurtosis  =  2.383353 
## Filliben correlation coefficient  =  0.9909457 
## ******************************************************************

2.4.3 Modelo ZIP

plot(mod.zip.final)

## ******************************************************************
##   Summary of the Randomised Quantile Residuals
##                            mean   =  -0.03668195 
##                        variance   =  1.357443 
##                coef. of skewness  =  0.4299052 
##                coef. of kurtosis  =  3.711852 
## Filliben correlation coefficient  =  0.9912651 
## ******************************************************************

2.4.4 Worm plot

par(mfrow=c(2, 2))
wp(mod.p.final)  ; title('Poisson')
wp(mod.nb.final) ; title('Binomial negativo')
wp(mod.nb.final) ; title('ZIP')

GAIC(mod.p.final, mod.nb.final, mod.zip.final)
##               df      AIC
## mod.zip.final  4 733.7801
## mod.nb.final   4 735.1283
## mod.p.final    2 920.1405

El mejor modelo es porque tiene el valor de \(GAIC\) mas pequeno.

3 ZIP distribution

Taken from: http://www.gamlss.org/wp-content/uploads/2013/01/gamlss-manual.pdf

Graficos de distribucion de probabilidad para \(ZIP(\mu=5, \sigma)\).

3.1 Logit function

De la tabla de resumen para el modelo ZIP se tiene que:

\[\begin{equation} \begin{array}{ll} log(\hat{\mu}) &= 0.98927 + 0.19462 W_t \\ logit(\hat{\sigma}) &= 12.3898 - 0.5005 W \\ \end{array} \end{equation}\]

Aplicando \(logit^{-1}\) se tiene que:

\[\begin{equation} \begin{array}{ll} \hat{\mu} &= \exp \left(0.98927 + 0.19462 W_t \right) \\ \hat{\sigma} &= 1 / \left( 1+\exp(-12.3898 + 0.5005 W) \right) \\ \end{array} \end{equation}\]

4 Estimaciones de \(\mu\) y \(\sigma\)

par(mfrow=c(1, 2))
mu.est <- function(x) exp(0.98927 + 0.19462 * x)
curve(mu.est, from=1.2, to=5.2, lwd=3, las=1, col='blue',
      xlab=expression(w[t]), ylab=expression(hat(mu)))

sigma.est <- function(x) 1/(1 + exp(-12.3898 + 0.5005  * x))
curve(sigma.est, from=21, to=34, lwd=3, las=1, col='red',
      xlab=expression(w), ylab=expression(hat(sigma)))

A partir de las expresiones anteriores de \(\mu\) y \(\sigma\) estimadas es posible escribir la media y varianza estimada recordando que cuando \(y \sim ZIP(\mu, \sigma)\) la media y varianza estan dadas por:

\[\begin{equation} \begin{array}{ll} E(y) &= (1-\sigma) \mu \\ Var(y) &= \mu (1-\sigma)(1+\sigma\mu) \\ \end{array} \end{equation}\]

5 Estimaciones de \(E(Y)\) y \(Var(Y)\)

La media estimada \(\hat{E}(y)\) se puede representar graficamente asi:

media <- function(w, wt) (1-sigma.est(w)) * mu.est(wt)
media <- Vectorize(media)

par(mfrow=c(1, 2))

k <- 100
W <- seq(from=21, to=33.5, length.out = k)
Wt <- seq(from=1.2, to=5.2, length.out = k)
medias <- outer(W, Wt, media)

contour(x=W, y=Wt, z=medias, lwd=3, las=1,
        xlab='W', ylab=expression(W[t]))

image(x=W, y=Wt, z=medias, lwd=2, col=terrain.colors(50), las=1,
        xlab='W', ylab=expression(W[t]))

6 Tarea

  • Replicar el ejemplo aquĆ­ mostrado en su computador.
  • Para cada uno de los tres modelos calcular la medida de desempeƱo \(Cor(Y, \hat{E}(Y))\).
  • Revisar el enlace de abajo y ajustar esos modelos.

https://data.princeton.edu/wws509/r/overdispersion