1 Horseshoe Crabs and Satellites

https://www.youtube.com/watch?v=Chk4AIydD0E

Brockmann (1996), Agresti (1996) and Agresti (2002) stydied if horseshoe crab had a male crab attached to her in her nest. The study investigated factors that affect whether the female crab had any other males:

the female crab’s color (),
spine condition (),
weight () in kilograms,
carapace width () in centimeters.
response outcome for each female crab is her number of satellites ().

There are 173 females in this study.

url <- 'https://newonlinecourses.science.psu.edu/stat504/sites/onlinecourses.science.psu.edu.stat504/files/lesson07/crab/index.txt'
crab <- read.table(file=url)
colnames(crab) <- c("Obs", "C", "S", "W", "Wt", "Sa")
crab <- crab[, -1] # To eliminate obs
crab$C <- as.factor(crab$C) # To ensure color as factor
head(crab)

##   C S    W   Wt Sa
## 1 2 3 28.3 3.05  8
## 2 3 3 26.0 2.60  4
## 3 3 3 25.6 2.15  0
## 4 4 2 21.0 1.85  0
## 5 2 3 29.0 3.00  1
## 6 1 2 25.0 2.30  3

barplot(prop.table(table(crab$Sa)), las=1, xlab='Numero de satelites',
        ylab='Proporcion', col='white')

with(crab, plot(x=W, y=Sa, xlab='Ancho caparazon', las=1,
                ylab='Numero de satelites', pch=20))

with(crab, plot(x=Wt, y=Sa, xlab='Peso de la hembra', las=1,
                ylab='Numero de satelites', pch=20))

tabla <- with(crab, table(C, Sa))
model::plot_prof(tabla, Row=T, cex=1, xlab='Color')

Para este ejemplo vamos a considerar tres modelos:

Poisson
Binomial negativa
ZIP (zero inflated Poisson)

Se va a utilizar el \(GAIC=-2 \times LogLik + k \times npar\) para seleccionar los modelos. Entre menor \(GAIC\) mejor es el modelo.

Nota:
- Si \(k=2\) el \(GAIC\) es el \(AIC\) usual.
- Si \(k=\log(n)\) el \(GAIC\) es el \(BAIC\).

2 Ajustando los modelos

2.1 Poisson

Primero cargamos el paquete gamlss.

library(gamlss)

Ajustamos el modelo inicial y hacemos seleccion de variables.

mod.p <- gamlss(Sa ~ 1, data=crab, family=PO)

## GAMLSS-RS iteration 1: Global Deviance = 988.0893 
## GAMLSS-RS iteration 2: Global Deviance = 988.0893

sup <- formula(~ W + Wt + C)
n <- nrow(crab) # numero de observaciones
mod.p.final <- stepGAICAll.A(mod.p, trace=F, k=log(n),
                             scope=list(lower=~1, upper=sup))

## --------------------------------------------------- 
## Start:  AIC= 993.24 
##  Sa ~ 1 
## 
## ---------------------------------------------------

La tabla de resumen es:

summary(mod.p.final)

## ******************************************************************
## Family:  c("PO", "Poisson") 
## 
## Call:  gamlss(formula = Sa ~ Wt, family = PO, data = crab,      trace = FALSE) 
## 
## 
## Fitting method: RS() 
## 
## ------------------------------------------------------------------
## Mu link function:  log
## Mu Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.4282     0.1789  -2.393   0.0178 *  
## Wt            0.5892     0.0650   9.065 2.84e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## ------------------------------------------------------------------
## No. of observations in the fit:  173 
## Degrees of Freedom for the fit:  2
##       Residual Deg. of Freedom:  171 
##                       at cycle:  2 
##  
## Global Deviance:     916.1405 
##             AIC:     920.1405 
##             SBC:     926.4471 
## ******************************************************************

2.2 Negative Binomial

Ajustamos el modelo inicial y hacemos seleccion de variables.

mod.nb <- gamlss(Sa ~ 1, data=crab, family=NBI)

## GAMLSS-RS iteration 1: Global Deviance = 767.4092 
## GAMLSS-RS iteration 2: Global Deviance = 767.4092

mod.nb.final <- stepGAICAll.A(mod.nb, trace=F, k=log(n),
                              scope=list(lower=~1, upper=sup),
                              sigma.scope=list(lower=~1, upper=sup))

## --------------------------------------------------- 
## Start:  AIC= 777.72 
##  Sa ~ 1 
## 
## --------------------------------------------------- 
## Start:  AIC= 764.1 
##  ~1 
## 
## --------------------------------------------------- 
## Start:  AIC= 747.74 
##  Sa ~ Wt 
## 
## ---------------------------------------------------

La tabla de resumen es:

summary(mod.nb.final)

## ******************************************************************
## Family:  c("NBI", "Negative Binomial type I") 
## 
## Call:  gamlss(formula = Sa ~ Wt, sigma.formula = ~W, family = NBI,  
##     data = crab, trace = FALSE) 
## 
## Fitting method: RS() 
## 
## ------------------------------------------------------------------
## Mu link function:  log
## Mu Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.05839    0.38429  -0.152 0.879417    
## Wt           0.46117    0.12896   3.576 0.000455 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## ------------------------------------------------------------------
## Sigma link function:  log
## Sigma Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.52763    2.54265   4.534 1.09e-05 ***
## W           -0.43140    0.09664  -4.464 1.47e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## ------------------------------------------------------------------
## No. of observations in the fit:  173 
## Degrees of Freedom for the fit:  4
##       Residual Deg. of Freedom:  169 
##                       at cycle:  5 
##  
## Global Deviance:     727.1283 
##             AIC:     735.1283 
##             SBC:     747.7414 
## ******************************************************************

2.3 ZIP

Ajustamos el modelo inicial y hacemos seleccion de variables.

mod.zip <- gamlss(Sa ~ 1, data=crab, family=ZIP)

## GAMLSS-RS iteration 1: Global Deviance = 764.3072 
## GAMLSS-RS iteration 2: Global Deviance = 763.2293 
## GAMLSS-RS iteration 3: Global Deviance = 763.2292

mod.zip.final <- stepGAICAll.A(mod.zip, trace=F, k=log(n),
                               scope=list(lower=~1, upper=sup),
                               sigma.scope=list(lower=~1, upper=sup))

## --------------------------------------------------- 
## Start:  AIC= 773.54 
##  Sa ~ 1 
## 
## --------------------------------------------------- 
## Start:  AIC= 771.46 
##  ~1 
## 
## --------------------------------------------------- 
## Start:  AIC= 746.39 
##  Sa ~ Wt 
## 
## ---------------------------------------------------

La tabla de resumen es:

summary(mod.zip.final)

## ******************************************************************
## Family:  c("ZIP", "Poisson Zero Inflated") 
## 
## Call:  gamlss(formula = Sa ~ Wt, sigma.formula = ~W, family = ZIP,  
##     data = crab, trace = FALSE) 
## 
## Fitting method: RS() 
## 
## ------------------------------------------------------------------
## Mu link function:  log
## Mu Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.98927    0.20919   4.729 4.73e-06 ***
## Wt           0.19462    0.07604   2.559   0.0114 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## ------------------------------------------------------------------
## Sigma link function:  logit
## Sigma Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  12.3898     2.6928   4.601 8.21e-06 ***
## W            -0.5005     0.1044  -4.795 3.55e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## ------------------------------------------------------------------
## No. of observations in the fit:  173 
## Degrees of Freedom for the fit:  4
##       Residual Deg. of Freedom:  169 
##                       at cycle:  3 
##  
## Global Deviance:     725.7801 
##             AIC:     733.7801 
##             SBC:     746.3933 
## ******************************************************************

2.4 Grafico de residuales

2.4.1 Modelo Poisson

plot(mod.p.final)

## ******************************************************************
##   Summary of the Randomised Quantile Residuals
##                            mean   =  -0.1601136 
##                        variance   =  2.740651 
##                coef. of skewness  =  0.5877063 
##                coef. of kurtosis  =  2.692174 
## Filliben correlation coefficient  =  0.9776663 
## ******************************************************************

2.4.2 Modelo binomial negativo

plot(mod.nb.final)

## ******************************************************************
##   Summary of the Randomised Quantile Residuals
##                            mean   =  -0.06028562 
##                        variance   =  1.155294 
##                coef. of skewness  =  -0.2930333 
##                coef. of kurtosis  =  2.383353 
## Filliben correlation coefficient  =  0.9909457 
## ******************************************************************

2.4.3 Modelo ZIP

plot(mod.zip.final)

## ******************************************************************
##   Summary of the Randomised Quantile Residuals
##                            mean   =  -0.03668195 
##                        variance   =  1.357443 
##                coef. of skewness  =  0.4299052 
##                coef. of kurtosis  =  3.711852 
## Filliben correlation coefficient  =  0.9912651 
## ******************************************************************

2.4.4 Worm plot

par(mfrow=c(2, 2))
wp(mod.p.final)  ; title('Poisson')
wp(mod.nb.final) ; title('Binomial negativo')
wp(mod.nb.final) ; title('ZIP')

GAIC(mod.p.final, mod.nb.final, mod.zip.final)

##               df      AIC
## mod.zip.final  4 733.7801
## mod.nb.final   4 735.1283
## mod.p.final    2 920.1405

El mejor modelo es porque tiene el valor de \(GAIC\) mas pequeno.

3 ZIP distribution

Taken from: http://www.gamlss.org/wp-content/uploads/2013/01/gamlss-manual.pdf

Graficos de distribucion de probabilidad para \(ZIP(\mu=5, \sigma)\).

3.1 Logit function

De la tabla de resumen para el modelo ZIP se tiene que:

\[\begin{equation} \begin{array}{ll} log(\hat{\mu}) &= 0.98927 + 0.19462 W_t \\ logit(\hat{\sigma}) &= 12.3898 - 0.5005 W \\ \end{array} \end{equation}\]

Aplicando \(logit^{-1}\) se tiene que:

\[\begin{equation} \begin{array}{ll} \hat{\mu} &= \exp \left(0.98927 + 0.19462 W_t \right) \\ \hat{\sigma} &= 1 / \left( 1+\exp(-12.3898 + 0.5005 W) \right) \\ \end{array} \end{equation}\]

4 Estimaciones de \(\mu\) y \(\sigma\)

par(mfrow=c(1, 2))
mu.est <- function(x) exp(0.98927 + 0.19462 * x)
curve(mu.est, from=1.2, to=5.2, lwd=3, las=1, col='blue',
      xlab=expression(w[t]), ylab=expression(hat(mu)))

sigma.est <- function(x) 1/(1 + exp(-12.3898 + 0.5005  * x))
curve(sigma.est, from=21, to=34, lwd=3, las=1, col='red',
      xlab=expression(w), ylab=expression(hat(sigma)))

A partir de las expresiones anteriores de \(\mu\) y \(\sigma\) estimadas es posible escribir la media y varianza estimada recordando que cuando \(y \sim ZIP(\mu, \sigma)\) la media y varianza estan dadas por:

\[\begin{equation} \begin{array}{ll} E(y) &= (1-\sigma) \mu \\ Var(y) &= \mu (1-\sigma)(1+\sigma\mu) \\ \end{array} \end{equation}\]

5 Estimaciones de \(E(Y)\) y \(Var(Y)\)

La media estimada \(\hat{E}(y)\) se puede representar graficamente asi:

media <- function(w, wt) (1-sigma.est(w)) * mu.est(wt)
media <- Vectorize(media)

par(mfrow=c(1, 2))

k <- 100
W <- seq(from=21, to=33.5, length.out = k)
Wt <- seq(from=1.2, to=5.2, length.out = k)
medias <- outer(W, Wt, media)

contour(x=W, y=Wt, z=medias, lwd=3, las=1,
        xlab='W', ylab=expression(W[t]))

image(x=W, y=Wt, z=medias, lwd=2, col=terrain.colors(50), las=1,
        xlab='W', ylab=expression(W[t]))

6 Tarea

Replicar el ejemplo aquí mostrado en su computador.
Para cada uno de los tres modelos calcular la medida de desempeño \(Cor(Y, \hat{E}(Y))\).
Revisar el enlace de abajo y ajustar esos modelos.

https://data.princeton.edu/wws509/r/overdispersion

Nota: Para conocer otras publicaciones relacionadas con glm visite este enlace.

Modelos para datos de conteo

Freddy Hernandez Barajas

07 julio, 2020

1 Horseshoe Crabs and Satellites

2 Ajustando los modelos

2.1 Poisson

2.2 Negative Binomial

2.3 ZIP

2.4 Grafico de residuales

2.4.1 Modelo Poisson

2.4.2 Modelo binomial negativo

2.4.3 Modelo ZIP

2.4.4 Worm plot

3 ZIP distribution

3.1 Logit function

4 Estimaciones de \(\mu\) y \(\sigma\)

5 Estimaciones de \(E(Y)\) y \(Var(Y)\)

6 Tarea