https://www.youtube.com/watch?v=Chk4AIydD0E
Brockmann (1996), Agresti (1996) and Agresti (2002) stydied if horseshoe crab had a male crab attached to her in her nest. The study investigated factors that affect whether the female crab had any other males:
There are 173 females in this study.
url <- 'https://newonlinecourses.science.psu.edu/stat504/sites/onlinecourses.science.psu.edu.stat504/files/lesson07/crab/index.txt'
crab <- read.table(file=url)
colnames(crab) <- c("Obs", "C", "S", "W", "Wt", "Sa")
crab <- crab[, -1] # To eliminate obs
crab$C <- as.factor(crab$C) # To ensure color as factor
head(crab)
## C S W Wt Sa
## 1 2 3 28.3 3.05 8
## 2 3 3 26.0 2.60 4
## 3 3 3 25.6 2.15 0
## 4 4 2 21.0 1.85 0
## 5 2 3 29.0 3.00 1
## 6 1 2 25.0 2.30 3
barplot(prop.table(table(crab$Sa)), las=1, xlab='Numero de satelites',
ylab='Proporcion', col='white')
with(crab, plot(x=W, y=Sa, xlab='Ancho caparazon', las=1,
ylab='Numero de satelites', pch=20))
with(crab, plot(x=Wt, y=Sa, xlab='Peso de la hembra', las=1,
ylab='Numero de satelites', pch=20))
tabla <- with(crab, table(C, Sa))
model::plot_prof(tabla, Row=T, cex=1, xlab='Color')
Para este ejemplo vamos a considerar tres modelos:
Se va a utilizar el \(GAIC=-2 \times LogLik + k \times npar\) para seleccionar los modelos. Entre menor \(GAIC\) mejor es el modelo.
Nota:
- Si \(k=2\) el \(GAIC\) es el \(AIC\) usual.
- Si \(k=\log(n)\) el \(GAIC\) es el \(BAIC\).
Primero cargamos el paquete gamlss.
library(gamlss)
Ajustamos el modelo inicial y hacemos seleccion de variables.
mod.p <- gamlss(Sa ~ 1, data=crab, family=PO)
## GAMLSS-RS iteration 1: Global Deviance = 988.0893
## GAMLSS-RS iteration 2: Global Deviance = 988.0893
sup <- formula(~ W + Wt + C)
n <- nrow(crab) # numero de observaciones
mod.p.final <- stepGAICAll.A(mod.p, trace=F, k=log(n),
scope=list(lower=~1, upper=sup))
## ---------------------------------------------------
## Start: AIC= 993.24
## Sa ~ 1
##
## ---------------------------------------------------
La tabla de resumen es:
summary(mod.p.final)
## ******************************************************************
## Family: c("PO", "Poisson")
##
## Call: gamlss(formula = Sa ~ Wt, family = PO, data = crab, trace = FALSE)
##
##
## Fitting method: RS()
##
## ------------------------------------------------------------------
## Mu link function: log
## Mu Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.4282 0.1789 -2.393 0.0178 *
## Wt 0.5892 0.0650 9.065 2.84e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## ------------------------------------------------------------------
## No. of observations in the fit: 173
## Degrees of Freedom for the fit: 2
## Residual Deg. of Freedom: 171
## at cycle: 2
##
## Global Deviance: 916.1405
## AIC: 920.1405
## SBC: 926.4471
## ******************************************************************
Ajustamos el modelo inicial y hacemos seleccion de variables.
mod.nb <- gamlss(Sa ~ 1, data=crab, family=NBI)
## GAMLSS-RS iteration 1: Global Deviance = 767.4092
## GAMLSS-RS iteration 2: Global Deviance = 767.4092
mod.nb.final <- stepGAICAll.A(mod.nb, trace=F, k=log(n),
scope=list(lower=~1, upper=sup),
sigma.scope=list(lower=~1, upper=sup))
## ---------------------------------------------------
## Start: AIC= 777.72
## Sa ~ 1
##
## ---------------------------------------------------
## Start: AIC= 764.1
## ~1
##
## ---------------------------------------------------
## Start: AIC= 747.74
## Sa ~ Wt
##
## ---------------------------------------------------
La tabla de resumen es:
summary(mod.nb.final)
## ******************************************************************
## Family: c("NBI", "Negative Binomial type I")
##
## Call: gamlss(formula = Sa ~ Wt, sigma.formula = ~W, family = NBI,
## data = crab, trace = FALSE)
##
## Fitting method: RS()
##
## ------------------------------------------------------------------
## Mu link function: log
## Mu Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.05839 0.38429 -0.152 0.879417
## Wt 0.46117 0.12896 3.576 0.000455 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## ------------------------------------------------------------------
## Sigma link function: log
## Sigma Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.52763 2.54265 4.534 1.09e-05 ***
## W -0.43140 0.09664 -4.464 1.47e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## ------------------------------------------------------------------
## No. of observations in the fit: 173
## Degrees of Freedom for the fit: 4
## Residual Deg. of Freedom: 169
## at cycle: 5
##
## Global Deviance: 727.1283
## AIC: 735.1283
## SBC: 747.7414
## ******************************************************************
Ajustamos el modelo inicial y hacemos seleccion de variables.
mod.zip <- gamlss(Sa ~ 1, data=crab, family=ZIP)
## GAMLSS-RS iteration 1: Global Deviance = 764.3072
## GAMLSS-RS iteration 2: Global Deviance = 763.2293
## GAMLSS-RS iteration 3: Global Deviance = 763.2292
mod.zip.final <- stepGAICAll.A(mod.zip, trace=F, k=log(n),
scope=list(lower=~1, upper=sup),
sigma.scope=list(lower=~1, upper=sup))
## ---------------------------------------------------
## Start: AIC= 773.54
## Sa ~ 1
##
## ---------------------------------------------------
## Start: AIC= 771.46
## ~1
##
## ---------------------------------------------------
## Start: AIC= 746.39
## Sa ~ Wt
##
## ---------------------------------------------------
La tabla de resumen es:
summary(mod.zip.final)
## ******************************************************************
## Family: c("ZIP", "Poisson Zero Inflated")
##
## Call: gamlss(formula = Sa ~ Wt, sigma.formula = ~W, family = ZIP,
## data = crab, trace = FALSE)
##
## Fitting method: RS()
##
## ------------------------------------------------------------------
## Mu link function: log
## Mu Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.98927 0.20919 4.729 4.73e-06 ***
## Wt 0.19462 0.07604 2.559 0.0114 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## ------------------------------------------------------------------
## Sigma link function: logit
## Sigma Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.3898 2.6928 4.601 8.21e-06 ***
## W -0.5005 0.1044 -4.795 3.55e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## ------------------------------------------------------------------
## No. of observations in the fit: 173
## Degrees of Freedom for the fit: 4
## Residual Deg. of Freedom: 169
## at cycle: 3
##
## Global Deviance: 725.7801
## AIC: 733.7801
## SBC: 746.3933
## ******************************************************************
plot(mod.p.final)
## ******************************************************************
## Summary of the Randomised Quantile Residuals
## mean = -0.1601136
## variance = 2.740651
## coef. of skewness = 0.5877063
## coef. of kurtosis = 2.692174
## Filliben correlation coefficient = 0.9776663
## ******************************************************************
plot(mod.nb.final)
## ******************************************************************
## Summary of the Randomised Quantile Residuals
## mean = -0.06028562
## variance = 1.155294
## coef. of skewness = -0.2930333
## coef. of kurtosis = 2.383353
## Filliben correlation coefficient = 0.9909457
## ******************************************************************
plot(mod.zip.final)
## ******************************************************************
## Summary of the Randomised Quantile Residuals
## mean = -0.03668195
## variance = 1.357443
## coef. of skewness = 0.4299052
## coef. of kurtosis = 3.711852
## Filliben correlation coefficient = 0.9912651
## ******************************************************************
par(mfrow=c(2, 2))
wp(mod.p.final) ; title('Poisson')
wp(mod.nb.final) ; title('Binomial negativo')
wp(mod.nb.final) ; title('ZIP')
GAIC(mod.p.final, mod.nb.final, mod.zip.final)
## df AIC
## mod.zip.final 4 733.7801
## mod.nb.final 4 735.1283
## mod.p.final 2 920.1405
El mejor modelo es porque tiene el valor de \(GAIC\) mas pequeno.
Taken from: http://www.gamlss.org/wp-content/uploads/2013/01/gamlss-manual.pdf
Graficos de distribucion de probabilidad para \(ZIP(\mu=5, \sigma)\).
De la tabla de resumen para el modelo ZIP se tiene que:
\[\begin{equation} \begin{array}{ll} log(\hat{\mu}) &= 0.98927 + 0.19462 W_t \\ logit(\hat{\sigma}) &= 12.3898 - 0.5005 W \\ \end{array} \end{equation}\]
Aplicando \(logit^{-1}\) se tiene que:
\[\begin{equation} \begin{array}{ll} \hat{\mu} &= \exp \left(0.98927 + 0.19462 W_t \right) \\ \hat{\sigma} &= 1 / \left( 1+\exp(-12.3898 + 0.5005 W) \right) \\ \end{array} \end{equation}\]
par(mfrow=c(1, 2))
mu.est <- function(x) exp(0.98927 + 0.19462 * x)
curve(mu.est, from=1.2, to=5.2, lwd=3, las=1, col='blue',
xlab=expression(w[t]), ylab=expression(hat(mu)))
sigma.est <- function(x) 1/(1 + exp(-12.3898 + 0.5005 * x))
curve(sigma.est, from=21, to=34, lwd=3, las=1, col='red',
xlab=expression(w), ylab=expression(hat(sigma)))
A partir de las expresiones anteriores de \(\mu\) y \(\sigma\) estimadas es posible escribir la media y varianza estimada recordando que cuando \(y \sim ZIP(\mu, \sigma)\) la media y varianza estan dadas por:
\[\begin{equation} \begin{array}{ll} E(y) &= (1-\sigma) \mu \\ Var(y) &= \mu (1-\sigma)(1+\sigma\mu) \\ \end{array} \end{equation}\]
La media estimada \(\hat{E}(y)\) se puede representar graficamente asi:
media <- function(w, wt) (1-sigma.est(w)) * mu.est(wt)
media <- Vectorize(media)
par(mfrow=c(1, 2))
k <- 100
W <- seq(from=21, to=33.5, length.out = k)
Wt <- seq(from=1.2, to=5.2, length.out = k)
medias <- outer(W, Wt, media)
contour(x=W, y=Wt, z=medias, lwd=3, las=1,
xlab='W', ylab=expression(W[t]))
image(x=W, y=Wt, z=medias, lwd=2, col=terrain.colors(50), las=1,
xlab='W', ylab=expression(W[t]))
https://data.princeton.edu/wws509/r/overdispersion
Nota: Para conocer otras publicaciones relacionadas con glm visite este enlace.