1 Modelo

El modelo Normal multivariado para secuencia de observaciones \(\boldsymbol{y}_1,\ldots,\boldsymbol{y}_n\), donde \(\boldsymbol{y}_i = (y_{i,1},\ldots,y_{i,p})\in\mathbb{R}^p\) para \(i=1,\ldots,n\), está dado por: \[ \begin{align*} \boldsymbol{y}_i\mid\boldsymbol{\theta},\mathbf{\Sigma}&\stackrel{\text{iid}}{\sim} \textsf{N}(\boldsymbol{\theta},\mathbf{\Sigma})\\ (\boldsymbol{\theta},\mathbf{\Sigma}) &\sim p(\boldsymbol{\theta},\mathbf{\Sigma}) \,. \end{align*} \] El modelo tiene \(k=p+p(p+1)/2\) parámetros desconocidos, a saber, el vector de medias \(\boldsymbol{\theta}\) y la matriz de covarianzas \(\mathbf{\Sigma}\).

El vector aleatorio \(\boldsymbol{y}=(y_1,\ldots,y_p)\) tiene distribución Normal multivariada si la función de densidad de probabilidad \(\boldsymbol{y}\) está dada por \[ p(\boldsymbol{y}\mid\boldsymbol{\theta},\mathbf{\Sigma}) = (2\pi)^{-p/2}\,|\mathbf{\Sigma}|^{-1/2}\,\exp\left\{ -\frac12(\boldsymbol{y}-\boldsymbol{\theta})^{\textsf{T}} \mathbf{\Sigma}^{-1} (\boldsymbol{y}-\boldsymbol{\theta}) \right\}\,, \] donde \[ \boldsymbol{\theta}= (\theta_1,\ldots,\theta_p) \qquad\text{y}\qquad \mathbf{\Sigma}= \begin{bmatrix} \sigma_1^2 & \sigma_{1,2} & \cdots & \sigma_{1,p} \\ \sigma_{2,1}^2 & \sigma_2^2 & \cdots & \sigma_{2,p} \\ \vdots & \vdots & \ddots & \vdots \\ \sigma_{p,1} & \sigma_{p,2} & \cdots & \sigma_p^2 \\ \end{bmatrix}\,, \] con \(\mathbf{\Sigma}\) simétrica, \(\mathbf{\Sigma}^{\textsf{T}}=\mathbf{\Sigma}\), y definida positiva, \(\boldsymbol{x}^{\textsf{T}}\mathbf{\Sigma}\boldsymbol{x}> 0\) para todo \(\boldsymbol{x}\in\mathbb{R}^p\).

El núcleo de esta distribución es: \[ p(\boldsymbol{y}\mid\boldsymbol{\theta},\mathbf{\Sigma}) \propto \exp\left\{ -\frac12\left[ \boldsymbol{y}^{\textsf{T}}\mathbf{\Sigma}^{-1}\boldsymbol{y}- 2\boldsymbol{y}^{\textsf{T}}\mathbf{\Sigma}^{-1}\boldsymbol{\theta}\right] \right\}\,. \]

2 Estadísticos Suficientes

Si \(\boldsymbol{y}_i\mid\boldsymbol{\theta},\mathbf{\Sigma}\stackrel{\text{iid}}{\sim} \textsf{N}(\boldsymbol{\theta},\mathbf{\Sigma})\), con \(i=1,\ldots,n\), entonces la distribución muestral conjunta de las observaciones es: \[ \begin{align*} p\left(\mathbf{Y} \mid \boldsymbol{\theta},\mathbf{\Sigma}\right) &= \prod_{i=1}^n \left(2 \pi\right)^{-p / 2} |\mathbf{\Sigma}|^{-1/2} \exp\left\{ -\frac12(\boldsymbol{y}_i-\boldsymbol{\theta})^{\textsf{T}} \mathbf{\Sigma}^{-1} (\boldsymbol{y}_i-\boldsymbol{\theta}) \right\} \\ &= \left(2 \pi\right)^{-np / 2} |\mathbf{\Sigma}|^{-n/2} \exp\left\{ -\frac12\sum_{i=1}^n(\boldsymbol{y}_i-\boldsymbol{\theta})^{\textsf{T}} \mathbf{\Sigma}^{-1} (\boldsymbol{y}_i-\boldsymbol{\theta}) \right\}\,, \end{align*} \] donde \(\mathbf{Y}=[\boldsymbol{y}_1^\textsf{T},\ldots,\boldsymbol{y}_n^\textsf{T}]^\textsf{T}\).

Como \[ \begin{align*} \sum_{i=1}^n(\boldsymbol{y}_i-\boldsymbol{\theta})^{\textsf{T}} \mathbf{\Sigma}^{-1} (\boldsymbol{y}_i-\boldsymbol{\theta}) &= \sum_{i=1}^n\boldsymbol{y}_i^{\textsf{T}}\mathbf{\Sigma}^{-1}\boldsymbol{y}_i - 2\sum_{i=1}^n\boldsymbol{y}_i^{\textsf{T}}\mathbf{\Sigma}^{-1}\boldsymbol{\theta}+ n\boldsymbol{\theta}^{\textsf{T}}\mathbf{\Sigma}^{-1}\boldsymbol{\theta}\\ &= \textsf{traza}\left( \mathbf{\Sigma}^{-1} \sum_{i=1}^n\boldsymbol{y}_i\boldsymbol{y}_i^{\textsf{T}} \right) - 2\left(\sum_{i=1}^n\boldsymbol{y}_i\right)^{\textsf{T}}\mathbf{\Sigma}^{-1}\boldsymbol{\theta}+ n\boldsymbol{\theta}^{\textsf{T}}\mathbf{\Sigma}^{-1}\boldsymbol{\theta}\\ \end{align*} \] entonces \[ \left( \sum_{i=1}^n\boldsymbol{y}_i, \sum_{i=1}^n\boldsymbol{y}_i\boldsymbol{y}_i^{\textsf{T}} \right) \] es un estadístico suficiente para \((\boldsymbol{\theta},\mathbf{\Sigma})\).

La media muestral y la matriz de covarianza muestral \((\bar{\boldsymbol y},\mathbf{S})\) también constituyen un estadístico suficiente para \((\boldsymbol{\theta},\mathbf{\Sigma})\) dado que \[ \sum_{i=1}^n(\boldsymbol{y}_i-\boldsymbol{\theta})^{\textsf{T}} \mathbf{\Sigma}^{-1} (\boldsymbol{y}_i-\boldsymbol{\theta}) = \textsf{traza}\left( \mathbf{\Sigma}^{-1}\left[ (n-1)\mathbf{S}+ n(\bar{\boldsymbol y} - \boldsymbol{\theta})(\bar{\boldsymbol y} - \boldsymbol{\theta})^{\textsf{T}} \right] \right) \] donde \[ \bar{\boldsymbol y} = \frac{1}{n}\sum_{i=1}^n \boldsymbol{y}_i \qquad\text{y}\qquad \mathbf{S}= \frac{1}{n-1}\sum_{i=1}^n (\boldsymbol{y}_i - \bar{\boldsymbol y})(\boldsymbol{y}_i - \bar{\boldsymbol y})^{\textsf{T}}\,. \]

3 Modelo Normal Multivariado Semi-Conjugado

Distribución muestral: \[ \boldsymbol{y}_i\mid\boldsymbol{\theta},\mathbf{\Sigma}\stackrel{\text{iid}}{\sim} \textsf{N}(\boldsymbol{\theta},\mathbf{\Sigma})\,,\qquad i=1,\ldots,n\,. \]
Distribución previa: \[ p(\boldsymbol{\theta},\mathbf{\Sigma}) = p(\boldsymbol{\theta})\,p(\mathbf{\Sigma}) \] donde \[ \begin{align*} \boldsymbol{\theta}&\sim \textsf{N} (\boldsymbol{\mu}_0, \mathbf{\Lambda}_0) \\ \mathbf{\Sigma}&\sim \textsf{WI}(\nu_0,\mathbf{S}_0^{-1}) \end{align*} \]
Parámetros: \(\boldsymbol{\theta}\), \(\mathbf{\Sigma}\).
Hiperparámetros: \(\boldsymbol{\mu}_0, \mathbf{\Lambda}_0, \nu_0, \mathbf{S}_0\).

Distribución Wishart

La matriz aleatoria \(\mathbf{W}> 0\) de \(p \times p\) tiene distribución Wishart con parámetros \(\nu > p+1\) (grados de libertad) y \(\mathbf{S}> 0\) (matriz de escala de \(p\times p\)), i.e., \(\mathbf{W}\sim\textsf{W}(\nu,\mathbf{S})\), si su función de densidad de probabilidad es \[ p(\mathbf{W}\mid\nu,\mathbf{S}) \propto |\mathbf{W}|^{(\nu-p-1)/2}\,\exp\left\{ -\frac12\textsf{traza}(\mathbf{S}^{-1}\mathbf{W}) \right\}\,. \]

Si \(\mathbf{W}\sim\textsf{W}(\nu,\mathbf{S})\), entonces \(\textsf{E}(\mathbf{W}) = \nu\mathbf{S}\).

Distribución Wishart Inversa

La matriz aleatoria \(\mathbf{W}> 0\) de \(p \times p\) tiene distribución Wishart Inversa con parámetros \(\nu > p+1\) (grados de libertad) y \(\mathbf{S}> 0\) (matriz de escala de \(p\times p\)), i.e., \(\mathbf{W}\sim\textsf{WI}(\nu,\mathbf{S}^{-1})\), si su función de densidad de probabilidad es \[ p(\mathbf{W}\mid\nu,\mathbf{S}^{-1}) \propto |\mathbf{W}|^{-(\nu+p+1)/2}\,\exp\left\{ -\frac12\textsf{traza}(\mathbf{S}\mathbf{W}^{-1}) \right\}\,. \]

Si \(\mathbf{W}\sim\textsf{WI}(\nu,\mathbf{S}^{-1})\), entonces \(\textsf{E}(\mathbf{W}) = \frac{1}{\nu-p-1}\mathbf{S}\) para \(\nu > p+1\).
Si \(\mathbf{W}^{-1}\sim\textsf{W}(\nu,\mathbf{S})\), entonces \(\mathbf{W}\sim\textsf{WI}(\nu,\mathbf{S}^{-1})\).

4 Distribuciones condicionales completas

El muestreador de Gibbs (Gibbs sampler) es un algoritmo iterativo que permite obtener muestras dependientes de la distribución posterior por medio de las distribuciones condicionales completas.

La distribución condicional completa de \(\boldsymbol{\theta}\) es \(\boldsymbol{\theta}\mid\text{resto}\sim\textsf{N}(\boldsymbol{\mu}_n,\mathbf{\Lambda}_n)\), donde \[ \boldsymbol{\mu}_n = \left( \mathbf{\Lambda}_0^{-1} + n\mathbf{\Sigma}^{-1} \right)^{-1} \left( \mathbf{\Lambda}_0^{-1}\boldsymbol{\mu}_0 + n\mathbf{\Sigma}^{-1}\bar{\boldsymbol{y}} \right) \qquad\text{y}\qquad \mathbf{\Lambda}_n = \left( \mathbf{\Lambda}_0^{-1} + n\mathbf{\Sigma}^{-1} \right)^{-1}\,. \]
La distribución condicional completa de \(\mathbf{\Sigma}\) es \(\mathbf{\Sigma}\mid\text{resto}\sim\textsf{WI}(\nu_n,\mathbf{S}_n^{-1})\), donde \[ \nu_n = \nu_0 + n \qquad\text{y}\qquad \mathbf{S}_n = \mathbf{S}_0 + \sum_{i=1}^n (\boldsymbol{y}_i - \boldsymbol{\theta})(\boldsymbol{y}_i - \boldsymbol{\theta})^{\textsf{T}} = \mathbf{S}_0 + (n-1)\mathbf{S}+ n(\bar{\boldsymbol y} - \boldsymbol{\theta})(\bar{\boldsymbol y} - \boldsymbol{\theta})^{\textsf{T}} \,. \]

5 Muetreador de Gibbs

Dado un estado actual de los parámetros del modelo \((\boldsymbol{\theta}^{(b)}, \mathbf{\Sigma}^{(b)})\), se genera un nuevo estado \((\boldsymbol{\theta}^{(b+1)}, \mathbf{\Sigma}^{(b+1)})\) como sigue:

Muestrar \(\boldsymbol{\theta}^{(b+1)}\sim p(\boldsymbol{\theta}\mid\mathbf{\Sigma}^{(b)}, \boldsymbol{y})\).
Muestrar \(\mathbf{\Sigma}^{(b+1)}\sim p(\mathbf{\Sigma}\mid\boldsymbol{\theta}^{(b+1)},\boldsymbol{y})\).
Almacenar \((\boldsymbol{\theta}^{(b+1)}, \mathbf{\Sigma}^{(b+1)})\)
Repetir los pasos 1. a 3. hasta convergencia.

Este algoritmo genera una secuencia dependiente de parámetros \((\boldsymbol{\theta}^{(1)}, \mathbf{\Sigma}^{(1)}),\ldots,(\boldsymbol{\theta}^{(B)}, \mathbf{\Sigma}^{(B)})\) de la distribución posterior \(p(\boldsymbol{\theta},\mathbf{\Sigma}\mid \boldsymbol{y})\).

6 Ejemplo: Comprensión de Lectura

A una muestra de 22 niños se les dan pruebas de comprensión de lectura antes y después de recibir un método de instrucción en particular.

Cada estudiante \(i\) tiene asociados dos variables \(y_{i,1}\) y \(y_{i,2}\) que denotan los puntajes antes y después de la instrucción, respectivamente.

El objetivo es evaluar la efectividad del método de enseñanza junto con la consistencia de la prueba de comprensión de lectura.

Hoff, P. D. (2009). A first course in Bayesian statistical methods (Vol. 580). New York: Springer.

# datos
Y <- structure(c(59, 43, 34, 32, 42, 38, 55, 67, 64, 45, 49, 72, 34, 
                 70, 34, 50, 41, 52, 60, 34, 28, 35, 77, 39, 46, 26, 
                 38, 43, 68, 86, 77, 60, 50, 59, 38, 48, 55, 58, 54, 
                 60, 75, 47, 48, 33), 
               .Dim = c(22L, 2L), .Dimnames = list(NULL, c("pretest", "posttest")))
(n <- nrow(Y))

## [1] 22

(p <- ncol(Y))

## [1] 2

# inspeccionar datos
summary(Y)

##     pretest         posttest    
##  Min.   :28.00   Min.   :26.00  
##  1st Qu.:34.25   1st Qu.:43.75  
##  Median :44.00   Median :52.00  
##  Mean   :47.18   Mean   :53.86  
##  3rd Qu.:58.00   3rd Qu.:60.00  
##  Max.   :72.00   Max.   :86.00

# estadísticos suficientes
yb <- c(colMeans(Y))
round(yb, 1)

##  pretest posttest 
##     47.2     53.9

SS <- cov(Y)
round(SS, 1)

##          pretest posttest
## pretest    182.2    148.4
## posttest   148.4    243.6

6.1 Elicitación de lo hiperparámetros

El examen fue diseñado para dar puntajes promedio de 50 de 100: \(\boldsymbol{\mu}_0 = (50, 50)\).

Usar una varianza previa tal que \(\textsf{Pr}( 0 < \theta_j < 100 ) \approx 0.99\): \(\sigma^2_{0,1}=\sigma^2_{0,2}=(50/3)^2 \approx 278\).

Usar correlación previa tal que \(\rho_0 = 0.5\): \(\sigma_{0,12} = (0.5)(50/3)^2 \approx 139\).

# previa
mu0 <- c(50,50)
L0  <- matrix(data = c(278,139,139,278), nrow = 2, ncol = 2)
nu0 <- 4
S0  <- matrix(data = c(278,139,139,278), nrow = 2, ncol = 2)

6.2 Ajuste del Modelo Normal MUltivariado Semi-Conjugado

# inicializar
theta <- yb
Sigma <- SS
# número de muestras
B <- 10000
ncat <- floor(B/10)
# almacenamiento
THETA <- SIGMA <- YS <- LP <- NULL
# cadena
iL0 <- solve(L0)
Lm0 <- solve(L0)%*%mu0
nun <- nu0 + n
SSn <- S0 + (n-1)*SS
set.seed(1)
for (b in 1:B) {
  # actualizar theta
  iSigma <- solve(Sigma)
  Ln     <- solve(iL0 + n*iSigma)
  theta  <- c(mvtnorm::rmvnorm(n = 1, mean = Ln%*%(Lm0 + n*(iSigma%*%yb)), sigma = Ln))
  # actualizar Sigma
  Sigma <- solve(rWishart::rWishart(n = 1, df = nun, Sigma = solve(SSn + n*((yb - theta)%*%t(yb - theta))))[,,1])
  # predictiva posterior
  YS <- rbind(YS, mvtnorm::rmvnorm(n = 1, mean = theta, sigma = Sigma)) 
  # log-verosimilitud
  LP[b] <- sum(apply(X = Y, MARGIN = 1, FUN = function(y) mvtnorm::dmvnorm(x = y, mean = theta, sigma = Sigma, log = T)))
  # almacenar
  THETA <- rbind(THETA, theta)
  SIGMA <- rbind(SIGMA, c(Sigma))
  # progreso
  if (b%%ncat == 0) 
       cat(100*round(b/B, 1), "% completado ... \n", sep = "")
}

## 10% completado ... 
## 20% completado ... 
## 30% completado ... 
## 40% completado ... 
## 50% completado ... 
## 60% completado ... 
## 70% completado ... 
## 80% completado ... 
## 90% completado ... 
## 100% completado ...

colnames(THETA) <- c("theta1", "theta2")
colnames(SIGMA) <- c("sigma1^2", "sigma21", "sigma12", "sigma2^2")

6.3 Convergencia

Cadena de la log-verosimilitud:

Errores estándar de Monte Carlo:

# errores estándar de Monte Carlo
round(apply(X = THETA, MARGIN = 2, FUN = sd)/sqrt(coda::effectiveSize(THETA)), 3)

## theta1 theta2 
##  0.029  0.033

round(apply(X = SIGMA, MARGIN = 2, FUN = sd)/sqrt(coda::effectiveSize(SIGMA)), 3)

## sigma1^2  sigma21  sigma12 sigma2^2 
##    0.597    0.595    0.595    0.812

6.4 Inferencia

Distribución posterior de \(\boldsymbol{\theta}\) y distribución predictiva posterior:

Inferencia sobre \(\theta_2 - \theta_1\):

¿Cuál es la probabilidad posterior de que la calificación promedio del segundo examen sea mayor que la del primero?

round(mean(THETA[,2] - THETA[,1] > 0), 3)

## [1] 0.996

round(quantile(THETA[,2] - THETA[,1], probs = c(0.025, 0.5, 0.975)), 3)

##   2.5%    50%  97.5% 
##  1.640  6.545 11.304

Inferencia sobre \(\tilde{y}_2 - \tilde{y}_1\):

¿Cuál es la probabilidad posterior de que un niño seleccionado al azar obtenga una puntuación más alta en el segundo examen que en el primero?

round(mean(YS[,2] - YS[,1] > 0), 3)

## [1] 0.721

round(quantile(YS[,2] - YS[,1], probs = c(0.025, 0.5, 0.975)), 3)

##    2.5%     50%   97.5% 
## -16.721   6.625  29.643

Inferencia sobre \(\rho=\frac{\sigma_{1,2}}{\sigma_{1}\,\sigma_{2}}\):

¿Las pruebas son consistentes? ¿Cuál es la probabilidad posterior de que la correlación entre las calificaciones sea superior a 0.6?

# muestras de rho
RHO <- SIGMA[,2]/sqrt(SIGMA[,1]*SIGMA[,4])

round(mean(RHO > 0.6), 3)

## [1] 0.804

round(quantile(RHO, prob = c(0.025,0.5,0.975)), 3)

##  2.5%   50% 97.5% 
## 0.430 0.700 0.856

Modelo Normal Multivariado

Juan Sosa PhD

Email jcsosam@unal.edu.co

GitHub https://github.com/jstats1702