Regresiones y matrices

Podemos formular los modelos de regresión usando álgebra matricial. Veamos un ejemplo donde tenemos tres variables predictoras.

set.seed(123)
n <- 30

x1 <-  runif(n, -3, 3)
x2 <-  runif(n, -3, 3)
x3 <-  runif(n, -3, 3)

Como la regresión es en definitiva una función lineal, podemos hacer las cuentas para obtener los valores esperados de la variable de respuesta $\mathbf{\mu}$ con un producto entre una matriz donde ponemos las variables predictoras $\mathbf{X}$ y un vector de coeficientes $: \[ \mathbf{\mu} = \mathbf{X} \mathbf{\beta} \] A esta matriz $\mathbf{X}$ se la llama “matriz de diseño”. Para tener una “ordenada al orígen” agregamos a la matriz de variables predictoras una columna de unos. Veamos cómo hacer esto en R:

X <- cbind(numeric(n) + 1, x1, x2, x3) # esta sería la llamada "matriz de diseño"
betas <- c(-1, 0.2, -0.1, 0.8)
sigma <- 0.5

m <- X %*% betas

# ahora simulamos los datos
y <- rnorm(n, m, sd = sigma)

Formular las regresiones de esta manera nos permite resolver las estimaciones de parámetros usando mínimos cuadrados. Podemos estimar los coeficientes haciendo:

\[ \hat{\mathbf{\beta}} = \left( \mathbf{X}^t \mathbf{X} \right)^{-1} \mathbf{X}^t \mathbf{y} \]

Para hacer esas cuentas en R hacemos

solve( t(X) %*% X ) %*% t(X) %*% y

##          [,1]
##    -1.0591239
## x1  0.2868337
## x2 -0.1138111
## x3  0.7865845

Podemos comprobar que lm nos da esos resultados:

lm(y ~ x1 + x2 + x3)

## 
## Call:
## lm(formula = y ~ x1 + x2 + x3)
## 
## Coefficients:
## (Intercept)           x1           x2           x3  
##     -1.0591       0.2868      -0.1138       0.7866

En análisis Bayesianos, podemos aprovechar un poco la idea de una matriz de diseño para escribir los modelos de manera más compacta. Por ejemplo en lenguaje BUGS podemos escribir:

cat(file = "reg.bug",
"
model{
  for(i in 1 : n){
    y[i] ~ dnorm(m[i], tau)
    m[i] <- b0 + inprod(b, X[i, ])
  }
  
  b0 ~ dnorm(0, 0.001)
  for (j in 1:k) {
    b[j] ~ dnorm(0, 1)
  }
  sigma ~ dnorm(0, 1)T(0, )
  tau <- pow(sigma, -2)
}
")

En este caso, la matriz de diseño no incluye una columna de unos porque quiero poner una previa diferente para la ordenada al origen. Para hacer el ajuste en JAGS:

library(jagsUI)

## Loading required package: lattice

## 
## Attaching package: 'jagsUI'

## The following object is masked from 'package:utils':
## 
##     View

X <- cbind(x1, x2, x3)
k <- ncol(X)
y <- as.vector(y)
data   <- list("y", "X", "n", "k")
params <- c("b0", "b", "sigma")

reg.sim <- jags(data, 
               inits = NULL, 
               params, 
               model.file = "reg.bug",
               n.chains = 3, 
               n.iter = 1000, 
               n.burnin = 500)

## 
## Processing function input....... 
## 
## Done. 
##  
## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 30
##    Unobserved stochastic nodes: 5
##    Total graph size: 233
## 
## Initializing model
## 
## Adaptive phase..... 
## Adaptive phase complete 
##  
## 
##  Burn-in phase, 500 iterations x 3 chains 
##  
## 
## Sampling from joint posterior, 500 iterations x 3 chains 
##  
## 
## Calculating statistics....... 
## 
## Done.

print(reg.sim)

## JAGS output for model 'reg.bug', generated by jagsUI.
## Estimates based on 3 chains of 1000 iterations,
## adaptation = 100 iterations (sufficient),
## burn-in = 500 iterations and thin rate = 1,
## yielding 1500 total samples from the joint posterior. 
## MCMC ran for 0.002 minutes at time 2019-05-14 22:26:39.
## 
##            mean    sd   2.5%    50%  97.5% overlap0     f  Rhat n.eff
## b0       -1.063 0.091 -1.245 -1.059 -0.889    FALSE 1.000 1.001  1500
## b[1]      0.289 0.052  0.189  0.290  0.391    FALSE 1.000 1.000  1500
## b[2]     -0.116 0.052 -0.218 -0.115 -0.017    FALSE 0.989 1.001  1469
## b[3]      0.783 0.055  0.669  0.784  0.890    FALSE 1.000 1.001  1500
## sigma     0.483 0.072  0.369  0.475  0.644    FALSE 1.000 1.004  1500
## deviance 39.857 3.550 35.213 39.079 48.717    FALSE 1.000 1.004   854
## 
## Successful convergence based on Rhat values (all < 1.1). 
## Rhat is the potential scale reduction factor (at convergence, Rhat=1). 
## For each parameter, n.eff is a crude measure of effective sample size. 
## 
## overlap0 checks if 0 falls in the parameter's 95% credible interval.
## f is the proportion of the posterior with the same sign as the mean;
## i.e., our confidence that the parameter is positive or negative.
## 
## DIC info: (pD = var(deviance)/2) 
## pD = 6.3 and DIC = 46.15 
## DIC is an estimate of expected predictive error (lower is better).

Si hacemos el análisis con Stan, podemos escribir el modelo como:

cat(file = "reg.stan", 
"
data { 
  int<lower=1> n;
  int<lower=1> k;
  vector[n] y;
  matrix[n,k] X;
}

parameters {
  real b0;
  vector[k] b;
  real<lower=0> s;
}

model{
  b0 ~ normal(0,5);
  b ~ normal(0,1);
  s ~ normal(0,1);
  y ~ normal(b0 + X*b, s);
}

"
)

veamos qué da…

library(rstan)

## Loading required package: ggplot2

## Loading required package: StanHeaders

## rstan (Version 2.18.2, GitRev: 2e1f913d3ca3)

## For execution on a local, multicore CPU with excess RAM we recommend calling
## options(mc.cores = parallel::detectCores()).
## To avoid recompilation of unchanged Stan programs, we recommend calling
## rstan_options(auto_write = TRUE)

## 
## Attaching package: 'rstan'

## The following object is masked from 'package:jagsUI':
## 
##     traceplot

rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())

reg_dat <- list(n = n,
               y = y,
               X = X,
               k = k)

stan.sim <- stan(file = 'reg.stan', 
            data = reg_dat,
            iter = 1000,
            chains = 3)

print(stan.sim)

## Inference for Stan model: reg.
## 3 chains, each with iter=1000; warmup=500; thin=1; 
## post-warmup draws per chain=500, total post-warmup draws=1500.
## 
##       mean se_mean   sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
## b0   -1.06    0.00 0.09 -1.24 -1.12 -1.06 -0.99 -0.88  1429 1.00
## b[1]  0.29    0.00 0.05  0.18  0.25  0.28  0.32  0.39  1486 1.00
## b[2] -0.11    0.00 0.05 -0.22 -0.15 -0.11 -0.08 -0.01  1689 1.00
## b[3]  0.79    0.00 0.06  0.68  0.75  0.78  0.82  0.90  1901 1.00
## s     0.48    0.00 0.07  0.37  0.43  0.47  0.52  0.64  1154 1.00
## lp__  6.40    0.07 1.76  1.90  5.43  6.83  7.70  8.70   578 1.02
## 
## Samples were drawn using NUTS(diag_e) at Tue May 14 22:26:42 2019.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

Otra forma de considerar a los modelos de regresión de este tipo es pensar que la variable $y$ proviene de una multivariada normal donde la matriz de varianza-covarianza tiene un solo valor en la diagonal (correspodiente a sigma la cuadrado) y es cero en el resto. Veamos un ejemplo con pocos datos como para ver de qué se trata:

library(mvtnorm)
n <- 6
sigma <- 0.5
S <- diag(numeric(n) + sigma^2)
S

##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 0.25 0.00 0.00 0.00 0.00 0.00
## [2,] 0.00 0.25 0.00 0.00 0.00 0.00
## [3,] 0.00 0.00 0.25 0.00 0.00 0.00
## [4,] 0.00 0.00 0.00 0.25 0.00 0.00
## [5,] 0.00 0.00 0.00 0.00 0.25 0.00
## [6,] 0.00 0.00 0.00 0.00 0.00 0.25

Lo que vemos en la matriz de varianza-covarianza S es que la varianza es la misma para todas las observaciones (homogéneidad de varianza) y que los datos son independientes porque las covarianzas son siempre ceros.

Ahora calculamos la media y generamos los datos

X <- cbind(numeric(n) + 1, x1, x2, x3)
m <- X %*% betas
y <- rmvnorm(1, m[1:n], S)

Podemos modelar falta de independencia en los datos si consideramos valores distintos de cero en las covarianzas. Veamos por ejemplo un caso de autocorrelation espacial:

library(mvtnorm)

# función para calcular matriz de distancia en un cuadrado de lado = side
dist.matrix <- function(side)
{
  cords <- expand.grid(1:side, 1:side)
  D <- dist(cords, method="euclidean", diag=TRUE, upper=TRUE)
  D <- as.matrix(D)
  return(D)
}

# función para simular datos auto-correlacionados en 2D,
# con decaimiento exponencial de la autocorrelación 
# (la media es constante e igual a global.mu)
cor.surface <- function(side, global.mu, s, rho)
{
  D <- dist.matrix(side)
  S <- s^2 / (1 - rho) * rho^D
  mu <- rep(global.mu, times=side*side)
  M <- matrix(nrow=side, ncol=side)
  M[] <- rmvnorm(1, mu, S)
  return(M)
}


set.seed (123)
side = 10
s = 2
global.mu = 0
rho = 0.7  

M <- cor.surface(side = side, rho = rho, global.mu = global.mu, s = s)
image(M)

Ahora escribimos el modelo en BUGS

cat(file="mvnormal.bugs",
"
model
{
  y[1:N] ~ dmnorm(mu[], D.tau[,])

  for(i in 1:N){
    mu[i] <- global.mu
  }

  rho ~ dbeta(2,2)
  s ~ dnorm(0,1)T(0,)
  global.mu ~ dnorm(0, 0.1)

  for(i in 1:N){
    for(j in 1:N){
      # turning the distance matrix to covariance matrix
      D.covar[i,j] <- s*s / (1-rho) * pow(rho, D[i,j])
    }
  }
  # turning covariances into precisions:
  D.tau[1:N,1:N] <- inverse(D.covar[1:N,1:N])
}
")

y lo ajustamos con JAGS

y <- as.vector(as.matrix(M))
datos <- list(N = side * side, D = dist.matrix(side), y = y)

fit <-  jags(data=datos, 
              parameters.to.save=c("rho","s", "global.mu"),
              model.file="mvnormal.bugs",
              n.iter = 1000,
              n.chains = 3)

## 
## Processing function input....... 
## 
## Done. 
##  
## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 1
##    Unobserved stochastic nodes: 3
##    Total graph size: 30120
## 
## Initializing model
## 
## Adaptive phase..... 
## Adaptive phase complete 
##  
## No burn-in specified 
##  
## Sampling from joint posterior, 1000 iterations x 3 chains 
##  
## 
## Calculating statistics....... 
## 
## Done.

print(fit)

## JAGS output for model 'mvnormal.bugs', generated by jagsUI.
## Estimates based on 3 chains of 1000 iterations,
## adaptation = 100 iterations (sufficient),
## burn-in = 0 iterations and thin rate = 1,
## yielding 3000 total samples from the joint posterior. 
## MCMC ran for 1.15 minutes at time 2019-05-14 22:26:42.
## 
##              mean    sd    2.5%     50%   97.5% overlap0     f  Rhat n.eff
## rho         0.596 0.109   0.396   0.592   0.826    FALSE 1.000 1.002  2100
## s           1.814 0.129   1.583   1.808   2.081    FALSE 1.000 1.004   840
## global.mu   1.404 1.052  -0.839   1.462   3.326     TRUE 0.917 1.000  3000
## deviance  430.478 2.294 427.806 429.897 436.415    FALSE 1.000 1.008   479
## 
## Successful convergence based on Rhat values (all < 1.1). 
## Rhat is the potential scale reduction factor (at convergence, Rhat=1). 
## For each parameter, n.eff is a crude measure of effective sample size. 
## 
## overlap0 checks if 0 falls in the parameter's 95% credible interval.
## f is the proportion of the posterior with the same sign as the mean;
## i.e., our confidence that the parameter is positive or negative.
## 
## DIC info: (pD = var(deviance)/2) 
## pD = 2.6 and DIC = 433.101 
## DIC is an estimate of expected predictive error (lower is better).

#pairs(as.matrix(as.mcmc(fit)))