Ajuste de modelos de regressão censurados em diferentes pacotes estatísticas

Iniciação Científica 2018-2019

Programa de Pesquisa para Alunos da Graduação

Aluno: Rafael Cabral Fernandez

Orientadores: Gustavo H. M. A. Rocha e Renata S. Bueno

Introdução

O documento a seguir apresenta a utilização do JAGS.

Para utilização do programa, é necessário baixar o programa em: https://sourceforge.net/projects/mcmc-jags/files/

Para utilização interna, basta utilizar o pacote “R2Jags” e “rjags”

O pacote foi rodado em 4 exemplos diferentes:

Exemplo didático para treino
Simulação de um modelo de regressão linear simples
Modelo de regressão linear simples utilizando dados do ipea, modelando a taxa de homicídos através da taxa de desemprego
Modelo de regressão linear simples utilizando dados do enem, modelando a nota de matemática através da nota de redação

Treino

Gerando dados normais

N <- 1000
x <- rnorm(N, 0, 5)

Código para efetuar o MCMC, ressaltando mais uma vez que também no Jags há incompatibilidade.

write.table(x,
            file = 'example1.data',
            row.names = FALSE,
            col.names = FALSE)

example1.string <-"model {
  for (i in 1:N) {
    x[i] ~ dnorm(mu, tau)
  }
  mu ~ dnorm(0, .0001)
  tau <- pow(sigma, -2)
  sigma ~ dunif(0, 100)
} "


example1.bug <- textConnection(example1.string)

Contudo, é possível salvar os resultados e obter algumas informações resumo de forma simples. (Como não há compatilidade, não há como imprimir os resultados pelo r markdown diretamente)

jags <- jags.model('example1.bug',
                   data = list('x' = x,'N' = N),
                   n.chains = 4,
                   n.adapt = 100)

update(jags, 1000)

samps.coda <- coda.samples(jags,
                           c('x'),
                           n.iter=10000,
                           thin=10
)

O método de obtenção das amostras da distribuição a posterior não é trivial pelo pacote “rjags”. Daqui em diante foi utilizado o pacote “R2Jags” para facilitar a amostragem.

Simulação

Gerando os dados

n = 100
alpha = -2
beta = 2
sigma = 1

set.seed(123)
x = sort(runif(n, 0, 10)) 
y = rnorm(n, mean = alpha + beta * x, sd = sigma)

plot(x, y)
lines(x, alpha + beta * x)

Código para o modelo R2Jags. Sendo mais direto que o anterior, não precisando de outras funções para fazer a ligação. AQUI NÃO HÁ CONFLITO COM O RMARKDOWN

model_code = '
model
{
  # Vero
  for (i in 1:n) {
  y[i] ~ dnorm(alpha + beta * x[i], sigma^-2)
  }
  # Prioris
  alpha ~ dnorm(0, 100^-2)
  beta ~ dnorm(0, 100^-2)
  sigma ~ dgamma(3,10)
}
'

model_data = list(n = n, y = y, x = x)

model_parameters =  c("alpha", "beta", "sigma")

model_run = jags(data = model_data,
                 parameters.to.save = model_parameters,
                 model.file=textConnection(model_code),
                 n.chains=4, # Number of different starting positions
                 n.iter=1000, # Number of iterations
                 n.burnin=200, # Number of iterations to remove at start
                 n.thin=2,# Amount of thinning
                 DIC=FALSE)

## module glm loaded

## module dic loaded

## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 100
##    Unobserved stochastic nodes: 3
##    Total graph size: 412
## 
## Initializing model

Resultados

kable(head(model_run[["BUGSoutput"]][["sims.matrix"]]))

alpha	beta	sigma
-1.904152	1.957540	0.9814547
-1.970599	1.971670	0.8747304
-1.693059	1.940811	0.9279075
-1.693775	1.945377	0.9883357
-2.121193	2.017274	1.0706101
-1.846197	1.932990	0.9966065

plot(model_run)

print(model_run)

## Inference for Bugs model at "6", fit using jags,
##  4 chains, each with 1000 iterations (first 200 discarded), n.thin = 2
##  n.sims = 1600 iterations saved
##       mu.vect sd.vect   2.5%    25%    50%    75%  97.5%  Rhat n.eff
## alpha  -1.827   0.187 -2.176 -1.951 -1.831 -1.705 -1.463 1.001  1600
## beta    1.954   0.033  1.886  1.932  1.954  1.977  2.017 1.001  1500
## sigma   0.936   0.063  0.826  0.892  0.934  0.974  1.064 1.000  1600
## 
## For each parameter, n.eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor (at convergence, Rhat=1).

traceplot(model_run)

Homícidio

Ajustando os dados

load("C:/Users/Rafael/Desktop/Pesquisa/Bases/homicidio.RData")
y = homicidio$X__2
x = homicidio$X__1
lm1 <- lm(y ~ x)
summary(lm1)

## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.2456 -0.7098  0.2115  0.5496  1.8874 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  15.7631     2.1774   7.239 4.29e-06 ***
## x             1.1262     0.2375   4.742 0.000315 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.173 on 14 degrees of freedom
## Multiple R-squared:  0.6163, Adjusted R-squared:  0.5889 
## F-statistic: 22.49 on 1 and 14 DF,  p-value: 0.0003151

plot(x, y)
lines(x, 15.7 + 1.13 * x)

Código para o modelo R2Jags

model_code = '
model
{
  # Vero
  for (i in 1:n) {
  y[i] ~ dnorm(alpha + beta * x[i], sigma^-2)
  }
  # Prioris
  alpha ~ dnorm(0, 100^-2)
  beta ~ dnorm(0, 100^-2)
  sigma ~ dgamma(3,10)
}
'
n = length(y)

model_data = list(n = n, y = y, x = x)

model_parameters =  c("alpha", "beta", "sigma")

model_run = jags(data = model_data,
                 parameters.to.save = model_parameters,
                 model.file=textConnection(model_code),
                 n.chains=4, # Number of different starting positions
                 n.iter=1000, # Number of iterations
                 n.burnin=200, # Number of iterations to remove at start
                 n.thin=2,# Amount of thinning
                 DIC=FALSE)

## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 16
##    Unobserved stochastic nodes: 3
##    Total graph size: 76
## 
## Initializing model

Resultados

kable(head(model_run[["BUGSoutput"]][["sims.matrix"]]))

alpha	beta	sigma
11.77850	1.5466896	1.1398907
15.93822	1.1088168	0.7976276
16.65617	0.9807454	1.0432371
18.21035	0.8479324	0.9499642
16.22962	1.0718739	0.9501246
15.83882	1.1104588	0.9556654

plot(model_run)

print(model_run)

## Inference for Bugs model at "7", fit using jags,
##  4 chains, each with 1000 iterations (first 200 discarded), n.thin = 2
##  n.sims = 1600 iterations saved
##       mu.vect sd.vect   2.5%    25%    50%    75%  97.5%  Rhat n.eff
## alpha  15.766   1.793 12.258 14.558 15.802 16.915 19.224 1.000  1600
## beta    1.126   0.196  0.748  0.997  1.123  1.257  1.510 1.000  1600
## sigma   0.982   0.139  0.754  0.883  0.967  1.066  1.290 1.002   990
## 
## For each parameter, n.eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor (at convergence, Rhat=1).

traceplot(model_run)

Enem

Carregando os dados

load("C:/Users/Rafael/Desktop/Pesquisa/Bases/enemAMOSTRA.Rda")

y = enem$NU_NOTA_REDACAO
x = enem$NU_NOTA_MT
lm1 <- lm(y ~ x)
summary(lm1)

## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -598.34  -84.79   -3.69   88.03  321.00 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 353.68468   32.02075  11.045   <2e-16 ***
## x             0.52649    0.05585   9.426   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 127.3 on 348 degrees of freedom
## Multiple R-squared:  0.2034, Adjusted R-squared:  0.2011 
## F-statistic: 88.86 on 1 and 348 DF,  p-value: < 2.2e-16

plot(x, y)
lines(x, 354 + 0.52* x)

Criando o modelo em R2Jags

model_code = '
model
{
  # Vero
  for (i in 1:n) {
  y[i] ~ dnorm(alpha + beta * x[i], sigma^-2)
  }
  # Prioris
  alpha ~ dnorm(0, 100^-2)
  beta ~ dnorm(0, 100^-2)
  sigma ~ dgamma(0.1,0.001)
}
'
n = length(y)

model_data = list(n = n, y = y, x = x)

model_parameters =  c("alpha", "beta", "sigma")

model_run = jags(data = model_data,
                 parameters.to.save = model_parameters,
                 model.file=textConnection(model_code),
                 n.chains=4, # Number of different starting positions
                 n.iter=1000, # Number of iterations
                 n.burnin=200, # Number of iterations to remove at start
                 n.thin=2,# Amount of thinning
                 DIC=FALSE)

## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 350
##    Unobserved stochastic nodes: 3
##    Total graph size: 1382
## 
## Initializing model

Resultados

kable(head(model_run[["BUGSoutput"]][["sims.matrix"]]))

alpha	beta	sigma
340.1643	0.5544911	126.1423
317.2050	0.6067734	131.9827
321.0834	0.5930153	131.6446
332.4139	0.5766795	130.9353
250.5939	0.6919149	129.0140
297.2232	0.6358489	128.5419

plot(model_run)

print(model_run)

## Inference for Bugs model at "8", fit using jags,
##  4 chains, each with 1000 iterations (first 200 discarded), n.thin = 2
##  n.sims = 1600 iterations saved
##       mu.vect sd.vect    2.5%     25%     50%     75%   97.5%  Rhat n.eff
## alpha 320.423  30.645 260.449 299.334 320.430 341.221 381.223 1.000  1600
## beta    0.584   0.053   0.478   0.549   0.584   0.620   0.691 1.000  1600
## sigma 127.613   4.735 119.133 124.243 127.635 130.641 137.511 1.006   410
## 
## For each parameter, n.eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor (at convergence, Rhat=1).

traceplot(model_run)

Escola Nacional de Ciências Estatísticas

Rafael Cabral Fernandez

25 de janeiro de 2019

Ajuste de modelos de regressão censurados em diferentes pacotes estatísticas

Iniciação Científica 2018-2019

Programa de Pesquisa para Alunos da Graduação

Aluno: Rafael Cabral Fernandez

Orientadores: Gustavo H. M. A. Rocha e Renata S. Bueno

Introdução

Treino

Simulação

Homícidio

Enem