Regresión logística bayesiana

Un investigador está interesado en saber qué efecto tienen variables como PAU (nota en Selectividad), bach (nota media bachiller) y el prestigio del instituto al que asistieron en la admisión de los estudiantes en una determinada universidad. La variable respuesta, admitir/no admitirlo, es una variable binaria (8p).

*Y: Admitido (1: Admitido, 0: No Admitido) X1: PAU (Nota en selectividad) X2: Bach (Nota media bachiller) X3: Prestigio (1 mayor reconocimiento hasta 4 reconocimiento bajo)

ASUMIENDO NORMALIDAD

CONSIDERANDO LOS SIGUIENTES DATOS

datos<-read.table("trabajo.txt",header = T,sep = "\t")
attach(datos)
head(datos,5)

##   Y Pau bach Prestigio
## 1 0 380 3.61         3
## 2 1 660 3.67         3
## 3 1 800 4.00         1
## 4 1 640 3.19         4
## 5 0 520 2.93         4

NOS PIDEN

A. Estime los parámetros con la ecuación de regresión logística binaria clásica (2p).

# Ajustar el modelo de regresión logística binaria
modelo_logistico1 <- glm(Y ~ ., data = datos, family = binomial)

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

# Resumen del modelo
summary(modelo_logistico1)

## 
## Call:
## glm(formula = Y ~ ., family = binomial, data = datos)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.203e+02  1.287e+06   0.000        1
## Pau          3.678e-01  7.229e+02   0.001        1
## bach         5.520e+01  2.644e+05   0.000        1
## Prestigio   -1.420e+01  1.117e+05   0.000        1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1.4421e+01  on 10  degrees of freedom
## Residual deviance: 3.9958e-10  on  7  degrees of freedom
## AIC: 8
## 
## Number of Fisher Scoring iterations: 25

Nota. Se observa que, las variables no son significativas (p-value > 5%), lo que da a entender que no influyen o repercuten en que admitan en la universidad, sin embargo por motivos de practica se sigue con el análisis.

B. Determinar la bondad de ajuste del modelo (1p).

#Test de Hosmer-Lemeshow
library(ResourceSelection)

## ResourceSelection 0.3-6   2023-06-27

hoslem.test(Y, fitted(modelo_logistico1), g=10)

## Warning in hoslem.test(Y, fitted(modelo_logistico1), g = 10): The data did not
## allow for the requested number of bins.

## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  Y, fitted(modelo_logistico1)
## X-squared = 1.9979e-10, df = 3, p-value = 1

Nota. Según la prueba de Hosmer-Lemeshow, el modelo es adecuado, para un nivel de signficancia del 5%.

#Pseudo-R2
library(DescTools)
PseudoR2(modelo_logistico1,which = c("McFadden","CoxSnell","Nagelkerke"))

##   McFadden   CoxSnell Nagelkerke 
##  1.0000000  0.7304398  1.0000000

Nota. observamos que el valor de Rcuadrado de Cox y Snell es igual a 0.73 y el valor de Rcuadrado de Nagelkerke es igual a 1, el cual nos indica que el 100% de las proporciones de la varianza de la variable dependiente (Admitir) es explicada por la variable predictora Pau, Bach y prestigio; dando a entender que el modelo de regresión logística es bueno.

C. Hallar el coeficiente OR e interpretar.

exp(cbind(OR = coef(modelo_logistico1), confint(modelo_logistico1, level=0.95)))

## Waiting for profiling to be done...

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

##                        OR       2.5 %       97.5 %
## (Intercept) 8.144029e-140 0.00000e+00          Inf
## Pau          1.444521e+00 1.99955e-17 3.125409e+17
## bach         9.424154e+23          NA          Inf
## Prestigio    6.833188e-07 0.00000e+00           NA

Nota. Se observa el OR y sus intervalos

B(Pau): Nos indica la posibilidad de que se admita, por el PAU, es decir, es aproximadamente 1.44 veces más probable que se admita por tener mas nota, respecto a los que tienen menos nota.
B(bach): Nos indica la posibilidad de que se admita, por el bach, es decir, es aproximadamente 9.4 E23 veces más probable que se admita por tener mas nota media de bachiller, respecto a los que tienen menos nota.
B(Prestigo): Nos indica la posibilidad de que se admita, por el prestigio, es decir, es aproximadamente 6.83E7 veces menos probable que se admita por tener un prestigo sercano a una categoria 4, respecto al prestigio sercano a la categoria 1.

#Precisión del modelo
y.pred<-ifelse(modelo_logistico1$fitted.values > 0.43, yes = 1, no = 0) 
tab_clasif <-table(Y, y.pred,dnn = c("observ", "predic"))
tab_clasif

##       predic
## observ 0 1
##      0 4 0
##      1 0 7

Nota. Considerando la precisión (4+7)/7 * 100 = 100%; Entonces el modelo presenta una precisión del 100%, lo que indica un modelo muy bueno.

D. Realizar predicciones en base a los siguientes valores: X (Pau= 1000, Bach=4.5 y Prestigio=1) e interpretar.

#Prediccion
nuevo<-data.frame(Pau= 1000, bach=4.5, Prestigio=1)
pred<-predict(modelo_logistico1,newdata=nuevo,type="response")
pred

## 1 
## 1

Nota. De acuerdo a estos resultados podemos decir que, la probabilidad de que se admita es del 100% cuando Pau sea 1000, bach sea 4.5 y prestigio sea 1.

*E. Encontrar los parámetros posteriores para una regresión logística bayesiana asumiendo que siguen una distribución normal: Intercepto -> N (0,9); B1 -> N (0,10).

Para ello consideramos el siguiente código:

#REGRESION LOGISTICA BAYESIANA ASUMIENDO NORMALIDAD

# Instala y carga el paquete "rstan" si aún no lo has hecho
library(rstan)

## Loading required package: StanHeaders

## 
## rstan version 2.26.23 (Stan version 2.26.1)

## For execution on a local, multicore CPU with excess RAM we recommend calling
## options(mc.cores = parallel::detectCores()).
## To avoid recompilation of unchanged Stan programs, we recommend calling
## rstan_options(auto_write = TRUE)
## For within-chain threading using `reduce_sum()` or `map_rect()` Stan functions,
## change `threads_per_chain` option:
## rstan_options(threads_per_chain = 1)

## Do not specify '-march=native' in 'LOCAL_CPPFLAGS' or a Makevars file

library(StanHeaders)

# Definir el modelo en formato Stan
modelo_stan <- "
data {
  int<lower=0> N;        // Número de observaciones
  int<lower=0, upper=3> y[N]; // Variable de respuesta binaria
  vector[N] x1;          // Variable predictora x1
  vector[N] x2;          // Variable predictora x2
  vector[N] x3;          // Variable predictora x3
}

parameters {
  real alpha;            // Intercepto
  real beta1;            // Coeficiente para x1
  real beta2;            // Coeficiente para x2
  real beta3;            // Coeficiente para x3
}

model {
  // Prior para los parámetros
  alpha ~ normal(0, 9);
  beta1 ~ normal(0, 10);
  beta2 ~ normal(0, 10);
  beta3 ~ normal(0, 10);
  // Modelo logístico
  for (i in 1:N) {
    y[i] ~ bernoulli_logit(alpha + beta1 * x1[i] + beta2 * x2[i] + beta3 * x3[i]);
  }
}
"

# Convertir los datos en formato necesario para Stan
datos_stan <- list(
  N = nrow(datos),
  y = Y,
  x1 = Pau,
  x2 = bach,
  x3 = Prestigio
)

# Compilar el modelo
modelo_compilado <- stan_model(model_code = modelo_stan)

# Ajustar el modelo utilizando MCMC
ajuste <- sampling(modelo_compilado, data = datos_stan, chains = 4, iter = 2000)

## 
## SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 1).
## Chain 1: 
## Chain 1: Gradient evaluation took 0.000252 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 2.52 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1: 
## Chain 1: 
## Chain 1: Iteration:    1 / 2000 [  0%]  (Warmup)
## Chain 1: Iteration:  200 / 2000 [ 10%]  (Warmup)
## Chain 1: Iteration:  400 / 2000 [ 20%]  (Warmup)
## Chain 1: Iteration:  600 / 2000 [ 30%]  (Warmup)
## Chain 1: Iteration:  800 / 2000 [ 40%]  (Warmup)
## Chain 1: Iteration: 1000 / 2000 [ 50%]  (Warmup)
## Chain 1: Iteration: 1001 / 2000 [ 50%]  (Sampling)
## Chain 1: Iteration: 1200 / 2000 [ 60%]  (Sampling)
## Chain 1: Iteration: 1400 / 2000 [ 70%]  (Sampling)
## Chain 1: Iteration: 1600 / 2000 [ 80%]  (Sampling)
## Chain 1: Iteration: 1800 / 2000 [ 90%]  (Sampling)
## Chain 1: Iteration: 2000 / 2000 [100%]  (Sampling)
## Chain 1: 
## Chain 1:  Elapsed Time: 37.163 seconds (Warm-up)
## Chain 1:                16.811 seconds (Sampling)
## Chain 1:                53.974 seconds (Total)
## Chain 1: 
## 
## SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 2).
## Chain 2: 
## Chain 2: Gradient evaluation took 0.00015 seconds
## Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 1.5 seconds.
## Chain 2: Adjust your expectations accordingly!
## Chain 2: 
## Chain 2: 
## Chain 2: Iteration:    1 / 2000 [  0%]  (Warmup)
## Chain 2: Iteration:  200 / 2000 [ 10%]  (Warmup)
## Chain 2: Iteration:  400 / 2000 [ 20%]  (Warmup)
## Chain 2: Iteration:  600 / 2000 [ 30%]  (Warmup)
## Chain 2: Iteration:  800 / 2000 [ 40%]  (Warmup)
## Chain 2: Iteration: 1000 / 2000 [ 50%]  (Warmup)
## Chain 2: Iteration: 1001 / 2000 [ 50%]  (Sampling)
## Chain 2: Iteration: 1200 / 2000 [ 60%]  (Sampling)
## Chain 2: Iteration: 1400 / 2000 [ 70%]  (Sampling)
## Chain 2: Iteration: 1600 / 2000 [ 80%]  (Sampling)
## Chain 2: Iteration: 1800 / 2000 [ 90%]  (Sampling)
## Chain 2: Iteration: 2000 / 2000 [100%]  (Sampling)
## Chain 2: 
## Chain 2:  Elapsed Time: 30.174 seconds (Warm-up)
## Chain 2:                15.402 seconds (Sampling)
## Chain 2:                45.576 seconds (Total)
## Chain 2: 
## 
## SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 3).
## Chain 3: 
## Chain 3: Gradient evaluation took 0.000169 seconds
## Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 1.69 seconds.
## Chain 3: Adjust your expectations accordingly!
## Chain 3: 
## Chain 3: 
## Chain 3: Iteration:    1 / 2000 [  0%]  (Warmup)
## Chain 3: Iteration:  200 / 2000 [ 10%]  (Warmup)
## Chain 3: Iteration:  400 / 2000 [ 20%]  (Warmup)
## Chain 3: Iteration:  600 / 2000 [ 30%]  (Warmup)
## Chain 3: Iteration:  800 / 2000 [ 40%]  (Warmup)
## Chain 3: Iteration: 1000 / 2000 [ 50%]  (Warmup)
## Chain 3: Iteration: 1001 / 2000 [ 50%]  (Sampling)
## Chain 3: Iteration: 1200 / 2000 [ 60%]  (Sampling)
## Chain 3: Iteration: 1400 / 2000 [ 70%]  (Sampling)
## Chain 3: Iteration: 1600 / 2000 [ 80%]  (Sampling)
## Chain 3: Iteration: 1800 / 2000 [ 90%]  (Sampling)
## Chain 3: Iteration: 2000 / 2000 [100%]  (Sampling)
## Chain 3: 
## Chain 3:  Elapsed Time: 30.005 seconds (Warm-up)
## Chain 3:                18.587 seconds (Sampling)
## Chain 3:                48.592 seconds (Total)
## Chain 3: 
## 
## SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 4).
## Chain 4: 
## Chain 4: Gradient evaluation took 0.000167 seconds
## Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 1.67 seconds.
## Chain 4: Adjust your expectations accordingly!
## Chain 4: 
## Chain 4: 
## Chain 4: Iteration:    1 / 2000 [  0%]  (Warmup)
## Chain 4: Iteration:  200 / 2000 [ 10%]  (Warmup)
## Chain 4: Iteration:  400 / 2000 [ 20%]  (Warmup)
## Chain 4: Iteration:  600 / 2000 [ 30%]  (Warmup)
## Chain 4: Iteration:  800 / 2000 [ 40%]  (Warmup)
## Chain 4: Iteration: 1000 / 2000 [ 50%]  (Warmup)
## Chain 4: Iteration: 1001 / 2000 [ 50%]  (Sampling)
## Chain 4: Iteration: 1200 / 2000 [ 60%]  (Sampling)
## Chain 4: Iteration: 1400 / 2000 [ 70%]  (Sampling)
## Chain 4: Iteration: 1600 / 2000 [ 80%]  (Sampling)
## Chain 4: Iteration: 1800 / 2000 [ 90%]  (Sampling)
## Chain 4: Iteration: 2000 / 2000 [100%]  (Sampling)
## Chain 4: 
## Chain 4:  Elapsed Time: 30.117 seconds (Warm-up)
## Chain 4:                18.26 seconds (Sampling)
## Chain 4:                48.377 seconds (Total)
## Chain 4:

# Resumen de los resultados
summary(ajuste)

## $summary
##              mean     se_mean         sd         2.5%          25%         50%
## alpha  -7.5850825 0.238645231 8.25963304 -23.79392742 -12.87167314  -7.3227675
## beta1   0.1191196 0.002029352 0.05530996   0.03283342   0.07616924   0.1139842
## beta2 -11.3009207 0.234998681 6.63873371 -25.46294887 -15.66372867 -10.6314233
## beta3  -5.8660000 0.117553059 3.36303271 -13.15321002  -7.94050932  -5.5947049
## lp__   -3.5618976 0.052208909 1.56583457  -7.47509353  -4.32642058  -3.2141120
##             75%       97.5%     n_eff     Rhat
## alpha -1.989730  7.98430898 1197.8874 1.002824
## beta1  0.155305  0.23605404  742.8340 1.005345
## beta2 -6.489268  0.06378046  798.0675 1.005878
## beta3 -3.344317 -0.38311873  818.4541 1.005582
## lp__  -2.426563 -1.61383403  899.5028 1.003230
## 
## $c_summary
## , , chains = chain:1
## 
##          stats
## parameter        mean         sd         2.5%          25%         50%
##     alpha  -7.1374827 8.58192258 -23.16435569 -12.41705274  -6.8554579
##     beta1   0.1223059 0.05360008   0.03348664   0.08040231   0.1187725
##     beta2 -11.7560875 6.49422125 -25.29817263 -15.91017703 -11.1969360
##     beta3  -6.0736655 3.22884746 -13.20091917  -8.07963454  -5.8723206
##     lp__   -3.6492690 1.63843614  -7.40346386  -4.39417786  -3.3129407
##          stats
## parameter        75%      97.5%
##     alpha -1.6349249  8.8100678
##     beta1  0.1569497  0.2339462
##     beta2 -7.3129103  0.1168572
##     beta3 -3.6292457 -0.7860101
##     lp__  -2.5105662 -1.6427724
## 
## , , chains = chain:2
## 
##          stats
## parameter        mean         sd         2.5%          25%         50%
##     alpha  -8.5760149 8.38065678 -24.61192617 -14.73782328  -8.4128073
##     beta1   0.1165139 0.05121547   0.03703342   0.07729044   0.1111404
##     beta2 -10.7509531 6.35714055 -23.47335522 -14.75605091 -10.3987625
##     beta3  -5.6872043 3.17535711 -12.49332460  -7.63828962  -5.6244599
##     lp__   -3.6290916 1.57712954  -7.94781147  -4.34133395  -3.2545052
##          stats
## parameter        75%      97.5%
##     alpha -2.5874394  7.0135145
##     beta1  0.1505257  0.2214264
##     beta2 -6.1829809  0.5193729
##     beta3 -3.2582870 -0.3351308
##     lp__  -2.4968617 -1.6489169
## 
## , , chains = chain:3
## 
##          stats
## parameter        mean         sd         2.5%          25%         50%
##     alpha  -7.2217676 7.94872246 -23.64136645 -12.13882902  -6.8448166
##     beta1   0.1248628 0.05819657   0.03693653   0.07801188   0.1167892
##     beta2 -11.9919909 6.88636722 -26.96220345 -17.00010939 -10.6608585
##     beta3  -6.2372425 3.58467712 -13.79777285  -8.27190284  -5.8446802
##     lp__   -3.5516426 1.55427291  -7.27129086  -4.31626390  -3.1798308
##          stats
## parameter        75%      97.5%
##     alpha -1.5794022  7.2231405
##     beta1  0.1650802  0.2537860
##     beta2 -6.8025944 -0.9492990
##     beta3 -3.6602651 -0.3712388
##     lp__  -2.4203952 -1.6036944
## 
## , , chains = chain:4
## 
##          stats
## parameter        mean         sd         2.5%          25%         50%
##     alpha  -7.4050649 8.04213806 -23.89486520 -12.36092109  -7.1686868
##     beta1   0.1127955 0.05721108   0.02764516   0.06721111   0.1068595
##     beta2 -10.7046514 6.71335923 -24.63197951 -15.03573990 -10.3055611
##     beta3  -5.4658877 3.39739373 -12.70546025  -7.75257169  -5.0175899
##     lp__   -3.4175870 1.48122483  -6.88644305  -4.20428495  -3.1023001
##          stats
## parameter        75%      97.5%
##     alpha -2.2770356  7.8380828
##     beta1  0.1497252  0.2358256
##     beta2 -5.7478293  0.1374114
##     beta3 -2.8913688 -0.2214687
##     lp__  -2.2923531 -1.5808450

Nota. Observamos que, el modelo bayesiano, quedaría con los parametros siguientes:

Ln(y) = -7.4367 + 0.1208(x1) - 11.5586(x2) - 5.9160(x3)

# Visualización de resultados (por ejemplo, trazar distribuciones posteriores)
plot(ajuste)

## ci_level: 0.8 (80% intervals)

## outer_level: 0.95 (95% intervals)

Nota. Se observa que el modelo es adecuado.

A MODO PRÁCTICA

Regresión logística bayesiana

Rojas Rojas Henry

2023-09-23

ASUMIENDO NORMALIDAD

CONSIDERANDO LOS SIGUIENTES DATOS

NOS PIDEN