Paquetes Requeridos
library(ggplot2)
library(tidyr)
library(gridExtra)
library(rstanarm)
library(rstan)
library(bayesplot)
theme_set(theme_minimal()) #ggplot theme
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())
setwd("C:/Users/Cristhian/Desktop/Rstan")
El presente Código muestra un análisis de series de tiempo del número de muertes ocurridas por accidentes de tránsito al año en Finlandia.
Importamos el registro de muertes por accidente de tránsito en Finlandia desde el año 1993 al 2016
deaths <- read.csv("./trafficdeaths.csv")
head(deaths)
## year deaths
## 1 1993 434
## 2 1994 423
## 3 1995 411
## 4 1996 355
## 5 1997 391
## 6 1998 367
Realizando un gráfico para observar la distribución de los datos
ggplot() +
geom_point(aes(year, deaths), data = deaths, size = 1) +
labs(y = 'Muertes por accidentes de Tránsito', x= "Año") +
guides(linetype = F)
El número de muertes son count data (datos en el que las observaciones solo son valores enteros no negativos), por lo que usamos el modelo Poisson. Primero Fijamos el modelo log-lineal para la intensidad de Poisson, que corresponde a suponer un cambio proporcional constante.
fit_lin <- stan_glm(deaths ~ year, data=deaths, family="poisson",
refresh=100, iter=1000, chains=4, seed=1234, refresh=0)
Veamos la distribución predictiva posterior (mediana e intervalos de 5% y 95%).
x_predict <- seq(1993,2025)
N_predict <- length(x_predict)
y_predict <- posterior_predict(fit_lin, newdata=data.frame(year=x_predict))
mu <- apply(t(y_predict), 1, quantile, c(0.05, 0.5, 0.95)) %>%
t() %>% data.frame(x = x_predict, .) %>% gather(pct, y, -x)
pfit <- ggplot() +
geom_point(aes(year, deaths), data = deaths, size = 1) +
geom_line(aes(x, y, linetype = pct), data = mu, color = 'blue') +
scale_linetype_manual(values = c(2,1,2)) +
labs(x = 'Año', y = 'Muertes por accidentes de Tránsito') +
guides(linetype = F)
(pfit)
A continuación, ajustamos un modelo “spline no lineal” con stan_gamm4 (permite que los predictores designados tengan un efecto no lineal)
fit_gam <- stan_gamm4(deaths ~ year + s(year), data=deaths,
family="poisson", adapt_delta=0.999,
refresh=100, iter=1000, chain=4, seed=1234, refresh=0)
monitor(fit_gam$stanfit)
## Inference for the input samples (4 chains: each with iter = 1000; warmup = 0):
##
## Q5 Q50 Q95 Mean SD Rhat Bulk_ESS Tail_ESS
## (Intercept) -71.0 41.7 145.3 41.2 73.1 1 526 520
## year -0.1 0.0 0.0 0.0 0.0 1 526 520
## s(year).1 -0.7 0.1 1.1 0.1 0.6 1 1444 1259
## s(year).2 -1.6 -0.4 0.2 -0.5 0.6 1 1005 1266
## s(year).3 -1.0 -0.1 0.6 -0.2 0.5 1 1383 1444
## s(year).4 -0.5 0.0 0.7 0.1 0.4 1 1653 1543
## s(year).5 -0.8 -0.1 0.4 -0.2 0.3 1 1546 1421
## s(year).6 0.1 0.5 0.9 0.5 0.2 1 707 678
## s(year).7 -0.1 0.2 0.5 0.2 0.2 1 1139 1755
## s(year).8 -0.9 -0.3 0.1 -0.4 0.3 1 855 1387
## s(year).9 -2.0 0.0 2.0 0.1 1.3 1 588 574
## smooth_sd[s(year)1] 0.1 0.5 1.1 0.5 0.3 1 417 605
## smooth_sd[s(year)2] 0.1 0.7 2.8 1.0 0.9 1 828 792
## mean_PPD 319.9 329.2 338.3 329.2 5.4 1 2238 1711
## log-posterior -134.5 -127.9 -123.5 -128.3 3.4 1 407 806
##
## For each parameter, Bulk_ESS and Tail_ESS are crude measures of
## effective sample size for bulk and tail quantities respectively (an ESS > 100
## per chain is considered good), and Rhat is the potential scale reduction
## factor on rank normalized split chains (at convergence, Rhat <= 1.05).
Construyendo la distribución predictiva posterior.
x_predict=seq(1993,2025)
N_predict=length(x_predict)
y_predict <- posterior_predict(fit_gam, newdata=data.frame(year=x_predict))
mu <- apply(t(y_predict), 1, quantile, c(0.05, 0.5, 0.95)) %>%
t() %>% data.frame(x = x_predict, .) %>% gather(pct, y, -x)
pfit <- ggplot() +
geom_point(aes(year, deaths), data = deaths, size = 1) +
geom_line(aes(x, y, linetype = pct), data = mu, color = 'blue') +
scale_linetype_manual(values = c(2,1,2)) +
labs(x = 'Año', y = 'Muertes por accidentes de Tránsito') +
guides(linetype = F)
(pfit)
La mediana predictiva es claramente no lineal. La media de predicción para los años futuros se mantiene al mismo nivel que las observaciones más recientes, pero la incertidumbre aumenta rápidamente. Finalmente ajustamos el proceso gaussiano centrado en el modelo lineal usando el lenguaje Stan:
N<-nrow(deaths)
Ey<-mean(deaths$deaths)
d_data <- list(N=N, x=deaths$year, y=deaths$deaths, Ey=Ey, N_predict=N_predict,
x_predict=x_predict, alpha0=2, beta0=4)
fit_gp <- stan("./poisson_gp.stan", data=d_data, refresh=100, iter=1000,
chains=4, seed=1234, init=0, control=list(adapt_delta=0.999))
## Warning: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
## Running the chains for more iterations may help. See
## http://mc-stan.org/misc/warnings.html#bulk-ess
monitor(fit_gp)
## Inference for the input samples (4 chains: each with iter = 1000; warmup = 0):
##
## Q5 Q50 Q95 Mean SD Rhat Bulk_ESS Tail_ESS
## lscale 0.9 2.2 5.6 2.6 1.7 1.01 299 524
## alpha 0.0 0.1 0.2 0.1 0.1 1.00 656 560
## a -0.1 0.0 0.1 0.0 0.0 1.00 1257 1033
## b 0.0 0.0 0.0 0.0 0.0 1.00 1425 1062
## z[1] -1.4 0.0 1.3 0.0 0.8 1.00 1774 1519
## z[2] -1.6 -0.3 0.9 -0.4 0.8 1.00 1838 1558
## z[3] -1.8 -0.4 1.1 -0.4 0.9 1.00 1445 1576
## z[4] -1.5 -0.1 1.4 -0.1 0.9 1.00 995 1628
## z[5] -1.3 0.2 1.7 0.2 0.9 1.00 2473 1489
## z[6] -1.3 0.1 1.6 0.2 0.9 1.00 2206 1403
## z[7] -1.3 0.3 1.7 0.3 0.9 1.00 2141 1137
## z[8] -1.3 0.2 1.6 0.2 0.9 1.00 2450 1642
## z[9] -1.3 0.2 1.8 0.2 0.9 1.00 2193 1628
## z[10] -1.4 0.1 1.5 0.1 0.9 1.00 2251 1129
## z[11] -1.4 0.1 1.7 0.1 0.9 1.00 2142 1419
## z[12] -1.3 0.3 1.9 0.3 1.0 1.00 1733 1377
## z[13] -1.3 0.3 1.9 0.3 1.0 1.01 885 1340
## z[14] -1.7 0.1 1.7 0.0 1.0 1.00 959 1496
## z[15] -2.0 -0.3 1.6 -0.3 1.1 1.00 773 1042
## z[16] -2.3 -0.7 0.9 -0.7 1.0 1.00 1683 1728
## z[17] -2.3 -0.6 1.0 -0.6 1.0 1.00 875 1490
## z[18] -1.7 -0.2 1.4 -0.2 1.0 1.00 1799 1545
## z[19] -1.7 -0.1 1.5 -0.1 0.9 1.00 1539 1452
## z[20] -1.7 -0.2 1.5 -0.2 0.9 1.00 1296 1336
## z[21] -1.3 0.1 1.7 0.2 0.9 1.00 1827 1506
## z[22] -1.3 0.2 1.8 0.2 1.0 1.01 2420 1524
## z[23] -1.4 0.3 1.8 0.3 1.0 1.00 2396 1535
## z[24] -1.6 0.1 1.7 0.1 1.0 1.00 2787 1517
## mu_predict[1] 409.7 433.5 460.5 434.0 15.7 1.00 2508 1501
## mu_predict[2] 398.8 418.8 440.0 419.0 12.5 1.00 1885 1761
## mu_predict[3] 383.9 401.3 419.1 401.7 11.3 1.00 2115 1822
## mu_predict[4] 367.4 387.0 405.0 386.7 11.6 1.00 1404 1659
## mu_predict[5] 361.1 380.3 399.2 380.4 11.6 1.00 1999 1832
## mu_predict[6] 359.2 377.0 395.5 377.1 10.9 1.00 2540 1913
## mu_predict[7] 358.0 374.3 392.6 374.8 10.6 1.00 2229 1583
## mu_predict[8] 353.9 370.7 389.4 371.0 10.8 1.00 2253 1728
## mu_predict[9] 350.6 367.1 385.8 367.4 10.8 1.00 2021 1627
## mu_predict[10] 342.5 359.2 378.4 359.5 10.9 1.00 1784 1750
## mu_predict[11] 330.1 347.8 364.9 347.4 10.8 1.00 1464 1499
## mu_predict[12] 319.3 338.2 355.5 338.0 11.2 1.00 811 1614
## mu_predict[13] 316.8 333.3 350.3 333.3 10.3 1.00 1196 1434
## mu_predict[14] 312.3 328.7 347.9 329.3 10.9 1.00 1330 1453
## mu_predict[15] 302.7 320.0 343.2 321.1 12.5 1.00 729 1653
## mu_predict[16] 287.4 301.8 320.4 302.4 10.1 1.00 979 1590
## mu_predict[17] 262.2 280.2 293.5 279.4 9.7 1.00 1277 1551
## mu_predict[18] 247.0 264.9 280.5 264.5 10.1 1.00 990 1492
## mu_predict[19] 242.0 257.2 273.7 257.5 9.8 1.00 1394 1721
## mu_predict[20] 237.9 252.4 267.4 252.5 9.1 1.00 1623 1739
## mu_predict[21] 236.5 250.5 264.3 250.2 8.5 1.00 1787 1678
## mu_predict[22] 236.8 249.7 263.5 250.0 8.5 1.00 2232 1720
## mu_predict[23] 237.2 251.4 267.8 251.6 9.4 1.00 1587 1828
## mu_predict[24] 232.8 249.3 269.3 249.8 11.3 1.00 2084 1877
## mu_predict[25] 220.1 243.6 271.5 244.8 16.1 1.00 1992 1625
## mu_predict[26] 208.6 236.9 274.9 239.1 21.1 1.00 1766 1746
## mu_predict[27] 198.2 230.1 276.6 233.0 24.4 1.00 1603 1517
## mu_predict[28] 189.0 223.6 274.5 226.9 26.8 1.00 1526 1661
## mu_predict[29] 182.6 218.0 273.4 221.4 28.9 1.00 1542 1325
## mu_predict[30] 176.9 212.0 268.7 215.8 30.6 1.00 1472 1207
## mu_predict[31] 170.4 205.9 264.8 210.4 32.0 1.00 1516 1305
## mu_predict[32] 164.9 200.1 259.7 205.0 33.2 1.00 1544 1249
## mu_predict[33] 159.3 194.6 256.5 199.9 34.2 1.00 1567 1230
## y_predict[1] 391.0 432.0 477.0 432.8 26.2 1.00 2186 1863
## y_predict[2] 379.0 418.0 459.0 418.3 24.1 1.00 1922 1854
## y_predict[3] 364.0 401.0 440.0 401.6 23.3 1.00 1915 1868
## y_predict[4] 347.0 386.0 422.0 386.0 22.7 1.00 1898 2002
## y_predict[5] 343.0 381.0 417.0 380.0 22.4 1.00 2098 1880
## y_predict[6] 341.0 376.5 417.0 377.3 22.8 1.00 2229 1809
## y_predict[7] 339.0 375.0 414.0 375.2 23.0 1.00 1932 1946
## y_predict[8] 336.0 371.0 407.0 371.7 21.5 1.00 1982 2073
## y_predict[9] 331.0 367.0 403.0 367.1 21.9 1.00 1892 2030
## y_predict[10] 324.0 359.0 395.0 359.4 21.3 1.00 1837 1641
## y_predict[11] 312.0 347.0 383.0 346.9 21.7 1.00 1973 1868
## y_predict[12] 302.0 339.0 373.0 338.1 21.3 1.00 1377 2016
## y_predict[13] 300.0 333.0 370.0 333.8 21.4 1.00 1786 1754
## y_predict[14] 296.0 329.0 365.0 329.1 21.0 1.00 1944 1870
## y_predict[15] 285.0 321.0 359.0 321.1 22.1 1.00 1230 1903
## y_predict[16] 271.0 303.0 335.0 302.9 19.7 1.00 1624 1762
## y_predict[17] 245.0 279.0 313.0 279.5 19.8 1.00 1929 1968
## y_predict[18] 233.0 265.0 297.0 264.9 19.2 1.00 1763 1842
## y_predict[19] 226.0 257.0 290.0 257.2 19.1 1.00 1826 1880
## y_predict[20] 222.0 252.0 283.0 252.4 18.5 1.00 1979 1924
## y_predict[21] 219.0 250.0 279.0 249.7 17.6 1.00 1874 1978
## y_predict[22] 220.0 249.0 279.0 249.7 17.8 1.00 1923 1873
## y_predict[23] 222.0 252.0 283.0 251.8 18.2 1.00 1651 1812
## y_predict[24] 218.0 249.0 282.0 249.6 19.9 1.00 2145 1908
## y_predict[25] 211.0 244.0 285.0 244.7 22.3 1.00 2077 1785
## y_predict[26] 200.0 239.0 284.0 239.5 26.0 1.00 1779 1818
## y_predict[27] 191.0 231.0 281.0 232.9 28.6 1.00 1481 1620
## y_predict[28] 182.0 224.0 281.0 227.3 31.0 1.00 1486 1719
## y_predict[29] 175.0 219.0 276.0 221.0 31.9 1.00 1666 1754
## y_predict[30] 168.0 212.0 276.0 215.7 34.0 1.00 1570 1392
## y_predict[31] 163.0 206.0 269.0 210.5 34.7 1.00 1506 1335
## y_predict[32] 159.0 201.0 264.1 205.4 35.9 1.00 1567 1418
## y_predict[33] 153.9 195.0 262.0 199.6 37.2 1.00 1628 1295
## log_lik[1] -5.1 -4.1 -4.0 -4.2 0.4 1.00 1656 1689
## log_lik[2] -4.8 -4.0 -3.9 -4.1 0.3 1.00 1823 1680
## log_lik[3] -4.9 -4.1 -3.9 -4.2 0.4 1.00 2013 1895
## log_lik[4] -7.1 -5.2 -4.1 -5.4 0.9 1.00 1388 1602
## log_lik[5] -5.1 -4.1 -3.9 -4.2 0.4 1.00 1932 2001
## log_lik[6] -4.9 -4.0 -3.9 -4.2 0.4 1.00 2259 1912
## log_lik[7] -5.4 -4.3 -3.9 -4.4 0.5 1.00 2492 2013
## log_lik[8] -5.5 -4.2 -3.9 -4.4 0.6 1.00 2138 1943
## log_lik[9] -5.6 -4.4 -3.9 -4.5 0.6 1.00 2015 1837
## log_lik[10] -4.7 -4.0 -3.9 -4.1 0.3 1.00 1938 1927
## log_lik[11] -4.5 -3.9 -3.8 -4.0 0.3 1.00 1883 1713
## log_lik[12] -5.4 -4.2 -3.8 -4.3 0.5 1.00 817 1821
## log_lik[13] -4.9 -4.0 -3.8 -4.1 0.4 1.00 956 1375
## log_lik[14] -4.7 -3.9 -3.8 -4.0 0.3 1.00 1920 1929
## log_lik[15] -7.2 -5.1 -3.9 -5.3 1.0 1.00 763 1645
## log_lik[16] -5.7 -4.4 -3.8 -4.5 0.6 1.00 1059 1717
## log_lik[17] -6.3 -4.9 -3.8 -4.9 0.8 1.00 1293 1532
## log_lik[18] -6.6 -4.8 -3.8 -4.9 0.9 1.00 997 1421
## log_lik[19] -5.6 -4.2 -3.7 -4.4 0.7 1.00 1295 1552
## log_lik[20] -5.7 -4.3 -3.7 -4.4 0.7 1.00 1563 1673
## log_lik[21] -4.6 -3.8 -3.7 -4.0 0.4 1.00 1811 1780
## log_lik[22] -6.0 -4.5 -3.8 -4.7 0.7 1.00 2220 1703
## log_lik[23] -5.9 -4.4 -3.7 -4.6 0.7 1.00 1683 1953
## log_lik[24] -4.7 -3.8 -3.7 -4.0 0.4 1.00 1823 1677
## lp__ 37989.1 37997.9 38004.2 37997.4 4.6 1.00 547 1011
##
## For each parameter, Bulk_ESS and Tail_ESS are crude measures of
## effective sample size for bulk and tail quantities respectively (an ESS > 100
## per chain is considered good), and Rhat is the potential scale reduction
## factor on rank normalized split chains (at convergence, Rhat <= 1.05).
Calculemos la distribución predictiva posterior.
gp_params <- extract(fit_gp)
mu <- apply(t(gp_params$y_predict), 1, quantile, c(0.05, 0.5, 0.95)) %>%
t() %>% data.frame(x = x_predict, .) %>% gather(pct, y, -x)
pfit <- ggplot() +
geom_point(aes(year, deaths), data = deaths, size = 1) +
geom_line(aes(x, y, linetype = pct), data = mu, color = 'blue') +
scale_linetype_manual(values = c(2,1,2)) +
labs(x = 'Anio', y = 'Muertes por Accidentes de Transito') +
guides(linetype = F)
(pfit)
No existen diferencias prácticas en el rendimiento predictivo, lo que se debe en parte al pequeño número de observaciones. Sobre la base de las distribuciones predictivas posteriores, existen claras diferencias en las predicciones futuras.
Código disponible repositorio