From this point onwards we switch to Bayesian approach. The rstanarm package provides stan_glm function which accepts same arguments as glm, but makes full Bayesian inference using Stan (mc-stan.org). By default a weakly informative Gaussian prior is used for weights.
Model Info:
function: stan_glm
family: poisson [log]
formula: y ~ x1 + x2 + x3 + x4
algorithm: sampling
sample: 4000 (posterior sample size)
priors: see help('prior_summary')
observations: 200
predictors: 5
Estimates:
mean sd 10% 50% 90%
(Intercept) 2.5 6.1 -5.4 2.6 10.2
x1 -2.0 6.1 -9.8 -2.1 6.0
x2 6.5 6.1 -1.4 6.4 14.6
x3 -8.9 6.1 -16.5 -8.9 -1.0
x4 -8.7 6.3 -16.9 -8.8 -0.6
Fit Diagnostics:
mean sd 10% 50% 90%
mean_PPD 4.3 0.2 4.1 4.3 4.6
The mean_ppd is the sample average posterior predictive distribution of the outcome variable (for details see help('summary.stanreg')).
MCMC diagnostics
mcse Rhat n_eff
(Intercept) 0.1 1.0 4112
x1 0.1 1.0 4114
x2 0.1 1.0 4116
x3 0.1 1.0 4259
x4 0.1 1.0 3946
mean_PPD 0.0 1.0 4127
log-posterior 0.0 1.0 1742
For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).
In case of even more variables with some being relevant and some irrelevant, it will be difficult to analyse joint posterior to see which variables are jointly relevant. We can easily test whether any of the covariates are useful by using cross-validation to compare to a null model
Computed from 4000 by 200 log-likelihood matrix
Estimate SE
elpd_loo -383.3 12.1
p_loo 5.3 0.7
looic 766.6 24.2
------
Monte Carlo SE of elpd_loo is 0.0.
All Pareto k estimates are good (k < 0.5).
See help('pareto-k-diagnostic') for details.
Computed from 4000 by 200 log-likelihood matrix
Estimate SE
elpd_loo -714.5 43.8
p_loo 4.9 0.8
looic 1428.9 87.5
------
Monte Carlo SE of elpd_loo is 0.1.
All Pareto k estimates are good (k < 0.5).
See help('pareto-k-diagnostic') for details.
elpd_diff se_diff
fitg 0.0 0.0
fitg0 -331.2 42.7
Based on cross-validation covariates together contain significant information to improve predictions.
Piironen and Vehtari (2017) also show that a projection predictive approach can be used to make a model reduction, that is, choosing a smaller model with some coefficients set to 0. The projection predictive approach solves the problem how to do inference after the selection. The solution is to project the full model posterior to the restricted subspace. See more in Piironen et.al (2018).
We make the projective predictive variable selection using the previous full model. A fast leave-one-out cross-validation approach (Vehtari, Gelman and Gabry, 2017) is used to choose the model size.
[1] "Computing LOOs..."
|
| | 0%
|
|= | 1%
|
|= | 2%
|
|== | 2%
|
|== | 3%
|
|== | 4%
|
|=== | 4%
|
|==== | 5%
|
|==== | 6%
|
|===== | 6%
|
|===== | 7%
|
|===== | 8%
|
|====== | 8%
|
|====== | 9%
|
|======= | 10%
|
|======== | 11%
|
|======== | 12%
|
|========= | 12%
|
|========= | 13%
|
|========= | 14%
|
|========== | 14%
|
|========== | 15%
|
|=========== | 16%
|
|============ | 16%
|
|============ | 17%
|
|============ | 18%
|
|============= | 18%
|
|============= | 19%
|
|============== | 20%
|
|=============== | 21%
|
|=============== | 22%
|
|================ | 22%
|
|================ | 23%
|
|================ | 24%
|
|================= | 24%
|
|================== | 25%
|
|================== | 26%
|
|=================== | 26%
|
|=================== | 27%
|
|=================== | 28%
|
|==================== | 28%
|
|==================== | 29%
|
|===================== | 30%
|
|====================== | 31%
|
|====================== | 32%
|
|======================= | 32%
|
|======================= | 33%
|
|======================= | 34%
|
|======================== | 34%
|
|======================== | 35%
|
|========================= | 36%
|
|========================== | 36%
|
|========================== | 37%
|
|========================== | 38%
|
|=========================== | 38%
|
|=========================== | 39%
|
|============================ | 40%
|
|============================= | 41%
|
|============================= | 42%
|
|============================== | 42%
|
|============================== | 43%
|
|============================== | 44%
|
|=============================== | 44%
|
|================================ | 45%
|
|================================ | 46%
|
|================================= | 46%
|
|================================= | 47%
|
|================================= | 48%
|
|================================== | 48%
|
|================================== | 49%
|
|=================================== | 50%
|
|==================================== | 51%
|
|==================================== | 52%
|
|===================================== | 52%
|
|===================================== | 53%
|
|===================================== | 54%
|
|====================================== | 54%
|
|====================================== | 55%
|
|======================================= | 56%
|
|======================================== | 56%
|
|======================================== | 57%
|
|======================================== | 58%
|
|========================================= | 58%
|
|========================================= | 59%
|
|========================================== | 60%
|
|=========================================== | 61%
|
|=========================================== | 62%
|
|============================================ | 62%
|
|============================================ | 63%
|
|============================================ | 64%
|
|============================================= | 64%
|
|============================================== | 65%
|
|============================================== | 66%
|
|=============================================== | 66%
|
|=============================================== | 67%
|
|=============================================== | 68%
|
|================================================ | 68%
|
|================================================ | 69%
|
|================================================= | 70%
|
|================================================== | 71%
|
|================================================== | 72%
|
|=================================================== | 72%
|
|=================================================== | 73%
|
|=================================================== | 74%
|
|==================================================== | 74%
|
|==================================================== | 75%
|
|===================================================== | 76%
|
|====================================================== | 76%
|
|====================================================== | 77%
|
|====================================================== | 78%
|
|======================================================= | 78%
|
|======================================================= | 79%
|
|======================================================== | 80%
|
|========================================================= | 81%
|
|========================================================= | 82%
|
|========================================================== | 82%
|
|========================================================== | 83%
|
|========================================================== | 84%
|
|=========================================================== | 84%
|
|============================================================ | 85%
|
|============================================================ | 86%
|
|============================================================= | 86%
|
|============================================================= | 87%
|
|============================================================= | 88%
|
|============================================================== | 88%
|
|============================================================== | 89%
|
|=============================================================== | 90%
|
|================================================================ | 91%
|
|================================================================ | 92%
|
|================================================================= | 92%
|
|================================================================= | 93%
|
|================================================================= | 94%
|
|================================================================== | 94%
|
|================================================================== | 95%
|
|=================================================================== | 96%
|
|==================================================================== | 96%
|
|==================================================================== | 97%
|
|==================================================================== | 98%
|
|===================================================================== | 98%
|
|===================================================================== | 99%
|
|======================================================================| 100%
[1] "Performing the selection using all the data.."
[1] "20% of terms selected."
[1] "50% of terms selected."
[1] "70% of terms selected."
[1] "100% of terms selected."
[1] "Done."
[1] 2
Intercept x2 x1
-6.1 15.2 6.7
HighCharts dynamic chart
100
300
Classes 'data.table' and 'data.frame': 11341 obs. of 8 variables:
$ SNo : int 1 2 3 4 5 6 7 8 9 10 ...
$ ObservationDate: Date, format: "2020-01-22" "2020-01-22" ...
$ Province.State : chr "Anhui" "Beijing" "Chongqing" "Fujian" ...
$ Country.Region : chr "Mainland China" "Mainland China" "Mainland China" "Mainland China" ...
$ Last.Update : Date, format: "2020-01-22" "2020-01-22" ...
$ Confirmed : num 1 14 6 1 0 26 2 1 4 1 ...
$ Deaths : num 0 0 0 0 0 0 0 0 0 0 ...
$ Recovered : num 0 0 0 0 0 0 0 0 0 0 ...
- attr(*, ".internal.selfref")=<externalptr>
Row4 ——————
---
title: "Chart Stack (Scrolling)"
Author: "Hyunsu Ju"
output:
flexdashboard::flex_dashboard:
theme: sandstone
vertical_layout: scroll
social: ["facebook","twitter","menu"]
source_code: embed
navbar:
- {title: "About", href: "https://stat.columbia.edu/~gelman/research/unpublished/loo_stan.pdf", align: left}
---
```{r setup, include=FALSE}
library(flexdashboard)
library(rstanarm)
options(mc.cores = parallel::detectCores())
library(loo)
library(tidyverse)
library(GGally)
library(bayesplot)
theme_set(bayesplot::theme_default())
library(projpred)
SEED=87
```
Page 1
===========
### Chart A
```{r, fig.width=15, fig.height=7}
set.seed(SEED)
df <- tibble(
pos.tot = runif(200,min=0.8,max=1.0),
urban.tot = pmin(runif(200,min=0.0,max=0.02),1.0 - pos.tot),
neg.tot = (1.0 - pmin(pos.tot + urban.tot,1)),
x1= pmax(pos.tot - runif(200,min=0.05,max=0.30),0),
x3= pmax(neg.tot - runif(200,min=0.0,max=0.10),0),
x2= pmax(pos.tot - x1 - x3/2,0),
x4= pmax(1 - x1 - x2 - x3 - urban.tot,0))
# true model and 200 Poisson observations
mean.y <- exp(-5.8 + 6.3*df$x1 + 15.2*df$x2)
df$y <- rpois(200,mean.y)
ggpairs(df,diag=list(continuous="barDiag"))
```
### Bayesian inference
From this point onwards we switch to Bayesian approach. The rstanarm package provides stan_glm function which accepts same arguments as glm, but makes full Bayesian inference using Stan (mc-stan.org). By default a weakly informative Gaussian prior is used for weights.
```{r}
fitg <- stan_glm(y ~ x1 + x2 + x3 + x4, data = df, na.action = na.fail, family=poisson(), QR=TRUE, seed=SEED)
```
### Summary
```{r}
summary(fitg)
```
### Data Table
```{r}
DT::datatable(df,options=list(bPaginate=FALSE))
```
### Collinearity problem as with MLEs
```{r}
mcmc_areas(as.matrix(fitg),prob_outer = .99)
```
### Improvement on predictions
In case of even more variables with some being relevant and some irrelevant, it will be difficult to analyse joint posterior to see which variables are jointly relevant. We can easily test whether any of the covariates are useful by using cross-validation to compare to a null model
```{r}
fitg0 <- stan_glm(y ~ 1, data = df, na.action = na.fail, family=poisson(), seed=SEED)
(loog <- loo(fitg))
(loog0 <- loo(fitg0))
loo_compare(loog0, loog)
```
Based on cross-validation covariates together contain significant information to improve predictions.
### Variable selection
Piironen and Vehtari (2017) also show that a projection predictive approach can be used to make a model reduction, that is, choosing a smaller model with some coefficients set to 0. The projection predictive approach solves the problem how to do inference after the selection. The solution is to project the full model posterior to the restricted subspace. See more in Piironen et.al (2018).
We make the projective predictive variable selection using the previous full model. A fast leave-one-out cross-validation approach (Vehtari, Gelman and Gabry, 2017) is used to choose the model size.
```{r}
library(projpred)
fitg_cv <- cv_varsel(fitg, method='forward', cv_method='LOO',cores=16)
```
### Recommendation for the model size to choose based on a LOO
```{r}
plot(fitg_cv)
```
### Selection model results
```{r}
(nv <- suggest_size(fitg_cv, alpha=0.1))
projg <- project(fitg_cv, nv = nv, ns = 4000)
round(colMeans(as.matrix(projg)),1)
mcmc_areas(as.matrix(projg) ,prob_outer = .99)
```
### Crude map of the world with capital cities:
```{r}
library(ggplot2)
library(plotly)
library(maps)
library(rbokeh)
data(world.cities)
caps <- subset(world.cities, capital == 1)
caps$population <- prettyNum(caps$pop, big.mark = ",")
figure(width = 800, height = 450, padding_factor = 0) %>%
ly_map("world", col = "gray") %>%
ly_points(long, lat, data = caps, size = 5,
hover = c(name, country.etc, population))
```
### HighChart
`HighCharts` dynamic chart
```{r}
library(highcharter)
data(penguins, package = "palmerpenguins")
hchart(penguins, "scatter", hcaes(x=flipper_length_mm, y=bill_length_mm, group=species))
```
Page 2
===========
Row1
-----------------
### Articles per Day
```{r}
articles<-100
valueBox(articles, icon="fa-pencil")
```
### Comments per Day
```{r}
comments<-200
valueBox(comments, icon="fa-comments")
```
### Spam per Day
```{r}
spam<-300
valueBox(spam, icon="fa-trash",
color=ifelse(spam>10,"warning","primary"))
```
Row2
--------
### Contact Rate
```{r}
gauge(70, min = 0, max = 100, symbol = '%', gaugeSectors(
success = c(80, 100), warning = c(40, 79), danger = c(0, 39)
))
```
### Average Rating
```{r}
gauge(23, min = 0, max = 50, gaugeSectors(
success = c(41, 50), warning = c(21, 40), danger = c(0, 20)
))
```
### Cancellations
```{r}
gauge(7, min = 0, max = 10, gaugeSectors(
success = c(0, 2), warning = c(3, 6), danger = c(7, 10)
))
```
Page3
===========
```{r}
library(plotly)
fig <- plot_ly(
domain = list(x = c(0, 1), y = c(0, 1)),
value = 270,
title = list(text = "Speed"),
type = "indicator",
mode = "gauge+number")
fig <- fig %>%
layout(margin = list(l=20,r=30))
fig
```
Page4
=============
Row1
------------------
```{r}
library(tidyverse)
library(data.table)
library(dplyr)
library(ggplot2)
library(lubridate)
library(stringr)
library(plotly)
## Reading in files
library(igraph)
library(threejs)
library(ggraph)
library(tidygraph)
# You can access files from datasets you've added to this kernel in the "../input/" directory.
# You can see the files added to this kernel by running the code below.
start.watch.date <- as.Date('2020-03-01')
watched.countries.count <- 12
covid <- data.table(read.csv('covid_19_data.csv'))
krcovid <- data.table(read.csv('PatientInfo.csv'))
krcovid[,confirmed_date:=as.Date(gsub('2002','2020',as.character(confirmed_date)))]
krcovid[,symptom_onset_date:=as.Date(as.character(symptom_onset_date))]
krcovid[,released_date:=as.Date(as.character(released_date))]
krcovid[,deceased_date:=as.Date(as.character(deceased_date),format='%Y-%m-%d')]
#krcovid[,table(symptom_onset_date==confirmed_date)]
#summary(krcovid)
#head(covid)
krcovid[,closed_date:=coalesce(released_date,deceased_date)]
krcovid[,closed.days:=as.integer(difftime(closed_date,confirmed_date,unit='days'))]
ggplot(krcovid[,.(n=.N,median_age=2020-median(birth_year,na.rm=T)),.(closed.days,confirmed_date,state)][,state:=ifelse(state=='deceased','death','recovered')])+
geom_point(aes(y=closed.days,x=confirmed_date,size=n,col=state))+
#geom_text(aes(y=closed.days,x=confirmed_date,label=median_age,col=state,hjust='left'))+
scale_x_date(date_labels = '%m/%d',date_breaks = '3 days')+
theme(axis.text.x=element_text(angle = 60, vjust = 0.5))+
scale_color_brewer(palette = 'Set1')+
labs(title='Days to close the cases in Korea')
covid[,Last.Update:=as.Date(Last.Update,format='%m/%d/%Y')]
covid[str_length(ObservationDate)==8,ObservationDate1:=as.Date(ObservationDate,format='%m/%d/%y',origin = orgin)]
covid[str_length(ObservationDate)==10,ObservationDate1:=as.Date(ObservationDate,format='%m/%d/%Y',origin = orgin)]
covid[,ObservationDate:=ObservationDate1]
covid$ObservationDate1=NULL
covid[grep('Korea',Country.Region),Country.Region:='Republic of Korea']
covid[grep('Iran',Country.Region),Country.Region:='Iran']
covid[grep('Taiwan',Province.State),Country.Region:='Taiwan']
covid[ grepl('Diamon',Province.State),Country.Region:='Diamond Princess']
str(covid)
# The JHU modified the Country name of Taiwan which is suck both for analysis and human right
## quick summary the cases by country
library(conflicted)
conflict_prefer("lag", "dplyr")
conflict_prefer("setdiff", "dplyr")
covid.by.country <- covid[,.(Confirmed=sum(Confirmed),Recovered=sum(Recovered),Deaths=sum(Deaths)),.(ObservationDate,Country.Region)]
setorder(covid.by.country,Country.Region,ObservationDate)
covid.by.country[,Daily.Confirmed:=Confirmed-lag(Confirmed,1),.(Country.Region)]
covid.by.country[,Daily.Recovered:=Recovered-lag(Recovered,1),.(Country.Region)]
covid.by.country[,Daily.Deaths:=Deaths-lag(Deaths,1),.(Country.Region)]
lookup.countries <- covid.by.country[order(-ObservationDate,Country.Region)][,head(.SD,1),.(Country.Region)][order(-Deaths),Country.Region]
lookup.countries <- setdiff(lookup.countries,grep('Diamon',lookup.countries,value=T))
```
Row2
------------------
```{r}
ggplot(covid.by.country[grep('Italy|Korea',Country.Region)][,Daily.Confirmed:=ifelse(Daily.Confirmed>5000,2000,Daily.Confirmed)])+
geom_line(aes(x=ObservationDate,y=Daily.Recovered,col='Recovered'))+
geom_line(aes(x=ObservationDate,y=Daily.Deaths,col='Deaths'))+
geom_line(aes(x=ObservationDate,y=Daily.Confirmed,col='Confirmed'))+
facet_wrap(~Country.Region,scales = 'free',ncol=1)+
scale_x_date(date_labels = '%m/%d',date_breaks = '7 days')
```
Row3
------------------
```{r}
ggplot(covid.by.country[grep('Italy|Korea',Country.Region)][ObservationDate>'2020-03-01'])+
geom_bar(aes(x=ObservationDate,y=Daily.Deaths,fill='Deaths'),stat = 'identity')+
geom_line(aes(x=ObservationDate,y=Daily.Confirmed),col='blue')+
facet_wrap(~Country.Region,scales = 'free',ncol=1)+
scale_x_date(date_labels = '%m/%d',date_breaks = '2 days')
```
Row4
------------------
```{r}
ggplot(covid.by.country[Country.Region %in% lookup.countries[1:watched.countries.count]],aes(group=1))+
geom_line(aes(x=ObservationDate,y=Daily.Recovered,col='Recovered'))+
geom_line(aes(x=ObservationDate,y=Daily.Deaths,col='Deaths'))+
geom_line(aes(x=ObservationDate,y=Daily.Confirmed,col='Confirmed'))+
facet_wrap(~Country.Region,ncol=3,scale='free_y')+
scale_color_brewer(palette = 'Set1',name='Value Type')+
scale_x_date(date_breaks = '7 days',date_labels = '%m/%d',)+
theme(axis.text.x=element_text(angle = 90, vjust = 0.5))+
labs(title='COVID-19 daily cases')
```
Comments per Day
200