From this point onwards we switch to Bayesian approach. The rstanarm package provides stan_glm function which accepts same arguments as glm, but makes full Bayesian inference using Stan (mc-stan.org). By default a weakly informative Gaussian prior is used for weights.
Model Info:
function: stan_glm
family: poisson [log]
formula: y ~ x1 + x2 + x3 + x4
algorithm: sampling
sample: 4000 (posterior sample size)
priors: see help('prior_summary')
observations: 200
predictors: 5
Estimates:
mean sd 10% 50% 90%
(Intercept) 2.5 6.1 -5.4 2.6 10.2
x1 -2.0 6.1 -9.8 -2.1 6.0
x2 6.5 6.1 -1.4 6.4 14.6
x3 -8.9 6.1 -16.5 -8.9 -1.0
x4 -8.7 6.3 -16.9 -8.8 -0.6
Fit Diagnostics:
mean sd 10% 50% 90%
mean_PPD 4.3 0.2 4.1 4.3 4.6
The mean_ppd is the sample average posterior predictive distribution of the outcome variable (for details see help('summary.stanreg')).
MCMC diagnostics
mcse Rhat n_eff
(Intercept) 0.1 1.0 4112
x1 0.1 1.0 4114
x2 0.1 1.0 4116
x3 0.1 1.0 4259
x4 0.1 1.0 3946
mean_PPD 0.0 1.0 4127
log-posterior 0.0 1.0 1742
For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).
In case of even more variables with some being relevant and some irrelevant, it will be difficult to analyse joint posterior to see which variables are jointly relevant. We can easily test whether any of the covariates are useful by using cross-validation to compare to a null model
Computed from 4000 by 200 log-likelihood matrix
Estimate SE
elpd_loo -383.3 12.1
p_loo 5.3 0.7
looic 766.6 24.2
------
Monte Carlo SE of elpd_loo is 0.0.
All Pareto k estimates are good (k < 0.5).
See help('pareto-k-diagnostic') for details.
Computed from 4000 by 200 log-likelihood matrix
Estimate SE
elpd_loo -714.5 43.8
p_loo 4.9 0.8
looic 1428.9 87.5
------
Monte Carlo SE of elpd_loo is 0.1.
All Pareto k estimates are good (k < 0.5).
See help('pareto-k-diagnostic') for details.
elpd_diff se_diff
fitg 0.0 0.0
fitg0 -331.2 42.7
Based on cross-validation covariates together contain significant information to improve predictions.
Piironen and Vehtari (2017) also show that a projection predictive approach can be used to make a model reduction, that is, choosing a smaller model with some coefficients set to 0. The projection predictive approach solves the problem how to do inference after the selection. The solution is to project the full model posterior to the restricted subspace. See more in Piironen et.al (2018).
We make the projective predictive variable selection using the previous full model. A fast leave-one-out cross-validation approach (Vehtari, Gelman and Gabry, 2017) is used to choose the model size.
[1] "Computing LOOs..."
|
| | 0%
|
|= | 1%
|
|= | 2%
|
|== | 2%
|
|== | 3%
|
|== | 4%
|
|=== | 4%
|
|==== | 5%
|
|==== | 6%
|
|===== | 6%
|
|===== | 7%
|
|===== | 8%
|
|====== | 8%
|
|====== | 9%
|
|======= | 10%
|
|======== | 11%
|
|======== | 12%
|
|========= | 12%
|
|========= | 13%
|
|========= | 14%
|
|========== | 14%
|
|========== | 15%
|
|=========== | 16%
|
|============ | 16%
|
|============ | 17%
|
|============ | 18%
|
|============= | 18%
|
|============= | 19%
|
|============== | 20%
|
|=============== | 21%
|
|=============== | 22%
|
|================ | 22%
|
|================ | 23%
|
|================ | 24%
|
|================= | 24%
|
|================== | 25%
|
|================== | 26%
|
|=================== | 26%
|
|=================== | 27%
|
|=================== | 28%
|
|==================== | 28%
|
|==================== | 29%
|
|===================== | 30%
|
|====================== | 31%
|
|====================== | 32%
|
|======================= | 32%
|
|======================= | 33%
|
|======================= | 34%
|
|======================== | 34%
|
|======================== | 35%
|
|========================= | 36%
|
|========================== | 36%
|
|========================== | 37%
|
|========================== | 38%
|
|=========================== | 38%
|
|=========================== | 39%
|
|============================ | 40%
|
|============================= | 41%
|
|============================= | 42%
|
|============================== | 42%
|
|============================== | 43%
|
|============================== | 44%
|
|=============================== | 44%
|
|================================ | 45%
|
|================================ | 46%
|
|================================= | 46%
|
|================================= | 47%
|
|================================= | 48%
|
|================================== | 48%
|
|================================== | 49%
|
|=================================== | 50%
|
|==================================== | 51%
|
|==================================== | 52%
|
|===================================== | 52%
|
|===================================== | 53%
|
|===================================== | 54%
|
|====================================== | 54%
|
|====================================== | 55%
|
|======================================= | 56%
|
|======================================== | 56%
|
|======================================== | 57%
|
|======================================== | 58%
|
|========================================= | 58%
|
|========================================= | 59%
|
|========================================== | 60%
|
|=========================================== | 61%
|
|=========================================== | 62%
|
|============================================ | 62%
|
|============================================ | 63%
|
|============================================ | 64%
|
|============================================= | 64%
|
|============================================== | 65%
|
|============================================== | 66%
|
|=============================================== | 66%
|
|=============================================== | 67%
|
|=============================================== | 68%
|
|================================================ | 68%
|
|================================================ | 69%
|
|================================================= | 70%
|
|================================================== | 71%
|
|================================================== | 72%
|
|=================================================== | 72%
|
|=================================================== | 73%
|
|=================================================== | 74%
|
|==================================================== | 74%
|
|==================================================== | 75%
|
|===================================================== | 76%
|
|====================================================== | 76%
|
|====================================================== | 77%
|
|====================================================== | 78%
|
|======================================================= | 78%
|
|======================================================= | 79%
|
|======================================================== | 80%
|
|========================================================= | 81%
|
|========================================================= | 82%
|
|========================================================== | 82%
|
|========================================================== | 83%
|
|========================================================== | 84%
|
|=========================================================== | 84%
|
|============================================================ | 85%
|
|============================================================ | 86%
|
|============================================================= | 86%
|
|============================================================= | 87%
|
|============================================================= | 88%
|
|============================================================== | 88%
|
|============================================================== | 89%
|
|=============================================================== | 90%
|
|================================================================ | 91%
|
|================================================================ | 92%
|
|================================================================= | 92%
|
|================================================================= | 93%
|
|================================================================= | 94%
|
|================================================================== | 94%
|
|================================================================== | 95%
|
|=================================================================== | 96%
|
|==================================================================== | 96%
|
|==================================================================== | 97%
|
|==================================================================== | 98%
|
|===================================================================== | 98%
|
|===================================================================== | 99%
|
|======================================================================| 100%
[1] "Performing the selection using all the data.."
[1] "20% of terms selected."
[1] "50% of terms selected."
[1] "70% of terms selected."
[1] "100% of terms selected."
[1] "Done."
[1] 2
Intercept x2 x1
-6.1 15.2 6.7
HighCharts dynamic chart
Call:
glm(formula = cbind(ndied, nalive) ~ dose, family = binomial,
data = exp.dat)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.2746 -0.4668 0.7688 0.9544 1.2990
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -14.82300 1.28959 -11.49 <2e-16 ***
dose 0.24942 0.02139 11.66 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 284.2024 on 7 degrees of freedom
Residual deviance: 7.3849 on 6 degrees of freedom
AIC: 37.583
Number of Fisher Scoring iterations: 4
[1] 59.43092
[1] 49.1 76.5
Call:
glm(formula = cbind(ndied, nalive) ~ dose, family = binomial,
data = exp.dat)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.2746 -0.4668 0.7688 0.9544 1.2990
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -14.82300 1.28959 -11.49 <2e-16 ***
dose 0.24942 0.02139 11.66 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 284.2024 on 7 degrees of freedom
Residual deviance: 7.3849 on 6 degrees of freedom
AIC: 37.583
Number of Fisher Scoring iterations: 4
(Intercept) dose
-14.85 0.25
5% 95%
(Intercept) -17.01 -12.86
dose 0.22 0.29
Computed from 4000 by 8 log-likelihood matrix
Estimate SE
elpd_loo -19.1 2.4
p_loo 2.1 0.4
looic 38.2 4.7
------
Monte Carlo SE of elpd_loo is 0.1.
Pareto k diagnostic values:
Count Pct. Min. n_eff
(-Inf, 0.5] (good) 6 75.0% 1214
(0.5, 0.7] (ok) 2 25.0% 392
(0.7, 1] (bad) 0 0.0% <NA>
(1, Inf) (very bad) 0 0.0% <NA>
All Pareto k estimates are ok (k < 0.7).
See help('pareto-k-diagnostic') for details.
Computed from 4000 by 8 log-likelihood matrix
Estimate SE
elpd_loo -172.8 34.1
p_loo 30.5 6.7
looic 345.6 68.1
------
Monte Carlo SE of elpd_loo is NA.
Pareto k diagnostic values:
Count Pct. Min. n_eff
(-Inf, 0.5] (good) 2 25.0% 129
(0.5, 0.7] (ok) 1 12.5% 57
(0.7, 1] (bad) 3 37.5% 16
(1, Inf) (very bad) 2 25.0% 5
See help('pareto-k-diagnostic') for details.
elpd_diff se_diff
fitg 0.0 0.0
fitg0 -153.7 35.5
elpd_diff se_diff
fitg1 0.0 0.0
fitg -0.7 0.7
5% 95%
(Intercept) -9.74 -7.46
dose 0.13 0.16
Model Info:
function: stan_glm
family: binomial [probit]
formula: cbind(ndied, nalive) ~ dose
algorithm: sampling
sample: 4000 (posterior sample size)
priors: see help('prior_summary')
observations: 8
predictors: 2
Estimates:
mean sd 10% 50% 90%
(Intercept) -8.6 0.7 -9.5 -8.6 -7.7
dose 0.1 0.0 0.1 0.1 0.2
Fit Diagnostics:
mean sd 10% 50% 90%
mean_PPD 36.5 1.3 34.8 36.5 38.2
The mean_ppd is the sample average posterior predictive distribution of the outcome variable (for details see help('summary.stanreg')).
MCMC diagnostics
mcse Rhat n_eff
(Intercept) 0.0 1.0 1741
dose 0.0 1.0 1688
mean_PPD 0.0 1.0 3411
log-posterior 0.0 1.0 1638
For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).
Model Info:
function: stan_glm
family: gaussian [identity]
formula: Survived ~ Age + Sex + as.factor(class)
algorithm: sampling
sample: 4000 (posterior sample size)
priors: see help('prior_summary')
observations: 714
predictors: 5
Estimates:
mean sd 10% 50% 90%
(Intercept) 1.1 0.1 1.1 1.1 1.2
Age 0.0 0.0 0.0 0.0 0.0
Sexmale -0.5 0.0 -0.5 -0.5 -0.4
as.factor(class)2 -0.2 0.0 -0.3 -0.2 -0.2
as.factor(class)3 -0.4 0.0 -0.5 -0.4 -0.4
sigma 0.4 0.0 0.4 0.4 0.4
Fit Diagnostics:
mean sd 10% 50% 90%
mean_PPD 0.4 0.0 0.4 0.4 0.4
The mean_ppd is the sample average posterior predictive distribution of the outcome variable (for details see help('summary.stanreg')).
MCMC diagnostics
mcse Rhat n_eff
(Intercept) 0.0 1.0 2968
Age 0.0 1.0 3326
Sexmale 0.0 1.0 3573
as.factor(class)2 0.0 1.0 2737
as.factor(class)3 0.0 1.0 2646
sigma 0.0 1.0 3844
mean_PPD 0.0 1.0 4177
log-posterior 0.0 1.0 1961
For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).
Computed from 4000 by 714 log-likelihood matrix
Estimate SE
elpd_waic -334.9 19.1
p_waic 6.1 0.4
waic 669.9 38.3
Computed from 4000 by 714 log-likelihood matrix
Estimate SE
elpd_loo -334.9 19.1
p_loo 6.1 0.4
looic 669.9 38.3
------
Monte Carlo SE of elpd_loo is 0.0.
All Pareto k estimates are good (k < 0.5).
See help('pareto-k-diagnostic') for details.
[[1]]
Stan model 'bernoulli' does not contain samples.
elpd_diff se_diff
Titanic_logit 0.0 0.0
Titanic_probit -1.5 0.8
Classes 'data.table' and 'data.frame': 11341 obs. of 8 variables:
$ SNo : int 1 2 3 4 5 6 7 8 9 10 ...
$ ObservationDate: Date, format: "2020-01-22" "2020-01-22" ...
$ Province.State : chr "Anhui" "Beijing" "Chongqing" "Fujian" ...
$ Country.Region : chr "Mainland China" "Mainland China" "Mainland China" "Mainland China" ...
$ Last.Update : Date, format: "2020-01-22" "2020-01-22" ...
$ Confirmed : num 1 14 6 1 0 26 2 1 4 1 ...
$ Deaths : num 0 0 0 0 0 0 0 0 0 0 ...
$ Recovered : num 0 0 0 0 0 0 0 0 0 0 ...
- attr(*, ".internal.selfref")=<externalptr>
Row4 ——————
---
title: "Chart Stack (Scrolling)"
Author: "Hyunsu Ju"
output:
flexdashboard::flex_dashboard:
theme: sandstone
vertical_layout: scroll
social: ["facebook","twitter","menu"]
source_code: embed
navbar:
- {title: "About", href: "https://stat.columbia.edu/~gelman/research/unpublished/loo_stan.pdf", align: left}
---
```{r setup, include=FALSE}
library(flexdashboard)
library(rstanarm)
options(mc.cores = parallel::detectCores())
library(loo)
library(tidyverse)
library(GGally)
library(bayesplot)
theme_set(bayesplot::theme_default())
library(projpred)
SEED=87
```
Page 1
===========
### Chart A
```{r, fig.width=15, fig.height=7}
set.seed(SEED)
df <- tibble(
pos.tot = runif(200,min=0.8,max=1.0),
urban.tot = pmin(runif(200,min=0.0,max=0.02),1.0 - pos.tot),
neg.tot = (1.0 - pmin(pos.tot + urban.tot,1)),
x1= pmax(pos.tot - runif(200,min=0.05,max=0.30),0),
x3= pmax(neg.tot - runif(200,min=0.0,max=0.10),0),
x2= pmax(pos.tot - x1 - x3/2,0),
x4= pmax(1 - x1 - x2 - x3 - urban.tot,0))
# true model and 200 Poisson observations
mean.y <- exp(-5.8 + 6.3*df$x1 + 15.2*df$x2)
df$y <- rpois(200,mean.y)
ggpairs(df,diag=list(continuous="barDiag"))
```
### Bayesian inference
From this point onwards we switch to Bayesian approach. The rstanarm package provides stan_glm function which accepts same arguments as glm, but makes full Bayesian inference using Stan (mc-stan.org). By default a weakly informative Gaussian prior is used for weights.
```{r}
fitg <- stan_glm(y ~ x1 + x2 + x3 + x4, data = df, na.action = na.fail, family=poisson(), QR=TRUE, seed=SEED)
```
### Summary
```{r}
summary(fitg)
```
### Data Table
```{r}
DT::datatable(df,options=list(bPaginate=FALSE))
```
### Collinearity problem as with MLEs
```{r}
mcmc_areas(as.matrix(fitg),prob_outer = .99)
```
### Improvement on predictions
In case of even more variables with some being relevant and some irrelevant, it will be difficult to analyse joint posterior to see which variables are jointly relevant. We can easily test whether any of the covariates are useful by using cross-validation to compare to a null model
```{r}
fitg0 <- stan_glm(y ~ 1, data = df, na.action = na.fail, family=poisson(), seed=SEED)
(loog <- loo(fitg))
(loog0 <- loo(fitg0))
loo_compare(loog0, loog)
```
Based on cross-validation covariates together contain significant information to improve predictions.
### Variable selection
Piironen and Vehtari (2017) also show that a projection predictive approach can be used to make a model reduction, that is, choosing a smaller model with some coefficients set to 0. The projection predictive approach solves the problem how to do inference after the selection. The solution is to project the full model posterior to the restricted subspace. See more in Piironen et.al (2018).
We make the projective predictive variable selection using the previous full model. A fast leave-one-out cross-validation approach (Vehtari, Gelman and Gabry, 2017) is used to choose the model size.
```{r}
library(projpred)
fitg_cv <- cv_varsel(fitg, method='forward', cv_method='LOO',cores=16)
```
### Recommendation for the model size to choose based on a LOO
```{r}
plot(fitg_cv)
```
### Selection model results
```{r}
(nv <- suggest_size(fitg_cv, alpha=0.1))
projg <- project(fitg_cv, nv = nv, ns = 4000)
round(colMeans(as.matrix(projg)),1)
mcmc_areas(as.matrix(projg) ,prob_outer = .99)
```
### Crude map of the world with capital cities:
```{r}
library(ggplot2)
library(plotly)
library(maps)
library(rbokeh)
data(world.cities)
caps <- subset(world.cities, capital == 1)
caps$population <- prettyNum(caps$pop, big.mark = ",")
figure(width = 800, height = 450, padding_factor = 0) %>%
ly_map("world", col = "gray") %>%
ly_points(long, lat, data = caps, size = 5,
hover = c(name, country.etc, population))
```
### HighChart
`HighCharts` dynamic chart
```{r}
library(highcharter)
data(penguins, package = "palmerpenguins")
hchart(penguins, "scatter", hcaes(x=flipper_length_mm, y=bill_length_mm, group=species))
```
Page 2
===========
Row1
-----------------
### Average Rating
```{r}
exp.dat = matrix(c(49.1,53,56.9,60.8,64.8,68.7,72.6,76.5,
59,60,62,56,63,59,62,60,
6,13,18,28,52,53,61,60,
.102,.217,.29,.5,.825,.898,.984,1),ncol=4)
colnames(exp.dat) = c('dose','nexp','ndied','prop')
exp.dat = as.data.frame(exp.dat)
# compute how many lived
exp.dat$nalive = exp.dat$nexp - exp.dat$ndied
#View(exp.dat)
summary(glm(cbind(ndied,nalive) ~ dose,
family=binomial,
data=exp.dat)->exp.glm)
dose4prob = function(b0,b1,prob){
d = (-b0+log(-prob/(prob-1)))/b1
return(d)
}
dose4prob(coef(exp.glm)[[1]],coef(exp.glm)[[2]],.5)
range(exp.dat$dose)
drange = seq(30,90,length=100)
exp.pred = predict(exp.glm, newdata=data.frame(dose=drange))
plot(drange,
exp(exp.pred)/(1+exp(exp.pred)),
type='l',
xlab='dose',
ylab='probability')
# add in the observed points
points(prop ~ dose, data = exp.dat, col='red', pch=19)
# compute our LD values
ld50 = dose4prob(coef(exp.glm)[[1]],coef(exp.glm)[[2]],.50)
ld25 = dose4prob(coef(exp.glm)[[1]],coef(exp.glm)[[2]],.25)
# let's add these to our graph for visualization purposes
abline(h=.5,lty='dashed',col='red')
abline(h=.25,lty='dotted',col='blue')
legend('topleft',
c(expression(LD[50]),
expression(LD[25])),
col=c('red','blue'),lty=c('dashed','dotted'),
inset=.01)
text(ld50,.52,sprintf("%.3f",ld50),pos=c(2),col='red')
text(ld25,.27,sprintf("%.3f",ld25),pos=c(2),col="blue")
```
```{r}
DT::datatable(exp.dat,options=list(bPaginate=FALSE))
```
```{r}
library(rstanarm)
options(mc.cores = parallel::detectCores())
library(loo)
library(tidyverse)
library(GGally)
library(bayesplot)
theme_set(bayesplot::theme_default())
library(projpred)
SEED=87
set.seed(SEED)
fitg <- stan_glm(cbind(ndied,nalive) ~ dose,
data=exp.dat, na.action = na.fail, family=binomial(), seed=SEED)
summary(glm(cbind(ndied,nalive) ~ dose,
family=binomial,
data=exp.dat))
round(coef(fitg), 2)
round(posterior_interval(fitg, prob = 0.9), 2)
fitg0 <- stan_glm(cbind(ndied,nalive) ~ 1, data = exp.dat, na.action = na.fail, family=binomial(), seed=SEED)
(loog <- loo(fitg))
(loog0 <- loo(fitg0))
loo_compare(loog0, loog)
```
```{r}
#Compare the probit
fitg1 <- stan_glm(cbind(ndied,nalive) ~ dose, data = exp.dat, na.action = na.fail, family=binomial(link=probit), seed=SEED)
loog1<-loo(fitg1)
loo_compare(loog1, loog)
round(posterior_interval(fitg1, prob = 0.9), 2)
summary(fitg1)
posterior <- as.array(fitg1)
np <- nuts_params(fitg1)
mcmc_pairs(posterior, pars = c("dose", "(Intercept)"), np = np)
#launch_shinystan(fitg1)
```
Page3
===========
```{r}
# launch_shinystan(fitg1)
# launch_shinystan(fitg)
url<-"https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
Titanic <- read.csv(url, header = TRUE)
#glimpse(Titanic,width = 50)
Titanic$class <- str_extract(Titanic$Pclass, "[0-9]")
TitanicLinear <- stan_glm(Survived ~ Age +
Sex + as.factor(class),
data = Titanic, family = gaussian)
summary(TitanicLinear)
plot(TitanicLinear)
Titanic_posterior <- TitanicLinear %>% as_tibble() %>%
rename(sec.class = "as.factor(class)2",
third.class = "as.factor(class)3")
ggplot(Titanic_posterior, aes(x=third.class)) +
geom_histogram()
posterior_vs_prior(TitanicLinear)
pp_check(TitanicLinear)
waic(TitanicLinear)
loo(TitanicLinear)
Titanic_probit <- stan_glm(Survived ~ Age +
Sex + as.factor(class),
data = Titanic, family = binomial(link=probit))
Loo_probit <- loo(Titanic_probit)
Titanic_logit <- stan_glm(Survived ~ Age +
Sex + as.factor(class),
data = Titanic, family = binomial(link=logit))
Loo_logit <- loo(Titanic_logit)
# ELPD_diff>0 indicates more support for 2nd model
loo_compare(Loo_probit, Loo_logit)
```
Page4
=============
Row1
------------------
```{r}
library(tidyverse)
library(data.table)
library(dplyr)
library(ggplot2)
library(lubridate)
library(stringr)
library(plotly)
## Reading in files
library(igraph)
library(threejs)
library(ggraph)
library(tidygraph)
# You can access files from datasets you've added to this kernel in the "../input/" directory.
# You can see the files added to this kernel by running the code below.
start.watch.date <- as.Date('2020-03-01')
watched.countries.count <- 12
covid <- data.table(read.csv('covid_19_data.csv'))
krcovid <- data.table(read.csv('PatientInfo.csv'))
krcovid[,confirmed_date:=as.Date(gsub('2002','2020',as.character(confirmed_date)))]
krcovid[,symptom_onset_date:=as.Date(as.character(symptom_onset_date))]
krcovid[,released_date:=as.Date(as.character(released_date))]
krcovid[,deceased_date:=as.Date(as.character(deceased_date),format='%Y-%m-%d')]
#krcovid[,table(symptom_onset_date==confirmed_date)]
#summary(krcovid)
#head(covid)
krcovid[,closed_date:=coalesce(released_date,deceased_date)]
krcovid[,closed.days:=as.integer(difftime(closed_date,confirmed_date,unit='days'))]
ggplot(krcovid[,.(n=.N,median_age=2020-median(birth_year,na.rm=T)),.(closed.days,confirmed_date,state)][,state:=ifelse(state=='deceased','death','recovered')])+
geom_point(aes(y=closed.days,x=confirmed_date,size=n,col=state))+
#geom_text(aes(y=closed.days,x=confirmed_date,label=median_age,col=state,hjust='left'))+
scale_x_date(date_labels = '%m/%d',date_breaks = '3 days')+
theme(axis.text.x=element_text(angle = 60, vjust = 0.5))+
scale_color_brewer(palette = 'Set1')+
labs(title='Days to close the cases in Korea')
covid[,Last.Update:=as.Date(Last.Update,format='%m/%d/%Y')]
covid[str_length(ObservationDate)==8,ObservationDate1:=as.Date(ObservationDate,format='%m/%d/%y',origin = orgin)]
covid[str_length(ObservationDate)==10,ObservationDate1:=as.Date(ObservationDate,format='%m/%d/%Y',origin = orgin)]
covid[,ObservationDate:=ObservationDate1]
covid$ObservationDate1=NULL
covid[grep('Korea',Country.Region),Country.Region:='Republic of Korea']
covid[grep('Iran',Country.Region),Country.Region:='Iran']
covid[grep('Taiwan',Province.State),Country.Region:='Taiwan']
covid[ grepl('Diamon',Province.State),Country.Region:='Diamond Princess']
str(covid)
# The JHU modified the Country name of Taiwan which is suck both for analysis and human right
## quick summary the cases by country
library(conflicted)
conflict_prefer("lag", "dplyr")
conflict_prefer("setdiff", "dplyr")
covid.by.country <- covid[,.(Confirmed=sum(Confirmed),Recovered=sum(Recovered),Deaths=sum(Deaths)),.(ObservationDate,Country.Region)]
setorder(covid.by.country,Country.Region,ObservationDate)
covid.by.country[,Daily.Confirmed:=Confirmed-lag(Confirmed,1),.(Country.Region)]
covid.by.country[,Daily.Recovered:=Recovered-lag(Recovered,1),.(Country.Region)]
covid.by.country[,Daily.Deaths:=Deaths-lag(Deaths,1),.(Country.Region)]
lookup.countries <- covid.by.country[order(-ObservationDate,Country.Region)][,head(.SD,1),.(Country.Region)][order(-Deaths),Country.Region]
lookup.countries <- setdiff(lookup.countries,grep('Diamon',lookup.countries,value=T))
```
Row2
------------------
```{r}
ggplot(covid.by.country[grep('Italy|Korea',Country.Region)][,Daily.Confirmed:=ifelse(Daily.Confirmed>5000,2000,Daily.Confirmed)])+
geom_line(aes(x=ObservationDate,y=Daily.Recovered,col='Recovered'))+
geom_line(aes(x=ObservationDate,y=Daily.Deaths,col='Deaths'))+
geom_line(aes(x=ObservationDate,y=Daily.Confirmed,col='Confirmed'))+
facet_wrap(~Country.Region,scales = 'free',ncol=1)+
scale_x_date(date_labels = '%m/%d',date_breaks = '7 days')
```
Row3
------------------
```{r}
ggplot(covid.by.country[grep('Italy|Korea',Country.Region)][ObservationDate>'2020-03-01'])+
geom_bar(aes(x=ObservationDate,y=Daily.Deaths,fill='Deaths'),stat = 'identity')+
geom_line(aes(x=ObservationDate,y=Daily.Confirmed),col='blue')+
facet_wrap(~Country.Region,scales = 'free',ncol=1)+
scale_x_date(date_labels = '%m/%d',date_breaks = '2 days')
```
Row4
------------------
```{r}
ggplot(covid.by.country[Country.Region %in% lookup.countries[1:watched.countries.count]],aes(group=1))+
geom_line(aes(x=ObservationDate,y=Daily.Recovered,col='Recovered'))+
geom_line(aes(x=ObservationDate,y=Daily.Deaths,col='Deaths'))+
geom_line(aes(x=ObservationDate,y=Daily.Confirmed,col='Confirmed'))+
facet_wrap(~Country.Region,ncol=3,scale='free_y')+
scale_color_brewer(palette = 'Set1',name='Value Type')+
scale_x_date(date_breaks = '7 days',date_labels = '%m/%d',)+
theme(axis.text.x=element_text(angle = 90, vjust = 0.5))+
labs(title='COVID-19 daily cases')
```