library(rstan)
## Warning: package 'rstan' was built under R version 4.2.3
## Loading required package: StanHeaders
##
## rstan version 2.26.22 (Stan version 2.26.1)
## For execution on a local, multicore CPU with excess RAM we recommend calling
## options(mc.cores = parallel::detectCores()).
## To avoid recompilation of unchanged Stan programs, we recommend calling
## rstan_options(auto_write = TRUE)
## For within-chain threading using `reduce_sum()` or `map_rect()` Stan functions,
## change `threads_per_chain` option:
## rstan_options(threads_per_chain = 1)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(loo)
## This is loo version 2.5.1
## - Online documentation and vignettes at mc-stan.org/loo
## - As of v2.0.0 loo defaults to 1 core but we recommend using as many as possible. Use the 'cores' argument or set options(mc.cores = NUM_CORES) for an entire session.
##
## Attaching package: 'loo'
## The following object is masked from 'package:rstan':
##
## loo
library(foreign)
library(Synth)
## ##
## ## Synth Package: Implements Synthetic Control Methods.
## ## See https://web.stanford.edu/~jhain/synthpage.html for additional information.
library(xtable)
library(ggplot2)
library(tidyr)
##
## Attaching package: 'tidyr'
## The following object is masked from 'package:rstan':
##
## extract
require(MASS)
## Loading required package: MASS
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
require(Matrix)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
library(bayesplot)
## This is bayesplot version 1.10.0
## - Online documentation and vignettes at mc-stan.org/bayesplot
## - bayesplot theme set to bayesplot::theme_default()
## * Does _not_ affect other ggplot2 plots
## * See ?bayesplot_theme_set for details on theme setting
library(tigerstats)
## Loading required package: abd
## Loading required package: nlme
##
## Attaching package: 'nlme'
## The following object is masked from 'package:dplyr':
##
## collapse
## Loading required package: lattice
## Loading required package: grid
## Loading required package: mosaic
## Registered S3 method overwritten by 'mosaic':
## method from
## fortify.SpatialPolygonsDataFrame ggplot2
##
## The 'mosaic' package masks several functions from core packages in order to add
## additional features. The original behavior of these functions should not be affected by this.
##
## Attaching package: 'mosaic'
## The following object is masked from 'package:Matrix':
##
## mean
## The following object is masked from 'package:ggplot2':
##
## stat
## The following object is masked from 'package:loo':
##
## compare
## The following objects are masked from 'package:dplyr':
##
## count, do, tally
## The following objects are masked from 'package:stats':
##
## binom.test, cor, cor.test, cov, fivenum, IQR, median, prop.test,
## quantile, sd, t.test, var
## The following objects are masked from 'package:base':
##
## max, mean, min, prod, range, sample, sum
## Welcome to tigerstats!
## To learn more about this package, consult its website:
## http://homerhanumat.github.io/tigerstats
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())
Importing the dataset [1]
d <- read.dta("/Users/apple/Desktop/repgermany.dta")
d
df_avg <- d |> group_by(index, country) |>
summarize_at(c("gdp", "infrate", "trade", "industry",
"schooling", "invest70", "invest80"),
mean, na.rm = TRUE)
df_avg <- mutate(df_avg, invest80 = (1/100)*invest80)
df_avg <- mutate(df_avg,
invest75 = mean(c_across(c("invest70", "invest80")), na.rm = TRUE))
df_J_P <- df_avg[, c(3:7, 10)]
rownames(df_J_P) <- 1:17
## Warning: Setting row names on a tibble is deprecated.
colnames(df_J_P) <-
c("avg_gdp", "avg_infrate", "avg_trade", "avg_industry", "avg_schooling", "avg_invest75")
df_J_P
We have reshaped the dataset to a dimension of 17 countries (rows) by
6 predictors (columns). This data frame will be later transformed into a
matrix for data loading purpose. Notice that we have created a column
called avg_invest75. Pinkney’s paper mentions the use of a
6th predictor, which is the average investment rate from 1975 - 1980.
However, this data seems to be vague from the raw dataset. Hence, we
convert invest80 by excluding the percentile and take the
average value with invest70. Then we average
invest75 for each single country and obtain this 6th column
- avg_invest75. All other predictors are closely aligned
with Pinkney’s paper [4].
german_unification <-
read.csv("/Users/apple/Desktop/Numpyro\ BGSC/german_unification.csv")
german_unification
df_J_T <- german_unification[, 2:45]
colnames(df_J_T) <- 1960:2003
rownames(df_J_T) <- 1:17
df_J_T
We load another dataset from the coding folder provided. In fact, the
dataset german_unification is just a wider dataset of
d, and we can also obtain this data frame by simply
performing tabular conversions. Notice that the row names are index
position for each OECD country and the column names are years ranging
from 1960 to 2003.
Storing the STAN file [2]
stan_code <- "
functions {
matrix make_F (int T,
vector diagonal_loadings,
vector lower_tri_loadings) {
int L = num_elements(diagonal_loadings);
int M = num_elements(lower_tri_loadings);
matrix[T, L] F;
int idx = 0; // Index for the lower diagonal
for (j in 1:L) {
F[j, j] = diagonal_loadings[j];
for(i in (j + 1):T) {
idx += 1;
F[i, j] = lower_tri_loadings[idx];
}
}
for (j in 1:(L - 1)) {
for (i in (j + 1):L) F[j, i] = 0;
}
return F;
}
matrix make_beta (int J, matrix off,
vector lambda,
real eta,
vector tau) {
int L = cols(off);
vector[L] cache = ( tan(0.5 * pi() * lambda) *
tan(0.5 * pi() * eta) );
vector[J] tau_ = tan(0.5 * pi() * tau);
matrix[J, L] out;
for (j in 1:J)
out[j] = off[j] * tau_[j];
return diag_pre_multiply(cache, out');
}
}
data {
int T; // times
int J; // countries
int L; // number of factors
int P;
matrix[P, J] X; // predictors
row_vector[T] Y[J]; // data matrix of order [J,T]
int trt_times;
}
transformed data {
int<lower=1> M = L * (T - L) + L * (L - 1) / 2;
row_vector[J] j_ones = rep_row_vector(1, J);
vector[T] t_ones = rep_vector(1.0, T);
matrix[J, P] X_std;
vector[J] y_mu;
vector[J] y_sd;
row_vector[T] Y_scaled[J];
row_vector[T - trt_times] Y_pre_target;
vector[P] x_mu;
vector[P] x_sd;
for (j in 1:J) {
y_mu[j] = Y[j, T - trt_times];
y_sd[j] = sd(Y[j, 1:T - trt_times]);
Y_scaled[j] = ( Y[j] - y_mu[j] ) / y_sd[j];
}
for (p in 1:P) {
x_mu[p] = mean(X[p]);
x_sd[p] = sd(X[p]);
X_std[, p] = ( X[p]' - mean(X[p]) ) / sd(X[p]);
}
Y_pre_target = Y_scaled[1, 1:T - trt_times];
}
parameters{
vector[P] chi; // non-time varying predictors
vector[T] delta; // year fixed effects
row_vector[J] kappa; // country fixed effects
matrix[J, L] beta_off;
vector<lower=0, upper=1>[L] lambda;
real<lower=0, upper=1> eta;
vector<lower=0, upper=1>[J] tau;
row_vector[trt_times] Y_post_target;
real<lower=0> sigma;
vector<lower=0>[L] F_diag;
vector[M] F_lower;
}
transformed parameters {
matrix[L, J] beta = make_beta(J,
beta_off,
lambda,
eta,
tau);
}
model {
chi ~ std_normal();
to_vector(beta_off) ~ std_normal();
F_diag ~ std_normal();
F_lower ~ normal(0, 2);
delta ~ normal(0, 2);
kappa ~ std_normal();
sigma ~ std_normal();
{
vector[J] predictors = X_std * chi;
matrix[T, L] F = make_F(T, F_diag, F_lower);
row_vector[T] Y_target[1];
row_vector[T] Y_temp[J];
Y_target[1] = append_col(Y_pre_target, Y_post_target);
Y_temp = append_array(Y_target, Y_scaled[2:J]);
for (j in 1:J)
Y_temp[j]' ~ normal_id_glm(F, delta + kappa[j] + predictors[j],
beta[ , j], sigma);
}
}
generated quantities {
vector[T] synth_out[J];
matrix[T, L] F_ = make_F(T, F_diag, F_lower);
matrix[T, J] Synth_ = F_ * beta + delta * j_ones +
t_ones * (kappa + (X_std * chi)');
for (j in 1:J)
synth_out[j] = Synth_[, j] * y_sd[j] + y_mu[j];
}"
This STAN file is uploaded without modification from the Appendix of Pinkney’s paper [2].
Loading the data list
T <- length(unique(d$year))
J <- length(unique(d$country))
L <- 8
P <- 6
X <- data.matrix(t(df_J_P))
Y <- data.matrix(df_J_T)
trt_times <- max(d$year) - 1990
data_list <- list(T = T, J = J, L = L, P = P, X = X, Y = Y, trt_times = trt_times)
The data loading process is composed of a list of data with formats
aligned with the STAN codes. A detailed description for the choice of
the data loading for each variable is carefully determined. To be
specific, T = 44, and T is the entire period
from 1960 to 2003. J = 17, and J is the unique
number of countries in the data, and there are 17 countries from the
OECD dataset. Pinkney’s paper states that 8 latent variables are used
[2], so I suspect L to be 8, where L is
defined as the number of factors. P is the number of
predictors, in Pinkney’s paper, 6 are mentioned. X is in
matrix form [P, J]. X is hence 6 by 17. The 6
predictors mentioned include
average GDP,
average inflation rate,
average trade openness,
average industry share of value added from 1981 - 1990,
average percentage of secondary school attained in the total population aged 25 and older from 1980 - 1985,
average investment rate from 1975 - 1980.
Notice that these 6 predictors have already been prepared from our
previous transformed dataset, df_J_P. We simply transpose
the data frame and converted it to a matrix format. Similarly,
Y is in matrix form [J, T], and Y
is hence 17 by 44. Y can also be easily obtained from a
previous transformed dataset, df_J_T - with rows indicating
17 countries and columns indicating 44 years - after we convert it to a
matrix. Lastly, trt_times = 13, and trt_times
is defined as the post-treatment period length, i.e. the number of years
observed after the German reunification. All 7 variables are stored in
data_list.
Compiling the model with initial values, max_treedepth, and adapt_delta settings
control_list <- list(max_treedepth = 14, adapt_delta = 0.95)
fit <- stan(model_code = stan_code, data = data_list, init = "0.1", control = control_list, chains = 4, warmup = 250, iter = 500)
## Info: Found int division at 'string', line 52, column 33 to column 44:
## L * (L - 1) / 2
## Values will be rounded towards zero. If rounding is not desired you can write
## the division as
## L * (L - 1) / 2.0
## Warning: There were 409 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 14. See
## https://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded
## Warning: Examine the pairs() plot to diagnose sampling problems
## Warning: The largest R-hat is NA, indicating chains have not mixed.
## Running the chains for more iterations may help. See
## https://mc-stan.org/misc/warnings.html#r-hat
## Warning: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
## Running the chains for more iterations may help. See
## https://mc-stan.org/misc/warnings.html#bulk-ess
## Warning: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
## Running the chains for more iterations may help. See
## https://mc-stan.org/misc/warnings.html#tail-ess
After successfully compiling the STAN codes and data list, we then
set the initial value to 0.1 with max_treedepth at 14 and adapt_delta
set to 0.95, according to Pinkney’s suggestion [2]. Also, the
fit is performed using STAN with 4 parallel chains, using
250 warm-up iterations, and 250 post-warmup iterations. I only pick the
half number of Pinkney’s choice in order to speed up the awfully long
fitting process. Notice that init can also be stored as a
list of lists. The initial value must be provided to each parameter with
correct dimension entered. The printed fit information is
removed to save the space here since it is more than 2,500 lines.
However, we will still analyze the fit on three diagnostics
in the next section to summarize this piece of removed information.
Visual MCMC diagnostics - Gelman-Rubin diagnostic, ESS, Autocorrelation
rhats <- rhat(fit)
length(rhats)
## [1] 2552
color_scheme_set("brightblue")
mcmc_rhat(rhats) + legend_move("top")
## Warning: Dropped 28 NAs from 'new_rhat(rhat)'.
mcmc_rhat(rhats) + xlim(0.997, 1.5) + legend_move("top")
## Warning: Dropped 28 NAs from 'new_rhat(rhat)'.
## Scale for x is already present.
## Adding another scale for x, which will replace the existing scale.
The bayesplot package provides a generic
rhat (\(\hat R\))
extractor function. We visualize the distribution of these 2,552 \(\hat R\) values with the
mcmc_rhat function. By comparing the within-chain variance
to the between-chain variance, the Gelman-Rubin diagnostic shown above
indicates that the convergence is not roughly satisfied since the
presence of some \(\hat R\) values that
are much larger than 1.1 (an extremity has \(\hat R = 4.44\)). There are several reasons
after this phenomenon, such as poor mixing, insufficient tuning, model
misspecification, and etc. These are not the focus of this report, but
we should still pay attention to this issue in future research on my
dataset so that we can prevent this from happening. Some other
diagnostics can also check the convergence, so we may look at this
aspect from another perspective below. The second plot displays the
values of \(\hat R\) between 0.997 and
1.5.
ratios <- neff_ratio(fit)
mcmc_neff(ratios) + legend_move("right")
## Warning: Dropped 28 NAs from 'new_neff_ratio(ratio)'.
The bayesplot package provides a generic
neff_ratio extractor function. The effective sample size
(ESS) is an estimate of the number of independent draws
from the posterior distribution of the estimand of interest. The
neff metric used in STAN is based on the ability of the
draws to estimate the true mean value of the parameter, which is related
to (but not necessarily equivalent to) estimating other functions of the
draws [3]. Because the draws within a Markov chain are not independent
if there is autocorrelation, the effective sample size, \(\text{N}_{\text{eff}}\), is usually smaller
than the total sample size, \(\text{N}\), although it may be larger in
some cases [4]. The larger the ratio \(\frac{\text{N}_{\text{eff}}}{\text{N}}\),
the better [5]. The visualization displays that the ratio is above 0.1
for most parameters, which is a threshold that we need to worry about if
the ratio is below 0.1. This does not seem to be a concern here since
there is only a tiny portion of ratios less than 0.1.
posterior_sample <- as.array(fit)
color_scheme_set("pink")
mcmc_acf(posterior_sample,
pars = c("chi[1]", "delta[1]", "kappa[1]", "beta_off[1,1]"), lags = 20)
mcmc_acf(posterior_sample,
pars = c("lambda[1]", "eta", "tau[1]"), lags = 20)
mcmc_acf(posterior_sample,
pars = c("Y_post_target[1]", "sigma", "F_diag[1]", "F_lower[1]"), lags = 20)
The mcmc_acf function computes the autocorrelation for
each parameter. Since we have over 2,500 parameters, we only present the
ACF of the first few parameters in the model. Positive autocorrelation
is bad since it means the chain tends to stay in the same area between
iterations. We intend to see it dropping quickly to zero with increasing
lag. Negative autocorrelation is possible and it indicates fast
convergence of sample mean towards true mean. In our case, we see that
the serial dependence or autocorrelation in the Markov chain decreases
as the lag value increases, which suggests good mixing and efficient
sampling. Our general diagnostics here further confirm that our model is
well-mixed and convergent. The trace
plot is shown below, also indicating the good mixing in the chains.
traceplot(fit,
pars = c("chi[1]", "delta[1]", "kappa[1]", "beta_off[1,1]",
"lambda[1]", "eta", "tau[1]", "Y_post_target[1]",
"sigma", "F_diag[1]", "F_lower[1]"))
Replicate codes from EDA V - GSCM [7]
dataprep_init <-
dataprep(foo = d, predictors = c("gdp","trade","infrate"), dependent = "gdp",
unit.variable = 1, time.variable = 3,
special.predictors =
list(list("industry", 1971:1980, c("mean")),
list("schooling",c(1970,1975), c("mean")),
list("invest70" ,1980, c("mean"))),
treatment.identifier = 7, controls.identifier = unique(d$index)[-7],
time.predictors.prior = 1971:1980, time.optimize.ssr = 1981:1990,
unit.names.variable = 2, time.plot = 1960:2003)
synth_init <-
synth(data.prep.obj = dataprep_init, Margin.ipop = .005,
Sigf.ipop = 7, Bound.ipop = 6)
##
## X1, X0, Z1, Z0 all come directly from dataprep object.
##
##
## ****************
## searching for synthetic control unit
##
##
## ****************
## ****************
## ****************
##
## MSPE (LOSS V): 4580.3
##
## solution.v:
## 0.4416298 0.1341838 0.07154154 0.001468671 0.1066738 0.2445024
##
## solution.w:
## 0.1350214 6.08e-08 0.5072787 5.2e-08 4.151e-07 6.59e-08 4.59e-08 2.401e-07 6.34e-08 0.165922 0.1464777 3.5e-08 2.76e-08 3.25e-08 0.0452973 1.8933e-06
dataprep_main <-
dataprep(foo = d, predictors = c("gdp","trade","infrate"), dependent = "gdp",
unit.variable = 1, time.variable = 3,
special.predictors = list(
list("industry" ,1981:1990, c("mean")),
list("schooling",c(1980,1985), c("mean")),
list("invest80" ,1980, c("mean"))),
treatment.identifier = 7, controls.identifier = unique(d$index)[-7],
time.predictors.prior = 1981:1990, time.optimize.ssr = 1960:1989,
unit.names.variable = 2, time.plot = 1960:2003)
##
## Missing data: treated unit; special predictor: special.industry.1981.1990 ; for period: 1990
## We ignore (na.rm = TRUE) all missing values for predictors.op.
synth_main <- synth(data.prep.obj = dataprep_main,
custom.v = as.numeric(synth_init$solution.v))
##
## X1, X0, Z1, Z0 all come directly from dataprep object.
##
##
## ****************
## optimization over w weights: computing synthtic control unit
##
##
##
## ****************
## v weights supplied manually: computing synthtic control unit
##
##
##
## ****************
## ****************
## ****************
##
## MSPE (LOSS V): 14313.62
##
## solution.v:
## 0.4416298 0.1341838 0.07154154 0.001468671 0.1066738 0.2445024
##
## solution.w:
## 0.2186942 0.0009650793 0.4175261 0.001169395 0.001050788 0.0008147026 0.0005029538 0.09028553 0.0005151888 0.1113089 0.1550799 0.0002918951 0.000302974 0.0005901991 0.0004689807 0.0004332223
synth_df <- synth.tab(dataprep.res = dataprep_main, synth.res = synth_main)
synth_weight <- synth_df$tab.w[, 1]
dataprep_gdp <- dataprep_main$Y0
synth_gdp_WG_GSCM <- dataprep_gdp %*% synth_weight
The above code snippets are directly referred from my EDA V on
Reproducing Germany Reunification with GSCM [7]. Some important features
we will be using in EDA VI include synth_weight, which is
the weight for each OECD country, and dataprep_gdp, which
is a giant matrix for all OECD countries’ GDP. The GSCM result will be
included as well in the last two figures to thoroughly compare with West
Germany under BSCM.
Accessing the synthetic GDP
dim(posterior_sample)
## [1] 250 4 2552
dimnames(posterior_sample)$iterations <- 1:250
The posterior_sample is in dimension 250, 4, and 2552.
250 implies that there are 250 iterations; 4 implies that there are 4
parallel chains; and 2552 implies that there are 2552 parameters. We
rename the row names to a list of integers from 1 to 250
(i.e. [1, ] is converted to 1). This is not
the ideal sample we want, so we access samples below on
this wanted parameter - synth_out.
samples <- as.array(fit, "synth_out")
dim(samples)
## [1] 250 4 748
synth_out_gdp <- array(NA, dim = c(250, 4, 17, 44))
for (j in 1:17) {
for (t in 1:44) {
synth_out_gdp[, , j, t] <- as.array(fit, paste0("synth_out[", j, ",", t, "]"))
}
}
The dimension for this wanted samples have 748
parameters, which is expected since 748 = 17 * 44. Now we create an
empty array with dimension 250, 4, 17, and 44. With a nested loop, we
enter each country first and then access the GDP Per Capita in each
year. Then we report this matrix back to our giant array. In short, the
giant arrary synth_out_gdp is in fact composed of the
nested matrices. Think of this array as a 17 * 44 matrix. Then in each
singular cell, it is again a 250 * 4 matrix, displaying GDP Per Capita
under all iterations and chains for this specific country under the
specific year.
synth_gdp <- matrix(data = 0, nrow = 44, ncol = 17)
synth_gdp_cred_int_2.5 <- matrix(data = 0, nrow = 44, ncol = 17)
synth_gdp_cred_int_97.5 <- matrix(data = 0, nrow = 44, ncol = 17)
synth_gdp_cred_int_25 <- matrix(data = 0, nrow = 44, ncol = 17)
synth_gdp_cred_int_75 <- matrix(data = 0, nrow = 44, ncol = 17)
for (j in 1:17) {
for (t in 1:44) {
synth_gdp[t, j] <- mean(synth_out_gdp[, , j, t])
synth_gdp_cred_int_2.5[t, j] <- quantile(synth_out_gdp[, , j, t], 0.025)
synth_gdp_cred_int_97.5[t, j] <- quantile(synth_out_gdp[, , j, t], 0.975)
synth_gdp_cred_int_25[t, j] <- quantile(synth_out_gdp[, , j, t], 0.25)
synth_gdp_cred_int_75[t,j ] <- quantile(synth_out_gdp[, , j, t], 0.75)
}
}
With this giant array set up, we will now enter the nested loop again
to extract some useful information for future Bayesian inference. We
extract the mean GDP Per Capita for each country under each year by
synth_gdp. We extract the 2.5% GDP Per Capita for each
country under each year by synth_gdp_cred_int_2.5.
Similarly, we extract the corresponding percentile information for 97.5%
and mid-50%.
rownames(synth_gdp) <- 1:44
colnames(synth_gdp) <- 1:17
synth_gdp_T_J <- synth_gdp[, -7]
rownames(synth_gdp_cred_int_2.5) <- 1:44
colnames(synth_gdp_cred_int_2.5) <- 1:17
synth_gdp_cred_int_2.5_T_J <- synth_gdp_cred_int_2.5[, -7]
rownames(synth_gdp_cred_int_97.5) <- 1:44
colnames(synth_gdp_cred_int_97.5) <- 1:17
synth_gdp_cred_int_97.5_T_J <- synth_gdp_cred_int_97.5[, -7]
rownames(synth_gdp_cred_int_25) <- 1:44
colnames(synth_gdp_cred_int_25) <- 1:17
synth_gdp_cred_int_25_T_J <- synth_gdp_cred_int_25[, -7]
rownames(synth_gdp_cred_int_75) <- 1:44
colnames(synth_gdp_cred_int_75) <- 1:17
synth_gdp_cred_int_75_T_J <- synth_gdp_cred_int_75[, -7]
These 5 matrices - with 44 years on rows and 17 countries on columns
- are then reshaped in order to drop the 7th country - West Germany. Now
the dimension for these 5 matrices are 44 * 16. Recall the weight vector
synth_weight, which is in dimension 16 * 1. Then it is
natural for us to perform matrix multiplication for all of these 5
matrices, just like what we did before in EDA V on GSCM.
Generating the tabular data for GDP Per Capita under BSCM, GSCM, and Real West Germany
synth_gdp_WG_BSCM <- synth_gdp_T_J %*% synth_weight
synth_gdp_WG_cred_int_2.5 <- synth_gdp_cred_int_2.5_T_J %*% synth_weight
synth_gdp_WG_cred_int_97.5 <- synth_gdp_cred_int_97.5_T_J %*% synth_weight
synth_gdp_WG_cred_int_25 <- synth_gdp_cred_int_25_T_J %*% synth_weight
synth_gdp_WG_cred_int_75 <- synth_gdp_cred_int_75_T_J %*% synth_weight
real_gdp_WG <- filter(d, index == 7)[, 4]
year <- filter(d, index == 7)[, 3]
df_final <- data.frame(year, real_gdp_WG, synth_gdp_WG_GSCM, synth_gdp_WG_BSCM,
synth_gdp_WG_cred_int_2.5, synth_gdp_WG_cred_int_97.5,
synth_gdp_WG_cred_int_25, synth_gdp_WG_cred_int_75)
rownames(df_final) <- NULL
df_final
The final complete dataset for West Germany’s real GDP Per Capita is in column 2. The West Germany’s generalized synthetic control GDP Per Capita is in column 3, and the West Germany’s Bayesian synthetic control GDP Per Capita is in column 4. For columns 5-8, these are the corresponding credible intervals for our BSCM estimate. Two visualizations are displayed below to reflect this table.
Visualization I
ggplot(df_final) +
geom_ribbon(aes(x = year, y = synth_gdp_WG_BSCM,
ymin = synth_gdp_WG_cred_int_25,
ymax = synth_gdp_WG_cred_int_75),
alpha = 0.9, fill = "pink3") +
geom_ribbon(aes(x = year, y = synth_gdp_WG_BSCM,
ymin = synth_gdp_WG_cred_int_2.5,
ymax = synth_gdp_WG_cred_int_97.5),
alpha = 0.4, fill = "pink") +
geom_line(aes(x = year, y = real_gdp_WG, color = "red2"), size = 0.7) +
geom_line(aes(x = year, y = synth_gdp_WG_GSCM, color = "darkblue"),
linetype = "longdash", size = 0.7) +
geom_line(aes(x = year, y = synth_gdp_WG_BSCM, color = "green3"),
linetype = "dotdash", size = 0.7) +
geom_vline(xintercept = 1990, linetype = "dotted", color = "darkgrey", size = 2) +
xlab("year") + ylab("Per Capita GDP (PPP, 2002 USD)") +
scale_color_identity(name = "Per Capita GDP",
breaks = c("red2", "darkblue", "green3"),
labels = c("Real West Germany",
"Generalized Synthetic West Germany",
"Bayesian Synthetic West Germany"),
guide = "legend") +
theme(legend.position = c(0.8, 0.2)) +
annotate("text", x = 1984, y = 30000, label = "German Reunification") +
annotate("text", x = 1996, y = 33000,
label = "BSCM - 95% CI", color = "pink") +
annotate("text", x = 1996, y = 31500,
label = "BSCM - 50% CI", color = "pink3") +
ggtitle("Trends in Per Capita GDP: West Germany under GSCM and BSCM vs. Real")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Incorporating Difference GDP Per Capita between Real and (Generalized / Bayesian) Synthetic West Germany into Tabular Data
diff_GSCM <- df_final$real_gdp_WG - df_final$synth_gdp_WG_GSCM
diff_BSCM <- df_final$real_gdp_WG - df_final$synth_gdp_WG_BSCM
diff_BSCM_lwr <- df_final$real_gdp_WG - df_final$synth_gdp_WG_cred_int_2.5
diff_BSCM_upr <- df_final$real_gdp_WG - df_final$synth_gdp_WG_cred_int_97.5
diff_BSCM_mdlwr <- df_final$real_gdp_WG - df_final$synth_gdp_WG_cred_int_25
diff_BSCM_mdupr <- df_final$real_gdp_WG - df_final$synth_gdp_WG_cred_int_75
df_final_diff <- data.frame(year, diff_GSCM, diff_BSCM,
diff_BSCM_lwr, diff_BSCM_upr,
diff_BSCM_mdlwr, diff_BSCM_upr)
df_final_diff
Visualization II
ggplot(df_final_diff) +
geom_ribbon(aes(x = year, y = diff_BSCM,
ymin = diff_BSCM_mdlwr,
ymax = diff_BSCM_mdupr),
alpha = 0.9, fill = "pink3") +
geom_ribbon(aes(x = year, y = diff_BSCM,
ymin = diff_BSCM_lwr,
ymax = diff_BSCM_upr),
alpha = 0.4, fill = "pink") +
geom_line(aes(x = year, y = diff_BSCM, color = "red"),
linetype = "solid", size = 1.1) +
geom_line(aes(x = year, y = diff_GSCM, color = "darkblue"),
linetype = "longdash", size = 1.1) +
geom_vline(xintercept = 1990, linetype = "dashed", size = 0.5, col = "darkgrey") +
geom_hline(yintercept = 0, linetype = "dashed", size = 0.5, col = "darkgrey") +
scale_color_identity(name = "Per Capita GDP",
breaks = c("darkblue", "red"),
labels = c("Generalized Synthetic West Germany",
"Bayesian Synthetic West Germany"),
guide = "legend") +
theme(legend.position = c(0.3, 0.15)) +
annotate("text", x = 1984, y = -700, label = "German Reunification") +
annotate("text", x = 1991, y = -2000,
label = "BSCM - 95% CI", color = "pink") +
annotate("text", x = 1991, y = -2500,
label = "BSCM - 50% CI", color = "pink3") +
ylab("difference in per-capita GDP (PPP, 2002 USD)") +
ggtitle("GDP Per Capita Diff between Real West Germany and under GSCM and BSCM")
References
[1]A. Abadie, A. Diamond, and J. Hainmueller, “Comparative Politics and the Synthetic Control Method,” American Journal of Political Science, vol. 59, no. 2, pp. 495–510, Apr. 2014, doi: https://doi.org/10.1111/ajps.12116 [accessed Jun. 9, 2023].
[2] S. Pinkney, “An Improved and Extended Bayesian Synthetic Control,” arXiv.org, Mar. 30, 2021. https://arxiv.org/abs/2103.16244 [accessed Jun. 12, 2023].
[3] “Introduction,” R-Packages. https://cran.r-project.org/web/packages/bayesplot/vignettes/visual-mcmc-diagnostics.html#fn1 [accessed Jun. 13, 2023].
[4] “Abstract for ``Suppressing Random Walks in Markov Chain Monte Carlo Using Ordered Overrelaxation”,” Glizen.com, 2023. https://glizen.com/radfordneal/over.abstract.html [accessed Jun. 13, 2023].
[5] “Home page for the book, `Bayesian Data Analysis,’” Columbia.edu, 2019. http://www.stat.columbia.edu/~gelman/book/ [accessed Jun. 14, 2023].
[6] Z. Jiang, “RPubs - Exploratory Data Analysis V - Reproducing German Reunification with Generalized Synthetic Control Method (GSCM),” rpubs.com. https://rpubs.com/jiangzm/1053384 [accessed Jun. 15, 2023].