Overview
The basic idea underlying Approximate Bayesian Computation (ABC) is deviously simple. In abstract terms the process is roughly the following:
- Specify the model (i.e. regression equation, etc)
- Specify prior distribution over parameters
- Sample candidates from prior
- Assess model fit, reject candidate?
- Take non-rejected candidates as posterior
Of critical importance is that this approach grew out of a need to abandon direct evaluation of the likelihood function as in classical statistics, whether Frequentist or Bayesian. Being recent, and growing out of technical fields, guidance for the methodology is largely impenetrable without a great deal of effort. This is, in part, because particular concerns – such as the selection of sufficient summary statistics – are somewhat unintuitive or unattainable although guidance exists (c.f. the recent handbook or this 2013 review).
This gap will likely close with time, and the methodology is in a position to spread, but before these things become lucidly clear it can be helpful to examine the approach as not doing anything fundamentally different than more orthodox approaches. This is especially relevant, perhaps unsurprisingly, for Bayesian estimation for which
A number of authors have argued that the Bayesian perspective needs to be incorporated into basic statistical training. This training is easily accomplished, insofar as Bayesian analysis can be carried out using conventional (frequentist) formulas for stratified analysis: one may use either information (inverse-variance) weighted averaging of current-data results with a prior distribution, or representation of the prior as a prior-data stratum (data augmentation). Thus, Bayesian analysis does not require posterior sampling or other special software, and does not even require explicit use of Bayes theorem. (Greenland 2007)
The same is true of ABC – all one needs to do the thing, at a base level, is know how to sample from a distribution. To facilitate this understanding of ABC, we will consider a very simple measure of fit – the root mean squared error of a linear model – which, by definition, is minimized when the residual sum of squares is minimized. Rather than explicitly minimizing this objective function to arrive at the “best” parameter values we will instead retain parameter vectors for which the fit is “not too bad.” And so rather than the uncertainty in the model being driven by sampling variability it will be uncertainty – on the part of the researcher – regarding how well the systematic component of the model should fit. This is relevant, of course, because as shown by Shmueli (2010, Appendix) the “true” model is not necessarily the best predictive model. And so, in some sense, ABC can be seen as a reflection of the fact that we don’t entirely believe the implicit likelihood function and so are more interested in parameters “around the maximum” rather than the maximum itself. In other terms, we might worry that our model is overfitting the data.
So, in short, ABC is a somewhat new and interesting way to think about Bayesian statistics. I don’t believe that it is the right approach for every problem – for simple models it is worse in almost every way – but it provides yet another alternative to classical approaches and is endowed with a great deal of intuition. In particular, it draws attention to the central role of the design matrix in a way which conventional approaches obscure and clarifies why we might want to just maximize the likelihood when able in the first place. By merit of being a likelihood-free method, it also enables a unique aproach to the problem of trying to identify more parameters than there are observations – a problem in which the most basic techniques naturally fail.
Easy as ABC
Let’s begin, as always, by simulating some data such that we know the truth. Here we will consider a simple non-linear relationship but later will treat the non-linear term as if it were just another covariate. We will also later turn to more complicated examples involving multilevel and inferential network models.
set.seed(8675309)
x <- rnorm(100,0,1)
y <- rnorm(100,2*x + 3*x^2,1)
data <- data.frame(y=y,x=x)
mod1 <- lm(y~x + I(x^2))
summary(mod1)
##
## Call:
## lm(formula = y ~ x + I(x^2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.46176 -0.66013 0.07491 0.71666 2.04074
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.1325 0.1221 -1.085 0.281
## x 2.0663 0.1074 19.238 <2e-16 ***
## I(x^2) 3.0912 0.0827 37.380 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9929 on 97 degrees of freedom
## Multiple R-squared: 0.9486, Adjusted R-squared: 0.9475
## F-statistic: 894.6 on 2 and 97 DF, p-value: < 2.2e-16
Nothing surprising here. Now suppose, for whatever reason, we weren’t able to simply run a linear model – perhaps we are dealing with a situation in which the likelihood function is costly to evaluate or whatever. What can we do?
First, we will define a function. The specification is somewhat “hackish” in that we take as an input a fitted model object, but that is just because it is easier to extract the design matrix – the only thing that really matters here. We will also assume that the prior beliefs over parameters are identically distributed such that a realization from the priors is a coordinate uniformly drawn from a hypercube. Of course priors are important for the resulting posterior distribution, but even within orthodox Bayesianism priors are frequently used for their computational rather than theoretical properties. Naturally, if the “truth” is outside of this hypercube then we will fail to find it – that said it is relatively straightforward to imagine an adaptive “warm-up” phase similar to MCMC/HMC/etc where we “move the cube” to cover a desired posterior. Or one can also scale their variables such that the coefficients can be reasonably expected to fall within a range. We will revisit this momentarily.
Anyway, here is the function we will be working with:
flat_abc_draw <- function(model,lims=c(-10,10)){
y <- model$model[,1]
X <- as.matrix(model.matrix(model))
betas <- matrix(runif(ncol(model.matrix(model)),lims[1],lims[2]),ncol=1)
rmse <- sqrt(mean((y - X %*% betas)^2))
betas <- data.frame(t(betas))
names(betas) <- colnames(model.matrix(model))
data.frame(rmse,
sdy = sd(y),
betas)
}
After extracting the outcome and design matrix we simulate a vector of coefficients, make a prediction from that model, calculate the RMSE, and save the results. Note that all coefficients, including the intercept, are randomly sampled from the prior space. Here is the output from one draw:
flat_abc_draw(mod1)
## rmse sdy X.Intercept. x I.x.2.
## 1 11.25093 4.333733 -0.7041323 -9.04709 0.6466968
Here we can see that, also unsurprisingly, a randomly drawn vector of parameter estimates isn’t greate at predicting the outcome.
We will also define a function for viewing the posterior distribution for a term under a particular rejection rule. It won’t be the prettiest thing in the world, but gets the job done.
library(ggplot2)
abc_plot <- function(output,var,tol=0.05,type="rank"){
if(type == "rank"){
output$type <- ifelse(output$rank < round(length(output$rank)*tol),"Accept","Reject")
}
if(type == "prop_max"){
output$type <- ifelse(output$prop_max > 1-tol,"Accept","Reject")
}
output$type <- relevel(as.factor(output$type),"Reject")
bw <- (max(output[,var]) - min(output[,var]))/100
ggplot(output,aes(x=output[,var],fill=type)) +
geom_histogram(position="identity",binwidth=bw)
}
Great. Now let’s define one last utility function for taking the draws and processing the results.
library(pbapply)
library(dplyr)
flat_abc <- function(model,ndraws=10000,lims=c(-10,10),cl=NULL){
out <- pblapply(1:ndraws,function(x)flat_abc_draw(model,lims),cl=cl)
out <- do.call("rbind",out)
tibble(out[order(out$rmse),]) %>%
mutate(rank = 1:nrow(out), prop_max = min(rmse)/rmse) -> out
out
}
Great now let’s do the thing and take a look.
abc1 <- flat_abc(mod1)
ggplot(abc1,aes(y=prop_max,x=x)) +
geom_point() +
geom_vline(xintercept = 2, color="red", lwd=1.5)

In the above plot we have a measure of fit on the y-axis – larger RMSE leads to lower values with the smallest RMSE simulated model normalized to 1. As one might expect, the truth here – indicated in red – is “pointed towards” by better fitting models. Again, this is because minimizing the sum of squared errors (or maximizing the likelihood) minimizes the “loss” and maximizes the “fit” explicitly.
So how do we get a posterior distribution? Well, we first remember that if we keep all of the above draws our posterior will just return our sampled prior – here if we retain all observations the distribution of x parameters would be uniform. What ABC says is “keep only those parameters that generate data similar to that which is observed.” So let’s only keep those parameters which are above 0.25 on the y-axis within this example.
data.frame(abc1) %>% abc_plot("x",0.75,type="prop_max")

I told you it wouldn’t be the prettiest thing in the world. What it does show is how we go from a uniform prior across the pre-selected range and reduce downwards to a non-uniform distribution by what I will call ex-post rejection sampling. We can also see, as intuited above, that accepting or rejecting based upon fit results in a distribution centered around the likelihood estimate yet distributed according to the uncertainty we are directly expressing – here that every model above that threshold is equally likely. I’ll leave it to the reader to imagine how subsampling and resampling, for example, could be used to generate more custom posterior distributions. The same goes for the rule for acceptance or rejection; it’s your beliefs!
For example, suppose that we wanted to look only at the top 10% of models in terms of fit. We can do that like this:
data.frame(abc1) %>% abc_plot("x",0.1,type="rank")

Readers will note that the distribution, while still naturally centered on the same thing, is somewhat wider than the above distribution. That is because the restriction is simply less informative.
Of course, if we wanted to we could also just plot the accepted draws as well:
subset(data.frame(abc1),prop_max > 0.25) %>%
ggplot(aes(x=x)) +
geom_histogram(binwidth = 0.25,fill="blue",col="black")

Or take a summary of sum sort, like the mean and quantiles:
abc1 %>% filter(rank <= 100) %>%
summarise(n=max(rank),
mean=mean(x),
lower95=quantile(x,0.025),
upper95=quantile(x,0.975),
pct_pos = sum(x>0))
## # A tibble: 1 x 5
## n mean lower95 upper95 pct_pos
## <int> <dbl> <dbl> <dbl> <int>
## 1 100 2.26 -0.222 4.51 95
And so we can see that the posterior mean of the top 100 models is 1.8; fairly close to the truth of 2. We can also see that despite the 95% credible crossing 0, 89% of the top 100 draws are positively signed.
Effects Plots
One nice thing is that it is relatively straightforward to construct effects plots. The strategy we will take is, for each accepted parameter draw, re-calculate the predictions of the model and take the conditional mean of that as the expected effect. To get partial plots we will leverage FWL theorem.
First, we will keep only those parameters within the top 1% of fitted models and take a look at the posterior means for good measure:
sub <- abc1[which(abc1$rank < nrow(abc1)*0.01),]
apply(sub[,3:5],2,mean)
## X.Intercept. x I.x.2.
## -0.1849047 2.2381116 3.0060915
Not bad! Now let’s get predicted values and plot.
design <- as.matrix(model.matrix(mod1))
betas <- sub[,3:5]
prop_max <- sub$prop_max
v <- list()
for(i in 1:nrow(betas)){
pred_y <- design %*% t(as.matrix(betas[i,]))
pred_y_noint <- pred_y - design[,c(1)] %*% t(as.matrix(betas[i,c(1)]))
pred_y_fwl <- pred_y - design[,c(1,3)] %*% t(as.matrix(betas[i,c(1,3)]))
v[[i]] <- data.frame(pred_y = pred_y,
pred_y_fwl = pred_y_fwl,
pred_y_noint = pred_y_noint,
design,
prop_max = prop_max[i],
pars=i)
}
preds <- do.call("rbind",v)
ggplot(preds,aes(x=x,y=pred_y,color=1-prop_max)) +
geom_point() +
geom_smooth(method = "gam", formula = y ~ s(x, bs = "cs"),col="orange",lwd=1.5) +
geom_point(aes(x=x,y=2*x + 3*x^2),color="red") +
labs(y="ABC Predictions",color="Fit Loss") +
ggtitle("Full Predictions and CEF")

Here the blue dots are predictions from the posterior distribution, shaded by their fit (with darker being better). The red dots are “true predictions” while the orange line is the conditional mean of the predictions. Not bad at all!
Again we might imagine that rather than keeping only these values we subsample or resample in some way. Here is effectively an unweighted bootstrap version of the above:
pred_sub <- preds[sample(1:nrow(preds),1000,replace=T),]
ggplot(pred_sub,aes(x=x,y=pred_y,color=1-prop_max)) +
geom_point() +
geom_smooth(method = "gam", formula = y ~ s(x, bs = "cs"),col="orange",lwd=1.5) +
geom_point(aes(x=x,y=2*x + 3*x^2),color="red") +
labs(y="ABC Predictions",color="Fit Loss") +
ggtitle("Re-Sampled Predictions and CEF")

Here there is very little difference, but we might imagine contexts in which we would prefer to give more weight to better fitting models than lower fitting models. Again, I leave this to the reader to explore further.
Instead, we can take a look at effects plots for specific terms by tacitly applying FWL theorem and removing contributions to the prediction. So, forget that the true relationship is quadratic for a moment and pretend that the terms are actually just different variables.
ggplot(preds,aes(x=x,y=pred_y - pred_y_noint + pred_y_fwl,color=1-prop_max)) +
geom_point() +
geom_smooth(method = "gam", formula = y ~ s(x, bs = "cs"),color="orange",lwd=1.5) +
geom_point(aes(x=x,y=2*x),color="red") +
ggtitle("FWLish Partial Effect")

Not bad! Also note that the plots naturally depict the density of the x-variable of interest. Such is a natural benefit of simulation based approaches.
A little bit curvier now
One might wonder whether or not the same technique applies to functions with a greater deal of non-linearity. Let’s simulate a more subtle relationship (shamelessly taken from example code from ICPSR’s Regression III)
n <- 1000
x <- 0:(n-1)/(n-1)
f <- 0.2*x^11*(10*(1-x))^6+10*(10*x)^3*(1-x)^10
y <- f + rnorm(n, 0, sd = 2)
ggplot(mapping=aes(x=x)) +
geom_point(aes(y=y), pch=1) +
geom_line(aes(y=f), size=1) +
theme_bw() +
labs(x="x", y="y") -> g1
g1

We will model this, somewhat crudely, with splines. One might do something like the following when confronted with this data:
dave_dat <- data.frame(y=y,x=x)
library(splines)
lin_mod <- lm(y ~ bs(x,df=10),data = dave_dat)
g1 + geom_point(aes(y=lin_mod$fitted.values), col="orange")

Not awful, but not great. Splines are one example in which we might not “believe” the model we are estimating; rather we are trying to capture a relationship we are uncertain of utilizing statistical machinery as a tool towards that end.
Let’s see how ABC does. Again we will exploit the ability to fit the model object as a convinience to extract the appropriate terms; all you need is the appropriate design matrix. For splines this is simply additional functions of the x variable of interest. In this case, the model look like this:
colnames(model.matrix(lin_mod))
## [1] "(Intercept)" "bs(x, df = 10)1" "bs(x, df = 10)2" "bs(x, df = 10)3"
## [5] "bs(x, df = 10)4" "bs(x, df = 10)5" "bs(x, df = 10)6" "bs(x, df = 10)7"
## [9] "bs(x, df = 10)8" "bs(x, df = 10)9" "bs(x, df = 10)10"
This time we will crank up the number of draws to 50,000 and retain only the top 1% of models. Of note, one weakness of naive ABC – like that discussed here – is that the dimensionality of the prior space grows with the dimensionality of the model. More things with parameters means that more samples need to be drawn to get well-fitting models and “explore the space.” This motivates more advanced approaches, not discussed here.
Anyway, first let’s take a look at only the top 3 models to see what is going on:
out <- pblapply(1:50000,function(x)flat_abc_draw(lin_mod,lims=c(-5,15)))
out <- do.call("rbind",out)
tibble(out[order(out$rmse),]) %>%
mutate(rank = 1:nrow(out), prop_max = min(rmse)/rmse) -> out
# Subset down to close to truth
sub <- out[which(out$rank <= 3),]
design <- as.matrix(model.matrix(lin_mod))
betas <- sub[,3:(2+ncol(design))]
prop_max <- sub$prop_max
v <- list()
for(i in 1:nrow(betas)){
pred_y <- design %*% t(as.matrix(betas[i,]))
v[[i]] <- data.frame(pred_y = pred_y,
design,
prop_max = prop_max[i],
pars=i)
}
preds <- do.call("rbind",v)
preds$x <- dave_dat$x
preds$f <- with(preds,0.2*x^11*(10*(1-x))^6+10*(10*x)^3*(1-x)^10)
ggplot(preds,aes(x=x,y=pred_y,color=1-prop_max)) +
geom_point() +
geom_smooth(method = "gam", formula = y ~ s(x, bs = "cs"), color="orange", lwd=2) +
labs(y="ABC Predictions",color="Fit Loss") +
ggtitle("Getting Curvy") +
geom_point(aes(y=f),color="red")

Again, shades of blue are model predictions with darker being better fit. Red is the truth and orange is the conditional mean of the plotted predictions. Much like in the fitted spline model above, none of the models does a perfect job in capturing the relationship but none of them are completely worthless. What this sort of ABC plot does is that it says “we will average the expectation of the models we believe,” a process which leads to somewhat better fit. And before you say it, if this is starting to sound like machine learning then we are on the same page.
Let’s add more models to consideration; the top 1%. We will also resample predictions, here only to speed up the plot and to emphasize again that you can do such a thing.
sub <- out[which(out$rank <= 0.01*nrow(out)),]
design <- as.matrix(model.matrix(lin_mod))
betas <- sub[,3:(2+ncol(design))]
prop_max <- sub$prop_max
v <- list()
for(i in 1:nrow(betas)){
pred_y <- design %*% t(as.matrix(betas[i,]))
v[[i]] <- data.frame(pred_y = pred_y,
design,
prop_max = prop_max[i],
pars=i)
}
preds <- do.call("rbind",v)
preds$x <- dave_dat$x
preds$f <- with(preds,0.2*x^11*(10*(1-x))^6+10*(10*x)^3*(1-x)^10)
pred_sub <- preds[sample(1:nrow(preds),10*nrow(dave_dat),replace=T),]
ggplot(pred_sub,aes(x=x,y=pred_y,color=1-prop_max)) +
geom_point() +
geom_smooth(method = "gam", formula = y ~ s(x, bs = "cs"), color="orange", lwd=2) +
labs(y="ABC Predictions",color="Fit Loss") +
ggtitle("Re-Sampled Predictions and CEF") +
geom_point(aes(y=f),color="red")

Not awful, but not perfect either. But then again, we have taken an incredibly brute force approach to the problem – simulate parameters draws, evaluate fit, reject if fit is too poor, predict from accepted, then average. For areas towards the center of the data which doesn’t experience a great deal of change even this simple quite frankly slapped-together model approximates the truth quite well. It also adequately represents our naive uncertainty in the task.
Multilevel Models
And of course you can do the same thing for multilevel models as well. Let’s do a moderately complex setting with both group varying intercepts and slopes.
set.seed(1234)
nobs <- 10000
n_groups <- 10
grp <- rep_len(1:n_groups,nobs)
labels <- letters[grp]
grp_means <- rnorm(n_groups,0,2)
grp_ints <- rnorm(n_groups,0,2)
grp_slps <- rnorm(n_groups,0,2)
x <- rnorm(nobs,grp_means[grp])
err <- rnorm(nobs)
y <- grp_slps[grp]*x + grp_ints[grp] + err
data <- data.frame(y,x,grp,grp_means[grp],labels)
ggplot(data,aes(x=x,y=y,color=labels)) +
geom_point() +
geom_smooth(method="lm", formula=y~x)

We will once more fit a simple linear model so simplify constructing the design matrix.
multilevel <- lm(y~x*labels-1,data=data)
truth <- data.frame(labels = labels,
true_int = grp_ints,
true_slp = grp_slps)[1:n_groups,]
summary(multilevel)
##
## Call:
## lm(formula = y ~ x * labels - 1, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.1165 -0.6947 0.0095 0.6844 3.7499
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## x 0.28003 0.03140 8.918 < 2e-16 ***
## labelsa -0.93317 0.08090 -11.535 < 2e-16 ***
## labelsb -1.93551 0.03628 -53.344 < 2e-16 ***
## labelsc -1.66773 0.07684 -21.705 < 2e-16 ***
## labelsd 0.12549 0.15443 0.813 0.416
## labelse 1.88688 0.04255 44.343 < 2e-16 ***
## labelsf -0.22498 0.04483 -5.018 5.31e-07 ***
## labelsg -1.06526 0.04814 -22.129 < 2e-16 ***
## labelsh -1.87891 0.04849 -38.750 < 2e-16 ***
## labelsi -1.69039 0.04907 -34.445 < 2e-16 ***
## labelsj 4.86969 0.06617 73.597 < 2e-16 ***
## x:labelsb -1.27632 0.04477 -28.505 < 2e-16 ***
## x:labelsc -1.11372 0.04491 -24.800 < 2e-16 ***
## x:labelsd 0.64750 0.04494 14.407 < 2e-16 ***
## x:labelse -1.62008 0.04484 -36.127 < 2e-16 ***
## x:labelsf -3.16468 0.04461 -70.943 < 2e-16 ***
## x:labelsg 0.84244 0.04517 18.650 < 2e-16 ***
## x:labelsh -2.35664 0.04541 -51.895 < 2e-16 ***
## x:labelsi -0.28676 0.04616 -6.212 5.42e-10 ***
## x:labelsj -2.12480 0.04506 -47.150 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.007 on 9980 degrees of freedom
## Multiple R-squared: 0.935, Adjusted R-squared: 0.9349
## F-statistic: 7177 on 20 and 9980 DF, p-value: < 2.2e-16
And take a quick look at the true slopes and intercepts:
truth
## labels true_int true_slp
## 1 a -0.9543854 0.2681764
## 2 b -1.9967729 -0.9813718
## 3 c -1.5525078 -0.8810957
## 4 d 0.1289176 0.9191789
## 5 e 1.9189881 -1.3874405
## 6 f -0.2205710 -2.8964098
## 7 g -1.0220190 1.1495114
## 8 h -1.8223908 -2.0473114
## 9 i -1.6743434 -0.0302766
## 10 j 4.8316704 -1.8718972
Baller. For ABC we will skip graphical matters and simply extract the posterior means of the top 0.01% of 1,000,000 models. Other decision rules will lead to different results.
library(parallel)
cl <- makeCluster(detectCores())
limz <- c(-10,10)
clusterExport(cl=cl,varlist=c("flat_abc_draw","multilevel","limz"))
abc_multi <- flat_abc(multilevel,1000000,limz,cl=cl)
stopCluster(cl)
best <- abc_multi[abc_multi$rank <= 0.0001*nrow(abc_multi),]
post_means <- apply(best[,3:(ncol(best)-2)],2,mean)
data.frame(truth,
post_int = post_means[2:(n_groups+1)],
post_slp = post_means[-c(2:(n_groups+1))]) -> results
ggplot(results) +
geom_point(aes(true_int,post_int),color="red") +
geom_point(aes(true_slp,post_slp),color="blue") +
labs(x="Truth",y="Posterior Mean") +
geom_hline(yintercept = 0) +
geom_vline(xintercept = 0) +
geom_abline(slope=1,intercept=0)

We can notice a few things. First, compared to simpler examples we are doing a bit worse. Why is this? Well, for one, we are starting to enter the boundaries of what we can reasonably expect from the particularly simple implimentation of ABC above – hence cranking up the number of draws. Remember, all we are doing is sampling from a hypercube of dimension equal to the number of terms on the right hand side of the regression we are conducting. Here that is 20 terms – it’s a large space to explore and we aren’t taking a ton of draws.
That said, there are few misleading inferences – nearly everything takes on the same sign as the truth, here, and all but one of these have a truth very close to zero. On the opposite end, the one “big coefficient” of around 5 is underestimated. Overall, however it ain’t bad!
Unidentified Models
The last thing of note is that it is feasible to estimate models with more parameters than observations in an ABC framework. The reason for this is that we aren’t estimating parameters so much as simulating them; we don’t run into the issue of more unknowns than equations.
Let’s illustrate this is a relatively simple case. We will have 10 covariates that matter and 10 which don’t, plus the intercept to estimate but only 30 observations. The case is obviously lame, but you might imagine a setting such as dynamic networks or a fully saturated model where one may wany to estimate coefficients in such a setting moreso than conduct variable selection.
library(MASS)
library(texreg)
obs <- 20
X <- mvrnorm(obs,rep(0,obs),Sigma = diag(1,obs))
beta <- c(rep(1,10),rep(0,obs-10))
y <- X %*% beta + rnorm(obs)
d <- data.frame(y=y,X)
unidentified <- lm(y~.,data=d)
identified <- lm(y~.,data=d[,1:15])
htmlreg(list(identified,unidentified),doctype = F)
Statistical models
|
|
Model 1
|
Model 2
|
|
(Intercept)
|
-0.06
|
-0.15
|
|
|
(0.35)
|
|
|
X1
|
1.61**
|
1.73
|
|
|
(0.35)
|
|
|
X2
|
1.32
|
1.04
|
|
|
(0.73)
|
|
|
X3
|
1.39
|
1.56
|
|
|
(0.60)
|
|
|
X4
|
1.30*
|
2.33
|
|
|
(0.37)
|
|
|
X5
|
0.74
|
0.63
|
|
|
(0.31)
|
|
|
X6
|
1.40**
|
1.89
|
|
|
(0.24)
|
|
|
X7
|
0.21
|
0.00
|
|
|
(0.49)
|
|
|
X8
|
0.71
|
0.56
|
|
|
(0.48)
|
|
|
X9
|
0.95
|
0.91
|
|
|
(0.45)
|
|
|
X10
|
1.91*
|
2.08
|
|
|
(0.58)
|
|
|
X11
|
0.11
|
-0.15
|
|
|
(0.54)
|
|
|
X12
|
0.03
|
-0.33
|
|
|
(0.40)
|
|
|
X13
|
0.76
|
1.66
|
|
|
(0.56)
|
|
|
X14
|
-0.20
|
0.48
|
|
|
(0.56)
|
|
|
X15
|
|
0.24
|
|
|
|
|
|
X16
|
|
-0.34
|
|
|
|
|
|
X17
|
|
0.85
|
|
|
|
|
|
X18
|
|
-0.15
|
|
|
|
|
|
X19
|
|
1.25
|
|
|
|
|
|
R2
|
0.99
|
1.00
|
|
Adj. R2
|
0.95
|
|
|
Num. obs.
|
20
|
20
|
|
RMSE
|
0.86
|
|
|
p < 0.001, p < 0.01, p < 0.05
|
Zero degrees of freedom sucks! No summary of uncertainty, and not all parameters get a value. There are existing techniques to deal with this sort of problem (ridge and lasso are both still popular, etc).
But we can still use the exact same machinery introduced above. The basic idea is simple. Since we generated the data, we know that the correct model can be generated. Here is its RMSE:
sqrt(mean((y-X%*%beta)^2))
## [1] 0.9160364
Just like before, the problem is that we don’t know the elements of the parameter vector. So, just like before: sample, calculate loss, reject those that are too bad, and summarize the rest. Insofar as the entries of the parameter vector are within the prior space then, with enough samples, we can point in that direction.
cl <- makeCluster(detectCores())
limz <- c(-2,4)
clusterExport(cl=cl,varlist=c("flat_abc_draw","unidentified","limz"))
abc_unid <- flat_abc(unidentified,500000,c(-1,3),cl=cl)
stopCluster(cl)
sub <- abc_unid[abc_unid$rank <= 250,]
means <- apply(sub[,3:(ncol(sub)-2)],2,mean)
ggplot(data.frame(truth=c(0,beta),post=means),aes(x=truth,y=post)) +
geom_point() +
labs(x="Truth",y="Posterior Means")

As we can see above, the approach doesn’t do too terribly but there is room for improvement. We are able to more or less separate the positive coefficients from the null coefficients; more samples would be mean more accuracy. Importantly, we also get a posterior distribution for summarizing our uncertainty.
ggplot(sub,aes(x=X1)) +
geom_histogram(binwidth = 0.1, col="black",fill="blue") +
ggtitle("Posterior of X1 (Truth = 1")

ggplot(sub,aes(x=X12)) +
geom_histogram(binwidth = 0.1, col="black",fill="blue") +
ggtitle("Posterior of X1 (Truth = 0")

What about more complicated sampling?
Look at the last posterior distribution plot – do you notice anything? It is truncated, and hence asymetric, due to the sampling algorithm not looking below -1 in this case.
This leads to two areas in which the above – perhaps the simplest and dumbest way of going about ABC – can be improved and to a discussion of how it is done in practice. There are two primary considerations:
Adaptive Search – even if we believe that the truth resides within the hypercube used above, we can be wrong. This motivates more advanced sampling algorithms which, frequently MCMC based, which find regions with a higher probability of acceptance. This also helps improve the efficiency of the sampling; less draws are needed. This is incredibly important to fight against the curse of dimensionality.
Rejection Sampling – in the above we used what I called ex post rejection sampling. Professional algorithms have the rejection rule embedded into the search, usually based around sufficient statistics, to help guide the algorithm (you can’t adapt to increase acceptance probability in a region without knowing how you will reject) and increase the fit of the model (RMSE, used above, is incredibly crude for capturing whether the outcome is actually being reproduced).
These concerns aside, even the really dumb implimentation examined above works surprisingly well in simple applied problems.
ABC is neat!
---
title: "Notes on Approximate Bayesian Computation"
author: "Christopher Schwarz (NYU Politics)"
date: "8/11/2020"
pages:
  extra: true
output: 
  html_document:
    toc: true
    toc_depth: 2
    toc_float: true
    code_download: true
---

# Overview

The basic idea underlying Approximate Bayesian Computation (ABC) is deviously simple.  In abstract terms the process is roughly the following:

1) Specify the model (i.e. regression equation, etc)
2) Specify prior distribution over parameters
3) Sample candidates from prior
4) Assess model fit, reject candidate?
5) Take non-rejected candidates as posterior

Of critical importance is that this approach grew out of a need to abandon direct evaluation of the likelihood function as in classical statistics, whether Frequentist or Bayesian.  Being recent, and growing out of technical fields, guidance for the methodology is largely impenetrable without a great deal of effort.  This is, in part, because particular concerns -- such as the selection of sufficient summary statistics -- are somewhat unintuitive or unattainable although guidance exists [(c.f. the recent handbook](https://books.google.com/books?id=A82GDwAAQBAJ&dq=handbook+of+approximate+bayesian+computation&lr=) or [this 2013 review)](https://arxiv.org/pdf/1202.3819.pdf).  

This gap will likely close with time, and the methodology is in a position to spread, but before these things become lucidly clear it can be helpful to examine the approach as not doing anything fundamentally different than more orthodox approaches.  This is especially relevant, perhaps unsurprisingly, for Bayesian estimation for which

> A number of authors have argued that the Bayesian perspective needs to be incorporated into basic statistical training. This training is easily accomplished, insofar as Bayesian analysis can be carried out using conventional (frequentist) formulas for stratified analysis: one may use either information (inverse-variance) weighted averaging of current-data results with a prior distribution, or representation of the prior as a prior-data stratum (data augmentation). Thus, Bayesian analysis does not require posterior sampling or other special software, and does not even require explicit use of Bayes theorem. [(Greenland 2007)](https://academic.oup.com/ije/article/36/1/195/669913)

The same is true of ABC -- all one needs to do the thing, at a base level, is know how to sample from a distribution.  To facilitate this understanding of ABC, we will consider a very simple measure of fit -- the root mean squared error of a linear model -- which, by definition, is minimized when the residual sum of squares is minimized.  Rather than explicitly minimizing this objective function to arrive at the "best" parameter values we will instead retain parameter vectors for which the fit is "not too bad."  And so rather than the uncertainty in the model being driven by sampling variability it will be uncertainty -- on the part of the researcher -- regarding how well the systematic component of the model should fit.  This is relevant, of course, because as shown by [Shmueli (2010, Appendix)](https://www.stat.berkeley.edu/~aldous/157/Papers/shmueli.pdf) the "true" model is not necessarily the best predictive model.  And so, in some sense, ABC can be seen as a reflection of the fact that we don't entirely believe the implicit likelihood function and so are more interested in parameters "around the maximum" rather than the maximum itself.  In other terms, we might worry that our model is overfitting the data.

So, in short, ABC is a somewhat new and interesting way to think about Bayesian statistics.  I don't believe that it is the right approach for every problem -- for simple models it is worse in almost every way -- but it provides yet another alternative to classical approaches and is endowed with a great deal of intuition.  In particular, it draws attention to the central role of the design matrix in a way which conventional approaches obscure and clarifies why we might want to just maximize the likelihood when able in the first place.  By merit of being a likelihood-free method, it also enables a unique aproach to the problem of trying to identify more parameters than there are observations -- a problem in which the most basic techniques naturally fail.

# Easy as ABC

Let's begin, as always, by simulating some data such that we know the truth.  Here we will consider a simple non-linear relationship but later will treat the non-linear term as if it were just another covariate.  We will also later turn to more complicated examples involving multilevel and inferential network models.

```{r}
set.seed(8675309)

x <- rnorm(100,0,1)
y <- rnorm(100,2*x + 3*x^2,1)

data <- data.frame(y=y,x=x)

mod1 <- lm(y~x + I(x^2))

summary(mod1)
```

Nothing surprising here.  Now suppose, for whatever reason, we weren't able to simply run a linear model -- perhaps we are dealing with a situation in which the likelihood function is costly to evaluate or whatever.  What can we do?

First, we will define a function.  The specification is somewhat "hackish" in that we take as an input a fitted model object, but that is just because it is easier to extract the design matrix -- the only thing that really matters here.  We will also assume that the prior beliefs over parameters are identically distributed such that a realization from the priors is a coordinate uniformly drawn from a hypercube.  Of course priors are important for the resulting posterior distribution, but even within orthodox Bayesianism priors are frequently used for their computational rather than theoretical properties.  Naturally, if the "truth" is outside of this hypercube then we will fail to find it -- that said it is relatively straightforward to imagine an adaptive "warm-up" phase similar to MCMC/HMC/etc where we "move the cube" to cover a desired posterior.  Or one can also scale their variables such that the coefficients can be reasonably expected to fall within a range.  We will revisit this momentarily.

Anyway, here is the function we will be working with:

```{r}
flat_abc_draw <- function(model,lims=c(-10,10)){
    
    y <- model$model[,1]
    X <- as.matrix(model.matrix(model))
    betas <- matrix(runif(ncol(model.matrix(model)),lims[1],lims[2]),ncol=1)
    rmse <- sqrt(mean((y - X %*% betas)^2))
    betas <- data.frame(t(betas))
    names(betas) <- colnames(model.matrix(model))
    
    data.frame(rmse,
               sdy = sd(y),
               betas)
}
```

After extracting the outcome and design matrix we simulate a vector of coefficients, make a prediction from that model, calculate the RMSE, and save the results.  Note that all coefficients, including the intercept, are randomly sampled from the prior space.  Here is the output from one draw:

```{r}
flat_abc_draw(mod1)
```

Here we can see that, also unsurprisingly, a randomly drawn vector of parameter estimates isn't greate at predicting the outcome.

We will also define a function for viewing the posterior distribution for a term under a particular rejection rule.  It won't be the prettiest thing in the world, but gets the job done.

```{r, warning=F, message=F}
library(ggplot2)

abc_plot <- function(output,var,tol=0.05,type="rank"){
  
  if(type == "rank"){
    output$type <- ifelse(output$rank < round(length(output$rank)*tol),"Accept","Reject")
  }
  if(type == "prop_max"){
    output$type <- ifelse(output$prop_max > 1-tol,"Accept","Reject")
  }
  
  output$type <- relevel(as.factor(output$type),"Reject")
  bw <- (max(output[,var]) - min(output[,var]))/100
  
  ggplot(output,aes(x=output[,var],fill=type)) + 
    geom_histogram(position="identity",binwidth=bw) 
  
}
```

Great.  Now let's define one last utility function for taking the draws and processing the results.

```{r, message=F, warning=F}
library(pbapply)
library(dplyr)

flat_abc <- function(model,ndraws=10000,lims=c(-10,10),cl=NULL){
      out <- pblapply(1:ndraws,function(x)flat_abc_draw(model,lims),cl=cl)
      out <- do.call("rbind",out)
      tibble(out[order(out$rmse),]) %>% 
        mutate(rank = 1:nrow(out), prop_max = min(rmse)/rmse) -> out
      out
}
```

Great now let's do the thing and take a look.

```{r}
abc1 <- flat_abc(mod1)

ggplot(abc1,aes(y=prop_max,x=x)) + 
  geom_point() +
  geom_vline(xintercept = 2, color="red", lwd=1.5)
```

In the above plot we have a measure of fit on the y-axis -- larger RMSE leads to lower values with the smallest RMSE simulated model normalized to 1.  As one might expect, the truth here -- indicated in red -- is "pointed towards" by better fitting models.  Again, this is because minimizing the sum of squared errors (or maximizing the likelihood) minimizes the "loss" and maximizes the "fit" explicitly.

So how do we get a posterior distribution?  Well, we first remember that if we keep all of the above draws our posterior will just return our sampled prior -- here if we retain all observations the distribution of x parameters would be uniform.  What ABC says is "keep only those parameters that generate data similar to that which is observed."  So let's only keep those parameters which are above 0.25 on the y-axis within this example.

```{r}
data.frame(abc1) %>% abc_plot("x",0.75,type="prop_max")
```

I told you it wouldn't be the prettiest thing in the world.  What it does show is how we go from a uniform prior across the pre-selected range and reduce downwards to a non-uniform distribution by what I will call ex-post rejection sampling.  We can also see, as intuited above, that accepting or rejecting based upon fit results in a distribution centered around the likelihood estimate yet distributed according to the uncertainty we are directly expressing -- here that every model above that threshold is equally likely.  I'll leave it to the reader to imagine how subsampling and resampling, for example, could be used to generate more custom posterior distributions.  The same goes for the rule for acceptance or rejection; it's your beliefs!

For example, suppose that we wanted to look only at the top 10% of models in terms of fit.  We can do that like this:

```{r}
data.frame(abc1) %>% abc_plot("x",0.1,type="rank")
```

Readers will note that the distribution, while still naturally centered on the same thing, is somewhat wider than the above distribution.  That is because the restriction is simply less informative.

Of course, if we wanted to we could also just plot the accepted draws as well:

```{r}
subset(data.frame(abc1),prop_max > 0.25) %>% 
  ggplot(aes(x=x)) + 
  geom_histogram(binwidth = 0.25,fill="blue",col="black")
```

Or take a summary of sum sort, like the mean and quantiles:

```{r}
abc1 %>% filter(rank <= 100) %>%
  summarise(n=max(rank),
            mean=mean(x),
            lower95=quantile(x,0.025),
            upper95=quantile(x,0.975),
            pct_pos = sum(x>0))
```

And so we can see that the posterior mean of the top 100 models is 1.8; fairly close to the truth of 2.  We can also see that despite the 95% credible crossing 0, 89% of the top 100 draws are positively signed.

# Effects Plots

One nice thing is that it is relatively straightforward to construct effects plots.  The strategy we will take is, for each accepted parameter draw, re-calculate the predictions of the model and take the conditional mean of that as the expected effect.  To get partial plots we will leverage FWL theorem.

First, we will keep only those parameters within the top 1% of fitted models and take a look at the posterior means for good measure:

```{r}
sub <- abc1[which(abc1$rank < nrow(abc1)*0.01),]
apply(sub[,3:5],2,mean)
```

Not bad!  Now let's get predicted values and plot.

```{r}
design <- as.matrix(model.matrix(mod1))
betas <- sub[,3:5]
prop_max <- sub$prop_max

v <- list()
for(i in 1:nrow(betas)){
  pred_y <- design %*% t(as.matrix(betas[i,]))
  pred_y_noint <- pred_y - design[,c(1)] %*% t(as.matrix(betas[i,c(1)]))
  pred_y_fwl <- pred_y - design[,c(1,3)] %*% t(as.matrix(betas[i,c(1,3)]))
  v[[i]] <- data.frame(pred_y = pred_y,
                       pred_y_fwl = pred_y_fwl,
                       pred_y_noint = pred_y_noint,
                       design,
                       prop_max = prop_max[i],
                       pars=i)
}

preds <- do.call("rbind",v)

ggplot(preds,aes(x=x,y=pred_y,color=1-prop_max)) + 
  geom_point() + 
  geom_smooth(method = "gam", formula = y ~ s(x, bs = "cs"),col="orange",lwd=1.5) +
  geom_point(aes(x=x,y=2*x + 3*x^2),color="red") +
  labs(y="ABC Predictions",color="Fit Loss") +
  ggtitle("Full Predictions and CEF")
```

Here the blue dots are predictions from the posterior distribution, shaded by their fit (with darker being better).  The red dots are "true predictions" while the orange line is the conditional mean of the predictions.  Not bad at all!

Again we might imagine that rather than keeping only these values we subsample or resample in some way.  Here is effectively an unweighted bootstrap version of the above:

```{r}
pred_sub <- preds[sample(1:nrow(preds),1000,replace=T),]

ggplot(pred_sub,aes(x=x,y=pred_y,color=1-prop_max)) + 
  geom_point() + 
  geom_smooth(method = "gam", formula = y ~ s(x, bs = "cs"),col="orange",lwd=1.5) +
  geom_point(aes(x=x,y=2*x + 3*x^2),color="red") +
  labs(y="ABC Predictions",color="Fit Loss") +
  ggtitle("Re-Sampled Predictions and CEF")

```

Here there is very little difference, but we might imagine contexts in which we would prefer to give more weight to better fitting models than lower fitting models.  Again, I leave this to the reader to explore further.

Instead, we can take a look at effects plots for specific terms by tacitly applying FWL theorem and removing contributions to the prediction.  So, forget that the true relationship is quadratic for a moment and pretend that the terms are actually just different variables.

```{r}
ggplot(preds,aes(x=x,y=pred_y - pred_y_noint + pred_y_fwl,color=1-prop_max)) + 
  geom_point() + 
  geom_smooth(method = "gam", formula = y ~ s(x, bs = "cs"),color="orange",lwd=1.5) +
  geom_point(aes(x=x,y=2*x),color="red") +
  ggtitle("FWLish Partial Effect")
```

Not bad!  Also note that the plots naturally depict the density of the x-variable of interest.  Such is a natural benefit of simulation based approaches.

# A little bit curvier now

One might wonder whether or not the same technique applies to functions with a greater deal of non-linearity.  Let's simulate a more subtle relationship (shamelessly taken from example code from ICPSR's Regression III)

```{r}
n <- 1000
x <- 0:(n-1)/(n-1)
f <- 0.2*x^11*(10*(1-x))^6+10*(10*x)^3*(1-x)^10
y <- f + rnorm(n, 0, sd = 2)

ggplot(mapping=aes(x=x)) + 
  geom_point(aes(y=y), pch=1) + 
  geom_line(aes(y=f), size=1) + 
  theme_bw() + 
  labs(x="x", y="y") -> g1
g1
```

We will model this, somewhat crudely, with splines.  One might do something like the following when confronted with this data:

```{r, message=F,warning=F}
dave_dat <- data.frame(y=y,x=x)
library(splines)
lin_mod <- lm(y ~ bs(x,df=10),data = dave_dat)
g1 + geom_point(aes(y=lin_mod$fitted.values), col="orange")
```

Not awful, but not great.  Splines are one example in which we might not "believe" the model we are estimating; rather we are trying to capture a relationship we are uncertain of utilizing statistical machinery as a tool towards that end.

Let's see how ABC does.  Again we will exploit the ability to fit the model object as a convinience to extract the appropriate terms; all you need is the appropriate design matrix.  For splines this is simply additional functions of the x variable of interest.  In this case, the model look like this:

```{r}
colnames(model.matrix(lin_mod))
```

This time we will crank up the number of draws to 50,000 and retain only the top 1% of models.  Of note, one weakness of naive ABC -- like that discussed here -- is that the dimensionality of the prior space grows with the dimensionality of the model.  More things with parameters means that more samples need to be drawn to get well-fitting models and "explore the space."  This motivates more advanced approaches, not discussed here.

Anyway, first let's take a look at only the top 3 models to see what is going on:

```{r}
out <- pblapply(1:50000,function(x)flat_abc_draw(lin_mod,lims=c(-5,15)))
out <- do.call("rbind",out)
tibble(out[order(out$rmse),]) %>% 
  mutate(rank = 1:nrow(out), prop_max = min(rmse)/rmse) -> out

# Subset down to close to truth
sub <- out[which(out$rank <= 3),]

design <- as.matrix(model.matrix(lin_mod))
betas <- sub[,3:(2+ncol(design))]
prop_max <- sub$prop_max

v <- list()
for(i in 1:nrow(betas)){
  pred_y <- design %*% t(as.matrix(betas[i,]))
  v[[i]] <- data.frame(pred_y = pred_y,
                       design,
                       prop_max = prop_max[i],
                       pars=i)
}

preds <- do.call("rbind",v)
preds$x <- dave_dat$x
preds$f <- with(preds,0.2*x^11*(10*(1-x))^6+10*(10*x)^3*(1-x)^10)

ggplot(preds,aes(x=x,y=pred_y,color=1-prop_max)) + 
  geom_point() + 
  geom_smooth(method = "gam", formula = y ~ s(x, bs = "cs"), color="orange", lwd=2) +
  labs(y="ABC Predictions",color="Fit Loss") +
  ggtitle("Getting Curvy") +
  geom_point(aes(y=f),color="red")
```

Again, shades of blue are model predictions with darker being better fit.  Red is the truth and orange is the conditional mean of the plotted predictions.  Much like in the fitted spline model above, none of the models does a perfect job in capturing the relationship but none of them are completely worthless.  What this sort of ABC plot does is that it says "we will average the expectation of the models we believe," a process which leads to somewhat better fit.  And before you say it, if this is starting to sound like machine learning then we are on the same page.

Let's add more models to consideration; the top 1%.  We will also resample predictions, here only to speed up the plot and to emphasize again that you can do such a thing.

```{r}
sub <- out[which(out$rank <= 0.01*nrow(out)),]

design <- as.matrix(model.matrix(lin_mod))
betas <- sub[,3:(2+ncol(design))]
prop_max <- sub$prop_max

v <- list()
for(i in 1:nrow(betas)){
  pred_y <- design %*% t(as.matrix(betas[i,]))
  v[[i]] <- data.frame(pred_y = pred_y,
                       design,
                       prop_max = prop_max[i],
                       pars=i)
}

preds <- do.call("rbind",v)
preds$x <- dave_dat$x
preds$f <- with(preds,0.2*x^11*(10*(1-x))^6+10*(10*x)^3*(1-x)^10)

pred_sub <- preds[sample(1:nrow(preds),10*nrow(dave_dat),replace=T),]

ggplot(pred_sub,aes(x=x,y=pred_y,color=1-prop_max)) + 
  geom_point() + 
  geom_smooth(method = "gam", formula = y ~ s(x, bs = "cs"), color="orange", lwd=2) +
  labs(y="ABC Predictions",color="Fit Loss") +
  ggtitle("Re-Sampled Predictions and CEF") +
  geom_point(aes(y=f),color="red")
```

Not awful, but not perfect either.  But then again, we have taken an incredibly brute force approach to the problem -- simulate parameters draws, evaluate fit, reject if fit is too poor, predict from accepted, then average.  For areas towards the center of the data which doesn't experience a great deal of change even this simple quite frankly slapped-together model approximates the truth quite well.  It also adequately represents our naive uncertainty in the task.

# Multilevel Models

And of course you can do the same thing for multilevel models as well.  Let's do a moderately complex setting with both group varying intercepts and slopes.

```{r}
set.seed(1234)
nobs <- 10000
n_groups <- 10

grp <- rep_len(1:n_groups,nobs)
labels <- letters[grp]
grp_means <- rnorm(n_groups,0,2)
grp_ints <- rnorm(n_groups,0,2)
grp_slps <- rnorm(n_groups,0,2)

x <- rnorm(nobs,grp_means[grp])
err <- rnorm(nobs)

y <- grp_slps[grp]*x + grp_ints[grp] + err

data <- data.frame(y,x,grp,grp_means[grp],labels)

ggplot(data,aes(x=x,y=y,color=labels)) + 
  geom_point() +
  geom_smooth(method="lm", formula=y~x)
```

We will once more fit a simple linear model so simplify constructing the design matrix.

```{r}
multilevel <- lm(y~x*labels-1,data=data)
truth <- data.frame(labels = labels,
                    true_int = grp_ints,
                    true_slp = grp_slps)[1:n_groups,]
summary(multilevel)
```

And take a quick look at the true slopes and intercepts:
```{r}
truth
```

Baller.  For ABC we will skip graphical matters and simply extract the posterior means of the top 0.01% of 1,000,000 models.  Other decision rules will lead to different results.

```{r, warning=F,message=F}
library(parallel)
cl <- makeCluster(detectCores())
limz <- c(-10,10)
clusterExport(cl=cl,varlist=c("flat_abc_draw","multilevel","limz"))

abc_multi <- flat_abc(multilevel,1000000,limz,cl=cl)
stopCluster(cl)
best <- abc_multi[abc_multi$rank <= 0.0001*nrow(abc_multi),]

post_means <- apply(best[,3:(ncol(best)-2)],2,mean)
data.frame(truth,
           post_int = post_means[2:(n_groups+1)],
           post_slp = post_means[-c(2:(n_groups+1))]) -> results

ggplot(results) +
  geom_point(aes(true_int,post_int),color="red") +
  geom_point(aes(true_slp,post_slp),color="blue") +
  labs(x="Truth",y="Posterior Mean") +
  geom_hline(yintercept = 0) +
  geom_vline(xintercept = 0) +
  geom_abline(slope=1,intercept=0)
```

We can notice a few things.  First, compared to simpler examples we are doing a bit worse.  Why is this?  Well, for one, we are starting to enter the boundaries of what we can reasonably expect from the particularly simple implimentation of ABC above -- hence cranking up the number of draws.  Remember, all we are doing is sampling from a hypercube of dimension equal to the number of terms on the right hand side of the regression we are conducting.  Here that is 20 terms -- it's a large space to explore and we aren't taking a ton of draws.

That said, there are few misleading inferences -- nearly everything takes on the same sign as the truth, here, and all but one of these have a truth very close to zero.  On the opposite end, the one "big coefficient" of around 5 is underestimated.  Overall, however it ain't bad!

# Unidentified Models

The last thing of note is that it is feasible to estimate models with more parameters than observations in an ABC framework.  The reason for this is that we aren't estimating parameters so much as simulating them; we don't run into the issue of more unknowns than equations.

Let's illustrate this is a relatively simple case.  We will have 10 covariates that matter and 10 which don't, plus the intercept to estimate but only 30 observations.  The case is obviously lame, but you might imagine a setting such as dynamic networks or a fully saturated model where one may wany to estimate coefficients in such a setting moreso than conduct variable selection.

```{r, warning=F, message=F, results="asis"}
library(MASS)
library(texreg)

obs <- 20
X <- mvrnorm(obs,rep(0,obs),Sigma = diag(1,obs))
beta <- c(rep(1,10),rep(0,obs-10))
y <- X %*% beta + rnorm(obs)
d <- data.frame(y=y,X)

unidentified <- lm(y~.,data=d)
identified <- lm(y~.,data=d[,1:15])

htmlreg(list(identified,unidentified),doctype = F)
```

Zero degrees of freedom sucks!  No summary of uncertainty, and not all parameters get a value.  There are existing techniques to deal with this sort of problem (ridge and lasso are both still popular, etc).  

But we can still use the exact same machinery introduced above.  The basic idea is simple.  Since we generated the data, we know that the correct model can be generated.  Here is its RMSE:

```{r}
sqrt(mean((y-X%*%beta)^2))
```

Just like before, the problem is that we don't know the elements of the parameter vector.  So, just like before: sample, calculate loss, reject those that are too bad, and summarize the rest.  Insofar as the entries of the parameter vector are within the prior space then, with enough samples, we can point in that direction.

```{r}
cl <- makeCluster(detectCores())
limz <- c(-2,4)
clusterExport(cl=cl,varlist=c("flat_abc_draw","unidentified","limz"))
abc_unid <- flat_abc(unidentified,500000,c(-1,3),cl=cl)
stopCluster(cl)

sub <- abc_unid[abc_unid$rank <= 250,]
means <- apply(sub[,3:(ncol(sub)-2)],2,mean)
ggplot(data.frame(truth=c(0,beta),post=means),aes(x=truth,y=post)) + 
  geom_point() +
  labs(x="Truth",y="Posterior Means")
```

As we can see above, the approach doesn't do too terribly but there is room for improvement.  We are able to more or less separate the positive coefficients from the null coefficients; more samples would be mean more accuracy.  Importantly, we also get a posterior distribution for summarizing our uncertainty.

```{r}
ggplot(sub,aes(x=X1)) + 
  geom_histogram(binwidth = 0.1, col="black",fill="blue") +
  ggtitle("Posterior of X1 (Truth = 1")
```

```{r}
ggplot(sub,aes(x=X12)) + 
  geom_histogram(binwidth = 0.1, col="black",fill="blue") +
  ggtitle("Posterior of X1 (Truth = 0")
```

# What about more complicated sampling?

Look at the last posterior distribution plot -- do you notice anything?  It is truncated, and hence asymetric, due to the sampling algorithm not looking below -1 in this case.

This leads to two areas in which the above -- perhaps the simplest and dumbest way of going about ABC -- can be improved and to a discussion of how it is done in practice.  There are two primary considerations:

1) Adaptive Search -- even if we believe that the truth resides within the hypercube used above, we can be wrong.  This motivates more advanced sampling algorithms which, frequently MCMC based, which find regions with a higher probability of acceptance.  This also helps improve the efficiency of the sampling; less draws are needed.  This is incredibly important to fight against the curse of dimensionality.

2) Rejection Sampling -- in the above we used what I called ex post rejection sampling.  Professional algorithms have the rejection rule embedded into the search, usually based around sufficient statistics, to help guide the algorithm (you can't adapt to increase acceptance probability in a region without knowing how you will reject) and increase the fit of the model (RMSE, used above, is incredibly crude for capturing whether the outcome is actually being reproduced).

These concerns aside, even the really dumb implimentation examined above works surprisingly well in simple applied problems.

ABC is neat!