Gelman and Hill (2006) detail a procedure for validating the results of a regression model by using the fitted coefficients to generate a simulated distribution and compare it to the original \(y\). If the two distributions coincide, it provides evidence that the hypothesized model successfully captures the process that generates \(y\). And if not, it suggests the model is not well-fit.

Below, I generate a simulated dataset, with a Poisson distributed dependent variable, and three independent variables (one of each distribution normal, binomial, and negative binomial). I fit two models, one that accurately describes the simulated data, and another that does not. Then I simulate from both regressions and compare the results to the original \(y\).

Simulating the Data

First, let’s simulate 1000 values of our independent variables:

n <- 1000
 
set.seed(1804)
 
x1 <- rnorm(n, mean=0, sd=.5)  # normally distributed IV
x2 <- sample(0:1, size=n, replace=TRUE, prob=c(0.5, 0.5)) # dichotomous IV (balanced)
x3 <- rnbinom(n, size=1, prob=0.57)

grid.arrange(
  ncol=3,
  qplot(x1, geom='density', fill=1),
  qplot(x2, geom='histogram', fill=1),
  qplot(x3, geom='histogram', fill=1)
)

Now, let’s assign some true coefficients more or less randomly, representing base reality:

beta_1 <- log(1.769)
beta_2 <- log(3.6)
beta_3 <- -1 * log(1.359)

Next, let’s specify the linear part of the model \(\eta\):

eta <- beta_1*x1 + beta_2*x2 + beta_3*x3

Link to \(y\) (Poisson distributed) and combine into a single data frame:

g <- exp(eta)
y <- rpois(n, g)
df <- data.frame(x1=x1, x2=x2, x3=x3, y=y)

Quick visualization:

grid.arrange(
  ncol=2, nrow=2,
  qplot(y, geom='density', fill=1),
  qplot(x1, y, data=df, geom=c('point', 'smooth')),
  qplot(x2, y, data=df, geom=c('point', 'smooth')),
  qplot(x3, y, data=df, geom=c('point', 'smooth'))
)

Modeling

Next we’ll fit two regressions. First, \(M_0\) represents the base truth:

m0 <- glm(y ~ x1 + as.factor(x2) + x3, df, family=poisson())
summary(m0)

## 
## Call:
## glm(formula = y ~ x1 + as.factor(x2) + x3, family = poisson(), 
##     data = df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.9350  -1.0544  -0.1787   0.5571   3.3792  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -0.07573    0.05253  -1.442    0.149    
## x1              0.53251    0.04466  11.924   <2e-16 ***
## as.factor(x2)1  1.32146    0.05632  23.463   <2e-16 ***
## x3             -0.28172    0.02780 -10.133   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 2092.9  on 999  degrees of freedom
## Residual deviance: 1105.1  on 996  degrees of freedom
## AIC: 3092.8
## 
## Number of Fisher Scoring iterations: 5

All terms are highly significant and their estimated coefficients are close to the true values, as expected.

Now create a second model that doesn’t fit the data \(M_1\). Since I can’t easily edit the coefficients of a glm object to create a more wrong model, I will create a modified version of the original predictors and train an erroneous model from that.

df1 <- df %>%
  mutate(x1 = x1 * 1 + sample(seq(-0.04466, 0.04466, .00001), length(x1)),
         x2 = sample(0:1, size=length(x2), replace=TRUE, prob=c(0.5, 0.5)),
         x3 = x3 * 1 + sample(seq(-0.02780, 0.02780, .00001), length(x3))
  )

m1 <- glm(y ~ x1 + as.factor(x2) + x3, df1, family=poisson())

We can see that soe of the parameters are approximately the same, others are different, but they are not exactly the same.

rbind(coef(m0), coef(m1))

##      (Intercept)        x1 as.factor(x2)1         x3
## [1,] -0.07572692 0.5325117     1.32146134 -0.2817232
## [2,]  0.76772167 0.5730307     0.07778385 -0.2914895

Simulating from `glm` regression output

We’ll write a function to simplify this. It returns n distributions of simulated from Poisson regression model \(m\), requiring a matrix of m’s predictors X.

# Outputs a list of n numeric vectors, each representing a simulated
# distribution from the regression model m's results and the predictor
# variables stored in X
# n: number of simulations
# X: matrix of predictors
# m: Poisson regression glm model object
regression_sim <- function(n, X, m) {
  
  n_X <- nrow(X)
  sim_obj <- arm::sim(m, n)
  y_rep <- list()
  
  for (s in 1:n) {
    y_hat <- exp(X %*% sim_obj@coef[s,])
    y_rep[[s]] <- rpois(n_X, y_hat)
  }
  
  return(y_rep)
  
}

Simulating from \(M_0\)

Use this function to evaluate the correct model:

X <- as.matrix(cbind(rep(1, n), df[, 1:3]))
m0_sims <- regression_sim(1000, X, m0)

A simple way to compare this with \(y\) is plotting the first several of these distributions. The actual distribution is in black, and the simulated in red:

par(mfrow=c(4, 4))
for (i in 1:16){
  plot(density(df$y))
  lines(density(m0_sims[[i]]), col='red')
}

It’s potentially difficult to see these results on a small screen, but suffice it to say the red and black distributions match up very well in most plot.

Simulating from \(M_1\)

Now let’s evaluate the incorrect model, \(M_1\):

X_1 <- as.matrix(cbind(rep(1, n), df1[, 1:3]))
m1_sims <- regression_sim(1000, X_1, m1)

par(mfrow=c(4, 4))
for (i in 1:16){
  plot(density(df$y))
  lines(density(m1_sims[[i]]), col='red')
}

We can see immediately that the second model is not close to corresponding to the original distribution.

Conclusion

Simulating from regression models results is a neat way to help validate models. It provides a good sanity check for when all the residual plots and standard errors and the rest prove more confusing than helpful.

We relied on eye balling the difference between the simulated distributions and the real distribution. A more formalized method would entail calculating test statistics, as Gelman and Hill do. Alternatively, we could use the Kullback-Leibler Divergence to measure the ‘distance’ between two probabilities.

References

Gelman, Andrew, and Jennifer Hill. 2006. Data analysis using regression and multilevel/hierarchical models. Cambridge: Cambridge University Press.

Validation by Simulation: Simulating the Results of a Regression

Ben Horvath

December 2019

Simulating the Data

Modeling

Simulating from `glm` regression output

Simulating from \(M_0\)

Simulating from \(M_1\)

Conclusion

References

Simulating the Data

Modeling

Simulating from glm regression output

Simulating from \(M_0\)

Simulating from \(M_1\)

Conclusion

References

Simulating from `glm` regression output