Gelman and Hill (2006) detail a procedure for validating the results of a regression model by using the fitted coefficients to generate a simulated distribution and compare it to the original \(y\). If the two distributions coincide, it provides evidence that the hypothesized model successfully captures the process that generates \(y\). And if not, it suggests the model is not well-fit.
Below, I generate a simulated dataset, with a Poisson distributed dependent variable, and three independent variables (one of each distribution normal, binomial, and negative binomial). I fit two models, one that accurately describes the simulated data, and another that does not. Then I simulate from both regressions and compare the results to the original \(y\).
Simulating the Data
First, let’s simulate 1000 values of our independent variables:
n <- 1000
set.seed(1804)
x1 <- rnorm(n, mean=0, sd=.5) # normally distributed IV
x2 <- sample(0:1, size=n, replace=TRUE, prob=c(0.5, 0.5)) # dichotomous IV (balanced)
x3 <- rnbinom(n, size=1, prob=0.57)
grid.arrange(
ncol=3,
qplot(x1, geom='density', fill=1),
qplot(x2, geom='histogram', fill=1),
qplot(x3, geom='histogram', fill=1)
)Now, let’s assign some true coefficients more or less randomly, representing base reality:
Next, let’s specify the linear part of the model \(\eta\):
Link to \(y\) (Poisson distributed) and combine into a single data frame:
Quick visualization:
grid.arrange(
ncol=2, nrow=2,
qplot(y, geom='density', fill=1),
qplot(x1, y, data=df, geom=c('point', 'smooth')),
qplot(x2, y, data=df, geom=c('point', 'smooth')),
qplot(x3, y, data=df, geom=c('point', 'smooth'))
)Modeling
Next we’ll fit two regressions. First, \(M_0\) represents the base truth:
##
## Call:
## glm(formula = y ~ x1 + as.factor(x2) + x3, family = poisson(),
## data = df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.9350 -1.0544 -0.1787 0.5571 3.3792
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.07573 0.05253 -1.442 0.149
## x1 0.53251 0.04466 11.924 <2e-16 ***
## as.factor(x2)1 1.32146 0.05632 23.463 <2e-16 ***
## x3 -0.28172 0.02780 -10.133 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 2092.9 on 999 degrees of freedom
## Residual deviance: 1105.1 on 996 degrees of freedom
## AIC: 3092.8
##
## Number of Fisher Scoring iterations: 5
All terms are highly significant and their estimated coefficients are close to the true values, as expected.
Now create a second model that doesn’t fit the data \(M_1\). Since I can’t easily edit the coefficients of a glm object to create a more wrong model, I will create a modified version of the original predictors and train an erroneous model from that.
df1 <- df %>%
mutate(x1 = x1 * 1 + sample(seq(-0.04466, 0.04466, .00001), length(x1)),
x2 = sample(0:1, size=length(x2), replace=TRUE, prob=c(0.5, 0.5)),
x3 = x3 * 1 + sample(seq(-0.02780, 0.02780, .00001), length(x3))
)
m1 <- glm(y ~ x1 + as.factor(x2) + x3, df1, family=poisson())We can see that soe of the parameters are approximately the same, others are different, but they are not exactly the same.
## (Intercept) x1 as.factor(x2)1 x3
## [1,] -0.07572692 0.5325117 1.32146134 -0.2817232
## [2,] 0.76772167 0.5730307 0.07778385 -0.2914895
Simulating from glm regression output
We’ll write a function to simplify this. It returns n distributions of simulated from Poisson regression model \(m\), requiring a matrix of m’s predictors X.
# Outputs a list of n numeric vectors, each representing a simulated
# distribution from the regression model m's results and the predictor
# variables stored in X
# n: number of simulations
# X: matrix of predictors
# m: Poisson regression glm model object
regression_sim <- function(n, X, m) {
n_X <- nrow(X)
sim_obj <- arm::sim(m, n)
y_rep <- list()
for (s in 1:n) {
y_hat <- exp(X %*% sim_obj@coef[s,])
y_rep[[s]] <- rpois(n_X, y_hat)
}
return(y_rep)
}Simulating from \(M_0\)
Use this function to evaluate the correct model:
A simple way to compare this with \(y\) is plotting the first several of these distributions. The actual distribution is in black, and the simulated in red:
It’s potentially difficult to see these results on a small screen, but suffice it to say the red and black distributions match up very well in most plot.
Simulating from \(M_1\)
Now let’s evaluate the incorrect model, \(M_1\):
X_1 <- as.matrix(cbind(rep(1, n), df1[, 1:3]))
m1_sims <- regression_sim(1000, X_1, m1)
par(mfrow=c(4, 4))
for (i in 1:16){
plot(density(df$y))
lines(density(m1_sims[[i]]), col='red')
}We can see immediately that the second model is not close to corresponding to the original distribution.
Conclusion
Simulating from regression models results is a neat way to help validate models. It provides a good sanity check for when all the residual plots and standard errors and the rest prove more confusing than helpful.
We relied on eye balling the difference between the simulated distributions and the real distribution. A more formalized method would entail calculating test statistics, as Gelman and Hill do. Alternatively, we could use the Kullback-Leibler Divergence to measure the ‘distance’ between two probabilities.
References
- Gelman, Andrew, and Jennifer Hill. 2006. Data analysis using regression and multilevel/hierarchical models. Cambridge: Cambridge University Press.