Simulation studies are used in statistics and computer science to evaluate the performance of estimators against a set of criteria. Typically, one will generate a synthetic data set using fully specified distributions. Then the estimation method(s) of interest is(are) applied to the synthetic data and the results recorded. These two phases are independently repeated \(M\) times. The results over the \(M\) runs are then summarized to give the results.
It is important to identify the specific goal of your simulation study. If the goal is parameter estimation, then you must first derive or estimate the true value of the parameter of interest, which depends on the distribution used to generate the data. Another potential goal in a simulation study is to evaluate the performance of a prediction method. This tutorial will focus on simulation studies for the evaluation of an estimator of a parameter from a frequentist perspective.
As a simple example, suppose that we are interested in comparing the performance of a linear regression with a spline regression to estimate the expected value of an outcome, \(Y\), for a certain value of a covariate, \(X\). Our parameter of interest will be \(E(Y\mid X=2)\).
To conduct the simulation study, we will first choose the distributions of variables \(X\) and \(Y\). We will generate 1000 independent realizations (or draws) of these two variables. We will apply the linear regression estimator then the spline estimator for the quantity of interest and use predict to get an estimate of the expectation of \(Y\) at the value \(X=2\), along with a standard error of the prediction. Then, we will record the estimate and standard error by writing to a file. We will use a loop to repeat this procedure \(M=200\) times.
We will demonstrate how to obtain the true value of \(E(Y\mid X=2)\) using an analytical approach and a numerical method.
Let’s set both distributions to be Gaussian and take 1000 independent draws of \((X,Y)\):
n = 1000
X = rnorm(n)
Y = rnorm(n,mean=(0.5*X+X^2))
In order to estimate the parameter of interest on successive runs, we can define functions that output the information that we need as a list:
linreg<-function(X,Y){ #linear regression, main term X
predobj<-predict(lm(Y~X),newdata=data.frame(X=2),se.fit=T)
return(list(est=predobj$fit,se=predobj$se.fit))
}
splinereg<-function(X,Y){ #spline regression, degree 3
predobj<-predict(lm(Y~poly(X,3)),newdata=data.frame(X=2),se.fit=T)
return(list(est=predobj$fit,se=predobj$se.fit))
}
We then run the functions on the synthetic data and save the results. To write to file on the local directory, we can use
reslm<-linreg(X,Y)
ressp<-splinereg(X,Y)
write(c(reslm$est,reslm$se,ressp$est,ressp$se),"results_single.txt",ncol=10,append=T)
We want to compare our results with the true value of the parameter of interest. In this example, calculating \(E(Y\mid X=2)=0.5\times 2+2^2=5\) is trivial. When the true value is less easily derived (through algebra or numerical integration), one can estimate this value numerically.
n.s = 100000 #large number
X.s = 2
Y.s = rnorm(n.s,mean=(0.5*X.s+X.s^2))
mean(Y.s)
## [1] 4.998067
If the resulting value from mean(Y.s) is unstable, the sample size can be increased for the calculation and/or this procedure can be looped and the results averaged.
Now we set the true value to
truth=5
The errors from one simulated dataset can be calculated as:
reslm$est-truth #error for the linear regression
## 1
## -2.778685
ressp$est-truth #error for the spline regression
## 1
## -0.07897688
We see that the linear regression model gives more error than the spline model. This makes sense because the true model for \(E(Y\mid X)\), which contains a squared term of \(X\), is not contained in the linear regression model but it is contained in the spline model of degree 3.
l1<-predict(lm(Y~X))
l2<-predict(lm(Y~poly(X,3)))
j <- order(X)
plot(Y ~ X, pch=19,cex=0.1)
lines(X[j],l1[j],col="red",lwd=4)
lines(X[j],l2[j],col="blue",lwd=2)
abline(v=2,lty="dashed")
From the above plot, we’d expect that the linear regression (red, thicker curve) will not perform well as we move outside of the centre mass of the data. The spline regression of degree 3 (blue, thinner curve) does a better job, in particular near the value of interest, \(X=2\) (vertical dashed line).
In order to evaluate the performance of the estimators over multiple draws, we will loop the above procedure. But we will make sure that this procedure is reproducible by setting seeds for each iteration.
We generate 200 random seeds and then store them to a file:
set.seed(111)
write(sample(1:10000000,size=200),file="seeds.txt",ncolumns=1)
This is only done once and the file is stored permanently. Any time we want to run the simulation study, we then read in the seeds stored in this file:
seeds<-read.table("seeds.txt",header=F)$V1
and set each seed when we generate the data in each loop. This way, the same datasets will be generated each time we run the loop.
The complete simulation study can be conducted as follows (using only 200 draws for the purposes of this tutorial though normally we would use 1000+ draws). Here, we use a sample size of \(n=1000\). This emulates a study where we have 1000 units or participants.
n = 1000
for (i in 1:200){
#generate data
set.seed(seeds[i]) #sets the seed to the ith element in the vector seeds
X = rnorm(n)
Y = rnorm(n,mean=(0.5*X+X^2))
#get estimates
reslm<-linreg(X,Y)
ressp<-splinereg(X,Y)
#write to file
write(c(i,reslm$est,reslm$se,ressp$est,ressp$se),"results.txt",ncol=10,append=T)
}
Now we should have a file with a 200x5 matrix of results. The first column is the iteration number. The second and fourth columns are the estimates for the linear and spline regressions, respectively. And the third and fifth columns are the standard errors for the respective estimators. The results are imported using:
dat1<-read.table("results.txt",header=F) #load results
dim(dat1) #check dimension
## [1] 200 5
We can summarize the estimated bias (or average error) using
colMeans(dat1[,c(2,4)])-truth
## V2 V4
## -3.010774195 0.002809875
and the average standard error using
sqrt(colMeans(dat1[,c(3,5)]^2))
## V3 V5
## 0.1221631 0.1038508
Note that this takes the squared root of the mean of the estimation variances produced by the lm function in each run.
To automatically generate a latex table, we could use the package xtable.
bias<-colMeans(dat1[,c(2,4)])-truth
avg.se<-sqrt(colMeans(dat1[,c(3,5)]^2))
datfr<-data.frame(cbind(bias,avg.se))
names(datfr)<-c("Bias","Avg SE")
rownames(datfr)<-c("Linear regression","Spline regression deg 3")
datfr
## Bias Avg SE
## Linear regression -3.010774195 0.1221631
## Spline regression deg 3 0.002809875 0.1038508
library(xtable)
xtable(datfr)
## % latex table generated in R 4.1.0 by xtable 1.8-4 package
## % Tue Jun 08 17:23:05 2021
## \begin{table}[ht]
## \centering
## \begin{tabular}{rrr}
## \hline
## & Bias & Avg SE \\
## \hline
## Linear regression & -3.01 & 0.12 \\
## Spline regression deg 3 & 0.00 & 0.10 \\
## \hline
## \end{tabular}
## \end{table}
From this, we see that the linear regression had bias but the spline regression of degree 3 did not. The linear regression also had more average standard error than the spline regression.
Now suppose that we want to add a third comparator. For example, we can investigate the performance of a spline of degree 2. Below we add predictions from a spline of degree 2 (thicker black curve) to the previous plot. It overlaps the spline of degree 3 in the region of interest, so we’d expect the performance to be similar.
l1<-predict(lm(Y~X))
l2<-predict(lm(Y~poly(X,3)))
j <- order(X)
plot(Y ~ X, pch=19,cex=0.1)
lines(X[j],l1[j],col="red",lwd=4)
lines(X[j],l2[j],col="blue",lwd=2)
abline(v=2,lty="dashed")
l3<-predict(lm(Y~poly(X,2)))
lines(X[j],l3[j],col="black",lwd=4)
Because we can reproduce the same simulated datasets, we can run a separate loop and don’t need to rerun the other estimators in order to make a fair comparison.
splinereg2<-function(X,Y){
predobj<-predict(lm(Y~poly(X,2)),newdata=data.frame(X=2),se.fit=T)
return(list(est=predobj$fit,se=predobj$se.fit))
}
for (i in 1:200){
#generate data
set.seed(seeds[i]) #sets the seed to the ith element in the vector seeds
X = rnorm(n)
Y = rnorm(n,mean=(0.5*X+X^2))
#get estimates
ressp2<-splinereg2(X,Y)
#write to file
write(c(i,ressp2$est,ressp2$se),"results2.txt",ncol=10,append=T)
}
Now we can add these results to the table.
dat2<-read.table("results2.txt",header=F) #load results
dim(dat2) #check dimension
## [1] 200 3
datfr<-rbind(datfr,c(mean(dat2[,2])-truth,sqrt(mean(dat2[,3]^2))))
rownames(datfr)[3]<-"Spline regression deg 2"
datfr
## Bias Avg SE
## Linear regression -3.010774195 0.12216310
## Spline regression deg 3 0.002809875 0.10385084
## Spline regression deg 2 0.001414360 0.09848925
xtable(datfr)
## % latex table generated in R 4.1.0 by xtable 1.8-4 package
## % Tue Jun 08 17:23:14 2021
## \begin{table}[ht]
## \centering
## \begin{tabular}{rrr}
## \hline
## & Bias & Avg SE \\
## \hline
## Linear regression & -3.01 & 0.12 \\
## Spline regression deg 3 & 0.00 & 0.10 \\
## Spline regression deg 2 & 0.00 & 0.10 \\
## \hline
## \end{tabular}
## \end{table}
Thus, the spline of degree 2 is similarly unbiased and has lower standard error than either of the other estimators, on average.
We may also be interested in evaluating how the estimated standard error compares to the Monte Carlo standard error (MCSE) and the coverage of the 95% confidence intervals derived from the standard errors (SE) using \(est \pm 1.96\times SE\). We combine the results dat1 and dat2 and calculate these quantities for all three estimators:
dat_res<-cbind(dat1,dat2)
MCSE<-apply(dat_res[c(2,4,7)],2,sd)
cover<-colMeans(dat_res[,c(2,4,7)]-1.96*dat_res[,c(3,5,8)]<=truth & dat_res[,c(2,4,7)]+1.96*dat_res[,c(3,5,8)]>=truth)
datfr$"MCSE"<-MCSE
datfr$"Coverage"<-cover
datfr
## Bias Avg SE MCSE Coverage
## Linear regression -3.010774195 0.12216310 0.20362617 0.000
## Spline regression deg 3 0.002809875 0.10385084 0.09816659 0.955
## Spline regression deg 2 0.001414360 0.09848925 0.09196982 0.945
xtable(datfr)
## % latex table generated in R 4.1.0 by xtable 1.8-4 package
## % Tue Jun 08 17:23:14 2021
## \begin{table}[ht]
## \centering
## \begin{tabular}{rrrrr}
## \hline
## & Bias & Avg SE & MCSE & Coverage \\
## \hline
## Linear regression & -3.01 & 0.12 & 0.20 & 0.00 \\
## Spline regression deg 3 & 0.00 & 0.10 & 0.10 & 0.95 \\
## Spline regression deg 2 & 0.00 & 0.10 & 0.09 & 0.94 \\
## \hline
## \end{tabular}
## \end{table}
Now we can see that the linear regression method underestimated the standard error of the method. The splines performed better, leading to close to optimal coverage.
A simple way to achieve version control is to save copies of the simulation study R document(s), seeds, and results in a separate folder with the date and version number. Note which version you use in a presentation or paper. Then you can always refer to the exact code and data used to obtain the results if needed.
In this tutorial, we presented an approach to conducting simulation studies in R. In particular, one can retrieve any given synthetic dataset by calling the ith seed and re-generating the data. Because the saved results are indexed by i, this allows you to further investigate anomalous individual results to better understand the performance of your estimator.
There are many considerations in simulation studies that were not addressed here, including choosing interesting or relevant data generating distributions and varying the dataset sample size \(n\). Such considerations should be guided by scientific interest that can stem from an application or methods performance in general.