Introduction

The central limit theorem is an important computational short-cut for generating and making inference from the sampling distribution of the mean. You’ll recall that the central limit theorem short-cut relies on a number of conditions, specifically:

Independent observations
Identically distributed observations
Mean and variance exist
Sample size large enough for convergence

In this simulation study, you are going to compare the sampling distribution of the mean generated by simulation to the sampling distribution implied by the central limit theorem. You will compare the distributions graphically in QQ-plots.

This will be a 4 × 4 factorial experiment. The first factor will be the sample size, with N = 5, 10, 20, and 40. The second factor will be the degree of skewness in the underlying distribution. The underlying distribution will be the Skew-Normal distribution.

The Skew-Normal distribution has three parameters:

Location
Scale
Slant.

When the slant parameter is 0, the distribution reverts to the normal distribution. As the slant parameter increases, the distribution becomes increasingly skewed. In this simulation, slant will be set to 0, 2, 10, 100. Set location and scale to 0 and 1, respectively, for all simulation settings. Use the rsn function in the sn package.

The output of each combination of factors will be a QQ-plot. Generate a sampling distribution of 5000 draws from both the CLT approximation and the simulation approximation.

When analyzing data, the parameters for the mean and variance (in the case of the CLT shortcut) or the parameters for the distribution (in the case of MLE, MM, etc) are replaced with sample estimates. (This is often called the plug-in approach.) For the purposes of this simulation, treat the mean and variance as known values and use the actual population parameters and the population mean and variance instead of sample estimates.

You will write a blog post to explain your simulation study and results. The audience of your blog post is a Senior Data Scientist who you hope to work with in the future. Think about how you will communicate the results. Comment on any patterns you observe.

By the CLT, we know that the sampling distribution of the mean would be normal if the following conditions are met:

Thus, to avoid for loops we can use the fact that

If \(Z_\alpha\)

Approximate Confidence Interval for \(\bar{X}\): \(\left( Z_{\alpha/2}\frac{s}{\sqrt{N}}+\bar{X}_N, Z_{1-\alpha/2}\frac{s}{\sqrt{N}}+\bar{X}_N\right)\)

one_plot <- function(slant,N,R=5000,location=0,scale=1){
  delta <- slant/sqrt(1+slant^2)
  pop_mean <- location + scale*delta*sqrt(2/pi)
  pop_sd <- sqrt(scale^2*(1-2*delta^2/pi))
  
  sample_dist_clt <- rnorm(R)*pop_sd/sqrt(N)+pop_mean
  sample_dist_sim <- replicate(R, mean(sn::rsn(N,location, scale, slant)))
  
  qqplot(sample_dist_sim, sample_dist_clt, asp = 1, axes=FALSE, xlab="", ylab="")
  box()
  abline(0,1)
}

par(mar=.25*c(1,1,1,1), oma = c(0,2,2,0), mfrow=c(4,5), pch=16)

for(slant in c(0,2,10,100)){
  for(N in c(0,5,10,20,40)){
    if(N==0){
      curve(sn::dsn(x,0,1,slant),-4,4, axes=FALSE,xlab="",ylab="")
      box()
      title(ylab = paste0("Slant = ", slant), xpd=NA, line=1)
    }else{
      one_plot(slant,N)
    }
    if(slant == 0){
      if(N==0){ title(main = "Distribution", xpd=NA, line = 1, font.main=1, cex.main=1) }else{ title(main = paste0("N = ", N), xpd=NA, line = 1, font.main=1, cex.main=1) }
    } 
  }
}

Conclusion

As the sample size N gets bigger, the QQ-plot seems to fit the y=x line better, which implies that the CLT works better for larger sample sizes.
As slant S gets bigger the Skew-Normal distribution gets more skewed to the left, and the QQ-plot does not fit at nicely (along the y=x line).
Therefore, the CLT works for any value of N if it is the Normal Distribution, that is S = 0. As S increases, the CLT estimation does not work as great and to compensate for this, we need to make sure that N is large enough (N > 30).

Advanced (optional)

Overlay in each figure the QQ-plot that results when the mean and variance parameters are estimated from data, instead of using the population parameters.

Simulation Study: Central Limit Theorem

Luz Melo

2022-12-02

Introduction

Conclusion

Advanced (optional)