In probability theory, the central limit theorem (CLT) states that, given certain conditions, the arithmetic mean of a sufficiently large number of iterates of independent random variables, each with a well-defined expected value and well-defined variance, will be approximately normally distributed, regardless of the underlying distribution. That is, suppose that a sample is obtained containing a large number of observations, each observation being randomly generated in a way that does not depend on the values of the other observations, and that the arithmetic average of the observed values is computed. If this procedure is performed many times, the central limit theorem says that the computed values of the average will be distributed according to the normal distribution (commonly known as a “bell curve”).
We will do the reverse way. We are going to calculate the variance of many samples just changing the sample size to see the impact of larger samples. And proof if it really happens.
t <- proc.time()
# The variable:
# How many samples we're going to draw on each step
n <- c(1:32, 2^(6:9)) # Sample size
p <- .5 # binomial probability
sd <- 2 # normal standard definition
pmf <- c("normal", "uniform", "binomial")[3] # Change this parameter to get other distribution
# Parameters
nosim <- 1000 # number of simulation in each step
m <- sapply(n , function(n) { # sapply is just a for()
x <- if(pmf=="normal") {
rnorm(nosim * n, sd=sd)
} else if(pmf=="uniform") {
runif(nosim * n)
} else if(pmf=="binomial") {
sample(0:1, nosim * n, replace=TRUE, prob=c(p, 1-p))
}
apply(matrix(x, nosim), 1, mean)
})
colnames(m) <- n # n is my main variable, I have to attach it as data label
# melt is a function from reshape2 package. It will get the matrix or data.frame column names and put them in a collumn repeated several times linked to its correspondent value.
# I can put the parameters and variables in the same line as my simulation values and one simulation per unique parameter combination. It's easier to subset them.
# It's called tidy data. It's wonderful. It's a very good standard to handle with Big Data.
x <- melt(m)
names(x) <- c("simulation", "sampleSize", "mean")
print(t - proc.time())
## user system elapsed
## -0.644 -0.021 -0.689
The first chart shows the distributions. More events falls on the long tail. As we increase the sample size, the values will concentrate more around the mean and it will generate thiner tails.
g <- ggplot(x, aes(mean, fill=as.factor(sampleSize))) + # Metadata
geom_histogram(aes(y = ..density..)) + # Histogram bars
geom_density() + # Density lines
facet_wrap(~sampleSize) + # Broke in a matrix of chart
ggtitle("Central Limit Theorem of a binomial distribution\nTested with different sample sizes") +
theme(legend.title=element_blank())
suppressMessages(print(g))
On this other chart we have plotted the simulation data as a line. We can see that how bigger the sample is, smaller will be the variance.
Then, we will add the variance values calculated with the formula (that one that uses only the sample size to tell the confidence interval, without use the data) and the result is amazing!
if(pmf=="normal") {
var <- unlist(lapply(split(x, x$sampleSize), function(y) { var(y$mean) }))
plot(n, var, type="l", main="The sample means meets the formula flawless",
xlab="sample size", ylab="variance") # line and aestethics)
points(n, sd^2/n, col="red")
} else if(pmf=="uniform") {
sd <- unlist(lapply(split(x, x$sampleSize), function(y) { sd(y$mean) }))
plot(n, sd, type="l", main="The sample means meets the formula flawless",
xlab="sample size", ylab="variance") # line and aestethics)
points(n, 1/sqrt(12 * n), col="red")
} else if(pmf=="binomial") {
var <- unlist(lapply(split(x, x$sampleSize), function(y) { var(y$mean) }))
plot(n, var, type="l", main="The sample means meets the formula flawless",
xlab="sample size", ylab="variance") # line and aestethics
points(n, p*(1-p)/n, col="red") # points
}