Let us say we have some data from a population. It could be the height of the all the people in a city or their income. We can plot a histogram to check the shape of the population distribution. This can take any shape.
Now, let us assume we take many different samples from the same population. From this data ,we will demonstrate two important results 1. The shape of the distribution of sample means is always normal 2. The standard deviation of sample means (standard error) can be approximated by the prescribed formula: \[s^2 / \sqrt{n}\]
We will test our proposition with multiple population distributions. We will assume that the size of our population is 100,000.
#uniformly distributed population with values betwen 40 and 160
pop1 <- runif(100000,40,160)
#normal distribution with mean as 100 and SD as 20
pop2 <- rnorm(100000,mean =100, sd = 20)
#exponentiation distribution with rate = 0.01 (= 1/100)
pop3 <- rexp (100000, 0.01)
We can confirm the shape of our population
par(mfrow = c(1,3))
plot(density(pop1), xlab = "uniform population", main = "")
plot(density(pop2), xlab = "normal population", main = "")
plot(density(pop3), xlab = "exponential population", main = "")
title("Population distributions", line = -2, outer = TRUE)
We will take a 500 samples, each of sample size 60 from each of the three populations. The numbers 500 and 60 are random and feel free to change them.
# We need a matrix of 500 rows and 3 columns to store all the sample means
meanmatrix <- matrix(0,500,3)
#we will use a loop to create 500 samples and find their sampel means
for (i in 1:500){
samp1 <- sample(pop1,500, replace= FALSE)
samp2 <- sample(pop2,500, replace= FALSE)
samp3 <- sample(pop3,500, replace= FALSE)
meanmatrix[i,1] = mean(samp1)
meanmatrix[i,2] = mean(samp2)
meanmatrix[i,3] = mean(samp3)
}
We can plot the distribution of the sample means with the data that we have generated
par(mfrow = c(1,3))
plot(density(meanmatrix[,1]), xlab = "sample means (uniform)", main = "")
plot(density(meanmatrix[,2]), xlab = "sample means (normal)", main = "")
plot(density(meanmatrix[,3]), xlab = "sample means (exponential)", main = "")
title("Sample means distributions", line = -2, outer = TRUE)
Woah…all the sample means curves are so close be normal!
Moral of the story: Whever the distribution of the parent population, the distribution of sample means is always normal.
We know that the standard error is the standard deviation of the sample means around the population mean. We will caculate it in two ways 1. Using the generated sample data 2. Using the formula mentioned above
sematrix <- matrix(0,3,2)
rownames(sematrix) <- c("Uniform population", "Normal population", "exponetiatl population")
colnames(sematrix) <- c("Simulated", "Formula")
for (i in 1 : 3){
sematrix[i,1] <- sd(meanmatrix[,i])
}
sematrix[1,2] <- sd(pop1)/sqrt(length(samp1))
sematrix[2,2] <- sd(pop2)/sqrt(length(samp2))
sematrix[3,2] <- sd(pop3)/sqrt(length(samp3))
sematrix
## Simulated Formula
## Uniform population 1.5086514 1.5513549
## Normal population 0.8578719 0.8952371
## exponetiatl population 4.5888569 4.4528435