I wanted to examine the issue of bias and variability in the estimation of the standard error of the mean based on a single sample standard deviation. As it is said that small sample sizes can give large errors.
The first simulation is going to be for the standard error of the mean for 100,000 samples taken from the standard normal distribution with a sample size of 9. I use 9 and not 10 so that the standard error of the mean will be the standard deviation divided by 3 which is easy to do manually.
<- vector()
y for (i in 1:100000){
<- rnorm(9, mean=0, sd=1)
sample <- sd(sample)/3
y[i]
}hist(y, main="Histogram of the Standard Errors of the Mean", xlab="Mean", ylab="Density", freq=F)
<-seq(0.5,2.5,by=0.01)
xcurve(dnorm(x, mean=mean(y),sd=sd(y)), add=TRUE)
The mode of the distribution is close to 0.33 which is the expected value of the standard error as the true standard deviation is 1 and the root of the sample size is 3. The distribution is slightly asymmetrical as there is a limit on the minimum of the standard deviation as it is always positive. This results in the right hand tail of the distribution being extended compared to the left hand tail and so that the mean and median are both 0.32.
I am going to repeat the process but this time the sample size will be 100.
<- vector()
y for (i in 1:100000){
<- rnorm(100, mean=0, sd=1)
sample <- sd(sample)/10
y[i]
}hist(y, main="Histogram of the SEM for sample size 100", xlab="Mean", ylab="Density", freq=F)
<-seq(0.5,2.5,by=0.005)
xcurve(dnorm(x, mean=mean(y),sd=sd(y)), add=TRUE)
This time the mean and the median are both much closer to the expected value of 0.1 and there is less of a long tail to the right.
I want to now create a simulation that looks at how the mean of the standard error of the mean changes with sample size. This should show a convergence as the sample size increases. This simulation requires another outside loop and a comparison with the expected value.
<- vector()
y <- vector()
z <- 2:100
x for (j in 1:99){
for (i in 1:10000){
<- rnorm(x[j],0,1)
sample <- sd(sample)/sqrt(x[j])
y[i]
}<- mean(y)-1/sqrt(x[j])
z[j]
}plot(x, z, ylab="Difference Between the Mean SEM and the Ideal", xlab="Sample Size", main="Plot of the difference between the Mean SEM\n and the Ideal Value against Sample Size")
From the graph it is clear that the bias in the mean of the SEM from different samples disappears once you get to sample sizes of 20. However the variance of the SEM is also important as this indicates how far away estimates can be based on sampling.
<- vector()
y <- vector()
z <- 2:100
x for (j in 1:99){
for (i in 1:10000){
<- rnorm(x[j],0,1)
sample <- sd(sample)/sqrt(x[j])
y[i]
}<- sd(y)
z[j]
}plot(x, z, ylab="Standard Deviation of the SEM", xlab="Sample Size", main="Plot of the Standard Deviation of the SEM\n against Sample Size")
Looking at the new plot it is clear that once sample sizes get above 40 the SEM has such a small deviation that any sample will give you a reasonable estimate of the standard error of the mean without a large difference to the true value. For smaller sample sizes there is still going to be a chance that the sample does not give a representative standard error of the mean and for sample sizes less than 20 this effect will be confounded with a bias in the estimate of the SEM.