The Sample Distribution

The easiest way to understand what is happening with the sample distribution is to create a simulation. I am going to simulate some normally distributed data with a mean of 10 and a standard deviation of 1. From that distribution I am going to take samples of size 5 and I am going to calculate the means for these samples and plot them as a histogram, and calculate the mean of all the sample means and the standard deviation (standard error) of the means.

y <- vector()
n <- 5
set.seed(1234)
for (i in 1:10000){
  sample <- rnorm(n, 10, 1)
  y[i] <- mean(sample)
}
hist(y, main="Histogram of the Sample Mean", xlab="Mean", ylab="Density", freq=F)
x<-seq(0.5,2.5,by=0.01)
curve(dnorm(x, mean=mean(y),sd=sd(y)), add=TRUE)

mean(y)

[1] 10.00214

sd(y)

[1] 0.445194

The first thing you should notice is that the mean of the sample means is very close to the true mean for the distribution. Many samples give an accurate value for the mean. If that value is not the value that you were expecting then there is something systematically wrong with the experiment.

The distribution of means is centred around the true mean but you can have significant variation. A mean of 9.5 or 10.5 is not unusual.

If you calculate the standard deviation of this distribution you get a standard error of 0.445 which is slightly less than half of the standard deviation of the actual data.

I am going to run the simulation again but this time with a sample size of 25.

y <- vector()
n <- 25
set.seed(1234)
for (i in 1:10000){
  sample <- rnorm(n, 10, 1)
  y[i] <- mean(sample)
}
hist(y, main="Histogram of the Sample Mean", xlab="Mean", ylab="Density", freq=F)
x<-seq(0.5,2.5,by=0.01)
curve(dnorm(x, mean=mean(y),sd=sd(y)), add=TRUE)

mean(y)

[1] 10.00554

sd(y)

[1] 0.1988358

The mean is still the same, although maybe slightly closer to the true value.

BUT the standard deviation of this new sample distribution - the standard error has more than halved again.

I will do one final simulation with a sample size of 1.

y <- vector()
n <- 1
set.seed(1234)
for (i in 1:10000){
  sample <- rnorm(n, 10, 1)
  y[i] <- mean(sample)
}
hist(y, main="Histogram of the Sample Mean", xlab="Mean", ylab="Density", freq=F)
x<-seq(0.5,2.5,by=0.01)
curve(dnorm(x, mean=mean(y),sd=sd(y)), add=TRUE)

mean(y)

[1] 10.00612

sd(y)

[1] 0.9875294

Now the standard error is very close to the expected standard deviation. This is because when the sample size is only 1 the standard error should be the same as the standard deviation.

The formula for the standard error is:

\[ \dfrac{s}{\sqrt{n}} \]

That is why when I changed the sample size from 1 to 25 the standard error went from about 1 to about 0.2.

We need to use this information about standard errors and means of sample distributions in the analysis of the class data.