Suppose you collect data from a sample of \(n\) people. You use random sampling, so your data are independent and identically distributed (i.i.d.). You collect data on one variable, \(y\), so \(y_i\) is the value of \(y\) for person \(i\) in your sample, and \(\bar{y}\) is the sample mean. The true mean of \(y\) in the population is \(\mu_y\) and the standard deviation is \(\sigma_y\). With i.i.d. data, the mean and standard deviation are the same for everyone; \(\mu_{yi}=\mu_y\) and \(\sigma_{yi}=\sigma_y\), so there is no need to write the subscript \(i\). \(\bar{y}\) describes the entire sample, so it has no subscript \(i\).
[Answer]
\[E(\bar{y}) =
E(\frac{1}{n}\sum_{i=1}^{n}y_i)=\frac{1}{n}\sum_{i=1}^{n}E(y_i)=\mu_y.\]
\[Var(\bar{y}) =
Var(\frac{1}{n}\sum_{i=1}^{n}y_i)= \frac{1}{n^2}\sum_{i=1}^{n}Var(y_i) =
\frac{\sigma_y^2}{n} \quad \Rightarrow \quad sd(\bar{y}) =
\frac{\sigma_y}{\sqrt{n}}.\]
[Answer]
The Law of Large Numbers
: If \(y_i\), \(i=1,...,n\), are independent and
identically distributed with finite variance, then \[\bar{y}\rightarrow_p \mu_y\] or \[\lim_{n\rightarrow\infty} \Pr(|\bar{y} -
\mu_y|>\epsilon) = 0 \text{ for every }\epsilon>0.\]
Intuitively, the sample mean converges to the true (theoretical) mean of
\(y\), as the sample becomes very
large.
[Answer]
The Central Limit Theorem
: If
\(y_i\), \(i=1,...,n\), are independent and
identically distributed with finite variance, then \[\sqrt{n}(\bar{y} - \mu_y) \rightarrow_d N(0,
\sigma_y^2).\] Intuitively, as the sample mean converges to the
true mean of \(y\), our sample mean
\(\bar{y}\) is not perfect while \(n<\infty\). So there is some mistake or
“error” in estimating the true mean, and the error gets converges to
have a normal-distribution shape. In particular, the pdf of the error
multiplied by \(\sqrt{n}\) approaches a
normal distribution with mean zero and variance \(\sigma_y^2\).
U(0,2)
. The sample size is 100
.
What is the sample mean for your numbers? (Three significant digits are
plenty.) Also, what is the sample standard deviation?set.seed(123)
N <- 100
y <- runif(N, 0, 2)
mean(y)
## [1] 0.997118
sd(y)
## [1] 0.5699905
Nsim <- 4 # the number of simulations
y_bar <- rep(NA, Nsim) # create a vector for sample mean
for (i in 1:Nsim){ # write a loop
y.data <- runif(N, min=0, max=2) # generate a sample with size N
y_bar[i] <- mean(y.data) # record the sample mean for sample i
}
y_bar
## [1] 1.0284498 0.9721034 0.9843849 0.9707808
Nsim <- 1000 # the number of simulations
y_bar <- rep(NA, Nsim) # create a vector for sample mean
for (i in 1:Nsim){ # write a loop
y.data <- runif(N, min=0, max=2) # generate a sample with size N
y_bar[i] <- mean(y.data) # record the sample mean for sample i
}
hist(y_bar,
col="lightblue",
main="The sampling distribution of the sample average",
ylim=c(0, 8),
freq = FALSE)
lines(density(y_bar), col="blue")
mean(y_bar) # sample average of the sample means of y
## [1] 0.998566
var(y_bar) # sample variance of the sample means of y
## [1] 0.003260772
sd(y_bar) # sample standard deviation of the sample means of y
## [1] 0.05710317
[Answer]
The true mean for a uniform pdf between 0 and 2
is exactly 1; \(\mu_y=1\)
[Answer]
The sample averages \(\bar{y}\) were usually close to 1,
typically within about \(1\pm 2\times
sd(\bar{y}) \approx 1\pm 0.1\), though occasionally farther away
from 1.
[Answer]
The (sample) standard deviation of \(\bar{y}\) was, in my case, \(\hat{\sigma}_{\bar{y}}=0.057\). (I used the
hat symbol over \(\sigma_{\bar{y}}\)
since this is a sample standard deviation, not exactly the true standard
deviation.)
[Answer]
The symbol \({\sigma}_{\bar{y}}\) represents the
standard deviation of different possible sample averages \(\bar{y}\), while \({\sigma}_y\) represents the standard
deviation of different possible \(y_i\)
values; despite looking like similar symbols, they are completely
different things!
[Answer]
From the answer to question 1, the true
variance and standard deviation of \(\bar{y}\) are \({\sigma}_{\bar{y}}^2= \sigma_y^2/n\) and
\({\sigma}_{\bar{y}}=
\sigma_y/\sqrt{n}\). So, with a sample of size 10 the variance
should be \(\sigma_y^2/10\), while with
a sample of size 100 the variance should be \(\sigma_y^2/100\). Thus, a sample of size
10, compared to a sample of size 100, should have 10 times as much
variance in its sample means \(\bar{y}\). Similarly, for samples of size 2
or 1,000 the variances should be \(\sigma_y^2/2\) or \(\sigma_y^2/1000\), so the variance of the
sample means should be 50 times larger or 10 times smaller than if the
sample size were 100.
Nsim <- 1000 # the number of simulations
N.list <- c(2, 10, 100, 1000) # a list of sample sizes
y_bar <- matrix(NA, nrow=Nsim, ncol=length(N.list)) # create a vector for sample mean
colnames(y_bar) <- c("n=2", "n=10", "n=100", "n=1000")
for (j in 1:length(N.list)){
N <- N.list[j]
for (i in 1:Nsim){ # write a loop
y.data <- runif(N, min=0, max=2) # generate a sample with size N
y_bar[i, j] <- mean(y.data) # record the sample mean for sample i
}
}
apply(y_bar, 2, mean) # compute the sample average for different n
## n=2 n=10 n=100 n=1000
## 0.9993824 1.0037622 1.0002126 0.9993071
apply(y_bar, 2, var) # compute the sample variance for different n
## n=2 n=10 n=100 n=1000
## 0.1628883226 0.0337147483 0.0033002447 0.0003201849
apply(y_bar, 2, sd) # compute the sample standard deviation for different n
## n=2 n=10 n=100 n=1000
## 0.40359425 0.18361576 0.05744776 0.01789371
[Answer]
According to the Law of Large Numbers, as the
sample size \(n\) increases toward
infinity, the probability of \(\bar{y}\) being wrong (different from \(\mu_y\)) by any amount \(\epsilon>0\) should dwindle to zero.
That seems to agree with the results in the table, since the values are
on average close to the true means and their variances keep getting
smaller as the sample size gets larger.
par(mfrow = c(2, 2)) # Create a 2 x 2 plotting matrix
# The next 4 plots created will be plotted next to each other
# Plot 1
hist(y_bar[,1],
col = "lightblue",
xlim = c(0, 2),
ylim = c(0, 20),
main="Histogram of the sample mean",
xlab = "n=2",
freq = FALSE)
lines(density(y_bar[,1]), col="blue")
# Plot 2
hist(y_bar[,2],
col="lightblue",
xlim = c(0, 2),
ylim = c(0, 20),
main="Histogram of the sample mean",
xlab = "n=10",
freq = FALSE)
lines(density(y_bar[,2]), col="blue")
# Plot 3
hist(y_bar[,3],
col="lightblue",
xlim = c(0, 2),
ylim = c(0, 20),
main="Histogram of the sample mean",
xlab = "n=100",
freq = FALSE)
lines(density(y_bar[,3]), col="blue")
# Plot 4
hist(y_bar[,4],
col="lightblue",
xlim = c(0, 2),
ylim = c(0, 20),
main="Histogram of the sample mean",
xlab = "n=1000",
freq = FALSE)
lines(density(y_bar[,4]), col="blue")
## Standardization ##
mu <- (2-0)/2 # theoretical mean of y
sig <- sqrt((2-0)^2/12) # theoretical sd of y
y_bar_std <- (y_bar-mu)
y_bar_std[,1] <- y_bar_std[,1]/(sig/sqrt(2))
y_bar_std[,2] <- y_bar_std[,2]/(sig/sqrt(10))
y_bar_std[,3] <- y_bar_std[,3]/(sig/sqrt(100))
y_bar_std[,4] <- y_bar_std[,4]/(sig/sqrt(1000))
par(mfrow = c(2, 2)) # Create a 2 x 2 plotting matrix
# The next 4 plots created will be plotted next to each other
# Plot 1
hist(y_bar_std[,1],
col = "lightblue",
xlim = c(-3, 3),
ylim = c(0, 0.4),
main="Histogram of the sample mean",
xlab = "n=2",
freq = FALSE)
lines(density(y_bar_std[,1]), col="blue")
# Plot 2
hist(y_bar_std[,2],
col="lightblue",
xlim = c(-3, 3),
ylim = c(0, 0.4),
main="Histogram of the sample mean",
xlab = "n=10",
freq = FALSE)
lines(density(y_bar_std[,2]), col="blue")
# Plot 3
hist(y_bar_std[,3],
col="lightblue",
xlim = c(-3, 3),
ylim = c(0, 0.4),
main="Histogram of the sample mean",
xlab = "n=100",
freq = FALSE)
lines(density(y_bar_std[,3]), col="blue")
# Plot 4
hist(y_bar_std[,4],
col="lightblue",
xlim = c(-3, 3),
ylim = c(0, 0.4),
main="Histogram of the sample mean",
xlab = "n=1000",
freq = FALSE)
lines(density(y_bar_std[,4]), col="blue")
[Answer]
The Central Limit Theorem does seem to be
working. Even with a sample size of 2 observations, the sample mean was
fairly close to normally distributed, since the histogram comes close to
the shape of a normal distribution with the same mean and variance,
drawn as a curve over the histogram. The fit to the normal distribution
seems even closer with the larger sample sizes. Also, the mean of the
distribution looks about right (about equal to \(\mu_y\)) and the standard deviation of each
distribution seems to have gotten smaller in proportion to \(1/\sqrt{n}\).