Lab 2 Assignment: Probability

Suppose you collect data from a sample of \(n\) people. You use random sampling, so your data are independent and identically distributed (i.i.d.). You collect data on one variable, \(y\), so \(y_i\) is the value of \(y\) for person \(i\) in your sample, and \(\bar{y}\) is the sample mean. The true mean of \(y\) in the population is \(\mu_y\) and the standard deviation is \(\sigma_y\). With i.i.d. data, the mean and standard deviation are the same for everyone; \(\mu_{yi}=\mu_y\) and \(\sigma_{yi}=\sigma_y\), so there is no need to write the subscript \(i\). \(\bar{y}\) describes the entire sample, so it has no subscript \(i\).

1. The sample average is a random variable. Only by luck you sample average is the same as the population mean. Write down the formula for mean and standard deviation of the sample average.

[Answer] \[E(\bar{y}) = E(\frac{1}{n}\sum_{i=1}^{n}y_i)=\frac{1}{n}\sum_{i=1}^{n}E(y_i)=\mu_y.\] \[Var(\bar{y}) = Var(\frac{1}{n}\sum_{i=1}^{n}y_i)= \frac{1}{n^2}\sum_{i=1}^{n}Var(y_i) = \frac{\sigma_y^2}{n} \quad \Rightarrow \quad sd(\bar{y}) = \frac{\sigma_y}{\sqrt{n}}.\]

2. Write the formal definition of the Law of Large Numbers, and explain intuitively what it means.

[Answer] The Law of Large Numbers: If \(y_i\), \(i=1,...,n\), are independent and identically distributed with finite variance, then \[\bar{y}\rightarrow_p \mu_y\] or \[\lim_{n\rightarrow\infty} \Pr(|\bar{y} - \mu_y|>\epsilon) = 0 \text{ for every }\epsilon>0.\] Intuitively, the sample mean converges to the true (theoretical) mean of \(y\), as the sample becomes very large.

3. Write the formal definition of the Central Limit Theorem, and explain intuitively what it means.

[Answer] The Central Limit Theorem: If \(y_i\), \(i=1,...,n\), are independent and identically distributed with finite variance, then \[\sqrt{n}(\bar{y} - \mu_y) \rightarrow_d N(0, \sigma_y^2).\] Intuitively, as the sample mean converges to the true mean of \(y\), our sample mean \(\bar{y}\) is not perfect while \(n<\infty\). So there is some mistake or “error” in estimating the true mean, and the error gets converges to have a normal-distribution shape. In particular, the pdf of the error multiplied by \(\sqrt{n}\) approaches a normal distribution with mean zero and variance \(\sigma_y^2\).

4. Using R, draw an i.i.d. random sample from a uniform distribution, `U(0,2)`. The sample size is `100`. What is the sample mean for your numbers? (Three significant digits are plenty.) Also, what is the sample standard deviation?

set.seed(123)
N <- 100
y <- runif(N, 0, 2)
mean(y)

## [1] 0.997118

sd(y)

## [1] 0.5699905

5. Repeat the process in question 4 four more times. What is the sample mean of y each time?

Nsim <- 4                  # the number of simulations
y_bar <- rep(NA, Nsim)     # create a vector for sample mean

for (i in 1:Nsim){                      # write a loop
  y.data <- runif(N, min=0, max=2)      # generate a sample with size N
  y_bar[i] <- mean(y.data)              # record the sample mean for sample i
}    

y_bar

## [1] 1.0284498 0.9721034 0.9843849 0.9707808

6. In questions 4 through 5, you computed the sample mean and standard deviation 5 times. However, that is not enough. We want to do it so many times that we can safely check the Law of Large Numbers and the Central Limit Theorem. So, let’s repeat the process in question 4 one thousand times. Each time, your program generates 100 random numbers (i.e., sample size is 100), and saves the sample mean. What is the average of the 1,000 sample means of \(y\)? Hopefully it is about equal to the true mean of \(y\), \(\mu_y\). What are the variance and standard deviation of the sample means of \(y\)? Hopefully they are small.

Nsim <- 1000                  # the number of simulations
y_bar <- rep(NA, Nsim)        # create a vector for sample mean

for (i in 1:Nsim){                      # write a loop
  y.data <- runif(N, min=0, max=2)      # generate a sample with size N
  y_bar[i] <- mean(y.data)              # record the sample mean for sample i
}    

hist(y_bar, 
     col="lightblue",
     main="The sampling distribution of the sample average",
     ylim=c(0, 8),
     freq = FALSE)
lines(density(y_bar), col="blue")

mean(y_bar)  # sample average of the sample means of y

## [1] 0.998566

var(y_bar)   # sample variance of the sample means of y

## [1] 0.003260772

sd(y_bar)    # sample standard deviation of the sample means of y

## [1] 0.05710317

7. When you generated random variable \(y\), you used a uniform pdf between 0 and 2, so \(y\) was equally likely to be each number between 0 and 2.

7.(1) From this uniform pdf, what is the true mean of random variable y?

[Answer] The true mean for a uniform pdf between 0 and 2 is exactly 1; \(\mu_y=1\)

7.(2) Looking at your results to questions 6, was a sample of \(100\) values of \(y\) large enough that the sample mean \(\bar{y}\) is usually close to the true mean? About how close?

[Answer] The sample averages \(\bar{y}\) were usually close to 1, typically within about \(1\pm 2\times sd(\bar{y}) \approx 1\pm 0.1\), though occasionally farther away from 1.

7.(3) There were still variations, though, in the sample mean. What was the (sample) standard deviation \(\hat{\sigma}_{\bar{y}}\) of \(\bar{y}\)? This is, approximately, \(\sigma_{\bar{y}}\) not \(\sigma_{y}\).

[Answer] The (sample) standard deviation of \(\bar{y}\) was, in my case, \(\hat{\sigma}_{\bar{y}}=0.057\). (I used the hat symbol over \(\sigma_{\bar{y}}\) since this is a sample standard deviation, not exactly the true standard deviation.)

7.(4) What is \({\sigma}_{\bar{y}}\) and how is it different from \({\sigma}_y\)?

[Answer] The symbol \({\sigma}_{\bar{y}}\) represents the standard deviation of different possible sample averages \(\bar{y}\), while \({\sigma}_y\) represents the standard deviation of different possible \(y_i\) values; despite looking like similar symbols, they are completely different things!

8.(1) If the sample size is 10 instead of 100, how much more or less should the variance and standard deviation be, when you create the 1,000 different sample means? What if the sample size is 2? What if the sample size is 1,000? Answer question 8(1) without R (use your answer to question 1).

[Answer] From the answer to question 1, the true variance and standard deviation of \(\bar{y}\) are \({\sigma}_{\bar{y}}^2= \sigma_y^2/n\) and \({\sigma}_{\bar{y}}= \sigma_y/\sqrt{n}\). So, with a sample of size 10 the variance should be \(\sigma_y^2/10\), while with a sample of size 100 the variance should be \(\sigma_y^2/100\). Thus, a sample of size 10, compared to a sample of size 100, should have 10 times as much variance in its sample means \(\bar{y}\). Similarly, for samples of size 2 or 1,000 the variances should be \(\sigma_y^2/2\) or \(\sigma_y^2/1000\), so the variance of the sample means should be 50 times larger or 10 times smaller than if the sample size were 100.

8.(2) Change your program to use a sample size of 10 and create the 1,000 sample means and check the standard deviation of the sample means. Do the same with sample sizes of 2 and 1,000. What are the variances and standard deviations of the sample means with sample sizes of 2, 10, 100, and 1,000?

Nsim <- 1000                  # the number of simulations
N.list <- c(2, 10, 100, 1000)   # a list of sample sizes
y_bar <- matrix(NA, nrow=Nsim, ncol=length(N.list))       # create a vector for sample mean
colnames(y_bar) <- c("n=2", "n=10", "n=100", "n=1000")
for (j in 1:length(N.list)){
  N <- N.list[j]
  
  for (i in 1:Nsim){                      # write a loop
  y.data <- runif(N, min=0, max=2)        # generate a sample with size N
  y_bar[i, j] <- mean(y.data)             # record the sample mean for sample i
  }
}

apply(y_bar, 2, mean)   # compute the sample average for different n

##       n=2      n=10     n=100    n=1000 
## 0.9993824 1.0037622 1.0002126 0.9993071

apply(y_bar, 2, var)    # compute the sample variance for different n

##          n=2         n=10        n=100       n=1000 
## 0.1628883226 0.0337147483 0.0033002447 0.0003201849

apply(y_bar, 2, sd)     # compute the sample standard deviation for different n

##        n=2       n=10      n=100     n=1000 
## 0.40359425 0.18361576 0.05744776 0.01789371

8.(3) Does your result fit what is expected from the Law of Large Numbers? Explain.

[Answer] According to the Law of Large Numbers, as the sample size \(n\) increases toward infinity, the probability of \(\bar{y}\) being wrong (different from \(\mu_y\)) by any amount \(\epsilon>0\) should dwindle to zero. That seems to agree with the results in the table, since the values are on average close to the true means and their variances keep getting smaller as the sample size gets larger.

9. Show four histograms, one for each set of 1,000 sample means generated using the four sample sizes, n=2, n=10, n=100, and n=1000. According to the Central Limit Theorem, as the sample size increases toward infinity, the distribution of the different sample means should approach a normal distribution with an increasingly small variance. Explain whether the Central Limit Theorem seems to be working as the sample size increases from n=10 to n=100 to n=1000. Note: You could standardize the sample means by (i) subtracting the true mean of the sample means, and then (ii) dividing by the true standard deviation of the sample means. This keeps the mean and standard deviation of the graphed distributions similar as \(n\) gets larger.

par(mfrow = c(2, 2)) # Create a 2 x 2 plotting matrix
# The next 4 plots created will be plotted next to each other

# Plot 1
hist(y_bar[,1], 
     col = "lightblue",
     xlim = c(0, 2),
     ylim = c(0, 20),
     main="Histogram of the sample mean",
     xlab = "n=2",
     freq = FALSE)
lines(density(y_bar[,1]), col="blue")

# Plot 2
hist(y_bar[,2], 
     col="lightblue",
     xlim = c(0, 2),
     ylim = c(0, 20),
     main="Histogram of the sample mean",
     xlab = "n=10",
     freq = FALSE)
lines(density(y_bar[,2]), col="blue")

# Plot 3
hist(y_bar[,3], 
     col="lightblue",
     xlim = c(0, 2),
     ylim = c(0, 20),
     main="Histogram of the sample mean",
     xlab = "n=100",
     freq = FALSE)
lines(density(y_bar[,3]), col="blue")

# Plot 4
hist(y_bar[,4], 
     col="lightblue",
     xlim = c(0, 2),
     ylim = c(0, 20),
     main="Histogram of the sample mean",
     xlab = "n=1000",
     freq = FALSE)
lines(density(y_bar[,4]), col="blue")

## Standardization ##
mu <- (2-0)/2            # theoretical mean of y
sig <- sqrt((2-0)^2/12)  # theoretical sd of y
y_bar_std <- (y_bar-mu)
y_bar_std[,1] <- y_bar_std[,1]/(sig/sqrt(2))
y_bar_std[,2] <- y_bar_std[,2]/(sig/sqrt(10))
y_bar_std[,3] <- y_bar_std[,3]/(sig/sqrt(100))
y_bar_std[,4] <- y_bar_std[,4]/(sig/sqrt(1000))

par(mfrow = c(2, 2)) # Create a 2 x 2 plotting matrix
# The next 4 plots created will be plotted next to each other

# Plot 1
hist(y_bar_std[,1], 
     col = "lightblue",
     xlim = c(-3, 3),
     ylim = c(0, 0.4),
     main="Histogram of the sample mean",
     xlab = "n=2",
     freq = FALSE)
lines(density(y_bar_std[,1]), col="blue")

# Plot 2
hist(y_bar_std[,2], 
     col="lightblue",
     xlim = c(-3, 3),
     ylim = c(0, 0.4),
     main="Histogram of the sample mean",
     xlab = "n=10",
     freq = FALSE)
lines(density(y_bar_std[,2]), col="blue")

# Plot 3
hist(y_bar_std[,3], 
     col="lightblue",
     xlim = c(-3, 3),
     ylim = c(0, 0.4),
     main="Histogram of the sample mean",
     xlab = "n=100",
     freq = FALSE)
lines(density(y_bar_std[,3]), col="blue")

# Plot 4
hist(y_bar_std[,4], 
     col="lightblue",
     xlim = c(-3, 3),
     ylim = c(0, 0.4),
     main="Histogram of the sample mean",
     xlab = "n=1000",
     freq = FALSE)
lines(density(y_bar_std[,4]), col="blue")

[Answer] The Central Limit Theorem does seem to be working. Even with a sample size of 2 observations, the sample mean was fairly close to normally distributed, since the histogram comes close to the shape of a normal distribution with the same mean and variance, drawn as a curve over the histogram. The fit to the normal distribution seems even closer with the larger sample sizes. Also, the mean of the distribution looks about right (about equal to \(\mu_y\)) and the standard deviation of each distribution seems to have gotten smaller in proportion to \(1/\sqrt{n}\).