In this project, we will work with the sampling distributions of statistics, the Central Limit Theorem (CLT), and properties of estimators.

The Central Limit Theorem and the properties of estimators are one of the most powerful concepts/tools that we learn in this course. It will send you back to the beginning of this class, require you to remind yourself and use the distributions and probability principles that we studied then, and yet at the end of this class now, it will set the stage for new beginnings for many things to come. Make sure to have fun, and good luck!

1 Random Samples

1.1 Binomial Distibution

Consider the Binomial distribution \(\text{Binom}(n=10, p = 0.8)\). We will get random samples from this distribution and check out the histograms for each of these samples. We expect that as the sample size increases, the sample distribution will provide an increasingly improved approximation of the the true distribution.

1.1.1 Probability Mass Function of \(\text{Binom}(n=10, p = 0.8)\)..

Plot the probability mass function of \(\text{Binom}(n=10, p = 0.8)\).

x <- 0:10
plot(x, dbinom(x, 10, p=.8),type='h', lwd = 5, main = paste("PMF of Binom(n = 10, p = 0.8)"))

1.1.2 Sample distributions

For the sample sizes \(4, 7, 10, 15, 20, 30, 40, 80, 1000\) get random samples of respective size, and plot the histograms for each of these samples.

sample_sizes <- c(4, 7, 10, 15, 20, 30, 40, 80, 1000)
par(mfrow = c(2, 2))
for(sizes in sample_sizes){
  vals <- rbinom(n = sizes, 10, p = 0.8)
  hist(vals, breaks = 20, main = paste("Histogram for sample of size", sizes))
}

WHAT DO YOU NOTICE? As the sample size increases, the distribution of the random sample approaches a normal distribution (but slightly skewed to the right)

1.2 Gamma Distribution

Consider the Gamma distribution \(\text{Gamma}(\alpha = 2, \beta = 1.5)\). We will get random samples from this distribution and check out the histograms for each of these samples. We expect that as the sample size increases, the sample distribution will provide an increasingly improved approximation of the the true distribution.

1.2.1 Density of \(\text{Gamma}(\alpha = 4, \beta = 2)\).

Plot the density curve of \(\text{Gamma}(\alpha = 4, \beta = 2)\),

x <- seq(0, 30, length.out = 10000)
y <- dgamma(x, shape = 4, scale = 2)
plot(x, y,type = 'l',main = "Density of the Gamma Distribution")
abline(v=0, h=0)

1.2.2 Sample distributions

For the sample sizes \(4, 7, 10, 15, 20, 30, 40, 80, 1000\) get random samples of respective size, and plot the histograms for each of these samples.

sample_sizes <- c(4, 7, 10, 15, 20, 30, 40, 80, 1000)
par(mfrow = c(2, 2))
for(sizes in sample_sizes){
  vals <- rgamma(n = sizes, shape = 4, scale = 2)
  hist(vals, breaks = 20, main = paste("Histogram for sample of size", sizes))
}

What do you notice? As the sample size increases, the distribution starts to have the shape of a normal distribution (but skewed to the left)

2 Sampling Distribution of Sample Max

2.1 Discrete Uniform distribution

Suppose we are working with the discrete uniform random variable taking values \(\{1, 2, 3, 4, 5, 6\}\).

Plot the pmf of the given discrete distribution

x <- 1:6
plot(x, dunif(x, min = 1, max = 6),type='h', lwd = 5, main = paste("PMF of Given Discrete Distribution" ))

Define a function “disc_samp” that takes input “n” and returns a random sample of size “n” from this distribution.

#takes values 1, 2, 3, 4, 5, 6
disc_samp <- function(n){
  round(runif(n,min=1, max=6))
}

Use the “disc_samp” function and the replicate function to to get the histograms for the sampling distribution of the sample max when working with sample sizes \(n = 1, 2, 3, 4,5, 15\). Be sure to have appropriate titles for your histograms.

sample_sizes <- c(1, 2, 3, 4, 5, 15)
par(mfrow = c(2, 2))
for(sizes in sample_sizes){
  vals <- disc_samp(sizes)
  hist(vals, breaks = 20, main = paste("Histogram for sample of size", sizes))
}

WHAT DO YOU NOTICE As the sample size increases, the distribution becomes less uniform

2.2 A Skewed Discrete Distribution

Suppose we are working with the discrete random variable taking values \(\{1, 2, 3, 4, 5, 6\}\) and having the following probability mass function: \[p(1) = 0.5, \quad p(2) = p(3) = \cdots = p(6) = 0.1\]

Plot the pmf of the given discrete distribution

x <- 1:6
Probabilities <- c(0.5, 0.1, 0.1, 0.1, 0.1, 0.1)
plot(x, Probabilities,type='h', lwd = 5, main = paste("PMF of Given Discrete Distribution"))

Define a function “disc_samp_1” that takes input “n” and returns a random sample of size “n” from this distribution.

disc_samp1 <- function(n){
rv_vals <- 1:6
probs <- c(0.5, 0.1, 0.1, 0.1, 0.1, 0.1)
samp <- sample(x = rv_vals, size = n, replace = TRUE, prob = probs )
return(samp)
}

Use the “disc_samp_1” function and the replicate function to to get the histograms for the sampling distribution of the sample max when working with sample sizes \(n = 1, 2, 3, 4,5 15\). Be sure to have appropriate titles for your histograms.

sample_sizes <- c(1, 2, 3, 4, 5, 15)
par(mfrow = c(2, 2))
for(n in sample_sizes){
reps <- replicate(10000, {max(disc_samp1(n))})
hist(reps,
main = paste("Sampling Dist of Sample Max with n = ", n))
}

WHAT DO YOU NOTICE The sample max becomes 6 more and more frequently as the sample size increases

3 Central Limit Theorem

3.1 Exponential Distribution

Suppose we are working with a population that has the exponential distibution with \(\lambda = 2\).

Use the replicate function to get the histograms for the sampling distribution of the sample mean when working with sample sizes \(n = 1, 2, 3, 4, 15, 500\). Be sure to have appropriate titles for your histograms.

samp_sizes  <- c(1, 2, 3, 4, 15, 500)

par(mfrow=c(2,2))
for(size in samp_sizes){
  replicates <- replicate(10000, {
    mean(rexp(size, rate = 2))
  })
  hist(replicates, breaks = 100,
       main = paste("Samp Dist for Exp, samp size =", size ))
}

What do you notice? As the sample size increases, the distribution approaches a normal distribution

3.2 Discrete Uniform distibution

Suppose we are working with the discrete uniform random variable taking values \(\{1, 2, 3, 4, 5, 6\}\).

Define a function “disc_samp” that takes input “n” and returns a random sample of size “n” from this distribution.

disc_samp <- function(n){
  round(runif(n,min=1, max=6))
}

Use the “disc_samp” function and the replicate function to to get the histograms for the sampling distribution of the sample mean when working with sample sizes \(n = 1, 2, 3, 4, 15, 500\). Be sure to have appropriate titles for your histograms.

sample_sizes <- c(1, 2, 3, 4, 15, 500)
par(mfrow = c(2, 2))
for(sizes in sample_sizes){
  vals <- disc_samp(sizes)
  hist(vals, breaks = 20, main = paste("Histogram for sample of size", sizes))
}

What do you notice? As the sample size increases, the distribution becomes more uniform

3.3 Continuous Uniform distibution

Suppose we are working with the Continuous uniform random variable taking values on \((0,1)\).

Define a function “cont_uni_samp” that takes input “n” and returns a random sample of size “n” from this distribution.

cont_uni_samp <- function(n){
  runif(n,min=0, max=1)
}
#cont_uni_samp(20)

Use the “cont_uni_samp” function and the replicate function to to get the histograms for the sampling distribution of the sample mean when working with sample sizes \(n = 1, 2, 3, 4, 15, 500\). Be sure to have appropriate titles for your histograms.

sample_sizes <- c(1, 2, 3, 4, 15, 500)
par(mfrow = c(2, 2))
for(sizes in sample_sizes){
  vals <- cont_uni_samp(sizes)
  hist(vals, breaks = 20, main = paste("Histogram for sample of size", sizes))
}

What do you notice? The distribution becomes more uniform as the sample size increases.

4 Estimators for the Gamma parameters

Recall from class that we calculated the estimators for \(\alpha\) and \(\beta\) using the method of moments for a sample of data values of size \(n\) as: \[\hat{\alpha} = \frac{\overline{X}^2}{(1/n)\sum X_i^2 - \overline{X}^2}\quad \hat{\beta} = \frac{(1/n)\sum X_i^2 - \overline{X}^2}{\overline{X}}\]

Write functions “alpha_hat” and “beta_hat” that take input a sample of data values “samp” and output the given estimators.

alpha_hat <- function(samp) {
  n <- length(samp)
  return(mean(samp)^2 / (1/n) * sum(samp^2) - mean(samp)^2)
}
  
beta_hat <- function(samp) {
  n <- length(samp)
  return((1/n) * sum(samp^2) - mean(samp)^2 / mean(samp))
}

Now let’s work with a Gamma distribution with \(\alpha = 2\) and \(\beta = 4\).

4.1 Bias, Variance, and MSE of Gamma Estimators.

Recall that we calculated the following identity in class: \[ \rm{MSE}(\hat{\theta}) = V(\hat{\theta}) + (\rm{Bias}(\hat{theta}))^2.\] This problem will empirically verify this identity. \

Use the replicate function to calculate the empirical variance and bias of the estimators \(\hat{\alpha}\) and \(\hat{\beta}\) when sampling a sample of size 30 from \(\rm{Gamma}(\alpha = 2, \beta = 4)\). Store these numbers in variables “var_emp” and “bias_emp”.

# Replicate data
data <- replicate(1000, rgamma(30, 2, rate = 4))

#Apply alpha hat to replicated data
alpha_hats <- lapply(data, alpha_hat)
beta_hats <- lapply(data, beta_hat)

# Find the bias and variance of beta and alpha
bias_emp_alpha <- mean(unlist(alpha_hats)) - 2
bias_emp_beta <- mean(unlist(beta_hats)) - 4
var_emp_alpha <- var(unlist(alpha_hats))
var_emp_beta <- var(unlist(beta_hats))

# Put values into a list
var_emp <- list(alpha = var_emp_alpha, beta = var_emp_beta)
bias_emp <- list(alpha = bias_emp_alpha, beta = bias_emp_beta)

Use the replicate function to calculate the Mean Squared Error for \(\hat{\alpha}\) and \(\hat{\beta}\) when sampling a sample of size 30 from \(\rm{Gamma}(\alpha = 2, \beta = 4)\). Store this number in “MSE_emp”.

# Functions to find MSE's
mse_alpha <- function(samp) {
  alpha_hat <- alpha_hat(samp)
  (alpha_hat - 2)^2
}
mse_beta <- function(samp) {
  beta_hat <- beta_hat(samp)
  (beta_hat - 4)^2
}

# Replicate function
data_mse <- replicate(1000, rgamma(30, 2, rate = 4))

# Find MSE's with replicated data
mse_alpha_values <- lapply(data_mse, mse_alpha)
mse_beta_values <- lapply(data_mse, mse_beta)

# Find empirical MSE
mse_emp_alpha <- mean(unlist(mse_alpha_values))
mse_emp_beta <- mean(unlist(mse_beta_values))
mse_emp <- list(alpha = mse_emp_alpha, beta = mse_emp_beta)

If there is justice in this world, you must get \[\rm{mse}_{emp}\approx\rm{var}_{emp} + \rm{bias}_{emp}.\] Check if the above approximation holds.

mse_emp$alpha
## [1] 6.214721
var_emp$alpha + bias_emp$alpha
## [1] 0.9026041
mse_emp$beta
## [1] 17.10149
var_emp$beta + bias_emp$beta
## [1] -4.052045

5 Bias, Variance, and MSE for estimator for \(\mu^2\)

Suppose we have a random sample of size n coming from the normal distribution \(N(\mu= 5, \sigma = 1.5)\). Recall from class that \(\hat{\theta} = \overline{X}^2\) is a biased estimator for \(\mu^2\), and we calculated the bias to be

\[ \text{Bias}(\hat{\theta}) = \frac{\sigma^2}{n}.\]

Write function that takes input a random sample and outputs the square of the sample mean (this is the estimator \(\hat{\theta}\).

calculate_theta_hat <- function(sample) {
  return(mean(sample)^2)
}

Use the replicate function to calculate the empirical variance and bias of the estimator \(\hat{\theta}\) when sampling a sample of size 15 from \(N(\mu = 5, \sigma = 1.5)\). Store these numbers in variables “var_emp” and “bias_emp”.

# Replicate to find empirical variance and bias
var_emp <- replicate(1000, var(rnorm(15, mean = 5, sd = 1.5)))
bias_emp <- replicate(1000, 1.5^2 / 15)

Use the replicate function to calculate the Mean Squared Error for \(\hat{\theta}\) when sampling a sample of size 15 from \(N(\mu = 5, \sigma = 1.5)\). Store this number in “MSE_emp”.

theta_hat <- calculate_theta_hat(rnorm(15, mean = 5, sd = 1.5))
mse <- mean((theta_hat - 5^2)^2)
MSE_emp <- replicate(1000, mse)

You must hopefully get \[\rm{mse}_{emp}\approx\rm{var}_{emp} + \rm{bias}_{emp}^2.\]

Check if this is true for the estimator \(\hat{\theta}\).

MSE_emp[1]
## [1] 27.66718
mean(var_emp) + mean(bias_emp)^2
## [1] 2.318013