Each lab in this course will have multiple components. First, there will a piece like the document below, which includes instructions, tutorials, and problems to be addressed in your write-up. Any part of the document below marked with a star, \(\star\), is a problem for your write-up. Second, there will be an R script where you will do all of the computations required for the lab. And third, you will complete a write-up to be turned in the Friday that you did your lab.
If you are comfortable doing so, I strongly suggest using RMarkdown to type your lab write-up. However, if you are new to R, you may handwrite your write-up (I’m also happy to work with you to learn RMarkdown!). All of your computational work will be done in RStudio Cloud, and both your lab write-up and your R script will be considered when grading your work.
In this lab, we want to visualize how estimators behave as the sample size increases, particularly in relation to bias and consistency.
Suppose, for example, we want to understand how the sample mean changes with sample size. We want a vector of values of the sample mean from \(n=1\) to some large value of \(n\). We can accomplish this through the use of the cumulative family of functions in R. Consider the example below.
data <- 6:10
data
## [1] 6 7 8 9 10
The first thing we want is a vector whose entries are the cumulative sums of the elements in our data vector. We can do this with the cumsum function.
cumsum(data)
## [1] 6 13 21 30 40
Notice that the first entry in this vector is 6, the second is \(6+7\), the third is \(6+7+8\), and so on. To get the mean, we need to divide the entries by 1, 2, 3, etc., respectively. We can do this by dividing the vector by a vector of sample sizes.
cumsum(data) / 1:length(data)
## [1] 6.0 6.5 7.0 7.5 8.0
Let’s extend this idea to a much larger example. We will consider \(Y\sim Exp(\theta)\), and use the sample mean as our estimator.
set.seed(1860)
n <- 10000
theta <- 5
sample <- rexp(n, rate = 1 / theta)
sample.means <- cumsum(sample) / 1:n
Once we have the vector of sample means from \(n=1\) through \(n=10000\), we can compute the bias of each and then plot the results.
bias <- sample.means - theta
ggplot(data.frame(n = 1:n, bias = bias), aes(x = n, y = bias)) +
geom_hline(yintercept = 0, color = "red") +
geom_line() +
labs(x = "Number of Sample Points (n)",
y = "Empirical Bias")
You will now run the same type of analysis of bias with different estimators. Consider a random variable \(Y\sim Unif(0,\theta)\). We know that the estimator \(\frac{n+1}{n}\cdot Y_{(n)}\) is an unbiased estimator of theta, as is \((n+1)\cdot Y_{(1)}\). The relevant cumulative functions to use here are cummax and cummin.
(\(\star\)) Using the values \(n = 300\) and \(\theta = 5\), make a bias plot for each of the unbiased estimators. Include your plots in your write-up.
(\(\star\)) Explain how the plots illustrate how these estimators are unbiased.
We would also like to investigate the consistency of estimators through simulations. The example below contains the code that plots the empirical probability of \(\vert\widehat{\theta}_n-\theta\vert < \epsilon\) for many values of n. We set \(\epsilon = 0.2\) and simulate \(N = 500\) samples of size \(n = 10,000\) from an exponential distribution, for which the sample mean is considered as an estimator of theta. Once we have the samples, we need to format them as a matrix with \(N\) rows and \(n\) columns (i.e., one row for each simulation run, and one column for each sample observation).
N <- 500
n <- 10000
epsilon <- 0.2
samples <- rexp(N * n, rate = 1 / theta)
samples.matrix <- matrix(samples, N, n)
To find the sample means of each row (as we did in the work in the previous section), we use the apply function across the rows of the matrix. This results in a new matrix (note that we must use the transpose function t() to ensure the dimensions of the matrix are what we want) where each row corresponds to a simulation run and each column corresponds to a sample mean for some number of samples.
means.matrix <- t(apply(samples.matrix, 1, function(x){cumsum(x) / 1:n}))
We can then compute determine if each entry of the matrix satisfies the inequality \(\vert\widehat{\theta}_n-\theta\vert < \epsilon\), returning a matrix of logical values (i.e., TRUE if the inequality is satisfied, and FALSE if it is not).
means.matrix.logical <- abs(means.matrix - theta) < epsilon
We can approximate the probability \(P(\vert\widehat{\theta}_n-\theta\vert < \epsilon)\) for each value of \(n\) by taking the proportion of TRUE values in each column of the logical matrix. This yields a vector whose entries approximate the probability for each value of \(n\).
consistency <- apply(means.matrix.logical, 2, mean)
Finally, we can plot the empirical probabilities for each value of n between 1 and 10,000.
ggplot(data.frame(n = 1:n, consistency = consistency), aes(x = n, y = consistency)) +
geom_hline(yintercept = 1, color = "red") +
geom_line() +
labs(x = "Number of Sample Points (n)",
y = "Empirical Probability",
title = "Empirical Consistency")
(\(\star\)) Using the values \(n = 1000\), \(N=500\), \(\epsilon = 0.2\), and \(\theta = 5\), make a consistency plot for each of the unbiased estimators, \(\frac{n+1}{n}\cdot Y_{(n)}\) and \((n+1)\cdot Y_{(1)}\) for \(Y\sim Unif(0,\theta)\). Include your plots in your write-up.
(\(\star\)) Based on your simulations, are the two estimators consistent? Explain how you came to your conclusions.