Each lab in this course will have multiple components. First, there will a piece like the document below, which includes instructions, tutorials, and problems to be addressed in your write-up. Any part of the document below marked with a star, \(\star\), is a problem for your write-up. Second, there will be an R script where you will do all of the computations required for the lab. And third, you will complete a write-up to be turned in the Friday that you did your lab.
Your lab write-up should be typed in RMarkdown. All of your computational work will be done in RStudio Cloud, and both your lab write-up and your R script will be considered when grading your work.
In this lab, we want to visualize how estimators behave as the sample size increases, particularly in relation to bias and consistency.
Suppose, for example, we want to understand how the sample mean changes with sample size. We want a vector of values of the sample mean from \(n=1\) to some large value of \(n\). We can accomplish this through the use of the cumulative family of functions in R. Consider the example below.
data <- 6:10
data
## [1] 6 7 8 9 10
The first thing we want is a vector whose entries are the cumulative sums of the elements in our data vector. We can do this with the cumsum function.
cumsum(data)
## [1] 6 13 21 30 40
Notice that the first entry in this vector is 6, the second is \(6+7\), the third is \(6+7+8\), and so on. To get the mean, we need to divide the entries by 1, 2, 3, etc., respectively. We can do this by dividing the vector by a vector of sample sizes.
cumsum(data) / 1:length(data)
## [1] 6.0 6.5 7.0 7.5 8.0
Let’s extend this idea to a much larger example. We will consider \(Y\sim Exp(\theta)\), and use the sample mean as our estimator.
set.seed(1860)
n <- 10000
theta <- 5
sample <- rexp(n, rate = 1 / theta)
sample.means <- cumsum(sample) / 1:n
Once we have the vector of sample means from \(n=1\) through \(n=10000\), we can compute the bias of each and then plot the results.
bias <- sample.means - theta
ggplot(data.frame(n = 1:n, bias = bias), aes(x = n, y = bias)) +
geom_hline(yintercept = 0, color = "red") +
geom_line() +
labs(x = "Number of Sample Points (n)",
y = "Empirical Bias")
You will now run the same type of analysis of bias with different estimators. Consider a random variable \(Y\sim Unif(0,\theta)\). We consider the estimators \(\frac{n+1}{n}\cdot Y_{(n)}\) and \((n+1)\cdot Y_{(1)}\) for \(\theta\).
(\(\star\)) Show that \(\frac{n+1}{n}\cdot Y_{(n)}\) and \((n+1)\cdot Y_{(1)}\) are both unbiased.
(\(\star\)) Using the values \(n = 300\) and \(\theta = 5\), make a bias plot for each of the unbiased estimators. Include your plots in your write-up. [Note: the relevant cumulative functions to use here are cummax and cummin.]
(\(\star\)) Explain how the plots illustrate how these estimators are unbiased, thus supporting your theoretical justification in problem 1.
An important property of estimators is that they be consistent. Here we will use the notation \(\widehat{\theta}_n\) for an estimator, with the subscript just serving as a reminder that the value of an estimator (typically) depends on the sample size \(n\).
Definition. An estimator \(\widehat{\theta}_n\) is a consistent estimator of \(\theta\) if, for any positive number \(\epsilon\), \[\lim_{n\to \infty}P(\vert\widehat{\theta}_n-\theta\vert\le \epsilon)=1.\] Let’s break down exactly what this expression is telling us.
Overall, we can think of a consistent estimator as an estimator that gets better and better (closer and closer to the true value of the population parameter) as the number of sample points increases.
We would also like to investigate the consistency of estimators through simulations. The example below contains the code that plots the empirical probability of \(\vert\widehat{\theta}_n-\theta\vert < \epsilon\) for many values of n. We set \(\epsilon = 0.2\) and simulate \(N = 500\) samples of size \(n = 10,000\) (i.e., a data frame with one column for each simulation run, and one row for each sample observation) from an exponential distribution, for which the sample mean is considered as an estimator of theta.
N <- 500
n <- 10000
epsilon <- 0.2
sims <- data.frame(replicate(N, rexp(n, rate = 1 / theta)))
To find the cumulative sample means of each column, we use the summarise function across the columns of the data frame. This results in a new data frame where each column corresponds to a simulation run and each row corresponds to a sample mean for some number of samples (i.e., the sixth row is the sample mean with taken for six sample points).
sims_means <- summarize(sims, across(everything(), function(x){cumsum(x) / 1:n}))
We can then compute determine if each entry of the matrix satisfies the inequality \(\vert\widehat{\theta}_n-\theta\vert < \epsilon\), returning a data frame of logical values (i.e., TRUE if the inequality is satisfied, and FALSE if it is not).
sims_means_logical <- abs(sims_means - theta) < epsilon
We can approximate the probability \(P(\vert\widehat{\theta}_n-\theta\vert < \epsilon)\) for each value of \(n\) by taking the proportion of TRUE values in each row of the logical data frame. Since R considers TRUE to be the value 1 and FALSE to be the value 0, we can find this proportion by taking the mean of each row of the data frame. This yields a vector whose entries approximate the probability for each value of \(n\).
consistency <- apply(sims_means_logical, 1, mean)
Finally, we can plot the empirical probabilities for each value of n between 1 and 10,000.
ggplot(data.frame(n = 1:n, consistency = consistency), aes(x = n, y = consistency)) +
geom_hline(yintercept = 1, color = "red") +
geom_line() +
labs(x = "Number of Sample Points (n)",
y = "Empirical Probability",
title = "Empirical Consistency")
(\(\star\)) Using the values \(n = 1000\), \(N=500\), \(\epsilon = 0.2\), and \(\theta = 5\), make a consistency plot for each of the unbiased estimators, \(\frac{n+1}{n}\cdot Y_{(n)}\) and \((n+1)\cdot Y_{(1)}\) for \(Y\sim Unif(0,\theta)\). Include your plots in your write-up.
(\(\star\)) Based on your simulations, are the two estimators consistent? Explain how you came to your conclusions.