Math 302 Lab 2

General Instructions

Each lab in this course will have multiple components. First, there will a piece like the document below, which includes instructions, tutorials, and problems to be addressed in your write-up. Any part of the document below marked with a star, \(\star\), is a problem for your write-up. Second, there will be an R script where you will do all of the computations required for the lab. And third, you will complete a write-up to be turned in the Friday that you did your lab.

Your lab write-up should be typed in RMarkdown. All of your computational work will be done in RStudio Cloud, and both your lab write-up and your R script will be considered when grading your work.

Lab Overview

In this lab, we want to visualize how estimators behave as the sample size increases, particularly in relation to bias and consistency.

Simulating Bias of Estimators

Suppose, for example, we want to understand how the sample mean changes with sample size. We want a vector of values of the sample mean from \(n=1\) to some large value of \(n\). We can accomplish this through the use of the cumulative family of functions in R. Consider the example below.

data <- 6:10
data

## [1]  6  7  8  9 10

The first thing we want is a vector whose entries are the cumulative sums of the elements in our data vector. We can do this with the cumsum function.

cumsum(data)

## [1]  6 13 21 30 40

Notice that the first entry in this vector is 6, the second is \(6+7\), the third is \(6+7+8\), and so on. To get the mean, we need to divide the entries by 1, 2, 3, etc., respectively. We can do this by dividing the vector by a vector of sample sizes.

cumsum(data) / 1:length(data)

## [1] 6.0 6.5 7.0 7.5 8.0

Let’s extend this idea to a much larger example. We will consider \(Y\sim Exp(\theta)\), and use the sample mean as our estimator.

set.seed(1860)

n <- 10000
theta <- 5

sample <- rexp(n, rate = 1 / theta)

sample.means <- cumsum(sample) / 1:n

Once we have the vector of sample means from \(n=1\) through \(n=10000\), we can compute the bias of each and then plot the results.

bias <- sample.means - theta

ggplot(data.frame(n = 1:n, bias = bias), aes(x = n, y = bias)) +
        geom_hline(yintercept = 0, color = "red") +
        geom_line() +
        labs(x = "Number of Sample Points (n)",
             y = "Empirical Bias")

You will now run the same type of analysis of bias with different estimators. Consider a random variable \(Y\sim Unif(0,\theta)\). We consider the estimators \(\frac{n+1}{n}\cdot Y_{(n)}\) and \((n+1)\cdot Y_{(1)}\) for \(\theta\).

(\(\star\)) Show that \(\frac{n+1}{n}\cdot Y_{(n)}\) and \((n+1)\cdot Y_{(1)}\) are both unbiased.
(\(\star\)) Using the values \(n = 300\) and \(\theta = 5\), make a bias plot for each of the unbiased estimators. Include your plots in your write-up. [Note: the relevant cumulative functions to use here are cummax and cummin.]
(\(\star\)) Explain how the plots illustrate how these estimators are unbiased, thus supporting your theoretical justification in problem 1.

Consistency

An important property of estimators is that they be consistent. Here we will use the notation \(\widehat{\theta}_n\) for an estimator, with the subscript just serving as a reminder that the value of an estimator (typically) depends on the sample size \(n\).

Definition. An estimator \(\widehat{\theta}_n\) is a consistent estimator of \(\theta\) if, for any positive number \(\epsilon\), \[\lim_{n\to \infty}P(\vert\widehat{\theta}_n-\theta\vert\le \epsilon)=1.\] Let’s break down exactly what this expression is telling us.

We can think of \(\epsilon\) as the error of estimation, i.e., how far off from the true value the estimate is. In this case, we are setting a fixed desired value for this error.
The probability \(P(\vert\widehat{\theta}_n-\theta\vert\le \epsilon)\) is the probability that the error of estimation is less than \(\epsilon\). Empirically, if we make a bunch of samples, this is the proportion of all of those samples where the error of estimation is smaller than \(\epsilon\).
The limit equaling 1 implies that as the number of sample points increases, the probability that our error of estimation is smaller than \(\epsilon\) approaches 1, i.e., is a probabilistic certainty.
Since this is true for all values of \(\epsilon\), we see that the error of estimation can be made, on average, as small as we want, as long as we make the sample size large enough.

Overall, we can think of a consistent estimator as an estimator that gets better and better (closer and closer to the true value of the population parameter) as the number of sample points increases.

Simulating Consistency of Estimators

We would also like to investigate the consistency of estimators through simulations. The example below contains the code that plots the empirical probability of \(\vert\widehat{\theta}_n-\theta\vert < \epsilon\) for many values of n. We set \(\epsilon = 0.2\) and simulate \(N = 500\) samples of size \(n = 10,000\) (i.e., a data frame with one column for each simulation run, and one row for each sample observation) from an exponential distribution, for which the sample mean is considered as an estimator of theta.

N <- 500
n <- 10000
epsilon <- 0.2

sims <- data.frame(replicate(N, rexp(n, rate = 1 / theta)))

To find the cumulative sample means of each column, we use the summarise function across the columns of the data frame. This results in a new data frame where each column corresponds to a simulation run and each row corresponds to a sample mean for some number of samples (i.e., the sixth row is the sample mean with taken for six sample points).

sims_means <- summarize(sims, across(everything(), function(x){cumsum(x) / 1:n}))

We can then compute determine if each entry of the matrix satisfies the inequality \(\vert\widehat{\theta}_n-\theta\vert < \epsilon\), returning a data frame of logical values (i.e., TRUE if the inequality is satisfied, and FALSE if it is not).

sims_means_logical <- abs(sims_means - theta) < epsilon

We can approximate the probability \(P(\vert\widehat{\theta}_n-\theta\vert < \epsilon)\) for each value of \(n\) by taking the proportion of TRUE values in each row of the logical data frame. Since R considers TRUE to be the value 1 and FALSE to be the value 0, we can find this proportion by taking the mean of each row of the data frame. This yields a vector whose entries approximate the probability for each value of \(n\).

consistency <- apply(sims_means_logical, 1, mean)

Finally, we can plot the empirical probabilities for each value of n between 1 and 10,000.

ggplot(data.frame(n = 1:n, consistency = consistency), aes(x = n, y = consistency)) +
        geom_hline(yintercept = 1, color = "red") +
        geom_line() +
        labs(x = "Number of Sample Points (n)",
             y = "Empirical Probability", 
             title = "Empirical Consistency")