Each lab in this course will have multiple components. First, there will a piece like the document below, which includes instructions, tutorials, and problems to be addressed in your write-up. Any part of the document below marked with a star, \(\star\), is a problem for your write-up. Second, there will be an R script where you will do all of the computations required for the lab. And third, you will complete a write-up to be turned in the Friday that you did your lab.
If you are comfortable doing so, I strongly suggest using RMarkdown to type your lab write-up. However, if you are new to R, you may handwrite your write-up (Iām also happy to work with you to learn RMarkdown!). All of your computational work will be done in RStudio Cloud, and both your lab write-up and your R script will be considered when grading your work.
In this lab, we will use simulated samples from distributions to visualize the mean square error of different estimators for the population variance.
One of the huge benefits of R (and other statistical programs) is that it allows us to very quickly generate large samples from distributions. This can help simulate the sampling process from a population in reality and let us gain an understanding for how statistical objects behave with regards to the underlying distribution.
Suppose we want \(N\) simulations of samples of size \(n\) from a given distribution,say \(N(1,4)\). The analogy is that there are \(N\) people each conducting the same statistical study. Each person gathers a sample of \(n\) observations and runs the necessary statistical test. This is a fairly straightforward process in R, done by picking \(N\cdot n\) random samples in total, then formatting them into a matrix with \(N\) rows and \(n\) columns. The code below does this, and then prints the result. Note that here there are 10 simulations run, each sampling 5 points from the normal distribution.
set.seed(1860)
N <- 10
n <- 5
mu <- 1
stdv <- 2
samples <- rnorm(N * n, mean = mu, sd = stdv)
samples.matrix <- matrix(samples, nrow = N, ncol = n)
samples.matrix
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.2156817 0.3719849 -0.4145262 -0.1042893 0.25485013
## [2,] -1.4097470 -1.5214981 -1.5109621 0.8704472 1.65903068
## [3,] 0.8814017 1.1901621 1.1054369 2.2076951 2.84100598
## [4,] 3.2711474 0.4730338 3.9060770 2.5369348 -0.66644656
## [5,] -0.6468976 -1.6795051 -1.8953270 -0.8574463 -2.13756238
## [6,] -0.8639747 1.9398331 -0.2203557 2.5999117 6.63406583
## [7,] -2.7448941 2.8615876 -2.4633129 2.0296678 1.68040691
## [8,] 2.1784200 2.4743523 0.6553033 -1.0104024 3.47829442
## [9,] 1.0697078 1.9147968 3.8710094 3.8324485 0.09658462
## [10,] 1.3153301 0.1766505 0.3272036 1.5033818 -1.65435878
If we want to compute a sample statistic, like the sample mean, for each sample, we will typically use the apply function. Check the documentation for the apply function to determine the role of each argument in the code below. Note that the output here is a vector of length 10, where each entry is the mean of one row of the matrix above.
sample.means <- apply(samples.matrix, 1, mean)
sample.means
## [1] 0.06474025 -0.38254589 1.64514036 1.90414928 -1.44334767 2.01789604
## [7] 0.27269104 1.55519352 2.15690944 0.33364145
The two sample statistics we will investigate in this lab are both estimators for the population variance, \(\sigma^2\).
\[ S^2 = \frac{1}{n-1}\sum_{i=1}^{n}(Y_i-\overline{Y})^2 \quad\textrm{and}\quad V = \frac{1}{n}\sum_{i=1}^{n}(Y_i-\overline{Y})^2\] We know from class that \(S^2\) is an unbiased estimator of \(\sigma^2\), while \(V\) is a biased estimator.
In your R code and your write-up to the questions below, you will work with the exponential random variable \(Y\sim Exp(\theta)\) where \(\theta=5\). Note that the variance of an exponetial random variable is \(\sigma^2=\theta^2\).
(\(\star\)) Compute estimates of \(MSE(S^2)\) and \(MSE(V)\) based on your simulated samples and print the values in your write-up. Do your values match your understanding of the theoretical behavior of MSE for biased and unbiased estimators? Why or why not?
(\(\star\)) Compute estimates of \(B(S^2)\), \(Var(S^2)\), \(B(V)\), and \(Var(V)\) based on your simulated samples and print the values in your write-up. Do your values match your theoretical understanding of bias and variance for these estimators?
(\(\star\)) Plot histograms of your estimate vectors and include a vertical line at the true value of \(\theta\). Explain how your histograms relate to your answers in questions 1 and 2 above.