Each lab in this course will have multiple components. First, there will a piece like the document below, which includes instructions, tutorials, and problems to be addressed in your write-up. Any part of the document below marked with a star, \(\star\), is a problem for your write-up. Second, there will be an R script where you will do all of the computations required for the lab. And third, you will complete a write-up to be turned in the Friday that you did your lab.
If you are comfortable doing so, I strongly suggest using RMarkdown to type your lab write-up. However, if you are new to R, you may handwrite your write-up (I’m also happy to work with you to learn RMarkdown!). All of your computational work will be done in RStudio Cloud, and both your lab write-up and your R script will be considered when grading your work.
In this lab, work with different families of distributions to visualize how the sample mean is distributed as the number of samples chosen increases.
We need a little tutorial in sampling from a known distribution. If you are comfortable with the ggplot2 package, dplyr package, and the apply family of functions in R, feel free to skip this section.
Suppose that we start with a normal random variable, \(Y\), with mean \(\mu=2\) and variance \(\sigma^2=9\). To build the a sample in a replicable way, we’ll start by defining a variable that specifies the number of samples taken. This way, to see what happens for larger samples we can easily update this value without changing the rest of our code.
samples <- 20
We can use the function rnorm, which will choose a specified number of random values from a given distribution, in this case a normal distribution. Note that the output is a vector.
rnorm(samples, mean = 2, sd = 3)
## [1] -0.395727515 2.539204733 3.290876779 3.009302167 3.056159943
## [6] 0.077180702 3.454571209 2.676285574 3.524090989 -0.421941015
## [11] -1.831135412 -1.259634694 4.220305838 -2.206344578 6.287925559
## [16] 0.292101246 0.009867545 8.888350188 -0.330509284 8.029697085
Just doing this process once doesn’t give us a lot of information about the behavior of the sample mean random variable, because we’ve only viewed one instance of it. When simulation in R becomes powerful is when we can do this sampling over and over again, recording the sample mean for each of our replicates.
In the code below, we start by defining a value, \(N\), that represents the number of times we want to do the sampling. We can then compute a number of random samples from the distribution equal to \(N\) times the number of sample points taken. The issue here is that the output is a vector, so it’s difficult to see the individual samples. We fix this by reformatting the vector as a matrix with 1000 rows and 20 columns. Thus, each row represents one draw of five samples from the normal distribution.
Finally, we need to compute the mean of each sample (i.e., the mean of each row of the matrix). We can do this using the apply function. The apply function takes as an input a matrix (or data frame), a value of 1 (rows) or 2 (columns) that tells the function which vectors to apply over, and a function that can be computed on a vector. In our case, we apply the function mean over rows of our matrix. This results in a vector of length \(N\) where each entry is the mean of a sample.
N <- 1000
sims <- rnorm(N * samples, mean = 2, sd = 3)
sims.mat <- matrix(sims, N, samples)
means <- apply(sims.mat, 1, mean)
We can then plot a histogram of these sample means using the ggplot package. Note that this matches our discussion in class that the sample mean \(\overline{Y}\) is also normally distributed. In particular, \(\overline{Y}\) in this example is normally distributed with a mean of \(2\) and a variance of \(\frac{9}{5}\).
ggplot(data.frame(means = means), aes(x = means)) +
geom_histogram(aes(y = ..density..), color = "black", fill = "blue") +
xlab("Sample Means") +
ylab("Density")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Suppose that \(X\sim Exp(0.2)\). As we’ve discussed, we’d like to know how \(\overline{X}\) is distributed. Since \(X\) in this example is not normally distributed, we can’t say if \(\overline{X}\) is normally distributed or not. In this part of the lab, you will look at histograms for simulated values of \(\overline{X}\) and make observations about general behavior.
Now consider a random variable \(W\sim Pois(3.4)\).