General Instructions

Each lab in this course will have multiple components. First, there will a piece like the document below, which includes instructions, tutorials, and problems to be addressed in your write-up. Any part of the document below marked with a star, \(\star\), is a problem for your write-up. Second, there will be an R script where you will do all of the computations required for the lab. And third, you will complete a write-up to be turned in the Friday that you did your lab.

If you are comfortable doing so, I strongly suggest using RMarkdown to type your lab write-up. However, if you are new to R, you may handwrite your write-up (I’m also happy to work with you to learn RMarkdown!). All of your computational work will be done in RStudio Cloud, and both your lab write-up and your R script will be considered when grading your work.

Lab Overview

In this lab, we want to simulate three different types of point estimators: maximum likelihood estimators, method of moments estimators, and percentile matching estimators. The overall goal is to compare these three types of estimators in a specific example.

Data Frames

Many things you want to do in R (particularly if you work in the tidyverse) become much easier if you format your data as a data frame, an R object with column names and rows (you can think of this as a matrix with fancy format). Below is the code to make a data frame, with 10 rows and 3 columns. The variable names that we want in the data frame are listed first followed by the vector of length 10 we want in that column.

test.df <- data.frame(first = 1:10,
                      second = rep(c(1,2), times = 5),
                      third = rnorm(10)
                      )

test.df
##    first second       third
## 1      1      1  0.41260741
## 2      2      2 -0.25148918
## 3      3      1 -0.07912936
## 4      4      2  0.50569366
## 5      5      1  0.87419505
## 6      6      2  0.31953914
## 7      7      1  1.27244019
## 8      8      2  1.22612513
## 9      9      1  0.19233012
## 10    10      2 -0.67416442

You can check if an object is a data frame (or any other object type) with the str function.

str(test.df)
## 'data.frame':    10 obs. of  3 variables:
##  $ first : int  1 2 3 4 5 6 7 8 9 10
##  $ second: num  1 2 1 2 1 2 1 2 1 2
##  $ third : num  0.4126 -0.2515 -0.0791 0.5057 0.8742 ...

To get a quick statistical snapshot of the columns of a dataframe, you can use the summary function.

summary(test.df)
##      first           second        third         
##  Min.   : 1.00   Min.   :1.0   Min.   :-0.67416  
##  1st Qu.: 3.25   1st Qu.:1.0   1st Qu.:-0.01126  
##  Median : 5.50   Median :1.5   Median : 0.36607  
##  Mean   : 5.50   Mean   :1.5   Mean   : 0.37981  
##  3rd Qu.: 7.75   3rd Qu.:2.0   3rd Qu.: 0.78207  
##  Max.   :10.00   Max.   :2.0   Max.   : 1.27244

When plotting with dataframes, it’s often useful to have a plot for each column of the dataframe all on the same graph. While you can do this by adding a new layer to your ggplot function, reshaping your dataframe can sometimes make this process much faster and more flexible. The package reshape2 allows you to reshape a dataframe using the melt function.

test.df.long <- melt(test.df)
test.df.long
##    variable       value
## 1     first  1.00000000
## 2     first  2.00000000
## 3     first  3.00000000
## 4     first  4.00000000
## 5     first  5.00000000
## 6     first  6.00000000
## 7     first  7.00000000
## 8     first  8.00000000
## 9     first  9.00000000
## 10    first 10.00000000
## 11   second  1.00000000
## 12   second  2.00000000
## 13   second  1.00000000
## 14   second  2.00000000
## 15   second  1.00000000
## 16   second  2.00000000
## 17   second  1.00000000
## 18   second  2.00000000
## 19   second  1.00000000
## 20   second  2.00000000
## 21    third  0.41260741
## 22    third -0.25148918
## 23    third -0.07912936
## 24    third  0.50569366
## 25    third  0.87419505
## 26    third  0.31953914
## 27    third  1.27244019
## 28    third  1.22612513
## 29    third  0.19233012
## 30    third -0.67416442

We can then use the long dataframe to make plots, both on the same graph and faceted by variable.

ggplot(test.df.long, aes(x = value, color = variable)) +
        geom_boxplot()

ggplot(test.df.long, aes(x = value)) +
        geom_histogram(color = "black", fill = "blue") +
        facet_wrap(~variable)

Simulating Estimators

In this part of the lab, we will work with \(Y\sim Unif(0,\theta)\), choosing \(\theta=5\), and construct simulations of the three types of estimators described above.

  1. In your R script, run the provided code to produce \(N=500\) simulations of samples of size \(n=50\) from \(Y\).

Recall that for the uniform random variable \(Y\), we have the following estimators for \(\theta\):

Note that we can find the sample quantile of a data set using the quantile function. As an example, suppose we want \(\widetilde{\phi}_{0.25}\) and \(\widetilde{\phi}_{0.75}\) for the third column of the test dataframe. Notice that these match the values displayed in the summary you saw above.

quantile(test.df$third, probs = 0.25)
##         25% 
## -0.01126449
quantile(test.df$third, probs = 0.75)
##       75% 
## 0.7820697
  1. In your R script create a dataframe named estimator.df for each estimator above. Use the apply function across the rows of samples.matrix for each value of the estimators.

  2. (\(\star\)) Print a summary of your estimator.df dataframe and include it in your write-up. Looking just at the summary statistics, what do you notice about the different estimators?

  3. (\(\star\)) In your R script, plot boxplots and histograms of the estimators, and include the plots in your writeup.

  4. (\(\star\)) What do your histograms indicate that was not obvious from the boxplots or summary statistics? How do your observations compare to your theoretical knowledge of the behavior of these estimators?

R Packages for Estimation

There is an R package, fitdistrplus, that computes estimates of parameters using a specified estimator. The code below computes a maximum likelihood estimate, a method of moments estimate, and a percentile matching estimate of the parameters (both the min and max of the support) of a uniform distribution from a single sample. Note a uniform distribution in general is of the form \(Unif(a,b)\), so it has two paramaters to match, \(a\) is the minimum of the support and \(b\) is the maximum of the support. Printing the estimate of each fitdist object gives estimates of both parameters. In our case, we’re interested in the maximum parameter as an estimator for \(\theta\).

theta <- 5
n <- 50
sam <- runif(n, 0, theta)
fit1 <- fitdist(data = sam, distr = "unif", method = "mle")
fit2 <- fitdist(data = sam, distr = "unif", method = "mme")
fit3 <- fitdist(data = sam, distr = "unif", method = "qme", probs = c(0.25, 0.75))

fit1
## Fitting of the distribution ' unif ' by maximum likelihood 
## Parameters:
##       estimate Std. Error
## min 0.03909785         NA
## max 4.95806702         NA
fit2
## Fitting of the distribution ' unif ' by matching moments 
## Parameters:
##      estimate
## 1 -0.08372304
## 2  4.92578411
fit3
## Fitting of the distribution ' unif ' by matching quantiles 
## Parameters:
##        estimate
## min -0.08661822
## max  4.47470699

Using the qqcomp function, you can see the qq plot of the fitted distribution to the data. With the denscomp function, you can visualize the histogram of the data and a red curve that is the pdf of the fitted distribution.

qqcomp(fit1)

denscomp(fit1)

In the example above, we knew the sample was from a uniform distribution, so we didn’t need to do any guesswork to figure out which distribution to fit to the data. However, this may not always be the case in practice. We’ll do this using the eruptions data in the faithful data set. The goal here is to use different estimation techniques to fit a distribution to the data and evaluate the fit.

  1. (\(\star\)) For the eruptions variable of the faithful data set, fit at least three different families of distributions to using at least two different estimators in each case. Print the values, qq-plots, and histograms in your write-up.

  2. (\(\star\)) By visual inspection, do any of the fitted densities describe the data well? Explain why or why not.