Math 302 Lab 4

General Instructions

Each lab in this course will have multiple components. First, there will a piece like the document below, which includes instructions, tutorials, and problems to be addressed in your write-up. Any part of the document below marked with a star, \(\star\), is a problem for your write-up. Second, there will be an R script where you will do all of the computations required for the lab. And third, you will complete a write-up to be turned in the Friday that you did your lab.

Your lab write-up should be typed in RMarkdown. All of your computational work will be done in RStudio Cloud, and both your lab write-up and your R script will be considered when grading your work.

Lab Overview

In this lab, we want to simulate three different types of point estimators: maximum likelihood estimators, method of moments estimators, and percentile matching estimators. The overall goal is to compare these three types of estimators in a specific example.

Reshaping Data Frames

Many things you want to do in R (particularly if you work in the tidyverse) become much easier if you format your data as a data frame, an R object with column names and rows. Below is the code to make a data frame, with 10 rows and 3 columns. The variable names that we want in the data frame are listed first followed by the vector of length 10 we want in that column.

test.df <- data.frame(first = 1:10,
                      second = rep(c(1,2), times = 5),
                      third = rnorm(10)
                      )

test.df

##    first second      third
## 1      1      1  0.5754486
## 2      2      2 -0.1816651
## 3      3      1  0.6637186
## 4      4      2 -0.3794772
## 5      5      1 -0.8880927
## 6      6      2  0.6542612
## 7      7      1 -0.4970470
## 8      8      2  1.3184297
## 9      9      1 -0.6283775
## 10    10      2  0.1268895

You can check if an object is a dataframe (or any other object type) with the str function.

str(test.df)

## 'data.frame':    10 obs. of  3 variables:
##  $ first : int  1 2 3 4 5 6 7 8 9 10
##  $ second: num  1 2 1 2 1 2 1 2 1 2
##  $ third : num  0.575 -0.182 0.664 -0.379 -0.888 ...

To get a quick statistical snapshot of the columns of a dataframe, you can use the summary function.

summary(test.df)

##      first           second        third         
##  Min.   : 1.00   Min.   :1.0   Min.   :-0.88809  
##  1st Qu.: 3.25   1st Qu.:1.0   1st Qu.:-0.46765  
##  Median : 5.50   Median :1.5   Median :-0.02739  
##  Mean   : 5.50   Mean   :1.5   Mean   : 0.07641  
##  3rd Qu.: 7.75   3rd Qu.:2.0   3rd Qu.: 0.63456  
##  Max.   :10.00   Max.   :2.0   Max.   : 1.31843

When plotting with dataframes, it’s often useful to have a plot for each column of the dataframe all on the same graph. While you can do this by adding a new layer to your ggplot function, reshaping your dataframe can sometimes make this process much faster and more flexible. The package reshape2 allows you to reshape a dataframe using the melt function.

test.df.long <- melt(test.df)
test.df.long

##    variable      value
## 1     first  1.0000000
## 2     first  2.0000000
## 3     first  3.0000000
## 4     first  4.0000000
## 5     first  5.0000000
## 6     first  6.0000000
## 7     first  7.0000000
## 8     first  8.0000000
## 9     first  9.0000000
## 10    first 10.0000000
## 11   second  1.0000000
## 12   second  2.0000000
## 13   second  1.0000000
## 14   second  2.0000000
## 15   second  1.0000000
## 16   second  2.0000000
## 17   second  1.0000000
## 18   second  2.0000000
## 19   second  1.0000000
## 20   second  2.0000000
## 21    third  0.5754486
## 22    third -0.1816651
## 23    third  0.6637186
## 24    third -0.3794772
## 25    third -0.8880927
## 26    third  0.6542612
## 27    third -0.4970470
## 28    third  1.3184297
## 29    third -0.6283775
## 30    third  0.1268895

We can then use the long dataframe to make plots, both on the same graph and faceted by variable.

ggplot(test.df.long, aes(x = value, y = variable)) +
        geom_boxplot()

ggplot(test.df.long, aes(x = value)) +
        geom_histogram(color = "black", fill = "blue") +
        facet_wrap(~variable)

Simulating Estimators

In this part of the lab, we will work with \(Y\sim Unif(0,\theta)\), choosing \(\theta=5\), and construct simulations of the three types of estimators described above.

In your R script, run the provided code to produce \(N=500\) simulations of samples of size \(n=50\) from \(Y\).

Recall that for the uniform random variable \(Y\), we have the following estimators for \(\theta\):

\(\widehat{\theta}_{MLE} = \frac{n+1}{n}Y_{(n)}\)
\(\widehat{\theta}_{MM} = 2\overline{Y}\)
\(\widehat{\theta}_{25} = 4\widetilde{\phi}_{0.25}\)
\(\widehat{\theta}_{50} = 2\widetilde{\phi}_{0.50}\)
\(\widehat{\theta}_{75} = \frac{4}{3}\widetilde{\phi}_{0.75}\)

Note that we can find the sample quantile of a data set using the quantile function. As an example, suppose we want \(\widetilde{\phi}_{0.25}\) and \(\widetilde{\phi}_{0.75}\) for the third column of the test dataframe. Notice that these match the values displayed in the summary you saw above.

quantile(test.df$third, probs = 0.25)

##        25% 
## -0.4676546

quantile(test.df$third, probs = 0.75)

##       75% 
## 0.6345581

In your R script create a dataframe named estimator_df for each estimator above. You will use the apply function across the rows of sims for each estimator, writing the appropriate function to compute the values of each estimator.
(\(\star\)) Print a summary of your estimator_df dataframe and include it in your write-up. Looking just at the summary statistics, what do you notice about the different estimators?
(\(\star\)) In your R script, plot boxplots and histograms of the estimators, along with any other types of plots you think may present your simulations well for analysis and discussion (other options are violin plots, density plots, raincloud plots using the ggdist package, etc.) and include your chosen plots in your writeup. Include a discussion about why you chose the visualizations you used in your writeup.
(\(\star\)) What do your plots indicate that was not obvious from the summary statistics? How do your observations compare to your theoretical knowledge of the behavior of these estimators?

R Packages for Estimation

There is an R package, fitdistrplus, that computes estimates of parameters using a specified estimator. The code below computes a maximum likelihood estimate, a method of moments estimate, and a percentile matching estimate of the parameters (both the min and max of the support) of a uniform distribution from a single sample. Note a uniform distribution in general is of the form \(Unif(a,b)\), so it has two paramaters to match, \(a\) is the minimum of the support and \(b\) is the maximum of the support. Printing the estimate of each fitdist object gives estimates of both parameters. In our case, we’re interested in the maximum parameter as an estimator for \(\theta\).

theta <- 5
n <- 50
sam <- runif(n, 0, theta)
fit1 <- fitdist(data = sam, distr = "unif", method = "mle")
fit2 <- fitdist(data = sam, distr = "unif", method = "mme")
fit3 <- fitdist(data = sam, distr = "unif", method = "qme", probs = c(0.25, 0.75))

fit1

## Fitting of the distribution ' unif ' by maximum likelihood 
## Parameters:
##       estimate Std. Error
## min 0.02429071         NA
## max 4.81568139         NA

fit2

## Fitting of the distribution ' unif ' by matching moments 
## Parameters:
##   estimate
## 1 0.159447
## 2 5.313122

fit3

## Fitting of the distribution ' unif ' by matching quantiles 
## Parameters:
##        estimate
## min -0.09625472
## max  5.64766748

Using the qqcomp function, you can see the qq plot of the fitted distribution to the data. With the denscomp function, you can visualize the histogram of the data and a red curve that is the pdf of the fitted distribution.

qqcomp(fit1)

denscomp(fit1)

In the example above, we knew the sample was from a uniform distribution, so we didn’t need to do any guesswork to figure out which distribution to fit to the data. However, this may not always be the case in practice. We’ll do this using the a data set of Federal subsidized student loan disbursements in the United States during the academic year 2020-2021. This data is sourced from Kaggle, and you can view the documentation (and download the raw data set) there. The goal here is to use different estimation techniques to fit a distribution to the data and evaluate the fit.