Each lab in this course will have multiple components. First, there will a piece like the document below, which includes instructions, tutorials, and problems to be addressed in your write-up. Any part of the document below marked with a star, \(\star\), is a problem for your write-up. Second, there will be an R script where you will do all of the computations required for the lab. And third, you will complete a write-up to be turned in the Friday that you did your lab.
Your lab write-up should be typed in RMarkdown. All of your computational work will be done in RStudio Cloud, and both your lab write-up and your R script will be considered when grading your work.
In this lab, we want to simulate three different types of point estimators: maximum likelihood estimators, method of moments estimators, and percentile matching estimators. The overall goal is to compare these three types of estimators in a specific example.
Many things you want to do in R (particularly if you work in the tidyverse) become much easier if you format your data as a data frame, an R object with column names and rows. Below is the code to make a data frame, with 10 rows and 3 columns. The variable names that we want in the data frame are listed first followed by the vector of length 10 we want in that column.
test.df <- data.frame(first = 1:10,
second = rep(c(1,2), times = 5),
third = rnorm(10)
)
test.df
## first second third
## 1 1 1 0.5754486
## 2 2 2 -0.1816651
## 3 3 1 0.6637186
## 4 4 2 -0.3794772
## 5 5 1 -0.8880927
## 6 6 2 0.6542612
## 7 7 1 -0.4970470
## 8 8 2 1.3184297
## 9 9 1 -0.6283775
## 10 10 2 0.1268895
You can check if an object is a dataframe (or any other object type)
with the str function.
str(test.df)
## 'data.frame': 10 obs. of 3 variables:
## $ first : int 1 2 3 4 5 6 7 8 9 10
## $ second: num 1 2 1 2 1 2 1 2 1 2
## $ third : num 0.575 -0.182 0.664 -0.379 -0.888 ...
To get a quick statistical snapshot of the columns of a dataframe,
you can use the summary function.
summary(test.df)
## first second third
## Min. : 1.00 Min. :1.0 Min. :-0.88809
## 1st Qu.: 3.25 1st Qu.:1.0 1st Qu.:-0.46765
## Median : 5.50 Median :1.5 Median :-0.02739
## Mean : 5.50 Mean :1.5 Mean : 0.07641
## 3rd Qu.: 7.75 3rd Qu.:2.0 3rd Qu.: 0.63456
## Max. :10.00 Max. :2.0 Max. : 1.31843
When plotting with dataframes, it’s often useful to have a plot for
each column of the dataframe all on the same graph. While you can do
this by adding a new layer to your ggplot function, reshaping your
dataframe can sometimes make this process much faster and more flexible.
The package reshape2 allows you to reshape a dataframe
using the melt function.
test.df.long <- melt(test.df)
test.df.long
## variable value
## 1 first 1.0000000
## 2 first 2.0000000
## 3 first 3.0000000
## 4 first 4.0000000
## 5 first 5.0000000
## 6 first 6.0000000
## 7 first 7.0000000
## 8 first 8.0000000
## 9 first 9.0000000
## 10 first 10.0000000
## 11 second 1.0000000
## 12 second 2.0000000
## 13 second 1.0000000
## 14 second 2.0000000
## 15 second 1.0000000
## 16 second 2.0000000
## 17 second 1.0000000
## 18 second 2.0000000
## 19 second 1.0000000
## 20 second 2.0000000
## 21 third 0.5754486
## 22 third -0.1816651
## 23 third 0.6637186
## 24 third -0.3794772
## 25 third -0.8880927
## 26 third 0.6542612
## 27 third -0.4970470
## 28 third 1.3184297
## 29 third -0.6283775
## 30 third 0.1268895
We can then use the long dataframe to make plots, both on the same graph and faceted by variable.
ggplot(test.df.long, aes(x = value, y = variable)) +
geom_boxplot()
ggplot(test.df.long, aes(x = value)) +
geom_histogram(color = "black", fill = "blue") +
facet_wrap(~variable)
In this part of the lab, we will work with \(Y\sim Unif(0,\theta)\), choosing \(\theta=5\), and construct simulations of the three types of estimators described above.
Recall that for the uniform random variable \(Y\), we have the following estimators for \(\theta\):
Note that we can find the sample quantile of a data set using the quantile function. As an example, suppose we want \(\widetilde{\phi}_{0.25}\) and \(\widetilde{\phi}_{0.75}\) for the third column of the test dataframe. Notice that these match the values displayed in the summary you saw above.
quantile(test.df$third, probs = 0.25)
## 25%
## -0.4676546
quantile(test.df$third, probs = 0.75)
## 75%
## 0.6345581
In your R script create a dataframe named
estimator_df for each estimator above. You will use the
apply function across the rows of sims for each estimator,
writing the appropriate function to compute the values of each
estimator.
(\(\star\)) Print a summary of
your estimator_df dataframe and include it in your
write-up. Looking just at the summary statistics, what do you notice
about the different estimators?
(\(\star\)) In your R script,
plot boxplots and histograms of the estimators, along with any other
types of plots you think may present your simulations well for analysis
and discussion (other options are violin plots, density plots, raincloud
plots using the ggdist package, etc.) and include your
chosen plots in your writeup. Include a discussion about why you chose
the visualizations you used in your writeup.
(\(\star\)) What do your plots indicate that was not obvious from the summary statistics? How do your observations compare to your theoretical knowledge of the behavior of these estimators?
There is an R package, fitdistrplus, that computes
estimates of parameters using a specified estimator. The code below
computes a maximum likelihood estimate, a method of moments estimate,
and a percentile matching estimate of the parameters (both the min and
max of the support) of a uniform distribution from a single sample. Note
a uniform distribution in general is of the form \(Unif(a,b)\), so it has two paramaters to
match, \(a\) is the minimum of the
support and \(b\) is the maximum of the
support. Printing the estimate of each fitdist object gives estimates of
both parameters. In our case, we’re interested in the maximum parameter
as an estimator for \(\theta\).
theta <- 5
n <- 50
sam <- runif(n, 0, theta)
fit1 <- fitdist(data = sam, distr = "unif", method = "mle")
fit2 <- fitdist(data = sam, distr = "unif", method = "mme")
fit3 <- fitdist(data = sam, distr = "unif", method = "qme", probs = c(0.25, 0.75))
fit1
## Fitting of the distribution ' unif ' by maximum likelihood
## Parameters:
## estimate Std. Error
## min 0.02429071 NA
## max 4.81568139 NA
fit2
## Fitting of the distribution ' unif ' by matching moments
## Parameters:
## estimate
## 1 0.159447
## 2 5.313122
fit3
## Fitting of the distribution ' unif ' by matching quantiles
## Parameters:
## estimate
## min -0.09625472
## max 5.64766748
Using the qqcomp function, you can see the qq plot of
the fitted distribution to the data. With the denscomp
function, you can visualize the histogram of the data and a red curve
that is the pdf of the fitted distribution.
qqcomp(fit1)
denscomp(fit1)
In the example above, we knew the sample was from a uniform distribution, so we didn’t need to do any guesswork to figure out which distribution to fit to the data. However, this may not always be the case in practice. We’ll do this using the a data set of Federal subsidized student loan disbursements in the United States during the academic year 2020-2021. This data is sourced from Kaggle, and you can view the documentation (and download the raw data set) there. The goal here is to use different estimation techniques to fit a distribution to the data and evaluate the fit.
(\(\star\)) For the
Mean.Disbursements variable, view its histogram, then fit
at least three different families of distributions to using at least two
different estimators in each case. Print the values, qq-plots, and
histograms in your write-up. Pick at least two plausible families of
distributions and at least one implausible one, based on the shape of
the histogram.
(\(\star\)) By visual inspection, do any of the fitted densities describe the data well? Explain why or why not.