Each lab in this course will have multiple components. First, there will a piece like the document below, which includes instructions, tutorials, and problems to be addressed in your write-up. Any part of the document below marked with a star, \(\star\), is a problem for your write-up. Second, there will be an R script where you will do all of the computations required for the lab. And third, you will complete a write-up to be turned in the Friday that you did your lab.
You will use RMarkdown to type your lab write-up. If you’re new to R and/or LaTeX, I’m also happy to work with you to learn RMarkdown! All of your computational work will be done in RStudio Cloud, and both your lab write-up and your R script will be considered when grading your work.
In this lab, you will learn how to plot continuous distributions using ggplot2, particularly in the context of gamma distributions.
As part of this lab, you will also see an application of gamma distributions to the study of prostate cancer epidemiology.
We need a little tutorial in plotting the distribution functions of continuous random variables using the R package ggplot2. If you are comfortable with the ggplot2 package in R, feel free to skip this section.
Suppose that we want to plot a normal distribution with mean \(\mu=2\) and variance \(\sigma^2=16\). The main idea behind the ggplot2 syntax is that you tell R what data you want it to plot, then add layers for aesthetic settings and graphical elements. The code below does just this. The first line tells R the data, which we define as a data frame with the \(x\) values we want to plot, and then specifies this variable to plot in aes (aesthetics). The next layer, telling R that you want to plot a normal distribution, is added with a \(+\) symbol. This stat_function layer uses the function (fun) dnorm, which is the R syntax for the pdf of a normal distribution. The args term in stat_function tells R what arguments need to be plugged into the dnorm function (see the documentation of this function by running ?dnorm which will show you all of the arguments of the function).
ggplot(data = data.frame(x = c(-10, 12)), aes(x = x)) +
stat_function(fun = dnorm, args = list(mean = 2, sd = sqrt(16)))
We can add other distributions to this plot as well, simply by adding new layers. In this case, suppose we want to add a plot of the standard normal distribution. To make these distinguishable, we’ll use different colors for each curve.
ggplot(data = data.frame(x = c(-10, 12)), aes(x = x)) +
stat_function(fun = dnorm, args = list(mean = 2, sd = sqrt(16)), color = "red") +
stat_function(fun = dnorm, args = list(mean = 0, sd = sqrt(1)), color = "blue")
The difficulty now is that we can’t tell from the picture alone what the distributions are. We can remedy this by adding a legend. This is done by changing color to an aesthetic in each stat_function call and specifying the label for each color. We also add a title and a more descriptive axis label for greater clarity.
ggplot(data = data.frame(x = c(-10, 12)), aes(x = x)) +
stat_function(fun = dnorm, args = list(mean = 2, sd = sqrt(16)), aes(color = "mean 2, var 16")) +
stat_function(fun = dnorm, args = list(mean = 0, sd = sqrt(1)), aes(color = "mean 0, var 1")) +
ggtitle("Normal Distributions") +
ylab("density")
We know that gamma distributions have two parameters, which we called \(\alpha\) and \(\beta\), the shape and scale parameters, respectively. As a reminder, different sources use different variables for these parameters (although the names shape and scale are consistent)! In particular, you might also see gamma distributions written in terms of the shape and rate, \(\lambda\), parameters. The relationship between the rate and the scale is \(\beta=\frac{1}{\lambda}\).
The Wikipedia page on gamma distributions has a handy plot of distributions for different values of the shape and scale parameters. However, Wikipedia (annoyingly) uses different notation than we do, choosing \(k\) for the shape parameter (\(\alpha\)) and \(\theta\) for the scale (\(\beta\)).
We will use a data set obtained from a paper by Aleksey V. Belikov, The number of key carcinogenic events can be predicted from cancer incidence. This data set, which is further cleaned from the source, contains the following three variables:
that were obtained from a large Centers for Disease Control survey in the United States over the years 1999-2012. The data set is displayed below.
| age | freq | prop |
|---|---|---|
| 17 | 0.001 | 0.0000002 |
| 22 | 0.001 | 0.0000002 |
| 27 | 0.001 | 0.0000002 |
| 32 | 0.100 | 0.0000233 |
| 37 | 1.000 | 0.0002325 |
| 42 | 9.600 | 0.0022325 |
| 47 | 42.200 | 0.0098135 |
| 52 | 144.100 | 0.0335100 |
| 57 | 316.500 | 0.0736012 |
| 62 | 541.800 | 0.1259941 |
| 67 | 821.200 | 0.1909677 |
| 72 | 906.000 | 0.2106877 |
| 77 | 845.900 | 0.1967116 |
| 82 | 671.800 | 0.1562252 |
(\(\star\)) One way inferring parameters from data is based on matching the mean and variance of the data to the known formulas for mean and variance of the distribution. In the case of the gamma distribution, solve for \(\alpha\) and \(\beta\) in terms of \(\mu\) and \(\sigma^2\). This is a theory-based exercise, not a programming one, but your work should be shown and interpreted in your write-up.
(\(\star\)) In your write-up, use R to compute the values of \(\mu\) and \(\sigma^2\) from the data, and then use your answer from problem 3 to compute \(\alpha\) and \(\beta\). Explain how we can interpret \(\alpha\) and \(\beta\) in the context of this data (in particular, round \(\alpha\) to the nearest whole number for ease of interpretation).
(\(\star\)) In your write-up,
make a scatter plot of the age and prop variables, then add a
stat_function layer to plot the gamma distribution with your calculated
values of \(\alpha\) and \(\beta\) (Note: since the \(y\)-axis of your scatter plot is a
proportion, and the output of the gamma pdf is a density, we need to
rescale the output of the function in the stat_function layer. We can do
this by adding inside stat_function() the following argument:
aes(y = after_stat(y * binwidth)).) Describe your
observations of the plot, particularly the reasonableness of the fit of
the model, based on visual inspection.