General Instructions

Each lab in this course will have multiple components. First, there will a piece like the document below, which includes instructions, tutorials, and problems to be addressed in your write-up. Any part of the document below marked with a star, \(\star\), is a problem for your write-up. Second, there will be an R script where you will do all of the computations required for the lab. And third, you will complete a write-up to be turned in the Friday that you did your lab.

If you are comfortable doing so, I strongly suggest using RMarkdown to type your lab write-up. However, if you are new to R, you may handwrite your write-up (I’m also happy to work with you to learn RMarkdown!). All of your computational work will be done in RStudio Cloud, and both your lab write-up and your R script will be considered when grading your work.

Lab Overview

In this lab, you will learn how to plot continuous distributions using ggplot2, particularly in the context of gamma distributions.

As part of this lab, you will also see an application of gamma distributions to the study of prostate cancer epidemiology.

R Tutorial

We need a little tutorial in plotting the distribution functions of continuous random variables using the R package ggplot2. If you are comfortable with the ggplot2 package in R, feel free to skip this section.

Suppose that we want to plot a normal distribution with mean \(\mu=2\) and variance \(\sigma^2=16\). The main idea behind the ggplot2 syntax is that you tell R what data you want it to plot, then add layers for aesthetic settings and graphical elements. The code below does just this. The first line tells R the data, which we define as a data frame with the \(x\) values we want to plot, and then specifies this variable to plot in aes (aesthetics). The next layer, telling R that you want to plot a normal distribution, is added with a \(+\) symbol. This stat_function layer uses the function (fun) dnorm, which is the R syntax for the pdf of a normal distribution. The args term in stat_function tells R what arguments need to be plugged into the dnorm function (see the documentation of this function by running ?dnorm which will show you all of the arguments of the function).

ggplot(data = data.frame(x = c(-10, 12)), aes(x = x)) +
        stat_function(fun = dnorm, args = list(mean = 2, sd = sqrt(16)))

We can add other distributions to this plot as well, simply by adding new layers. In this case, suppose we want to add a plot of the standard normal distribution. To make these distinguishable, we’ll use different colors for each curve.

ggplot(data = data.frame(x = c(-10, 12)), aes(x = x)) +
        stat_function(fun = dnorm, args = list(mean = 2, sd = sqrt(16)), color = "red") +
        stat_function(fun = dnorm, args = list(mean = 0, sd = sqrt(1)), color = "blue")

The difficulty now is that we can’t tell from the picture alone what the distributions are. We can remedy this by adding a legend. This is done by changing color to an aesthetic in each stat_function call and specifying the label for each color. We also add a title and a more descriptive axis label for greater clarity.

ggplot(data = data.frame(x = c(-10, 12)), aes(x = x)) +
        stat_function(fun = dnorm, args = list(mean = 2, sd = sqrt(16)), aes(color = "mean 2, var 16")) +
        stat_function(fun = dnorm, args = list(mean = 0, sd = sqrt(1)), aes(color = "mean 0, var 1")) +
        ggtitle("Normal Distributions") +
        ylab("density")

Gamma Distributions

We know that gamma distributions have two parameters, which we called \(\alpha\) and \(\beta\), the shape and scale parameters, respectively. As a reminder, different sources use different variables for these parameters (although the names shape and scale are consistent)! In particular, you might also see gamma distributions written in terms of the shape and rate, \(\lambda\), parameters. The relationship between the rate and the scale is \(\beta=\frac{1}{\lambda}\).

  1. (\(\star\)) In your R script, choose values for the shape and scale parameters, and plot your chosen gamma distribution. Note that you may have to do some back and forth testing to find an appropriate plotting domain (the values of \(x\) in the data frame in the ggplot call).

The Wikipedia page on gamma distributions has a handy plot of distributions for different values of the shape and scale parameters. However, Wikipedia (annoyingly) uses different notation than we do, choosing \(k\) for the shape parameter and \(\theta\) for the scale.

  1. (\(\star\)) In your R script, recreate the plot from Wikipedia, updating the notation to be consistent with our text (don’t worry about matching the colors exactly).

Epidemiology

We will use a data set obtained from a paper by Aleksey V. Belikov, The number of key carcinogenic events can be predicted from cancer incidence. This data set, which is further cleaned from the source, shows the midpoint of age ranges (e.g., an age of 12 represents the age range 10-14), while the freq variable gives the proportion of total cases observed in that age range. In the age ranges with no cases observed, we include a small fudge factor to get a nonzero value.

age freq
1.0 0.0000204
2.5 0.0000204
7.0 0.0000002
12.0 0.0000002
17.0 0.0000002
22.0 0.0000002
27.0 0.0000002
32.0 0.0000204
37.0 0.0002039
42.0 0.0019577
47.0 0.0086056
52.0 0.0293853
57.0 0.0645417
62.0 0.1104856
67.0 0.1674618
72.0 0.1847545
77.0 0.1724987
82.0 0.1369957
87.0 0.1230473
  1. (\(\star\)) One way inferring parameters from data is based on matching the mean and variance of the data to the known formulas for mean and variance of the distribution. In the case of the gamma distribution, solve for \(\alpha\) and \(\beta\) in terms of \(\mu\) and \(\sigma^2\).

  2. (\(\star\)) In your R script, calculate the values of \(\mu\) and \(\sigma^2\) from the data, and then use your answer from problem 3 to find \(\alpha\) and \(\beta\). Explain how we can interpret \(\alpha\) and \(\beta\) in the context of this data (in particular, round \(\alpha\) to the nearest whole number for ease of interpretation).

  3. (\(\star\)) In your R script, make a scatter plot of the age and freq variables, then add a stat_function layer to plot the gamma distribution with your calculated values of \(\alpha\) and \(\beta\) (note: your y value here will need to be freq/4.473705 to account for differences in the frequency and density computations.)