Math 301 Lab 4 - Gamma Distributions

General Instructions

Each lab in this course will have multiple components. First, there will a piece like the document below, which includes instructions, tutorials, and problems to be addressed in your write-up. Any part of the document below marked with a star, \(\star\), is a problem for your write-up. Second, there will be an R script where you will do all of the computations required for the lab. And third, you will complete a write-up to be turned in the Friday that you did your lab.

You will use RMarkdown to type your lab write-up. If you’re new to R and/or LaTeX, I’m also happy to work with you to learn RMarkdown! All of your computational work will be done in RStudio Cloud, and both your lab write-up and your R script will be considered when grading your work.

Lab Overview

In this lab, you will learn how to plot continuous distributions using ggplot2, particularly in the context of gamma distributions.

As part of this lab, you will also see an application of gamma distributions to the study of prostate cancer epidemiology.

R Tutorial

We need a little tutorial in plotting the distribution functions of continuous random variables using the R package ggplot2. If you are comfortable with the ggplot2 package in R, feel free to skip this section.

Suppose that we want to plot a normal distribution with mean \(\mu=2\) and variance \(\sigma^2=16\). The main idea behind the ggplot2 syntax is that you tell R what data you want it to plot, then add layers for aesthetic settings and graphical elements. The code below does just this. The first line tells R the data, which we define as a data frame with the \(x\) values we want to plot, and then specifies this variable to plot in aes (aesthetics). The next layer, telling R that you want to plot a normal distribution, is added with a \(+\) symbol. This stat_function layer uses the function (fun) dnorm, which is the R syntax for the pdf of a normal distribution. The args term in stat_function tells R what arguments need to be plugged into the dnorm function (see the documentation of this function by running ?dnorm which will show you all of the arguments of the function).

ggplot(data = data.frame(x = c(-10, 12)), aes(x = x)) +
        stat_function(fun = dnorm, args = list(mean = 2, sd = sqrt(16)))

We can add other distributions to this plot as well, simply by adding new layers. In this case, suppose we want to add a plot of the standard normal distribution. To make these distinguishable, we’ll use different colors for each curve.

ggplot(data = data.frame(x = c(-10, 12)), aes(x = x)) +
        stat_function(fun = dnorm, args = list(mean = 2, sd = sqrt(16)), color = "red") +
        stat_function(fun = dnorm, args = list(mean = 0, sd = sqrt(1)), color = "blue")

The difficulty now is that we can’t tell from the picture alone what the distributions are. We can remedy this by adding a legend. This is done by changing color to an aesthetic in each stat_function call and specifying the label for each color. We also add a title and a more descriptive axis label for greater clarity.

ggplot(data = data.frame(x = c(-10, 12)), aes(x = x)) +
        stat_function(fun = dnorm, args = list(mean = 2, sd = sqrt(16)), aes(color = "mean 2, var 16")) +
        stat_function(fun = dnorm, args = list(mean = 0, sd = sqrt(1)), aes(color = "mean 0, var 1")) +
        ggtitle("Normal Distributions") +
        ylab("density")

Gamma Distributions

We know that gamma distributions have two parameters, which we called \(\alpha\) and \(\beta\), the shape and scale parameters, respectively. As a reminder, different sources use different variables for these parameters (although the names shape and scale are consistent)! In particular, you might also see gamma distributions written in terms of the shape and rate, \(\lambda\), parameters. The relationship between the rate and the scale is \(\beta=\frac{1}{\lambda}\).

(\(\star\)) In your write-up, choose values for the shape and scale parameters, and plot your chosen gamma distribution. Note that you may have to do some back and forth testing to find an appropriate plotting domain (the values of \(x\) in the data frame in the ggplot call).

The Wikipedia page on gamma distributions has a handy plot of distributions for different values of the shape and scale parameters. However, Wikipedia (annoyingly) uses different notation than we do, choosing \(k\) for the shape parameter (\(\alpha\)) and \(\theta\) for the scale (\(\beta\)).

Image credit: https://en.wikipedia.org/wiki/Gamma_distribution#/media/File:Gamma_distribution_pdf.svg

(\(\star\)) In your write-up, recreate the plot from Wikipedia, updating the notation to be consistent with our text (don’t worry about matching the colors exactly).

Epidemiology

We will use a data set obtained from a paper by Aleksey V. Belikov, The number of key carcinogenic events can be predicted from cancer incidence. This data set, which is further cleaned from the source, contains the following three variables:

age: the midpoint of age ranges (e.g., an age of 22 represents the age range 20-24)
freq: the number of cases (per 100,000) observed in that age range
prop: the proportion of cases in that age range out of all cases

that were obtained from a large Centers for Disease Control survey in the United States over the years 1999-2012. The data set is displayed below.

age	freq	prop
17	0.001	0.0000002
22	0.001	0.0000002
27	0.001	0.0000002
32	0.100	0.0000233
37	1.000	0.0002325
42	9.600	0.0022325
47	42.200	0.0098135
52	144.100	0.0335100
57	316.500	0.0736012
62	541.800	0.1259941
67	821.200	0.1909677
72	906.000	0.2106877
77	845.900	0.1967116
82	671.800	0.1562252