Math 302 Lab 1

General Instructions

Each lab in this course will have multiple components. First, there will a piece like the document below, which includes instructions, tutorials, and problems to be addressed in your write-up. Any part of the document below marked with a star, \(\star\), is a problem for your write-up. Second, there will be an R script where you will do all of the computations required for the lab. And third, you will complete a write-up to be turned in the Friday that you did your lab.

If you are comfortable doing so, I strongly suggest using RMarkdown to type your lab write-up. However, if you are new to R, you may handwrite your write-up (I’m also happy to work with you to learn RMarkdown!). All of your computational work will be done in RStudio Cloud, and both your lab write-up and your R script will be considered when grading your work.

Lab Overview

In this lab, work with different families of distributions and visualize the distributions of their order statistics. Then, you will use a dataset to make qualitative comparisons between plots of distributions and qq-plots.

Order Statistics

There’s not a really nice way built in to R to plot order statistics. Instead, we will write our own function that plots them.

os.pdf <- function(y, n, i, cdf, pdf, ...) {
        c <- factorial(n)/(factorial(i - 1) * factorial(n - i))
        F <- sapply(y, cdf, ...)
        f <- sapply(y, pdf, ...)
        c * (F^(i - 1)) * ((1 - F)^(n - i)) * f
}

The function os.pdf takes as an input the following variables:

x: the domain you want to plot the order statistics on (this will be specified in the dataframe we use in ggplot)
n: the number of sample points taken from the underlying continuous random variable
i: the specific order statistic you want
cdf: the cdf of the underling continuous random variable (e.g., pexp, punif, pnorm)
pdf: the pdf of the underling continuous random variable (e.g., dexp, dunif, dnorm)
…: optional arguments needed for the specific continuous random variable chosen (e.g., rate if using pexp and dexp)

To actually use this function to plot the pdfs of order statistics, we use ggplot along with the stat_function layer.

ggplot(data.frame(x=c(0,25)),aes(x=x)) +
        stat_function(fun = os.pdf, geom = "line", args = list(n = 5, i = 1, cdf = pexp, pdf = dexp, rate = 0.1), aes(col = "y1")) +
        stat_function(fun = os.pdf, geom = "line", args = list(n = 5, i = 2, cdf = pexp, pdf = dexp, rate = 0.1), aes(col = "y2")) +
        stat_function(fun = os.pdf, geom = "line", args = list(n = 5, i = 3, cdf = pexp, pdf = dexp, rate = 0.1), aes(col = "y3")) +
        stat_function(fun = os.pdf, geom = "line", args = list(n = 5, i = 4, cdf = pexp, pdf = dexp, rate = 0.1), aes(col = "y4")) +
        stat_function(fun = os.pdf, geom = "line", args = list(n = 5, i = 5, cdf = pexp, pdf = dexp, rate = 0.1), aes(col = "y5")) +
        scale_colour_manual("Order Statistic", values = c("red", "purple", "blue", "green", "yellow")) +
        xlab('Y') +
        ylab('Density') +
        ggtitle('Order Statistic Densities for an Exponential Random Variable')

(\(\star\)) Plot the pdfs of n=5 order statistics for two common families of continuous distributions other than the exponential distribution (choose values for the parameters however you want). Include the plots in your write-up.
(\(\star\)) What observations do you make about the shapes of the pdfs of the order statistics and the relationships between them?

QQ-Plots

In R, qq-plots can be made in a number of ways. Although most of the plotting we will do uses the ggplot2 package, qq-plots in ggplot2 are a bit less flexible than qq-plots with the package EnvStats (at least in some cases). In this lab, we’ll use EnvStats to make qq-plots, and you can look on your own for how to produce qq-plots in ggplot2 (or in base R graphics).

As an example, here’s how you can make qq-plots in two different contexts. Below, we’ve loaded the dataset “father.son”, which has 1078 observations of pairs of heights of fathers and sons (the collection and study of this data set led to several developments in modern statistics). The dataset consists of two variables:

fheight: father’s height in inches
sheight: son’s height in inches

The code below gives the qq-plot of the sheight variable compared to a normal distribution.

data(father.son)

qqPlot(father.son$sheight, add.line = TRUE, qq.line.type = 'robust', line.col = 'red')

In the example above, a straight line was added to the plot to aid the visualization. In the qqPlot function, the argument qq.line.type specifies the type of line you include in the plot. There are three options for line type:

0-1: gives a line with slope 1 and intercept 0, which will help visualize if the two distributions are exactly identical.
least squares: gives the least squares regression line for the qq-plot viewed as a scatter plot.
robust: gives a line drawn through the point corresponding to the \(25^\textrm{th}\) percentile and the point corresponding to the \(75^\textrm{th}\) percentile. It is called robust because it is much less sensitive to outliers than the least squares regression line.

Instead of making a qq-plot of a sample against a theoretical distribution, we can make a qq-plot of two samples against each other. One axis is the quantiles of the first variable and the other axis is the quantiles of the second variable. This type of qq-plot can allow us to compare two samples against each other as ways of comparing their underlying distributions. If the plot lies close to a line, we can be reasonably confident that the two samples come from similar distributions.

qqPlot(father.son$fheight, father.son$sheight, add.line = TRUE, qq.line.type = 'robust', line.col = 'red')

We will apply the idea above in the context of examining wealth gaps by race in the United States. The dataset for this analysis is the Survey of Consumer Finances, conducted by the Federal Reserve. The most recent data is from 2019, and this survey was conducted on a sample of over 58,000 families in the U.S. In particular, the cleaned dataset you work with below is from the Summary Extract Public Data.

Rather than work with the full dataset, you have a subset consisting of two variables:

NETWORTH: a measure of the family’s net worth, from a formula developed by the Federal Reserve
RACE: a factor variable with four factors (White non-Hispanic, Black, Hispanic, and Other)

In the problems below, you will do an exploratory analysis of this dataset, make observations, and ask statistical questions.

(\(\star\)) Plot histograms of net worth by race (use facet_wrap to do this in one step). Include the plots in your write-up and discuss the following question. Is it possible, from these histograms, to reasonably compare the distributions of net worth across race?
(\(\star\)) Plot qq-plots of net worth of a racial group (for White non-Hispanic, Black, and Hispanic only) against the net worth of the whole group, and include the plots in your write-up.
(\(\star\)) Explain what the shapes of the qq-plots allow us to conclude about the differences in wealth distributions across racial groups.
(\(\star\)) What statistical questions do you have about these datasets after your exploratory analysis?

Math 302 Lab 1

Ross Sweet

General Instructions

Lab Overview

Order Statistics

QQ-Plots