Introduction

The median is an important quantity in data analysis. It represents the middle value of the data distribution. Estimates of the median, however, have a degree of uncertainty because

  1. the estimates are calculated from a finite sample and

  2. the data distribution of the underlying data is generally unknown.

One important roles of a data scientist is to quantify and to communicate the degree of uncertainty in his or her data analysis. In this article, we will explore the variation of the median (and a range of other quantiles). To do this, we will use analytical methods. Finally, we will discuss how we can write flexible functions for the distribution of order statistics.

Given any random variables \(X_1, X_2..., X_n\), the order statistics \(X_{(1)}, X_{(2)}, ..., X_{(n)}\) are also random variables, defined by sorting the values of \(X_1, ..., X_n\) in increasing order.

The first order statistic (or smallest order statistic) is the minimum of the sample, that is,

\[ X_{(1)} = \min\{X_1, ..., X_n\} \]

The nth order statistic (or largest order statistic) is the maximum of the sample, that is,

\[ X_{(n)}=\max\{X_1,...,X_n\} \]

The sample range is the difference between the maximum and minimum.

\[ \text{Range}\{X_1,...,X_n\}=X_n - X_1 \]

The sample median is given by the middle value of our order statistics random variables.

If n is even: The sample median is given by the average of the values in positions \(\frac{n}{2}\) and \(\frac{n}{2} + 1\).

If n is odd: The sample median is in position \(\frac{(n + 1)}{2}\).

Questions:

Q: Begin with the median from a sample of \(N=200\) from the standard normal distribution. Write an R function that is the density function for the median in this sample. Note that the 100th order statistic is approximately the median, and use the order statistic formula discussed in class. Generate a plot of the function.

Since we have a sample of \(N=200\), by definition of sample median we know that the median is given by the \(100^{th}\) order statistics, or \(X_{(100)}\). For a general distribution \(1\leq k \leq n\), the density function for the \(k^{th}\) order statistics is given by

\[ f_{(k)}(x)= k {n\choose k} \left[ F(x)\right]^{k-1}\left[ 1 - F(x)\right]^{n-k}f(x) \]

Thus, the density function for the sample median is given by

\[ f_{(100)}(x)= 100 {200 \choose 100} \left[ F(x)\right]^{99}\left[ 1 - F(x)\right]^{100}f(x). \]

x <- seq(-0.5,0.5,0.01)

dorder <- function(x){
  100*choose(200,100)*(pnorm(x,0,1))^(99)*(1-pnorm(x,0,1))^(100)*(dnorm(x,0,1))
}

curve(dorder(x), -0.5, 0.5, lwd = 3, col = "blue", xlab = parse(text="X[(100)]"),ylab = "Density")

Q: Write an R function that is the probability function for the median in this sample. Use the order statistic formula discussed in class. Generate a plot of the function.

For a general distribution \(1\leq k \leq n\), the probability function for the \(k^{th}\) order statistics is given by

\[ F_{(k)}(x)= \sum_{j=k}^{n} {n \choose j} \left[ F(x)\right]^{j}\left[ 1 - F(x)\right]^{n-j} \]

Thus, the probability function for the sample median is given by

\[ F_{(100)}(x)= \sum_{j=100}^{200} {200 \choose j} \left[ F(x)\right]^{j}\left[ 1 - F(x)\right]^{200-j} \]

x <- seq(-0.5,0.5,0.01)

porder <- function(x){
  pbinom(99, 200, pnorm(x,0,1), lower.tail = FALSE)
}

curve(porder(x), -0.5, 0.5, lwd = 3, col = "skyblue", xlab = parse(text="X[(100)]"),ylab = "Probability")

Q: Write an R function that is the quantile function for the median in this sample. (You have several options for how to write this function). Generate a plot of the function.

x <- seq(-0.5,0.5,0.01)
p <- porder(x)

qorder <- approxfun(p,x)

curve(qorder(x), 0, 1, lwd = 3, col = "blue", ylab = parse(text="X[(100)]"),xlab = "Probability")

Q: Simulate the sampling distribution for the median. Create a plot of the empirical CDF (ECDF). Overlay the plot of the ECDF with a plot of the CDF.

In order to simulate the sampling distribution for the median, we begin by generating 200 samples from the normal distribution using the built in function in R: rnom. This function takes 3 inputs, which are the size, mean, and standard deviation. Since our sample of 200 is from the standard normal distribution, the mean is zero and the standard deviation is one. Now, we wish to simulate the sampling distribution for the median. To do this, we use the built-in functions replicate and median to find the median for our sample size 100000 times and store it in a vector named sample_median.

sample_median <- replicate(100000,median(rnorm(200,0,1)))

Next, we use the ecdf function in R to calculate the empirical cumulative density function of our simulated distribution.

sample_ecdf <- ecdf(sample_median)

plot(sample_ecdf, xlim=c(-0.5,0.5), do.points = FALSE, col = "blue", lwd = 3)
curve(porder(x), -0.5, 0.5, add = TRUE, lwd = 3, col = "skyblue")
legend(
    "bottomright"
  , c("ECDF","CDF")
  , lwd = 3
  , col = c("blue","skyblue")
  , bty = "n"
)

Q: Using the simulated sampling distribution from the previous question, create a histogram (on the density scale). Overlay the histogram with a plot of the density function.

hist(sample_median, breaks = 200, xlab = parse(text="X[(100)]"), main = "Samping distribution for the Median", freq = FALSE, col = "skyblue")

curve(dorder(x), -0.5, 0.5,add = TRUE, lwd = 3, col = "blue", xlab = parse(text="X[(100)]"),ylab = "Density")

Q: One very common way to compare a random sample to a theoretical candidate distribution is the QQ plot. It is created by ploting quantiles of the theoretical distribution on the x-axis and empirical quantiles from the sample on the y-axis.

If sample and theoretical quantiles come from the same distribution, then the plotted points will fall along the line y=x, approximately. Here are two examples when the sample and theoretical quantiles came from the same distribution.

random_sample <- rexp(200)
q_candidate <- qexp
x <- q_candidate((1:200)/200)
y <- quantile(random_sample, probs = (1:200)/200)
plot(x,y, asp = 1)
abline(0,1)

random_sample <- rnorm(200)
q_candidate <- qnorm

x <- q_candidate((1:200)/200)
y <- quantile(random_sample, probs = (1:200)/200)

plot(x,y, asp = 1, xlab = "Theoretical quantile", ylab = "Sample quantile")
abline(0,1)

Here is an example when the sample distribution does not match with the theoretical distribution. The sample distribution is t_3 where as the theoretical distribution is N(0,1). Notice the deviation from y=x.

random_sample <- rt(200, df = 3)
q_candidate <- qnorm

x <- q_candidate((1:200)/200)
y <- quantile(random_sample, probs = (1:200)/200)

plot(x,y, asp = 1, xlab = "Theoretical quantile", ylab = "Sample quantile")
abline(0,1)

The deviation occurs despite the fact that the two distributions are similar.

For the assigment, generate a QQ plot for the simulated data of the median relative to the known sampling distribution of the median.

Does the simulated data agree with the theoretical sampling distribution?

x <- ppoints(100)
empirical_Q <- quantile(sample_median, x)
theoretical_Q <- qorder(x)

# Using Base R
# plot(empirical_Q, theoretical_Q)

df <- data.frame(empirical_Q, theoretical_Q)
ggplot(df,aes(sample = theoretical_Q))+
  geom_qq(size = 1.5)+
  geom_qq_line(size = 1, colour = "skyblue")+
  labs(x="Theoretical Quantiles",y="Empirical Quantiles",title="Normal Q-Q Plot")+
  theme(plot.title = element_text(hjust=0.5))
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Based on the Q-Q Plot, we see that the simulated data matches the theoretical sampling distribution. We know this since the plotted points fall almost perfectly along the line y=x.

Q: Modify the dorder, porder, and qorder functions so that the functions take a new parameter k (for the k^{th} order statistic) so that the functions will work for any order statistic and not just the median.

k <- 100

dorder1 <- function(x, k){
  F <- get(paste0("p","norm"))
  f <- get(paste0("d","norm"))
  k*choose(200,k)*F(x)^(k-1)*(1-F(x))^(200-k)*f(x)
}

porder1 <- function(x,k){
  F <- get(paste0("p","norm"))
  pbinom(k-1, 200, F(x), lower.tail = FALSE)
}

p <- porder1(x,k)
qorder1 <- approxfun(p,x)
## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values

Q: Generate the QQ plot for simulated data from the sampling distribution of the sample max and the theoretical largest order statistic distribution.

sample_max <- replicate(100000,max(rnorm(200,0,1)))
x <- ppoints(200)
p <- porder1(x,200)

empirical_Q1 <- quantile(sample_max, x)
theoretical_Q1 <- qorder1(x)

# Using Base R
# plot(empirical_Q1, theoretical_Q1)

df1 <- data.frame(empirical_Q1, theoretical_Q1)

ggplot(df1,aes(sample = theoretical_Q1))+
  geom_qq(size = 1.5)+
  geom_qq_line(size = 1, colour = "blue")+
  labs(x="Theoretical Quantiles",y="Empirical Quantiles",title="Normal Q-Q Plot")+
  theme(plot.title = element_text(hjust=0.5))
## Warning: Removed 110 rows containing non-finite values (`stat_qq()`).
## Warning: Removed 110 rows containing non-finite values (`stat_qq_line()`).

Q: Modify the dorder, porder, and qorder functions so that the functions take new parameters dist and ... so that the functions will work for any continuous distribution that has d and p functions defined in R.

n <- 200

dorder2 <- function(x, k, n, dist, ...){
  F <- get(paste0("p",dist))
  f <- get(paste0("d",dist))
  k*choose(n,k)*F(x, ...)^(k-1)*(1-F(x, ...))^(n-k)*f(x, ...)
}

porder2 <- function(x, k, n, dist, ...){
  F <- get(paste0("p",dist))
  pbinom(k-1, n, F(x, ...), lower.tail = FALSE)
}

p <- porder2(x,k,n,dist = "norm")
qorder2 <- approxfun(p,x)
## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values

Q: Use the newly modified functions to plot the probability and density functions for the sample max (N=200).

x <- seq(-5,5,0.01)

curve(dorder2(x, k = 200, n= 200, dist = "norm"), 1, 5, lwd = 3, col = "blue", xlab = parse(text="X[(200)]"),ylab = "Density")

curve(porder2(x, k = 200, n= 200, dist = "norm"), 1, 5, lwd = 3, col = "skyblue", xlab = parse(text="X[(200)]"),ylab = "Probability")