Dealing with distributions in the world of statistics often provide researchers with a better idea of obtaining the possibility of obtaining a certain random variable under the characteristics of that distribution. The probability density function gives a researcher a great idea about the chances of obtaining a certain variable, and the cumulative distribution function tells us where that variable ranks among all the variables (i.e. what proportion of variables it supersedes). Using these two functions, charts can be created to understand more about the distribution of any phenomena. There is also another distribution function called quantile distribution function. This is the reverse of cumulative distribution function. The objective of the quantile distribution function is to take in a certain percentage and return the random variable that happens to be greater than that percentage of variables. There are two ways to develop the distributions above: via analytics and simulation. Hence, the agenda of this deliverable is to use the order statistics approach (analytical) and simulation approach to generate distributions of certain percentiles of a random variable, and compare the performance of both approaches.
The functions below were used to generate the distributions in this blog post. dorder was used to build the probability density function curve. porder was used to build the cumulative distribution function. And qorder was used to build the quantile distribution function. Over the course of the deliverable, the functions were transformed to include parameters such as kth value, distribution name and distribution parameters.
dorder <- function(x) {
100*
choose(200,100)
(pnorm(x, mean = 0, sd = 1))^(100-1)*
(1-pnorm(x, mean = 0, sd = 1))^(200-100)*
dnorm(x, mean = 0, sd = 1)
}
porder <- function(x) {
pbinom(100-1, 200, pnorm(x, mean = 0, sd = 1), lower.tail = FALSE)
}
qorder <- function(p) {
r <- curve(porder(x),-5,5)
d <- data.frame(x = r$x, y = r$y)
d <- d %>%
filter(y >= 0.005 & y <= 0.995)
n <- curve(porder(x),min(d$x), max(d$x))
m <- mean(n$x)
sd <- sd(n$x)
q <- c()
for (i in seq_along(p)) {
q[i] <- qnorm(p[i], mean = m, sd = sd)
}
df <- data.frame(p = p, q = q)
return(df)
}
Q: Begin with the median from a sample of N=200 from the standard normal distribution. Write an R function that is the density function for the median in this sample. Note that the 100th order statistic is approximately the median, and use the order statistic formula discussed in class. Generate a plot of the function.
r <- curve(dorder(x), -0.4, 0.4)
n <- data.frame(x = r$x, y = r$y*10^60)
ggplot(n) +
geom_line(aes(x = x, y = y), color = "blue") +
labs(title = "Probability density function of the median (100th order statistic)",
x = "value",
y = "density") +
theme_light() +
theme(plot.title = element_text(hjust=0.5))
The probability density function of the range of possible medians via analytical approach shows that there is a better chance of getting 0 as the median which is quite accurate given that this is a normal distribution with a mean of 0 and a standard deviation of 1.
Q: Write an R function that is the probability function for the median in this sample. Use the order statistic formula discussed in class. Generate a plot of the function.
curve(porder(x), -0.4,0.4, col = "blue",
main = "Cumulative distribution function of the median (100th order statistic)",
xlab = "value",
ylab = "probability")
Upon evaluating the cumulative distribution function, it seems that it correctly showcases the cumulative probabilities of each potential outcome as the value corresponding to having an
F(X) of 0 is -0.2 and the one corresponding to having an F(X) of 1 is 0.2. Both values are on the very end of the probability density functions.
Q: Write an R function that is the quantile function for the median in this sample. (You have several options for how to write this function.) Generate a plot of the function.
p <- seq(0.01, 0.99, by = 0.01)
d <- qorder(p)
ggplot(d) +
geom_line(aes(x = p, y = q), color = "red") +
labs(title = "Quantile distribution function of median (100th order statistic)",
x = "probability",
y = "value") +
theme_bw() +
theme(plot.title= element_text(hjust=0.5))
The quantile distribution function plot showcases an inverse output of the cumulative distribution plot. This is expected as the objective of using the quantile function is to figure out the values that correspond to certain percentiles on the cumulative distribution plot. Reviewing the y-axis, a range of -0.2 to 0.2 can be observed as the range of possibilities. This is expected according to the probability density function.
Q: Simulate the sampling distribution for the median. Create a plot of the empirical CDF (ECDF). Overlay the plot of the ECDF with a plot of the CDF.
b <- c()
set.seed(0)
for (i in 1:100) {
n <- rnorm(200, mean = 0, sd= 1)
b[i] <- quantile(n, 0.5)
}
p <- ecdf(b)
plot(p, main = "Cumulative distribution function of the median (100th order statistic) for both approaches",
xlab = "value",
ylab = "density",xlim = c(-0.2,0.2))
curve(porder(x), col = "blue", add = TRUE)
In this chart, there is a slight deviation between the cdf generated via simulation and that built via analytics. However, the follow similar trends as they towards moving from zero to one. Generally, it seems like the simulation values are greater than those generated via analytics.
Q: Using the simulated sampling distribution from the previous question, create a histogram (on the density scale). Overlay the histogram with a plot of the density function.
b <- c()
set.seed(0)
for (i in 1:101) {
n <- rnorm(200, mean = 0, sd= 1)
b[i] <- quantile(n, 0.5)
}
r <- curve(dorder(x),-0.225,0.25)
n <- data.frame(b = b, x = r$x, y = r$y*10^60)
ggplot(n) +
geom_histogram(aes(x = b,y=..density../10),bins = 10, color = "black", fill= "white") +
geom_line(aes(x = x, y = y), color = "blue") +
labs(title = "Probability density function of the median (100th order statistic) for both approaches",
x = "values",
y = "probability",
subtitle = "Blue is for analytics, black is for simulation") +
theme_light() +
theme(plot.title = element_text(hjust=0.5),
plot.subtitle = element_text(hjust=0.5))
It turns out that pdf created via the analytical approach exhibits the same behavior as the histogram built through the simulated approach. However, there are likely to be discrepancies between values as they do not fit on each other perfectly well.
Q: One very common way to compare a random sample to a theoretical candidate distribution is the QQ plot. It is created by plotting quantiles of the theoretical distribution on the x-axis and empirical quantiles from the sample on the y-axis.
b <- c()
set.seed(42)
for (i in 1:1000) {
n <- rnorm(200, mean = 0, sd= 1)
b[i] <- quantile(n, 0.5)
}
b <- sort(b)
p <- ppoints(1000)
d <- qorder(p)
g <- cbind(d, b)
x <- seq(-0.25,0.25, by = ((0.5)/(1000 - 1)))
y <- x
ggplot(g) +
geom_point(aes(x = q, y = b), col = "red") +
geom_line(aes(x = x, y = y)) +
labs(title = "QQplot of the median (100th order statistic)",
x = "Theoretical quantiles",
y = "Sample quantiles") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))
The two distributions are not in agreement with one another as the points of the plot are not inline with the
y = x line on the chart. This can be evidenced below when reviewing the probability density function distributions of the curves.
ggplot(g) +
geom_density(aes(x = q), color = "blue") +
geom_density(aes(x = b), color = "red") +
labs(title = "Probability density function for theoretical and sample quantiles",
subtitle = "Blue stands for theoretical, Red stands for sample") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))
Q: Modify the dorder, porder, and qorder functions so that the functions take a new parameter k (for the kth order statistic) so that the functions will work for any order statistic and not just the median.
dorder <- function(x,k) {
k*
choose(200,k)
(pnorm(x, mean = 0, sd = 1))^(k-1)*
(1-pnorm(x, mean = 0, sd = 1))^(200-k)*
dnorm(x, mean = 0, sd = 1)
}
porder <- function(x,k) {
pbinom(k-1, 200, pnorm(x, mean = 0, sd = 1), lower.tail = FALSE)
}
qorder <- function(p,k) {
r <- curve(porder(x,k),-5,5)
d <- data.frame(x = r$x, y = r$y)
d <- d %>%
filter(y >= 0.005 & y <= 0.995)
n <- curve(porder(x,k),min(d$x), max(d$x))
m <- mean(n$x)
sd <- sd(n$x)
q <- c()
for (i in seq_along(p)) {
q[i] <- qnorm(p[i], mean = m, sd = sd)
}
df <- data.frame(p = p, q = q)
return(df)
}
Q: Generate the QQ plot for simulated data from the sampling distribution of the sample max and the theoretical largest order statistic distribution.
b <- c()
set.seed(0)
for (i in 1:1000) {
n <- rnorm(1000, mean = 0, sd= 1)
b[i] <- quantile(n, 0.99)
}
b <- sort(b)
p <- ppoints(1000)
d <- qorder(p, 200)
g <- cbind(d, b)
x <- seq(1.5,3, by = ((1.5)/(1000 - 1)))
y <- x
ggplot(g) +
geom_point(aes(x = q, y = b), col = "red") +
geom_line(aes(x = x, y = y)) +
labs(title = "QQplot of maximum values",
x = "Theoretical quantiles",
y = "Sample quantiles") +
theme_bw() +
theme(plot.title = element_text(hjust=0.5))
It can also be observed here that the two distributions are not in agreement with one another as the points of the plot are not inline with the
y = x line on the chart. The deviation is worse than the one experienced for the qqplot for the median.
Q: Modify the dorder, porder, and qorder functions so that the functions take new parameters dist and … so that the functions will work for any continuous distribution that has d and p functions defined in R.
dorder <- function(x,k, dist, ...) {
f <- eval(parse(text = str_c("d", dist, sep = "")))
Fm <- eval(parse(text = str_c("p", dist, sep = "")))
k*
choose(200,k)
(Fm(x, ...))^(k-1)*
(1-Fm(x, ...))^(200-k)*
f(x, ...)
}
porder <- function(x,k, dist, ...) {
Fm <- eval(parse(text = str_c("p", dist, sep = "")))
pbinom(k-1, 200, Fm(x, ...), lower.tail = FALSE)
}
qorder <- function(p, k, dist, ...) {
qm <- eval(parse(text = str_c("q", dist, sep = "")))
r <- curve(porder(x,k,dist,...),-5,5)
d <- data.frame(x = r$x, y = r$y)
d <- d %>%
filter(y >= 0.01 & y <= 0.999)
n <- curve(porder(x,k,dist,...),min(d$x), max(d$x))
m <- mean(n$x)
sd <- sd(n$x)
q <- c()
for (i in seq_along(p)) {
q[i] <- qm(p[i], mean = m, sd = sd)
}
df <- data.frame(p = p, q = q)
return(df)
}
Q: Use the newly modified functions to plot the probability and density functions for the sample min (N=200).
curve(dorder(x, 1, dist = "norm", mean = 0, sd = 1), xlim = c(-5,-1), col = "blue",
main = "Probability density function of the minimum value",
xlab = "value",
ylab = "density")
curve(porder(x, 1, dist = "norm", mean = 0, sd = 1), xlim = c(-5,-1), col = "blue",
main = "Cumulative distribution function of the minimum value",
xlab = "value",
ylab = "probability")
The values generated for the minimum value (i.e. first order statistic) do not follow that of a normally distributed curve like the one built for the median. There are more values in the left half of the density curve than in the right half. The cumulative distribution also highlights this phenomenon as the range of values lie in the second region (i.e corresponding to a percentile > 0.5) are smaller than the ones in the first.
Comparing theoretical outcomes to empirical ones is often a great way to find out whether to know how both approaches differ (or agree) with one another. This could provide researchers with a better idea on how to conduct research in order to ensure they build better models of how the world works.