Probability distributions have made it possible for people to develop a better understanding of the world as we apply them in different areas of life to test for the chances of random events occurring. Thereby, being able to measure the chances of success in different domains. There are many statistical distributions out there. However, this deliverable will be focusing only on three: gamma, lognormal and uniform.
There are different functions that were used to generate the charts that were used in this deliverable. These functions can be seen in the code chunk below. The pdf function is used to plot the probability density function of the distrubutions. The cdf is used to plot the cumulative distribution function of the distributions. The arith_vs_geom is used to showcase the relationship between the arithmetic and geometric means of 1000 samples of data for each distribution. The final function, hist_plot, displays a side-by-side histogram distribution of both the arithmetic and geometric means for each distribution.
pdf <- function(df, type, var) {
ggplot(df) +
geom_density(aes(x = type), fill = "grey") +
geom_vline(aes(xintercept = mean(type)), color = "blue") +
geom_vline(aes(xintercept = median(type)), color = "red") +
labs(title = sprintf("Probability density function of %s",var),
subtitle = "Blue represents mean, red represents median",
x = "value") +
theme_bw() +
theme(plot.title = element_text(hjust=0.5),
plot.subtitle = element_text(hjust=0.5))
}
cdf <- function(df, type, var) {
ggplot(df, aes(type)) +
stat_ecdf(geom = "step", color = "black") +
geom_vline(aes(xintercept = mean(type)), color = "blue") +
geom_vline(aes(xintercept = median(type)), color = "red") +
labs(title = sprintf("Cumulative distribution function of %s", var),
subtitle = "Blue represents mean, red represents median",
x = "value",
y = "percentile") +
theme_bw() +
theme(plot.title = element_text(hjust=0.5),
plot.subtitle = element_text(hjust=0.5))
}
arith_vs_geom <- function(d, var, n1, n2) {
x <- seq(n1, n2, by = ((n2 - n1)/(1000 - 1)))
y <- seq(n1, n2, by = ((n2 - n1)/(1000 - 1)))
df <- data.frame(x=x, y=y)
ggplot(d) +
geom_point(aes(y = arithmetic, x = geometric), color = "blue") +
geom_line(aes(x=df$x,y=df$y),color = "red") +
labs(title=sprintf("Arithmetic vs Geometric mean for %s", var),
y = "Arithmetic mean",
x = "Geometric mean") +
theme_bw() +
theme(plot.title = element_text(hjust=0.5),
plot.subtitle = element_text(hjust=0.5))
}
hist_plot <- function(d, var, n) {
ggplot(d) +
geom_histogram(aes(x = abs(arithmetic-geometric)), binwidth=n,fill= "blue",alpha =0.4) +
labs(x = "Difference between mean values",
title= sprintf("Distribution of the difference between Arithmetic and Geometric mean values of %s", var)) +
theme_bw() +
theme(plot.title = element_text(hjust=0.5),
plot.subtitle = element_text(hjust=0.5))
}
For each distribution below, generate a figure of the PDF and CDF. Mark the mean and median in the figure.
For each distribution below, generate a figure of the PDF and CDF of the transformation Y = log(X) random variable. Mark the mean and median in the figure. You may use simulation or analytic methods in order find the PDF and CDF of the transformation.
For each of the distributions below, generate 1000 samples of size 100. For each sample, calculate the geometric and arithmetic mean. Generate a scatter plot of the geometic and arithmetic sample means. Add the line of identify as a reference line.
Generate a histogram of the difference between the arithmetic mean and the geometric mean.
In all the graphs, a blue line represents the mean and a red line represents the median
Distribution 1
X∼GAMMA(shape=3,scale=1)
mean <- mean(rgamma(1000, shape=3,scale=1))
curve(dgamma(x,shape=3,scale=1), xlim = c(0,10), main = "Probability density function of gamma distribution", xlab = "value", ylab = "density")
abline(v = mean, lwd = 2, col = "blue")
abline(v = qgamma(0.5,shape=3,scale=1), lwd = 2, col = "red")
curve(pgamma(x,shape=3,scale=1), xlim = c(0,10), main = "Cumulative distribution function of gamma distribution", xlab = "value", ylab = "probability")
abline(v = mean, lwd = 2, col = "blue")
abline(v = qgamma(0.5,shape=3,scale=1), lwd = 2, col = "red")
gamma <- rgamma(10000, shape=3, scale=1)
x <- 1:10000
loggamma <- log(gamma)
df_gamma <- data.frame(x = x, gamma = gamma,loggamma = loggamma)
pdf(df_gamma, loggamma, "gamma distribution under log transformation")
cdf(df_gamma, loggamma, "gamma distribution under log transformation")
d <- data.frame()
for (i in 1:1000) {
n <- rgamma(100, shape=3, scale=1)
d[i,1] <- mean(n)
d[i,2] <- exp(mean(log(n)))
}
d <- d %>%
rename(arithmetic = V1, geometric = V2)
arith_vs_geom(d, "geometric distribution",1.8,3)
hist_plot(d, "gammma distribution", 0.05)
Distribution 2
X∼LOG NORMAL(μ=−1,σ=1)
mean <- mean(rlnorm(1000, meanlog = -1, sdlog = 1))
curve(dlnorm(x, meanlog = -1, sdlog = 1), xlim = c(0,5), main = "Probability density function of lognormal distribution", xlab = "value", ylab = "density")
abline(v = mean, lwd = 2, col = "blue")
abline(v = qlnorm(0.5,meanlog = -1, sdlog = 1), lwd = 2, col = "red")
curve(plnorm(x,meanlog = -1, sdlog = 1), xlim = c(0,5), main = "Cumulative distribution function of lognormal distribution", xlab = "value", ylab = "probability")
abline(v = mean, lwd = 2, col = "blue")
abline(v = qlnorm(0.5,meanlog = -1, sdlog = 1), lwd = 2, col = "red")
lnorm <- rlnorm(1000, meanlog = -1, sdlog = 1)
loglnorm <- log(lnorm)
x <- 1:1000
df_lnorm <- data.frame(x = x, lnorm = lnorm, loglnorm = loglnorm)
pdf(df_lnorm, loglnorm, "lognormal distribution under log transformation")
cdf(df_lnorm, loglnorm, "lognormal distribution under log transformation")
3.
d <- data.frame()
for (i in 1:1000) {
n <- rlnorm(100, meanlog=-1, sdlog= 1)
d[i,1] <- mean(n)
d[i,2] <- exp(mean(log(n)))
}
d <- d %>%
rename(arithmetic = V1, geometric = V2)
arith_vs_geom(d, "lognormal distribution",0.4,1.0)
hist_plot(d, "lognormal distribution",0.01)
Distribution 3
X∼UNIFORM(0,12)
mean <- mean(runif(1000, min = 0, max = 1))
curve(dunif(x, min = 0, max = 1), xlim = c(0,1), main = "Probability density function of uniform distribution", xlab = "value", ylab = "density")
abline(v = mean, lwd = 2, col = "blue")
abline(v = qunif(0.5,min = 0, max = 1), lwd = 2, col = "red")
curve(punif(x,min = 0, max = 1), xlim = c(0,1), main = "Cumulative distribution function of uniform distribution", xlab = "value", ylab = "probability")
abline(v = mean, lwd = 2, col = "blue")
abline(v = qunif(0.5,min = 0, max = 1), lwd = 2, col = "red")
unif <- runif(1000, min = 0, max = 1)
logunif <- log(unif)
x <- 1:1000
df_unif <- data.frame(x = x, unif = unif, logunif = logunif)
pdf(df_unif, logunif, "uniform distribution under log transformation")
cdf(df_unif, logunif, "uniform distribution under log transformation")
d <- data.frame()
for (i in 1:1000) {
n <- rlnorm(100, meanlog=-1, sdlog= 1)
d[i,1] <- mean(n)
d[i,2] <- exp(mean(log(n)))
}
d <- d %>%
rename(arithmetic = V1, geometric = V2)
arith_vs_geom(d, "uniform distribution",0.4,1)
hist_plot(d, "uniform distribution",0.01)
Show that if Xi> for all i, then the arithmetic mean is greater than or equal to the geometric mean.
Hint: Start with the sample mean of the transformation Yi=log(Xi).
What is the correct relationship between E[log(X)] and log(E[X])? Is one always larger? Equal? Explain your answer.
E[log(X)] is the mean of the log of X and log(E[X]) is the log of the mean of X. The relationship between these two functions is that there are positive correlated, that is as log(E[X]) gets bigger E[log(X)]. Moreover, it can be used to connect the arithmetic mean and the geometric mean as displayed below, as the arithmetic mean is always greater than the geometric mean. This connection can be inferred using the red line of identity from the scatter plots in Part I.
\[Arithmetic > Geometric \] The next step is to insert the equations that describe both arithmetic mean and geometric mean \[E[X] > e^{E(log[X])}\] Multiplying both sides by log will lead to the equation below: \[log(E[X]) > E(log[X])\]
According to our equation above, log(E[X]) is bigger than E[log(X)]. This was derived as result of the arithmetic mean always being greater than the geometric mean.
Seeing the effect of log transformation on random variables was a great way to see how the distribution of random variables change upon logarithmic applications. Once again, this deliverable highlights the importance of utilizing probability distributions to comprehend random distributions.