1 Introduction

Probability distributions have made it possible for people to develop a better understanding of the world as we apply them in different areas of life to test for the chances of random events occurring. Thereby, being able to measure the chances of success in different domains. There are many statistical distributions out there. However, this deliverable will be focusing only on three: gamma, lognormal and uniform.

2 Method

There are different functions that were used to generate the charts that were used in this deliverable. These functions can be seen in the code chunk below. The pdf function is used to plot the probability density function of the distrubutions. The cdf is used to plot the cumulative distribution function of the distributions. The arith_vs_geom is used to showcase the relationship between the arithmetic and geometric means of 1000 samples of data for each distribution. The final function, hist_plot, displays a side-by-side histogram distribution of both the arithmetic and geometric means for each distribution.

pdf <- function(df, type, var) {
  ggplot(df) +
    geom_density(aes(x = type), fill = "grey") +
    geom_vline(aes(xintercept = mean(type)), color = "blue") +
    geom_vline(aes(xintercept = median(type)), color = "red") +
    labs(title = sprintf("Probability density function of %s",var),
         subtitle = "Blue represents mean, red represents median",
         x = "value") +
    theme_bw() +
    theme(plot.title = element_text(hjust=0.5),
          plot.subtitle = element_text(hjust=0.5))
}

cdf <- function(df, type, var) {
    ggplot(df, aes(type)) + 
    stat_ecdf(geom = "step", color = "black") +
    geom_vline(aes(xintercept = mean(type)), color = "blue") +
    geom_vline(aes(xintercept = median(type)), color = "red") +
    labs(title = sprintf("Cumulative distribution function of %s", var),
         subtitle = "Blue represents mean, red represents median",
         x = "value",
         y = "percentile") +
    theme_bw() +
    theme(plot.title = element_text(hjust=0.5),
          plot.subtitle = element_text(hjust=0.5))
}

arith_vs_geom <- function(d, var) {
  ggplot(d) +
    geom_point(aes(y = arithmetic, x = geometric), color = "blue") +
    geom_smooth(aes(y = arithmetic, x = geometric), method = "lm",
                color = "black", se = FALSE) +
    labs(title=sprintf("Geometric vs Arithmetic mean for %s", var),
         y = "Arithmetic mean",
         x = "Geometric mean") +
    theme_bw() +
    theme(plot.title = element_text(hjust=0.5),
          plot.subtitle = element_text(hjust=0.5))
}

hist_plot <- function(d, var, n) {
  ggplot(d) +
    geom_histogram(aes(x = arithmetic), binwidth=n,fill= "blue",alpha =0.4) +
    geom_histogram(aes(x = geometric), binwidth=n,fill = "red", alpha =0.4) +
    labs(x = "Mean Values", 
         title= sprintf("Distribution of Arithmetic and Geometric mean values of %s", var), subtitle = "Blue represents arithmetic, red represents geometric") +
    theme_bw() +
    theme(plot.title = element_text(hjust=0.5),
          plot.subtitle = element_text(hjust=0.5))
}

3 Questions and Answers

3.1 Part 1

For each distribution below, generate a figure of the PDF and CDF. Mark the mean and median in the figure.

For each distribution below, generate a figure of the PDF and CDF of the transformation Y = log(X) random variable. Mark the mean and median in the figure. You may use simulation or analytic methods in order find the PDF and CDF of the transformation.

For each of the distributions below, generate 1000 samples of size 100. For each sample, calculate the geometric and arithmetic mean. Generate a scatter plot of the geometic and arithmetic sample means. Add the line of identify as a reference line.

Generate a histogram of the difference between the arithmetic mean and the geometric mean.

Distribution 1

X∼GAMMA(shape=3,scale=1)

gamma <- rgamma(10000, shape=3, scale=1)
x <- 1:10000
loggamma <- log(gamma)
df_gamma <- data.frame(x = x, gamma = gamma,loggamma = loggamma)
pdf(df_gamma, gamma, "gamma distribution")

cdf(df_gamma, gamma, "gamma distribution")

pdf(df_gamma, loggamma, "gamma distribution under log transformation")

cdf(df_gamma, loggamma, "gamma distribution under log transformation")

d <- data.frame()
  for (i in 1:1000) {
      n <- rgamma(100, shape=3, scale=1)
      d[i,1] <- mean(n)
      d[i,2] <- exp(mean(log(n)))
  }
d <- d %>% 
  rename(arithmetic = V1, geometric = V2)
arith_vs_geom(d, "geometric distribution")

hist_plot(d, "gammma distribution", 0.05)

Distribution 2

X∼LOG NORMAL(μ=−1,σ=1)

lnorm <- rlnorm(1000, meanlog = -1, sdlog = 1)
loglnorm <- log(lnorm)
x <- 1:1000
df_lnorm <- data.frame(x = x, lnorm = lnorm, loglnorm = loglnorm)
pdf(df_lnorm, lnorm, "lognormal distribution")

cdf(df_lnorm, lnorm, "lognormal distribution")

pdf(df_lnorm, loglnorm, "lognormal distribution under log transformation")

cdf(df_lnorm, loglnorm, "lognormal distribution under log transformation")

3.

d <- data.frame()
  for (i in 1:1000) {
      n <- rlnorm(100, meanlog=-1, sdlog= 1)
      d[i,1] <- mean(n)
      d[i,2] <- exp(mean(log(n)))
  }
d <- d %>% 
  rename(arithmetic = V1, geometric = V2)
arith_vs_geom(d, "lognormal distribution")

hist_plot(d, "lognormal distribution",0.01)

Distribution 3

X∼UNIFORM(0,12)

unif <- runif(1000, min = 0, max = 1)
logunif <- log(unif)
x <- 1:1000
df_unif <- data.frame(x = x, unif = unif, logunif = logunif)
pdf(df_unif, unif, "uniform distribution")

cdf(df_unif, unif, "uniform distribution")

pdf(df_unif, logunif, "uniform distribution under log transformation")

cdf(df_unif, logunif, "uniform distribution under log transformation")

d <- data.frame()
  for (i in 1:1000) {
      n <- rlnorm(100, meanlog=-1, sdlog= 1)
      d[i,1] <- mean(n)
      d[i,2] <- exp(mean(log(n)))
  }
d <- d %>% 
  rename(arithmetic = V1, geometric = V2)
arith_vs_geom(d, "uniform distribution")

hist_plot(d, "uniform distribution",0.01)

3.2 Part 2 (Optional)

Show that if Xi> for all i, then the arithmetic mean is greater than or equal to the geometric mean.

Hint: Start with the sample mean of the transformation Yi=log(Xi).

3.3 Part 3

What is the correct relationship between E[log(X)] and log(E[X])? Is one always larger? Equal? Explain your answer.

E[log(X)] is the mean of the log of X and log(E[X]) is the log of the mean of X. The relationship between these two functions is that there are positive correlated, that is as log(E[X]) gets bigger E[log(X)]. Moreover, it can be used to connect the arithmetic mean and the geometric mean as displayed below, as the arithmetic mean is always greater than the geometric mean. This connection can be inferred from the scatter plots in Part I.

\[Arithmetic > Geometric \] The next step is to insert the equations that describe both arithmetic mean and geometric mean \[E[X] > e^{E(log[X])}\] Multiplying both sides by log will lead to the equation below: \[log(E[X]) > E(log[X])\]

According to our equation above, log(E[X]) is bigger than E[log(X)]. This was derived as result of the arithmetic mean always being greater than the geometric mean.

4 Conclusion

Seeing the effect of log transformation on random variables was a great way to see how the distribution of random variables change upon logarithmic applications. Once again, this deliverable highlights the importance of utilizing probability distributions to comprehend random distributions.