DATA605: Final Project

Problem 1

set.seed(1234)

n <- 5
lambda <- 8

X <- rgamma(1:10000, shape=n, scale=1/lambda)

Calculating our sum of exponential distributions: Y and Z:

Y <- 0
for (i in 1:n){
  Y <- Y + rexp(1:10000, rate=lambda)
}

Z <- rexp(1:10000, rate=lambda)

Expected value and variance of our PDFs

print(mean(X))

## [1] 0.6228281

print(mean(Y))

## [1] 0.6242317

print(mean(Z))

## [1] 0.1261022

print(var(X))

## [1] 0.07520684

print(var(Y))

## [1] 0.07647876

print(var(Z))

## [1] 0.01464671

qplot(Z)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Question 1b

Using calculus, calculate the expected value and variance of the Gamma pdf (\(X\)). Using the moment generating function for exponentials, calculate the expected value of the single exponential (Z) and the sum of exponentials (Y)

Gamma Distirbution

For the gamma distribution, the pdf is

\[\begin{aligned} f(x; \alpha, \beta) = \frac{x^{\alpha - 1}e^{-\beta x}\beta^{\alpha}}{\Gamma(\alpha)} \end{aligned}\]

where \(\Gamma(\alpha) = (\alpha - 1)!\). We can find the expected value:

\[\begin{aligned} E(X) = \int_0^{\infty} x \cdot \frac{x^{\alpha - 1}e^{-\beta x}\beta^{\alpha}}{\Gamma(\alpha)} \,dx\newline = \int_0^{\infty} \frac{\alpha}{\beta}\frac{\beta^{\alpha + 1}}{\Gamma(\alpha + 1)}x^{(\alpha + 1) - 1}e^{-\beta x} \,dx \end{aligned}\]

We got to the step above using the fact that \(\Gamma(a + 1) = a\Gamma(a)\). W ecan pull out our constants above and are left with the definition of the gamma distribution, which integrates to 1 over \((o, \infty)\)

\[\begin{aligned} E(X) = \frac{\alpha}{\beta}\int_0^{\infty} \frac{\beta^{\alpha + 1}}{\Gamma(\alpha + 1)}x^{(\alpha + 1) - 1}e^{-\beta x} \,dx = \frac{\alpha}{\beta} \end{aligned}\]

For the variance of the gamma distribution, we can use our calculation above, plugging into the formula \(var(X) = E(X^2) - (E(X))^2\)

\[\begin{aligned} var(X) = \int_0^{\infty} x^2 f_X(x) \,dx - (E(X))^2\newline = \int_0^{\infty} x^2 \frac{x^{\alpha - 1}e^{-\beta x}\beta^{\alpha}}{\Gamma(\alpha)} \,dx - \frac{\alpha^2}{\beta^2}\newline = \frac{\beta^{\alpha}}{\Gamma(\alpha)}\int_0^{\infty}(\frac{t}{\beta})^{\alpha + 1}e^{-t}/\beta \,dt - \frac{\alpha^2}{\beta^2} \end{aligned}\] Where we substituted \(t = -\beta x\). Now we can simplify \[\begin{aligned} var(X) = \frac{\beta^{\alpha}}{\beta^{\alpha + 2}\Gamma(\alpha)}\int_0^{\infty} t^{\alpha + 1}e^t \,dt - \frac{\alpha^2}{\beta^2} \newline = \frac{\Gamma(\alpha + 2)}{\beta^2\Gamma(\alpha)} - \frac{\alpha^2}{\beta^2}\newline = \frac{\alpha(\alpha + 1)\Gamma(\alpha) - \alpha^2\Gamma(\alpha)}{\beta^2\Gamma(\alpha])}\newline = \frac{\alpha(\alpha + 1 - \alpha1)}{\beta^2} = \frac{\alpha}{\beta^2} \end{aligned}\]

Moment-generating function for exponentials

Let’s define our moment-generating function for the exponential distribution:

\[\begin{aligned} M_Z(t) = \frac{1}{1 - \frac{t}{\lambda}}, t < \lambda \end{aligned}\] We know the moment generating function of the exponential distribution, so we just need to take its first derivative and evaluate at \(t=0\) \[\begin{aligned} \frac{\,dM}{\,dt} = \frac{d}{\,dt}(1 - \frac{t}{\lambda})^{-1} = \frac{1}{\lambda(1 - \frac{t}{\lambda})^2} \end{aligned}\]

When we evaluate the above at \(t=0\), we get \(1/\lambda\), roughly in line with what we expect from our simulated data (\(1/8 = 0.125\))

mean(Z)

## [1] 0.1261022

For the sum of exponentials (an Erlang Distribution), we first need to figure out the moment generating function for the sum of exponentially-distributed random variables. We can use the property of moment generating functions that the sum of independent random variables produces a product moment generating function of the input variables \(M_{X + Y}(t) = M_X(t)M_Y(t)\)

\[\begin{aligned} M_Y(t) = (\frac{1}{1 - \frac{t}{\lambda}})^n = (1 - \frac{t}{\lambda})^{-n} \end{aligned}\]

We can take the first derivative of this expression w.r.t \(t\) and then evaluate at \(t=0\) to find the expected value for our random variable \(Y\)

\[\begin{aligned} M_Y(t)\rvert_{t=0} = \frac{d}{\,dt} = (1 - \frac{t}{\lambda})^{-n}\rvert_{t=0} \newline = -n(\frac{1}{1 - \frac{t}{\lambda}})^{-n - 1}(\frac{-1}{\lambda})\rvert_{t=0} = \frac{n}{\lambda(1 - \frac{t}{\lambda})^{n + 1}}\rvert_{t=0}\newline \frac{n}{\lambda} \end{aligned}\]

This predicted value is pretty close to our simulated data mean

print(n / lambda)

## [1] 0.625

print(mean(Y))

## [1] 0.6242317

Question 1c

\(P(Z > \lambda | Z > \lambda /2)\)

We can rewrite this as \(P(Z > \lambda/2 ∩ Z > \lambda) / P(Z > \lambda/2)\). In this case, the numerator can be rewritten as \(P(Z > \lambda/2 ∩ Z > \lambda) = P(Z > \lambda)\).We can then calculate our numerator and denominator expressions separately \(P(Z > \lambda/2)\) & \(P(Z > \lambda)\).

First, let’s find \(P(Z > \lambda)\):

\[\begin{aligned} P(Z > \lambda) = 1 - P(Z \le \lambda) = 1 - F(\lambda)\newline = 1 - (1 - e^{-\lambda\lambda}) = e^{-\lambda} \end{aligned}\]

We can use the cumulative distribution function for the exponential distribution \(F(x) = 1 - e^{-\lambda x}\).

Similarly, we can find \(P(Z > \lambda/2) = 1 - F(\lambda/2) = 1 - (1 - e^{-\lambda\lambda/2}) = e^{-\lambda/2}\). Dividing these, we get \(e^{-\lambda / 2}\). Which we can plug in our value fo 8 to find the probability:

(prob1c <- exp(-lambda / 2))

## [1] 0.01831564

Question 1d

Similar to above, we can re-write the expression as \(P(Z > 2\lambda) / P(Z > \lambda)\).

\[\begin{aligned} P(Z > 2\lambda) = 1 - (1 - e^{-\lambda2\lambda}) = e^{\lambda} P(Z > \lambda) = 1 - (1 - e^-{\lambda\lambda}) = e^{-\lambda} \end{aligned}\]

These two expressions divided by each other give a probability of 1. This makes sense given that Z must be greater than \(\lambda\) to be greater than \(2\lambda\).

Question 1e

\(P(Z>3\lambda | Z> \lambda)\)

Similar to above, we can rewrite this expression:

\[\begin{aligned} P(Z>3\lambda | Z> \lambda) = \frac{P(Z > \lambda \cap Z > 3\lambda)}{P(Z > \lambda)}\newline = \frac{P(Z > 3\lambda)}{P(Z > \lambda)} \end{aligned}\]

We already know from above that \(P(Z > \lambda) = e^{-\lambda}\). We can then evaluate the numerator \(P(Z > 3\lambda)\)

\[\begin{aligned} P(Z > 3\lambda) = 1 - (1 - e ^{-3\lambda}) = e^{-3\lambda} \end{aligned}\] Plugging into our joint probability, we get \[\begin{aligned} P(Z>3\lambda | Z> \lambda) = \frac{e^{-3\lambda}}{e^{-\lambda}} = e^{-2\lambda} \end{aligned}\]

(prob1e <- exp(-2 * lambda))

## [1] 1.125352e-07

Question 2

Loosely investigate whether P(YZ) = P(Y) P(Z) by building a table with quartiles and evaluating the marginal and joint probabilities. First, let’s get the different quartiles of our simulated values for Y and Z

quartiles_to_keep <- c("25%", "50%", "75%", "100%")
(qY <- quantile(Y)[quartiles_to_keep])

##       25%       50%       75%      100% 
## 0.4262256 0.5848759 0.7794643 2.2111061

(qZ <- quantile(Z)[quartiles_to_keep])

##        25%        50%        75%       100% 
## 0.03798518 0.08890607 0.17665073 1.05238332

Now let’s matrix multiply our simulated datasets to get the possible pairwise products of Y and Z. This is our “sample space” over our simulated data. In other words, this gives us bins into whihc out outcomes could fall into

counts <- as.data.frame(qY %*% t(qZ))
rownames(counts) <- c("ZQ1", "ZQ2", "ZQ3", "ZQ4")
colnames(counts) <- c("YQ1", "YQ2", "YQ3", "YQ4")

N <- sum(counts)
(joint_probs <- counts / N)

##             YQ1         YQ2        YQ3        YQ4
## ZQ1 0.002983846 0.006983830 0.01387643 0.08266777
## ZQ2 0.004094498 0.009583363 0.01904154 0.11343850
## ZQ3 0.005456739 0.012771750 0.02537666 0.15117951
## ZQ4 0.015479130 0.036229620 0.07198596 0.42885087

Now we need to calculate the marginal probabilities (in an extra row and column) by summing our rows and then summing our columns, and then dividing our cell values by the sum of our random variable products, to convert our cell values into probabilities.

We can verify that this is a valid marginal probability table by summing the rows and columns of our frequency table

xx <- 0
yy <- 0

# sum rows to 1
for (i in 1:nrow(joint_probs)){xx <- xx + sum(joint_probs[i, ])}
# sum columns to 1
for (i in 1:nrow(joint_probs)){yy <- yy + sum(joint_probs[, i])}

# Check both sums are equal to 1
(xx == 1 & yy == 1)

## [1] TRUE

Independence

First, let’s run a \(\chi^2\) test for independence. We can use our counts table, which gives us the frequency of observatiosn in each quartile.

chisq.test(counts)

## Warning in chisq.test(counts): Chi-squared approximation may be incorrect

## 
##  Pearson's Chi-squared test
## 
## data:  counts
## X-squared = 7.7341e-32, df = 9, p-value = 1

Our p-value of 1 indicates that there is no relationship between \(Y\) and \(Z\)(i.e., they are independent), and we cannot reject the null. This makes sense as Y and Z were simulated from different distribution calls in our code, and should be independent random vars.

Now we can run Fisher’s exact test for independence, which is used typically with smaller samples

ftest <- fisher.test(counts)

## Warning in fisher.test(counts): 'x' has been rounded to integer: Mean relative
## difference: 0.470764

ftest

## 
##  Fisher's Exact Test for Count Data
## 
## data:  counts
## p-value = 1
## alternative hypothesis: two.sided

Similar to above, a p-value of 1 indicates we cannot reject the null hypothesis that our random variables are independent.