Generate X, Y & Z


X~Gamma

Generate a random variable \(X\) that has 10,000 random Gamma pdf values, with \(n\) greater than 3 and \(\lambda\) between 2 and 10.

I interpret this question differently. The Gamma distribution is the amount of time before a fixed number of random events occur. \(n\) would be the number of random Gamma pdf values generated. The \(n\) in the question is not a size parameter, it’s usually called \(\alpha\) and is the shape parameter, or the fixed number of events needed to occur. Neither is \(\lambda\) the size parameter. \(\lambda\) is the rate parameter, or the mean number of events per interval of time. \(\lambda\) is related to the scale parameter, \(\theta\), or the average time between events, by the formula \(\lambda=\frac{1}{\theta}\). Only \(\lambda\) or \(\theta\) can be specified when calling a Gamma function, not both.

# We set a seed so this is reproducible
set.seed(6050)

# Generate random variable X
n <- 10000
alpha <- 7
lambda <- 5
X <- rgamma(n, alpha, lambda)


Y~Sum of Exponentials

Generate 10,000 observations from the sum of \(n\) exponential pdfs with rate/shape parameter (\(\theta\)). The \(n\) and \(\theta\) must be the same as in the previous case.

The sum of exponential random variables is a Gamma random variable because it has the same moment generating function as a Gamma distribution

Here we’re going to run into a problem with reproducibility. If we set only one seed then the five sets of exponential distributions would be identical and it wouldn’t be a sum of exponentials it would be the one set of exponentials times five. So we’re going to set a unique seed for each of the five sets of exponential distributions.

# Generate the sum of seven exponential pdfs 10,000 times
set.seed(6051)
rexp1 <- rexp(n,lambda)
set.seed(6052)
rexp2 <- rexp(n,lambda)
set.seed(6053)
rexp3 <- rexp(n,lambda)
set.seed(6054)
rexp4 <- rexp(n,lambda)
set.seed(6055)
rexp5 <- rexp(n,lambda)
set.seed(6056)
rexp6 <- rexp(n,lambda)
set.seed(6057)
rexp7 <- rexp(n,lambda)

Y <- rexp1+rexp2+rexp3+rexp4+rexp5+rexp6+rexp7


Z~Exponential

Generate 10,000 observations from a single exponential pdf with rate/shape parameter (\(\lambda\))

An exponential distribution is the amount of time until a random event happens. It’s like a Gamma distribution where the shape parameter, \(\alpha\), is one. In this way it makes sense that a sum of exponential distributions is equal to a Gamma distribution if the rate of events per interval of time is the same, and the number of exponential distributions summed is the same number of events needed to occur in the Gamma distribution.

set.seed(6058)
Z <- rexp(n,lambda)



1a. Empirical Means and Variances

Calculate the empirical expected value (means) and variances of all three pdfs.

Here we see that the mean Gamma distribution with \(\alpha=\) seven is equal to the mean of the sum of seven Exponential distributions, and the mean of the Exponential distribution is one-seventh of the others.

# Calculate the means of all three
mean(X)
## [1] 1.394451
mean(Y)
## [1] 1.405457
mean(Z)
## [1] 0.1971788

Here we see that the variance of the Gamma distribution with \(\alpha=\) seven is equal to the variance of the the sum of seven Exponential distributions, and the variance of the Exponential distribution is one-seventh of the others.

# Calculate the variances of all three
var(X)
## [1] 0.2759522
var(Y)
## [1] 0.2795335
var(Z)
## [1] 0.04063684



1b. Analytical Means and Variances

Using calculus, calculate the expected value and variance of the Gamma pdf (X). Using the moment generating function for exponentials, calculate the expected value of the single exponential (Z) and the sum of exponentials (Y)


Gamma Mean

Using calculus, calculate the expected value of the Gamma pdf (X).

For this problem we adopted the proof for the expected value of a Gamma distribution from https://statproofbook.github.io/P/gam-mean.html. In retrospect we should have just evaluated the function using R.

The expected value is the probability-weighted average over all possible values (note Gamma distributions are only defined for \(x>0\):

\(E(X)=\displaystyle\int_0^{\infty}x\frac{\lambda^{\alpha}}{\Gamma(\alpha)}x^{\alpha-1}e^{-\lambda x}dx\)

where the first \(x\) after the integration symbol is the value and the remaining portion is the probability.

Substituting our values of \(\alpha=7\) and \(\lambda=5\) we get:

\(E(X)=\displaystyle\int_0^{\infty}x\frac{5^7}{\Gamma(7)}x^{7-1}e^{-5 x}dx\)

We can combine the two exponents of \(x\) into one term:

\(E(X)=\displaystyle\int_0^{\infty}\frac{5^7}{\Gamma(7)}x^{(7+1)-1}e^{-5 x}dx\)

We can pull out a \(\frac{1}{\lambda}\):

\(E(X)=\displaystyle\int_0^{\infty}\frac{1}{5}\frac{5^{7+1}}{\Gamma(7)}x^{(7+1)-1}e^{-5 x}dx\)

We can use \(\Gamma(\alpha+1)=\Gamma(\alpha)\alpha\) to convert \(\frac{1}{\alpha}\) into \(\alpha\frac{1}{\Gamma(\alpha+1)}\):

\(E(X)=\displaystyle\int_0^{\infty}\frac{7}{5}\frac{5^{7+1}}{\Gamma(7+1)}x^{(7+1)-1}e^{-5 x}dx\)

Now when we pull out the term \(\frac{7}{5}\), what’s left is the probability density function for Gamma where \(\alpha\) is now \(\alpha+1\) and since the probability density function integrated over all possible values is equal to one, we get:

\(E(X)=\frac{7}{5}\) or 1.4, which is essentially the found empirical value of the Gamma distribution’s mean.


Gamma Variance

Using calculus, calculate the variance of the Gamma pdf (X).

This time I tried to set up and simplify the variance of \(X\) with calculus and then evaluate using R but the result was diverging for me so I had to switch to using a proof as a guide.

Variance is the sum of all squared-values minus the square of the expected value. So we have to solve this equation:

\(var(X)=\displaystyle\int_0^{\infty}x^2f_X(x)dx-(E(X))^2\)

We know the expected value is \(\frac{\alpha}{\lambda}\) and when we substitute the Gamma probability function and our values we get:

\(var(X)=\displaystyle\int_0^{\infty}x^2\frac{5^7}{\Gamma(7)}x^6e^{-5 x}dx-(\frac{7}{5})^2\)

We can combine the two x terms and take out the constants where \(\Gamma(n)=(n-1)!\)

\(var(X)=\frac{5^7}{6!}\displaystyle\int_0^{\infty}x^8e^{-5 x}dx-(\frac{7}{5})^2\)

Now that it’s simplified I’ll evaluate it with R:

# Here we integrate the middle
integrand <- function(x) {x^8/exp(1)^(5*x)}
middle <- integrate(integrand, lower=0, upper=Inf)$value

# Here we put it all together
5^7/factorial(6)*middle - 1.4^2
## [1] 0.28

We got 0.28 for the Variance which is essentially the found empirical value of the Gamma distribution’s variance.


Exponential Mean

Using the moment generating function for exponentials, calculate the expected value of the single exponential (Z)

To calculate the expected value we have to set \(t=0\) for the first derivative of the moment generating function.

The moment generating function of an exponential distribution is:

\(M_X(t)=\frac{\lambda}{\lambda-t}\)

The first derivative of which, with \(\lambda=5\) is:

\(M'_X(t)=\frac{5}{(5-t)^2}\)

When we evaluate at \(t=0\) we get:

\(M'_X(0)=.2\)


Sum of Exponential Mean

Using the moment generating function for exponentials, calculate the expected value of the sum of exponentials (Y)

The moment generating function of a sum of exponential distributions is:

\(M_X(t)=(1-\frac{t}{\lambda})^{-\alpha}\)

The first derivative of which, with \(\alpha=7\) and \(\lambda=5\) is:

\(M'_X(t)=\frac{7}{5}(1-\frac{t}{5})^{-8}\)

When we evaluate at \(t=0\) we get:

\(M'_X(0)=\frac{7}{5}\)




1c-e. Probability


\(P(Z>\lambda\mid Z>\lambda/2)\)

For pdf Z (the exponential), calculate the probability empirically. Then evaluate through calculus whether the memoryless property holds.

I interpret this question differently. Instead I’m providing the empirical probability, \(P(Z>\theta\mid Z>\theta/2)\). Note, the Gamma output is in terms of elapsed intervals of time and \(\theta\) is also units of time (mean interval of time between events), whereas \(\lambda\) is in units of events (mean number of events in an interval of time). So we should be using \(\theta\) instead of \(\lambda\) in problems 1c.-1e. Also, \(\lambda\) is 5. The highest value of Z is 2.1359207.

# Transform the rgamma() output into a workable structure (Found through trial and error! I think cbind would have been more elegant.)
Zlist <- as.list(Z)
Zdf <- as.data.frame(Zlist)
Zt <- t(Zdf)
sum(Zt>.2)/sum(Zt>.1)
## [1] 0.5957304


Demonstrate Memorylessness with Rationale

For showing the memoryless property I was aided by the linked resource: http://www.math.wm.edu/~leemis/chart/UDR/PDFs/ExponentialForgetfulness.pdf

I’m not able to think of how to do it with calculus. Unless it’s doing the calculus-equivalent of the following, but th eonly calculus there is going from the CDF \(\lambda e^{-\lambda x}\)to the PDF \(1-e^{-\lambda x}\):

# Since both of these values are equal to each other we've demonstrated the memoryless property
(1-pexp(.1,5))*(1-pexp(.1,5))
## [1] 0.3678794
1-pexp(.2,5)
## [1] 0.3678794

Here’s the proof behind my assertation above, adapted from the math.wm.edu link I mentioned above.

We can show \(Z\) is memoryless if we can show:

\(P(Z>2\lambda\mid Z>\lambda)=P(Z>\lambda)\)

Using conditional probability, this is equivalent to:

\(P(Z>2\lambda)=P(Z>\lambda)P(Z>\lambda)\)

Since \(P(Z\geq\lambda)=e^{-\lambda/\theta}\) we can rewrite the above to:

\(e^{-2\lambda/\theta}=e^{-\lambda/\theta}e^{-\lambda/\theta}\)

which, through the multiplication of exponents, do equal each other so we’ve shown the memoryless property is satisfied:

\(e^{-2\lambda/\theta}=e^{-2\lambda/\theta}\)


\(P(Z>2\lambda\mid Z>\lambda)\)

For pdf Z (the exponential), calculate the probability empirically. Then evaluate through calculus whether the memoryless property holds.

I’m providing instead the empirical probability, \(P(Z>2\theta\mid Z>\theta)\).

sum(Zt>.4)/sum(Zt>.2)
## [1] 0.373307


Demonstrate Memorylessness

# Since both of these values are equal to each other we've demonstrated the memoryless property
(1-pexp(.2,5))*(1-pexp(.2,5))
## [1] 0.1353353
1-pexp(.4,5)
## [1] 0.1353353


Probability \(P(Z>3\lambda\mid Z>\lambda)\)

For pdf Z (the exponential), calculate the probability empirically. Then evaluate through calculus whether the memoryless property holds.

I’m providing instead the empirical probability, \(P(Z>3\theta\mid Z>\theta)\).

sum(Zt>.6)/sum(Zt>.2)
## [1] 0.1408014


Demonstrate Memorylessness

# Since both of these values are equal to each other we've demonstrated the memoryless property
(1-pexp(.2,5))*(1-pexp(.2,5))*(1-pexp(.2,5))
## [1] 0.04978707
1-pexp(.6,5)
## [1] 0.04978707



2a. Joint and Marginal Table

Loosely investigate whether P(YZ) = P(Y)P(Z) by building a table with quartiles and evaluating the marginal and joint probabilities.

When random variables are independent then \(p(x,y)=p(x)p(y)\), otherwise, when random variables are dependent then \(p(x,y)=p(x|y)p(y)\).

When four quartiles are discussed then we would be referring to the four quarters of the data if you were to line up the values from lowest to highest. When three quartiles are discussed then we are talking about the 25th percentile value, meaning that 25% of the data falls below that value, as the first quartile, the 50th as the second quartile and the 75th as the third quartile. If there were four of these mid percentile values then it would be quintiles (20-40-60-80%).

A joint and marginal probability table is where you have a grid where you count each instance based on it’s two characteristics. You then convert those numbers into probabilities, which you sum on the right and bottom sides for each column and row. The probabilities in the middle are called the joint probabilities. Each one is the probability that both events happen. The marginal probabilities (written in the margins, get it?) are the probability for that one event happening.

We’ll attempt this problem by assuming every row in our 10,000 dataset is one instance and assign each one to the 4x4 grid depending on which quarter their Y-value is in and which quarter their X-value is in.


Code

# Make my data frame
df = data.frame(cbind(Y,Z))
df$Quartile <- df$Y
head(df)
##          Y          Z Quartile
## 1 1.351016 0.24084474 1.351016
## 2 1.770144 0.01141032 1.770144
## 3 1.132028 0.01827127 1.132028
## 4 1.452724 0.22196369 1.452724
## 5 1.411754 0.16504039 1.411754
## 6 2.008086 0.35970756 2.008086
# Find my quantiles
quantile(Y)
##       0%      25%      50%      75%     100% 
## 0.221622 1.027327 1.340688 1.717489 5.282895
quantile(Z)
##           0%          25%          50%          75%         100% 
## 2.267957e-06 5.555302e-02 1.342315e-01 2.735498e-01 2.135921e+00
# Store quantiles for sorting my data into the table
YQ1 <- 1.027327
YQ2 <- 1.340688
YQ3 <- 1.717489
ZQ1 <- 5.555302e-02
ZQ2 <- 1.342315e-01
ZQ3 <- 2.735498e-01
# Populate Quartiles as basis for Joint and Marginal Probability Table
#for (x in 2:n) {
for(i in 1:10000) {
  
  # Initialize values
  Yvalue <- df$Y[i]
  Zvalue <- df$Z[i]
  Quartile <- 0
  
  # Populate Quartile value for Y
  if (Yvalue <= YQ1) {
    Quartile = Quartile + 10
  } else if (Yvalue <= YQ2) {
    Quartile = Quartile + 20
  } else if (Yvalue <= YQ3) {
    Quartile = Quartile + 30
  } else {
    Quartile = Quartile + 40
  }
  
  # Populate Quartile value for Z
  if (Zvalue <= ZQ1) {
    Quartile = Quartile + 1
  } else if (Zvalue <= ZQ2) {
    Quartile = Quartile + 2
  } else if (Zvalue <= ZQ3) {
    Quartile = Quartile + 3
  } else {
    Quartile = Quartile + 4
  }
  
  # Assign Quartile to df
  df$Quartile[i] = Quartile
  
}
# Count the Quartiles for the 16 possibilities and convert to probabilities
JMP11 <- sum(df$Quartile==11)/10000
JMP12 <- sum(df$Quartile==12)/10000
JMP13 <- sum(df$Quartile==13)/10000
JMP14 <- sum(df$Quartile==14)/10000
JMP21 <- sum(df$Quartile==21)/10000
JMP22 <- sum(df$Quartile==22)/10000
JMP23 <- sum(df$Quartile==23)/10000
JMP24 <- sum(df$Quartile==24)/10000
JMP31 <- sum(df$Quartile==31)/10000
JMP32 <- sum(df$Quartile==32)/10000
JMP33 <- sum(df$Quartile==33)/10000
JMP34 <- sum(df$Quartile==34)/10000
JMP41 <- sum(df$Quartile==41)/10000
JMP42 <- sum(df$Quartile==42)/10000
JMP43 <- sum(df$Quartile==43)/10000
JMP44 <- sum(df$Quartile==44)/10000
# Produce the marginal probabilities
JMP15 = JMP11+JMP12+JMP13+JMP14
JMP25 = JMP21+JMP22+JMP23+JMP24
JMP35 = JMP31+JMP32+JMP33+JMP34
JMP45 = JMP41+JMP42+JMP43+JMP44

JMP51 = JMP11+JMP21+JMP31+JMP41
JMP52 = JMP12+JMP22+JMP32+JMP42
JMP53 = JMP13+JMP23+JMP33+JMP43
JMP54 = JMP14+JMP24+JMP34+JMP44
JMP55 = JMP15+JMP25+JMP35+JMP45
# Create the joint and marginal probabilities table

cnames <- c("1st Quartile Y", "2nd Quartile Y", "3rd Quartile Y", "4th Quartile Y", "Sum")
rnames <- c("1st Quartile Z", "2nd Quartile Z", "3rd Quartile Z", "4th Quartile Z", "Sum")
c1 <- c(JMP11, JMP12, JMP13, JMP14, JMP15)
c2 <- c(JMP21, JMP22, JMP23, JMP24, JMP25)
c3 <- c(JMP31, JMP32, JMP33, JMP34, JMP35)
c4 <- c(JMP41, JMP42, JMP43, JMP44, JMP45)
c5 <- c(JMP51, JMP52, JMP53, JMP54, JMP55)

JMPT = data.frame(c1,c2,c3,c4,c5)
names(JMPT) <- cnames
row.names(JMPT) <- rnames


Answer

The marginal probabilities makes sense that they are all 0.25 because that’s how we defined the buckets, based on the quartiles.

To evaluate \(p(x,y)=p(x)p(y)\), \(6.25\%=(25\%)(25\%)\) and all of the values seem clustered around 6.25% with one pair of values, The 2nd Y Quartile’s 1st and 4th Z quartiles being a little outside of the others’ range, but supports independence to me!

JMPT
##                1st Quartile Y 2nd Quartile Y 3rd Quartile Y 4th Quartile Y  Sum
## 1st Quartile Z         0.0629         0.0588         0.0654         0.0629 0.25
## 2nd Quartile Z         0.0651         0.0617         0.0601         0.0631 0.25
## 3rd Quartile Z         0.0625         0.0604         0.0642         0.0629 0.25
## 4th Quartile Z         0.0595         0.0691         0.0603         0.0611 0.25
## Sum                    0.2500         0.2500         0.2500         0.2500 1.00



2b. Does Independence Hold?

Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?

Both tests tell you about independence however Fisher’s Exact Test is better for when there is at least one category, or joint probability table cell (we have 16), for which there are few frequencies, 4 or less. Since our categories have many records falling into each, Chi Square Test is more appropriate.

Neither test is designed for numerical data. They are designed for two or more sets of categorical data, so we’ll have to turn the quartiles into two columns.

# Setting up data for two tests
df$QuartileY <- df$Quartile%/%10
df$QuartileZ <- df$Quartile%%10


Chi-squared Test

In the chi-squared output below, the p-value is 1, so the test indicates there is no significant association between the two variables and the distributions are independent.

The X-squared is the magnitude of the difference between the observed and expected frequencies of the data but it’s hard to tell if it means anything here or is relatively high or low; however since our sample size is large I think we can trust the p-value.

chisq.test(df[4:5])
## Warning in chisq.test(df[4:5]): Chi-squared approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  df[4:5]
## X-squared = 5193.5, df = 9999, p-value = 1


Fisher’s Exact Test

An earlier version of our Fisher’s Exact Test failed due to lack of workspace so we’re going to simulate p-values. I tried manually increasing the workspace but it didn’t run until workspace = 1e+09, which looked like it might take forever. It looks like for really big sets of data, setting simulate p-values to TRUE can cause it to use the Chi-squared test instead, but it looks like it someone condensed the data from 10,000 records to 2,000 replicates and ran the Fisher’s Exact Test.

Either way, the p-value is 1, indicating the two distributions are independent; however since the Fisher’s Exact Test is designed for smaller data sets, and where some events have very few occurences, it’s likely we would defer to the Chi-squared test.

fisher.test(df[4:5], simulate.p.value=TRUE)
## 
##  Fisher's Exact Test for Count Data with simulated p-value (based on
##  2000 replicates)
## 
## data:  df[4:5]
## p-value = 1
## alternative hypothesis: two.sided