Using R, generate a random variable X that has 10,000 random Gamma pdf values. A Gamma pdf is completely describe by n (a size parameter) and lambda (\(\lambda\) , a shape parameter). Choose any n greater 3 and an expected value \(\lambda\) between 2 and 10 (you choose).
set.seed(51023)
n = 10000
shape = 5
lambda = 3
x_gamma <- rgamma(n=n,shape=shape, rate=lambda)
head(x_gamma)
## [1] 2.0502233 0.4307308 1.1192294 1.5410472 1.1542016 0.6988969
Then generate 10,000 observations from the sum of n exponential pdfs with rate/shape parameter (\(\lambda\)). The n and \(\lambda\) must be the same as in the previous case.
set.seed(51323)
y1_exponential <- rexp(n,lambda)
set.seed(51324)
y2_exponential <- rexp(n,lambda)
set.seed(51325)
y3_exponential <- rexp(n,lambda)
set.seed(51326)
y4_exponential <- rexp(n,lambda)
set.seed(51327)
y5_exponential <- rexp(n,lambda)
y_exponential <- y1_exponential + y2_exponential + y3_exponential + y4_exponential + y5_exponential
head(y_exponential)
## [1] 1.5936518 1.8291070 1.2697119 2.1948992 0.8308876 1.4988129
Then generate 10,000 observations from a single exponential pdf with rate/shape parameter (\(\lambda\)).
NOTE: The Gamma distribution is quite common in data science. For example, it is used to model failures for multiple processes when each of those processes has the same failure rate. The exponential is used for constant failure rates, service times, etc.
set.seed(5152023)
z_exponential <- rexp(n, lambda)
head(z_exponential)
## [1] 0.31903999 0.02936884 0.33178521 0.33271694 0.06073430 0.03785072
Calculate the empirical expected value (means) and variances of all three pdfs.
mean(x_gamma)
## [1] 1.668325
var(x_gamma)
## [1] 0.5495519
mean(y_exponential)
## [1] 1.679196
var(y_exponential)
## [1] 0.5478189
mean(z_exponential)
## [1] 0.3318277
var(z_exponential)
## [1] 0.112242
This is an area of the course that I struggled with. I still don’t fully understand the Calculus portion as it pertains to probabilities and expected values, variance, and moment generating functions
Using calculus, calculate the expected value and variance of the Gamma pdf (X).
\(E(X) = \int_{0}^{\infty} x\frac{\lambda^\alpha}{\Gamma(\alpha)}x^{\alpha-1}e^{-\lambda x}dx\)
Substituting for \(\alpha = 5\); and \(\lambda=3\), we then get:
\(E(X) = \int_{0}^{\infty} x\frac{3^5}{\Gamma(5)}x^{5-1}e^{-3 x}dx\)
This further simplifies to:
\(E(X) = \int_{0}^{\infty} \frac{3^5}{\Gamma(5)}x^{(5+1)-1}e^{-3 x}dx\)
Using the moment generating function for exponentials, calculate the expected value of the single exponential (Z) and the sum of exponentials (Y)
For pdf Z (the exponential), calculate empirically probabilities a through c. Then evaluate through calculus whether the memoryless property holds.
For each of these conditional probabilities, we are focused on solved P(A|B) using the identity that \(P(A|B) = \frac{P(B|A)P(A)}{P(B)}\). Additionally, we solve for the probability density P(X) using the formula \(f(x) = \lambda e^{-\lambda x}\)
x_a <- lambda
x_b <- lambda/2
f_xa <- lambda*exp((-lambda)*(x_a))
f_xb <- lambda*exp((-lambda)*(x_b))
prob_a <- length(z_exponential[z_exponential > f_xa])/length(z_exponential)
prob_b <- length(z_exponential[z_exponential > f_xb])/length(z_exponential)
list_a <- z_exponential[z_exponential > f_xa]
list_ba <- list_a[list_a < f_xb]
prob_ba <- length(list_ba)/length(list_a)
prob_ab <- ((prob_ba)*(prob_a))/prob_b
prob_ab
## [1] 0.1009255
x_a <- 2*lambda
x_b <- lambda
f_xa <- lambda*exp((-lambda)*(x_a))
f_xb <- lambda*exp((-lambda)*(x_b))
prob_a <- length(z_exponential[z_exponential > f_xa])/length(z_exponential)
prob_b <- length(z_exponential[z_exponential > f_xb])/length(z_exponential)
list_a <- z_exponential[z_exponential > f_xa]
list_ba <- list_a[list_a < f_xb]
prob_ba <- length(list_ba)/length(list_a)
prob_ab <- ((prob_ba)*(prob_a))/prob_b
prob_ab
## [1] 0.0008006405
x_a <- 3*lambda
x_b <- lambda
f_xa <- lambda*exp((-lambda)*(x_a))
f_xb <- lambda*exp((-lambda)*(x_b))
prob_a <- length(z_exponential[z_exponential > f_xa])/length(z_exponential)
prob_b <- length(z_exponential[z_exponential > f_xb])/length(z_exponential)
list_a <- z_exponential[z_exponential > f_xa]
list_ba <- list_a[list_a < f_xb]
prob_ba <- length(list_ba)/length(list_a)
prob_ab <- ((prob_ba)*(prob_a))/prob_b
prob_ab
## [1] 0.0008006405
Loosely investigate whether P(YZ) = P(Y) P(Z) by building a table with quartiles and evaluating the marginal and joint probabilities.
(quantiles_y <- quantile(y_exponential))
## 0% 25% 50% 75% 100%
## 0.0949479 1.1384496 1.5752271 2.1157600 5.9639041
(quantiles_z <- quantile(z_exponential))
## 0% 25% 50% 75% 100%
## 4.199778e-05 9.511661e-02 2.316999e-01 4.567086e-01 3.110112e+00
prod_quantiles = c()
for(y_quantile in 1:4) {
for(z_quantile in 1:4) {
yz <- quantiles_y[[y_quantile+1]] * quantiles_z[[z_quantile+1]]
#print(yz)
prod_quantiles <- c(prod_quantiles,yz)
}
}
(prod_quantiles_matrix <- matrix(prod_quantiles, nrow=4, ncol=4,byrow=FALSE))
## [,1] [,2] [,3] [,4]
## [1,] 0.1082855 0.1498303 0.2012439 0.5672663
## [2,] 0.2637787 0.3649800 0.4902214 1.3818361
## [3,] 0.5199397 0.7194197 0.9662857 2.7237661
## [4,] 3.5407062 4.8991334 6.5802513 18.5484121
col_sums <- colSums(prod_quantiles_matrix)
row_sums <- rowSums(prod_quantiles_matrix)
sum_of_cols = sum(col_sums)
sum_of_rows = sum(row_sums)
(round(sum_of_cols) == round(sum_of_rows))
## [1] TRUE
Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test.
chisq.test(prod_quantiles_matrix)
## Warning in chisq.test(prod_quantiles_matrix): Chi-squared approximation may be
## incorrect
##
## Pearson's Chi-squared test
##
## data: prod_quantiles_matrix
## X-squared = 3.0839e-31, df = 9, p-value = 1
fisher.test(prod_quantiles_matrix)
## Warning in fisher.test(prod_quantiles_matrix): 'x' has been rounded to integer:
## Mean relative difference: 0.1164772
##
## Fisher's Exact Test for Count Data
##
## data: prod_quantiles_matrix
## p-value = 1
## alternative hypothesis: two.sided
Using both Fisher’s Exact Test and Chi-Squared, we see that the test for independence holds.
The difference between the two tests is that Chi-Squared is a non-parametric test that allows you for use without any specific assumptions about the distribution that the data is drawn from. Additionally, the Chi-Squared test is prone to errors when working with smaller sample sizes. On the other hand, with the Fisher’s Exact Test, it does not use an approximation and as a result its validity holds, even for smaller sample sizes.
In this case, I believe that the Fisher’s Exact Test is the most appropriate test to use for our data.