Data605_Final

Problem 1

Probability Density 1: X~Gamma. Using R, generate a random variable X that has 10,000 random Gamma pdf values. A Gamma pdf is completely describe by n (a size parameter) and lambda ( , a shape parameter). Choose any n greater 3 and an expected value () between 2 and 10 (you choose).

seed_num = 42
set.seed(seed_num)

samples = 10000
n = 5
lambda = 2
X <- rgamma(n = samples, shape = lambda, rate = n)

Probability Density 2: Y~Sum of Exponentials. Then generate 10,000 observations from the sum of n exponential pdfs with rate/shape parameter (). The n and must be the same as in the previous case. (e.g., mysum=rexp(rexp(

Y=(rexp(samples, rate=lambda)+rexp(samples, rate=lambda)+rexp(samples, rate=lambda)+rexp(samples, rate=lambda))

Probability Density 3: Z~ Exponential. Then generate 10,000 observations from a single exponential pdf with rate/shape parameter (). NOTE: The Gamma distribution is quite common in data science. For example, it is used to model failures for multiple processes when each of those processes has the same failure rate. The exponential is used for constant failure rates, service times, etc.

Z <- rexp(samples, rate=lambda)

1a. Calculate the empirical expected value (means) and variances of all three pdfs.

pdf <- cbind(X, Y, Z)
means <- colMeans(pdf)
vars <- apply(pdf, 2, var)
result <- data.frame(Means = means, Vars = vars)
result

1b. Using calculus, calculate the expected value and variance of the Gamma pdf (X). Using the moment generating function for exponentials, calculate the expected value of the single exponential (Z) and the sum of exponentials (Y)

For the Gamma probability distribution gX(x, n, λ), which describes certain types of random events, we have some important properties. The expectation value (E(X)) represents the average value we expect to observe, and the variance (σ^2) measures the spread or variability of the distribution.

To calculate E(X), we use the fact that the total probability of all values in the distribution must equal 1. Since gX(x, n, λ) is a valid probability distribution, the integral (area under the curve) of gX(x, n, λ) from 0 to infinity is 1. Hence, E(X) for gX(x, n, λ) is nλ, which means the average value is equal to the product of the parameters n and λ.

To determine the variance (σ^2), we calculate E(X^2) - (E(X))^2. By evaluating the integral of x^2 * gX(x, n, λ) from 0 to infinity, we find that E(X^2) is n(n+1)/λ^2. Plugging the values of E(X) and E(X^2) into the variance formula, we get σ^2 = (n+1)/λ^2. This represents the measure of the distribution’s spread or variability.

In the case of the exponential probability distribution fX(x) = λ * e^(-λx), which is often used to model events with constant failure rates, the expectation value E(X) represents the average value we expect to observe. For this distribution, E(X) is simply 1/λ, meaning the average value is the reciprocal of the rate parameter λ.

By using mathematical techniques like moment generating functions, we can derive the expectation value from the properties of the distribution. The moment generating function MX(t) for fX(x) is λ / (λ - t). By taking the first derivative of MX(t) with respect to t and evaluating it at t = 0, we find that E(X) is indeed 1/λ.

In the case of the distribution Y, which represents the sum of n exponential random variables, we can determine its expectation value (E(Y)) based on the properties of the exponential distribution. Since the expected value of the sum of random variables is equal to the sum of their expectations, E(Y) is equal to n/λ. This means that on average, the sum of n exponential random variables will be n/λ.

In summary, for the Gamma distribution gX(x, n, λ), the expectation value is nλ and the variance is (n+1)/λ^2. For the exponential distribution fX(x), the expectation value is 1/λ. Additionally, when considering the sum of n exponential random variables, the expectation value is n/λ. These values provide important insights into the average and variability of the respective distributions.

1c. Probability. For pdf Z (the exponential), calculate empirically probabilities a through c. Then evaluate through calculus whether the memoryless property holds.

P(Z> | Z>) b. P(Z>Z>) b. P(Z> Z>)

z <- rexp(samples , rate = lambda)

# Probability a: P(Z > λ | Z > λ/2)
prob_a <- mean(z > lambda) / mean(z > lambda/2)
cat("Probability a:", prob_a, "\n")

## Probability a: 0.130566

# Probability b: P(Z > 2λ | Z > λ)
prob_b <- mean(z > 2*lambda) / mean(z > lambda)
cat("Probability b:", prob_b, "\n")

## Probability b: 0.02890173

# Probability c: P(Z > 3λ | Z > λ)
prob_c <- mean(z > 3*lambda) / mean(z > lambda)
cat("Probability c:", prob_c, "\n")

## Probability c: 0

# Memoryless property verification
memoryless_a <- mean(z > 5*lambda/4)
memoryless_b <- mean(z > 2*lambda) * exp(-lambda)
memoryless_c <- mean(z > 3*lambda) * exp(-lambda)
cat("Memoryless property for a:", abs(prob_a - memoryless_a) < 1e-6, "\n")

## Memoryless property for a: FALSE

cat("Memoryless property for b:", abs(prob_b - memoryless_b) < 1e-6, "\n")

## Memoryless property for b: FALSE

cat("Memoryless property for c:", abs(prob_c - memoryless_c) < 1e-6, "\n")

## Memoryless property for c: TRUE

1d

5 points. Loosely investigate whether P(YZ) = P(Y) P(Z) by building a table with quartiles and evaluating the marginal and joint probabilities. 1st Quartile Y 2d Quartile Y 3d Quartile Y 4th Quartile Y Sum 1st Quartile Z
2d Quartile Z
3d Quartile Z
4th Quartile Z
Sum

# Calculate quantiles of Y and Z
quantY <- as.matrix(t(quantile(Y, probs = c(0.25, 0.5, 0.75, 1))))
quantZ <- as.matrix(quantile(Z, probs = c(0.25, 0.5, 0.75, 1)))

# Calculate the product and sum and prob
prodYZ <- quantZ %*% quantY
prodSum <- sum(prodYZ)
probYZ <- prodYZ / prodSum

# Add row and column names
colnames(probYZ) <- c("25% Y", "50% Y", "75% Y", "100% Y")
rownames(probYZ) <- c("25% Z", "50% Z", "75% Z", "100% Z")

# Calculate row sums and column sums
Rsum <- rowSums(probYZ)
Csum <- colSums(probYZ)

# Results
FinalTable <- cbind(probYZ, Rsum)
FinalTable <- rbind(FinalTable, c(Csum, sum(probYZ)))

FinalTable

##              25% Y       50% Y       75% Y     100% Y       Rsum
## 25% Z  0.001928555 0.002803137 0.003904446 0.01325407 0.02189021
## 50% Z  0.004816448 0.007000662 0.009751112 0.03310122 0.05466944
## 75% Z  0.009918991 0.014417161 0.020081437 0.06816864 0.11258623
## 100% Z 0.071437290 0.103833430 0.144627952 0.49095545 0.81085412
##        0.088101285 0.128054390 0.178364947 0.60547938 1.00000000

1e

5 points. Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?

countsTable = probYZ*10000
fisher.test(countsTable, simulate.p.value=TRUE)

## Warning in fisher.test(countsTable, simulate.p.value = TRUE): 'x' has been
## rounded to integer: Mean relative difference: 0.0003785851

## 
##  Fisher's Exact Test for Count Data with simulated p-value (based on
##  2000 replicates)
## 
## data:  countsTable
## p-value = 1
## alternative hypothesis: two.sided

chisq.test(countsTable)

## 
##  Pearson's Chi-squared test
## 
## data:  countsTable
## X-squared = 1.2926e-28, df = 9, p-value = 1

Fisher’s Exact Test and the Chi Square Test are used to assess the relationship between two variables based on observed data. Fisher’s Exact Test determines the probability of observing the data or a more extreme outcome under the assumption of independence. It considers all possible tables with the same marginal totals and sums the probabilities of tables as or more extreme than the observed table. If this probability is small (typically < 0.05), the null hypothesis of independence is rejected.

In contrast, the Chi Square Test compares observed counts in a contingency table to expected counts under independence. It calculates a test statistic by comparing squared differences between observed and expected counts, divided by the expected counts. Under the null hypothesis, the test statistic follows a Chi Square distribution with degrees of freedom based on the number of variable categories. If the calculated statistic exceeds the critical value, the null hypothesis is rejected.

Generally, Fisher’s Exact Test suits small sample sizes, while the Chi Square Test suits large sample sizes. Performing both tests and comparing results is advisable.

Data605_Final_Q1

Shariq Mian

2023-05-15

Problem 1

1a. Calculate the empirical expected value (means) and variances of all three pdfs.

1b. Using calculus, calculate the expected value and variance of the Gamma pdf (X). Using the moment generating function for exponentials, calculate the expected value of the single exponential (Z) and the sum of exponentials (Y)

1c. Probability. For pdf Z (the exponential), calculate empirically probabilities a through c. Then evaluate through calculus whether the memoryless property holds.

1d

1e