Data 605 Final Problem 1

Computational Mathematics:

Your final is due by the end of the last week of class. You should post your solutions to your GitHub account or RPubs. You are also expected to make a short presentation via YouTube and post that recording to the board. This project will show off your ability to understand the elements of the class.

Probability Density 1: X~Gamma:

Using R, generate a random variable X that has 10,000 random Gamma pdf values. A Gamma pdf is completely describe by n (a size parameter) and lambda (lamda, a shape parameter). Choose any n greater 3 and an expected value (lamda) between 2 and 10 (you choose).
\[E(x) = \frac{n}{\lambda}\] \[Var(y) = \frac{n}{\lambda^2}\]

#make this example reproducible
set.seed(123)

# set parameters

lambda<- 4 #between 2 and 10
n<- 5 #  Choose any n greater 3
  
#generate 10,000 random values that follow gamma distribution
x <- rgamma(10000, shape=n, rate=lambda)

#create histogram to view distribution of values
hist(x, ylab = "Density", main = "Gamma PDF with n = 5 and λ = 4")

# Create a density plot using the density() function
plot(density(x), main = "Density Plot using density()", col= "Green", lwd = 2)

Probability Density 2: Y~Sum of Exponentials:

Then generate 10,000 observations from the sum of n exponential pdfs with rate/shape parameter (). The n and must be the same as in the previous case. (e.g., mysum=rexp(10000,lamda)+rexp(10000,lamda)

\[E(y) = \frac{n}{\lambda}\]

\[Var(y) = \frac{n}{\lambda^2}\]

#make this example reproducible
set.seed(123)

# set parameters
lambda<- 4 #between 2 and 10
n<- 5 #  Choose any n greater 3

# Sum n exponential distributions with rate parameter lambda
# Generate 10,000 observations from the sum of n exponential distributions
 Y<- sum(rexp(10000 * n, lambda))

Probability Density 3:

Z~ Exponential. Then generate 10,000 observations from a single exponential pdf with rate/shape parameter (). NOTE: The Gamma distribution is quite common in data science. For example, it is used to model failures for multiple processes when each of those processes has the same failure rate. The exponential is used for constant failure rates, service times, etc.
5 points.

\[E(z) = \frac{1}{\lambda}\]

\[Var(z) = \frac{1}{\lambda^2}\]

#make this example reproducible
set.seed(123)

# set parameters
lamda<- 4 #between 2 and 10
n<- 5 #  Choose any n greater 3

Z <- rexp(10000, lambda)
hist(Z, ylab = "Density", main = "Exponential with n = 5 and λ = 4")

1a.Calculate the empirical expected value (means) and variances of all three pdfs.

X~Gamma

Calculate the empirical mean and variance

#(empirical_mean_x <- mean(x))
#(empirical_variance_x <- var(x))
(E_mean_x<-n/lambda)

## [1] 1.25

(E_Var_x<-n/lambda^2)

## [1] 0.3125

Y~Sum of Exponentials

Calculate the empirical mean and variance

#(empirical_mean_Y <- mean(Y))
#(empirical_variance_Y <- var(Y))
(E_mean_y<-n/lambda)

## [1] 1.25

(E_Var_y<-n/lambda^2)

## [1] 0.3125

Z~ Exponential

Calculate the empirical mean and variance

#(empirical_mean_Z <- mean(Z))
#(empirical_variance_Z <- var(Z))
(E_mean_z<-1/lambda)

## [1] 0.25

(E_Var_z<-1/lambda^2)

## [1] 0.0625

1b.Using calculus, calculate the expected value and variance of the Gamma pdf (X).

Mean: https://statproofbook.github.io/P/gam-mean.html#:~:text=Proof%3A%20Mean%20of%20the%20gamma%20distribution&text=E(X)%3Dab,X(x)dx.

\[{E(x) }= \frac{n}\lambda\]

integrand <- function(x) x * dgamma(x, n, lambda)
expected_value_X <- integrate(integrand, lower = 0, upper = Inf)$value
expected_value_X

## [1] 1.25

\[{var(x) }= \frac{n}{\lambda^2}\]

integrand <- function(x) x^2 * dgamma(x, n, lambda)
variance_X <- integrate(integrand, lower = 0, upper = Inf)$value - expected_value_X^2
variance_X

## [1] 0.3125

Variance: https://statproofbook.github.io/P/gam-var

Using the moment generating function for exponentials, calculate the expected value of the single exponential (Z) and the sum of exponentials (Y)

The MGF of an exponential distribution with rate parameter λ is given by:

\[M_x(t) = \frac{\lambda}{(\lambda-t)}\]

Substituting the values of \[\lambda=4\] and n = 5 we get \[M_x(t) = \frac{4^5}{(4-t)}\] When evaluating MGF at t=0 We get:

\[M_x(t) = \frac{4^5}{(4-0)}\]

\[M_x(t) = 0\]

###1c-e. Probability. For pdf Z (the exponential), calculate empirically probabilities a through c. Then evaluate through calculus whether the memoryless property holds. a. \[P(Z>\lambda| Z>\lambda/2)\]

We can use pexp() funtion to calculate this

(Emp_prob_a <- 1-pexp(E_mean_z, lambda))

## [1] 0.3678794

\[P(Z>2Z|Z>\lambda) \]

(Emp_prob_b <- 1-pexp(E_mean_z, 2*lambda))

## [1] 0.1353353

\[P(Z>3|Z>\lambda)\]

(Emp_prob_c <- 1-pexp(E_mean_z, 3*lambda))

## [1] 0.04978707

5 points. Loosely investigate whether P(YZ) = P(Y) P(Z)

by building a table with quartiles and evaluating the marginal and joint probabilities.

1st Quartile Z
2d Quartile Z
3d Quartile Z
4th Quartile Z

# first we need to get the quantiles of each
y_quant <- quantile(Y, c(0.25,0.50,0.75,1))
z_quant <- quantile(Z, c(0.25,0.50,0.75,1))

# convert to df (creates row/col names)
y_quant <- as.data.frame(y_quant)
z_quant <- as.data.frame(z_quant)

# convert to matrix
y_quant <- as.matrix(t(y_quant))
z_quant <- as.matrix(z_quant)

# matrix multiplication
prod_y_z <- z_quant %*% y_quant

# sum
prod_sum_y_z <- sum(prod_y_z)

# create blank matrix and set col and row names
prob_y_z <- matrix(nrow=4,ncol=4, 0,
                    dimnames = dimnames(prod_y_z))

# loop through blank matrix and calc prob
for (i in 1:dim(prob_y_z)[1]) {
  for (j in 1:dim(prob_y_z)[2]) {
    prob_y_z[i,j] = prod_y_z[i,j]/prod_sum_y_z
  }
}

# view results
prob_y_z

##              25%         50%         75%        100%
## 25%  0.006317426 0.006317426 0.006317426 0.006317426
## 50%  0.015246730 0.015246730 0.015246730 0.015246730
## 75%  0.030210659 0.030210659 0.030210659 0.030210659
## 100% 0.198225184 0.198225184 0.198225184 0.198225184

# sums of cols and append as row
col_sums <- colSums(prob_y_z)
prob_y_z <- rbind(prob_y_z, col_sums)

# sums of rows and append as col
row_sums <- rowSums(prob_y_z)
prob_y_z <- cbind(prob_y_z, row_sums)
prob_y_z

##                  25%         50%         75%        100%   row_sums
## 25%      0.006317426 0.006317426 0.006317426 0.006317426 0.02526971
## 50%      0.015246730 0.015246730 0.015246730 0.015246730 0.06098692
## 75%      0.030210659 0.030210659 0.030210659 0.030210659 0.12084264
## 100%     0.198225184 0.198225184 0.198225184 0.198225184 0.79290074
## col_sums 0.250000000 0.250000000 0.250000000 0.250000000 1.00000000

Essentially, the joint probabilities are the first four rows and columns. The Marginal Probabilities are the fifth row and column respectively.

Looking at the data, these appear to converge only as they approach higher quartiles At lower quartiles, there is limited intersection, whereas at higher values, the rate suddenly spikes and the rate of intersection increases. Other than that, I think other methods are in order to effectively evaluate this.

5 points. Fisher’s Exact Test and the Chi Square Test:

Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?

Fisher Test Both tests are statistical tests used to analyze categorical data and determine if there is a significant association between two categorical variables. Typically Fisher Exact Test is used for a smaller sample size, whereas the Chi Square test is used when the sample size is larger. So the null hypothesis here would be that there is NO association between the Y and Z.

counts <- prob_y_z*10000

#fisher.test(base_grid, workspace = <large number>)
fisher.test(counts , simulate.p.value = TRUE)

## Warning in fisher.test(counts, simulate.p.value = TRUE): 'x' has been rounded
## to integer: Mean relative difference: 0.000243373

## 
##  Fisher's Exact Test for Count Data with simulated p-value (based on
##  2000 replicates)
## 
## data:  counts
## p-value = 1
## alternative hypothesis: two.sided

Run Chi-square Test

chisq.test(counts, simulate.p.value = T)

## 
##  Pearson's Chi-squared test with simulated p-value (based on 2000
##  replicates)
## 
## data:  counts
## X-squared = 2.1504e-28, df = NA, p-value = 1

A p-value of 1 means that the observed test statistic is very likely to occur under the null hypothesis. It suggests that the data provides no evidence against the null hypothesis and supports the notion that any observed differences or effects are due to random chance.

References :

https://www.statology.org/gamma-distribution-in-r/

https://www.statology.org/exponential-distribution/

https://www.statology.org/plot-exponential-distribution-in-r/

Data 605 Final Problem 1

Deepa Sharma

2023-05-05

Computational Mathematics:

Probability Density 1: X~Gamma:

Probability Density 2: Y~Sum of Exponentials:

Probability Density 3:

1a.Calculate the empirical expected value (means) and variances of all three pdfs.

X~Gamma

Calculate the empirical mean and variance

Y~Sum of Exponentials

Calculate the empirical mean and variance

Z~ Exponential

Calculate the empirical mean and variance

1b.Using calculus, calculate the expected value and variance of the Gamma pdf (X).

5 points. Loosely investigate whether P(YZ) = P(Y) P(Z)

5 points. Fisher’s Exact Test and the Chi Square Test:

References :