Your final is due by the end of the last week of class. You should post your solutions to your GitHub account or RPubs. You are also expected to make a short presentation via YouTube and post that recording to the board. This project will show off your ability to understand the elements of the class.
Using R, generate a random variable X that has 10,000 random Gamma
pdf values. A Gamma pdf is completely describe by n (a size parameter)
and lambda (lamda, a shape parameter). Choose any n greater 3 and an
expected value (lamda) between 2 and 10 (you choose).
\[E(x) = \frac{n}{\lambda}\] \[Var(y) = \frac{n}{\lambda^2}\]
#make this example reproducible
set.seed(123)
# set parameters
lambda<- 4 #between 2 and 10
n<- 5 # Choose any n greater 3
#generate 10,000 random values that follow gamma distribution
x <- rgamma(10000, shape=n, rate=lambda)
#create histogram to view distribution of values
hist(x, ylab = "Density", main = "Gamma PDF with n = 5 and λ = 4")
# Create a density plot using the density() function
plot(density(x), main = "Density Plot using density()", col= "Green", lwd = 2)
Then generate 10,000 observations from the sum of n exponential pdfs with rate/shape parameter (). The n and must be the same as in the previous case. (e.g., mysum=rexp(10000,lamda)+rexp(10000,lamda)
\[E(y) = \frac{n}{\lambda}\]
\[Var(y) = \frac{n}{\lambda^2}\]
#make this example reproducible
set.seed(123)
# set parameters
lambda<- 4 #between 2 and 10
n<- 5 # Choose any n greater 3
# Sum n exponential distributions with rate parameter lambda
# Generate 10,000 observations from the sum of n exponential distributions
Y<- sum(rexp(10000 * n, lambda))
Z~ Exponential. Then generate 10,000 observations from a single
exponential pdf with rate/shape parameter (). NOTE: The Gamma
distribution is quite common in data science. For example, it is used to
model failures for multiple processes when each of those processes has
the same failure rate. The exponential is used for constant failure
rates, service times, etc.
5 points.
\[E(z) = \frac{1}{\lambda}\]
\[Var(z) = \frac{1}{\lambda^2}\]
#make this example reproducible
set.seed(123)
# set parameters
lamda<- 4 #between 2 and 10
n<- 5 # Choose any n greater 3
Z <- rexp(10000, lambda)
hist(Z, ylab = "Density", main = "Exponential with n = 5 and λ = 4")
#(empirical_mean_x <- mean(x))
#(empirical_variance_x <- var(x))
(E_mean_x<-n/lambda)
## [1] 1.25
(E_Var_x<-n/lambda^2)
## [1] 0.3125
#(empirical_mean_Y <- mean(Y))
#(empirical_variance_Y <- var(Y))
(E_mean_y<-n/lambda)
## [1] 1.25
(E_Var_y<-n/lambda^2)
## [1] 0.3125
#(empirical_mean_Z <- mean(Z))
#(empirical_variance_Z <- var(Z))
(E_mean_z<-1/lambda)
## [1] 0.25
(E_Var_z<-1/lambda^2)
## [1] 0.0625
Mean: https://statproofbook.github.io/P/gam-mean.html#:~:text=Proof%3A%20Mean%20of%20the%20gamma%20distribution&text=E(X)%3Dab,X(x)dx.
\[{E(x) }= \frac{n}\lambda\]
integrand <- function(x) x * dgamma(x, n, lambda)
expected_value_X <- integrate(integrand, lower = 0, upper = Inf)$value
expected_value_X
## [1] 1.25
\[{var(x) }= \frac{n}{\lambda^2}\]
integrand <- function(x) x^2 * dgamma(x, n, lambda)
variance_X <- integrate(integrand, lower = 0, upper = Inf)$value - expected_value_X^2
variance_X
## [1] 0.3125
Variance: https://statproofbook.github.io/P/gam-var
Using the moment generating function for exponentials, calculate the expected value of the single exponential (Z) and the sum of exponentials (Y)
The MGF of an exponential distribution with rate parameter λ is given by:
\[M_x(t) = \frac{\lambda}{(\lambda-t)}\]
Substituting the values of \[\lambda=4\] and n = 5 we get \[M_x(t) = \frac{4^5}{(4-t)}\] When evaluating MGF at t=0 We get:
\[M_x(t) = \frac{4^5}{(4-0)}\]
\[M_x(t) = 0\]
###1c-e. Probability. For pdf Z (the exponential), calculate empirically probabilities a through c. Then evaluate through calculus whether the memoryless property holds. a. \[P(Z>\lambda| Z>\lambda/2)\]
We can use pexp() funtion to calculate this
(Emp_prob_a <- 1-pexp(E_mean_z, lambda))
## [1] 0.3678794
(Emp_prob_b <- 1-pexp(E_mean_z, 2*lambda))
## [1] 0.1353353
(Emp_prob_c <- 1-pexp(E_mean_z, 3*lambda))
## [1] 0.04978707
by building a table with quartiles and evaluating the marginal and joint probabilities.
# first we need to get the quantiles of each
y_quant <- quantile(Y, c(0.25,0.50,0.75,1))
z_quant <- quantile(Z, c(0.25,0.50,0.75,1))
# convert to df (creates row/col names)
y_quant <- as.data.frame(y_quant)
z_quant <- as.data.frame(z_quant)
# convert to matrix
y_quant <- as.matrix(t(y_quant))
z_quant <- as.matrix(z_quant)
# matrix multiplication
prod_y_z <- z_quant %*% y_quant
# sum
prod_sum_y_z <- sum(prod_y_z)
# create blank matrix and set col and row names
prob_y_z <- matrix(nrow=4,ncol=4, 0,
dimnames = dimnames(prod_y_z))
# loop through blank matrix and calc prob
for (i in 1:dim(prob_y_z)[1]) {
for (j in 1:dim(prob_y_z)[2]) {
prob_y_z[i,j] = prod_y_z[i,j]/prod_sum_y_z
}
}
# view results
prob_y_z
## 25% 50% 75% 100%
## 25% 0.006317426 0.006317426 0.006317426 0.006317426
## 50% 0.015246730 0.015246730 0.015246730 0.015246730
## 75% 0.030210659 0.030210659 0.030210659 0.030210659
## 100% 0.198225184 0.198225184 0.198225184 0.198225184
# sums of cols and append as row
col_sums <- colSums(prob_y_z)
prob_y_z <- rbind(prob_y_z, col_sums)
# sums of rows and append as col
row_sums <- rowSums(prob_y_z)
prob_y_z <- cbind(prob_y_z, row_sums)
prob_y_z
## 25% 50% 75% 100% row_sums
## 25% 0.006317426 0.006317426 0.006317426 0.006317426 0.02526971
## 50% 0.015246730 0.015246730 0.015246730 0.015246730 0.06098692
## 75% 0.030210659 0.030210659 0.030210659 0.030210659 0.12084264
## 100% 0.198225184 0.198225184 0.198225184 0.198225184 0.79290074
## col_sums 0.250000000 0.250000000 0.250000000 0.250000000 1.00000000
Essentially, the joint probabilities are the first four rows and columns. The Marginal Probabilities are the fifth row and column respectively.
Looking at the data, these appear to converge only as they approach higher quartiles At lower quartiles, there is limited intersection, whereas at higher values, the rate suddenly spikes and the rate of intersection increases. Other than that, I think other methods are in order to effectively evaluate this.
Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?
Fisher Test Both tests are statistical tests used to analyze categorical data and determine if there is a significant association between two categorical variables. Typically Fisher Exact Test is used for a smaller sample size, whereas the Chi Square test is used when the sample size is larger. So the null hypothesis here would be that there is NO association between the Y and Z.
counts <- prob_y_z*10000
#fisher.test(base_grid, workspace = <large number>)
fisher.test(counts , simulate.p.value = TRUE)
## Warning in fisher.test(counts, simulate.p.value = TRUE): 'x' has been rounded
## to integer: Mean relative difference: 0.000243373
##
## Fisher's Exact Test for Count Data with simulated p-value (based on
## 2000 replicates)
##
## data: counts
## p-value = 1
## alternative hypothesis: two.sided
Run Chi-square Test
chisq.test(counts, simulate.p.value = T)
##
## Pearson's Chi-squared test with simulated p-value (based on 2000
## replicates)
##
## data: counts
## X-squared = 2.1504e-28, df = NA, p-value = 1
A p-value of 1 means that the observed test statistic is very likely to occur under the null hypothesis. It suggests that the data provides no evidence against the null hypothesis and supports the notion that any observed differences or effects are due to random chance.