library(ggplot2)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.1 ✔ stringr 1.5.2
## ✔ lubridate 1.9.4 ✔ tibble 3.3.0
## ✔ purrr 1.1.0 ✔ tidyr 1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidyr)
library(janitor)
##
## Attaching package: 'janitor'
##
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(stats)
library(skimr)
#Part A
Suppose the following: θ ~ Gamma(a, b) where a is the shape parameter, b is the rate parameter. a = 2 b = 1/3
This Gamma Distribution could represent the posterior of θ if my data comes from a Poisson Distribution with mean θ & I had used a conjugate Gamma prior.
To start the simulation, my number of simulation is m = 100. m is independent samples from the Gamma distribution.
Change m = 100 to m = 10000 to see how its increase impact accuracy of simulation
set.seed(32)
m = 10000
a = 2
b = 1/3
theta = rgamma(n = m, shape = a, rate = b)
df <- data.frame(theta)
set.seed() function: use to initialize the random number generator. Its primary purpose is to ensure that the results of random number generation are reproducible.
ggplot(df, aes(x = theta)) +
geom_histogram(aes(y = after_stat(density))) + #this gives me probability density and not count
labs(y = "Density")
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
1). Now I want to draw a curve of this Gamma Distribution because I want
to compare the simulation from the histogram to the theoretical
probability density function for this Gamma.
ggplot(df, aes(x = theta)) +
geom_histogram(aes(y = after_stat(density)), bins = 20, color = "white") + #this gives me probability density and not count
labs(y = "Density") +
stat_function(fun = dgamma, args = list(shape = a, rate = b), color = "blue")
What does stat_function() do in ggplot()? * It adds a
theoretical curve to a ggplot by evaluating a function on the x-range of
the plot and drawing the result. * What it does * Creates an internal
grid of x values (length n, default ~101) over the current x limits. *
Computes y = fun(x, …) using your function and args. * Plots the
resulting (x, y) as a line, inheriting ggplot scales, themes, facets,
etc.
2).
Compare simulation E(θ) Vs. theoretical E(θ) as well as their variances.
Returning to the simulation: Simulation E(θ) = sum of rgamma / m Vs.
True theoretical E(θ) = a / b = 2 / 1/3 = 6
simulation_Eθ = sum(theta) / m
simulation_Eθ
## [1] 6.022368
True_Theoretical_Eθ = 2 / (1/3)
True_Theoretical_Eθ
## [1] 6
Simulation Var(θ) = var(θ) Vs. True theoretical Var(θ) = a / b^2
sim_Var_θ = var(theta)
sim_Var_θ
## [1] 18.01033
True_theoretical_Var_θ = a / b^2
True_theoretical_Var_θ
## [1] 18
Notice that I am in the ball park with the expected value of the simulataion, but off with the variance for my 100 iteration simulation.
To increase the accuracy of simulation, increasae number of iteration to m = 10000
After changing m = 10000 the simulations E(θ) and Var(θ) are very similar to the true theoretical ones.
3). Now, I want to perform summary statistics or inferences of theta so I can be more informed about if I want my indicator function, I < 5
summary(theta)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.04149 2.88774 5.02640 6.02237 8.07676 37.43035
indicator_func = theta < 5
mean(indicator_func)
## [1] 0.4974
The probability of the simulation is < 5 ≈ 0.4974.
I want to compare this to the True theoretical probability of the Gamma Distribution with the given parameters.
I can do so with pgamma() function the CDF at q <= 5
pgamma(q = 5, shape = a, rate = b)
## [1] 0.4963317
I can see that with simulation m = 10,000 has probability < 5 is about the same as the True theoretical distribution.
4). Now, I am interested in the 90th percentile or the 0.9 quantile of the Gamma Distribution.
Simulation 90th percentile is possible through quantile() function Vs. True theoretical 90th percentile of the Gamma with qgamma() function
quantile(theta, probs = 0.90)
## 90%
## 11.74426
qgamma(p = 0.9, shape = a, rate = b)
## [1] 11.66916
Result of 90th percentile / 0.9 quantile of this Gamma: A). Simulation ≈ 11.74426 B). True theoretical ≈ 11.6692
This further reinforces one of the advantages of Monte Carlo
simulation in which as iteration increases
the more accurate it gets.
#Part B
Recall the Gamma Distribution:
set.seed(32) m = 10000 a = 2 b = 1/3 theta = rgamma(n = m, shape = a, rate = b) df <- data.frame(theta)
If I seek the expected value of θ of this Gamma Distribution, then I can use the sample mean from the simulated values or simulated θ. How? The sample means of these thetas approx. follow the Normal Distribution in which the mean of that Normal Distribution is the True Expected Value of θ as well as the True Variance(θ) / m
** As m increases, the variance of the Monte Carlo estimate decreases.
#Then, Standard Error (SE) of this Monte Carlo approx. Normal = sqrt(variance(θ)) / sqrt(m)
SE = sqrt(sim_Var_θ) / sqrt(m)
SE
## [1] 0.04243858
Confidence interval: 2 * SE
confid_int = 2 * SE
confid_int
## [1] 0.08487716
Multiply 2 * SE means that: I am 95% reasonably confident that the Monte Carlo estimate for the Expected Value (θ) is no more than 0.084877 from the True Value of the Expected Value (θ).
Lower and Upper interval of confidence: * Lower bound = simulated 𝝁 - 2 * Standard Error. * Upper bound = simulated 𝝁 + 2 * Standard Error.
lowCon_Int = simulation_Eθ - 2 * SE
lowCon_Int
## [1] 5.937491
upperCon_in = simulation_Eθ + 2 * SE
upperCon_in
## [1] 6.107245
This means: I am reasonably confident after running the Monte Carlo simulation that the True Mean of the Gamma Distribution is within (5.937, 6.107).
What about the probability that in which theta < 5 or Pr(theta < 5)?
Indicator function (theta < 5):
E(θ <5) = mean(indicator_fun)
E_Indicator = mean(indicator_func)
E_Indicator
## [1] 0.4974
Monte Carlo Standard Error for the Indicator Function = STDEV(Indicator) / sqrt(m)
Simulated_SE_θ = sd(indicator_func) / sqrt(m)
Simulated_SE_θ
## [1] 0.005000182
Confidence interval: theta < 5 2 * Simulated_SE_θ
2 * Simulated_SE_θ
## [1] 0.01000036
I am 95% reasonably confident that the Monte Carlo estimate for the Expected Value (θ <5) is no more than 0.01000036 from the True Value of the Expected Value (θ <5).
Lower and Upper interval of confidence: * Lower bound = simulated 𝝁 - 2 * Standard Error. * Upper bound = simulated 𝝁 + 2 * Standard Error.
lowerB = E_Indicator - 2 * Simulated_SE_θ
lowerB
## [1] 0.4873996
upperB = E_Indicator + 2 * Simulated_SE_θ
upperB
## [1] 0.5074004
This means: I am reasonably confident after running the Monte Carlo simulation that the True Mean in which (θ<5) of the Gamma Distribution is within (0.4873996, 0.5074004).
I can check with the _True Theoretical (θ < 5)
pgamma(q = 5, shape = a, rate = b)
## [1] 0.4963317
Comparing the Monte Carlo (θ<5) against the True Theoretical value (θ<5) shows me that the simulation confidence intervals are bounded correctly.