library(ggplot2)
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.1     ✔ stringr   1.5.2
## ✔ lubridate 1.9.4     ✔ tibble    3.3.0
## ✔ purrr     1.1.0     ✔ tidyr     1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(tidyr)
library(janitor)

## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

library(stats)
library(skimr)

#Part A

Suppose the following: θ ~ Gamma(a, b) where a is the shape parameter, b is the rate parameter. a = 2 b = 1/3

This Gamma Distribution could represent the posterior of θ if my data comes from a Poisson Distribution with mean θ & I had used a conjugate Gamma prior.

To start the simulation, my number of simulation is m = 100. m is independent samples from the Gamma distribution.

Change m = 100 to m = 10000 to see how its increase impact accuracy of simulation

set.seed(32)
m = 10000
a = 2
b = 1/3
theta = rgamma(n = m, shape = a, rate = b)
df <- data.frame(theta)

set.seed() function: use to initialize the random number generator. Its primary purpose is to ensure that the results of random number generation are reproducible.

ggplot(df, aes(x = theta)) + 
  geom_histogram(aes(y = after_stat(density))) + #this gives me probability density and not count 
  labs(y = "Density")

## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

1). Now I want to draw a curve of this Gamma Distribution because I want to compare the simulation from the histogram to the theoretical probability density function for this Gamma.

ggplot(df, aes(x = theta)) + 
  geom_histogram(aes(y = after_stat(density)), bins = 20, color = "white") + #this gives me probability density and not count 
  labs(y = "Density") + 
  stat_function(fun = dgamma, args = list(shape = a, rate = b), color = "blue")

What does stat_function() do in ggplot()? * It adds a theoretical curve to a ggplot by evaluating a function on the x-range of the plot and drawing the result. * What it does * Creates an internal grid of x values (length n, default ~101) over the current x limits. * Computes y = fun(x, …) using your function and args. * Plots the resulting (x, y) as a line, inheriting ggplot scales, themes, facets, etc.

fun = the function to evaluate (e.g., dnorm, dgamma, or your own).
args = a named list of extra parameters passed to fun.

2).
Compare simulation E(θ) Vs. theoretical E(θ) as well as their variances. Returning to the simulation: Simulation E(θ) = sum of rgamma / m Vs. True theoretical E(θ) = a / b = 2 / 1/3 = 6

simulation_Eθ = sum(theta) / m
simulation_Eθ

## [1] 6.022368

True_Theoretical_Eθ = 2 / (1/3)
True_Theoretical_Eθ

## [1] 6

Simulation Var(θ) = var(θ) Vs. True theoretical Var(θ) = a / b^2

sim_Var_θ = var(theta)
sim_Var_θ

## [1] 18.01033

True_theoretical_Var_θ = a / b^2
True_theoretical_Var_θ

## [1] 18

Notice that I am in the ball park with the expected value of the simulataion, but off with the variance for my 100 iteration simulation.

To increase the accuracy of simulation, increasae number of iteration to m = 10000

After changing m = 10000 the simulations E(θ) and Var(θ) are very similar to the true theoretical ones.

3). Now, I want to perform summary statistics or inferences of theta so I can be more informed about if I want my indicator function, I < 5

summary(theta)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.04149  2.88774  5.02640  6.02237  8.07676 37.43035

indicator_func = theta < 5
mean(indicator_func)

## [1] 0.4974

The probability of the simulation is < 5 ≈ 0.4974.

I want to compare this to the True theoretical probability of the Gamma Distribution with the given parameters.

I can do so with pgamma() function the CDF at q <= 5

pgamma(q = 5, shape = a, rate = b)

## [1] 0.4963317

I can see that with simulation m = 10,000 has probability < 5 is about the same as the True theoretical distribution.

4). Now, I am interested in the 90th percentile or the 0.9 quantile of the Gamma Distribution.

Simulation 90th percentile is possible through quantile() function Vs. True theoretical 90th percentile of the Gamma with qgamma() function

quantile(theta, probs = 0.90)

##      90% 
## 11.74426

qgamma(p = 0.9, shape = a, rate = b)

## [1] 11.66916

Result of 90th percentile / 0.9 quantile of this Gamma: A). Simulation ≈ 11.74426 B). True theoretical ≈ 11.6692

This further reinforces one of the advantages of Monte Carlo simulation in which as iteration increases
the more accurate it gets.

#Part B

Using CLT to approx. how accurate the Monte Carlo estimates are

Recall the Gamma Distribution:

set.seed(32) m = 10000 a = 2 b = 1/3 theta = rgamma(n = m, shape = a, rate = b) df <- data.frame(theta)

If I seek the expected value of θ of this Gamma Distribution, then I can use the sample mean from the simulated values or simulated θ. How? The sample means of these thetas approx. follow the Normal Distribution in which the mean of that Normal Distribution is the True Expected Value of θ as well as the True Variance(θ) / m

** As m increases, the variance of the Monte Carlo estimate decreases.

#Then, Standard Error (SE) of this Monte Carlo approx. Normal = sqrt(variance(θ)) / sqrt(m)

SE = sqrt(sim_Var_θ) / sqrt(m)
SE

## [1] 0.04243858

Confidence interval: 2 * SE

confid_int = 2 * SE
confid_int

## [1] 0.08487716

Multiply 2 * SE means that: I am 95% reasonably confident that the Monte Carlo estimate for the Expected Value (θ) is no more than 0.084877 from the True Value of the Expected Value (θ).

Lower and Upper interval of confidence: * Lower bound = simulated 𝝁 - 2 * Standard Error. * Upper bound = simulated 𝝁 + 2 * Standard Error.

lowCon_Int = simulation_Eθ - 2 * SE
lowCon_Int

## [1] 5.937491

upperCon_in = simulation_Eθ + 2 * SE
upperCon_in

## [1] 6.107245

This means: I am reasonably confident after running the Monte Carlo simulation that the True Mean of the Gamma Distribution is within (5.937, 6.107).

What about the probability that in which theta < 5 or Pr(theta < 5)?

Indicator function (theta < 5):

E(θ <5) = mean(indicator_fun)

E_Indicator = mean(indicator_func)
E_Indicator

## [1] 0.4974

Monte Carlo Standard Error for the Indicator Function = STDEV(Indicator) / sqrt(m)

Simulated_SE_θ = sd(indicator_func) / sqrt(m)
Simulated_SE_θ

## [1] 0.005000182

Confidence interval: theta < 5 2 * Simulated_SE_θ

2 * Simulated_SE_θ

## [1] 0.01000036

I am 95% reasonably confident that the Monte Carlo estimate for the Expected Value (θ <5) is no more than 0.01000036 from the True Value of the Expected Value (θ <5).

Lower and Upper interval of confidence: * Lower bound = simulated 𝝁 - 2 * Standard Error. * Upper bound = simulated 𝝁 + 2 * Standard Error.

lowerB = E_Indicator - 2 * Simulated_SE_θ
lowerB

## [1] 0.4873996

upperB = E_Indicator + 2 * Simulated_SE_θ
upperB

## [1] 0.5074004

This means: I am reasonably confident after running the Monte Carlo simulation that the True Mean in which (θ<5) of the Gamma Distribution is within (0.4873996, 0.5074004).

I can check with the _True Theoretical (θ < 5)

pgamma(q = 5, shape = a, rate = b)

## [1] 0.4963317

Comparing the Monte Carlo (θ<5) against the True Theoretical value (θ<5) shows me that the simulation confidence intervals are bounded correctly.

Monte Carlo Simulation

Kacey

2025-09-30

Using CLT to approx. how accurate the Monte Carlo estimates are