Simulation Monte Carlo Integration

Question 1

Starting with \(X_0\) = 1, write down the entire cycle for \(X_i\) = \(11X_{i-1}\) mod (16).

Answer 1

library(knitr)

y<- function(x0){

x <- (11 * x0) %% 16
flag<-TRUE

while (TRUE) {
x <- c(x,(11*tail(x,1)) %% 16)  
if (x==tail(x,1)) break
}
return(x)
}
y<-data.frame('X'=y(1))

kable(y, caption = 'Random Value')

Random Value
X
11
9
3
1
11

c(x,(tail(x,1) + 12) %% 13)

Question 2

Using the LCG provided below: \(X_i\) = (\(X_{i-1}\) + 12) mod (13), plot the pairs \((U_1, U_2), (U_2, U_3)\),…, and observe the lattice structure obtained. Discuss what you observed.

Answer 2

library(ggplot2)

y2<- function(x0){
  
  x <- (x0 + 12) %% 13
  flag<-TRUE
  
  for (i in 1:100) {
    x <- c(x,(tail(x,1) + 12) %% 13)
    }
  return(x)
}
func2<- y2(0)

data <- data.frame(x=head(func2,-1), y=tail(func2, -1))
ggplot(data,aes(x=x,y=y)) + geom_point()

#########################
y2<- function(x0){
  x <- (x0 + 12) %% 13
  flag<-TRUE
    for (i in 1:100) {
    x <- c(x,(tail(x,1) + 12) %% 13)
    if (x==tail(x,1)) break
  }
  return(x)
}
func22<- data.frame(y2(0))

kable(func22, caption = 'Random Value')

Random Value
y2.0.
12
11
10
9
8
7
6
5
4
3
2
1
0
12

Question 3

Implement the pseudo-random number generator:

\[X_i = 16807X_{i-1} \text{ mod}(2^{31} - 1)\]

Using the seed \(X_0\) = 1234567, run the generator for 100,000 observations. Perform a chi-square goodness-of-fit test on the resulting PRN’s. Use 20 equal-probability intervals and level \(\alpha\)= 0.05. Now perform a runs up-and-down test with \(\alpha\) = 0.05 on the observations to see if they are independent.

Answer 3

Using the seed X 0 = 1234567, run the generator for 100,000 observations:

y3<- function(x0, n){
  x <- x0
  r <- 1234567/(2^31 - 1)
  flag<-TRUE
  for (i in 1:n) {
    x <- c(x,(16807*tail(x,1)) %% (2^31 - 1))
    r <- c(r, tail(x,1)/(2^31 - 1))
  }
  df<-data.frame(x,r)
  return(df)
}
func3<- y3(1234567,100)

kable(head(func3, 10), caption= 'first 10 of 100,000 observations generator')

first 10 of 100,000 observations generator
x	r
1234567	0.0005749
1422014746	0.6621772
456328559	0.2124945
849987676	0.3958064
681650688	0.3174183
1825340118	0.8499902
1687465831	0.7857875
1569179335	0.7307061
2097898185	0.9769100
1988278849	0.9258645

Perform a chi-square goodness-of-fit test on the resulting PRN’s. Use 20 equal-probability intervals and level \(\alpha\)= 0.05:

for this question, our approach to Ci-square will be consisted of four steps: 1) State the hypotheses 2) Formulate an analysis plan 3) Analyze sample data 4) Interpret results.

State the hypotheses: \(H_0\): The data are consistent with a specified distribution (uniform distribution). \(H_a\): The data are not consistent with a specified distribution (uniform distribution).
Formulate an analysis plan: For this analysis, the significance level is 0.05. Using sample data, we will conduct a chi-square goodness of fit test of the null hypothesis.
Analyze sample data:

In this case, we wil apply the chi-square goodness of fit test to sample data, we compute the degrees of freedom, the expected frequency counts, and the chi-square test statistic:

intervals <- seq(0,1,by=1/20)
df <- data.frame(lower=head(intervals,-1),upper=tail(intervals,-1))

library(plyr)
df2 <- ddply(df, .variables=c("lower","upper"), function(df){c(count = with(df, 
                                                            length(func3$r[func3$r>=lower&func3$r<upper])))})
df2$chistat <- ((df2$count - length(func3$r)/20)^2)/(length(func3$r)/20)
chisquare <- sum(df2$chistat)
chisquare

## [1] 18.40594

chisquare2 <- chisq.test(df2$count)
chisquare2

## 
##  Chi-squared test for given probabilities
## 
## data:  df2$count
## X-squared = 18.406, df = 19, p-value = 0.4955

Interpret results:

The p-value = 0.4954934 is not less than \(\leq\) 0.05. Therefore we don’t reject the null hypothesis that the distrubtion is uniform

Now perform a runs up-and-down test with \(\alpha\) = 0.05 on the observations to see if they are independent.

library(tseries)

## Warning: package 'tseries' was built under R version 3.3.1

n=100
s <- rep(NA, n - 1)
for(i in 1:n - 1)
{
  s[i] <- func3$x[i] < func3$x[i+ 1]
}
runs2 <- runs.test(as.factor(s))
runs2

## 
##  Runs Test
## 
## data:  as.factor(s)
## Standard Normal = 2.3348, p-value = 0.01955
## alternative hypothesis: two.sided

The p-value 0.0195527 is less than 0.05. Hence we reject the null hypothesis. therefore there is no evidence to support independence in the psuedo-random numbers.

Question 4

Give inverse-transforms, composition, and acceptance-rejection algorithms for generating from the following density:

\[f(x)= \begin{cases} \frac{3x^2}{2}, & -1 \leq x \leq 1 \\ 0 , & otherwise \\ \end{cases} \]

http://stattrek.com/chi-square-test/goodness-of-fit.aspx?Tutorial=AP

Answer 4

Inverse Transform Method The inverse transform method is applicable when the distribution function F of a random variable is continuous and strictly increasing on the domain of the random variable. In that case, the inverse function \(F^{-1}\) exists and it is easily seen that, if U in uniformly distributed on (0, 1), then \(X = F^{-1}\) (U) has distribution function F: P(X \(\leq\) x) = P(\(F^{-1}\)(U) \(\leq\) x) = P(U \(\leq\) F(x)) = F(x).

\[ F(x) = \int_{-1}^{x} \frac{3x^2}{2}, -1 \leq x \leq 1 \] \[ F(x) = \frac{x^3}{2} \]

next we will set F(x) to y and and solve for x:

\[ \frac{x^2}{2} = y \] \[ x^3 = 2y \] \[ x = \sqrt[3] 2y\]

hence the inverse function is \(F^{-1}(x) = \sqrt[3] 2x\)

Now lets plot our inverse function and see if it is uniformly distributed on (0, 1)

invFunc <- function(x)
{
  y <- (2 * x)^(1/3)
  return (y)
}

# Generate the uniform variables between 0 and 1 
n<-100000
uniVals <- runif(n, 0, 1)

# lets use our inverse function to generate the values 

df4<- invFunc(uniVals)

hist(df4, main = "Inverse Function ")

Acceptance-Rejection MethodSuppose that we want to generate a random variable X with distribution function F and probability density function f. The acceptance-rejection method requires a function g which

majorizes the density g(x) >f(x) for all x. Clearly, g(x) will not be a probability density function because c= \(\int\) g(x) dx > 1. However, if c 1 from our given pdf above.

# g(x) function for the Accept/Reject approach
g_x <- function(f, min, max)
{
  c <- 2
  accepted <- TRUE
  while(accepted)
  {
    # Get a random value from uniform distribution 
    r <- runif(1, min, max)
    # Sample x from g(x) and u from uniform dist
    u <- runif(1, 0, 1)
    gx <- dunif(r, min, max)
    # Check u<f(x)/cg(x).
    if(u < f(r) / (c * gx))
    {
      accepted = FALSE
    }
  }
  return(r)
}
############

pdf<- function(x)
{
  if(-1 <=x && x <= 1)
  {
    pdf <- (3 * x^2) / 2
  }
  else
  {
    pdf = 0
  }
  return (pdf)
}
#generate values
vals <- rep(NA, n)
for(i in 1:n)
{
  vals[i] <- g_x(pdf, -1, 1)
}

hist(vals, main = "Accept-Reject Method ")

Composition MethodThe composition method is an important principle to facilitate and speed up random variate generation. The basic idea is simple. To generate random variates with a given density we first split the domain of the density into subintervals. Then we select one of these randomly with probabilities given by the area below the density in the respective subintervals. Finally we generate a random variate from the density of the selected part by inversion and return it as random variate of the full distribution

We can rewrite our \(f(x) = \frac{3x^2}{2}, -1 \leq x \leq 1\) as follow:

\[f(x) = \frac{3x^2}{2} I_{-\infty,0} + \frac{3x^2}{2} I_{0,1} + \frac{3x^2}{2} I_{1,\infty}\] Where \(I_A\) denotes the indicator function of the set A, defined by

as follow:

\[ I_A(x) = \begin{cases} 1, & \text{for -1 < x< 1 } \\ 0, & \text{otherwise} \end{cases} \]

This implies that CDF F(x) is:

\[ F(x) = \begin{cases} 0, & for -\infty < x < 0 \\ \frac {x^3} {2}, & 0 \leq x \leq 1 \\ 1, & 1 < x < \infty \end{cases} \] now lets generate our values:

x <- runif(10000)
r <- (x^(1/3))/2
inv <- (-(x^(1/3)) )/2
vals <- c(r,inv)
hist(vals)

Question 5

Implement, test, and compare different methods to generate from a N(0,1) distribution

The simplest generator is the inverse transform method. Create a function normrandit that:

Takes no input variables.
Generates a uniform random number U \(\sim\) U(0,1)
Returns one output variable: X= \(F^{-1} (U)\) where \(F^{-1}\) is the inverse normal CDF.

Also create a function itstats that:1. Takes one input variable N 2. Generate N samples from a N(0,1) distribution using normrandit 3. Returns two output variables: the mean and the standard deviation of the samples.

Next up, we want to generate samples using the Box-Müller algorithm. Create a function normrandbm that:

Takes no input variables.
Generates two uniform random numbers \(U_1,U_2 \sim U(0,1)\)
Returns two output variables: \(X = (-2ln(U_1))^{\frac{1}{2}} cos(2 \pi U_2)\) and \(Y = (-2ln(U_1))^{\frac{1}{2}} sin(2 \pi U_2)\)

As in a), create a function bmstats that can produce N samples using normrandbm and return their mean and the standard deviation.

Lastly, we want to generate samples using the accept-reject approach. Create a function normrandar that:

Takes no input variable
Generates two uniform random numbers: \(U_1, U_2 ~ U(0,1)\)
Convert the samples to \(Exp(1)\) by calculating \(X,Y = -ln(U_i)\)
Accept the sample if \(Y \geq \frac{(X-1)^2}{2}\) and reject otherwise
If a sample is accepted, randomly choose sign, and return.
If sample is rejected, return to 2 and try again.

as in a), create a function arstats that produces N samples using normrandar and returns their means and standard deviations.

We now compare and evaluate the approaches implemented in parts a, b, and c. Run 10 iterations of itstats, bmstats, and arstats for N = 100, 1000, 10000, and 100000, and calculate the average means and standard deviations produced by each method at each value of N.

In addition, measure the exact CPU time required for each iteration, and calculate the average time required for each method at each value of N.

For each method, plot the average means and standard deviations against the sample size. Which of the three methods appears to be the most accurate? Also, plot the average CPU time vs. N. Which of the three methods appears to take the least time?

Answer 5

Since \(\mu = 0\) and \(\sigma = 1\), we have the pdf:

\[ f(x) = \frac{1}{\sqrt{2 \pi}} e^{-\frac{x^2}{2}} \]

Inverse Transform Method

\[F(X)=\int \frac{1}{\sqrt{2\pi}} e^{\frac{-x^2}{2}} dx\]

This is a special integral (Gauss error function),erf(). the inverse of the erf() is qnorm(). hence: function normrandit:

normrandit <- function()
{
U <- runif(1)
return (qnorm(U))
}

function itstats:

itstats <- function(N){
  x <- numeric()
  for(i in 1:N){
    x <- c(x,normrandit())
  }
  return(list(x= x,mean=mean(x), sd=sd(x)))
}


stats <- (itstats(10000))
statsdf<- data.frame(stats$mean, stats$sd)

kable(statsdf, caption = 'Inverse Transform Method mean & sd')

Inverse Transform Method mean & sd
stats.mean	stats.sd
-0.0045068	1.00662

Box-Muller Method: b) Next up, we want to generate samples using the Box-Müller algorithm. Create a function normrandbm that: 1. Takes no input variables. 2. Generates two uniform random numbers \(U_1,U_2 \sim U(0,1)\) 3. Returns two output variables: \(X = (-2ln(U_1))^{\frac{1}{2}} cos(2 \pi U_2)\) and \(Y = (-2ln(U_1))^{\frac{1}{2}} sin(2 \pi U_2)\)

library(nortest)

## Warning: package 'nortest' was built under R version 3.3.1

normrandbm <- function(){
size = 2

u = runif(size)
v = runif(size)

x=rep(0,size)
y=rep(0,size)

  x = sqrt(-2*log(u[1]))*cos(2*pi*v[2])
  y = sqrt(-2*log(u[1]))*sin(2*pi*v[2])

return (c(x, y))

}
res<- data.frame(normrandbm())

#a test for normality
#lillie.test(c(res$x,res$y))

#plot the estimation of the density
#plot(density(c(res$x,res$y)))

As in a), create a function bmstats that can produce N samples using normrandbm and return their mean and the standard deviation.

bmstats <- function(N){
  x <- numeric()
  for(i in 1:N){
    x[i] <- normrandbm()[1]
    x[i+1] <- normrandbm()[2]
  }
  return(list(values=x, mean=mean(vals), sd=sd(vals)))
}
res<- bmstats(100)
res2<- data.frame('mean' = res$mean, 'sd'= res$sd)

kable(res2, caption = 'mean & sd Box-Muller Method')

mean & sd Box-Muller Method
mean	sd
0	0.3870603

Lastly, we want to generate samples using the accept-reject approach. Create a function normrandar that:

Takes no input variable
Generates two uniform random numbers: \(U_1, U_2 ~ U(0,1)\)
Convert the samples to \(Exp(1)\) by calculating \(X,Y = -ln(U_i)\)
Accept the sample if \(Y \geq \frac{(X-1)^2}{2}\) and reject otherwise
If a sample is accepted, randomly choose sign, and return.
If sample is rejected, return to 2 and try again.

normrandar <- function(){
repeat{
U <- runif(2)
X <- -log(U[1])
Y <- -log(U[2])
if(Y >= ((X-1)^2)/2){
break
}
}
return(X)
}

as in a), create a function arstats that produces N samples using normrandar and returns their means and standard deviations.

arstats <- function(N){
x <- numeric()
for(i in 1:N){
x <- c(x,normrandar())
}
return(list(x=x,mean=mean(x), sd=sd(x)))
}

We now compare and evaluate the approaches you implemented in parts a) b) and c) Run 10 iterations of itstats, bmstats, and arstats for N = 100, 1000, 10000, and 100000, and calculate the average means and standard deviations produced by each method at each value of N.

df <- data.frame(method=c(), N=c(), mean=c(), sd=c(), time=c())
compare <- function(fun)
{
  results <- data.frame(method=c(), N=c(), mean=c(), sd=c(), time=c())
  N <- c(100, 1000, 10000, 100000)
  for(n in N)
  {
    m <- rep(2)
    s <- rep(2)
    t <- rep(2)
    for(i in 1:2)
    {
      st <- system.time({ret <- fun(n)})
      m[i] <- ret$mean
      s[i] <- ret$sd
      t[i] <- st[[3]]
    }
    results <- rbind(results, data.frame(method=deparse(substitute(fun)),
                                           N=n,
                                           mean=mean(m),
                                           sd=mean(s),
                                           time=mean(t)))
  }
  return (results)
}

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:plyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

dfitstats<- compare(itstats)
#dfitstats[order(time),] 
#arrange(dfitstats, desc(time))
dfbmstats<- compare(bmstats)
dfarstats<- compare(arstats)

allResults<- rbind(dfitstats,dfbmstats,dfarstats)
by_N <- group_by(allResults, N)

arrange(by_N, (-time))

## Source: local data frame [12 x 5]
## Groups: N [4]
## 
##     method     N          mean        sd   time
##     <fctr> <dbl>         <dbl>     <dbl>  <dbl>
## 1  itstats 1e+05 -1.615361e-03 0.9992685 56.395
## 2  bmstats 1e+05 -1.888386e-17 0.3870603 49.140
## 3  arstats 1e+05  7.969463e-01 0.6009868 39.930
## 4  bmstats 1e+04 -1.888386e-17 0.3870603  0.715
## 5  arstats 1e+04  7.984375e-01 0.6033143  0.515
## 6  itstats 1e+04 -1.426713e-02 1.0014883  0.495
## 7  bmstats 1e+03 -1.888386e-17 0.3870603  0.050
## 8  arstats 1e+03  7.881387e-01 0.5943927  0.020
## 9  itstats 1e+03  8.133871e-03 0.9903329  0.005
## 10 bmstats 1e+02 -1.888386e-17 0.3870603  0.005
## 11 itstats 1e+02 -6.347146e-02 0.9389717  0.000
## 12 arstats 1e+02  8.120647e-01 0.5551688  0.000

library(ggplot2)
p5 <- ggplot(allResults) + geom_line(aes(x=N, y=mean, colour=method)) + 
  labs(title="Means by Approach")
p5

library(ggplot2)
p51 <- ggplot(allResults) + geom_line(aes(x=N, y=sd, colour=method)) + 
  labs(title="SD by Approach")
p51

library(ggplot2)
p52 <- ggplot(allResults) + geom_line(aes(x=N, y=time, colour=method)) + 
  labs(title="Time by Approach")
p52

From the above plots, it looks like the Box-Muller approach takes the most time.
The Inverse Transform method is the least. The accept reject slightly above the inverse transform. However, nowhere near the Box-Muller approach.

Plot histograms of 1000000 samples from each method and compare.

m <- 100000
itdf <- data.frame(itstats(m))
bmdf <- data.frame(bmstats(m))
ardf <- data.frame(arstats(m))
ItStats <- data.frame(method=rep("itstats", m), r=itdf$x)
BmStats <- data.frame(method=rep("bmstats", m+1), r=bmdf$values)
ArStats <- data.frame(method=rep("arstats", m), r=ardf$x)
##########
hist(ArStats$r, main='Inverse Transform Method')

hist(BmStats$r, main='Box Muller Method')

hist(ArStats$r, main='Accept Reject Method')

################

Question 6

Use Monte Carlo integration to estimate the value of \(\pi\) To summarize the approach, consider the unit quarter circle illustrated in the figure below

GenerateN pairs of uniform random numbers (x,y), where x \(\sim U(0,1) and y \sim\) U(0,1)
and each (x,y) pair represents a point in the unit square. To obtain an estimate of \(\pi\) count the fraction of points that fall inside the unit quarter circle and multiply by 4. Note that the fraction of points that fall inside the quarter circle should tend to the ratio between the area of the unit quarter circle (i.e \(\frac 1 4 |pi\) as compared to area of the unit square (e.i.,1) We proceed step-by-step:

Create a function insidecircle that takes two inputs between 0 and 1 and returns 1 if these points fall within the unit circle.
Create a function estimatepi that takes a single input N, generates N pairs of uniform random numbers and uses insidecircle to produce an estimate of \(\pi\) as described above. In addition to the estimate of \(|pi\), estimatepi should also return the standard error of this estimate, and a 95% confidence interval for the estimate.
Use estimatepi to estimate \(\pi\) for N = 1000 to 10000 in increments of 500 and record the estimate, its standard error and the upper and lower bounds of the 95 % CI. How large must N be in order to ensure that your estimate of \(\pi\) is within 0.1 of the true value?
Using the value of N you determined in part c), run estimatepi 500 times and collect 500 different estimates of \(\pi\). Produce a histogram of the estimates and note the shape of this distribution. Calculate the standard deviation of the estimates - does it match the standard error you obtained in part c)? What percentage of the estimates lies within the 95% CI you obtained in part c)?

Answer 6

Before crearing the functions, lets understand the problem: We’ll begin by constructing a target circle inscribed in a square. This is our dart board, and the target circle reaches all the way to the edge of the square. It might look something like the following, figure 2:

Next, we’ll simulate repeatedly throwing darts at random against the board above (we’ll assume that our random throws alway hits the square, and are equally likely to land at any given point inside the square). We then look at the fraction of the darts that land within the circle out of all those that were thrown

The area of a square is the length of a side squared. Thus our square, with sides of length 2 (2 x Radius of 1), has an area of 4 (2 squared). The area of a circle is Pi times the radius squared, so for a unit circle (with radius 1), the area is Pi. Therefore, the ratio of the area of our circle to the area of our square is precisely Pi/4. (That is, the circle takes up a Pi/4 portion of our dart board.)

Now we can proceed with creating our functions…

insidecircle <- function(x,y){
  ifelse(x^2 + y^2 < 1,return(1),return(0))
}

estimatepi <- function(N){
  x <- runif(N)
  y <- runif(N)
  df <- data.frame(x=x,y=y)
  df$inside <- apply(df,1,function(x) insidecircle(x[1],x[2]))
  
  mean <- sum(df$inside)/length(df$inside)
  pi_est <- 4*sum(df$inside)/length(df$inside)
  se <- sd(df$inside)
  pi_se <- 4*se
  ci95 <- (1.96*pi_se)/sqrt(length(df$inside))
  
  return(list(pi=pi_est,standard.error=pi_se,ci95=ci95))
}


results <- data.frame(n=c(),estimate=c(),se=c(),upper=c(),lower=c(),interval=c())
for(n in seq(1000,10000,by=500)){
  pi <- estimatepi(n)
  results <- rbind(results,c(n,pi$pi,pi$standard.error,
                             pi$pi+pi$ci95,pi$pi-pi$ci95,pi$ci95*2))
}

colnames(results) <- c("N","estimate", "se", "upper", "lower",
                       "interval")


kable(results)

N	estimate	se	upper	lower	interval
1000	3.148000	1.638530	3.249557	3.046443	0.2031143
1500	3.136000	1.646606	3.219330	3.052670	0.1666595
2000	3.122000	1.656046	3.194579	3.049421	0.1451588
2500	3.150400	1.636353	3.214545	3.086255	0.1282901
3000	3.142667	1.641710	3.201414	3.083919	0.1174957
3500	3.116571	1.659535	3.171552	3.061591	0.1099609
4000	3.155000	1.632985	3.205607	3.104393	0.1012135
4500	3.149333	1.636957	3.197162	3.101505	0.0956571
5000	3.162400	1.627684	3.207517	3.117283	0.0902342
5500	3.161455	1.628344	3.204490	3.118420	0.0860698
6000	3.148000	1.637847	3.189443	3.106557	0.0828865
6500	3.122462	1.655446	3.162707	3.082216	0.0804905
7000	3.138857	1.644199	3.177375	3.100339	0.0770356
7500	3.119467	1.657456	3.156978	3.081955	0.0750235
8000	3.165500	1.625405	3.201118	3.129882	0.0712365
8500	3.111529	1.662777	3.146879	3.076180	0.0706986
9000	3.152444	1.634677	3.186217	3.118672	0.0675456
9500	3.153684	1.633797	3.186539	3.120830	0.0657086
10000	3.140400	1.643094	3.172605	3.108195	0.0644093

using N= 4000,

pi <- c()
for(i in 1:500){
pi <- c(pi,estimatepi(500)$pi)
}
hist(pi)

n=4000
results2 <- data.frame(pi_est=c(), se=c(), lower=c(), upper=c(), n=c())

for(k in 1:500){
  resPi <-estimatepi(n)
  results2 <- rbind(results2, data.frame(pi_est=resPi$pi, se=resPi$standard.error, ci95=resPi$ci95, n=n))
}

hist(results2$pi)

Est_sd <- sd(results2$pi_est)

The histogram looks normally distributed

the standard deviation is 0.0270029 . It is little smaller than the one we had before..

What percentage of the estimates lie within the CI I calculated above?

ci4000 <- (results %>% filter(n==4000) %>% select(interval))/2
new_ci<- length(pi[pi>mean(pi)-ci4000 & pi<mean(pi)+ci4000])/length(pi)

The percentage of the estimates lie within the CI is 0.526