library(knitr)
y<- function(x0){
x <- (11 * x0) %% 16
flag<-TRUE
while (TRUE) {
x <- c(x,(11*tail(x,1)) %% 16)
if (x==tail(x,1)) break
}
return(x)
}
y<-data.frame('X'=y(1))
kable(y, caption = 'Random Value')
| X |
|---|
| 11 |
| 9 |
| 3 |
| 1 |
| 11 |
c(x,(tail(x,1) + 12) %% 13)
Using the LCG provided below: \(X_i\) = (\(X_{i-1}\) + 12) mod (13), plot the pairs \((U_1, U_2), (U_2, U_3)\),…, and observe the lattice structure obtained. Discuss what you observed.
library(ggplot2)
y2<- function(x0){
x <- (x0 + 12) %% 13
flag<-TRUE
for (i in 1:100) {
x <- c(x,(tail(x,1) + 12) %% 13)
}
return(x)
}
func2<- y2(0)
data <- data.frame(x=head(func2,-1), y=tail(func2, -1))
ggplot(data,aes(x=x,y=y)) + geom_point()
#########################
y2<- function(x0){
x <- (x0 + 12) %% 13
flag<-TRUE
for (i in 1:100) {
x <- c(x,(tail(x,1) + 12) %% 13)
if (x==tail(x,1)) break
}
return(x)
}
func22<- data.frame(y2(0))
kable(func22, caption = 'Random Value')
| y2.0. |
|---|
| 12 |
| 11 |
| 10 |
| 9 |
| 8 |
| 7 |
| 6 |
| 5 |
| 4 |
| 3 |
| 2 |
| 1 |
| 0 |
| 12 |
Implement the pseudo-random number generator:
\[X_i = 16807X_{i-1} \text{ mod}(2^{31} - 1)\]
Using the seed \(X_0\) = 1234567, run the generator for 100,000 observations. Perform a chi-square goodness-of-fit test on the resulting PRN’s. Use 20 equal-probability intervals and level \(\alpha\)= 0.05. Now perform a runs up-and-down test with \(\alpha\) = 0.05 on the observations to see if they are independent.
Using the seed X 0 = 1234567, run the generator for 100,000 observations:
y3<- function(x0, n){
x <- x0
r <- 1234567/(2^31 - 1)
flag<-TRUE
for (i in 1:n) {
x <- c(x,(16807*tail(x,1)) %% (2^31 - 1))
r <- c(r, tail(x,1)/(2^31 - 1))
}
df<-data.frame(x,r)
return(df)
}
func3<- y3(1234567,100)
kable(head(func3, 10), caption= 'first 10 of 100,000 observations generator')
| x | r |
|---|---|
| 1234567 | 0.0005749 |
| 1422014746 | 0.6621772 |
| 456328559 | 0.2124945 |
| 849987676 | 0.3958064 |
| 681650688 | 0.3174183 |
| 1825340118 | 0.8499902 |
| 1687465831 | 0.7857875 |
| 1569179335 | 0.7307061 |
| 2097898185 | 0.9769100 |
| 1988278849 | 0.9258645 |
Perform a chi-square goodness-of-fit test on the resulting PRN’s. Use 20 equal-probability intervals and level \(\alpha\)= 0.05:
for this question, our approach to Ci-square will be consisted of four steps: 1) State the hypotheses 2) Formulate an analysis plan 3) Analyze sample data 4) Interpret results.
State the hypotheses: \(H_0\): The data are consistent with a specified distribution (uniform distribution). \(H_a\): The data are not consistent with a specified distribution (uniform distribution).
Formulate an analysis plan: For this analysis, the significance level is 0.05. Using sample data, we will conduct a chi-square goodness of fit test of the null hypothesis.
Analyze sample data:
In this case, we wil apply the chi-square goodness of fit test to sample data, we compute the degrees of freedom, the expected frequency counts, and the chi-square test statistic:
intervals <- seq(0,1,by=1/20)
df <- data.frame(lower=head(intervals,-1),upper=tail(intervals,-1))
library(plyr)
df2 <- ddply(df, .variables=c("lower","upper"), function(df){c(count = with(df,
length(func3$r[func3$r>=lower&func3$r<upper])))})
df2$chistat <- ((df2$count - length(func3$r)/20)^2)/(length(func3$r)/20)
chisquare <- sum(df2$chistat)
chisquare
## [1] 18.40594
chisquare2 <- chisq.test(df2$count)
chisquare2
##
## Chi-squared test for given probabilities
##
## data: df2$count
## X-squared = 18.406, df = 19, p-value = 0.4955
The p-value = 0.4954934 is not less than \(\leq\) 0.05. Therefore we don’t reject the null hypothesis that the distrubtion is uniform
Now perform a runs up-and-down test with \(\alpha\) = 0.05 on the observations to see if they are independent.
library(tseries)
## Warning: package 'tseries' was built under R version 3.3.1
n=100
s <- rep(NA, n - 1)
for(i in 1:n - 1)
{
s[i] <- func3$x[i] < func3$x[i+ 1]
}
runs2 <- runs.test(as.factor(s))
runs2
##
## Runs Test
##
## data: as.factor(s)
## Standard Normal = 2.3348, p-value = 0.01955
## alternative hypothesis: two.sided
The p-value 0.0195527 is less than 0.05. Hence we reject the null hypothesis. therefore there is no evidence to support independence in the psuedo-random numbers.
Give inverse-transforms, composition, and acceptance-rejection algorithms for generating from the following density:
\[f(x)= \begin{cases} \frac{3x^2}{2}, & -1 \leq x \leq 1 \\ 0 , & otherwise \\ \end{cases} \]
\[ F(x) = \int_{-1}^{x} \frac{3x^2}{2}, -1 \leq x \leq 1 \] \[ F(x) = \frac{x^3}{2} \]
next we will set F(x) to y and and solve for x:
\[ \frac{x^2}{2} = y \] \[ x^3 = 2y \] \[ x = \sqrt[3] 2y\]
hence the inverse function is \(F^{-1}(x) = \sqrt[3] 2x\)
Now lets plot our inverse function and see if it is uniformly distributed on (0, 1)
invFunc <- function(x)
{
y <- (2 * x)^(1/3)
return (y)
}
# Generate the uniform variables between 0 and 1
n<-100000
uniVals <- runif(n, 0, 1)
# lets use our inverse function to generate the values
df4<- invFunc(uniVals)
hist(df4, main = "Inverse Function ")
Acceptance-Rejection MethodSuppose that we want to generate a random variable X with distribution function F and probability density function f. The acceptance-rejection method requires a function g which
majorizes the density g(x) >f(x) for all x. Clearly, g(x) will not be a probability density function because c= \(\int\) g(x) dx > 1. However, if c
# g(x) function for the Accept/Reject approach
g_x <- function(f, min, max)
{
c <- 2
accepted <- TRUE
while(accepted)
{
# Get a random value from uniform distribution
r <- runif(1, min, max)
# Sample x from g(x) and u from uniform dist
u <- runif(1, 0, 1)
gx <- dunif(r, min, max)
# Check u<f(x)/cg(x).
if(u < f(r) / (c * gx))
{
accepted = FALSE
}
}
return(r)
}
############
pdf<- function(x)
{
if(-1 <=x && x <= 1)
{
pdf <- (3 * x^2) / 2
}
else
{
pdf = 0
}
return (pdf)
}
#generate values
vals <- rep(NA, n)
for(i in 1:n)
{
vals[i] <- g_x(pdf, -1, 1)
}
hist(vals, main = "Accept-Reject Method ")
Composition MethodThe composition method is an important principle to facilitate and speed up random variate generation. The basic idea is simple. To generate random variates with a given density we first split the domain of the density into subintervals. Then we select one of these randomly with probabilities given by the area below the density in the respective subintervals. Finally we generate a random variate from the density of the selected part by inversion and return it as random variate of the full distribution
We can rewrite our \(f(x) = \frac{3x^2}{2}, -1 \leq x \leq 1\) as follow:
\[f(x) = \frac{3x^2}{2} I_{-\infty,0} + \frac{3x^2}{2} I_{0,1} + \frac{3x^2}{2} I_{1,\infty}\] Where \(I_A\) denotes the indicator function of the set A, defined by
as follow:
\[ I_A(x) = \begin{cases} 1, & \text{for -1 < x< 1 } \\ 0, & \text{otherwise} \end{cases} \]
This implies that CDF F(x) is:
\[ F(x) = \begin{cases} 0, & for -\infty < x < 0 \\ \frac {x^3} {2}, & 0 \leq x \leq 1 \\ 1, & 1 < x < \infty \end{cases} \] now lets generate our values:
x <- runif(10000)
r <- (x^(1/3))/2
inv <- (-(x^(1/3)) )/2
vals <- c(r,inv)
hist(vals)
Implement, test, and compare different methods to generate from a N(0,1) distribution
Also create a function itstats that:1. Takes one input variable N 2. Generate N samples from a N(0,1) distribution using normrandit 3. Returns two output variables: the mean and the standard deviation of the samples.
As in a), create a function bmstats that can produce N samples using normrandbm and return their mean and the standard deviation.
as in a), create a function arstats that produces N samples using normrandar and returns their means and standard deviations.
In addition, measure the exact CPU time required for each iteration, and calculate the average time required for each method at each value of N.
For each method, plot the average means and standard deviations against the sample size. Which of the three methods appears to be the most accurate? Also, plot the average CPU time vs. N. Which of the three methods appears to take the least time?
Since \(\mu = 0\) and \(\sigma = 1\), we have the pdf:
\[ f(x) = \frac{1}{\sqrt{2 \pi}} e^{-\frac{x^2}{2}} \]
\[F(X)=\int \frac{1}{\sqrt{2\pi}} e^{\frac{-x^2}{2}} dx\]
This is a special integral (Gauss error function),erf(). the inverse of the erf() is qnorm(). hence: function normrandit:
normrandit <- function()
{
U <- runif(1)
return (qnorm(U))
}
function itstats:
itstats <- function(N){
x <- numeric()
for(i in 1:N){
x <- c(x,normrandit())
}
return(list(x= x,mean=mean(x), sd=sd(x)))
}
stats <- (itstats(10000))
statsdf<- data.frame(stats$mean, stats$sd)
kable(statsdf, caption = 'Inverse Transform Method mean & sd')
| stats.mean | stats.sd |
|---|---|
| -0.0045068 | 1.00662 |
Box-Muller Method: b) Next up, we want to generate samples using the Box-Müller algorithm. Create a function normrandbm that: 1. Takes no input variables. 2. Generates two uniform random numbers \(U_1,U_2 \sim U(0,1)\) 3. Returns two output variables: \(X = (-2ln(U_1))^{\frac{1}{2}} cos(2 \pi U_2)\) and \(Y = (-2ln(U_1))^{\frac{1}{2}} sin(2 \pi U_2)\)
library(nortest)
## Warning: package 'nortest' was built under R version 3.3.1
normrandbm <- function(){
size = 2
u = runif(size)
v = runif(size)
x=rep(0,size)
y=rep(0,size)
x = sqrt(-2*log(u[1]))*cos(2*pi*v[2])
y = sqrt(-2*log(u[1]))*sin(2*pi*v[2])
return (c(x, y))
}
res<- data.frame(normrandbm())
#a test for normality
#lillie.test(c(res$x,res$y))
#plot the estimation of the density
#plot(density(c(res$x,res$y)))
As in a), create a function bmstats that can produce N samples using normrandbm and return their mean and the standard deviation.
bmstats <- function(N){
x <- numeric()
for(i in 1:N){
x[i] <- normrandbm()[1]
x[i+1] <- normrandbm()[2]
}
return(list(values=x, mean=mean(vals), sd=sd(vals)))
}
res<- bmstats(100)
res2<- data.frame('mean' = res$mean, 'sd'= res$sd)
kable(res2, caption = 'mean & sd Box-Muller Method')
| mean | sd |
|---|---|
| 0 | 0.3870603 |
normrandar <- function(){
repeat{
U <- runif(2)
X <- -log(U[1])
Y <- -log(U[2])
if(Y >= ((X-1)^2)/2){
break
}
}
return(X)
}
as in a), create a function arstats that produces N samples using normrandar and returns their means and standard deviations.
arstats <- function(N){
x <- numeric()
for(i in 1:N){
x <- c(x,normrandar())
}
return(list(x=x,mean=mean(x), sd=sd(x)))
}
We now compare and evaluate the approaches you implemented in parts a) b) and c) Run 10 iterations of itstats, bmstats, and arstats for N = 100, 1000, 10000, and 100000, and calculate the average means and standard deviations produced by each method at each value of N.
df <- data.frame(method=c(), N=c(), mean=c(), sd=c(), time=c())
compare <- function(fun)
{
results <- data.frame(method=c(), N=c(), mean=c(), sd=c(), time=c())
N <- c(100, 1000, 10000, 100000)
for(n in N)
{
m <- rep(2)
s <- rep(2)
t <- rep(2)
for(i in 1:2)
{
st <- system.time({ret <- fun(n)})
m[i] <- ret$mean
s[i] <- ret$sd
t[i] <- st[[3]]
}
results <- rbind(results, data.frame(method=deparse(substitute(fun)),
N=n,
mean=mean(m),
sd=mean(s),
time=mean(t)))
}
return (results)
}
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
dfitstats<- compare(itstats)
#dfitstats[order(time),]
#arrange(dfitstats, desc(time))
dfbmstats<- compare(bmstats)
dfarstats<- compare(arstats)
allResults<- rbind(dfitstats,dfbmstats,dfarstats)
by_N <- group_by(allResults, N)
arrange(by_N, (-time))
## Source: local data frame [12 x 5]
## Groups: N [4]
##
## method N mean sd time
## <fctr> <dbl> <dbl> <dbl> <dbl>
## 1 itstats 1e+05 -1.615361e-03 0.9992685 56.395
## 2 bmstats 1e+05 -1.888386e-17 0.3870603 49.140
## 3 arstats 1e+05 7.969463e-01 0.6009868 39.930
## 4 bmstats 1e+04 -1.888386e-17 0.3870603 0.715
## 5 arstats 1e+04 7.984375e-01 0.6033143 0.515
## 6 itstats 1e+04 -1.426713e-02 1.0014883 0.495
## 7 bmstats 1e+03 -1.888386e-17 0.3870603 0.050
## 8 arstats 1e+03 7.881387e-01 0.5943927 0.020
## 9 itstats 1e+03 8.133871e-03 0.9903329 0.005
## 10 bmstats 1e+02 -1.888386e-17 0.3870603 0.005
## 11 itstats 1e+02 -6.347146e-02 0.9389717 0.000
## 12 arstats 1e+02 8.120647e-01 0.5551688 0.000
library(ggplot2)
p5 <- ggplot(allResults) + geom_line(aes(x=N, y=mean, colour=method)) +
labs(title="Means by Approach")
p5
library(ggplot2)
p51 <- ggplot(allResults) + geom_line(aes(x=N, y=sd, colour=method)) +
labs(title="SD by Approach")
p51
library(ggplot2)
p52 <- ggplot(allResults) + geom_line(aes(x=N, y=time, colour=method)) +
labs(title="Time by Approach")
p52
From the above plots, it looks like the Box-Muller approach takes the most time.
The Inverse Transform method is the least. The accept reject slightly above the inverse transform. However, nowhere near the Box-Muller approach.
m <- 100000
itdf <- data.frame(itstats(m))
bmdf <- data.frame(bmstats(m))
ardf <- data.frame(arstats(m))
ItStats <- data.frame(method=rep("itstats", m), r=itdf$x)
BmStats <- data.frame(method=rep("bmstats", m+1), r=bmdf$values)
ArStats <- data.frame(method=rep("arstats", m), r=ardf$x)
##########
hist(ArStats$r, main='Inverse Transform Method')
hist(BmStats$r, main='Box Muller Method')
hist(ArStats$r, main='Accept Reject Method')
################
Use Monte Carlo integration to estimate the value of \(\pi\) To summarize the approach, consider the unit quarter circle illustrated in the figure below
GenerateN pairs of uniform random numbers (x,y), where x \(\sim U(0,1) and y \sim\) U(0,1)
and each (x,y) pair represents a point in the unit square. To obtain an estimate of \(\pi\) count the fraction of points that fall inside the unit quarter circle and multiply by 4. Note that the fraction of points that fall inside the quarter circle should tend to the ratio between the area of the unit quarter circle (i.e \(\frac 1 4 |pi\) as compared to area of the unit square (e.i.,1) We proceed step-by-step:
Before crearing the functions, lets understand the problem: We’ll begin by constructing a target circle inscribed in a square. This is our dart board, and the target circle reaches all the way to the edge of the square. It might look something like the following, figure 2:
Next, we’ll simulate repeatedly throwing darts at random against the board above (we’ll assume that our random throws alway hits the square, and are equally likely to land at any given point inside the square). We then look at the fraction of the darts that land within the circle out of all those that were thrown
The area of a square is the length of a side squared. Thus our square, with sides of length 2 (2 x Radius of 1), has an area of 4 (2 squared). The area of a circle is Pi times the radius squared, so for a unit circle (with radius 1), the area is Pi. Therefore, the ratio of the area of our circle to the area of our square is precisely Pi/4. (That is, the circle takes up a Pi/4 portion of our dart board.)
Now we can proceed with creating our functions…
insidecircle <- function(x,y){
ifelse(x^2 + y^2 < 1,return(1),return(0))
}
estimatepi <- function(N){
x <- runif(N)
y <- runif(N)
df <- data.frame(x=x,y=y)
df$inside <- apply(df,1,function(x) insidecircle(x[1],x[2]))
mean <- sum(df$inside)/length(df$inside)
pi_est <- 4*sum(df$inside)/length(df$inside)
se <- sd(df$inside)
pi_se <- 4*se
ci95 <- (1.96*pi_se)/sqrt(length(df$inside))
return(list(pi=pi_est,standard.error=pi_se,ci95=ci95))
}
results <- data.frame(n=c(),estimate=c(),se=c(),upper=c(),lower=c(),interval=c())
for(n in seq(1000,10000,by=500)){
pi <- estimatepi(n)
results <- rbind(results,c(n,pi$pi,pi$standard.error,
pi$pi+pi$ci95,pi$pi-pi$ci95,pi$ci95*2))
}
colnames(results) <- c("N","estimate", "se", "upper", "lower",
"interval")
kable(results)
| N | estimate | se | upper | lower | interval |
|---|---|---|---|---|---|
| 1000 | 3.148000 | 1.638530 | 3.249557 | 3.046443 | 0.2031143 |
| 1500 | 3.136000 | 1.646606 | 3.219330 | 3.052670 | 0.1666595 |
| 2000 | 3.122000 | 1.656046 | 3.194579 | 3.049421 | 0.1451588 |
| 2500 | 3.150400 | 1.636353 | 3.214545 | 3.086255 | 0.1282901 |
| 3000 | 3.142667 | 1.641710 | 3.201414 | 3.083919 | 0.1174957 |
| 3500 | 3.116571 | 1.659535 | 3.171552 | 3.061591 | 0.1099609 |
| 4000 | 3.155000 | 1.632985 | 3.205607 | 3.104393 | 0.1012135 |
| 4500 | 3.149333 | 1.636957 | 3.197162 | 3.101505 | 0.0956571 |
| 5000 | 3.162400 | 1.627684 | 3.207517 | 3.117283 | 0.0902342 |
| 5500 | 3.161455 | 1.628344 | 3.204490 | 3.118420 | 0.0860698 |
| 6000 | 3.148000 | 1.637847 | 3.189443 | 3.106557 | 0.0828865 |
| 6500 | 3.122462 | 1.655446 | 3.162707 | 3.082216 | 0.0804905 |
| 7000 | 3.138857 | 1.644199 | 3.177375 | 3.100339 | 0.0770356 |
| 7500 | 3.119467 | 1.657456 | 3.156978 | 3.081955 | 0.0750235 |
| 8000 | 3.165500 | 1.625405 | 3.201118 | 3.129882 | 0.0712365 |
| 8500 | 3.111529 | 1.662777 | 3.146879 | 3.076180 | 0.0706986 |
| 9000 | 3.152444 | 1.634677 | 3.186217 | 3.118672 | 0.0675456 |
| 9500 | 3.153684 | 1.633797 | 3.186539 | 3.120830 | 0.0657086 |
| 10000 | 3.140400 | 1.643094 | 3.172605 | 3.108195 | 0.0644093 |
using N= 4000,
pi <- c()
for(i in 1:500){
pi <- c(pi,estimatepi(500)$pi)
}
hist(pi)
n=4000
results2 <- data.frame(pi_est=c(), se=c(), lower=c(), upper=c(), n=c())
for(k in 1:500){
resPi <-estimatepi(n)
results2 <- rbind(results2, data.frame(pi_est=resPi$pi, se=resPi$standard.error, ci95=resPi$ci95, n=n))
}
hist(results2$pi)
Est_sd <- sd(results2$pi_est)
The histogram looks normally distributed
the standard deviation is 0.0270029 . It is little smaller than the one we had before..
What percentage of the estimates lie within the CI I calculated above?
ci4000 <- (results %>% filter(n==4000) %>% select(interval))/2
new_ci<- length(pi[pi>mean(pi)-ci4000 & pi<mean(pi)+ci4000])/length(pi)
The percentage of the estimates lie within the CI is 0.526