Ken Jinkun Xiao
March 3rd, 2015
Simulation is a statistical computation method to evaluate the validity of a certain model.
We will describe deterministic methods of generating values, which are then treated as though they are random.
Simulated random numbers are called pseudorandom numbers.
R can generate random number following a certain distribution. However, the random numbers in the computing are not totally random. Therefore, we call it pseudorandom numbers.
One of the simplest methods for simulating independent uniform random variables on the interval [0,1] is the multiplicative congruential random number generator.
It produces a sequence of pseudorandom numbers, \( u_0,u_1,u_2,... \) which can appear like independent uniform random variables on the interval [0,1].
Let m be a large integer, and let \( b \) be another integer which is smaller than \( m \).
The value of \( b \) is often chosen to be near the square root of \( m \).
Different values of \( b \) and \( m \) give rise to pseudorandom number generators of varying quality.
To begin, an integer \( x_0 \) is chosen between \( 1 \) and \( m \). \( x_0 \) is called the seed. We discuss strategies for choosing \( x_0 \) later.
Once the seed has been chosen, the generator proceeds as fol- lows:
\[ x_1=bx_0 mod m \]
\[ u_1=x_1/m \]
\( u_1 \) is the first pseudorandom number, taking some value be- tween \( 0 \) and \( 1 \). The second pseudorandom number is then obtained in the same manner:
\[ x_2=bx_1 mod m \]
\[ u_2=x_2/m \]
\( u_2 \) is another pseudorandom number.
If \( m \) and \( b \) are chosen properly and are not disclosed to the user, it is diffcult to predict the value of \( u_2 \), given the value of \( u_1 \) only.
In other words, for most practical purposes \( u_2 \) is approximately independent of \( u_1 \).
The method continues according to the following formulas:
\[ x_n = b x_{n-1} mod m \] \[ u_n = x_n/m \]
This method produces numbers which are entirely deterministic, but to an observer who doesn't know the formula above, the numbers appear to be random and unpredictable, at least in the short term.
Take \( m=7 \) and \( b=3 \). Also, take \( x_0=2 \), then \[ x_1=3\times 2 mod 7=6,u_1=6/7=0.857 \]
\[ x_2=3\times 6 mod 7=4,u_2=4/7=0.571 \]
\[ x_3 = 3 \times 4 mod 7 = 5, u_3 = 5/7=0.714 \]
\[ x_4 = 3 \times 5 mod 7 = 1; u_4 = 1/7=0.143 \]
\[ x_5 = 3 \times 1 mod 7 = 3; u_5 = 3/7=0.429 \]
\[ x_6 = 3 \times 3 mod 7 = 2; u6 = 2/7=0.286 \]
It should be clear that the iteration will set \( x_7 = x_1 \) and cycle \( x_i \) through the same sequence of integers, so the corresponding sequence \( u_i \) will also be cyclic.
An observer might not easily be able to predict \( u_2 \) from \( u_1 \), but since \( u_{i+6} = u_i \) for all \( i > 0 \), longer sequences are very easy to predict.
In order to produce an unpredictable sequence, it is desirable to have a very large cycle length so that it is unlikely that any observer will ever see a whole cycle.
The cycle length cannot be any larger than m, so m would normally be taken to be very large.
Care must be taken in the choice of \( b \) and \( m \) to ensure that the cycle length is actually \( m \).
Note, for example, what happens when \( b = 171 \) and \( m = 29241 \). Start with \( x_0 = 3 \), say.
\[ x_1=171\times 3=513 \]
\[ x_2=171 \times 513 mod 29241 =0 \]
All remaining \( x_n \)'s will be 0.
To avoid this kind of problem, we should choose \( m \) so that it is not divisible by \( b \); thus, prime values of \( m \) will be preferred.
The code below produces \( 30268 \) pseudorandom numbers based on the multiplicative congruential generator:
\[ X_n=171 x_{n-1} mod 30269 \] \[ u_n=x_n/30269 \]
With initial seed \( x_0=27218 \)
random.number <- numeric(30268) # the output
# will be stored here
random.seed <- 27218
for (j in 1:30268) {
random.seed <- (171 * random.seed) %% 30269
random.number[j] <- random.seed/30269
}
The results, stored in the vector random.number, are in the range between \( 0 \) and \( 1 \). These are the pseudorandom numbers, \( u_1, u_2,..., u_{30268} \).
head(random.number)
[1] 0.7638508 0.6184876 0.7613730 0.1947867 0.3085335 0.7592256
Checking the cycling of the random here:
length(unique(random.number))
[1] 30268
The last calculation shows that this generator did not cycle before all possible numbers were computed.
The following function will produce \( n \) simulated random num- bers on the interval \( [0, 1] \), using a multiplicative congruential generator:
rng <- function(n, a=171, m=30269, seed=1) {
x <- numeric(min(m-1,n))
x[1] <- seed
for (i in 1:min(m-1,n)){
y <- x[i]
x[i+1] <- (a*y)%%m
}
x[2:(n+1)]/m
}
rng(5) #
[1] 0.005649344 0.966037861 0.192474148 0.913079388 0.136575374
There are many tests for randomness. The main goals of these tests are to ensure that the output from a random number simulator are:
It is impossible to ensure that these two conditions will hold for a simulated sequence. It is not too hard to check the rst condition, but the second condition can only be checked in an incomplete way.
A simple test for the uniform distribution can be based on the chi-square test.
Divide the interval \( [0, 1] \) into m equal subintervals. Accord- ing to the uniform distribution, the probability that a uniformly distributed value would lie in one of the subintervals is \( 1/m \).
If n numbers are simulated, we would expect \( E_i = n/m \) to lie in the \( i \) th subinterval.
The observed number of simulated values in the \( i \) th intervals can be counted: \( O_i \).
The chi-square test statistic is \[ x=\sum_{i=1}^m\frac{(O_i-E_i)^2}{E_i} \]
\( x \) will be large if \( O_i \) and \( E_i \) differ a lot, i.e. if the uniform assumption does not hold. Otherwise, \( x \) is likely to be small.
If the uniform assumption holds, then the p-value for the goodness- of-fit test of uniformity is calculated as
\[ P(X \geq x) \]
where \( X \) is chi-square distributed on \( m-1 \) degrees of freedom.
There is a built-in chisq.test() function in R, but we created a simpler one, specifically designed to test random number gen- erators: rng.chisq(). It takes arguments \( x \), the data vector (simulated from a random number generator on \( [0, 1] \)), and \( m \), the number of subintervals to use for the test.
rng.chisq <- function(x, m=10) {
# x is output from a uniform pseudorandom number
# generator on [0,1]
Ob <- trunc(m*x)/m
Ob <- table(Ob)
p <- rep(1,m)/m
Ex <- length(x)*p
chisq <- sum((Ob-Ex)^2/Ex)
pvalue <- 1-pchisq(chisq, m-1)
list(test.statistic=chisq, p.value=pvalue, df=m-1)
}
x <- rng(1000,a=27,seed=12)
rng.chisq(x,5)
$test.statistic
[1] 6.88
$p.value
[1] 0.1423672
$df
[1] 4
The p-value is not small, so there is very little evidence against the uniformity hypothesis.
Conclusion: the numbers are following a uniform distribution.
Many tests have been devised to try to test independence.
The autocorrelation function can be used to test linear depen- dence between lagged values. Try the acf() function, and look for spikes in the graphical output. These indicate linear depen- dence. This is studied in detail in a time series course.
Unfortunately, dependence need not be linear. Nonlinear de- pendence is much harder to detect.
One test that seems to work well, according to Donald Knuth (a very well-known theoretical computer scientist) is the spectral test.
Knuth says that all bad generators fail this test and all not-so- bad generators pass the test.
The spectral test looks at successive overlapping subsequences of the numbers generated and identifies patterns in these sub- sequences.
e.g. Suppose numbers \( .1, .3, .4, .5, .2, .8, .9 \) have been gener- ated. The overlapping subsequences of length \( 5 \) are
\[ (.1, .3, .4, .5, .2), (.3, .4, .5, .2, .8), (.4, .5, .2, .8, .9) \]
The spectral test would consider how vectors like this look in 5-dimensional space.
The test usually considers subsequences up to length \( 8 \).
We will consider only subsequences of length \( 2 \).
\[ (.1, .3), (.3, .4), (.4, .5), (.5, .2), (.2, .8), (.8, .9) \]
These two dimensional vectors can be plotted in the plane, and it can be shown that the resulting points will lie on parallel lines.
If the parallel lines are too far apart, the generator will not be producing enough successive pairs of points in certain parts of the plane. That would be bad.
The spectral test then amounts to computing the largest dis- tance between these parallel lines. This is actually not very easy, but we can visualize the idea using a graph, called a lag plot.
A lag plot plots the coordinates of the successive overlapping pairs.
e.g. It would plot the pairs \[ (.1, .3), (.3, .4), (.4, .5), (.5, .2),(.2, .8), (.8, .9) \]
x <- rng(100,a=15, m=511,seed=12)
lag.plot(x, do.lines=FALSE)
The separation between the lines of points is large bad!
x <- rng(100,a=31, m=511,seed=12)
lag.plot(x, do.lines=FALSE)
Separation between lines of points is smaller better.
The spectral test calculates distances between planes of simu- lated points in \( 3 \) dimensions. Again, large distances are bad.
A famous generator developed by IBM in the 1960's called RANDU was shown to have bad behaviour in 3 dimensions. It was implemented in several programming languages before this problem was discovered.
The spectral test also calculates distances between hyperplanes in higher dimensions. Beyond dimension 8, the calculations become very time-consuming.
A similar kind of operation to rng (though using a dierent formula, and with a much longer cycle) is used internally by R to produce pseudorandom numbers automatically with the function runif().
runif(n, min = a, max = b)
Execution of this command produces n pseudorandom uniform numbers on the interval \( [a, b] \). The default values are \( a = 0 \) and \( b = 1 \). The seed is selected internally.
Generate 5 uniform pseudorandom numbers on the interval \( [0, 1] \), and 10 uniform such numbers on the interval \( [-3,-1] \).
runif(5)
[1] 0.7058655 0.4938127 0.2198955 0.6721048 0.9393958
runif(10, min = -3, max = -1)
[1] -2.721035 -2.772500 -1.638076 -2.823924 -1.178297 -1.612484 -2.812582
[8] -1.652116 -2.461634 -2.587444
If you execute the above code yourself, you will almost certainly obtain different results than those displayed in our output.
This is because the starting seed that you will use will be dif- ferent from the one that was selected when we ran our code.
There are two different strategies for choosing the starting seed \( x_0 \).
If the goal is to make an unpredictable sequence, then a random value is desirable.
For example, the computer might determine the current time of day to the nearest millisecond, then base the starting seed on the number of milliseconds past the start of the minute.
To avoid predictability, this external randomization should only be done once, after which the formula above should be used for updates.
The second strategy for choosing x0 is to use a xed, non- random value, e.g. \( x_0 = 1 \).
This makes the sequence of ui values predictable and repeat- able.
This would be useful when debugging a program that uses random numbers, or in other situations where repeatability is needed.
The way to do this in R is to use the set.seed() function.
set.seed(32789) # this ensures that your output will match ours
runif(5)
[1] 0.3575211 0.3537589 0.2672321 0.9969302 0.1317401
Other various algorithms can be used to simulate the random variables following other distributions.
A Bernoulli trial is an experiment in which there are only \( 2 \)
possible outcomes. For example, a light bulb may work or not
work; these are the only possibilities. Each outcome (work'
or
not work') has a probability associated with it; the sum of
these two probabilities must be \( 1 \).
Consider a student who guesses on a multiple choice test ques- tion which has \( 5 \) options; the student may guess correctly with probability \( 0.2 \) and incorrectly with probability \( 0.8 \). [The possible outcome of a guess is to either be correct or to be incorrect.]
Suppose we would like to know how well such a student would do on a multiple choice test consisting of \( 20 \) questions.
We can get an idea by using simulation. Each question corresponds to an independent Bernoulli trial with probability of success equal to \( 0.2 \).
We can simulate the correctness of the student for each ques- tion by generating an independent uniform random number. If this number is less than \( 0.2 \), we say that the student guessed correctly; otherwise, we say that the student guessed incorrectly.
This will work, because the probability that a uniform random variable is less than \( 0.2 \) is exactly \( 0.2 \), while the probability that a uniform random variable exceeds \( 0.2 \) is exactly \( 0.8 \), which is the same as the probability that the student guesses incorrectly. Thus, the uniform random number generator is simulating the student.
R can do the simulation as follows:
set.seed(23207) # use this to obtain our output
guesses <- runif(20)
correct.answers <- (guesses < 0.2)
correct.answers
[1] FALSE FALSE FALSE FALSE TRUE TRUE TRUE FALSE TRUE TRUE FALSE
[12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
The vector correct.answers contains the results of the simulated student's guesses; a TRUE value corresponds to a correct guess.
The total number of correct guesses can be calculated.
table(correct.answers)
correct.answers
FALSE TRUE
14 6
Our simulated student would score \( 6/20 \).
In the preceding example, we could associate the values 1' and
0' with the outcomes from a Bernoulli trial.
This defines the Bernoulli random variable: a random variable which takes the value 1 with probability \( p \), and 0 with probability \( 1 - p \). The expected value of a Bernoulli random variable is p and its theoretical variance is \( p(1 - p) \). In the above example, a student would expect to guess correctly \( 20\% \) of the time; our simulated student was a little bit lucky, obtaining a mark of \( 30\% \).
Let \( X \) denote the sum of m independent Bernoulli random vari- ables, each having probability \( p \). \( X \) is called a binomial random variable; it represents the number of successes' in \( m \) Bernoulli trials.
A binomial random variable can take values in the set \( \{0, 1, 2,...,m\} \). The probability of a binomial random variable X taking on any one of these values is governed by the binomial distribution:
\[ P(X=x)=(\frac{m}{x})p^x(1-p)^{m-x},x=0,1,...,m \]
dbinom(x, size, prob)
Here, size and prob are the binomial parameters \( m \) and \( p \), respectively, while \( x \) denotes the number of successes'. The output from this function is the value of \( P(X = x) \).
Compute the probability of getting 4 heads in 6 tosses of a fair coin.
dbinom(x = 4, size = 6, prob = 0.5)
[1] 0.234375
Thus, \( P(X = 4) = 0.234 \), when \( X \) is a binomial random variable with \( m = 6 \) and \( p = 0.5 \).
Cumulative probabilities of the form \( P(X \leq x) \) can be com- puted using pbinom(); this function takes the same arguments as dbinom(). For example, we can calculate \( P(X \leq 4) \) where \( X \) is the number of heads obtained in 6 tosses of a fair coin as:
pbinom(4, 6, 0.5)
[1] 0.890625
The function qbinom() gives the quantiles for the binomial dis- tribution. The \( 89 \) th percentile of the distribution of X (as defined above) is:
qbinom(0.89, 6, 0.5)
[1] 4
The expected value (or mean) of a binomial random variable is \( mp \) and the variance is \( mp(1 - p) \).
The rbinom() function can be used to generate binomial pseu- dorandom numbers.
rbinom(n, size, prob)
Here, size and prob are the binomial parameters, while \( n \) is the number of variates generated.
Suppose \( 10\% \) of the windshields produced on an assembly line are defective, and suppose \( 15 \) windshields are produced each hour. Each windshield is independent of all other windshields. This process is judged to be out of control when more than \( 4 \) defective windshields are produced in any single hour.
Simulate the number of defective windshields produced for each hour over a \( 24 \)-hour period, and determine if any process should have been judged out of control at any point in that simulation run.
Since \( 15 \) windshields are produced each hour and each tube has a \( 0.1 \) probability of being defective, independent of the state of the other windshields, the number of defectives produced in one hour is a binomial random variable with \( m = 15 \) and \( p = 0.1 \).
To simulate the number of defectives for each hour in a 24-hour period, we need to generate 24 binomial random numbers. We then identify all instances in which the number of defectives exceeds \( 5 \).
One such simulation run is:
defectives <- rbinom(24, 15, 0.1)
defectives
[1] 1 2 2 2 1 0 1 2 2 1 2 3 1 0 1 0 1 0 3 2 2 1 0 2
any(defectives > 5)
[1] FALSE
The Poisson distribution is the limit of a sequence of binomial distributions with parameters \( n \) and \( p_n \), where \( n \) is increasing to infinity, and \( p_n \) is decreasing to \( 0 \), but where the expected value (or mean) npn converges to a constant \( \lambda \).
The variance \( np_n(1 - p_n \)) converges to this same constant. Thus, the mean and variance of a Poisson random variable are both equal to \( \lambda \). This parameter is sometimes referred to as a rate.
Poisson random variables arise in a number of different ways.
They are often used as a crude model for count data.
Examples of count data are the numbers of earthquakes in a region in a given year, or the number of individuals who arrive at a bank teller in a given hour.
The limit comes from dividing the time period into n indepen- dent intervals, on which the count is either \( 0 \) or \( 1 \).
The Poisson random variable is the total count.
The possible values that a Poisson random variable X could take are the non-negative integers \( \{0, 1, 2,...,\} \).
The probability of taking on any of these values is:
\[ P(X=x)=\frac{e^{-x}\lambda^x}{x!},x=0,1,2,... \]
The Poisson probabilities can be evaluated using the dpois() function.
dpois(x, lambda)
Here, lambda is the Poisson rate parameter, while \( x \) is the num- ber of Poisson events. The output from the function is the value of \( P(X = x) \).
The average number of arrivals per minute at an automatic bank teller is \( 0.5 \). Arrivals follow a Poisson process. (Described later.) The probability of \( 3 \) arrivals in the next minute is:
dpois(x = 3, lambda = 0.5)
[1] 0.01263606
Therefore, \( P(X = 3) = 0.0126 \), if \( X \) is Poisson random vari- able with mean \( 0.5 \).
Cumulative probabilities of the form \( P(X \leq x) \) can be calcu- lated using ppois(), and Poisson quantiles can be computed using qpois().
We can generate Poisson random numbers using the rpois() function.
rpois(n, lambda)
The parameter \( n \) is the number of variates produced, and lambda is as above.
Suppose traffic accidents occur at an intersection with a mean rate of \( 3.7 \) per year. Simulate the annual number of accidents for a \( 10 \)-year period, assuming a Poisson model.
rpois(10, 3.7)
[1] 5 5 2 6 0 1 4 2 2 5
A Poisson process is a simple model of the collection of events that occur during an interval of time. A way of thinking about a Poisson process is to think of a random collection of points on a line or in the plane (or in higher dimensions, if necessary).
The homogeneous Poisson process has the following properties:
The distribution of the number of points in a set is Poisson with rate proportional to the size of the set.
The numbers of points in non-overlapping sets are indepen- dent of each other.
In particular, for a Poisson process with rate \( \lambda \) the number of points on an interval \( [0, T] \) is Poisson distributed with mean \( \lambda T \).
One way to simulate a Poisson process is as follows.
Generate \( N \) as a Poisson pseudorandom number with pa- rameter \( \lambda T \).
Generate \( N \) independent uniform pseudorandom numbers on the interval \( [0, T] \).
Simulate points of a homogeneous Poisson process having rate \( 1.5 \) on the interval \( [0,10] \).
(N <- rpois(1, 1.5 * 10))
[1] 13
P <- runif(N, max = 10)
sort(P)
[1] 0.3616289 0.8223641 1.7589516 2.1767836 2.1873242 4.1602823 4.7727588
[8] 4.8855681 5.7580374 6.3623382 6.9250931 7.8025595 8.1488465
Exponential random variables are used as simple models for such things as failure times of mechanical or electronic compo- nents, or for the time it takes a server to complete service to a customer. The exponential distribution is characterized by a constant failure rate, denoted by \( \lambda \).
\( T \) has an exponential distribution with rate \( \lambda > 0 \) if
\[ P(T\leq t)=1-e^{-\lambda t} \]
for any non-negative \( t \).
The pexp() function can be used to evaluate the distribution function.
pexp(q, rate)
The output from this is the value of \( P(T \leq q) \), where \( T \) is an exponential random variable with parameter rate.
Suppose the service time at a bank teller can be modeled as an exponential random variable with rate \( 3 \) per minute.
Then the probability of a customer being served in less than \( 1 \)
pexp(1, rate = 3)
[1] 0.9502129
Thus, \( P(X \leq 1) = 0.95 \), when \( X \) is an exponential random variable with rate \( 3 \).
Differentiating the right hand side of the distribution function with respect to \( t \) gives the exponential probability density func- tion:
\[ f(t)=\lambda e^{-\lambda t} \]
The dexp() function can be used to evaluate this. It takes the same arguments as the pexp() function. The qexp() function can be used to obtain quantiles of the exponential distribution.
The expected value of an exponential random variable is \( 1/\lambda \), and the variance is \( 1/\lambda^2 \).
A simple way to simulate exponential pseudorandom variates is based on the inversion method.
For an exponential random variable, \( F(x)=1-e^{- \lambda x} \), so \( F_{-1}(U)=-\frac{\log (1-U)}{\lambda} \)
Therefore, for any \( x \in (0,1) \), we have \[ P(F(T) \leq x) = P(T \leq F^{-1}(x)) = F(F^{-1}(x)) = x: \]
Thus, \( F(T) \) is a uniform random variable on the interval \( (0, 1) \).
Generate a uniform pseudorandom variable \( U \) on \( [0,1] \), and set \[ 1-e^{- \lambda T}=U \]
Solving this for \( T \), we have:
\[ T=-\frac{\log (1-U)}{\lambda} \]
\( T \) has an exponential distribution with rate \( \lambda \).
The R function rexp() can be used to generate n random ex- ponential variates.
rexp(n, rate)
A bank has a single teller who is facing a lineup of \( 10 \) customers. The time for each customer to be served is exponentially dis- tributed with rate \( 3 \) per minute. We can simulate the service times (in minutes) for the \( 10 \) customers.
servicetimes <- rexp(10, rate = 3)
servicetimes
[1] 0.3605426 0.0686455 0.1915629 0.2483414 0.8189500 0.1955231 0.9822217
[8] 1.2710204 0.3287456 0.2544072
The total time until these \( 10 \) simulated customers will complete service is around \( 3 \) minutes and \( 16 \) seconds:
sum(servicetimes)
[1] 4.71996
It can be shown that the points of a homogeneous Poisson pro- cess with rate \( \lambda \) on the line are separated by independent ex- ponentially distributed random variables which have mean \( 1/\lambda \).
This leads to another simple way of simulating a Poisson process on the line.
Simulate the first \( 25 \) points of a Poisson \( 1.5 \) process, starting from 0.
X <- rexp(25, rate = 1.5)
cumsum(X)
[1] 0.6363484 1.0931507 1.1143659 1.8451037 4.9685888 5.0572653
[7] 5.0918432 8.2256760 8.4432871 8.6402185 9.2219486 9.6200494
[13] 10.0257444 10.4329056 10.6811674 12.7360499 13.2546585 15.4761621
[19] 15.8895808 16.3705789 16.8459176 16.9178202 18.2353238 18.2930150
[25] 18.6376047
A normal random variable X has a probability density function given by
\[ f(x)=\frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{(x-u)^2}{2\sigma^2}} \]
where \( \mu \) is the expected value of \( X \), and \( x^2 \) denotes the variance of \( X \).
The standard normal random variable has mean \( \mu = 0 \) and standard deviation \( \sigma=1 \).
The normal density function can be evaluated using the dnorm() function.
The distribution function can be evaluated using pnorm()
The quantiles of the normal distribution can be obtained using qnorm().
For example, the \( 95 \) th percentile of the normal distribution with mean \( 2.7 \) and standard deviation \( 3.3 \) is:
qnorm(0.95, mean = 2.7, sd = 3.3)
[1] 8.128017
Normal pseudorandom variables can be generated using the rnorm() function in R.
rnorm(n, mean, sd)
This produces \( n \) normal pseudorandom variates which have mean mean and standard deviation sd.
We can simulate \( 10 \) independent normal variates with a mean of \( -3 \) and standard deviation of \( 0.5 \) using
rnorm(10, -3, 0.5)
[1] -2.557426 -2.531531 -3.071788 -3.873954 -3.528422 -3.710907 -3.240593
[8] -3.397676 -2.469309 -3.665879
Box-Muller transform - Change the psudorandom number from uniform distribution to normal distribution.
Suppose \( U_1,U_2 \) are independent random variables following uniform distribution in the interval \( (0,1) \). Let \( Z_0,Z_1 \) as
\[ Z_0=R\cos(\Theta)=\sqrt{-2\ln U_1}\cos (2\pi U_2) \] \[ Z_1=R\sin(\Theta)=\sqrt{-2\ln U_1}\sin (2\pi U_2) \]
Then, \( Z_0,Z_1 \) are independent random variables with a standard normal distribution. The random number variables for \( R^2,\Theta \) are in the corresponding polar coordinates: \[ R^2=-2 \ln U_1 \] \[ \Theta=2\pi U_2 \]
Assume random number \( U_1=0.2,U_2=0.6 \). Therefore,
\[ R^2=-2\ln U_1=3.218876 \]
\[ \Theta=2\pi U_2=3.769911 \]
Therefore, we could have \( Z_0,Z_1 \) as: \[ Z_0=R\cos (\Theta)=-1.451476 \]
\[ Z_1=R\cos (\Theta)=-1.054559 \]
GaussianNoise<-function(mu,sigma){
U1<- runif(n = 1,min = 0,max = 1)
U2<- runif(n = 1,min = 0,max = 1)
Z_1 = sqrt(-2 * log(U1)) * cos(2*pi * U2)
Z_2 = sqrt(-2 * log(U1)) * sin(2*pi * U2)
Z1 <- Z_1*sigma + mu
Z2 <- Z_2*sigma + mu
return(c(Z1,Z2))
}
NormRandom <- replicate(n = 1000,GaussianNoise(0,1))[1:10]
plot(density(NormRandom))
NormRandom <- replicate(n = 1000,GaussianNoise(0,1))[1:100]
plot(density(NormRandom))
NormRandom <- replicate(n = 1000,GaussianNoise(0,1))
plot(density(NormRandom))
We can simulate random numbers from certain conditional distributions by first simulating according to an unconditional distribution, and then rejecting those numbers which do not satisfy the specified condition.
Simulate \( x \) from the standard normal distribution, conditional on the event that \( 0 < x < 3 \). We will simulate from the entire normal distribution and then accept only those values which lie between \( 0 \) and \( 3 \).
We can simulate a large number of such variates as follows:
This plot shows how the histogram tracks the rescaled normal density over the interval \( (0, 3) \).
x <- rnorm(100000)
x <- x[(0 < x) & (x < 3)]
hist(x, probability=TRUE)
Suppose \( g(x) \) is any function that is integrable on the interval \( [a, b] \).
The integral
\[ \int_{a}^{b}g(x)dx \]
gives the area of the region with \( a < x < b \) and \( y \) between \( 0 \) and \( g(x) \) (where negative values count towards negative areas)
Monte Carlo integration uses simulation to obtain approximations to these integrals. It relies on the law of large numbers
This law says that a sample mean from a large random sample will tend to be close to the expected value of the distribution being sampled.
If we can express an integral as an expected value, we can approximate it by a sample mean.
For example, let \( U_1,U_2,...,U_n \) be independent uniform random variables on the interval \( [a, b] \). These have density \( f(u)=1/(b-a) \) on that interval. Then
\[ E[g(U_i)]=\int_{a}^{b}g(u)\frac{1}{b-a}du \]
so the original integral \( \int_{a}^{b}g(x)dx \) can be approximated by \( b-a \) times a sample mean of \( g(U_i) \).
To approximate the integral \( \int_{0}^{1}x^4 \) use the following lines:
u <- runif(100000)
mean(u^4)
[1] 0.201374
Compare with the exact answer, \( 0.2 \), which can easily be com- puted.
To approximate the integral \( \int_{2}^{5}\sin(x)dx \) use the following lines:
u <- runif(100000, min = 2, max = 5)
mean(sin(u))*(5-2)
[1] -0.7032925
The true value can be shown to be \( -0.700 \).
Now let \( V_1,V_2,...,V_n \) be an additional set of independent uniform random variables on the interval \( [0, 1] \), and suppose \( g(x, y) \) is now an integrable function of the two variables \( x \) and \( y \). The law of large numbers says that \[ \lim_{n\rightarrow \infty}\sum_{i=1}^{n}g(U_i,V_i)/n=\int_{0}^{1}\int_{0}^{1}g(x,y)dxdy \] with probability \( 1 \).
So we can approximate the integral \( \int_{0}^{1}\int_{0}^{1}g(x,y)dxdy \) by gener- ating two sets of independent uniform pseudorandom variates, computing \( g(U_i, V_i) \) for each one, and taking the average.
Approximate the integral \( \int_{3}^{10}\int_{1}^{7}sin(x-y)dxdy \) using the following:
U <- runif(100000, min = 1, max = 7)
V <- runif(100000, min = 3, max = 10)
mean(sin(U - V))*42
[1] 0.188048
The factor of \( 42 = (7-1)(10-3) \) compensates for the joint density of \( U \) and \( V \) being \( f(u, v)=1/42 \).
The uniform density is by no means the only density that can be used in Monte Carlo integration.
If the density of \( X \) is \( f(x) \), then
\[ E[g(X)/f(X)] =\int[g(x)=f(x)]f(x)dx =\int g(x)dx \]
so we can approximate the latter by sample averages of \( g(X)/f(X) \).
To approximate the integral \( \int_{1}^{\infty}exp(-x^2) \), write it as \[ \int_{0}^{\infty}exp[-(x+1)^2]dx \]
and use an exponential distribution for \( X \):
X <- rexp(100000)
mean( exp( -(X + 1)^2 ) / dexp(X) )
[1] 0.1392813
The true value of this integral is \( 0.1394 \).
Monte Carlo integration is not always successful: sometimes the ratio \( g(X)/f(X) \) varies so much that the sample mean doesn't converge.
Try to choose \( f(x) \) so this ratio is roughly constant, and avoid situations where \( g(x)/f(x) \) can be arbitrarily large.
The simulation methods discussed so far will only work for par- ticular types of probability densities or distributions.
General purpose simulation methods can be used to draw pseu- dorandom samples from a wide variety of distributions.
The idea of rejection sampling was used earlier to sample from a conditional distribution: sample from a convenient distribution, and select a subsample to achieve the target distribution.
We will show how to use rejection sampling to draw a random sample from a univariate density or probability function \( g(x) \), using a sequence of two examples.
Simulate pseudorandom variates from the triangular density function.
\[ g(x)=1-|1-x|,0\leq x \leq 2 \]
If we could draw points uniformly from the triangular region below the density, the x-coordinate would be distributed with density \( g(x) \).
The right hand panel of the gure shows that the graph where the density is nonzero can be entirely contained in a rectangle of height \( 1 \) and width \( 2 \).
A subset of uniformly distributed points in the rectangle will be uniformly distributed in the triangular area under the triangular density.
Thus, a strategy for simulating values from the triangular den- sity is:
Simulate a point \( (U_1,U_2) \) uniformly in the rectangle.
If \( (U1,U2) \) is located within the triangular region, accept \( U_1 \) as a pseudorandom variate; otherwise, reject it, and return to step 1.
Since the triangular density occupies half of the area of the rectangle, we would expect to sample roughly 2 uniform points from the rectangle for every point we accept from the triangular distribution.
In vectorized form, the steps are:
U1 <- runif(100000, max=2)
U2 <- runif(100000)
X <- U1[U2 < (1 - abs(1 - U1))]
The vector X will contain approximately \( 50000 \) simulated values from the triangular distribution.
hist(X)
It is easier to check if we include an overlaid density curve:
hist(X, freq=FALSE) # total area = 1
curve(1-abs(1-x), -.5, 2.5, add=TRUE)