Processing math: 100%
  • Simulation
  • Overview of today’s lesson
  • Required R Packages
  • What is Computer Simulation
  • Key Simulation Ingredients
    • Random Numbers
    • Distributions
    • Algorithms
    • GIGO
  • Random Number Simulation
    • First lets explore the r command
    • The p Command
    • The d Command
    • Simulation using sample() function
  • Monte Carlo Simulation
    • Monte Carlo Integration
    • Monte Carlo to estimate π
    • Monte Carlo to Predict the Stock Market



Simulation

1: the act or process of simulating
2: a sham object : counterfeit
3a: the imitative representation of the functioning of one system or process by means of the functioning of another…a computer simulation of an industrial process
b: examination of a problem often not subject to direct experimentation by means of a simulating device
https://www.merriam-webster.com/dictionary/simulation



Overview of today’s lesson

  • What is Computer Simulation
  • Key Simulation Ingredients
  • Random Number Simulation in R
  • Simulation using sample() function
  • Monte Carlo Simulation


Required R Packages

library(tidyverse)
library(knitr)


What is Computer Simulation

Computer simulation is the use of a computer to represent the dynamic responses of one system by the behavior of another system modeled after it. A simulation uses a mathematical description, or model, of a real system in the form of a computer program. This model is composed of equations that duplicate the functional relationships within the real system. When the program is run, the resulting mathematical dynamics form an analog of the behavior of the real system, with the results presented in the form of data. A simulation can also take the form of a computer-graphics image that represents dynamic processes in an animated sequence.

Computer simulations are used to study the dynamic behavior of objects or systems in response to conditions that cannot be easily or safely applied in real life. For example, a nuclear blast can be described by a mathematical model that incorporates such variables as heat, velocity, and radioactive emissions. Additional mathematical equations can then be used to adjust the model to reflect changes in certain variables, such as the amount of fissionable material that produced the blast. Simulations are especially useful in enabling observers to measure and predict how the functioning of an entire system may be affected by altering individual components within that system.

The simpler simulations performed by personal computers consist mainly of business models and geometric models. The former includes spreadsheet, financial, and statistical software programs that are used in business analysis and planning. Geometric models are used for numerous applications that require simple mathematical modeling of objects, such as buildings, industrial parts, and the molecular structures of chemicals. More advanced simulations, such as those that emulate weather patterns or the behavior of macroeconomic systems, are usually performed on powerful workstations or on mainframe computers. In engineering, computer models of newly designed structures undergo simulated tests to determine their responses to stress and other physical variables. Simulations of river systems can be manipulated to determine the potential effects of dams and irrigation networks before any actual construction has taken place. Other examples of computer simulations include estimating the competitive responses of companies in a particular market and reproducing the movement and flight of space vehicles.
https://www.britannica.com/technology/computer-simulation


Key Simulation Ingredients

  • Random Numbers
  • Distributions
  • Algorithms
  • GIGO


Random Numbers

Random numbers are at the foundation of computer simulation. Random numbers are useful for a variety of purposes, such as generating data encryption keys, simulating and modeling complex phenomena and for selecting random samples from larger data sets. They have also been used aesthetically, for example in literature and music, and are of course ever popular for games and gambling. When discussing single numbers, a random number is one that is drawn from a set of possible values, each of which is equally probable, i.e., a uniform distribution. When discussing a sequence of random numbers, each number drawn must be statistically independent of the others.

With the advent of computers, programmers recognized the need for a means of introducing randomness into a computer program. However, surprising as it may seem, it is difficult to get a computer to do something by chance. A computer follows its instructions blindly and is therefore completely predictable. (A computer that doesn’t follow its instructions in this manner is broken.) There are two main approaches to generating random numbers using a computer: Pseudo-Random Number Generators (PRNGs) and True Random Number Generators (TRNGs). The approaches have quite different characteristics and each has its pros and cons.
https://www.random.org/

For the purposes of simulation and this lesson we will focus on PRNGs. By using and setting a ‘seed’, PRNGs give us the ability to reproduce a simulation. Reproducibility is important!

Distributions

“Probability distributions are fundamental to statistics, just like data structures are to computer science. They’re the place to start studying if you mean to talk like a data scientist. You can sometimes get away with simple analysis using R or scikit-learn without quite understanding distributions, just like you can manage a Java program without understanding hash functions. But it would soon end in tears, bugs, bogus results, or worse: sighs and eye-rolling from stats majors.”
https://blog.cloudera.com/blog/2015/12/common-probability-distributions-the-data-scientists-crib-sheet/

Generally speaking a probability distribution is a listing of all the values the random variable can assume with their corresponding probabilities.

Relational Mapping of the Most Common Distributions


For more information about probability distributions follow this link: https://blog.cloudera.com/blog/2015/12/common-probability-distributions-the-data-scientists-crib-sheet/

Algorithms

A procedure for solving a mathematical problem in a finite number of steps that frequently involves repetition of an operation; broadly: a step-by-step procedure for solving a problem or accomplishing some end especially by a computer - a search algorithm.
https://www.merriam-webster.com/dictionary/algorithm

GIGO

Garbage In - Garbage Out. Computers will process your inputted data, and summarily produce an output. Simply put, flawed data in = flawed data out.

Random Number Simulation

A simulation study typically begins with a probability model for the data and simulation of responses from this model. For several common probability distributions R provides a set of functions, sometimes called a d-p-q-r family, to evaluate the probability density function (for continuous distributions - the probability mass function for discrete distributions), the cumulative distribution function or the quantile function (inverse of the c.d.f) and for simulation of a random sample.

As shown below, the names of the functions are composed of the initial letter indicating

  • d: density function (or probability mass function)
  • p: (cumulative) probability function (values are always in the interval [0,1])
  • q: quantile function - the inverse (more-or-less) of the p function
  • r: simulation of a random sample from the distribution

and an abbreviation of the distribution name, as shown below, which also states the parameter names used for the distribution.

  • exp: (Exponential) The parameter is rate (defaults to 1). The mean of the distribution is 1/rate.
  • norm: (Normal)The most famous distribution in statistics, this is the well-known “bell-curve”. Parameters of the distribution are mean (defaults to 0) and sd (defaults to 1).
  • unif: (Uniform) Parameters are min (defaults to 0) and max (defaults to 1).
  • binom: (Binomial) Parameters are size, the number of trials, and prob, the probability of success on a single trial (no defaults).
  • geom: (Geometric) Parameter is prob, the probability of success on each independent trial. Note that the distribution is defined in terms of the number of failures before the first success.
  • pois: (Poisson) The parameter is lambda, the mean.


Use the command help(Distributions) to access more complete details, usage, and syntax for distributions in the stats package


First lets explore the r command

Example 1: Generate 25 random numbers between 0 and 10 from the uniform distribution.

runif(25, min = 0, max = 10)
##  [1] 2.52108680 5.41441143 2.21375084 8.39537641 6.63524886 9.52025284
##  [7] 3.67148977 4.95634544 8.30637858 2.40256461 2.20710543 0.07394886
## [13] 5.86388760 1.97245725 1.60879116 3.30715687 8.99187604 8.22733655
## [19] 3.90201251 3.65717878 5.21268748 3.85556689 5.11653035 7.41064149
## [25] 1.28892533



Piping your output into the round() function allows you to round numbers off as appropriate. In this example we chose 0 digits after the decimal.


Example 2: Generate 100 random numbers between 0 and 50 from the uniform distribution, round to nearest whole number.

runif(100, min = 0, max = 50) %>% round(.,digits = 0)
##   [1]  6 41 45 12 48 19 24 25 26 33 25 19 43 24 18 37 16 18 42 31 46 46  4
##  [24] 39 17 37  5 19 30 49  5 27 29 13 32 40 37  4 45 34  9 43 15 21 17 10
##  [47] 22 13 27 10 45 21 19 17 26 17 21 22  2  0 20  4 13 29 28  2 49 43 14
##  [70] 38 11  7 36  6 11 22 25 18  3  4 24  1 10 39 40 20 48  7 39 38 19  2
##  [93] 43 50 38 16  1 34 24 46


Reproducibility

Precede your number generation command withset.seed to select a seed or starting point for your random number generator. The seed is what makes your model “reproducible”, by using the same seed you ensure that you are using the same set of random numbers as before.
Example:

set.seed(283)
rnorm(10, mean = 5, sd = 2)
##  [1] 7.013740 5.400777 4.348542 5.982698 2.998060 3.666842 9.071926
##  [8] 7.349144 7.428118 8.522766
set.seed(283)
rnorm(10, mean = 5, sd = 2)
##  [1] 7.013740 5.400777 4.348542 5.982698 2.998060 3.666842 9.071926
##  [8] 7.349144 7.428118 8.522766

And without setting the seed

rnorm(10, mean = 5, sd = 2)
##  [1] 5.436304 7.014597 3.053420 6.992132 5.912376 4.104026 2.637033
##  [8] 6.535835 5.518205 2.525203

Your Turn

Generate 500 random numbers from the exponential distribution. Plot your output on a histogram.

set.seed(147)
rexp(500, rate = 1) %>% hist()


The p Command

Prefixing your function with p calls up the cumulative distribution function (CDF). Given a number or list of numbers, the CDF function allows us to compute the probability that a random number will be less that that number. In the following examples we will evaluate the normal distribution. To learn more about the normal distribution and associated syntax type the command ?Normal. Note that the default mean and standard deviation is 0,1 respectively.

pnorm(0)
## [1] 0.5
pnorm(1.645)
## [1] 0.9500151
pnorm(1.645, lower.tail = FALSE)
## [1] 0.04998491
pnorm(17, mean = 25, sd = 6)
## [1] 0.09121122


Your Turn

Compute the CDF for the number 1.21 given a normal distribution with mean = 0 and sd = 1.

pnorm(1.21)
## [1] 0.8868606


The d Command

The d command is used to calculate the probability density function (PDF) for a given point (the height of the PDF).

dnorm(0)
## [1] 0.3989423
dnorm(1.645)
## [1] 0.1031108



Putting these two functions to work allows you to do calculate the probability that a certain random variable will fall between a given range of numbers as seen in this example from http://www.statmethods.net/advgraphs/probability.html

# Children's IQ scores are normally distributed with a
# mean of 100 and a standard deviation of 15. What
# proportion of children are expected to have an IQ between
# 80 and 120?

mean=100; sd=15
lb=80; ub=120

x <- seq(-4,4,length=100)*sd + mean
hx <- dnorm(x,mean,sd)

plot(x, hx, type="n", xlab="IQ Values", ylab="",
  main="Normal Distribution", axes=FALSE)

i <- x >= lb & x <= ub
lines(x, hx)
polygon(c(lb,x[i],ub), c(0,hx[i],0), col="red")

area <- pnorm(ub, mean, sd) - pnorm(lb, mean, sd)
result <- paste("P(",lb,"< IQ <",ub,") =",
   signif(area, digits=3))
mtext(result,3)
axis(1, at=seq(40, 160, 20), pos=0) 

Simulation using sample() function

sample()takes a sample of the specified size from elements of a vector. Syntax is as follows: sample(x, size, replace, prob). You can type in the command help(sample) for detailed explanation.

For this simulation, use the sample() function to run 10000 trials using two fair dies. Determine the probability of rolling a 7.

trials <- 10000
die1 <- sample(1:6, size = trials, replace = TRUE, prob = NULL)
die2 <- sample(1:6, size = trials, replace = TRUE, prob = NULL)
outcome2 <- die1 + die2
mean(outcome2 == 7)
## [1] 0.1659

What is the probability of rolling a 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, and 13?

sapply(2:13, function(x) mean(outcome2 == x))
##  [1] 0.0273 0.0569 0.0821 0.1070 0.1465 0.1659 0.1366 0.1105 0.0832 0.0565
## [11] 0.0275 0.0000


Your Turn

Use the sample() function to determine the probability of rolling a 7 using three fair six-sided dies.

trials <- 10000
die1 <- sample(1:6, size = trials, replace = TRUE, prob = NULL)
die2 <- sample(1:6, size = trials, replace = TRUE, prob = NULL)
die3 <- sample(1:6, size = trials, replace = TRUE, prob = NULL)
outcome3 <- die1 + die2 + die3
mean(outcome3 == 7)
## [1] 0.0722

If you had to guess what number will come up next, which number should you choose?

qplot(outcome3,
      geom="histogram",
      binwidth = 1,  
      main = "Histogram of Outcomes", 
      xlab = "Number Rolled",  
      fill=I("blue"), 
      col=I("black"), 
      alpha=I(.4),
      xlim=c(0,22))


Monte Carlo Simulation

Building on random number simulation, we can now employ Monte Carlo methods. At its heart, Monte Carlo Simulation draws on the Law of Large Numbers, in this case, repeated random sampling to obtain numerical results, lots of repeated random sampling.
Monte Carlo Simulation performs risk analysis by building models of possible results by substituting a range of values-a probability distribution-for any factor that has inherent uncertainty. It then calculates results over and over, each time using a different set of random values from the probability functions. Depending upon the number of uncertainties and the ranges specified for them, a Monte Carlo simulation could involve thousands or tens of thousands of recalculations before it is complete. Monte Carlo simulation produces distributions of possible outcome values.
http://www.palisade.com/risk/monte_carlo_simulation.asp


Monte Carlo Integration

Suppose we have an instance of a Normal distribution with a mean of 1 and a standard deviation of 10. Then we want to find the integral from 3 to 6. 631102πe(x1)22102dx

Lets run 1 million trials to integrate this function.

trials <- 1000000
norm.sims <- rnorm(trials, mean = 1, sd = 10)
sum(norm.sims >= 3 & norm.sims <= 6)/trials
## [1] 0.111764


Monte Carlo to estimate π

We can use Monte Carlo to estimate π. First, a quick review of facts.


The area of a circle is: Acircle=πr2

The area of a square drawn around that circle is: Asquare=4r2

The ratio between these two facts can be reduced to: π/4

Given this fact, if we can empirically determine the ratio of the area of the circle to the area of the square we can simply multiply this number by 4 and we’ll get our approximation of π.
To do this we can randomly sample x and y values from a unit square centered around 0. If x2+y2<=r2 then the point is in the circle. It follows then that the ratio of points inside the circle to the points outside the circle should approximate π

runs <- 100000
xs <- runif(runs,min=-0.5,max=0.5)
ys <- runif(runs,min=-0.5,max=0.5)
in.circle <- xs^2 + ys^2 <= 0.5^2
mc.pi <- (sum(in.circle)/runs)*4
plot(xs,ys,pch='.',col=ifelse(in.circle,"blue","grey")
     ,xlab='',ylab='',asp=1,
     main=paste("Monte Carlo Approximation of Pi =",mc.pi),sub="100,000 Trials")





And to reinforce the idea that sample size matters, here is the Monte Carlo approximation of π using only 100 trials.




Monte Carlo to Predict the Stock Market

A fictional stock has the following properties: On average it gains 1.001 times its opening price during the trading day, but that can vary by a standard deviation of 0.005 on any given day (this is its volatility). We can simulate a single sample path for this stock by taking the cumulative product from a Normal distribution with a mean of 1.001 and a sd of 0.005. Assuming the stock opens at $20/per share, here is a sample path for 365 days of trading.

days <- 365
changes <- rnorm(365,mean=1.001,sd=0.005)
plot(cumprod(c(20,changes)),type='l',ylab="Price",xlab="day",main="closing price (1 possible path)")



So if that is one possible path, lets apply Monte Carlo methods to simulate 100,000 possible paths. Calculation of the median price for possibilities will give us a good prediction of the closing price after 365 days of trading.

runs <- 100000
#simulates future movements and returns the closing price on day 365
generate.path <- function(){
  days <- 365
  changes <- rnorm(365,mean=1.001,sd=0.005)
  sample.path <- cumprod(c(20,changes))
  closing.price <- sample.path[days+1] #+1 because we add the opening price
}

mc.closing <- replicate(runs,generate.path())


Lower_95 Closing_Price Upper_95
24.50508 28.66963 33.52122
The possibilities for application of Monte Carlo Methods are endless. There are many more uses and applications for Monte Carlo Methods. I used this website for the examples seen above. https://www.countbayesie.com/blog/2015/3/3/6-amazing-trick-with-monte-carlo-simulations





And Just for Fun

Are you living in a computer simulation?

https://www.simulation-argument.com/simulation.html

You can’t make this stuff up…