A monkey typing randomly manages to type out a line of Shakespeare.  What is the chance that the monkey would produce this particular line?

A monkey typing randomly manages to type out a line of Shakespeare. What is the chance that the monkey would produce this particular line?

Probability and Expected Value

What is the Chance?

  • You pull three balls at random from an urn that contains 100 balls, 70 of which are red and 30 of which are black.
    • What is the chance that at least two of the balls are red?
  • Your favorite college basketball team is in the NCAA Tournament.
    • What is the chance that it will win the Championship?
  • You plan to live in Florida for the next ten years.
    • What is the chance that you will experience a Category 5 hurricane?

We are interested in finding the probability that a specified event will occur when a random process takes place.

Random Processes and Events

Random Process Event
selecting three balls at least two of them are red
March Madness your team wins it all
ten years in Florida category 5 storm hits at least once

Definition of Probability

Probability of an Event

The long-run proportion of times that the event will occur if the underlying random process is repeated many, many times.

Random Variables

Random Variable

A number whose value depends on the outcome of a chance process.

Random Variable Examples

Random Process Random Variable \(X\)
selecting three balls the number of red balls you get
March Madness 1 (your team wins) 0 (your team does not win)
ten years in Florida number of Category 5 storms that hit in that time
ten years in Florida 1 (Cat-5 hits at least once) 0 (never hits)
one year in Florida total insurance claims filed in that year
Kentucky Derby 2017 your winnings ($)

Expected Value

What can you expect \(X\) to be, on average, in the long run?

Expected Value of a Random Variable

The long-run average of the values of a random variable, where the underlying chance process is repeated many, many times.

Monte Carlo Simulation

How to Estimate a Probability

Say you want the probability of being hit by a Cat-5 storm in the next ten years.

Procedure:

n <- a really big number
for (i in 1:n) {
  live in FL ten years
  record whether or not a Cat-5 hit you (1 = hit at least once, 0 = not)
}
probability <- (number of 1s you recorded)/n
cat("The chance of getting hit is ", probability, ".", sep = "")

The bigger \(n\) is, the better your estimate of the probability should be.

How to Estimate an Expected Value

Say you want the expected number of red balls, when you pick three from the urn.

Procedure:

n <- a really big number
for (i in 1:n) {
  set up your urn (70 reds, 30 blacks)
  pick three balls at random
  record the number of reds you got
}
expected_value <- (total reds picked)/n
cat("The expected number of reds is ", expected value, ".", sep = "")

The bigger \(n\) is, the better your estimate of the expected value should be.

Crazy!

  • It takes \(10n\) years to live in FL for ten years, \(n\) times!
  • And it's expensive!
  • And you would need a time machine to repeat the process under the same conditions:
    • live in FL ten years
    • enter time machine, go back 10 years
    • live in FL another 10 years …

Same with drawing from the urn!

Monte Carlo Simulation

The solution is to make the computer model (simulate) the random process and keep track of the results.

Repetition will deliver results of each "trial" of the modeled random process.

This is called Monte Carlo simulation.

Simple Examples

Example #1: A Box

Consider a box that holds ten tickets. Three of them are labeled with the letter "a"; the rest are labeled "b".

Our random process:

We will draw one ticket from the box at random. (Each ticket is equally likely to be picked.)

Our probability question:

What is the probability of drawing an "a"-ticket?

Modeling the Box

tickets <- c(rep("a", 3), rep("b", 7))
tickets
##  [1] "a" "a" "a" "b" "b" "b" "b" "b" "b" "b"

Modeling the Random Process

We want to model (simulate) the random process of picking a ticket from the box.

We need a good function for this.

We need a good R-function, Mr. Frodo!

We need a good R-function, Mr. Frodo!

Modeling the Process: the sample() Function

Try this:

sample(tickets, size = 1)

Try it a few more times.

  • first argument is the vector to sample from
  • size says how many times to repeat the sampling

Repetition

To get an estimate of probability, we need to repeat the random process a number of times. (Let's do 20 times.)

You could always do this with a loop:

sims <- numeric(20)
for ( i in 1:20 ) {
  sims[i] <- sample(tickets, size = 1)
}
sims
##  [1] "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "a" "b" "b" "b" "b" "b"
## [18] "b" "b" "b"

Repetition (no loop)

Actually, sample() can handle the repetition for you:

sims <- sample(tickets, size = 20, replace = TRUE)
sims
##  [1] "b" "b" "b" "a" "b" "b" "b" "b" "b" "b" "a" "a" "b" "b" "b" "b" "b"
## [18] "b" "a" "a"
  • replace says whether or not to replace the item you picked, after you sample

Summarize

We can summarize the results in a table with the table() function:

table(sims)
## sims
##  a  b 
##  5 15

Summarize (More Relevant)

The prop.table() function will compute proportions for us:

prop.table(table(sims))
## sims
##    a    b 
## 0.25 0.75

We figure the chance of drawing an "a"-ticket is about 0.25.

Repeat MANY Times

How about repeating more times, say 10,000 times?

sims <- sample(tickets, size = 10000, replace = TRUE)
prop.table(table(sims))
## sims
##      a      b 
## 0.2981 0.7019

With more repetitions, we get a better estimate.

Law of Large Numbers

The Law of Large Numbers says:

The more times you repeat the underlying random process, the more accurate your estimate of the desired probability (or expected value) is liable to be.

Example #2: A Game

You are about to play a game: you will flip a fair coin twice.

  • If you get Tails both times, you lose a dollar.
  • If you get exactly one Head, nothing happens.
  • If you get Heads both times, you win two dollars.

Let \(W\) be the number of dollars you will win.

\(W\) is a random variable. What is its expected value?

What Can Happen

When you flip a fair coin twice, there are four equally likely outcomes:

  • Tails and then Tails (\(W = -1\))
  • Tails and then Heads (\(W = 0\))
  • Heads and then Tails (\(W = 0\))
  • Heads and then Heads (\(W = 2\))

Hence you have:

  • The chance that \(W=-1\) is 0.25.
  • The chance that \(W=0\) is 0.50.
  • The chance that \(W=2\) is 0.25.

We Know the Distribution of W!

Distribution of a Random Variable

A statement of the probabilities for a random variable to assume its various possible values.

\(w\) \(P(W = w)\)
-1 0.25
0 0.5
2 0.25

Model and Repeat

Back to Monte Carlo simulation. sample() is still good for this!

# model the process:
w <- c(-1, 0, 2)
pw <- c(0.25, 0.50, 0.25)
# repeat and store results:
winnings <- sample(w, size = 10, replace = TRUE, prob = pw)
winnings
##  [1]  0  2 -1  0  0  0  2  0  0  0
  • parameter prob specifies the chance of getting each of the elements of the first argument w

Summarize

mean(winnings)
## [1] 0.3

This is our estimate of the expected value of \(W\).

Better With More Repetitions

How about repeating 10,000 times?

winnings <- sample(w, size = 10000, replace = TRUE, prob = pw)
mean(winnings)
## [1] 0.2456

Exactness Sometimes Possible

In a simple game like this, the expected value of \(W\) can be computed exactly, as a weighted average:

\[\sum_{w} wP(W = w) = 0.25 \times -1 + 0.50 \times 0 + 0.25 \times 2 = 0.25.\]

Exactness With R

R can compute \(\sum_{w} wP(W = w)\):

sum(w * pw)
## [1] 0.25

But usually the random process is so complex that exact computations are difficult or impossible.

Example: Chance of a Triangle

Triangle From Three Sides?

Recall the function for deciding if three side-lengths can make a triangle:

isTriangle <- function(x, y, z) {
  (x + y > z) & (x +z > y) & (y + z > x)
}

Vectorization

Recall that it worked on many triangles at once.

# sides for six triangles:
a <- c(2, 4.7, 5.2, 6, 6, 9)
b <- c(4, 1, 2.8, 6, 6, 3.5)
c <- c(5, 3.8, 12, 13, 11, 6.2)
# first triangle has sides 2, 4 and 5, etc.
isTriangle(a, b, c)
## [1]  TRUE  TRUE FALSE FALSE  TRUE  TRUE

A Random Process

  • You will take a stick that is one foot long, and
  • break it randomly in two places.
  • You will have three pieces.

Probability Question:

What is the probability that the three pieces will form a triangle?

Modeling Breaks

x <- runif(1)  # the first break
y <- runif(1)  # the second break
c(x = x, y = y)
##         x         y 
## 0.6469028 0.3942258

Make the Pieces

a <- min(x, y); b <- max(x, y)
c(left = 0, a = a, b = b, right = 1)
##      left         a         b     right 
## 0.0000000 0.3942258 0.6469028 1.0000000
piece1 <- a - 0; piece2 <- b - a ; piece3 <- 1 - b
c(piece1 = piece1, piece2 = piece2, piece3 = piece3)
##    piece1    piece2    piece3 
## 0.3942258 0.2526771 0.3530972

Do They Make a Triangle?

isTriangle(piece1, piece2, piece3)
## [1] TRUE

Repetition

runif() and isTriangle() do vectorization, so they can handle repetition. Let's repeat the random process eight times.

x <- runif(8)  # the first breaks
x
## [1] 0.64690284 0.39422576 0.61850181 0.47689114 0.13609719 0.06738439
## [7] 0.12915262 0.39311793
y <- runif(8)  # the second breaks
y
## [1] 0.002582699 0.620205954 0.764414018 0.743835758 0.826165695 0.422729083
## [7] 0.409287665 0.539692614

Get the Sides

a <- pmin(x, y)  # pairwise min gets left breaks
a
## [1] 0.002582699 0.394225758 0.618501814 0.476891136 0.136097186 0.067384386
## [7] 0.129152617 0.393117930
b <- pmax(x, y)  # pairwise max gets right breaks
b
## [1] 0.6469028 0.6202060 0.7644140 0.7438358 0.8261657 0.4227291 0.4092877
## [8] 0.5396926
side1 <- a     # leftmost sides
side2 <- b - a  # middle sides
side3 <- 1 - b # rigtmost sides

Determine Triangles

isTriangle(side1, side2, side3)
## [1] FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE  TRUE

A Helper-Function

This function does the job, straight from the original breaks:

makesTriangle <- function(x, y) {
  a <- pmin(x, y)
  b <- pmax(x, y)
  side1 <- a
  side2 <- b - a
  side3 <- 1 - b
  isTriangle(x = side1, y = side2, z = side3)
}

Apply Function

makesTriangle(x, y)
## [1] FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE  TRUE

x y side1 side2 side3 triangle
0.6469028 0.0025827 0.0025827 0.6443201 0.3530972 FALSE
0.3942258 0.6202060 0.3942258 0.2259802 0.3797940 TRUE
0.6185018 0.7644140 0.6185018 0.1459122 0.2355860 FALSE
0.4768911 0.7438358 0.4768911 0.2669446 0.2561642 TRUE
0.1360972 0.8261657 0.1360972 0.6900685 0.1738343 FALSE
0.0673844 0.4227291 0.0673844 0.3553447 0.5772709 FALSE
0.1291526 0.4092877 0.1291526 0.2801350 0.5907123 FALSE
0.3931179 0.5396926 0.3931179 0.1465747 0.4603074 TRUE

Summarise Results

triangle <- makesTriangle(x, y)
sum(triangle)
## [1] 3
mean(triangle)
## [1] 0.375

Encapsulate in a Function

triangleSim <- function(reps = 10000 ) {
  cut1 <- runif(reps)
  cut2 <- runif(reps)
  triangle <- makesTriangle(cut1, cut2)
  cat("The proportion of triangles was ", mean(triangle), ".\n", sep = "")
}

Add a Feature

triangleSim <- function(reps = 10000, table = FALSE ) {
  cut1 <- runif(reps)
  cut2 <- runif(reps)
  triangle <- makesTriangle(cut1, cut2)
  if ( table ) {
    cat("Here is a table of the results:\n")
    print(table(triangle))
    cat("\n")
  }
  cat("The proportion of triangles was ", mean(triangle), ".\n", sep = "")
}

Try It Out

triangleSim(table = TRUE)  # use default 10000 reps
## Here is a table of the results:
## triangle
## FALSE  TRUE 
##  7466  2534 
## 
## The proportion of triangles was 0.2534.

Another Feature: Setting a Seed

Let's add a seed parameter to our simulation function:

triangleSim <- function(reps = 10000, table = FALSE, seed ) {
  set.seed(seed)
  cut1 <- runif(reps)
  cut2 <- runif(reps)
  triangle <- makesTriangle(cut1, cut2)
  if ( table ) {
    cat("Here is a table of the results:\n\n")
    print(table(triangle))
    cat("\n")
  }
  cat("The proportion of triangles was ", mean(triangle), ".\n", sep = "")
}

Add Still Another Feature

It would be nice to have the option to call your function without having to provide a seed.

This can be done by setting the default value of seed to NULL, and adding a check in the body of the function.

Implementing the Feature

triangleSim <- function(reps = 10000, table = FALSE, seed = NULL) {
  if ( !is.null(seed) ) {
    set.seed(seed)
  }
  cut1 <- runif(reps)
  cut2 <- runif(reps)
  triangle <- makesTriangle(cut1, cut2)
  if ( table ) {
    cat("Here is a table of the results:\n\n")
    print(table(triangle))
    cat("\n")
  }
  cat("The proportion of triangles was ", mean(triangle), ".\n", sep = "")
}

Try It Out

triangleSim(seed = 3030)
## The proportion of triangles was 0.2547.
triangleSim()
## The proportion of triangles was 0.2524.

Encapsulation and Generalization

Encapsulation and Generalization

A method for developing a computer program according to which the programmer first designs a basic procedure, then encapsulates it in one or more functions, and then progressively generalizes these functions until the program possesses all desired features.

Example: Will They Connect?

Anna and Raj

Anna and Raj make a date for coffee tomorrow at the local Coffee Shop. After making the date, both of them forget the exact time they agreed to meet: they can only remember that it was to be sometime between 10 and 11am.

Each person, independently of the other, randomly picks a time between 10 and 11 to arrive.

Each person is willing to wait ten minutes for the other person to arrive.

What is the chance that they meet?

Random Process

The random process is:

From 10 to 11am, watch for Anna and Raj enter/leave the coffeeshop.

Modeling involves Anna and Raj's arrival times (in minutes after 10am):

anna <- runif(1, min = 0, max = 60)
raj <- runif(1, min = 0, max = 60)

Key Idea

Anna and Raj will connect if their arrival times are no more than 10 minutes apart.

That is, they connect if and only if

\[-10 < anna - raj < 10,\]

which is the same as

\[\vert\, anna - raj \,\vert < 10.\]

What Happens This Time

c(anna = anna, raj = raj)
##     anna      raj 
## 12.58343 28.44067
anna - raj
## [1] -15.85724
abs(anna - raj)
## [1] 15.85724

They do not connect.

Encapsulate in a Function

meetupSim <- function(reps = 10000, table = FALSE, seed = NULL) {
  if ( !is.null(seed) ) {
    set.seed(seed)
  }
  anna <- runif(reps, 0, 60); raj <- runif(reps, 0, 60)
  connect <- (abs(anna - raj) < 10)
  if ( table ) {
    cat("Here is a table of the results:\n\n")
    print(table(connect))
    cat("\n")
  }
  cat("The proportion of times they met was ", mean(connect), ".\n", sep = "")
}

Try It Out

meetupSim(table = TRUE, seed = 3939)
## Here is a table of the results:
## 
## connect
## FALSE  TRUE 
##  6955  3045 
## 
## The proportion of times they met was 0.3045.

Vectorization vs. Looping

Two Ways to Repeat

  • When we repeat our simulation of a random process, we could use a loop.

  • So far, we have used vectorization to avoid that.

  • Does it make a difference?

A Looping Meetup Simulation

meetupSim2 <- function(reps = 10000, table = FALSE, seed = NULL) {
  if ( !is.null(seed) ) {
    set.seed(seed)
  }
  connect <- numeric(reps)  # set up vector to hold results
  for ( i in 1:reps ) {     # loop to repeat random process
    anna <- runif(1, 0, 60) # get one arrival time for anna  
    raj <- runif(1, 0, 60)  # and a time for raj
    connect[i] <- (abs(anna - raj) < 10) # put result in vector
  }
  if ( table ) {
    cat("Here is a table of the results:\n\n")
    print(table(connect))
    cat("\n")
  }
  cat("The proportion of times they met was ", mean(connect), ".\n", sep = "")
}

Comparison

system.time() records time to perform a task.

system.time(meetupSim(reps = 10000, seed = 4040))
## The proportion of times they met was 0.3079.
##    user  system elapsed 
##   0.005   0.000   0.006
system.time(meetupSim2(reps = 10000, seed = 4040))
## The proportion of times they met was 0.2993.
##    user  system elapsed 
##   0.034   0.002   0.037

The Moral

Avoid looping if you can repeat with vectorization.