A monkey typing randomly manages to type out a line of Shakespeare. What is the chance that the monkey would produce this particular line?
A monkey typing randomly manages to type out a line of Shakespeare. What is the chance that the monkey would produce this particular line?
We are interested in finding the probability that a specified event will occur when a random process takes place.
| Random Process | Event |
|---|---|
| selecting three balls | at least two of them are red |
| March Madness | your team wins it all |
| ten years in Florida | category 5 storm hits at least once |
Probability of an Event
The long-run proportion of times that the event will occur if the underlying random process is repeated many, many times.
Random Variable
A number whose value depends on the outcome of a chance process.
| Random Process | Random Variable \(X\) |
|---|---|
| selecting three balls | the number of red balls you get |
| March Madness | 1 (your team wins) 0 (your team does not win) |
| ten years in Florida | number of Category 5 storms that hit in that time |
| ten years in Florida | 1 (Cat-5 hits at least once) 0 (never hits) |
| one year in Florida | total insurance claims filed in that year |
| Kentucky Derby 2017 | your winnings ($) |
What can you expect \(X\) to be, on average, in the long run?
Expected Value of a Random Variable
The long-run average of the values of a random variable, where the underlying chance process is repeated many, many times.
Say you want the probability of being hit by a Cat-5 storm in the next ten years.
Procedure:
n <- a really big number
for (i in 1:n) {
live in FL ten years
record whether or not a Cat-5 hit you (1 = hit at least once, 0 = not)
}
probability <- (number of 1s you recorded)/n
cat("The chance of getting hit is ", probability, ".", sep = "")
The bigger \(n\) is, the better your estimate of the probability should be.
Say you want the expected number of red balls, when you pick three from the urn.
Procedure:
n <- a really big number
for (i in 1:n) {
set up your urn (70 reds, 30 blacks)
pick three balls at random
record the number of reds you got
}
expected_value <- (total reds picked)/n
cat("The expected number of reds is ", expected value, ".", sep = "")
The bigger \(n\) is, the better your estimate of the expected value should be.
Same with drawing from the urn!
The solution is to make the computer model (simulate) the random process and keep track of the results.
Repetition will deliver results of each "trial" of the modeled random process.
This is called Monte Carlo simulation.
Consider a box that holds ten tickets. Three of them are labeled with the letter "a"; the rest are labeled "b".
Our random process:
We will draw one ticket from the box at random. (Each ticket is equally likely to be picked.)
Our probability question:
What is the probability of drawing an "a"-ticket?
tickets <- c(rep("a", 3), rep("b", 7))
tickets
## [1] "a" "a" "a" "b" "b" "b" "b" "b" "b" "b"
We want to model (simulate) the random process of picking a ticket from the box.
We need a good function for this.
We need a good R-function, Mr. Frodo!
sample() FunctionTry this:
sample(tickets, size = 1)
Try it a few more times.
size says how many times to repeat the samplingTo get an estimate of probability, we need to repeat the random process a number of times. (Let's do 20 times.)
You could always do this with a loop:
sims <- numeric(20)
for ( i in 1:20 ) {
sims[i] <- sample(tickets, size = 1)
}
sims
## [1] "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "a" "b" "b" "b" "b" "b" ## [18] "b" "b" "b"
Actually, sample() can handle the repetition for you:
sims <- sample(tickets, size = 20, replace = TRUE) sims
## [1] "b" "b" "b" "a" "b" "b" "b" "b" "b" "b" "a" "a" "b" "b" "b" "b" "b" ## [18] "b" "a" "a"
replace says whether or not to replace the item you picked, after you sampleWe can summarize the results in a table with the table() function:
table(sims)
## sims ## a b ## 5 15
The prop.table() function will compute proportions for us:
prop.table(table(sims))
## sims ## a b ## 0.25 0.75
We figure the chance of drawing an "a"-ticket is about 0.25.
How about repeating more times, say 10,000 times?
sims <- sample(tickets, size = 10000, replace = TRUE) prop.table(table(sims))
## sims ## a b ## 0.2981 0.7019
With more repetitions, we get a better estimate.
The Law of Large Numbers says:
The more times you repeat the underlying random process, the more accurate your estimate of the desired probability (or expected value) is liable to be.
You are about to play a game: you will flip a fair coin twice.
Let \(W\) be the number of dollars you will win.
\(W\) is a random variable. What is its expected value?
When you flip a fair coin twice, there are four equally likely outcomes:
Hence you have:
Distribution of a Random Variable
A statement of the probabilities for a random variable to assume its various possible values.
| \(w\) | \(P(W = w)\) |
|---|---|
| -1 | 0.25 |
| 0 | 0.5 |
| 2 | 0.25 |
Back to Monte Carlo simulation. sample() is still good for this!
# model the process: w <- c(-1, 0, 2) pw <- c(0.25, 0.50, 0.25) # repeat and store results: winnings <- sample(w, size = 10, replace = TRUE, prob = pw) winnings
## [1] 0 2 -1 0 0 0 2 0 0 0
prob specifies the chance of getting each of the elements of the first argument wmean(winnings)
## [1] 0.3
This is our estimate of the expected value of \(W\).
How about repeating 10,000 times?
winnings <- sample(w, size = 10000, replace = TRUE, prob = pw) mean(winnings)
## [1] 0.2456
In a simple game like this, the expected value of \(W\) can be computed exactly, as a weighted average:
\[\sum_{w} wP(W = w) = 0.25 \times -1 + 0.50 \times 0 + 0.25 \times 2 = 0.25.\]
R can compute \(\sum_{w} wP(W = w)\):
sum(w * pw)
## [1] 0.25
But usually the random process is so complex that exact computations are difficult or impossible.
Recall the function for deciding if three side-lengths can make a triangle:
isTriangle <- function(x, y, z) {
(x + y > z) & (x +z > y) & (y + z > x)
}
Recall that it worked on many triangles at once.
# sides for six triangles: a <- c(2, 4.7, 5.2, 6, 6, 9) b <- c(4, 1, 2.8, 6, 6, 3.5) c <- c(5, 3.8, 12, 13, 11, 6.2) # first triangle has sides 2, 4 and 5, etc. isTriangle(a, b, c)
## [1] TRUE TRUE FALSE FALSE TRUE TRUE
Probability Question:
What is the probability that the three pieces will form a triangle?
x <- runif(1) # the first break y <- runif(1) # the second break c(x = x, y = y)
## x y ## 0.6469028 0.3942258
a <- min(x, y); b <- max(x, y) c(left = 0, a = a, b = b, right = 1)
## left a b right ## 0.0000000 0.3942258 0.6469028 1.0000000
piece1 <- a - 0; piece2 <- b - a ; piece3 <- 1 - b c(piece1 = piece1, piece2 = piece2, piece3 = piece3)
## piece1 piece2 piece3 ## 0.3942258 0.2526771 0.3530972
isTriangle(piece1, piece2, piece3)
## [1] TRUE
runif() and isTriangle() do vectorization, so they can handle repetition. Let's repeat the random process eight times.
x <- runif(8) # the first breaks x
## [1] 0.64690284 0.39422576 0.61850181 0.47689114 0.13609719 0.06738439 ## [7] 0.12915262 0.39311793
y <- runif(8) # the second breaks y
## [1] 0.002582699 0.620205954 0.764414018 0.743835758 0.826165695 0.422729083 ## [7] 0.409287665 0.539692614
a <- pmin(x, y) # pairwise min gets left breaks a
## [1] 0.002582699 0.394225758 0.618501814 0.476891136 0.136097186 0.067384386 ## [7] 0.129152617 0.393117930
b <- pmax(x, y) # pairwise max gets right breaks b
## [1] 0.6469028 0.6202060 0.7644140 0.7438358 0.8261657 0.4227291 0.4092877 ## [8] 0.5396926
side1 <- a # leftmost sides side2 <- b - a # middle sides side3 <- 1 - b # rigtmost sides
isTriangle(side1, side2, side3)
## [1] FALSE TRUE FALSE TRUE FALSE FALSE FALSE TRUE
This function does the job, straight from the original breaks:
makesTriangle <- function(x, y) {
a <- pmin(x, y)
b <- pmax(x, y)
side1 <- a
side2 <- b - a
side3 <- 1 - b
isTriangle(x = side1, y = side2, z = side3)
}
makesTriangle(x, y)
## [1] FALSE TRUE FALSE TRUE FALSE FALSE FALSE TRUE
| x | y | side1 | side2 | side3 | triangle |
|---|---|---|---|---|---|
| 0.6469028 | 0.0025827 | 0.0025827 | 0.6443201 | 0.3530972 | FALSE |
| 0.3942258 | 0.6202060 | 0.3942258 | 0.2259802 | 0.3797940 | TRUE |
| 0.6185018 | 0.7644140 | 0.6185018 | 0.1459122 | 0.2355860 | FALSE |
| 0.4768911 | 0.7438358 | 0.4768911 | 0.2669446 | 0.2561642 | TRUE |
| 0.1360972 | 0.8261657 | 0.1360972 | 0.6900685 | 0.1738343 | FALSE |
| 0.0673844 | 0.4227291 | 0.0673844 | 0.3553447 | 0.5772709 | FALSE |
| 0.1291526 | 0.4092877 | 0.1291526 | 0.2801350 | 0.5907123 | FALSE |
| 0.3931179 | 0.5396926 | 0.3931179 | 0.1465747 | 0.4603074 | TRUE |
triangle <- makesTriangle(x, y) sum(triangle)
## [1] 3
mean(triangle)
## [1] 0.375
triangleSim <- function(reps = 10000 ) {
cut1 <- runif(reps)
cut2 <- runif(reps)
triangle <- makesTriangle(cut1, cut2)
cat("The proportion of triangles was ", mean(triangle), ".\n", sep = "")
}
triangleSim <- function(reps = 10000, table = FALSE ) {
cut1 <- runif(reps)
cut2 <- runif(reps)
triangle <- makesTriangle(cut1, cut2)
if ( table ) {
cat("Here is a table of the results:\n")
print(table(triangle))
cat("\n")
}
cat("The proportion of triangles was ", mean(triangle), ".\n", sep = "")
}
triangleSim(table = TRUE) # use default 10000 reps
## Here is a table of the results: ## triangle ## FALSE TRUE ## 7466 2534 ## ## The proportion of triangles was 0.2534.
Let's add a seed parameter to our simulation function:
triangleSim <- function(reps = 10000, table = FALSE, seed ) {
set.seed(seed)
cut1 <- runif(reps)
cut2 <- runif(reps)
triangle <- makesTriangle(cut1, cut2)
if ( table ) {
cat("Here is a table of the results:\n\n")
print(table(triangle))
cat("\n")
}
cat("The proportion of triangles was ", mean(triangle), ".\n", sep = "")
}
It would be nice to have the option to call your function without having to provide a seed.
This can be done by setting the default value of seed to NULL, and adding a check in the body of the function.
triangleSim <- function(reps = 10000, table = FALSE, seed = NULL) {
if ( !is.null(seed) ) {
set.seed(seed)
}
cut1 <- runif(reps)
cut2 <- runif(reps)
triangle <- makesTriangle(cut1, cut2)
if ( table ) {
cat("Here is a table of the results:\n\n")
print(table(triangle))
cat("\n")
}
cat("The proportion of triangles was ", mean(triangle), ".\n", sep = "")
}
triangleSim(seed = 3030)
## The proportion of triangles was 0.2547.
triangleSim()
## The proportion of triangles was 0.2524.
Encapsulation and Generalization
A method for developing a computer program according to which the programmer first designs a basic procedure, then encapsulates it in one or more functions, and then progressively generalizes these functions until the program possesses all desired features.
Anna and Raj make a date for coffee tomorrow at the local Coffee Shop. After making the date, both of them forget the exact time they agreed to meet: they can only remember that it was to be sometime between 10 and 11am.
Each person, independently of the other, randomly picks a time between 10 and 11 to arrive.
Each person is willing to wait ten minutes for the other person to arrive.
What is the chance that they meet?
The random process is:
From 10 to 11am, watch for Anna and Raj enter/leave the coffeeshop.
Modeling involves Anna and Raj's arrival times (in minutes after 10am):
anna <- runif(1, min = 0, max = 60) raj <- runif(1, min = 0, max = 60)
Anna and Raj will connect if their arrival times are no more than 10 minutes apart.
That is, they connect if and only if
\[-10 < anna - raj < 10,\]
which is the same as
\[\vert\, anna - raj \,\vert < 10.\]
c(anna = anna, raj = raj)
## anna raj ## 12.58343 28.44067
anna - raj
## [1] -15.85724
abs(anna - raj)
## [1] 15.85724
They do not connect.
meetupSim <- function(reps = 10000, table = FALSE, seed = NULL) {
if ( !is.null(seed) ) {
set.seed(seed)
}
anna <- runif(reps, 0, 60); raj <- runif(reps, 0, 60)
connect <- (abs(anna - raj) < 10)
if ( table ) {
cat("Here is a table of the results:\n\n")
print(table(connect))
cat("\n")
}
cat("The proportion of times they met was ", mean(connect), ".\n", sep = "")
}
meetupSim(table = TRUE, seed = 3939)
## Here is a table of the results: ## ## connect ## FALSE TRUE ## 6955 3045 ## ## The proportion of times they met was 0.3045.
When we repeat our simulation of a random process, we could use a loop.
So far, we have used vectorization to avoid that.
Does it make a difference?
meetupSim2 <- function(reps = 10000, table = FALSE, seed = NULL) {
if ( !is.null(seed) ) {
set.seed(seed)
}
connect <- numeric(reps) # set up vector to hold results
for ( i in 1:reps ) { # loop to repeat random process
anna <- runif(1, 0, 60) # get one arrival time for anna
raj <- runif(1, 0, 60) # and a time for raj
connect[i] <- (abs(anna - raj) < 10) # put result in vector
}
if ( table ) {
cat("Here is a table of the results:\n\n")
print(table(connect))
cat("\n")
}
cat("The proportion of times they met was ", mean(connect), ".\n", sep = "")
}
system.time() records time to perform a task.
system.time(meetupSim(reps = 10000, seed = 4040))
## The proportion of times they met was 0.3079.
## user system elapsed ## 0.005 0.000 0.006
system.time(meetupSim2(reps = 10000, seed = 4040))
## The proportion of times they met was 0.2993.
## user system elapsed ## 0.034 0.002 0.037
Avoid looping if you can repeat with vectorization.