PH125.3 Probability

3.1 Discrete probability

Monte Carlo simulations for categorical data

We have a box with 5 beads, 2 blue and 3 red

beads <- rep (c("blue","red"), times= c(2,3))
beads

## [1] "blue" "blue" "red"  "red"  "red"

#picking one at random

sample(beads,1)

## [1] "blue"

# repeating subtraction 10000 times

t <- 10000
subs <- replicate (t, sample(beads,1))

# results of substraction:

table(subs)

## subs
## blue  red 
## 3944 6056

# proportion

prop.table(table(subs))

## subs
##   blue    red 
## 0.3944 0.6056

Values are really close to 40 and 60%

`sample` function

Can be used directly, without the use of replicate, to repeat the same experiment of picking 1 out of the 5 beads, continually, under the same conditions.

B <- 10000
events <- sample(beads, B, replace = TRUE)
prop.table(table(events))

## events
##   blue    red 
## 0.3982 0.6018

We get results very similar to those previously obtained with replicate.

Conditional probabilities

When events are not independent e.g.

Pr(Card 2 is King | Card 1 was a King) = 3 / 51

When two events are independent

Pr(A|B) = Pr(A)

Addition and multiplication rules

Multiplication rule

Pr (A and B) = Pr(A) * Pr(B|A)

Pr (A and B and C) = Pr(A) * Pr(B|A) * Pr(C|A and B)

Multiplication rule under independence

Pr (A and B and C) = Pr(A) * Pr(B) * Pr(C)

General rule for computing conditional probabilities

Pr(B|A) = Pr(A and B) / Pr(A)

Addition rule

Pr(A or B) = Pr(A) + Pr(B) - Pr(A and B)

Combinations and permutations

`paste`

We use paste to create strings by combining other strings That’s how we can build cards for example

number <- "Three"
suit <- "Hearts"
paste(number, suit)

## [1] "Three Hearts"

Another example

paste(letters[1:5], as.character(1:5))

## [1] "a 1" "b 2" "c 3" "d 4" "e 5"

`expand.grid()`

The function expand.grid gives us all the combinations of entries of two vectors. For example, if you have blue and black pants and white, grey, and plaid shirts, all your combinations are:

expand.grid(pants = c("blue", "black"), shirt = c("white", "grey", "plaid"))

##   pants shirt
## 1  blue white
## 2 black white
## 3  blue  grey
## 4 black  grey
## 5  blue plaid
## 6 black plaid

# Building a card deck

suits <- c("Diamonds", "Clubs", "Hearts", "Spades")
numbers <- c("Ace", "Deuce", "Three", "Four", "Five", "Six", "Seven", 
             "Eight", "Nine", "Ten", "Jack", "Queen", "King")
deck <- expand.grid(number = numbers, suit = suits)
deck <- paste(deck$number, deck$suit)
head(deck)

## [1] "Ace Diamonds"   "Deuce Diamonds" "Three Diamonds" "Four Diamonds" 
## [5] "Five Diamonds"  "Six Diamonds"

# probability of kings in deck

kings <- paste("King", suits)
kings

## [1] "King Diamonds" "King Clubs"    "King Hearts"   "King Spades"

mean(deck %in% kings)

## [1] 0.07692308

# 5 random phone numbers of 7 digits

all_phone_numbers <- permutations(10, 7, v = 0:9)
n <- nrow(all_phone_numbers)
index <- sample(n, 5)
all_phone_numbers[index,]

##      [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,]    4    5    7    0    9    1    2
## [2,]    9    8    6    5    1    0    2
## [3,]    2    3    6    9    5    4    0
## [4,]    6    7    1    0    4    5    3
## [5,]    7    1    4    8    5    6    3

To compute all possible ways we can choose two cards when the order matters, we type:

hands <- permutations(52, 2, v = deck)

# This is a matrix with two columns and 2652 rows. 

head(hands)

##      [,1]        [,2]            
## [1,] "Ace Clubs" "Ace Diamonds"  
## [2,] "Ace Clubs" "Ace Hearts"    
## [3,] "Ace Clubs" "Ace Spades"    
## [4,] "Ace Clubs" "Deuce Clubs"   
## [5,] "Ace Clubs" "Deuce Diamonds"
## [6,] "Ace Clubs" "Deuce Hearts"

With a matrix we can get the first and second cards like this. This will return the whole 1st and 2nd columns

first_card <- hands[,1]
second_card <- hands[,2]

Now the cases for which the first hand was a King can be computed like this:

kings <- paste("King", suits)
sum(first_card %in% kings)

## [1] 204

To get the conditional probability, we compute what fraction of these have a King in the second card:

sum(first_card%in%kings & second_card%in%kings) / sum(first_card%in%kings)

## [1] 0.05882353

which is exactly 3/51, as we had already deduced. Notice that the code above is equivalent to:

mean(first_card%in%kings & second_card%in%kings) /
mean(first_card%in%kings)

## [1] 0.05882353

which uses mean instead of sum and is an R version of:

Pr (A and B) / Pr (A)

Probability of a natural 21 in blackjack

aces <- paste("Ace", suits)
facecard <- c("King", "Queen", "Jack", "Ten")
facecard <- expand.grid(number = facecard, suit = suits)
facecard <- paste(facecard$number, facecard$suit)

hands <- combinations(52, 2, v=deck) # all possible hands

# probability of a natural 21 given that the ace is listed first in combinations

mean(hands[,1] %in% aces & hands[,2] %in% facecard)

## [1] 0.04826546

# probability of a natural 21 checking for both ace first and ace second

mean((hands[,1] %in% aces & hands[,2] %in% facecard)|(hands[,2] %in% aces & hands[,1] %in% facecard))

## [1] 0.04826546

The birthday problem

Birthdays can be represented as a number from 1 to 365 (we exclude 29th Feb)

n <- 50
sample(1:365,n,replace=TRUE)

##  [1] 127  29 171 338 171 273 255 136 162   3  46  70 122 231 162 265 246 144  47
## [20] 229 323 313 317 254 344  86 180 199  14 150  22 261 161  37 274 146 307  79
## [39] 153  68 350  49  88 111 205 141 239 217  94   1

Montecarlo

B <- 10000

same_birthday <- function (n) {
  bdays <- sample(1:365,n,replace=TRUE)
  any(duplicated(bdays))
}

results <- replicate(B, same_birthday(50))

mean(results)

## [1] 0.9669

In a group of 50 people there is 96% probability of at least 2 people having the same birthday

Say we want to use this knowledge to bet with friends about two people having the same birthday in a group of people. When are the chances larger than 50%? Larger than 75%?

Let’s create a look-up table. We can quickly create a function to compute this for any group size:

compute_prob <- function(n, B=10000){
  results <- replicate(B, same_birthday(n))
  mean(results)
}

Using the function sapply, we can perform element-wise operations on any function:

n <- seq(1,60)

# n is a vector of 1 to 60 (groups of 1 to 60 people)
# with sapply we can run the function for each of the elements in n

prob <- sapply(n, compute_prob)

We can now make a plot of the estimated probabilities of two people having the same birthday in a group of size n:

prob <- sapply(n, compute_prob)
qplot(n, prob)

With statistical formulas

exact_prob <- function(n){
  prob_unique <- seq(365,365-n+1)/365 
  1 - prod( prob_unique)
}
eprob <- sapply(n, exact_prob)
qplot(n, prob) + geom_line(aes(n, eprob), col = "red")

Monty Hall problem

We run a monte carlo for the option of not switching doors

B <- 10000
stick <- replicate(B, {
    doors <- as.character(1:3)
    prize <- sample(c("car","goat","goat"))   #random prizes
    prize_door <- doors[prize == "car"]     # prize door with car
    my_pick  <- sample(doors, 1)            #random pick
    show <- sample(doors[!doors %in% c(my_pick, prize_door)],1)   
    stick <- my_pick    
    stick == prize_door    
})
mean(stick)

## [1] 0.3377

Monte carlo including switching doors

B <- 10000
switch <- replicate(B, {
  doors <- as.character(1:3)
  prize <- sample(c("car","goat","goat"))   #random prizes
  prize_door <- doors[prize == "car"]       # prize door with car
  my_pick  <- sample(doors, 1)          #random pick
  show <- sample(doors[!doors %in% c(my_pick, prize_door)],1)   
  switch <- doors[!doors %in% c(my_pick,show)]    
  switch == prize_door    
})

mean(switch)

## [1] 0.6512

mean(stick)

## [1] 0.3377

Twice the probability to win the prize by changing doors

3.2 Continuous probability

x <- heights %>% filter(sex=="Male") %>% pull(height)

We defined the empirical distribution function (eCDF) as:

F <- function(a) mean(x<=a)

which, for any value a, gives the proportion of values in the list x that are smaller or equal than a.

If I pick one of the male students at random, what is the chance that he is taller than 70.5 inches? Because every student has the same chance of being picked, the answer to this is equivalent to the proportion of students that are taller than 70.5 inches. Using the CDF we obtain an answer by typing:

1 - F(70)

## [1] 0.3768473

Theoretical continuous distributions

We say that a random quantity is normally distributed with average m and standard deviation s if its probability distribution is defined by:

F(a) = pnorm(a, m, s)

This is useful because if we are willing to use the normal approximation for, say, height, we don’t need the entire dataset to answer questions such as: what is the probability that a randomly selected student is taller then 70 inches? We just need the average height and standard deviation:

x <- heights$height[heights$sex=="Male"]
m <- mean(x)
s <- sd(x)
1 - pnorm(70.5, m, s)

## [1] 0.371369

The Probability density

The probability of a single value is not defined for a continuous distribution.

The quantity with the most similar interpretation to the probability of a single value is the probability density function 𝑓(𝑥) .

The probability density 𝑓(𝑥) is defined such that the integral of 𝑓(𝑥) over a range gives the CDF of that range. 𝐹(𝑎)=Pr(𝑋≤𝑎)=∫𝑎−∞𝑓(𝑥)𝑑𝑥

In R, the probability density function for the normal distribution is given by dnorm(). We will see uses of dnorm() in the future.

Note that dnorm() gives the density function and pnorm() gives the distribution function, which is the integral of the density function

3.3 Random variables, sampling and CLT

Random variables

beads <- rep( c("red", "blue"), times = c(2,3))
X <- ifelse(sample(beads, 1) == "blue", 1, 0)

Here X is a random variable: every time we select a new bead the outcome changes randomly. See below:

ifelse(sample(beads, 1) == "blue", 1, 0)

## [1] 0

ifelse(sample(beads, 1) == "blue", 1, 0)

## [1] 0

ifelse(sample(beads, 1) == "blue", 1, 0)

## [1] 1

Sampling models

We are going to define a random variable S that will represent the casino’s total winnings. Let’s start by constructing the urn. A roulette wheel has 18 red pockets, 18 black pockets and 2 green ones. So playing a color in one game of roulette is equivalent to drawing from this urn:

color <- rep(c("Black", "Red", "Green"), c(18, 18, 2))

The 1,000 outcomes from 1,000 people playing are independent draws from this urn. If red comes up, the gambler wins and the casino loses a dollar, so we draw a -$1. Otherwise, the casino wins a dollar and we draw a $1. To construct our random variable S, we can use this code:

n <- 1000
X <- sample(ifelse(color == "Red", -1, 1),  n, replace = TRUE)
X[1:10]

##  [1] -1 -1  1  1  1  1  1 -1  1  1

Because we know the proportions of 1s and -1s, we can generate the draws with one line of code, without defining color:

X <- sample(c(-1,1), n, replace = TRUE, prob=c(9/19, 10/19))

We call this a sampling model since we are modeling the random behavior of roulette with the sampling of draws from an urn. The total winnings S is simply the sum of these 1,000 independent draws:

X <- sample(c(-1,1), n, replace = TRUE, prob=c(9/19, 10/19))
S <- sum(X)
S

## [1] 92

Probability distribution of a random variable

We can estimate the distribution function for the random variable S by using a Monte Carlo simulation to generate many realizations of the random variable. With this code, we run the experiment of having 1,000 people play roulette, over and over, specifically B=10,000 times

n <- 1000
B <- 10000
roulette_winnings <- function(n){
  X <- sample(c(-1,1), n, replace = TRUE, prob=c(9/19, 10/19))
  sum(X)
}
S <- replicate(B, roulette_winnings(n))

Now we can ask the following: in our simulations, how often did we get sums less than or equal to a?

This will be a very good approximation of F(a) and we can easily answer the casino’s question: how likely is it that we will lose money? We can see it is quite low:

mean(S<0)

## [1] 0.0414

Expected value and standard error

B <- 10^6
x <- sample(c(-1,1), B, replace = TRUE, prob=c(9/19, 10/19))
mean(x)

## [1] 0.052044

In general, if the urn has two possible outcomes, say a and b, with proportions p and (1-p) respectively, the average is:

E(x) = ap + b(1-p)

and standard error

**SE[x] = |b-a|sqrt(p(1-p))

In the roulette example

2 * sqrt(90)/19

## [1] 0.998614

The standard error tells us the typical difference between a random variable and its expectation. Since one draw is obviously the sum of just one draw, we can use the formula above to calculate that the random variable defined by one draw has an expected value of 0.05 and a standard error of about 1. This makes sense since we either get 1 or -1, with 1 slightly favored over -1.

Using the formula above, the sum of 1,000 people playing has standard error of about $32:

n <- 1000
sqrt(n) * 2 * sqrt(90)/19

## [1] 31.57895

As a result, when 1,000 people bet on red, the casino is expected to win $50 with a standard error of $32. It therefore seems like a safe bet. But we still haven’t answered the question: how likely is it to lose money? Here the CLT will help.

Central Limit Theorem

We previously ran this Monte Carlo simulation:

n <- 1000
B <- 10000
roulette_winnings <- function(n){
  X <- sample(c(-1,1), n, replace = TRUE, prob=c(9/19, 10/19))
  sum(X)
}
S <- replicate(B, roulette_winnings(n))

The Central Limit Theorem (CLT) tells us that the sum S is approximated by a normal distribution. Using the formulas above, we know that the expected value and standard error are:

n * (20-18)/38

## [1] 52.63158

sqrt(n) * 2 * sqrt(90)/19

## [1] 31.57895

The theoretical values above match those obtained with the Monte Carlo simulation:

mean(S)

## [1] 52.8096

sd(S)

## [1] 31.9207

Using the CLT, we can skip the Monte Carlo simulation and instead compute the probability of the casino losing money using this approximation:

mu <- n * (20-18)/38
se <-  sqrt(n) * 2 * sqrt(90)/19 
pnorm(0, mu, se)

## [1] 0.04779035

which is also in very good agreement with our Monte Carlo result:

mean(S < 0)

## [1] 0.0449

3.4 Case study: The Big Short

set.seed(1)

# bank loses $200,000 to loans not paid back
# there is a chance of 2% that clients will default
# we assume there are no profits made from paying clients

n <- 1000
loss_per_foreclosure <- -200000
p <- 0.02 
defaults <- sample( c(0,1), n, prob=c(1-p, p), replace = TRUE)
sum(defaults * loss_per_foreclosure)

## [1] -4600000

# The sum above is random
# Monte Carlo for distribution of losses

B <- 10000

losses <- replicate(B,{
  defaults <- sample(c(0,1),n,replace=TRUE,prob=c(1-p,p))
  sum(defaults*loss_per_foreclosure)
})

mean(losses)

## [1] -4003200

sd(losses)

## [1] 884943

# With CLT

n*(p*loss_per_foreclosure+(1-p)*0)

## [1] -4e+06

sqrt(n)*abs(loss_per_foreclosure)*sqrt(p*(1-p))

## [1] 885437.7

# adding an amount to laons (that are re-paid) so that losses are CERO
# losses*p + x (1-p) = 0
# x (1-p) ; x is the amount added
# x = -losses / (1-p)

-loss_per_foreclosure*p/(1-p)/200000

## [1] 0.02040816

qnorm(0.01)

## [1] -2.326348

# Quantity to make sure the bank losses money 1 in 100

l <- loss_per_foreclosure
z <- qnorm(0.01)
x <- -l*( n*p - z*sqrt(n*p*(1-p)))/ ( n*(1-p) + z*sqrt(n*p*(1-p)))

x/180000

## [1] 0.03471767

#Interest rate is 0.035 or 3.5%

# Expected profit per loan

EP <- loss_per_foreclosure*p + x*(1-p)


# Total expected profit

TEP <- n*(l*p + x*(1-p))


# Monte Carlo for total expected profit

B <- 10000

profit <- replicate(B,{
  draws <- sample(c(x,l),n,replace = TRUE, prob=c(1-p,p))
  sum(draws)
})


# Employee suggests increasing n even if p doubles from 2% defaults to 4%

p <- 0.04

# new interest rate l*p + x*(1-p)=0
# x=l*p/(1-p)
# r = x / 180000

r <- -l*p/(1-p)/180000
r

## [1] 0.0462963

# r is 0.04629
# If we set r to 5% the expected profit

r <- 0.05
x <- r*180000
NewEP <- l*p+(r*180000)*(1-p)
NewEP

## [1] 640

# what n do we need for our probability of losing money of 1%

z <- qnorm(0.01)
n <- ceiling((z^2*(x-l)^2*p*(1-p))/(l*p + x*(1-p))^2)
n

## [1] 22163

# 22163 loans will reduce P of losing money to 1%, 
# and total expected profit would be 14,184,320

n*(l*p+x*(1-p))

## [1] 14184320

# Now we don't wrongly assume that the probability of defaulting by borrowers
# is independent. We keep p of defaulting to 4% but now We will assume that
# with 50-50 chance, all the probabilities go up or down slightly to somewhere 
# between 0.03 and 0.05. But it happens to everybody at once, not just one person


p <- 0.04

x <- 0.05*180000

profit <- replicate(B, {
  new_p <- 0.04 + sample(seq(-0.01, 0.01, length = 100), 1)
  draws <- sample( c(x, loss_per_foreclosure), n, prob=c(1-new_p, new_p), replace = TRUE) 
  sum(draws)
})

mean(profit)

## [1] 14025355

# continues to be very large, but now the probability to lose money is 34.6%

mean(profit<0)

## [1] 0.3468

EXERCISES

ACT

The ACT is a standardized college admissions test used in the United States. The four multi-part questions in this assessment all involve simulating some ACT test scores and answering probability questions about them.

For the three year period 2016-2018, ACT standardized test scores were approximately normally distributed with a mean of 20.9 and standard deviation of 5.7. (Real ACT scores are integers between 1 and 36, but we will ignore this detail and use continuous values instead.)

First we’ll simulate an ACT test score dataset and answer some questions about it.

Set the seed to 16, then use rnorm() to generate a normal distribution of 10000 tests with a mean of 20.9 and standard deviation of 5.7. Save these values as act_scores. You’ll be using this dataset throughout these four multi-part questions.

# set seed to 16
set.seed(16)

# generate a random normal distribution of 10,000 tests
# mean 20.9 and sd of 5.7

act_scores <- rnorm(10000,mean=20.9, sd=5.7)

1a What is the mean of act_scores

avg_act <- mean(act_scores)
avg_act

## [1] 20.84012

1b What is the standard deviation

sd_act <- sd(act_scores)
sd_act

## [1] 5.675237

1c A perfect score is 36 or greater (the maximum reported score is 36).

In act_scores, how many perfect scores are there out of 10,000 simulated tests?

mean(act_scores>=36)*10000

## [1] 41

1d In act_scores, what is the probability of an ACT score greater than 30?

1 - pnorm(30,avg_act,sd_act)

## [1] 0.05326283

1e In act_scores, what is the probability of an ACT score less than or equal to 10?

pnorm(10, avg_act,sd_act)

## [1] 0.02806186

2 Set x equal to the sequence of integers 1 to 36. Use dnorm to determine the value of the probability density function over x given a mean of 20.9 and standard deviation of 5.7; save the result as f_x. Plot x against f_x.

Which of the following plots is correct?

x <- 1:36
f_x <- dnorm(x, avg_act, sd_act)
plot(x, f_x)

3a What is the probability of a Z-score greater than 2 (2 standard deviations above the mean)?

# scaling act_scores
z_act_scores <- scale(act_scores)

# probability of z >= 2
mean(z_act_scores>=2)

## [1] 0.0233

3b What ACT score value corresponds to 2 standard deviations above the mean (Z = 2)?

avg_act + 2*sd_act

## [1] 32.1906

3c A Z-score of 2 corresponds roughly to the 97.5th percentile.

Use qnorm() to determine the 97.5th percentile of normally distributed data with the mean and standard deviation observed in act_scores.

What is the 97.5th percentile of act_scores?

qnorm(0.975, avg_act, sd_act)

## [1] 31.96338

4a What is the minimum integer score such that the probability of that score or lower is at least .95?

qnorm(.95,avg_act,sd_act)

## [1] 30.17506

# The answer is 31

4b Use qnorm() to determine the expected 95th percentile, the value for which the probability of receiving that score or lower is 0.95, given a mean score of 20.9 and standard deviation of 5.7.

What is the expected 95th percentile of ACT scores?

qnorm(.95, 20.9, 5.7)

## [1] 30.27567

As discussed in the Data Visualization course, we can use quantile() to determine sample quantiles from the data.

Make a vector containing the quantiles for p <- seq(0.01, 0.99, 0.01), the 1st through 99th percentiles of the act_scores data. Save these as sample_quantiles.

In what percentile is a score of 26?

p <- seq(0.01, 0.99, 0.01)
sample_quantiles <- quantile(act_scores,p)

# to determine percentile of a score of 26
names(sample_quantiles[max(which(sample_quantiles < 26))])

## [1] "82%"

# which(s...<26) returns a vector with all percentiles for scores 26 or lower
which(sample_quantiles<26)

##  1%  2%  3%  4%  5%  6%  7%  8%  9% 10% 11% 12% 13% 14% 15% 16% 17% 18% 19% 20% 
##   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20 
## 21% 22% 23% 24% 25% 26% 27% 28% 29% 30% 31% 32% 33% 34% 35% 36% 37% 38% 39% 40% 
##  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40 
## 41% 42% 43% 44% 45% 46% 47% 48% 49% 50% 51% 52% 53% 54% 55% 56% 57% 58% 59% 60% 
##  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60 
## 61% 62% 63% 64% 65% 66% 67% 68% 69% 70% 71% 72% 73% 74% 75% 76% 77% 78% 79% 80% 
##  61  62  63  64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80 
## 81% 82% 
##  81  82

# doing max(which(s...< 26)) we isolate the percentile for score 26
max(which(sample_quantiles < 26))

## [1] 82

# sample quantile(...) returns the tibble
sample_quantiles[max(which(sample_quantiles < 26))]

##      82% 
## 25.99266

# in order to know the value __% we need to use names (like getting the column name)
names(sample_quantiles[max(which(sample_quantiles < 26))])

## [1] "82%"

4d Make a corresponding set of theoretical quantiles using qnorm() over the interval p <- seq(0.01, 0.99, 0.01) with mean 20.9 and standard deviation 5.7. Save these as theoretical_quantiles. Make a QQ-plot graphing sample_quantiles on the y-axis versus theoretical_quantiles on the x-axis.

# recap, sample_quantiles <- quantile(act_scores, p)
# now we calculate theoretical qs

p <- seq(0.01, 0.99, 0.01)
theoretical_quantiles <- qnorm(p, 20.9, 5.7)

# to qqplot
qqplot(theoretical_quantiles, sample_quantiles)

The CDF distribution for act_scores with the mean and 5% and 95% quantiles

act_cdf <- sapply(1:36, function (x){
  mean(act_scores <= x)
})

x <- 1:36
y <- act_cdf

ggplot() + geom_line(aes(x,y),size=1) + geom_vline(xintercept = sample_quantiles[95], linetype="dotted", color="green", size=1 ) + geom_vline(xintercept = sample_quantiles[5], linetype="dotted", color="red", size=1 ) + geom_vline(xintercept = avg_act, color="black", alpha=0.2 )

SAT Testing

An old version of the SAT college entrance exam had a -0.25 point penalty for every incorrect answer and awarded 1 point for a correct answer. The quantitative test consisted of 44 multiple-choice questions each with 5 answer choices. Suppose a student chooses answers by guessing for all questions on the test.

1a What is the probability of guessing correctly for one question?

# each question 5 choices
p <- 1/5
p

## [1] 0.2

1b What is the expected value of points for guessing on one question?

a <- 1
b <- -0.25
mu <- a*p + b*(1-p)
mu

## [1] 0

1c What is the expected score of guessing on all 44 questions?

n <- 44

n*mu

## [1] 0

1d What is the standard error of guessing on all 44 questions?

# E[X] = n (ap + b(1-p))
# SE[x] = sqrt(n) * |b-a| * sqrt(p(1-p)) 

sigma <- sqrt(n) * abs(b-a) * sqrt(p*(1-p))
sigma

## [1] 3.316625

1e Use the Central Limit Theorem to determine the probability that a guessing student scores 8 points or higher on the test.

1 - pnorm(8, mu, sigma)

## [1] 0.007930666

# There is an 8% probability to score 8 or higher, given correct answers are 1 point and incorrect answers -0.25

1f Set the seed to 21, then run a Monte Carlo simulation of 10,000 students guessing on the test.

What is the probability that a guessing student scores 8 points or higher?

set.seed(21)
B <- 10000
n <- 44

# Monte Carlo simulation
exam_scores <- replicate(B,{
  X <- sample(c(1, -0.25), n, replace = TRUE, prob=c(p, 1-p))
  sum(X)
})

# Probability of 8 or higher
mean(exam_scores>=8)

## [1] 0.008

The SAT was recently changed to reduce the number of multiple choice options from 5 to 4 and also to eliminate the penalty for guessing.

2a What is the expected value of the test score when guessing on this new test?

# New conditions
a <- 1
b <- 0    # no penalty for incorrect answer
p <- 0.25 # there are 4 questions
n <- 44

# Expected value of 1 question
mu <- (a*p)+(b*(1-p))

# Expected value of test
mu*n

## [1] 11

2b Consider a range of correct answer probabilities p <- seq(0.25, 0.95, 0.05) representing a range of student skills.

What is the lowest p such that the probability of scoring over 35 exceeds 80%?

# mu = n * (a*p + b*(1-p))
# sigma = sqrt(n) * abs(b-a) * sqrt(p*(1-p))
# probability of scoring over 35 is 1 - pnorm(35, mu, sigma)

# we consider different p based on different student skills
p <- seq(0.25, 0.95, 0.05)

exp_val <- sapply(p, function(x) {
  mu <- n* (a*x + b*(1-x))
  sigma <- sqrt(n) * abs(b-a) * sqrt(x*(1-x))
  1 - pnorm(35, mu, sigma)
})

# exp_val is returning the probability of scoring over 35 points, based on student skills, the vector p (0.25, 0.30...0.95)

# each p will have a different probability
plot(p, exp_val)

# what is the minimum skill (p) required to score 35 or more at 80% probability

p[which(exp_val > 0.8)] # will return all p that satisfy > 0.8

## [1] 0.85 0.90 0.95

min(p[which(exp_val > 0.8)])

## [1] 0.85

Betting on roulette

A casino offers a House Special bet on roulette, which is a bet on five pockets (00, 0, 1, 2, 3) out of 38 total pockets. The bet pays out 6 to 1. In other words, a losing bet yields -$1 and a successful bet yields $6. A gambler wants to know the chance of losing money if he places 500 bets on the roulette House Special.

3a What is the expected value of the payout for one bet?

p <- 5/38
a <- 6
b <- -1

mu <- a*p + b*(1-p)
mu

## [1] -0.07894737

# losing 8 cents per bet

3b What is the standard error of the payout for one bet?

sigma <- abs(b-a) * sqrt(p*(1-p))
sigma

## [1] 2.366227

3c What is the expected value of the average payout over 500 bets?

n <- 500

# expected value of sum would be n*mu
# they are asking exp. value of average over 500 bets
mu

## [1] -0.07894737

3d What is the standard error of the average payout over 500 bets?

sigma / sqrt(n)

## [1] 0.1058209

3e What is the expected value of the sum of 500 bets?

mu500 <- n*mu
mu500

## [1] -39.47368

# losing $39

3f What is the standard error of the sum of 500 bets?

sd500 <- sqrt(n) * abs(b-a) * sqrt(p*(1-p))
sd500

## [1] 52.91045

3g Use pnorm() with the expected value of the sum and standard error of the sum to calculate the probability of losing money over 500 bets

pnorm(0, mu500, sd500)

## [1] 0.7721805

# there is a 77% chance of losing money over 500 bets

PH125.3 Probability | Notes

2020-05-01

3.1 Discrete probability

Monte Carlo simulations for categorical data

`sample` function

Conditional probabilities

Addition and multiplication rules

Multiplication rule

Multiplication rule under independence

Addition rule

Combinations and permutations

`paste`

`expand.grid()`

The birthday problem

Monty Hall problem

3.2 Continuous probability

Theoretical continuous distributions

The Probability density

3.3 Random variables, sampling and CLT

Random variables

Sampling models

Probability distribution of a random variable

Expected value and standard error

Central Limit Theorem

3.4 Case study: The Big Short

EXERCISES

ACT

SAT Testing

Betting on roulette

PH125.3 Probability | Notes

2020-05-01

3.1 Discrete probability

Monte Carlo simulations for categorical data

sample function

Conditional probabilities

Addition and multiplication rules

Multiplication rule

Multiplication rule under independence

Addition rule

Combinations and permutations

paste

expand.grid()

The birthday problem

Monty Hall problem

3.2 Continuous probability

Theoretical continuous distributions

The Probability density

3.3 Random variables, sampling and CLT

Random variables

Sampling models

Probability distribution of a random variable

Expected value and standard error

Central Limit Theorem

3.4 Case study: The Big Short

EXERCISES

ACT

SAT Testing

Betting on roulette

`sample` function

`paste`

`expand.grid()`