We have a box with 5 beads, 2 blue and 3 red
beads <- rep (c("blue","red"), times= c(2,3))
beads
## [1] "blue" "blue" "red" "red" "red"
#picking one at random
sample(beads,1)
## [1] "blue"
# repeating subtraction 10000 times
t <- 10000
subs <- replicate (t, sample(beads,1))
# results of substraction:
table(subs)
## subs
## blue red
## 3944 6056
# proportion
prop.table(table(subs))
## subs
## blue red
## 0.3944 0.6056
Values are really close to 40 and 60%
sample functionCan be used directly, without the use of replicate, to repeat the same experiment of picking 1 out of the 5 beads, continually, under the same conditions.
B <- 10000
events <- sample(beads, B, replace = TRUE)
prop.table(table(events))
## events
## blue red
## 0.3982 0.6018
We get results very similar to those previously obtained with replicate.
When events are not independent e.g.
Pr(Card 2 is King | Card 1 was a King) = 3 / 51
When two events are independent
Pr(A|B) = Pr(A)
Pr (A and B) = Pr(A) * Pr(B|A)
Pr (A and B and C) = Pr(A) * Pr(B|A) * Pr(C|A and B)
Pr (A and B and C) = Pr(A) * Pr(B) * Pr(C)
General rule for computing conditional probabilities
Pr(B|A) = Pr(A and B) / Pr(A)
Pr(A or B) = Pr(A) + Pr(B) - Pr(A and B)
pasteWe use paste to create strings by combining other strings That’s how we can build cards for example
number <- "Three"
suit <- "Hearts"
paste(number, suit)
## [1] "Three Hearts"
Another example
paste(letters[1:5], as.character(1:5))
## [1] "a 1" "b 2" "c 3" "d 4" "e 5"
expand.grid()The function expand.grid gives us all the combinations of entries of two vectors. For example, if you have blue and black pants and white, grey, and plaid shirts, all your combinations are:
expand.grid(pants = c("blue", "black"), shirt = c("white", "grey", "plaid"))
## pants shirt
## 1 blue white
## 2 black white
## 3 blue grey
## 4 black grey
## 5 blue plaid
## 6 black plaid
# Building a card deck
suits <- c("Diamonds", "Clubs", "Hearts", "Spades")
numbers <- c("Ace", "Deuce", "Three", "Four", "Five", "Six", "Seven",
"Eight", "Nine", "Ten", "Jack", "Queen", "King")
deck <- expand.grid(number = numbers, suit = suits)
deck <- paste(deck$number, deck$suit)
head(deck)
## [1] "Ace Diamonds" "Deuce Diamonds" "Three Diamonds" "Four Diamonds"
## [5] "Five Diamonds" "Six Diamonds"
# probability of kings in deck
kings <- paste("King", suits)
kings
## [1] "King Diamonds" "King Clubs" "King Hearts" "King Spades"
mean(deck %in% kings)
## [1] 0.07692308
# 5 random phone numbers of 7 digits
all_phone_numbers <- permutations(10, 7, v = 0:9)
n <- nrow(all_phone_numbers)
index <- sample(n, 5)
all_phone_numbers[index,]
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 4 5 7 0 9 1 2
## [2,] 9 8 6 5 1 0 2
## [3,] 2 3 6 9 5 4 0
## [4,] 6 7 1 0 4 5 3
## [5,] 7 1 4 8 5 6 3
To compute all possible ways we can choose two cards when the order matters, we type:
hands <- permutations(52, 2, v = deck)
# This is a matrix with two columns and 2652 rows.
head(hands)
## [,1] [,2]
## [1,] "Ace Clubs" "Ace Diamonds"
## [2,] "Ace Clubs" "Ace Hearts"
## [3,] "Ace Clubs" "Ace Spades"
## [4,] "Ace Clubs" "Deuce Clubs"
## [5,] "Ace Clubs" "Deuce Diamonds"
## [6,] "Ace Clubs" "Deuce Hearts"
With a matrix we can get the first and second cards like this. This will return the whole 1st and 2nd columns
first_card <- hands[,1]
second_card <- hands[,2]
Now the cases for which the first hand was a King can be computed like this:
kings <- paste("King", suits)
sum(first_card %in% kings)
## [1] 204
To get the conditional probability, we compute what fraction of these have a King in the second card:
sum(first_card%in%kings & second_card%in%kings) / sum(first_card%in%kings)
## [1] 0.05882353
which is exactly 3/51, as we had already deduced. Notice that the code above is equivalent to:
mean(first_card%in%kings & second_card%in%kings) /
mean(first_card%in%kings)
## [1] 0.05882353
which uses mean instead of sum and is an R version of:
Pr (A and B) / Pr (A)
Probability of a natural 21 in blackjack
aces <- paste("Ace", suits)
facecard <- c("King", "Queen", "Jack", "Ten")
facecard <- expand.grid(number = facecard, suit = suits)
facecard <- paste(facecard$number, facecard$suit)
hands <- combinations(52, 2, v=deck) # all possible hands
# probability of a natural 21 given that the ace is listed first in combinations
mean(hands[,1] %in% aces & hands[,2] %in% facecard)
## [1] 0.04826546
# probability of a natural 21 checking for both ace first and ace second
mean((hands[,1] %in% aces & hands[,2] %in% facecard)|(hands[,2] %in% aces & hands[,1] %in% facecard))
## [1] 0.04826546
Birthdays can be represented as a number from 1 to 365 (we exclude 29th Feb)
n <- 50
sample(1:365,n,replace=TRUE)
## [1] 127 29 171 338 171 273 255 136 162 3 46 70 122 231 162 265 246 144 47
## [20] 229 323 313 317 254 344 86 180 199 14 150 22 261 161 37 274 146 307 79
## [39] 153 68 350 49 88 111 205 141 239 217 94 1
Montecarlo
B <- 10000
same_birthday <- function (n) {
bdays <- sample(1:365,n,replace=TRUE)
any(duplicated(bdays))
}
results <- replicate(B, same_birthday(50))
mean(results)
## [1] 0.9669
In a group of 50 people there is 96% probability of at least 2 people having the same birthday
Say we want to use this knowledge to bet with friends about two people having the same birthday in a group of people. When are the chances larger than 50%? Larger than 75%?
Let’s create a look-up table. We can quickly create a function to compute this for any group size:
compute_prob <- function(n, B=10000){
results <- replicate(B, same_birthday(n))
mean(results)
}
Using the function sapply, we can perform element-wise operations on any function:
n <- seq(1,60)
# n is a vector of 1 to 60 (groups of 1 to 60 people)
# with sapply we can run the function for each of the elements in n
prob <- sapply(n, compute_prob)
We can now make a plot of the estimated probabilities of two people having the same birthday in a group of size n:
prob <- sapply(n, compute_prob)
qplot(n, prob)
With statistical formulas
exact_prob <- function(n){
prob_unique <- seq(365,365-n+1)/365
1 - prod( prob_unique)
}
eprob <- sapply(n, exact_prob)
qplot(n, prob) + geom_line(aes(n, eprob), col = "red")
We run a monte carlo for the option of not switching doors
B <- 10000
stick <- replicate(B, {
doors <- as.character(1:3)
prize <- sample(c("car","goat","goat")) #random prizes
prize_door <- doors[prize == "car"] # prize door with car
my_pick <- sample(doors, 1) #random pick
show <- sample(doors[!doors %in% c(my_pick, prize_door)],1)
stick <- my_pick
stick == prize_door
})
mean(stick)
## [1] 0.3377
Monte carlo including switching doors
B <- 10000
switch <- replicate(B, {
doors <- as.character(1:3)
prize <- sample(c("car","goat","goat")) #random prizes
prize_door <- doors[prize == "car"] # prize door with car
my_pick <- sample(doors, 1) #random pick
show <- sample(doors[!doors %in% c(my_pick, prize_door)],1)
switch <- doors[!doors %in% c(my_pick,show)]
switch == prize_door
})
mean(switch)
## [1] 0.6512
mean(stick)
## [1] 0.3377
Twice the probability to win the prize by changing doors
x <- heights %>% filter(sex=="Male") %>% pull(height)
We defined the empirical distribution function (eCDF) as:
F <- function(a) mean(x<=a)
which, for any value a, gives the proportion of values in the list x that are smaller or equal than a.
If I pick one of the male students at random, what is the chance that he is taller than 70.5 inches? Because every student has the same chance of being picked, the answer to this is equivalent to the proportion of students that are taller than 70.5 inches. Using the CDF we obtain an answer by typing:
1 - F(70)
## [1] 0.3768473
We say that a random quantity is normally distributed with average m and standard deviation s if its probability distribution is defined by:
F(a) = pnorm(a, m, s)
This is useful because if we are willing to use the normal approximation for, say, height, we don’t need the entire dataset to answer questions such as: what is the probability that a randomly selected student is taller then 70 inches? We just need the average height and standard deviation:
x <- heights$height[heights$sex=="Male"]
m <- mean(x)
s <- sd(x)
1 - pnorm(70.5, m, s)
## [1] 0.371369
The probability of a single value is not defined for a continuous distribution.
The quantity with the most similar interpretation to the probability of a single value is the probability density function 𝑓(𝑥) .
The probability density 𝑓(𝑥) is defined such that the integral of 𝑓(𝑥) over a range gives the CDF of that range. 𝐹(𝑎)=Pr(𝑋≤𝑎)=∫𝑎−∞𝑓(𝑥)𝑑𝑥
In R, the probability density function for the normal distribution is given by dnorm(). We will see uses of dnorm() in the future.
Note that dnorm() gives the density function and pnorm() gives the distribution function, which is the integral of the density function
beads <- rep( c("red", "blue"), times = c(2,3))
X <- ifelse(sample(beads, 1) == "blue", 1, 0)
Here X is a random variable: every time we select a new bead the outcome changes randomly. See below:
ifelse(sample(beads, 1) == "blue", 1, 0)
## [1] 0
ifelse(sample(beads, 1) == "blue", 1, 0)
## [1] 0
ifelse(sample(beads, 1) == "blue", 1, 0)
## [1] 1
We are going to define a random variable S that will represent the casino’s total winnings. Let’s start by constructing the urn. A roulette wheel has 18 red pockets, 18 black pockets and 2 green ones. So playing a color in one game of roulette is equivalent to drawing from this urn:
color <- rep(c("Black", "Red", "Green"), c(18, 18, 2))
The 1,000 outcomes from 1,000 people playing are independent draws from this urn. If red comes up, the gambler wins and the casino loses a dollar, so we draw a -$1. Otherwise, the casino wins a dollar and we draw a $1. To construct our random variable S, we can use this code:
n <- 1000
X <- sample(ifelse(color == "Red", -1, 1), n, replace = TRUE)
X[1:10]
## [1] -1 -1 1 1 1 1 1 -1 1 1
Because we know the proportions of 1s and -1s, we can generate the draws with one line of code, without defining color:
X <- sample(c(-1,1), n, replace = TRUE, prob=c(9/19, 10/19))
We call this a sampling model since we are modeling the random behavior of roulette with the sampling of draws from an urn. The total winnings S is simply the sum of these 1,000 independent draws:
X <- sample(c(-1,1), n, replace = TRUE, prob=c(9/19, 10/19))
S <- sum(X)
S
## [1] 92
We can estimate the distribution function for the random variable S by using a Monte Carlo simulation to generate many realizations of the random variable. With this code, we run the experiment of having 1,000 people play roulette, over and over, specifically B=10,000 times
n <- 1000
B <- 10000
roulette_winnings <- function(n){
X <- sample(c(-1,1), n, replace = TRUE, prob=c(9/19, 10/19))
sum(X)
}
S <- replicate(B, roulette_winnings(n))
Now we can ask the following: in our simulations, how often did we get sums less than or equal to a?
This will be a very good approximation of F(a) and we can easily answer the casino’s question: how likely is it that we will lose money? We can see it is quite low:
mean(S<0)
## [1] 0.0414
B <- 10^6
x <- sample(c(-1,1), B, replace = TRUE, prob=c(9/19, 10/19))
mean(x)
## [1] 0.052044
In general, if the urn has two possible outcomes, say a and b, with proportions p and (1-p) respectively, the average is:
E(x) = ap + b(1-p)
and standard error
**SE[x] = |b-a|sqrt(p(1-p))
In the roulette example
2 * sqrt(90)/19
## [1] 0.998614
The standard error tells us the typical difference between a random variable and its expectation. Since one draw is obviously the sum of just one draw, we can use the formula above to calculate that the random variable defined by one draw has an expected value of 0.05 and a standard error of about 1. This makes sense since we either get 1 or -1, with 1 slightly favored over -1.
Using the formula above, the sum of 1,000 people playing has standard error of about $32:
n <- 1000
sqrt(n) * 2 * sqrt(90)/19
## [1] 31.57895
As a result, when 1,000 people bet on red, the casino is expected to win $50 with a standard error of $32. It therefore seems like a safe bet. But we still haven’t answered the question: how likely is it to lose money? Here the CLT will help.
We previously ran this Monte Carlo simulation:
n <- 1000
B <- 10000
roulette_winnings <- function(n){
X <- sample(c(-1,1), n, replace = TRUE, prob=c(9/19, 10/19))
sum(X)
}
S <- replicate(B, roulette_winnings(n))
The Central Limit Theorem (CLT) tells us that the sum S is approximated by a normal distribution. Using the formulas above, we know that the expected value and standard error are:
n * (20-18)/38
## [1] 52.63158
sqrt(n) * 2 * sqrt(90)/19
## [1] 31.57895
The theoretical values above match those obtained with the Monte Carlo simulation:
mean(S)
## [1] 52.8096
sd(S)
## [1] 31.9207
Using the CLT, we can skip the Monte Carlo simulation and instead compute the probability of the casino losing money using this approximation:
mu <- n * (20-18)/38
se <- sqrt(n) * 2 * sqrt(90)/19
pnorm(0, mu, se)
## [1] 0.04779035
which is also in very good agreement with our Monte Carlo result:
mean(S < 0)
## [1] 0.0449
set.seed(1)
# bank loses $200,000 to loans not paid back
# there is a chance of 2% that clients will default
# we assume there are no profits made from paying clients
n <- 1000
loss_per_foreclosure <- -200000
p <- 0.02
defaults <- sample( c(0,1), n, prob=c(1-p, p), replace = TRUE)
sum(defaults * loss_per_foreclosure)
## [1] -4600000
# The sum above is random
# Monte Carlo for distribution of losses
B <- 10000
losses <- replicate(B,{
defaults <- sample(c(0,1),n,replace=TRUE,prob=c(1-p,p))
sum(defaults*loss_per_foreclosure)
})
mean(losses)
## [1] -4003200
sd(losses)
## [1] 884943
# With CLT
n*(p*loss_per_foreclosure+(1-p)*0)
## [1] -4e+06
sqrt(n)*abs(loss_per_foreclosure)*sqrt(p*(1-p))
## [1] 885437.7
# adding an amount to laons (that are re-paid) so that losses are CERO
# losses*p + x (1-p) = 0
# x (1-p) ; x is the amount added
# x = -losses / (1-p)
-loss_per_foreclosure*p/(1-p)/200000
## [1] 0.02040816
qnorm(0.01)
## [1] -2.326348
# Quantity to make sure the bank losses money 1 in 100
l <- loss_per_foreclosure
z <- qnorm(0.01)
x <- -l*( n*p - z*sqrt(n*p*(1-p)))/ ( n*(1-p) + z*sqrt(n*p*(1-p)))
x/180000
## [1] 0.03471767
#Interest rate is 0.035 or 3.5%
# Expected profit per loan
EP <- loss_per_foreclosure*p + x*(1-p)
# Total expected profit
TEP <- n*(l*p + x*(1-p))
# Monte Carlo for total expected profit
B <- 10000
profit <- replicate(B,{
draws <- sample(c(x,l),n,replace = TRUE, prob=c(1-p,p))
sum(draws)
})
# Employee suggests increasing n even if p doubles from 2% defaults to 4%
p <- 0.04
# new interest rate l*p + x*(1-p)=0
# x=l*p/(1-p)
# r = x / 180000
r <- -l*p/(1-p)/180000
r
## [1] 0.0462963
# r is 0.04629
# If we set r to 5% the expected profit
r <- 0.05
x <- r*180000
NewEP <- l*p+(r*180000)*(1-p)
NewEP
## [1] 640
# what n do we need for our probability of losing money of 1%
z <- qnorm(0.01)
n <- ceiling((z^2*(x-l)^2*p*(1-p))/(l*p + x*(1-p))^2)
n
## [1] 22163
# 22163 loans will reduce P of losing money to 1%,
# and total expected profit would be 14,184,320
n*(l*p+x*(1-p))
## [1] 14184320
# Now we don't wrongly assume that the probability of defaulting by borrowers
# is independent. We keep p of defaulting to 4% but now We will assume that
# with 50-50 chance, all the probabilities go up or down slightly to somewhere
# between 0.03 and 0.05. But it happens to everybody at once, not just one person
p <- 0.04
x <- 0.05*180000
profit <- replicate(B, {
new_p <- 0.04 + sample(seq(-0.01, 0.01, length = 100), 1)
draws <- sample( c(x, loss_per_foreclosure), n, prob=c(1-new_p, new_p), replace = TRUE)
sum(draws)
})
mean(profit)
## [1] 14025355
# continues to be very large, but now the probability to lose money is 34.6%
mean(profit<0)
## [1] 0.3468
The ACT is a standardized college admissions test used in the United States. The four multi-part questions in this assessment all involve simulating some ACT test scores and answering probability questions about them.
For the three year period 2016-2018, ACT standardized test scores were approximately normally distributed with a mean of 20.9 and standard deviation of 5.7. (Real ACT scores are integers between 1 and 36, but we will ignore this detail and use continuous values instead.)
First we’ll simulate an ACT test score dataset and answer some questions about it.
Set the seed to 16, then use rnorm() to generate a normal distribution of 10000 tests with a mean of 20.9 and standard deviation of 5.7. Save these values as act_scores. You’ll be using this dataset throughout these four multi-part questions.
# set seed to 16
set.seed(16)
# generate a random normal distribution of 10,000 tests
# mean 20.9 and sd of 5.7
act_scores <- rnorm(10000,mean=20.9, sd=5.7)
1a What is the mean of act_scores
avg_act <- mean(act_scores)
avg_act
## [1] 20.84012
1b What is the standard deviation
sd_act <- sd(act_scores)
sd_act
## [1] 5.675237
1c A perfect score is 36 or greater (the maximum reported score is 36).
In act_scores, how many perfect scores are there out of 10,000 simulated tests?
mean(act_scores>=36)*10000
## [1] 41
1d In act_scores, what is the probability of an ACT score greater than 30?
1 - pnorm(30,avg_act,sd_act)
## [1] 0.05326283
1e In act_scores, what is the probability of an ACT score less than or equal to 10?
pnorm(10, avg_act,sd_act)
## [1] 0.02806186
2 Set x equal to the sequence of integers 1 to 36. Use dnorm to determine the value of the probability density function over x given a mean of 20.9 and standard deviation of 5.7; save the result as f_x. Plot x against f_x.
Which of the following plots is correct?
x <- 1:36
f_x <- dnorm(x, avg_act, sd_act)
plot(x, f_x)
3a What is the probability of a Z-score greater than 2 (2 standard deviations above the mean)?
# scaling act_scores
z_act_scores <- scale(act_scores)
# probability of z >= 2
mean(z_act_scores>=2)
## [1] 0.0233
3b What ACT score value corresponds to 2 standard deviations above the mean (Z = 2)?
avg_act + 2*sd_act
## [1] 32.1906
3c A Z-score of 2 corresponds roughly to the 97.5th percentile.
Use qnorm() to determine the 97.5th percentile of normally distributed data with the mean and standard deviation observed in act_scores.
What is the 97.5th percentile of act_scores?
qnorm(0.975, avg_act, sd_act)
## [1] 31.96338
4a What is the minimum integer score such that the probability of that score or lower is at least .95?
qnorm(.95,avg_act,sd_act)
## [1] 30.17506
# The answer is 31
4b Use qnorm() to determine the expected 95th percentile, the value for which the probability of receiving that score or lower is 0.95, given a mean score of 20.9 and standard deviation of 5.7.
What is the expected 95th percentile of ACT scores?
qnorm(.95, 20.9, 5.7)
## [1] 30.27567
4c
As discussed in the Data Visualization course, we can use quantile() to determine sample quantiles from the data.
Make a vector containing the quantiles for p <- seq(0.01, 0.99, 0.01), the 1st through 99th percentiles of the act_scores data. Save these as sample_quantiles.
In what percentile is a score of 26?
p <- seq(0.01, 0.99, 0.01)
sample_quantiles <- quantile(act_scores,p)
# to determine percentile of a score of 26
names(sample_quantiles[max(which(sample_quantiles < 26))])
## [1] "82%"
# which(s...<26) returns a vector with all percentiles for scores 26 or lower
which(sample_quantiles<26)
## 1% 2% 3% 4% 5% 6% 7% 8% 9% 10% 11% 12% 13% 14% 15% 16% 17% 18% 19% 20%
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## 21% 22% 23% 24% 25% 26% 27% 28% 29% 30% 31% 32% 33% 34% 35% 36% 37% 38% 39% 40%
## 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
## 41% 42% 43% 44% 45% 46% 47% 48% 49% 50% 51% 52% 53% 54% 55% 56% 57% 58% 59% 60%
## 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
## 61% 62% 63% 64% 65% 66% 67% 68% 69% 70% 71% 72% 73% 74% 75% 76% 77% 78% 79% 80%
## 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
## 81% 82%
## 81 82
# doing max(which(s...< 26)) we isolate the percentile for score 26
max(which(sample_quantiles < 26))
## [1] 82
# sample quantile(...) returns the tibble
sample_quantiles[max(which(sample_quantiles < 26))]
## 82%
## 25.99266
# in order to know the value __% we need to use names (like getting the column name)
names(sample_quantiles[max(which(sample_quantiles < 26))])
## [1] "82%"
4d Make a corresponding set of theoretical quantiles using qnorm() over the interval p <- seq(0.01, 0.99, 0.01) with mean 20.9 and standard deviation 5.7. Save these as theoretical_quantiles. Make a QQ-plot graphing sample_quantiles on the y-axis versus theoretical_quantiles on the x-axis.
# recap, sample_quantiles <- quantile(act_scores, p)
# now we calculate theoretical qs
p <- seq(0.01, 0.99, 0.01)
theoretical_quantiles <- qnorm(p, 20.9, 5.7)
# to qqplot
qqplot(theoretical_quantiles, sample_quantiles)
The CDF distribution for act_scores with the mean and 5% and 95% quantiles
act_cdf <- sapply(1:36, function (x){
mean(act_scores <= x)
})
x <- 1:36
y <- act_cdf
ggplot() + geom_line(aes(x,y),size=1) + geom_vline(xintercept = sample_quantiles[95], linetype="dotted", color="green", size=1 ) + geom_vline(xintercept = sample_quantiles[5], linetype="dotted", color="red", size=1 ) + geom_vline(xintercept = avg_act, color="black", alpha=0.2 )
An old version of the SAT college entrance exam had a -0.25 point penalty for every incorrect answer and awarded 1 point for a correct answer. The quantitative test consisted of 44 multiple-choice questions each with 5 answer choices. Suppose a student chooses answers by guessing for all questions on the test.
1a What is the probability of guessing correctly for one question?
# each question 5 choices
p <- 1/5
p
## [1] 0.2
1b What is the expected value of points for guessing on one question?
a <- 1
b <- -0.25
mu <- a*p + b*(1-p)
mu
## [1] 0
1c What is the expected score of guessing on all 44 questions?
n <- 44
n*mu
## [1] 0
1d What is the standard error of guessing on all 44 questions?
# E[X] = n (ap + b(1-p))
# SE[x] = sqrt(n) * |b-a| * sqrt(p(1-p))
sigma <- sqrt(n) * abs(b-a) * sqrt(p*(1-p))
sigma
## [1] 3.316625
1e Use the Central Limit Theorem to determine the probability that a guessing student scores 8 points or higher on the test.
1 - pnorm(8, mu, sigma)
## [1] 0.007930666
# There is an 8% probability to score 8 or higher, given correct answers are 1 point and incorrect answers -0.25
1f Set the seed to 21, then run a Monte Carlo simulation of 10,000 students guessing on the test.
What is the probability that a guessing student scores 8 points or higher?
set.seed(21)
B <- 10000
n <- 44
# Monte Carlo simulation
exam_scores <- replicate(B,{
X <- sample(c(1, -0.25), n, replace = TRUE, prob=c(p, 1-p))
sum(X)
})
# Probability of 8 or higher
mean(exam_scores>=8)
## [1] 0.008
The SAT was recently changed to reduce the number of multiple choice options from 5 to 4 and also to eliminate the penalty for guessing.
2a What is the expected value of the test score when guessing on this new test?
# New conditions
a <- 1
b <- 0 # no penalty for incorrect answer
p <- 0.25 # there are 4 questions
n <- 44
# Expected value of 1 question
mu <- (a*p)+(b*(1-p))
# Expected value of test
mu*n
## [1] 11
2b Consider a range of correct answer probabilities p <- seq(0.25, 0.95, 0.05) representing a range of student skills.
What is the lowest p such that the probability of scoring over 35 exceeds 80%?
# mu = n * (a*p + b*(1-p))
# sigma = sqrt(n) * abs(b-a) * sqrt(p*(1-p))
# probability of scoring over 35 is 1 - pnorm(35, mu, sigma)
# we consider different p based on different student skills
p <- seq(0.25, 0.95, 0.05)
exp_val <- sapply(p, function(x) {
mu <- n* (a*x + b*(1-x))
sigma <- sqrt(n) * abs(b-a) * sqrt(x*(1-x))
1 - pnorm(35, mu, sigma)
})
# exp_val is returning the probability of scoring over 35 points, based on student skills, the vector p (0.25, 0.30...0.95)
# each p will have a different probability
plot(p, exp_val)
# what is the minimum skill (p) required to score 35 or more at 80% probability
p[which(exp_val > 0.8)] # will return all p that satisfy > 0.8
## [1] 0.85 0.90 0.95
min(p[which(exp_val > 0.8)])
## [1] 0.85
A casino offers a House Special bet on roulette, which is a bet on five pockets (00, 0, 1, 2, 3) out of 38 total pockets. The bet pays out 6 to 1. In other words, a losing bet yields -$1 and a successful bet yields $6. A gambler wants to know the chance of losing money if he places 500 bets on the roulette House Special.
3a What is the expected value of the payout for one bet?
p <- 5/38
a <- 6
b <- -1
mu <- a*p + b*(1-p)
mu
## [1] -0.07894737
# losing 8 cents per bet
3b What is the standard error of the payout for one bet?
sigma <- abs(b-a) * sqrt(p*(1-p))
sigma
## [1] 2.366227
3c What is the expected value of the average payout over 500 bets?
n <- 500
# expected value of sum would be n*mu
# they are asking exp. value of average over 500 bets
mu
## [1] -0.07894737
3d What is the standard error of the average payout over 500 bets?
sigma / sqrt(n)
## [1] 0.1058209
3e What is the expected value of the sum of 500 bets?
mu500 <- n*mu
mu500
## [1] -39.47368
# losing $39
3f What is the standard error of the sum of 500 bets?
sd500 <- sqrt(n) * abs(b-a) * sqrt(p*(1-p))
sd500
## [1] 52.91045
3g Use pnorm() with the expected value of the sum and standard error of the sum to calculate the probability of losing money over 500 bets
pnorm(0, mu500, sd500)
## [1] 0.7721805
# there is a 77% chance of losing money over 500 bets