Assignment 2

Question 1:Problem A • Consider the following independent random variables: – X∼N(μ=4,σ2 =10) – Y ∼U(a=2,b=8) • Compute the probability that X > Y , i.e. Pr(X > Y).

I used Monte Carlo simulations to estimate Pr(X > Y). I generated random normal and uniform variables from the given distributions repeatedly. The probability that X > Y can then be approximated by the number of times we observe X > Y, divided by the total number of observations.

# Simulate x and y randomly 100000 times
num_samp <- 100000
set.seed(4541) # for reproducibility
x <- rnorm(num_samp, 4, sqrt(10))
y <- runif(num_samp, min = 2, max = 8)

#counting how many times x is bigger than y. (TRUE = 1)
count = sum(x>y)

prob1 = count/num_samp
#printing the output
cat("The probability that X > Y =", prob1)

## The probability that X > Y = 0.39161

A probability of 0.39161 is reasonable since the expected value for Y is higher than X

• Use bootstrapping to derive the sampling distribution for your estimate of Pr(X > Y ).

I used 1000 bootstrapped samples to derive the sampling distribution. I sampled with replacement and calculated the probability for each one. I stored the results and calculated the mean and standard deviation

# Initialisation
obsData <- data.frame(x, y)
# no. of bootstrapped samples
NRepeat <- 1000 
# results
bootRes <- matrix(data = NA, nrow = NRepeat, ncol = 1) 

# Loop across all samples
for (i in seq(NRepeat))
{
  # Resample with replacement
  bootData <- obsData[sample(x = num_samp, size = num_samp, replace = T), ]
  
  # compute P(X>Y)
  count = sum(bootData$x>bootData$y)
  
  # Store results for each sample
  bootRes[i] <- count/num_samp
}

#Calculate the mean and standard deviation for the stored results 
mea = mean(bootRes)
std = sqrt(var(bootRes))
cat("The sampling distribution follows a normal distribution with mean of", mea,"and standard deviation of",std)

## The sampling distribution follows a normal distribution with mean of 0.3915438 and standard deviation of 0.001515522

• Show how the sample variance of this sampling distribution changes as a function of the number of Monte Carlo simulations. simulations.

I ran the above bootstrap for different numbers of Monte Carlo simulations. I stored the values of the bootstrap into columns of a matrix, and calculated the variance of each column.

#number of monte carlo simulations
MC <- c(500,1000,2000,4000,6000,8000,10000)


#storing the results for each bootstrap 
boot <- matrix(data = NA, nrow = NRepeat, ncol = length(MC))

for (j in seq(length(MC))){
  #generating data for different numbers of monte carlo simulations
  x <- rnorm(MC[j], 4, sqrt(10))
  y <- runif(MC[j], min = 2, max = 8) 
  
  #grouping into a data frame
  obsData <- data.frame(x, y)
  
  #for each different number of MC simulations do a bootstrap 
  for (i in seq(NRepeat))
  {
    # Resample with replacement
    bootData <- obsData[sample(x = MC[j], size = MC[j], replace = T), ]
    
    # compute P(X>Y)
    #count the number of times x is bigger than y. TRUE = 1, FALSE = 0
    probability = sum(bootData$x>bootData$y)/MC[j]
    
    
    #Store results in each column
    boot[i,j] <- probability
    
    
  }
  
}

#calculating the variance for each bootstrap
variances = rep(0,length(MC))

for (i in seq(length(MC))){
  #getting the variance of each column in the bootstrap results
  variances[i] = var(boot[,i])
}

#plotting the results 
plot(MC,variances,
     main="Sample variance vs number of Monte Carlo simulations",
  xlab="Number of Monte Carlo simulations",
  ylab="Variance")

We see an exponential decrease in the variance, as the number of Monte Carlo simulations decrease. The standard number of simulations is 10,000 but we see that the variance is already significantly low by 6,000.

Problem B • Consider the following football tournament format: a team keeps playing until they accrue 7 wins or 3 losses (whichever comes first - no draws allowed). Assume a fixed win rate p ∈ [0, 1] across all rounds (they are paired at random). • Plot how the total number of matches played (i.e. wins + losses) varies as a function of p.

set.seed(73123) # for reproducibility

num_wins = 0
num_loss = 0
num_played = 0

#Play is a function that outputs the number of games played under a certain win probability 
Play <- function(num_wins,num_loss,p) {
  #the function will run if we've had less then 3 losses and less than 7 wins (i.e we're still playing)
  
  if (num_loss<3 && num_wins<7){
    #we roll again 
    #the total number played = the number of games won + number lost
    num_played = num_wins + num_loss
    
    #the random binomial with one trial will generate a 1 or 0
    # 0 - loss
    # 1 - win
    sim <- rbinom(n = 1, size = 1, p)
    if (sim == 1){
      #won game
      num_wins = num_wins + 1
      num_loss = num_loss + 0
      #play again by calling the function again
      Play(num_wins,num_loss,p)
    } else if (sim == 0){
      #lost game
      num_wins = num_wins + 0
      num_loss = num_loss + 1
      #play again by calling the function again
      Play(num_wins,num_loss,p)
    }

  } else {
    #If we've already lost 3 or won 7 we just return the num_wins and losses
    return(c(num_wins + num_loss, num_wins))
  }
  
}


#run it 10,000 times and use different p values each time
N_runs = 10000
#Choose p values from 0-1 with step size 0.1
p_vals = seq(0,1,0.1)

#storing results
played = rep(0,11)
games_won = rep(0,11)

for (i in seq(11)){
  
  #running it 10,000 times but for different p_vals[i]  
  for (j in seq(N_runs)){
    #adding the number of games played for each probability
    played[i] = played[i]+ Play(0,0,p_vals[i])[1]
    #adding the number of games won for each probability
    games_won[i] = games_won[i]+ Play(0,0,p_vals[i])[2]
  }
}

#getting the average number of games played for each probability
Games_Played = played/10000

#getting the average number of games won for each probability
games_won = games_won/10000


plot(p_vals,Games_Played,
     main="Win probability vs games played",
  xlab="Win probability",
  ylab="Games Played")

The number of games played increases up until p=0.8 and then decreases until p=1. The maximum expected number of games played occurs when p=0.8. The maximum number you can play is 9: 7 wins and 2 losses or 6 wins and 3 losses. If we have p=1 we will never lose. Therefore, it makes sense that a probability of 0.8 will lead to the highest number played, since having a small chance of loosing adds to your games.

• Comment on the observed win rate relative to the assumed win rate p (i.e. if a team obtains 2 wins - 3 losses, the maximum likelihood point estimate for their win rate is 40%). Specifically, focus on the effect driven by the format of this tournament.

To get the observed rate we divide the number of games won by the total number of games played

#dividing the number of games won by the number played
obs_rate = games_won/Games_Played
print(obs_rate)

##  [1] 0.0000000 0.1003029 0.2009899 0.2965771 0.3998948 0.4963108 0.6012624
##  [8] 0.7060749 0.8001061 0.9008469 1.0000000

The average probabilities if we run it 10,000 times are close to the true values. However, there is more variability due to the format of the game. For example, if the probability of winning is 0.1, then there is a large chance of 0.72 that you’ll loose all 3. This would result in an observed win rate of 0 which is a lot less. If you win one and loose 3, the observed rate increases to 0.25. If you win 2 and loose 3, then the observed rate increase even further to 0.4. These are all feasible outcomes but results in very different observed rates, with a max difference of 0.4. However, running the experiment 10,000 times averages out these differences and we end up with an observed of 0.1003029.

(parallel computing would take longer to implement than running the code, so I decided to follow the advice in lecture notes and not do it.)

Assignment 2

2022-11-03