The world series is a first to 4 wins or best of 7 match-up between the champions of the American and National Leagues of Major League Baseball. In this assignment, we are going to explain probability calculations related to the world series and first to k win random variables.

Setup:

Suppose that the Braves \((B)\) and the Yankees \((Y)\) are teams competing in the World Series. Suppose that in any given game, the probability that the Braves win a single head-to-head match-up is \(P_B\) and the probability that the Yankees win a single head-to-head match-up is \(P_Y = 1 - P_B\). Assume that home-field advantage doesn’t exist.

1. Explain why the outcome of a first to k wins process is a bivariate random variable. Explain the two outcomes that comprise the outcome.

This is a bivariate random variable since we are interested in each possible combination of values for two random variables according to their probability of occurrence. In this case these are the winner team \((B\space\space\text{or}\space Y)\) and the number of games played \(k =4,5,6,7\).

Note: It impossible for there to be only 1, 2, or 3 games played, because the series cannot end until one team has won 4 games. Therefore, at least 4 games must be played and so the probability is 0 for k = 1,2,3.

2. Derive the joint distribution of the bivariate random variable by completing a cross-table. You may do this for a specific value of P_B or for P_B in general.

First, we will consider the general cross table for the winner team and the number of games played:

$$ \[\begin{array}{|l|c|c|c|c|} \hline K & \text{Type} & B & Y & \text{Row Total} \\ \hline k_4: k=4 & \text{Cell prob} & P(k_4\space\text{and}\space B) = P(k_4)P(B\mid k_4)& P(k_4\space\text{and}\space Y) = P(k_4)P(Y\mid k_4) & P(k_4)= P(k_4\space\text{and}\space B) + P(k_4\space\text{and}\space Y)\\ & \text{Row prob} & P(B\mid k_4) = \frac{P(k_4\space\text{and}\space B)}{P(k_4)}& P(Y\mid k_4) = \frac{P(k_4\space\text{and}\space Y)}{P(k_4)}\\ & \text{Column prob} & P(k_4 \mid B) = \frac{P(k_4\space\text{and}\space B)}{P(B)}& P(k_4\mid Y) = \frac{P(k_4\space\text{and}\space Y)}{P(Y)}\\ \hline k_5: k=5 & \text{Cell prob} & P(k_5\space\text{and}\space B) = P(k_5)P(B\mid k_5)& P(k_5\space\text{and}\space Y) = P(k_5)P(Y\mid k_5) & P(k_5)= P(k_5\space\text{and}\space B) + P(k_5\space\text{and}\space Y)\\ & \text{Row prob} & P(B\mid k_5) = \frac{P(k_5\space\text{and}\space B)}{P(k_5)}& P(Y\mid k_5) = \frac{P(k_5\space\text{and}\space Y)}{P(k_5)}\\ & \text{Column prob} & P(k_5 \mid B) = \frac{P(k_5\space\text{and}\space B)}{P(B)}& P(P(k_5\mid Y)) = \frac{P(k_5\space\text{and}\space Y)}{P(Y)}\\ \hline k_6: k=6 & \text{Cell prob} & P(k_6\space\text{and}\space B) = P(k_6)P(B\mid k_6)& P(k_6\space\text{and}\space Y) = P(k_6)P(Y\mid k_6) & P(k_6)= P(k_6\space\text{and}\space B) + P(k_6\space\text{and}\space Y)\\ & \text{Row prob} & P(B\mid k_6) = \frac{P(k_6\space\text{and}\space B)}{P(k_6)}& P(Y\mid k_6) = \frac{P(k_6\space\text{and}\space Y)}{P(k_6)}\\ & \text{Column prob} & P(k_6 \mid B) = \frac{P(k_6\space\text{and}\space B)}{P(B)}& P(k_6\mid Y) = \frac{P(k_6\space\text{and}\space Y)}{P(Y)}\\ \hline k_7: k=7 & \text{Cell prob} & P(k_7\space\text{and}\space B) = P(k_7)P(B\mid k_7)& P(k_7\space\text{and}\space Y) = P(k_7)P(Y\mid k_7) & P(k_7)= P(k_7\space\text{and}\space B) + P(k_7\space\text{and}\space Y)\\ & \text{Row prob} & P(B\mid k_7) = \frac{P(k_7\space\text{and}\space B)}{P(k_7)}& P(Y\mid k_7) = \frac{P(k_7\space\text{and}\space Y)}{P(k_7)}\\ & \text{Column prob} & P(k_7 \mid B) = \frac{P(k_7\space\text{and}\space B)}{P(B)}& P(k_7\mid Y) = \frac{P(k_7\space\text{and}\space Y)}{P(Y)}\\ \hline \text{Total} & ---- & P(B) = P(k_4\space\text{and}\space B) + P(k_5\space\text{and}\space B) & P(Y) = P(k_4\space\text{and}\space Y) + P(k_5\space\text{and}\space Y) & 1 \\ \text{Column} & & + P(k_6\space\text{and}\space B) + P(k_7\space\text{and}\space B) & + P(k_6\space\text{and}\space Y) + P(k_7\space\text{and}\space Y) \\ \hline \end{array}\]

$$ The joint distribution is based on joint probability, which is defined as the probability of two events happening together. For two general events A and B the joint probability can formally be written as \(P(A and B)\). Thus, cell probabilities of our cross table represent the joint distribution:

$$ \[\begin{array}{|l|c|c|c|c|} \hline K & B & Y & \text{Row Total} \\ \hline k_4: k=4 & (P_B)^4 & (P_Y)^4 & P(k_4)=(P_B)^4 + (P_Y)^4\\ & \frac{(P_B)^4}{P(k_4)} & \frac{(P_Y)^4}{P(k_4)}\\ & \frac{(P_B)^4}{P(B)} & \frac{(P_Y)^4}{P(Y)}\\ \hline k_5: k=5 & [{4\choose 3}(P_B)^3(P_Y)](P_B) & [{4\choose 3}(P_Y)^3(P_B)](P_Y) & P(k_5)=[{4\choose 3}(P_B)^3(P_Y)](P_B)\\ & \frac{[{4\choose 3}(P_B)^3(P_Y)](P_B)}{P(k_5)} & \frac{[{4\choose 3}(P_Y)^3(P_B)](P_Y)}{P(k_5)} & + [{4\choose 3}(P_Y)^3(P_B)](P_Y)\\ & \frac{[{4\choose 3}(P_B)^3(P_Y)](P_B)}{P(B)} & \frac{[{4\choose 3}(P_Y)^3(P_B)](P_Y)}{P(Y)}\\ \hline k_6: k=6 & [{5\choose 3}(P_B)^3(P_Y)^2](P_B) & [{5\choose 3}(P_Y)^3(P_B)^2](P_Y) & P(k_6)=[{5\choose 3}(P_B)^3(P_Y)^2](P_B)\\ & \frac{[{5\choose 3}(P_B)^3(P_Y)^2](P_B)}{P(k_6)} & \frac{[{5\choose 3}(P_Y)^3(P_B)^2](P_Y)}{P(k_6)} & + [{5\choose 3}(P_Y)^3(P_B)^2](P_Y)\\ & \frac{[{5\choose 3}(P_B)^3(P_Y)^2](P_B)}{P(B)} & \frac{[{5\choose 3}(P_Y)^3(P_B)^2](P_Y)}{P(Y)}\\ \hline k_7: k=7 & [{6\choose 3}(P_B)^3(P_Y)^3](P_B) & [{6\choose 3}(P_Y)^3(P_B)^3](P_Y) & P(k_7)=[{6\choose 3}(P_B)^3(P_Y)^3](P_B) \\ & \frac{[{6\choose 3}(P_B)^3(P_Y)^3](P_B)}{P(k_7)} & \frac{[{6\choose 3}(P_Y)^3(P_B)^3](P_Y)}{P(k_7)} & + [{6\choose 3}(P_Y)^3(P_B)^3](P_Y)\\ & \frac{[{6\choose 3}(P_B)^3(P_Y)^3](P_B)}{P(B)} & \frac{[{6\choose 3}(P_Y)^3(P_B)^3](P_Y)}{P(Y)}\\ \hline \text{Column} & P(B) = (P_B)^4 + [{4\choose 3}(P_B)^3(P_Y)](P_B) & P(Y) = (P_Y)^4 + [{4\choose 3}(P_Y)^3(P_B)](P_Y) & 1 \\ \text{Total} & + [{5\choose 3}(P_B)^3(P_Y)^2](P_B) + [{6\choose 3}(P_B)^3(P_Y)^3](P_B) & + [{5\choose 3}(P_Y)^3(P_B)^2](P_Y) + [{6\choose 3}(P_Y)^3(P_B)^3](P_Y) \\ \hline \end{array}\]

$$

The “Row Total” and the “Column Total” give the marginal probability distribution for B and Y. Thus, the marginal probability distribution for \(B\) gives \(B's\) probabilities unconditional on \(Y\), and vice versa.

3. What is the probability that the Braves win the World Series given that P_B=0.55? Identify this quantity in the cross-table.

The probability that the Braves win the World Series is given by the column total of B:

\[ P(B)= (P_B)^4 + \left[{4\choose 3}(P_B)^3(P_Y)\right](P_B) + \left[{5\choose 3}(P_B)^3(P_Y)^2\right](P_B)+ \left[{6\choose 3}(P_Y)^3(P_B)^3\right](P_B). \]

Now, substituting \(P_B=0.55\) yields that \(P(B)=0.6083\).

This probability can also be calculated using Rstudio:

P_B = 0.55

PB <- function(prob){
  pnbinom(3,4, prob)
}

PB(P_B)
## [1] 0.6082878

4. What is the probability that the Braves win the World Series given that \(P_B=x\)? This will be a figure (see below) with \(P_B\) on the x-axis and \(P(\text{Braves win World Series})\) on the y-axis.

out <- data.frame(N = seq(0.5, 1, 0.01), prob = NA)

for(i in 1:nrow(out)){
out[i,"prob"] <- PB(out[i,"N"])
}
Brave_WorldSeries <- ggplot(out,aes(x=N,y=prob))+geom_line(size = 0.75,colour = "black")+labs(x="Probability of the Braves winning a head-to-head matchup",y="P(Braves win World Series)",title="Probability of Winning the World Series")+theme(plot.title = element_text(hjust=0.5))
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Brave_WorldSeries

5. Suppose one could change the World Series to be first-to-5-wins or some other first-to-k-wins series. What is the smallest k so that \(P(\text{Braves win World Series}\mid P_B=0.55)\geq 0.8\).

P_B<-0.55

# Function calculates the probabilities for first k games

PBfirstkgames<-function(K){
  pnbinom(K-1,K,P_B)
}

out1<-data.frame(k = seq(1,100), prob=NA) # k is the number of wins

# Save the probabilities in a data frame

for (i in 1:nrow(out1)){
  out1[i,"prob"]<-PBfirstkgames(out1[i,"k"])
}

# Iterate through rows until prob >= 0.8

for(i in 1:nrow(out1)) {
  if (out1[i,"prob"]<0.8){
    next
  }
  else{
    smallest_k <- out1[i,"k"]
    smallest_k_prob<-out1[i,"prob"]
    break
  }
}

smallest_k
## [1] 36
smallest_k_prob
## [1] 0.8017017

6. What is the smallest k so that \(P(\text{Braves win World Series}|P_B=x)\geq 0.8\)? This will be a figure (see below) with \(P_B\) on the x-axis and \(k\) is the y-axis.

# Function calculates the probabilities for first k games when P_B is between 0.51 and 1.

out2 <- expand.grid(P_B = seq(0.51,1,0.01), k = seq(1,100), prob=NA)


PBfirstkgames<-function(K, P_B){
  pnbinom(K-1,K,P_B)
}

# Save the probabilities in a data frame

for (i in 1:nrow(out2)){
  out2[i,"prob"]<-PBfirstkgames(out2[i,"k"],out2[i,"P_B"])
}

# Filter the data

out3 <- out2 %>% filter (prob >= 0.8) %>% arrange(P_B)%>%filter(!duplicated(P_B))
Shortest_series_PB <- ggplot(out3,aes(x=P_B,y=k))+geom_line(size = 0.75,colour = "black")+labs(x="Probability of the Braves winning a head-to-head matchup",y="Series length (k)",title="Shortest series so that P(Braves win World Series | P_B = x)>= 0.8")+theme(plot.title = element_text(hjust=0.5))

Shortest_series_PB

7. Calculate \(P(P_B=0.55|\text{Braves win World Series in 7 games})\) under the assumption that either \(P_B=0.55\) or \(P_Y=0.45\). Explain your solution.

In our cross table, the value \(P(P_B=0.55|\text{Braves win World Series in 7 games})\) is given by the row probability of \(B\) and \(k_7\):

\[ P(B\mid k_7) = \frac{P(k_7\space\mid\space B)P(B)}{P(k_7)}=\frac{\left[{6\choose 3}(P_B)^3(P_Y)^3\right]P_B}{[{6\choose 3}(P_B)^3(P_Y)^3](P_B)+[{6\choose 3}(P_Y)^3(P_B)^3](P_Y)} \]

Substituting \(P_B=0.55\) yields that

((choose(6,3)*((0.55)^3)*((0.45)^3))*(0.55))/((choose(6,3)*((0.55)^3)*((0.45)^3))*(0.55)+(choose(6,3)*((0.45)^3)*((0.55)^3))*(0.45))
## [1] 0.55

Therefore, \(P(P_B=0.55|\text{Braves win World Series in 7 games})=0.55\).

8. Write an R function which generates random draws from the bivariate distribution. Identify what the inputs are and the structure of the outputs. Explain why such a function might be helpful.

rws<-function(n,k,p){
  o<-rbinom(7,1,p)
  if(any(cumsum(o)==k)){
  c(1,which(cumsum(o)==k)[1])
  }else{
  c(2,which(cumsum(!o)==k)[1])
  }}

This function takes 3 arguments in total:

input output
n: Maximum number of games
k: Stopping rule (end game when team wins k times)
p: probability of winning a head-to-head matchup

First, we let o be a sequence of n=7 max games with probability p of winning a single (k=1) head-to-head matchup. Then we use the cumsum function in R, this will keep track of how many games each team has won. The last number in the cumsum(o) vector tells how many games Braves won out of 7 games. Now, we use the any() function tell if us if there are any of element in the cumsum vector that is equal to k (stopping rule). Lastly, we use the which to extract the first time that cumsum vecotr is equal to k. This will return a vector with two arguments: 1 or 2 (Braves won or Yankees won, respectively) and the number of games it took. In this example, if we take \(P_B=0.55\), since a World Series can at most last \(7\) games and the stopping rule is \(k=4\) games

o<-rbinom(7,1,0.55)
o
## [1] 1 0 1 1 1 1 1
cumsum(o)
## [1] 1 1 2 3 4 5 6
any(cumsum(o)==4)
## [1] TRUE
c(1,which(cumsum(o)==4)[1])
## [1] 1 5
c(2,which(cumsum(!o)==4)[1])
## [1]  2 NA

This is useful if we wanted to have a longer game sequence.

9. The home field advantage is the edge which a team may have when playing a game at its home stadium. For example, it is the edge the Braves may have over the Yankees when the head-to-head match-up is in Atlanta. It is the advantage the Yankees may have when the head-to-head match-up is in New York. Explain how the derivation of the distribution would change if one were to account for home field advantage. Suppose that the schedule of games is

Game 1 Game2 Game 3 Game 4 Game 5 Game 6 Game 7
ATL ATL NYC NYC NYC ATL ATL

The derivation of the distribution would change since \(P_B\) and \(P_Y\) will change. We would have to use some type of scalar to represent the advantage when a team is playing at its home stadium. Thus, in game 1 since is being held in Atlanta then \(B\) will have advantage over \(Y\) and so \(P_B*c\) where \(c\) is a positive scalar. This scalar will assess the degree of advantage a team has over the other, where \(c=1\) when a team has no advantage. In this sequence we see that 4 out of the 7 games will be played in Atlanta, and so we would expect Braves to be more likely to win the World Series than without advantage.