1 Introduction

This blog post is aimed at figuring out how much would home advantage effect the outcome of the match if home advantage exists. The data set used to analyze in the post is the widely known “World Series”,Braves and Yankees are chosen in this post. The target person of this blog post is a HR manager tasked with hiring a data scientist.

1.1 Background

1.1.1 World Series

The World Series is the annual championship series of Major League Baseball (MLB) and concludes the MLB postseason. Learn more about World Series via https://en.wikipedia.org/wiki/List_of_World_Series_champions

1.1.2 Home advantage

The home field advantage is the edge which a team may have when playing a game at its home stadium. For example, it is the edge the Braves may have over the Yankees when the head-to-head match-up is in Atlanta. It is the advantage the Yankees may have when the head-to-head match-up is in New York.

1.1.3 Probability

We suppose that in any given game, the probability that the Braves win is PB and the probability that the Yankees win is: \[PY = 1 − PB\] ### Probability with home advantage For Braves: Probability with home advantage: \[ At~Atlanta:~PBhome = PB * 1.1\] \[ At~New~York:~PBaway = 1 − (1 − PB) * 1.1\] ### Relative Error \[relative~error =~|p̂−p|/p\]

While the absolute error gives how large the error is, the relative error gives how large the error is relative to the correct value

And since relative error equals absolute error divided by probability, and probability is minus than 1. Thus relative error is always greater than absolute error.

Relative error is clearly explained in this link:https://www.statisticshowto.com/relative-error/

1.2 Questions to answer

1.Compute analytically the probability that the Braves win the world series when the sequence of game locations is {NYC, NYC, ATL, ATL, ATL, NYC, NYC}. (The code below computes the probability for the alternative sequence of game locations. Note: The code uses data.table syntax, which may be new to you. This is intentional, as a gentle way to introduce data.table.) Calculate the probability with and without home field advantage when PB = 0.55. What is the difference in probabilities?

2.Calculate the same probabilities as the previous question by simulation.

3.What is the absolute and relative error for your simulation in the previous question?

4.Does the difference in probabilities (with vs without home field advantage) depend on PB? (Generate a plot to answer this question.)

5.Does the difference in probabilities (with vs without home field advantage) depend on the advantage factor? (The advantage factor in PBH and PBA is the 1.1 multiplier that results in a 10% increase for the home team. Generate a plot to answer this question.)

2 Solutions

library(tidyverse)
library(dplyr)
require(data.table)

Question 1:

apo <-fread("all-possible-world-series-outcomes.csv")
# for NY NY ATL ATL ATL NY NY




world_series_analytics <- function(game = apo, 
                                    hfi = c(0,0,1,1,1,0,0),
                                    pb = .55,
                                    advantage_multiplier = 1.1){
  
  pbh <- pb*advantage_multiplier
  pba <- 1 - (1 - pb)*advantage_multiplier
  
  # Calculate the probability of each possible outcome
  apo[, p := NA_real_] # Initialize new column in apo to store prob
  for(i in 1:nrow(apo)){
    prob_game <- rep(1, 7)
    for(j in 1:7){
      p_win <- ifelse(hfi[j], pbh, pba)
      prob_game[j] <- case_when(
          apo[i,j,with=FALSE] == "W" ~ p_win
        , apo[i,j,with=FALSE] == "L" ~ 1 - p_win
        , TRUE ~ 1
      )
    }
    apo[i, p := prod(prob_game)] # Data.table syntax
  }

# Sanity check: does sum(p) == 1?
apo[, sum(p)] # This is data.table notation

# Probability of overall World Series outcomes
apo[, sum(p), overall_outcome]
}
analytics_outcome <- world_series_analytics()
analytics_outcome 
##    overall_outcome       V1
## 1:               W 0.604221
## 2:               L 0.395779
analytics_outcome_non_advantage <- world_series_analytics(advantage_multiplier = 1) # no home advantage
analytics_outcome_non_advantage
##    overall_outcome        V1
## 1:               W 0.6082878
## 2:               L 0.3917122
analytics_outcome_non_advantage$V1[1] - analytics_outcome$V1[1]
## [1] 0.004066825

The with home advantages winning rate is just 0.004066825 higher than the winning rate without home advantage.This is might because that Braves have 4 away games and 3 home games, the Yankees enjoy more from home advantage than Braves, so there might be not obvious difference between with advantages and without advantages.

Question2:

sim_world_series <- function(hfi = c(0,0,1,1,1,0,0),
                                    pb = .55,
                                    advantage_multiplier = 1.1){
  
  pbh <- pb*advantage_multiplier
  pba <- 1 - (1 - pb)*advantage_multiplier
  
  num_win = 0
  for(i in 1:7) {
    if(hfi[i]){
      p_win = pbh
    } else{
      p_win = pba
    }
    
    game_outcome = rbinom(1, 1, p_win)
    num_win = num_win + game_outcome
    if (num_win == 4 | (i - num_win == 4)) break
  }
  return(num_win == 4)
}

sim_result_withadv = NA
for (k in 1:10000){
  sim_result_withadv[k] = sim_world_series()
}

mean(sim_result_withadv)
## [1] 0.5938
1-mean(sim_result_withadv)
## [1] 0.4062
#The following will calculate without advantage one
sim_result_withoutadv = NA
for (k in 1:10000){
  sim_result_withoutadv[k] = sim_world_series(advantage_multiplier = 1)
}

mean(sim_result_withoutadv)
## [1] 0.606
1-mean(sim_result_withoutadv)
## [1] 0.394
abs(mean(sim_result_withadv) - mean(sim_result_withoutadv))
## [1] 0.0122

The difference is still tiny by 10000 trials. The propobal reason is in the Q1. Question 3:

(abs_error = abs(mean(sim_result_withadv) - analytics_outcome$V1[1]))
## [1] 0.01042097
(relative_error = abs_error/analytics_outcome$V1[1])
## [1] 0.01724695

Question 4:

PB_initial= seq(0.5,1,0.01)
tt <- rep(0,length(PB_initial))
for (i in 1:length(PB_initial))
{
  #print(PB_initial[i])
  tt[i]<- world_series_analytics(pb = PB_initial[i],advantage_multiplier= 1.1)[[2]][1] - pnbinom(3,4,PB_initial[i])
}
tt
##  [1] -1.571970e-02 -1.340850e-02 -1.106947e-02 -8.720791e-03 -6.380600e-03
##  [6] -4.066825e-03 -1.797034e-03  4.117187e-04  2.543040e-03  4.581343e-03
## [11]  6.511988e-03  8.321413e-03  9.997262e-03  1.152850e-02  1.290552e-02
## [16]  1.412024e-02  1.516620e-02  1.603862e-02  1.673448e-02  1.725254e-02
## [21]  1.759341e-02  1.775953e-02  1.775519e-02  1.758653e-02  1.726144e-02
## [26]  1.678957e-02  1.618222e-02  1.545225e-02  1.461395e-02  1.368290e-02
## [31]  1.267581e-02  1.161030e-02  1.050471e-02  9.377800e-03  8.248542e-03
## [36]  7.135757e-03  6.057801e-03  5.032192e-03  4.075209e-03  3.201459e-03
## [41]  2.423409e-03  1.750881e-03  1.190517e-03  7.451980e-04  4.134327e-04
## [46]  1.887047e-04  5.877912e-05  4.968852e-06  1.359733e-06  1.399227e-05
## [51] -1.110223e-16
plot(PB_initial,tt,xlab = "PB", ylab = "Difference")

Yes, it does depend on PB. As PB increasing,difference first increase and then decrease, finally get close to 1. Question 5: Set advantage multiplier up to 1/0.55 because PBhome could not be greater than 1.

Ad_initial=seq(1,1/0.55,by=0.01) 
ad <- rep(0,length(Ad_initial))
for (i in 1:length(Ad_initial))
{
  #print(Ad_initial[i])
  ad[i]<- abs(world_series_analytics(pb = 0.55,advantage_multiplier= Ad_initial[i])[[2]][1] - pnbinom(3,4,0.55))
}
ad
##  [1] 5.551115e-16 4.496765e-04 8.892095e-04 1.318831e-03 1.738767e-03
##  [6] 2.149234e-03 2.550446e-03 2.942610e-03 3.325929e-03 3.700602e-03
## [11] 4.066825e-03 4.424793e-03 4.774697e-03 5.116731e-03 5.451087e-03
## [16] 5.777958e-03 6.097544e-03 6.410043e-03 6.715662e-03 7.014613e-03
## [21] 7.307117e-03 7.593403e-03 7.873711e-03 8.148296e-03 8.417427e-03
## [26] 8.681387e-03 8.940483e-03 9.195040e-03 9.445407e-03 9.691962e-03
## [31] 9.935111e-03 1.017529e-02 1.041298e-02 1.064868e-02 1.088297e-02
## [36] 1.111642e-02 1.134971e-02 1.158354e-02 1.181866e-02 1.205592e-02
## [41] 1.229622e-02 1.254051e-02 1.278986e-02 1.304540e-02 1.330835e-02
## [46] 1.358005e-02 1.386191e-02 1.415548e-02 1.446241e-02 1.478447e-02
## [51] 1.512358e-02 1.548179e-02 1.586130e-02 1.626446e-02 1.669380e-02
## [56] 1.715202e-02 1.764199e-02 1.816679e-02 1.872971e-02 1.933425e-02
## [61] 1.998412e-02 2.068330e-02 2.143600e-02 2.224670e-02 2.312014e-02
## [66] 2.406137e-02 2.507574e-02 2.616889e-02 2.734683e-02 2.861587e-02
## [71] 2.998273e-02 3.145446e-02 3.303852e-02 3.474279e-02 3.657557e-02
## [76] 3.854558e-02 4.066204e-02 4.293462e-02 4.537349e-02 4.798937e-02
## [81] 5.079347e-02 5.379760e-02
plot(Ad_initial,ad,xlab = "Advantage Factor", ylab = "Difference")

Yes, it does depend on advantage factor. As AF increasing,difference keep increasing.