This blog post is aimed at figuring out how much would home advantage effect the outcome of the match if home advantage exists. The data set used to analyze in the post is the widely known “World Series”,Braves and Yankees are chosen in this post. The target person of this blog post is a HR manager tasked with hiring a data scientist.
The World Series is the annual championship series of Major League Baseball (MLB) and concludes the MLB postseason. Learn more about World Series via https://en.wikipedia.org/wiki/List_of_World_Series_champions
The home field advantage is the edge which a team may have when playing a game at its home stadium. For example, it is the edge the Braves may have over the Yankees when the head-to-head match-up is in Atlanta. It is the advantage the Yankees may have when the head-to-head match-up is in New York.
We suppose that in any given game, the probability that the Braves win is PB and the probability that the Yankees win is: \[PY = 1 − PB\] ### Probability with home advantage For Braves: Probability with home advantage: \[ At~Atlanta:~PBhome = PB * 1.1\] \[ At~New~York:~PBaway = 1 − (1 − PB) * 1.1\] ### Relative Error \[relative~error =~|p̂−p|/p\]
While the absolute error gives how large the error is, the relative error gives how large the error is relative to the correct value
And since relative error equals absolute error divided by probability, and probability is minus than 1. Thus relative error is always greater than absolute error.
Relative error is clearly explained in this link:https://www.statisticshowto.com/relative-error/
1.Compute analytically the probability that the Braves win the world series when the sequence of game locations is {NYC, NYC, ATL, ATL, ATL, NYC, NYC}. (The code below computes the probability for the alternative sequence of game locations. Note: The code uses data.table syntax, which may be new to you. This is intentional, as a gentle way to introduce data.table.) Calculate the probability with and without home field advantage when PB = 0.55. What is the difference in probabilities?
2.Calculate the same probabilities as the previous question by simulation.
3.What is the absolute and relative error for your simulation in the previous question?
4.Does the difference in probabilities (with vs without home field advantage) depend on PB? (Generate a plot to answer this question.)
5.Does the difference in probabilities (with vs without home field advantage) depend on the advantage factor? (The advantage factor in PBH and PBA is the 1.1 multiplier that results in a 10% increase for the home team. Generate a plot to answer this question.)
library(tidyverse)
library(dplyr)
require(data.table)
Question 1:
apo <-fread("all-possible-world-series-outcomes.csv")
# for NY NY ATL ATL ATL NY NY
world_series_analytics <- function(game = apo,
hfi = c(0,0,1,1,1,0,0),
pb = .55,
advantage_multiplier = 1.1){
pbh <- pb*advantage_multiplier
pba <- 1 - (1 - pb)*advantage_multiplier
# Calculate the probability of each possible outcome
apo[, p := NA_real_] # Initialize new column in apo to store prob
for(i in 1:nrow(apo)){
prob_game <- rep(1, 7)
for(j in 1:7){
p_win <- ifelse(hfi[j], pbh, pba)
prob_game[j] <- case_when(
apo[i,j,with=FALSE] == "W" ~ p_win
, apo[i,j,with=FALSE] == "L" ~ 1 - p_win
, TRUE ~ 1
)
}
apo[i, p := prod(prob_game)] # Data.table syntax
}
# Sanity check: does sum(p) == 1?
apo[, sum(p)] # This is data.table notation
# Probability of overall World Series outcomes
apo[, sum(p), overall_outcome]
}
analytics_outcome <- world_series_analytics()
analytics_outcome
## overall_outcome V1
## 1: W 0.604221
## 2: L 0.395779
analytics_outcome_non_advantage <- world_series_analytics(advantage_multiplier = 1) # no home advantage
analytics_outcome_non_advantage
## overall_outcome V1
## 1: W 0.6082878
## 2: L 0.3917122
analytics_outcome_non_advantage$V1[1] - analytics_outcome$V1[1]
## [1] 0.004066825
The with home advantages winning rate is just 0.004066825 higher than the winning rate without home advantage.This is might because that Braves have 4 away games and 3 home games, the Yankees enjoy more from home advantage than Braves, so there might be not obvious difference between with advantages and without advantages.
Question2:
sim_world_series <- function(hfi = c(0,0,1,1,1,0,0),
pb = .55,
advantage_multiplier = 1.1){
pbh <- pb*advantage_multiplier
pba <- 1 - (1 - pb)*advantage_multiplier
num_win = 0
for(i in 1:7) {
if(hfi[i]){
p_win = pbh
} else{
p_win = pba
}
game_outcome = rbinom(1, 1, p_win)
num_win = num_win + game_outcome
if (num_win == 4 | (i - num_win == 4)) break
}
return(num_win == 4)
}
sim_result_withadv = NA
for (k in 1:10000){
sim_result_withadv[k] = sim_world_series()
}
mean(sim_result_withadv)
## [1] 0.5938
1-mean(sim_result_withadv)
## [1] 0.4062
#The following will calculate without advantage one
sim_result_withoutadv = NA
for (k in 1:10000){
sim_result_withoutadv[k] = sim_world_series(advantage_multiplier = 1)
}
mean(sim_result_withoutadv)
## [1] 0.606
1-mean(sim_result_withoutadv)
## [1] 0.394
abs(mean(sim_result_withadv) - mean(sim_result_withoutadv))
## [1] 0.0122
The difference is still tiny by 10000 trials. The propobal reason is in the Q1. Question 3:
(abs_error = abs(mean(sim_result_withadv) - analytics_outcome$V1[1]))
## [1] 0.01042097
(relative_error = abs_error/analytics_outcome$V1[1])
## [1] 0.01724695
Question 4:
PB_initial= seq(0.5,1,0.01)
tt <- rep(0,length(PB_initial))
for (i in 1:length(PB_initial))
{
#print(PB_initial[i])
tt[i]<- world_series_analytics(pb = PB_initial[i],advantage_multiplier= 1.1)[[2]][1] - pnbinom(3,4,PB_initial[i])
}
tt
## [1] -1.571970e-02 -1.340850e-02 -1.106947e-02 -8.720791e-03 -6.380600e-03
## [6] -4.066825e-03 -1.797034e-03 4.117187e-04 2.543040e-03 4.581343e-03
## [11] 6.511988e-03 8.321413e-03 9.997262e-03 1.152850e-02 1.290552e-02
## [16] 1.412024e-02 1.516620e-02 1.603862e-02 1.673448e-02 1.725254e-02
## [21] 1.759341e-02 1.775953e-02 1.775519e-02 1.758653e-02 1.726144e-02
## [26] 1.678957e-02 1.618222e-02 1.545225e-02 1.461395e-02 1.368290e-02
## [31] 1.267581e-02 1.161030e-02 1.050471e-02 9.377800e-03 8.248542e-03
## [36] 7.135757e-03 6.057801e-03 5.032192e-03 4.075209e-03 3.201459e-03
## [41] 2.423409e-03 1.750881e-03 1.190517e-03 7.451980e-04 4.134327e-04
## [46] 1.887047e-04 5.877912e-05 4.968852e-06 1.359733e-06 1.399227e-05
## [51] -1.110223e-16
plot(PB_initial,tt,xlab = "PB", ylab = "Difference")
Yes, it does depend on PB. As PB increasing,difference first increase
and then decrease, finally get close to 1. Question 5: Set advantage
multiplier up to 1/0.55 because PBhome could not be greater than 1.
Ad_initial=seq(1,1/0.55,by=0.01)
ad <- rep(0,length(Ad_initial))
for (i in 1:length(Ad_initial))
{
#print(Ad_initial[i])
ad[i]<- abs(world_series_analytics(pb = 0.55,advantage_multiplier= Ad_initial[i])[[2]][1] - pnbinom(3,4,0.55))
}
ad
## [1] 5.551115e-16 4.496765e-04 8.892095e-04 1.318831e-03 1.738767e-03
## [6] 2.149234e-03 2.550446e-03 2.942610e-03 3.325929e-03 3.700602e-03
## [11] 4.066825e-03 4.424793e-03 4.774697e-03 5.116731e-03 5.451087e-03
## [16] 5.777958e-03 6.097544e-03 6.410043e-03 6.715662e-03 7.014613e-03
## [21] 7.307117e-03 7.593403e-03 7.873711e-03 8.148296e-03 8.417427e-03
## [26] 8.681387e-03 8.940483e-03 9.195040e-03 9.445407e-03 9.691962e-03
## [31] 9.935111e-03 1.017529e-02 1.041298e-02 1.064868e-02 1.088297e-02
## [36] 1.111642e-02 1.134971e-02 1.158354e-02 1.181866e-02 1.205592e-02
## [41] 1.229622e-02 1.254051e-02 1.278986e-02 1.304540e-02 1.330835e-02
## [46] 1.358005e-02 1.386191e-02 1.415548e-02 1.446241e-02 1.478447e-02
## [51] 1.512358e-02 1.548179e-02 1.586130e-02 1.626446e-02 1.669380e-02
## [56] 1.715202e-02 1.764199e-02 1.816679e-02 1.872971e-02 1.933425e-02
## [61] 1.998412e-02 2.068330e-02 2.143600e-02 2.224670e-02 2.312014e-02
## [66] 2.406137e-02 2.507574e-02 2.616889e-02 2.734683e-02 2.861587e-02
## [71] 2.998273e-02 3.145446e-02 3.303852e-02 3.474279e-02 3.657557e-02
## [76] 3.854558e-02 4.066204e-02 4.293462e-02 4.537349e-02 4.798937e-02
## [81] 5.079347e-02 5.379760e-02
plot(Ad_initial,ad,xlab = "Advantage Factor", ylab = "Difference")
Yes, it does depend on advantage factor. As AF increasing,difference
keep increasing.