require(tidyverse)

1 – Intro

This document introduces a framework for simulating races based on the initial probabilities of finishing in first. This model has one main input, betting odds, and outputs the distribution of finishes for each race participant. This model is simple, speedy, and very easy to interpret. This model has applications in NASCAR, horse racing, or any other race really. This model is the base of the multilevel NASCAR DraftKings model which I will be rolling out in February. I’ve been working on this DraftKings model for the better part of a year now, and it is almost ready to go live.

2 – The Model

This model starts with a main assumption, that betting odds represent the probability that a driver comes in first. From there, we define the following:

\(R_i\) – Racer \(i\) where \(i = 1, 2, ..., n\) \(n\) – Number of Racers
\(O_i\) – Decimal Betting Odds of Racer \(i\) where \(i = 1, 2, ..., n\)

Decimal betting odds have values \([1,\infty)\). A decimal odds of 2 represents 1/1 odds, or in English, if I bet one dollar and win the bet, I will profit one dollar. A decimal odds of 1.5 represents 1/2 odds, or that I have to bet two dollars to win profit dollar. We make the assumption that these odds innately imply probability of a racer winning. We can transform these betting odds to percentages with the following transformation.

\(\phi^*_i = \frac{1}{O_i - 1}\)

Almost always, these sportsbook betting odds have vigorish or “juice” attached to them. The vigorish is the percent of edge that the sportsbook gives itself. Essentially, by decreasing the payout by a certain margin, it guarantees that the house always wins. This means that the sum of the raw betting odds probabilities \(\phi*_i\) is greater than 1. To fix this and eliminate “the juice”, we normalize these probabilities \(\phi*_i\).

\(\phi_i = \frac{\phi^*_i}{\sum_{i=1}^n \phi^*_i}\)

These probabilities, \(\phi_i\), now represent the probability of each racer finishing in first. We need to find the probability of each racer finishing in every single position possible. We define \(\Phi_{i,j}\), a matrix that stores the marginal distribution of the racer’s finishes in each row. The betting odds probabilities \(\phi_i\) fill the first column of \(\Phi_{i,j}\)

\(\Phi_{i,j} = \begin{bmatrix} P(R_1 = 1) & P(R_1 = 2) & \dots & P(R_1 = n)\\ P(R_2 = 1) & P(R_2 = 2) & \dots & P(R_2 = n)\\ \vdots & & & \\ P(R_i = 1) & P(R_i = 2) & \dots & P(R_i = n)\\ \vdots & & & \\ P(R_n = 1) & P(R_n = 2) & \dots & P(R_n = n) \end{bmatrix}\)

The \(\Phi\) matrix is a double stochastic matrix, meaning the sum of the rows and the sum of the columns both add to 1.

Example 1: Consider the case where n = 2.
\(R = \{A, B\}\)
\(\phi = \{.7, .3\}\)

\(\Phi_{i,j} = \begin{bmatrix} .7 & P(R_1 = 2)\\ .3 & P(R_2 = 2)\\ \end{bmatrix}\)

It is clear that:

\(\Phi_{i,j} = \begin{bmatrix} .7 & .3\\ .3 & .7\\ \end{bmatrix}\)

Example 2:
Consider the case where n = 3: \(R = \{A, B, C\}\)
\(\phi = \{p_1, p_2, p_3\}\)

\(\Phi_{i,j} = \begin{bmatrix} p_1 & P(R_1 = 2) & P(R_1 = 3) \\ p_2 & P(R_2 = 2) & P(R_2 = 3) \\ p_3 & P(R_3 = 2) & P(R_3 = 3) \\ \end{bmatrix}\)

\(P(R_1 = 2) = \sum_{i=1}^3 P(R_i = 1, R_1 = 2)\)

\(P(R_1 = 2) = \sum_{i=1}^3 P(R_1 = 2 | R_i = 1)P(R_1 = 2)\)

The next line of equivalence depends on a key assumption.

The Assumption:

\(P(R_i = 2 | R_j = 1 | \Omega) = P(R_i = 1 | \Omega \smallsetminus \{R_j = 1\})\)

The logic of this assumption comes from the following. We have a set of discrete outcomes that denote the probability of a single racer coming in first, \(\phi\) (or \(\Phi_{,1}\). If we know a specific racer comes in first, we use the remaining probabilities to decide who comes in second. Since racer “A” came in first, he cannot finish in any other position. We remove this driver from \(\phi\) our set of outcomes. We then normalize \(\phi\) by the following:

\(\phi_{i | R_A = 1} = \frac{\phi_i}{\phi \smallsetminus \{R_A = 1\}}\)

We can use the above to exactly fill in any marginal distribution matrix \(\Phi\). Unfortunately, at large n, this is computationally expensive, so we will use R’s sample() function to quickly generate these permutations.

3 – Implementation

To get the marginal distribution matrix from the betting odds we do the following:

Initialize an sims-by-n empty matrix notated M, where sims is the number of simulations and n is the number of racers.
Use the sample() function to generate random weighted permutations of the racers, storing each permutation in a row of the S-by-n matrix.
Create an empty n-by-n+1 marginal distribution matrix, notated D, where the first column is the ID column.
Use the table function on a column to get the count of each racers finish
Insert the results in step 4. into matrix D.
Fix formatting
Convert counts in D to probabilities.

get_dists <- function(odds,n,sims) {
  ### Get probabilities of finishing in 1st, 2nd, etc. by generating random permutations of race finishes using the initial betting odds. 
  ### These variables should be renamed. 
  M <- matrix("",nrow=sims,ncol=length(odds))
  for (i in 1:sims){
    M[i,]<-sample(1:n,length(odds),prob=odds)
  }
  D <- data.frame(rep("",length(odds)))
  D$Var1 <- as.character(as.data.frame(table(M[,1]))[,1])
  for(i in 1:length(odds)){
    tmp <- as.data.frame(table(M[,i]))
    D <- left_join(D,tmp,by="Var1")
  }
  D$rep.....length.odds..<-NULL
  cols <- paste0("odds",1:length(odds))
  names(D)<-c("Racer",cols)
  D[is.na(D)] = 0
  ####
  D[,2:(length(odds)+1)] <- D[,2:(length(odds)+1)]/sims
  return(D)
}

4 – Benchmarking, Testing, and Example

In this section we test the running time of our get_dists() function using betting odds from the William-Hill Sportsbook (circa an article from CBS Sports).

test_odds <- read_csv("TEST_ODDS_NH.csv")

sims=100000
start_time = Sys.time()
test_dists <- get_dists(test_odds$P_norm,n=length(test_odds$Racer),sims=sims)
end_time = Sys.time()
paste0("The function get_dists takes ",end_time - start_time," seconds to run with ",sims," simulations. ")

## [1] "The function get_dists takes 2.26293182373047 seconds to run with 1e+05 simulations. "

# hot fix to add names back in to the races
test_dists <- test_dists[order(as.numeric(test_dists$Racer)),]
test_dists$Racer <- test_odds$Racer

It’s quick right?? A partial output of the get_dists function is shown below.

head(test_dists)[,1:8]

##               Racer   odds1   odds2   odds3   odds4   odds5   odds6   odds7
## 1     Kevin Harvick 0.24164 0.19599 0.15642 0.12275 0.09157 0.06738 0.04627
## 12     Denny Hamlin 0.12293 0.12496 0.11563 0.11109 0.10217 0.09220 0.07826
## 23       Kyle Busch 0.12187 0.12061 0.11719 0.11366 0.10159 0.09322 0.07876
## 32 Martin Truex Jr. 0.06784 0.07312 0.07476 0.07769 0.07871 0.07997 0.08025
## 33      Ryan Blaney 0.05550 0.05953 0.06418 0.06927 0.07153 0.07193 0.07344
## 34  Brad Keselowski 0.05831 0.05958 0.06404 0.06851 0.07039 0.07336 0.07406

We can now use betting odds to effectively simulate a large number of races in a short amount of time. However, there is still lots of work needed to get this model ready for use. In the next installment, we will discuss calculation of expected finish position, verification of the results, and adding a “crash prior” to account for random outcomes.

NASCAR Model Pt. 1

Ben Gramza

12/14/2020

1 – Intro

2 – The Model

3 – Implementation

4 – Benchmarking, Testing, and Example