Introduction

While one can probabilistically model race finishing position (based on a host of input data values) the intrigue of watching a F1 race is on the random unpredictable events and how the racers and teams adapt. Should one be interested in predicting the effect of those events, a Monte Carlo (MC) simulation would be much more adept to predict the outcome of races.

Already Done Work

Using the Ergast Data and some data from FastF1 python library, as well as other sources, the dataset already has been prepared with some derivative data. In particular, data about circuit (altitude, weather, historical correlation between starting and finishing positions), constructor (car failure rate, tire failure rate), driver (crash rate, disqualification rate), pit stops (avg # pits per circuit, avg pit duration, driver/constructor avg pit speed & avg num vs field), practice information (driver & constructor number of laps, pace, relative to teammate), qualifying (position, gap to leader), etc. have been determined.

Literature

The following literature sources have been used in this data exploration.

Notes from Literature

Building the Lap Simulator

Let’s build a lap simulator! To start, we need some basic parameters.

parameters <- list(
  avg_lap_time = 94, #like COTA in Phoenix
  num_laps = 56,
  num_drivers = 20
)

If laps are modeled by deviation from a average lap (in our scenario, 94 seconds long), a race can be simulated by:

simulate_lap_time<-function(params = parameters){
  lap_time = 0
}

simulate_race<-function(params = parameters){
  driver_times = data.frame(driver = 1:params$num_drivers,
                            position = 1:params$num_drivers,
                            race_time = NA_real_)
  for (lap in 1:params$num_laps){
    for (driver in unique(driver_times$driver)){
      driver_times[driver_times$driver == driver, ]$race_time <- 
        sum(driver_times[driver_times$driver == driver, ]$race_time, simulate_lap_time(params = params), params$avg_lap_time, na.rm=TRUE)
    }
  }
  
  return(driver_times)
}

simulate_race(parameters)
##    driver position race_time
## 1       1        1      5264
## 2       2        2      5264
## 3       3        3      5264
## 4       4        4      5264
## 5       5        5      5264
## 6       6        6      5264
## 7       7        7      5264
## 8       8        8      5264
## 9       9        9      5264
## 10     10       10      5264
## 11     11       11      5264
## 12     12       12      5264
## 13     13       13      5264
## 14     14       14      5264
## 15     15       15      5264
## 16     16       16      5264
## 17     17       17      5264
## 18     18       18      5264
## 19     19       19      5264
## 20     20       20      5264

Note that here we just have driver numbers = starting grid position. Of course, since each driver is getting only the average lap time, each driver is finishing with the same time as everyone else. But we can start to add complexity to this piece by piece, rerunning the simulations each time.

One quick item to add is the grid start position to the start line time (i.e. from the grid box to the start line). This is longer for cars near the back than it is near the front of the grid, and was modeled by Heilmer et. al., 2020 as being ts = sqrt((2*(grid*8-0.8))/11.2)+0.2.

simulate_lap_time<-function(lap, grid, params = parameters){
  if(lap == 1){
    ts = sqrt((2*(grid*8-0.8))/11.2)+0.2
  } else {
    ts = 0
  }
  lap_time = 0 + ts
}

simulate_race<-function(params = parameters){
  driver_times = data.frame(driver = 1:params$num_drivers,
                            position = 1:params$num_drivers,
                            grid = 1:params$num_drivers,
                            race_time = NA_real_)
  for (lap in 1:params$num_laps){
    for (driver in unique(driver_times$driver)){
      driver_times[driver_times$driver == driver, ]$race_time <- 
        sum(driver_times[driver_times$driver == driver, ]$race_time, 
            simulate_lap_time(params = params, lap=lap, grid=driver_times[driver_times$driver == driver,]$grid), 
            params$avg_lap_time, na.rm=TRUE)
    }
  }
  
  driver_times <- driver_times %>% 
    dplyr::arrange(.data$race_time) %>% 
    dplyr::mutate("position" = 1:dplyr::n(),
                  "laps" = params$num_laps - floor((.data$race_time - head(.data$race_time, 1))/params$avg_lap_time)) %>%
    dplyr::select(c("driver", "position", "race_time", 'laps'))
  
  return(driver_times)
}

simulate_race(parameters)
##    driver position race_time laps
## 1       1        1  5265.334   56
## 2       2        2  5265.848   56
## 3       3        3  5266.235   56
## 4       4        4  5266.560   56
## 5       5        5  5266.846   56
## 6       6        6  5267.103   56
## 7       7        7  5267.340   56
## 8       8        8  5267.559   56
## 9       9        9  5267.766   56
## 10     10       10  5267.961   56
## 11     11       11  5268.146   56
## 12     12       12  5268.323   56
## 13     13       13  5268.493   56
## 14     14       14  5268.656   56
## 15     15       15  5268.814   56
## 16     16       16  5268.966   56
## 17     17       17  5269.114   56
## 18     18       18  5269.257   56
## 19     19       19  5269.396   56
## 20     20       20  5269.532   56

We can continually iterate to add more capabilities to this model. Building piece by piece allows us to validate the performance of each item. We can add more and more parameters and add capabilities like tire deg, fuel consumption, pit stops, etc.

Turning This into Monte Carlo Simulation

Of course, if we run the above code as many times as we like, our results will always be the same. But, by adding in random variation (such as small +/- to lap time based on driver skill, likelihood of pitting at any lap, likelihood of crash/retirement/safety car/etc. we can get variation in our results time with each run.

simulate_lap_time<-function(lap, grid, params = parameters){
  if(lap == 1){
    ts = sqrt((2*(grid*8-0.8))/11.2)+0.2
  } else {
    ts = 0
  }
  # ToDo: model this variation, this is just a quick example!
  tv <- runif(1, min = -1.5, max = 3+grid*.1) 
  lap_time = 0 + ts + tv
}

simulate_race(parameters)
##    driver position race_time laps
## 1       8        1  5298.345   56
## 2       2        2  5315.311   56
## 3       3        3  5316.079   56
## 4       4        4  5318.669   56
## 5       7        5  5322.132   56
## 6       6        6  5324.782   56
## 7       1        7  5329.507   56
## 8       5        8  5330.543   56
## 9       9        9  5331.115   56
## 10     12       10  5335.402   56
## 11     18       11  5337.014   56
## 12     15       12  5339.831   56
## 13     11       13  5345.138   56
## 14     10       14  5346.976   56
## 15     19       15  5347.236   56
## 16     14       16  5355.466   56
## 17     20       17  5356.757   56
## 18     13       18  5361.551   56
## 19     17       19  5370.689   56
## 20     16       20  5378.022   56
simulate_race(parameters)
##    driver position race_time laps
## 1       2        1  5293.223   56
## 2       6        2  5299.568   56
## 3       3        3  5305.696   56
## 4       1        4  5316.459   56
## 5       7        5  5321.339   56
## 6       4        6  5321.667   56
## 7       5        7  5329.739   56
## 8       8        8  5333.281   56
## 9      11        9  5341.247   56
## 10      9       10  5342.352   56
## 11     15       11  5342.906   56
## 12     13       12  5345.097   56
## 13     14       13  5345.952   56
## 14     12       14  5348.293   56
## 15     20       15  5353.430   56
## 16     10       16  5355.664   56
## 17     16       17  5358.156   56
## 18     17       18  5359.815   56
## 19     18       19  5377.292   56
## 20     19       20  5387.954   55

See how those two results are different? And that some cars have passed each other? The driver who started 8th won the first race! And, in the second, the last place car was not on the final lap - they’d been passed (by the first two finishers).

Generating Race Predictions

At this point, repeating the process 10, 100, 1000 or more times can give us estimates of what could happen in the race - maybe it’s most likely that Hamilton, Leclerc or Verstappen win in COTA, but what if everything goes wrong for everyone and Vettel picks up the win in an Aston Martin? Repeating the race time after time points toward if that is even a reasonable possibility.

Of course, we can’t model all of the variability of F1 - what were the odds of Bottas steaming up the side of the leaders in the rain taking them out, and the track dries out for the restart only to have everyone but Hamilton put on fresh tires, and Ocon takes the win? We’ll never know the odds, but it happened in 2021. We can’t capture everything, but hopefully we can make pretty good guesses.