While one can probabilistically model race finishing position (based on a host of input data values) the intrigue of watching a F1 race is on the random unpredictable events and how the racers and teams adapt. Should one be interested in predicting the effect of those events, a Monte Carlo (MC) simulation would be much more adept to predict the outcome of races.
Using the Ergast Data and some data from FastF1 python library, as well as other sources, the dataset already has been prepared with some derivative data. In particular, data about circuit (altitude, weather, historical correlation between starting and finishing positions), constructor (car failure rate, tire failure rate), driver (crash rate, disqualification rate), pit stops (avg # pits per circuit, avg pit duration, driver/constructor avg pit speed & avg num vs field), practice information (driver & constructor number of laps, pace, relative to teammate), qualifying (position, gap to leader), etc. have been determined.
The following literature sources have been used in this data exploration.
DRS Effect can be modeled after lap 2 and 2 laps after SC.
Overtaking can slow both cars, but the overtaken car more than the overtaking.
Model each driver’s performance vs. a theoretical (stable) lap time instead of
Let’s build a lap simulator! To start, we need some basic parameters.
parameters <- list(
avg_lap_time = 94, #like COTA in Phoenix
num_laps = 56,
num_drivers = 20
)
If laps are modeled by deviation from a average lap (in our scenario, 94 seconds long), a race can be simulated by:
simulate_lap_time<-function(params = parameters){
lap_time = 0
}
simulate_race<-function(params = parameters){
driver_times = data.frame(driver = 1:params$num_drivers,
position = 1:params$num_drivers,
race_time = NA_real_)
for (lap in 1:params$num_laps){
for (driver in unique(driver_times$driver)){
driver_times[driver_times$driver == driver, ]$race_time <-
sum(driver_times[driver_times$driver == driver, ]$race_time, simulate_lap_time(params = params), params$avg_lap_time, na.rm=TRUE)
}
}
return(driver_times)
}
simulate_race(parameters)
## driver position race_time
## 1 1 1 5264
## 2 2 2 5264
## 3 3 3 5264
## 4 4 4 5264
## 5 5 5 5264
## 6 6 6 5264
## 7 7 7 5264
## 8 8 8 5264
## 9 9 9 5264
## 10 10 10 5264
## 11 11 11 5264
## 12 12 12 5264
## 13 13 13 5264
## 14 14 14 5264
## 15 15 15 5264
## 16 16 16 5264
## 17 17 17 5264
## 18 18 18 5264
## 19 19 19 5264
## 20 20 20 5264
Note that here we just have driver numbers = starting grid position. Of course, since each driver is getting only the average lap time, each driver is finishing with the same time as everyone else. But we can start to add complexity to this piece by piece, rerunning the simulations each time.
One quick item to add is the grid start position to the start line
time (i.e. from the grid box to the start line). This is longer for cars
near the back than it is near the front of the grid, and was modeled by
Heilmer et. al., 2020 as being
ts = sqrt((2*(grid*8-0.8))/11.2)+0.2.
simulate_lap_time<-function(lap, grid, params = parameters){
if(lap == 1){
ts = sqrt((2*(grid*8-0.8))/11.2)+0.2
} else {
ts = 0
}
lap_time = 0 + ts
}
simulate_race<-function(params = parameters){
driver_times = data.frame(driver = 1:params$num_drivers,
position = 1:params$num_drivers,
grid = 1:params$num_drivers,
race_time = NA_real_)
for (lap in 1:params$num_laps){
for (driver in unique(driver_times$driver)){
driver_times[driver_times$driver == driver, ]$race_time <-
sum(driver_times[driver_times$driver == driver, ]$race_time,
simulate_lap_time(params = params, lap=lap, grid=driver_times[driver_times$driver == driver,]$grid),
params$avg_lap_time, na.rm=TRUE)
}
}
driver_times <- driver_times %>%
dplyr::arrange(.data$race_time) %>%
dplyr::mutate("position" = 1:dplyr::n(),
"laps" = params$num_laps - floor((.data$race_time - head(.data$race_time, 1))/params$avg_lap_time)) %>%
dplyr::select(c("driver", "position", "race_time", 'laps'))
return(driver_times)
}
simulate_race(parameters)
## driver position race_time laps
## 1 1 1 5265.334 56
## 2 2 2 5265.848 56
## 3 3 3 5266.235 56
## 4 4 4 5266.560 56
## 5 5 5 5266.846 56
## 6 6 6 5267.103 56
## 7 7 7 5267.340 56
## 8 8 8 5267.559 56
## 9 9 9 5267.766 56
## 10 10 10 5267.961 56
## 11 11 11 5268.146 56
## 12 12 12 5268.323 56
## 13 13 13 5268.493 56
## 14 14 14 5268.656 56
## 15 15 15 5268.814 56
## 16 16 16 5268.966 56
## 17 17 17 5269.114 56
## 18 18 18 5269.257 56
## 19 19 19 5269.396 56
## 20 20 20 5269.532 56
We can continually iterate to add more capabilities to this model. Building piece by piece allows us to validate the performance of each item. We can add more and more parameters and add capabilities like tire deg, fuel consumption, pit stops, etc.
Of course, if we run the above code as many times as we like, our results will always be the same. But, by adding in random variation (such as small +/- to lap time based on driver skill, likelihood of pitting at any lap, likelihood of crash/retirement/safety car/etc. we can get variation in our results time with each run.
simulate_lap_time<-function(lap, grid, params = parameters){
if(lap == 1){
ts = sqrt((2*(grid*8-0.8))/11.2)+0.2
} else {
ts = 0
}
# ToDo: model this variation, this is just a quick example!
tv <- runif(1, min = -1.5, max = 3+grid*.1)
lap_time = 0 + ts + tv
}
simulate_race(parameters)
## driver position race_time laps
## 1 8 1 5298.345 56
## 2 2 2 5315.311 56
## 3 3 3 5316.079 56
## 4 4 4 5318.669 56
## 5 7 5 5322.132 56
## 6 6 6 5324.782 56
## 7 1 7 5329.507 56
## 8 5 8 5330.543 56
## 9 9 9 5331.115 56
## 10 12 10 5335.402 56
## 11 18 11 5337.014 56
## 12 15 12 5339.831 56
## 13 11 13 5345.138 56
## 14 10 14 5346.976 56
## 15 19 15 5347.236 56
## 16 14 16 5355.466 56
## 17 20 17 5356.757 56
## 18 13 18 5361.551 56
## 19 17 19 5370.689 56
## 20 16 20 5378.022 56
simulate_race(parameters)
## driver position race_time laps
## 1 2 1 5293.223 56
## 2 6 2 5299.568 56
## 3 3 3 5305.696 56
## 4 1 4 5316.459 56
## 5 7 5 5321.339 56
## 6 4 6 5321.667 56
## 7 5 7 5329.739 56
## 8 8 8 5333.281 56
## 9 11 9 5341.247 56
## 10 9 10 5342.352 56
## 11 15 11 5342.906 56
## 12 13 12 5345.097 56
## 13 14 13 5345.952 56
## 14 12 14 5348.293 56
## 15 20 15 5353.430 56
## 16 10 16 5355.664 56
## 17 16 17 5358.156 56
## 18 17 18 5359.815 56
## 19 18 19 5377.292 56
## 20 19 20 5387.954 55
See how those two results are different? And that some cars have passed each other? The driver who started 8th won the first race! And, in the second, the last place car was not on the final lap - they’d been passed (by the first two finishers).
At this point, repeating the process 10, 100, 1000 or more times can give us estimates of what could happen in the race - maybe it’s most likely that Hamilton, Leclerc or Verstappen win in COTA, but what if everything goes wrong for everyone and Vettel picks up the win in an Aston Martin? Repeating the race time after time points toward if that is even a reasonable possibility.
Of course, we can’t model all of the variability of F1 - what were the odds of Bottas steaming up the side of the leaders in the rain taking them out, and the track dries out for the restart only to have everyone but Hamilton put on fresh tires, and Ocon takes the win? We’ll never know the odds, but it happened in 2021. We can’t capture everything, but hopefully we can make pretty good guesses.