This is Part 2 of my 2 part tutorial series. This tutorial aims to walk you through how to simulate matches in a given competition, and then simulate the final standings of the competition. We will use the Elo model created in Part 1 to predict the probability of Team.A winning or losing. We will use these predictions to simulate each game n times and tally up the results.
If you are following on from Part 1, we will be using the
vnl_elo_model and vnl_elo_ratings
dataframe.
If not, here is a link to ,
alternatively, I have provided the code below to recreate the dataframe
and model:
# Required Libraries
library(rvest) # web scraping
library(dplyr) # data wrangling
library(elo) # Elo package
# VNL 2018 URL
url_2018 <- read_html('https://en.wikipedia.org/wiki/2018_FIVB_Volleyball_Men%27s_Nations_League')
# Tables from URL
tables_2018 <- url_2018 %>%
html_table()
# Extract tables no. 11 to 30, 33, 35 and 37 to 39
tables_2018 <- tables_2018[c(11:30, 33, 35, 37:39)]
# bind all lists together into a dataframe
vnl_2018 <- as.data.frame(do.call(what = rbind,
args = tables_2018))
# Change column names to Team.A and Team.B
names(vnl_2018)[3] <- 'Team.A'
names(vnl_2018)[5] <- 'Team.B'
# Adding year
vnl_2018 <- vnl_2018 %>%
# Add year column
mutate(Year = 2018) %>%
# Relocate year column as first column to make dataframe look a bit neater
relocate(Year, .before = Date)
# select necessary columns to manipulate for Elo model
vnl_elo <- vnl_2018 %>%
select(Year, Team.A, Team.B, Score)
vnl_elo <- vnl_elo %>%
# select 1st character within Score column to represent Team.A score
mutate(Team.A_score = substr(Score, 1, 1),
# select 3rd character within Score column to represent Team.B score
Team.B_score = substr(Score, 3, 3)) %>%
# Determining if Team.A won (1) or lost (0)
mutate(Result = ifelse(Team.A_score > Team.B_score, '1', '0'))
# Elo model
vnl_elo_model <- elo.run(formula = Result ~ Team.A + Team.B,
k = 27,
initial.elos = 1500,
data = vnl_elo)
# Save as dataframe
vnl_elo_ratings <- vnl_elo_model %>%
# save as data.frame
as.data.frame() %>%
# and round numbers to 2 decimal places
mutate_if(is.numeric, round, digits = 2)
Before starting the simulations, there is some data wrangling to be
done. We need to convert the dataframe to a data.table for the
simulation to work properly, add a game no. column to reference which
game to simulate, add result of a match as a character (win, loss) and
use the elo_model to predict whether Team.A will win or
lose, and the predicted probabilites of a win or loss.
Also need to create a vector with unique team names.
# Required Libraries
library(data.table) #for using data.table class
# add game and result to dataframe
vnl_simulation <- vnl_elo_ratings %>%
# renaming team columns to match the original df (vnl_elo)
rename(Team.A = team.A,
Team.B = team.B) %>%
# changing team columns back to a character
mutate(Team.A = as.character(Team.A),
Team.B = as.character(Team.B),
# include game number ro reference simulations
game = 1:130,
# add Result column from vnl_elo to indicate win or lose
Result = vnl_elo$Result) %>%
# change Result to character
mutate(Result = ifelse(Result == 1, 'Win', 'Lose'))
# predict using Elo model + convert to data.table
vnl_simulation <- vnl_simulation %>%
# use vnl_elo_model to make predictions on all games in 2018
# predicts whether Team.A will win/lose + probability of winning
mutate(pred_win = predict(vnl_elo_model, newdata = vnl_simulation)) %>%
# predicted probability of Team.A losing
mutate(pred_lose = 1 - pred_win) %>%
ungroup() %>%
# select required variables
select(game, Team.A, Team.B, Result, pred_win, pred_lose) %>%
# simulations need to be in a data.table format
as.data.table
# Retrieve unique team names
team_names <- unique(c(vnl_simulation$Team.A, vnl_simulation$Team.B))
The first step to create simulations is creating a function that calculates an arbitrary number of points for a win (+3) or a loss (+0) for each game.
# A function that takes in a table then calculates points for each of the rounds.
# only input for this function is dataframe with required data
calculate_points <- function(df){
# create table that will store results from for loop
# points awarded to a team for a win (+3) or loss (+0)
# matches_played is simply the number of matches a given team has played
points_table <- data.table(team = team_names,
point = 0,
matches_played = 0)
# for each row in the provided dataframe/table
for (i in c(1:nrow(df))){
# get result of current iteration and convert from factor to character
result <- df[i]$Result
result <- as.character(result)
# select team for current iteration
Team.A_name <- df[i]$Team.A
# select max no. of matches played for current team
Team.A_matches_played <- max(points_table[team == Team.A_name]$matches_played)
# select point for current team and match_played iteration
Team.A_point_before <- points_table[team == Team.A_name & matches_played == Team.A_matches_played]$point
# if Team.A contains NA, +0
Team.A_point <- case_when(is.na(result) ~ Team.A_point_before + 0,
# if Team.A wins, +3
result == 'Win' ~ Team.A_point_before + 3,
# if Team.A losses, +0
result == 'Lose' ~ Team.A_point_before + 0)
# repeat same process for Team.B
Team.B_name <- df[i]$Team.B
Team.B_matches_played <- max(points_table[team == Team.B_name]$matches_played)
Team.B_point_before <- points_table[team == Team.B_name & matches_played == Team.B_matches_played]$point
Team.B_point <- case_when(is.na(result) ~ Team.B_point_before + 0,
# result is switched here before result is based on Team.A
result == 'Lose' ~ Team.B_point_before + 3,
result == 'Win' ~ Team.B_point_before + 0)
# add results for current team to points_table and +1 to matches played for Team.A
points_table <- rbindlist(list(points_table,
data.table(matches_played = Team.A_matches_played + 1,
team = Team.A_name,
point = Team.A_point)),
# fills missing columns with NA if True
fill = T)
# repeat for Team.B
points_table <- rbindlist(list(points_table,
data.table(matches_played = Team.B_matches_played + 1,
team = Team.B_name,
point = Team.B_point)),
# fills missing columns with NA if True
fill = T)
}
# return points_table to iterate through till nrows reached
return(points_table)
}
vnl_points <- calculate_points(vnl_simulation)
We will now create the function that will simulate the matches:
# df = input dataframe that we want to simulate matches for, in this case 'vnl_simulations'
# times = no. of times you want to simulate each game
simulate_matches <- function(df, times){
# empty table to output all results
output_table <- data.table()
# for each iteration of match
for (i in c(1:times)){
# for each row in dataframe
for(j in c(1:nrow(df))){
# get match_row number
match_row <- df[j]
# sample current match using probabilities of Team.A winning or losing
outcome <- sample(c("Win", "Lose"),
size = 1,
replace = TRUE,
prob = c(match_row$pred_win,
match_row$pred_lose))
# add result to sampled match and add simulation id no.
match_row[, Result:= outcome]
match_row[, simulation_id:= i]
# bind results to output table
output_table <- rbindlist(list(output_table, match_row), fill = T)
}
}
return(output_table)
}
no._simulations <- 500
vnl_simulated_matches <- simulate_matches(vnl_simulation, no._simulations)
## Warning in instance$preRenderHook(instance): It seems your data is too big
## for client-side DataTables. You may consider server-side processing: https://
## rstudio.github.io/DT/server.html
Now we have the simulated matches, we can simulate the standings:
# simulate standings with simulated match results
# input df will be simulated matches
simulate_standings <- function(df){
# empty table to store results in
standings_all <- data.table()
# for each simulated match in each simulation
for (i in c(1:max(df$simulation_id))){
# all games in 1 simulation (130)
one_simulation <- df[simulation_id == i]
# use function created before to calculate points for all games in given simulation id
simulated_fixtures <- calculate_points(one_simulation)
# add simulation id
simulated_fixtures[, simulation_id:= i]
# bind simulated points/fixtures to empty dataframe/table
standings_all <- rbindlist(list(standings_all,
simulated_fixtures),
fill = T)
}
return(standings_all)
}
# NOTE: BE CAREFUL WITH HOW MANY SIMULATIONS YOU DO, THIS CAN TAKE AWHILE, ESPECIALLY WITH AN OLD and/or SLOW COMPUTER
# IF YOU THINK IT WILL TAKE AWHILE, CHANGE THE 'no._simulations' VARIABLE IN THE SIMULATE_MATCHES FUNCTION WE CREATED ABOVE
vnl_simulated_standings <- simulate_standings(vnl_simulated_matches)
With the finalized simulated standings, we can filter the results to produce a simulated final standings/rankings table for the VNL 2018. For example, running 500 simulations in the table below we can see France placed 1st 33 times, Russia 24 times and Serbia 11 times…
# select each simulation where teams shared maximum amount of games played (15)
# this excludes all final_pos matches
final_pos <- vnl_simulated_standings %>%
filter(matches_played == 15)
# order each simulation by most points
final_pos <- final_pos[order(simulation_id, -point)]
# adding a column indicating where a team would of placed in a given simulation
final_pos[, standing:= c(1:.N), by = simulation_id]
With the final positions for each simulation, we can then look at how many times each team placed in each position from 1st to 16th. For example, from the table below when running 250 simulations, France placed 1st 63 times, Russia 60 times, and Serbia 27 times, etc.
# from here we can view how many times each team placed in each position
count_occurences <- table(final_pos$team, final_pos$standing)
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Argentina | 5 | 5 | 14 | 17 | 20 | 25 | 29 | 39 | 39 | 42 | 32 | 44 | 52 | 63 | 41 | 33 |
| Australia | 9 | 13 | 15 | 15 | 28 | 29 | 19 | 32 | 43 | 50 | 45 | 33 | 50 | 53 | 39 | 27 |
| Brazil | 33 | 59 | 48 | 52 | 42 | 48 | 30 | 39 | 35 | 23 | 25 | 21 | 16 | 16 | 9 | 4 |
| Bulgaria | 7 | 13 | 14 | 16 | 21 | 43 | 32 | 25 | 39 | 36 | 41 | 43 | 41 | 44 | 49 | 36 |
| Canada | 25 | 34 | 42 | 48 | 31 | 34 | 36 | 40 | 39 | 40 | 30 | 39 | 26 | 13 | 14 | 9 |
| China | 1 | 2 | 5 | 8 | 6 | 11 | 17 | 20 | 23 | 36 | 37 | 41 | 47 | 65 | 97 | 84 |
| France | 140 | 71 | 58 | 47 | 45 | 29 | 30 | 19 | 19 | 9 | 15 | 4 | 8 | 2 | 4 | 0 |
| Germany | 19 | 17 | 21 | 16 | 19 | 26 | 44 | 41 | 42 | 36 | 41 | 40 | 44 | 47 | 33 | 14 |
| Iran | 13 | 22 | 23 | 40 | 28 | 38 | 39 | 38 | 30 | 33 | 40 | 38 | 42 | 31 | 32 | 13 |
| Italy | 5 | 18 | 23 | 25 | 39 | 23 | 40 | 42 | 45 | 40 | 43 | 44 | 49 | 26 | 25 | 13 |
| Japan | 9 | 14 | 16 | 11 | 37 | 33 | 44 | 26 | 24 | 41 | 50 | 52 | 34 | 44 | 38 | 27 |
| Poland | 32 | 51 | 33 | 44 | 46 | 55 | 33 | 34 | 34 | 29 | 27 | 23 | 19 | 22 | 12 | 6 |
| Russia | 114 | 88 | 79 | 52 | 39 | 26 | 26 | 24 | 20 | 4 | 7 | 10 | 3 | 4 | 4 | 0 |
| Serbia | 46 | 47 | 61 | 46 | 47 | 33 | 48 | 33 | 31 | 30 | 21 | 16 | 13 | 11 | 11 | 6 |
| South Korea | 1 | 0 | 0 | 3 | 3 | 4 | 4 | 8 | 7 | 24 | 21 | 31 | 43 | 43 | 87 | 221 |
| United States | 41 | 46 | 48 | 60 | 49 | 43 | 29 | 40 | 30 | 27 | 25 | 21 | 13 | 16 | 5 | 7 |
Alternatively, we can convert these outcomes to a percentage, which indicated the probability of a team finishing in a given position. We can see that France and Russia had the highest probability of placing 1st with 25.2% and 24.0% respectively.
It’s worth mentioning as you run more simulations your results will become more accurate, but it will take longer.
# from here we can view how many times each team placed in each position
probabilities <- round(table(final_pos$team,
final_pos$standing) / no._simulations * 100, 2)
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Argentina | 1.0 | 1.0 | 2.8 | 3.4 | 4.0 | 5.0 | 5.8 | 7.8 | 7.8 | 8.4 | 6.4 | 8.8 | 10.4 | 12.6 | 8.2 | 6.6 |
| Australia | 1.8 | 2.6 | 3.0 | 3.0 | 5.6 | 5.8 | 3.8 | 6.4 | 8.6 | 10.0 | 9.0 | 6.6 | 10.0 | 10.6 | 7.8 | 5.4 |
| Brazil | 6.6 | 11.8 | 9.6 | 10.4 | 8.4 | 9.6 | 6.0 | 7.8 | 7.0 | 4.6 | 5.0 | 4.2 | 3.2 | 3.2 | 1.8 | 0.8 |
| Bulgaria | 1.4 | 2.6 | 2.8 | 3.2 | 4.2 | 8.6 | 6.4 | 5.0 | 7.8 | 7.2 | 8.2 | 8.6 | 8.2 | 8.8 | 9.8 | 7.2 |
| Canada | 5.0 | 6.8 | 8.4 | 9.6 | 6.2 | 6.8 | 7.2 | 8.0 | 7.8 | 8.0 | 6.0 | 7.8 | 5.2 | 2.6 | 2.8 | 1.8 |
| China | 0.2 | 0.4 | 1.0 | 1.6 | 1.2 | 2.2 | 3.4 | 4.0 | 4.6 | 7.2 | 7.4 | 8.2 | 9.4 | 13.0 | 19.4 | 16.8 |
| France | 28.0 | 14.2 | 11.6 | 9.4 | 9.0 | 5.8 | 6.0 | 3.8 | 3.8 | 1.8 | 3.0 | 0.8 | 1.6 | 0.4 | 0.8 | 0.0 |
| Germany | 3.8 | 3.4 | 4.2 | 3.2 | 3.8 | 5.2 | 8.8 | 8.2 | 8.4 | 7.2 | 8.2 | 8.0 | 8.8 | 9.4 | 6.6 | 2.8 |
| Iran | 2.6 | 4.4 | 4.6 | 8.0 | 5.6 | 7.6 | 7.8 | 7.6 | 6.0 | 6.6 | 8.0 | 7.6 | 8.4 | 6.2 | 6.4 | 2.6 |
| Italy | 1.0 | 3.6 | 4.6 | 5.0 | 7.8 | 4.6 | 8.0 | 8.4 | 9.0 | 8.0 | 8.6 | 8.8 | 9.8 | 5.2 | 5.0 | 2.6 |
| Japan | 1.8 | 2.8 | 3.2 | 2.2 | 7.4 | 6.6 | 8.8 | 5.2 | 4.8 | 8.2 | 10.0 | 10.4 | 6.8 | 8.8 | 7.6 | 5.4 |
| Poland | 6.4 | 10.2 | 6.6 | 8.8 | 9.2 | 11.0 | 6.6 | 6.8 | 6.8 | 5.8 | 5.4 | 4.6 | 3.8 | 4.4 | 2.4 | 1.2 |
| Russia | 22.8 | 17.6 | 15.8 | 10.4 | 7.8 | 5.2 | 5.2 | 4.8 | 4.0 | 0.8 | 1.4 | 2.0 | 0.6 | 0.8 | 0.8 | 0.0 |
| Serbia | 9.2 | 9.4 | 12.2 | 9.2 | 9.4 | 6.6 | 9.6 | 6.6 | 6.2 | 6.0 | 4.2 | 3.2 | 2.6 | 2.2 | 2.2 | 1.2 |
| South Korea | 0.2 | 0.0 | 0.0 | 0.6 | 0.6 | 0.8 | 0.8 | 1.6 | 1.4 | 4.8 | 4.2 | 6.2 | 8.6 | 8.6 | 17.4 | 44.2 |
| United States | 8.2 | 9.2 | 9.6 | 12.0 | 9.8 | 8.6 | 5.8 | 8.0 | 6.0 | 5.4 | 5.0 | 4.2 | 2.6 | 3.2 | 1.0 | 1.4 |
Additionally, to add some context, we can even view how our simulations compared to the actual final standings of the 2018 VNL: