Part 2 - Elo Model based Simulations

Overview

This is Part 2 of my 2 part tutorial series. This tutorial aims to walk you through how to simulate matches in a given competition, and then simulate the final standings of the competition. We will use the Elo model created in Part 1 to predict the probability of Team.A winning or losing. We will use these predictions to simulate each game n times and tally up the results.

Load data

If you are following on from Part 1, we will be using the vnl_elo_model and vnl_elo_ratings dataframe.

If not, here is a link to Part 1 , alternatively, I have provided the code below to recreate the dataframe and model:

Click to view code

# Required Libraries
library(rvest) # web scraping
library(dplyr) # data wrangling
library(elo)   # Elo package

# VNL 2018 URL
url_2018 <- read_html('https://en.wikipedia.org/wiki/2018_FIVB_Volleyball_Men%27s_Nations_League')

# Tables from URL
tables_2018 <- url_2018 %>% 
  html_table()

# Extract tables no. 11 to 30, 33, 35 and 37 to 39
tables_2018 <- tables_2018[c(11:30, 33, 35, 37:39)]

# bind all lists together into a dataframe
vnl_2018 <- as.data.frame(do.call(what = rbind, 
                                  args = tables_2018)) 

# Change column names to Team.A and Team.B
names(vnl_2018)[3] <- 'Team.A'
names(vnl_2018)[5] <- 'Team.B'

# Adding year
vnl_2018 <- vnl_2018 %>% 
  # Add year column
  mutate(Year = 2018) %>% 
  # Relocate year column as first column to make dataframe look a bit neater
  relocate(Year, .before = Date)

# select necessary columns to manipulate for Elo model
vnl_elo <- vnl_2018 %>% 
  select(Year, Team.A, Team.B, Score)

vnl_elo <- vnl_elo %>% 
  # select 1st character within Score column to represent Team.A score
  mutate(Team.A_score = substr(Score, 1, 1),
  # select 3rd character within Score column to represent Team.B score
         Team.B_score = substr(Score, 3, 3)) %>% 
  # Determining if Team.A won (1) or lost (0)
  mutate(Result = ifelse(Team.A_score > Team.B_score, '1', '0'))

# Elo model
vnl_elo_model <- elo.run(formula = Result ~ Team.A + Team.B,
        k = 27,
        initial.elos = 1500,
        data = vnl_elo)

# Save as dataframe
vnl_elo_ratings <- vnl_elo_model %>% 
  # save as data.frame
  as.data.frame() %>% 
  # and round numbers to 2 decimal places
  mutate_if(is.numeric, round, digits = 2)

Data wrangling

Before starting the simulations, there is some data wrangling to be done. We need to convert the dataframe to a data.table for the simulation to work properly, add a game no. column to reference which game to simulate, add result of a match as a character (win, loss) and use the elo_model to predict whether Team.A will win or lose, and the predicted probabilites of a win or loss.

Also need to create a vector with unique team names.

# Required Libraries
library(data.table) #for using data.table class

# add game and result to dataframe
vnl_simulation <- vnl_elo_ratings %>% 
  # renaming team columns to match the original df (vnl_elo)
  rename(Team.A = team.A,
         Team.B = team.B) %>% 
  # changing team columns back to a character
  mutate(Team.A = as.character(Team.A),
         Team.B = as.character(Team.B),
         # include game number ro reference simulations
         game = 1:130,
         # add Result column from vnl_elo to indicate win or lose
         Result = vnl_elo$Result) %>% 
  # change Result to character
  mutate(Result = ifelse(Result == 1, 'Win', 'Lose'))

# predict using Elo model + convert to data.table
vnl_simulation <- vnl_simulation %>% 
  # use vnl_elo_model to make predictions on all games in 2018
  # predicts whether Team.A will win/lose + probability of winning
  mutate(pred_win = predict(vnl_elo_model, newdata = vnl_simulation)) %>% 
  # predicted probability of Team.A losing
  mutate(pred_lose = 1 - pred_win) %>% 
  ungroup() %>% 
  # select required variables
  select(game, Team.A, Team.B, Result, pred_win, pred_lose) %>% 
  # simulations need to be in a data.table format
  as.data.table

# Retrieve unique team names
team_names <- unique(c(vnl_simulation$Team.A, vnl_simulation$Team.B))

Simulations

The first step to create simulations is creating a function that calculates an arbitrary number of points for a win (+3) or a loss (+0) for each game.

# A function that takes in a table then calculates points for each of the rounds.
# only input for this function is dataframe with required data
calculate_points <- function(df){
  
  # create table that will store results from for loop
  # points awarded to a team for a win (+3) or loss (+0)
  # matches_played is simply the number of matches a given team has played
  points_table <- data.table(team = team_names, 
                            point = 0, 
                            matches_played = 0)
  
  # for each row in the provided dataframe/table
  for (i in c(1:nrow(df))){
    
    # get result of current iteration and convert from factor to character
    result <- df[i]$Result
    result <- as.character(result)
    
    # select team for current iteration
    Team.A_name <- df[i]$Team.A
    # select max no. of matches played for current team
    Team.A_matches_played <- max(points_table[team == Team.A_name]$matches_played)
    # select point for current team and match_played iteration
    Team.A_point_before <- points_table[team == Team.A_name & matches_played == Team.A_matches_played]$point
    
    # if Team.A contains NA, +0
    Team.A_point <- case_when(is.na(result) ~ Team.A_point_before + 0,
                                 # if Team.A wins, +3
                                 result == 'Win' ~ Team.A_point_before + 3,
                                 # if Team.A losses, +0
                                 result == 'Lose' ~ Team.A_point_before + 0)
    
    # repeat same process for Team.B
    Team.B_name <- df[i]$Team.B
    Team.B_matches_played <- max(points_table[team == Team.B_name]$matches_played)
    Team.B_point_before <- points_table[team == Team.B_name & matches_played == Team.B_matches_played]$point
    
    Team.B_point <- case_when(is.na(result) ~ Team.B_point_before + 0,
                                 # result is switched here before result is based on Team.A
                                 result == 'Lose' ~ Team.B_point_before + 3,
                                 result == 'Win' ~ Team.B_point_before + 0)
    
    
    # add results for current team to points_table and +1 to matches played for Team.A
    points_table <- rbindlist(list(points_table, 
                                  data.table(matches_played = Team.A_matches_played + 1,
                                                          team = Team.A_name,
                                                          point = Team.A_point)),
                             # fills missing columns with NA if True
                             fill = T)
    
    # repeat for Team.B
    points_table <- rbindlist(list(points_table, 
                                  data.table(matches_played = Team.B_matches_played + 1,
                                                          team = Team.B_name,
                                                          point = Team.B_point)),
                             # fills missing columns with NA if True
                             fill = T)
    
  }
  # return points_table to iterate through till nrows reached 
  return(points_table)
}

vnl_points <- calculate_points(vnl_simulation)

We will now create the function that will simulate the matches:

# df = input dataframe that we want to simulate matches for, in this case 'vnl_simulations'
# times = no. of times you want to simulate each game 
simulate_matches <- function(df, times){
  
  # empty table to output all results
  output_table <- data.table()
  
  # for each iteration of match
  for (i in c(1:times)){
    # for each row in dataframe
    for(j in c(1:nrow(df))){
      
      # get match_row number
      match_row <- df[j]
      
      # sample current match using probabilities of Team.A winning or losing
      outcome <- sample(c("Win", "Lose"),
                        size = 1,
                        replace = TRUE,
                        prob = c(match_row$pred_win,
                                 match_row$pred_lose))
      
      # add result to sampled match and add simulation id no.
      match_row[, Result:= outcome]
      match_row[, simulation_id:= i]
      
      # bind results to output table
      output_table <- rbindlist(list(output_table, match_row), fill = T)
    }
  }
  return(output_table)
}

no._simulations <- 500

vnl_simulated_matches <- simulate_matches(vnl_simulation, no._simulations)

## Warning in instance$preRenderHook(instance): It seems your data is too big
## for client-side DataTables. You may consider server-side processing: https://
## rstudio.github.io/DT/server.html

Now we have the simulated matches, we can simulate the standings:

# simulate standings with simulated match results
# input df will be simulated matches
simulate_standings <- function(df){
  
  # empty table to store results in
  standings_all <- data.table()
  
  # for each simulated match in each simulation 
  for (i in c(1:max(df$simulation_id))){
    
    # all games in 1 simulation (130)
    one_simulation <- df[simulation_id == i]
    
    # use function created before to calculate points for all games in given simulation id
    simulated_fixtures <- calculate_points(one_simulation)
    # add simulation id
    simulated_fixtures[, simulation_id:= i]
    
    # bind simulated points/fixtures to empty dataframe/table
    standings_all <- rbindlist(list(standings_all, 
                                    simulated_fixtures), 
                               fill = T)
    
  }
  return(standings_all)
}

# NOTE: BE CAREFUL WITH HOW MANY SIMULATIONS YOU DO, THIS CAN TAKE AWHILE, ESPECIALLY WITH AN OLD and/or SLOW COMPUTER
# IF YOU THINK IT WILL TAKE AWHILE, CHANGE THE 'no._simulations' VARIABLE IN THE SIMULATE_MATCHES FUNCTION WE CREATED ABOVE
vnl_simulated_standings <- simulate_standings(vnl_simulated_matches)

With the finalized simulated standings, we can filter the results to produce a simulated final standings/rankings table for the VNL 2018. For example, running 500 simulations in the table below we can see France placed 1st 33 times, Russia 24 times and Serbia 11 times…

# select each simulation where teams shared maximum amount of games played (15)
# this excludes all final_pos matches
final_pos <- vnl_simulated_standings %>% 
  filter(matches_played == 15)

# order each simulation by most points
final_pos <- final_pos[order(simulation_id, -point)]

# adding a column indicating where a team would of placed in a given simulation  
final_pos[, standing:= c(1:.N), by = simulation_id]

With the final positions for each simulation, we can then look at how many times each team placed in each position from 1st to 16th. For example, from the table below when running 250 simulations, France placed 1st 63 times, Russia 60 times, and Serbia 27 times, etc.

# from here we can view how many times each team placed in each position
count_occurences <- table(final_pos$team, final_pos$standing)

Total Team Occurences
	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16
Argentina	5	5	14	17	20	25	29	39	39	42	32	44	52	63	41	33
Australia	9	13	15	15	28	29	19	32	43	50	45	33	50	53	39	27
Brazil	33	59	48	52	42	48	30	39	35	23	25	21	16	16	9	4
Bulgaria	7	13	14	16	21	43	32	25	39	36	41	43	41	44	49	36
Canada	25	34	42	48	31	34	36	40	39	40	30	39	26	13	14	9
China	1	2	5	8	6	11	17	20	23	36	37	41	47	65	97	84
France	140	71	58	47	45	29	30	19	19	9	15	4	8	2	4	0
Germany	19	17	21	16	19	26	44	41	42	36	41	40	44	47	33	14
Iran	13	22	23	40	28	38	39	38	30	33	40	38	42	31	32	13
Italy	5	18	23	25	39	23	40	42	45	40	43	44	49	26	25	13
Japan	9	14	16	11	37	33	44	26	24	41	50	52	34	44	38	27
Poland	32	51	33	44	46	55	33	34	34	29	27	23	19	22	12	6
Russia	114	88	79	52	39	26	26	24	20	4	7	10	3	4	4	0
Serbia	46	47	61	46	47	33	48	33	31	30	21	16	13	11	11	6
South Korea	1	0	0	3	3	4	4	8	7	24	21	31	43	43	87	221
United States	41	46	48	60	49	43	29	40	30	27	25	21	13	16	5	7

Alternatively, we can convert these outcomes to a percentage, which indicated the probability of a team finishing in a given position. We can see that France and Russia had the highest probability of placing 1st with 25.2% and 24.0% respectively.

It’s worth mentioning as you run more simulations your results will become more accurate, but it will take longer.

# from here we can view how many times each team placed in each position
probabilities <- round(table(final_pos$team, 
                             final_pos$standing) / no._simulations * 100, 2)

Position Probabilities
	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16
Argentina	1.0	1.0	2.8	3.4	4.0	5.0	5.8	7.8	7.8	8.4	6.4	8.8	10.4	12.6	8.2	6.6
Australia	1.8	2.6	3.0	3.0	5.6	5.8	3.8	6.4	8.6	10.0	9.0	6.6	10.0	10.6	7.8	5.4
Brazil	6.6	11.8	9.6	10.4	8.4	9.6	6.0	7.8	7.0	4.6	5.0	4.2	3.2	3.2	1.8	0.8
Bulgaria	1.4	2.6	2.8	3.2	4.2	8.6	6.4	5.0	7.8	7.2	8.2	8.6	8.2	8.8	9.8	7.2
Canada	5.0	6.8	8.4	9.6	6.2	6.8	7.2	8.0	7.8	8.0	6.0	7.8	5.2	2.6	2.8	1.8
China	0.2	0.4	1.0	1.6	1.2	2.2	3.4	4.0	4.6	7.2	7.4	8.2	9.4	13.0	19.4	16.8
France	28.0	14.2	11.6	9.4	9.0	5.8	6.0	3.8	3.8	1.8	3.0	0.8	1.6	0.4	0.8	0.0
Germany	3.8	3.4	4.2	3.2	3.8	5.2	8.8	8.2	8.4	7.2	8.2	8.0	8.8	9.4	6.6	2.8
Iran	2.6	4.4	4.6	8.0	5.6	7.6	7.8	7.6	6.0	6.6	8.0	7.6	8.4	6.2	6.4	2.6
Italy	1.0	3.6	4.6	5.0	7.8	4.6	8.0	8.4	9.0	8.0	8.6	8.8	9.8	5.2	5.0	2.6
Japan	1.8	2.8	3.2	2.2	7.4	6.6	8.8	5.2	4.8	8.2	10.0	10.4	6.8	8.8	7.6	5.4
Poland	6.4	10.2	6.6	8.8	9.2	11.0	6.6	6.8	6.8	5.8	5.4	4.6	3.8	4.4	2.4	1.2
Russia	22.8	17.6	15.8	10.4	7.8	5.2	5.2	4.8	4.0	0.8	1.4	2.0	0.6	0.8	0.8	0.0
Serbia	9.2	9.4	12.2	9.2	9.4	6.6	9.6	6.6	6.2	6.0	4.2	3.2	2.6	2.2	2.2	1.2
South Korea	0.2	0.0	0.0	0.6	0.6	0.8	0.8	1.6	1.4	4.8	4.2	6.2	8.6	8.6	17.4	44.2
United States	8.2	9.2	9.6	12.0	9.8	8.6	5.8	8.0	6.0	5.4	5.0	4.2	2.6	3.2	1.0	1.4

Additionally, to add some context, we can even view how our simulations compared to the actual final standings of the 2018 VNL:

Part 2 - Elo Model based Simulations

Liam Drowley - 21259021

2022-10-22

Overview

Load data

Data wrangling

Simulations