Overview

This is Part 2 of my 2 part tutorial series. This tutorial aims to walk you through how to simulate matches in a given competition, and then simulate the final standings of the competition. We will use the Elo model created in Part 1 to predict the probability of Team.A winning or losing. We will use these predictions to simulate each game n times and tally up the results.

Load data

If you are following on from Part 1, we will be using the vnl_elo_model and vnl_elo_ratings dataframe.

If not, here is a link to Part 1, alternatively, I have provided the code below to recreate the dataframe and model:

Click to view code
# Required Libraries
library(rvest) # web scraping
library(dplyr) # data wrangling
library(elo)   # Elo package

# VNL 2018 URL
url_2018 <- read_html('https://en.wikipedia.org/wiki/2018_FIVB_Volleyball_Men%27s_Nations_League')

# Tables from URL
tables_2018 <- url_2018 %>% 
  html_table()

# Extract tables no. 11 to 30, 33, 35 and 37 to 39
tables_2018 <- tables_2018[c(11:30, 33, 35, 37:39)]

# bind all lists together into a dataframe
vnl_2018 <- as.data.frame(do.call(what = rbind, 
                                  args = tables_2018)) 

# Change column names to Team.A and Team.B
names(vnl_2018)[3] <- 'Team.A'
names(vnl_2018)[5] <- 'Team.B'

# Adding year
vnl_2018 <- vnl_2018 %>% 
  # Add year column
  mutate(Year = 2018) %>% 
  # Relocate year column as first column to make dataframe look a bit neater
  relocate(Year, .before = Date)

# select necessary columns to manipulate for Elo model
vnl_elo <- vnl_2018 %>% 
  select(Year, Team.A, Team.B, Score)

vnl_elo <- vnl_elo %>% 
  # select 1st character within Score column to represent Team.A score
  mutate(Team.A_score = substr(Score, 1, 1),
  # select 3rd character within Score column to represent Team.B score
         Team.B_score = substr(Score, 3, 3)) %>% 
  # Determining if Team.A won (1) or lost (0)
  mutate(Result = ifelse(Team.A_score > Team.B_score, '1', '0'))

# Elo model
vnl_elo_model <- elo.run(formula = Result ~ Team.A + Team.B,
        k = 27,
        initial.elos = 1500,
        data = vnl_elo)

# Save as dataframe
vnl_elo_ratings <- vnl_elo_model %>% 
  # save as data.frame
  as.data.frame() %>% 
  # and round numbers to 2 decimal places
  mutate_if(is.numeric, round, digits = 2)

Data wrangling

Before starting the simulations, there is some data wrangling to be done. We need to convert the dataframe to a data.table for the simulation to work properly, add a game no. column to reference which game to simulate, add result of a match as a character (win, loss) and use the elo_model to predict whether Team.A will win or lose, and the predicted probabilites of a win or loss.

Also need to create a vector with unique team names.

# Required Libraries
library(data.table) #for using data.table class
# add game and result to dataframe
vnl_simulation <- vnl_elo_ratings %>% 
  # renaming team columns to match the original df (vnl_elo)
  rename(Team.A = team.A,
         Team.B = team.B) %>% 
  # changing team columns back to a character
  mutate(Team.A = as.character(Team.A),
         Team.B = as.character(Team.B),
         # include game number ro reference simulations
         game = 1:130,
         # add Result column from vnl_elo to indicate win or lose
         Result = vnl_elo$Result) %>% 
  # change Result to character
  mutate(Result = ifelse(Result == 1, 'Win', 'Lose'))

# predict using Elo model + convert to data.table
vnl_simulation <- vnl_simulation %>% 
  # use vnl_elo_model to make predictions on all games in 2018
  # predicts whether Team.A will win/lose + probability of winning
  mutate(pred_win = predict(vnl_elo_model, newdata = vnl_simulation)) %>% 
  # predicted probability of Team.A losing
  mutate(pred_lose = 1 - pred_win) %>% 
  ungroup() %>% 
  # select required variables
  select(game, Team.A, Team.B, Result, pred_win, pred_lose) %>% 
  # simulations need to be in a data.table format
  as.data.table

# Retrieve unique team names
team_names <- unique(c(vnl_simulation$Team.A, vnl_simulation$Team.B))

Simulations

The first step to create simulations is creating a function that calculates an arbitrary number of points for a win (+3) or a loss (+0) for each game.

# A function that takes in a table then calculates points for each of the rounds.
# only input for this function is dataframe with required data
calculate_points <- function(df){
  
  # create table that will store results from for loop
  # points awarded to a team for a win (+3) or loss (+0)
  # matches_played is simply the number of matches a given team has played
  points_table <- data.table(team = team_names, 
                            point = 0, 
                            matches_played = 0)
  
  # for each row in the provided dataframe/table
  for (i in c(1:nrow(df))){
    
    # get result of current iteration and convert from factor to character
    result <- df[i]$Result
    result <- as.character(result)
    
    # select team for current iteration
    Team.A_name <- df[i]$Team.A
    # select max no. of matches played for current team
    Team.A_matches_played <- max(points_table[team == Team.A_name]$matches_played)
    # select point for current team and match_played iteration
    Team.A_point_before <- points_table[team == Team.A_name & matches_played == Team.A_matches_played]$point
    
    # if Team.A contains NA, +0
    Team.A_point <- case_when(is.na(result) ~ Team.A_point_before + 0,
                                 # if Team.A wins, +3
                                 result == 'Win' ~ Team.A_point_before + 3,
                                 # if Team.A losses, +0
                                 result == 'Lose' ~ Team.A_point_before + 0)
    
    # repeat same process for Team.B
    Team.B_name <- df[i]$Team.B
    Team.B_matches_played <- max(points_table[team == Team.B_name]$matches_played)
    Team.B_point_before <- points_table[team == Team.B_name & matches_played == Team.B_matches_played]$point
    
    Team.B_point <- case_when(is.na(result) ~ Team.B_point_before + 0,
                                 # result is switched here before result is based on Team.A
                                 result == 'Lose' ~ Team.B_point_before + 3,
                                 result == 'Win' ~ Team.B_point_before + 0)
    
    
    # add results for current team to points_table and +1 to matches played for Team.A
    points_table <- rbindlist(list(points_table, 
                                  data.table(matches_played = Team.A_matches_played + 1,
                                                          team = Team.A_name,
                                                          point = Team.A_point)),
                             # fills missing columns with NA if True
                             fill = T)
    
    # repeat for Team.B
    points_table <- rbindlist(list(points_table, 
                                  data.table(matches_played = Team.B_matches_played + 1,
                                                          team = Team.B_name,
                                                          point = Team.B_point)),
                             # fills missing columns with NA if True
                             fill = T)
    
  }
  # return points_table to iterate through till nrows reached 
  return(points_table)
}

vnl_points <- calculate_points(vnl_simulation)

We will now create the function that will simulate the matches:

# df = input dataframe that we want to simulate matches for, in this case 'vnl_simulations'
# times = no. of times you want to simulate each game 
simulate_matches <- function(df, times){
  
  # empty table to output all results
  output_table <- data.table()
  
  # for each iteration of match
  for (i in c(1:times)){
    # for each row in dataframe
    for(j in c(1:nrow(df))){
      
      # get match_row number
      match_row <- df[j]
      
      # sample current match using probabilities of Team.A winning or losing
      outcome <- sample(c("Win", "Lose"),
                        size = 1,
                        replace = TRUE,
                        prob = c(match_row$pred_win,
                                 match_row$pred_lose))
      
      # add result to sampled match and add simulation id no.
      match_row[, Result:= outcome]
      match_row[, simulation_id:= i]
      
      # bind results to output table
      output_table <- rbindlist(list(output_table, match_row), fill = T)
    }
  }
  return(output_table)
}

no._simulations <- 500

vnl_simulated_matches <- simulate_matches(vnl_simulation, no._simulations)
## Warning in instance$preRenderHook(instance): It seems your data is too big
## for client-side DataTables. You may consider server-side processing: https://
## rstudio.github.io/DT/server.html

Now we have the simulated matches, we can simulate the standings:

# simulate standings with simulated match results
# input df will be simulated matches
simulate_standings <- function(df){
  
  # empty table to store results in
  standings_all <- data.table()
  
  # for each simulated match in each simulation 
  for (i in c(1:max(df$simulation_id))){
    
    # all games in 1 simulation (130)
    one_simulation <- df[simulation_id == i]
    
    # use function created before to calculate points for all games in given simulation id
    simulated_fixtures <- calculate_points(one_simulation)
    # add simulation id
    simulated_fixtures[, simulation_id:= i]
    
    # bind simulated points/fixtures to empty dataframe/table
    standings_all <- rbindlist(list(standings_all, 
                                    simulated_fixtures), 
                               fill = T)
    
  }
  return(standings_all)
}

# NOTE: BE CAREFUL WITH HOW MANY SIMULATIONS YOU DO, THIS CAN TAKE AWHILE, ESPECIALLY WITH AN OLD and/or SLOW COMPUTER
# IF YOU THINK IT WILL TAKE AWHILE, CHANGE THE 'no._simulations' VARIABLE IN THE SIMULATE_MATCHES FUNCTION WE CREATED ABOVE
vnl_simulated_standings <- simulate_standings(vnl_simulated_matches)

With the finalized simulated standings, we can filter the results to produce a simulated final standings/rankings table for the VNL 2018. For example, running 500 simulations in the table below we can see France placed 1st 33 times, Russia 24 times and Serbia 11 times…

# select each simulation where teams shared maximum amount of games played (15)
# this excludes all final_pos matches
final_pos <- vnl_simulated_standings %>% 
  filter(matches_played == 15)

# order each simulation by most points
final_pos <- final_pos[order(simulation_id, -point)]

# adding a column indicating where a team would of placed in a given simulation  
final_pos[, standing:= c(1:.N), by = simulation_id]

With the final positions for each simulation, we can then look at how many times each team placed in each position from 1st to 16th. For example, from the table below when running 250 simulations, France placed 1st 63 times, Russia 60 times, and Serbia 27 times, etc.

# from here we can view how many times each team placed in each position
count_occurences <- table(final_pos$team, final_pos$standing)
Total Team Occurences
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Argentina 5 5 14 17 20 25 29 39 39 42 32 44 52 63 41 33
Australia 9 13 15 15 28 29 19 32 43 50 45 33 50 53 39 27
Brazil 33 59 48 52 42 48 30 39 35 23 25 21 16 16 9 4
Bulgaria 7 13 14 16 21 43 32 25 39 36 41 43 41 44 49 36
Canada 25 34 42 48 31 34 36 40 39 40 30 39 26 13 14 9
China 1 2 5 8 6 11 17 20 23 36 37 41 47 65 97 84
France 140 71 58 47 45 29 30 19 19 9 15 4 8 2 4 0
Germany 19 17 21 16 19 26 44 41 42 36 41 40 44 47 33 14
Iran 13 22 23 40 28 38 39 38 30 33 40 38 42 31 32 13
Italy 5 18 23 25 39 23 40 42 45 40 43 44 49 26 25 13
Japan 9 14 16 11 37 33 44 26 24 41 50 52 34 44 38 27
Poland 32 51 33 44 46 55 33 34 34 29 27 23 19 22 12 6
Russia 114 88 79 52 39 26 26 24 20 4 7 10 3 4 4 0
Serbia 46 47 61 46 47 33 48 33 31 30 21 16 13 11 11 6
South Korea 1 0 0 3 3 4 4 8 7 24 21 31 43 43 87 221
United States 41 46 48 60 49 43 29 40 30 27 25 21 13 16 5 7

Alternatively, we can convert these outcomes to a percentage, which indicated the probability of a team finishing in a given position. We can see that France and Russia had the highest probability of placing 1st with 25.2% and 24.0% respectively.

It’s worth mentioning as you run more simulations your results will become more accurate, but it will take longer.

# from here we can view how many times each team placed in each position
probabilities <- round(table(final_pos$team, 
                             final_pos$standing) / no._simulations * 100, 2)
Position Probabilities
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Argentina 1.0 1.0 2.8 3.4 4.0 5.0 5.8 7.8 7.8 8.4 6.4 8.8 10.4 12.6 8.2 6.6
Australia 1.8 2.6 3.0 3.0 5.6 5.8 3.8 6.4 8.6 10.0 9.0 6.6 10.0 10.6 7.8 5.4
Brazil 6.6 11.8 9.6 10.4 8.4 9.6 6.0 7.8 7.0 4.6 5.0 4.2 3.2 3.2 1.8 0.8
Bulgaria 1.4 2.6 2.8 3.2 4.2 8.6 6.4 5.0 7.8 7.2 8.2 8.6 8.2 8.8 9.8 7.2
Canada 5.0 6.8 8.4 9.6 6.2 6.8 7.2 8.0 7.8 8.0 6.0 7.8 5.2 2.6 2.8 1.8
China 0.2 0.4 1.0 1.6 1.2 2.2 3.4 4.0 4.6 7.2 7.4 8.2 9.4 13.0 19.4 16.8
France 28.0 14.2 11.6 9.4 9.0 5.8 6.0 3.8 3.8 1.8 3.0 0.8 1.6 0.4 0.8 0.0
Germany 3.8 3.4 4.2 3.2 3.8 5.2 8.8 8.2 8.4 7.2 8.2 8.0 8.8 9.4 6.6 2.8
Iran 2.6 4.4 4.6 8.0 5.6 7.6 7.8 7.6 6.0 6.6 8.0 7.6 8.4 6.2 6.4 2.6
Italy 1.0 3.6 4.6 5.0 7.8 4.6 8.0 8.4 9.0 8.0 8.6 8.8 9.8 5.2 5.0 2.6
Japan 1.8 2.8 3.2 2.2 7.4 6.6 8.8 5.2 4.8 8.2 10.0 10.4 6.8 8.8 7.6 5.4
Poland 6.4 10.2 6.6 8.8 9.2 11.0 6.6 6.8 6.8 5.8 5.4 4.6 3.8 4.4 2.4 1.2
Russia 22.8 17.6 15.8 10.4 7.8 5.2 5.2 4.8 4.0 0.8 1.4 2.0 0.6 0.8 0.8 0.0
Serbia 9.2 9.4 12.2 9.2 9.4 6.6 9.6 6.6 6.2 6.0 4.2 3.2 2.6 2.2 2.2 1.2
South Korea 0.2 0.0 0.0 0.6 0.6 0.8 0.8 1.6 1.4 4.8 4.2 6.2 8.6 8.6 17.4 44.2
United States 8.2 9.2 9.6 12.0 9.8 8.6 5.8 8.0 6.0 5.4 5.0 4.2 2.6 3.2 1.0 1.4

Additionally, to add some context, we can even view how our simulations compared to the actual final standings of the 2018 VNL: