Overview

This will be a 2 part series that will walk you through how to web scrape, create an Elo based team-rating model, and finally simulate final standings of a competition using the Elo model. The first part is going to demonstrate how to scrape volleyball game stats from Wikipedia and from those results, how to create an Elo based team ratings system.

Web Scraping

To begin with we will find some data the necessary data to create an Elo model, which will be done via web scraping (as there aren’t that many nice R packages that have volleyball data).

I’ve chosen to use the results from the Volleyball Nation’s League in 2018.

Firstly, we will load in the required packages to do our web scraping and some data wrangling:

# Required Libraries
library(rvest) # web scraping
library(dplyr) # data wrangling

Secondly, we want to get the URL of the website that we want to scrape from. The read_html function returns all HTML coding from the provided URL.

For this example we will be using this URL: https://en.wikipedia.org/wiki/2018_FIVB_Volleyball_Men%27s_Nations_League

# VNL 2018
url_2018 <- read_html('https://en.wikipedia.org/wiki/2018_FIVB_Volleyball_Men%27s_Nations_League')

As we can see from the image below, the data that we want is stored in a table format. So the next function we will use from rvest is the html_table function which extracts all tables from the page into a nested list.

Example table (click on image to zoom):

tables_2018 <- url_2018 %>% 
  html_table()

# View length of tables_2018
length(tables_2018)
## [1] 69

We can then explore the tables_2018 list and search for the tables with the information that we actually want (as there were 69 tables found in total). If you go back and view the examples tables in the image above, you can observe that the table we require is the 2nd one that includes the team names and match scores. This may quite tedious to do, but if you can find a pattern in the tables it makes it a lot easier, for example, all these tables have the exact same number of columns (12).

# Extract tables no. 11 to 30, 33, 35 and 37 to 39
tables_2018 <- tables_2018[c(11:30, 33, 35, 37:39)]

We now run into the dilema that lists are not great to work with, so we want to turn all these tables into a dataframe, which makes viewing and manipulating the data a lot easier

# do.call executes a functioon across a given list
# 'what' is the function you would like to apply, and 'args' is the list you want
# to apply the function to
# and we want to save the output as a dataframe, so we use 'as.data.frame'
vnl_2018 <- as.data.frame(do.call(what = rbind, 
                                  args = tables_2018)) 
VNL 2018 Results
Date Time Score Set 1 Set 2 Set 3 Set 4 Set 5 Total Report
25 May 17:00 Australia 1–3 Japan 18–25 15–25 25–23 17–25 NA 75–98 P2 Report
25 May 20:00 France 3–1 Iran 25–20 24–26 25–20 25–17 NA 99–83 P2 Report
26 May 17:00 Australia 0–3 Iran 23–25 23–25 21–25 NA 67–75 P2 Report
26 May 20:00 France 3–1 Japan 25–16 20–25 25–20 25–22 NA 95–83 P2 Report
27 May 15:00 Iran 1–3 Japan 22–25 28–30 25–23 23–25 NA 98–103 P2 Report
27 May 18:00 France 3–0 Australia 25–17 25–20 36–34 NA 86–71 P2 Report
25 May 16:00 Argentina 2–3 United States 27–25 26–24 24–26 21–25 10–15 108–115 P2 Report
25 May 19:30 China 2–3 Bulgaria 18–25 25–18 25–19 17–25 11–15 96–102 P2 Report
26 May 16:00 Bulgaria 1–3 United States 19–25 25–22 19–25 20–25 83–97 P2 Report
26 May 19:30 China 3–0 Argentina 25–22 25–21 25–18 75–61 P2 Report


Now we might just want to add a few things to this newly created dataframe and do a little data wrangling. Firstly, we can observe that the columns containing team names do not have a column name. Additionally, we may want to add a column indicating what year this competition was held in as we will be scraping additional data from future years.

# Change column names to Team.A and Team.B
names(vnl_2018)[3] <- 'Team.A'
names(vnl_2018)[5] <- 'Team.B'

vnl_2018 <- vnl_2018 %>% 
  # Add year column
  mutate(Year = 2018) %>% 
  # Relocate year column as first column to make dataframe look a bit neater
  relocate(Year, .before = Date)

Elo

Now we will go ahead and calculate the Elo for each team, for each game they played throughout the competittion. Thankfully there is a nice package, elo, that makes this a lot easier to do compared to trying to calculate it manually (especially the competition structure is messy).

Even still, we need to do some data wrangling to use the elo package properly:

# select necessary columns to manipulate
vnl_elo <- vnl_2018 %>% 
  select(Year, Team.A, Team.B, Score)

vnl_elo <- vnl_elo %>% 
  # select 1st character within Score column to represent Team.A score
  mutate(Team.A_score = substr(Score, 1, 1),
  # select 3rd character within Score column to represent Team.B score
         Team.B_score = substr(Score, 3, 3)) %>% 
  # Determining if Team.A won (1) or lost (0)
  mutate(Result = ifelse(Team.A_score > Team.B_score, '1', '0'))

We will use a function called elo.run which requires 4 main inputs:

  • formula - feeds the function with Team names and result of a given game (as a numeric)
  • data - the dataframe to search for the values
  • k - a value acts amplifying/dampening effect on the Elo value, determining how much ‘Elo’ can be gained or lost in a given game
  • initial.elos - indicates what Elo each team should start at

The k and initial.elos parameters can be hypertuned for greater accuracy, but for simplicity I will provide some defaults.

library(elo)
vnl_elo_model <- elo.run(formula = Result ~ Team.A + Team.B,
        k = 27,
        initial.elos = 1500,
        data = vnl_elo)

We can also save the model as a dataframe to view the change in Elo for each game:

vnl_elo_ratings <- vnl_elo_model %>% 
  # save as data.frame
  as.data.frame() %>% 
  # and round numbers to 2 decimal places
  mutate_if(is.numeric, round, digits = 2)

And finally, we can use the vnl_elo_model created above to access the final Elo ratings of each team:

final.elos(vnl_elo_model) 
##     Argentina     Australia        Brazil      Bulgaria        Canada 
##      1438.900      1450.007      1522.737      1468.489      1510.782 
##         China        France       Germany          Iran         Italy 
##      1403.962      1618.864      1490.840      1500.198      1503.100 
##         Japan        Poland        Russia        Serbia   South Korea 
##      1462.086      1516.666      1628.448      1550.325      1361.229 
## United States 
##      1573.367