This will be a 2 part series that will walk you through how to web scrape, create an Elo based team-rating model, and finally simulate final standings of a competition using the Elo model. The first part is going to demonstrate how to scrape volleyball game stats from Wikipedia and from those results, how to create an Elo based team ratings system.
To begin with we will find some data the necessary data to create an Elo model, which will be done via web scraping (as there aren’t that many nice R packages that have volleyball data).
I’ve chosen to use the results from the Volleyball Nation’s League in 2018.
Firstly, we will load in the required packages to do our web scraping and some data wrangling:
# Required Libraries
library(rvest) # web scraping
library(dplyr) # data wrangling
Secondly, we want to get the URL of the website that we want to
scrape from. The read_html function returns all HTML coding
from the provided URL.
For this example we will be using this URL: https://en.wikipedia.org/wiki/2018_FIVB_Volleyball_Men%27s_Nations_League
# VNL 2018
url_2018 <- read_html('https://en.wikipedia.org/wiki/2018_FIVB_Volleyball_Men%27s_Nations_League')
As we can see from the image below, the data that we want is stored
in a table format. So the next function we will use from
rvest is the html_table function which
extracts all tables from the page into a nested list.
tables_2018 <- url_2018 %>%
html_table()
# View length of tables_2018
length(tables_2018)
## [1] 69
We can then explore the tables_2018 list and search for
the tables with the information that we actually want (as there were 69
tables found in total). If you go back and view the examples tables in
the image above, you can observe that the table we require is the 2nd
one that includes the team names and match scores. This may quite
tedious to do, but if you can find a pattern in the tables it makes it a
lot easier, for example, all these tables have the exact same number of
columns (12).
# Extract tables no. 11 to 30, 33, 35 and 37 to 39
tables_2018 <- tables_2018[c(11:30, 33, 35, 37:39)]
We now run into the dilema that lists are not great to work with, so we want to turn all these tables into a dataframe, which makes viewing and manipulating the data a lot easier
# do.call executes a functioon across a given list
# 'what' is the function you would like to apply, and 'args' is the list you want
# to apply the function to
# and we want to save the output as a dataframe, so we use 'as.data.frame'
vnl_2018 <- as.data.frame(do.call(what = rbind,
args = tables_2018))
| Date | Time | Score | Set 1 | Set 2 | Set 3 | Set 4 | Set 5 | Total | Report | ||
|---|---|---|---|---|---|---|---|---|---|---|---|
| 25 May | 17:00 | Australia | 1–3 | Japan | 18–25 | 15–25 | 25–23 | 17–25 | NA | 75–98 | P2 Report |
| 25 May | 20:00 | France | 3–1 | Iran | 25–20 | 24–26 | 25–20 | 25–17 | NA | 99–83 | P2 Report |
| 26 May | 17:00 | Australia | 0–3 | Iran | 23–25 | 23–25 | 21–25 | NA | 67–75 | P2 Report | |
| 26 May | 20:00 | France | 3–1 | Japan | 25–16 | 20–25 | 25–20 | 25–22 | NA | 95–83 | P2 Report |
| 27 May | 15:00 | Iran | 1–3 | Japan | 22–25 | 28–30 | 25–23 | 23–25 | NA | 98–103 | P2 Report |
| 27 May | 18:00 | France | 3–0 | Australia | 25–17 | 25–20 | 36–34 | NA | 86–71 | P2 Report | |
| 25 May | 16:00 | Argentina | 2–3 | United States | 27–25 | 26–24 | 24–26 | 21–25 | 10–15 | 108–115 | P2 Report |
| 25 May | 19:30 | China | 2–3 | Bulgaria | 18–25 | 25–18 | 25–19 | 17–25 | 11–15 | 96–102 | P2 Report |
| 26 May | 16:00 | Bulgaria | 1–3 | United States | 19–25 | 25–22 | 19–25 | 20–25 | 83–97 | P2 Report | |
| 26 May | 19:30 | China | 3–0 | Argentina | 25–22 | 25–21 | 25–18 | 75–61 | P2 Report |
Now we might just want to add a few things to this newly created
dataframe and do a little data wrangling. Firstly, we can observe that
the columns containing team names do not have a column name.
Additionally, we may want to add a column indicating what year this
competition was held in as we will be scraping additional data from
future years.
# Change column names to Team.A and Team.B
names(vnl_2018)[3] <- 'Team.A'
names(vnl_2018)[5] <- 'Team.B'
vnl_2018 <- vnl_2018 %>%
# Add year column
mutate(Year = 2018) %>%
# Relocate year column as first column to make dataframe look a bit neater
relocate(Year, .before = Date)
Now we will go ahead and calculate the Elo for each team, for each
game they played throughout the competittion. Thankfully there is a nice
package, elo, that makes this a lot easier to do compared
to trying to calculate it manually (especially the competition structure
is messy).
Even still, we need to do some data wrangling to use the
elo package properly:
# select necessary columns to manipulate
vnl_elo <- vnl_2018 %>%
select(Year, Team.A, Team.B, Score)
vnl_elo <- vnl_elo %>%
# select 1st character within Score column to represent Team.A score
mutate(Team.A_score = substr(Score, 1, 1),
# select 3rd character within Score column to represent Team.B score
Team.B_score = substr(Score, 3, 3)) %>%
# Determining if Team.A won (1) or lost (0)
mutate(Result = ifelse(Team.A_score > Team.B_score, '1', '0'))
We will use a function called elo.run which requires 4
main inputs:
The k and initial.elos parameters can be hypertuned for greater accuracy, but for simplicity I will provide some defaults.
library(elo)
vnl_elo_model <- elo.run(formula = Result ~ Team.A + Team.B,
k = 27,
initial.elos = 1500,
data = vnl_elo)
We can also save the model as a dataframe to view the change in Elo for each game:
vnl_elo_ratings <- vnl_elo_model %>%
# save as data.frame
as.data.frame() %>%
# and round numbers to 2 decimal places
mutate_if(is.numeric, round, digits = 2)
And finally, we can use the vnl_elo_model created above
to access the final Elo ratings of each team:
final.elos(vnl_elo_model)
## Argentina Australia Brazil Bulgaria Canada
## 1438.900 1450.007 1522.737 1468.489 1510.782
## China France Germany Iran Italy
## 1403.962 1618.864 1490.840 1500.198 1503.100
## Japan Poland Russia Serbia South Korea
## 1462.086 1516.666 1628.448 1550.325 1361.229
## United States
## 1573.367