To build the ELO Ranks, i have search the web for some historical results of the Serie A league, and put my hands on several dataframe starting from back in the 2000’s. I have allready cleaned and tidyed a bit the different dataframe with SQL, so we’ll jump straight into R.
install.packages(‘tinytex’)
Load tidiverse, and get the data into R
library("tidyverse")
## -- Attaching packages ------------------------------------------------------------------------------------------------------------------------------------------------------------------ tidyverse 1.3.0 --
## v ggplot2 3.3.2 v purrr 0.3.4
## v tibble 3.0.3 v dplyr 1.0.2
## v tidyr 1.1.2 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.5.0
## -- Conflicts --------------------------------------------------------------------------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
serie_a_2020 <- read.csv("serie_a_2019_2020.csv")
serie_a_2019 <- read.csv("serie_a_2018_2019.csv")
serie_a_2018 <- read.csv("serie_a_2017_2018.csv")
serie_a_2017 <- read.csv("serie_a_2016_2017.csv")
serie_a_2016 <- read.csv("serie_a_2015_2016.csv")
serie_a_2015 <- read.csv("serie_a_2014_2015.csv")
serie_a_2014 <- read.csv("serie_a_2013_2014.csv")
serie_a_2013 <- read.csv("serie_a_2012_2013.csv")
serie_a_2012 <- read.csv("serie_a_2011_2012.csv")
serie_a_2011 <- read.csv("serie_a_2010_2011.csv")
serie_a_2010 <- read.csv("serie_a_2009_2010.csv")
serie_a_2009 <- read.csv("serie_a_2008_2009.csv")
serie_a_2008 <- read.csv("serie_a_2007_2008.csv")
serie_a_2007 <- read.csv("serie_a_2006_2007.csv")
serie_a_2006 <- read.csv("serie_a_2005_2006.csv")
serie_a_2005 <- read.csv("serie_a_2004_2005.csv")
serie_a_2004 <- read.csv("serie_a_2003_2004.csv")
serie_a_2003 <- read.csv("serie_a_2002_2003.csv")
serie_a_2002 <- read.csv("serie_a_2001_2002.csv")
serie_a_2001 <- read.csv("serie_a_2000_2001.csv")
serie_a_2000 <- read.csv("serie_a_1999_2000.csv")
Now select some variables of interest from the database, we’ll need them later
serie_a_2000<- serie_a_2000 %>%
select(Date, HomeTeam, AwayTeam, FTHG, FTAG, FTR, HTHG, HTAG, HTR)
serie_a_2001<- serie_a_2001 %>%
select(Date, HomeTeam, AwayTeam, FTHG, FTAG, FTR, HTHG, HTAG, HTR)
serie_a_2002<- serie_a_2002 %>%
select(Date, HomeTeam, AwayTeam, FTHG, FTAG, FTR, HTHG, HTAG, HTR)
serie_a_2003<- serie_a_2003 %>%
select(Date, HomeTeam, AwayTeam, FTHG, FTAG, FTR, HTHG, HTAG, HTR)
serie_a_2004<- serie_a_2004 %>%
select(Date, HomeTeam, AwayTeam, FTHG, FTAG, FTR, HTHG, HTAG, HTR)
serie_a_2005<- serie_a_2005 %>%
select(Date, HomeTeam, AwayTeam, FTHG, FTAG, FTR, HTHG, HTAG, HTR)
serie_a_2006<- serie_a_2006 %>%
select(Date, HomeTeam, AwayTeam, FTHG, FTAG, FTR, HTHG, HTAG, HTR)
serie_a_2007<- serie_a_2007 %>%
select(Date, HomeTeam, AwayTeam, FTHG, FTAG, FTR, HTHG, HTAG, HTR)
serie_a_2008<- serie_a_2008 %>%
select(Date, HomeTeam, AwayTeam, FTHG, FTAG, FTR, HTHG, HTAG, HTR)
serie_a_2009<- serie_a_2009 %>%
select(Date, HomeTeam, AwayTeam, FTHG, FTAG, FTR, HTHG, HTAG, HTR)
serie_a_2010<- serie_a_2010 %>%
select(Date, HomeTeam, AwayTeam, FTHG, FTAG, FTR, HTHG, HTAG, HTR)
serie_a_2011<- serie_a_2011 %>%
select(Date, HomeTeam, AwayTeam, FTHG, FTAG, FTR, HTHG, HTAG, HTR)
serie_a_2012<- serie_a_2012 %>%
select(Date, HomeTeam, AwayTeam, FTHG, FTAG, FTR, HTHG, HTAG, HTR)
serie_a_2013<- serie_a_2013 %>%
select(Date, HomeTeam, AwayTeam, FTHG, FTAG, FTR, HTHG, HTAG, HTR)
serie_a_2014<- serie_a_2014 %>%
select(Date, HomeTeam, AwayTeam, FTHG, FTAG, FTR, HTHG, HTAG, HTR)
serie_a_2015<- serie_a_2015 %>%
select(Date, HomeTeam, AwayTeam, FTHG, FTAG, FTR, HTHG, HTAG, HTR)
serie_a_2016<- serie_a_2016 %>%
select(Date, HomeTeam, AwayTeam, FTHG, FTAG, FTR, HTHG, HTAG, HTR)
serie_a_2017<- serie_a_2017 %>%
select(Date, HomeTeam, AwayTeam, FTHG, FTAG, FTR, HTHG, HTAG, HTR)
serie_a_2018<- serie_a_2018 %>%
select(Date, HomeTeam, AwayTeam, FTHG, FTAG, FTR, HTHG, HTAG, HTR)
serie_a_2019<- serie_a_2019 %>%
select(Date, HomeTeam, AwayTeam, FTHG, FTAG, FTR, HTHG, HTAG, HTR)
serie_a_2020<- serie_a_2020 %>%
select(Date, HomeTeam, AwayTeam, FTHG, FTAG, FTR, HTHG, HTAG, HTR)
Now we can merge the database into one (we can use rbind, because all dataframe has the same columns names and number)
serie_a_total <- rbind(serie_a_2000, serie_a_2001, serie_a_2002,
serie_a_2003,serie_a_2004, serie_a_2005, serie_a_2006,
serie_a_2007, serie_a_2008, serie_a_2009, serie_a_2010,
serie_a_2011, serie_a_2012, serie_a_2013, serie_a_2014,
serie_a_2015, serie_a_2016, serie_a_2017, serie_a_2018,
serie_a_2019, serie_a_2020)
Let’s change some variable name as I still have trouble interpreting some of them
serie_a_total <- serie_a_total %>%
rename(FullTimeHomeTeamGoals= FTHG,
FullTimeAwayTeamGoals= FTAG,
FullTimeResult= FTR,
HalfTimeHomeTeamGoals= HTHG,
HalfTimeAwayTeamGoals= HTAG,
HalfTimeResult= HTR)
Now the first thing to do, we have to create another dataframe to store each teams ELO RATING, and obviously update it after avery single match.
serie_a_teams <- data.frame(team = unique(c(serie_a_total$HomeTeam, serie_a_total$AwayTeam)))
Ok, ready to move on! Before we begin playing with Elo ratings, we need to assign an initial Elo value to all of the Serie A Teams we have. We can set this value to 1200.
serie_a_teams<- serie_a_teams %>%
mutate(elo = 1200)
For each football game played, we’ll create a variable showing who won. We’ll set the variable values to: #### 1 if the home team won #### 0 if the away team won #### 0.5 for a draw
serie_a_total<- serie_a_total %>%
mutate(GameResult = if_else(FullTimeHomeTeamGoals>FullTimeAwayTeamGoals,
1,
if_else(FullTimeHomeTeamGoals == FullTimeAwayTeamGoals, 0.5, 0)))
Now we install and load the most important package for our ELO RATING SYSTEM
library(elo)
## Warning: package 'elo' was built under R version 4.0.5
The difficult part, we have to write our program. It won’t be to difficult, and thanks to Edouard Mathiueu and elo CRAN package instruction everyone can easily understand a bit more. We’ll loop over every single game, get pre-match ratings and update them accordingly to our historical saved results.
for (i in seq_len(nrow(serie_a_total))) {
match <- serie_a_total[i, ]
teamA_elo <- subset(serie_a_teams, team == match$HomeTeam)$elo
teamB_elo <- subset(serie_a_teams, team == match$AwayTeam)$elo
new_elo <- elo.calc(wins.A = match$GameResult,
elo.A = teamA_elo,
elo.B = teamB_elo,
k = 32)
teamA_new_elo <- new_elo[1, 1]
teamB_new_elo <- new_elo[1, 2]
serie_a_teams <- serie_a_teams %>%
mutate(elo = if_else(team == match$HomeTeam, teamA_new_elo,
if_else(team == match$AwayTeam, teamB_new_elo, elo)))
}
Let’s wait for R to run the code and then check it out!
serie_a_teams %>%
arrange(-elo)
## team elo
## 1 Juventus 1551.086
## 2 Inter 1464.413
## 3 Roma 1444.377
## 4 Lazio 1436.622
## 5 Atalanta 1412.872
## 6 Napoli 1369.907
## 7 Milan 1327.346
## 8 Torino 1303.022
## 9 Cagliari 1288.759
## 10 Bologna 1274.334
## 11 Parma 1255.946
## 12 Sassuolo 1242.973
## 13 Udinese 1236.299
## 14 Carpi 1232.347
## 15 Sampdoria 1221.050
## 16 Empoli 1207.038
## 17 Perugia 1206.563
## 18 Fiorentina 1199.820
## 19 Crotone 1191.961
## 20 Catania 1189.770
## 21 Lecce 1185.787
## 22 Spal 1176.548
## 23 Reggina 1173.585
## 24 Novara 1170.369
## 25 Vicenza 1168.511
## 26 Verona 1163.226
## 27 Siena 1155.915
## 28 Genoa 1150.361
## 29 Piacenza 1144.016
## 30 Brescia 1143.050
## 31 Modena 1140.293
## 32 Como 1129.224
## 33 Palermo 1128.835
## 34 Ascoli 1125.048
## 35 Chievo 1115.004
## 36 Frosinone 1112.577
## 37 Benevento 1109.516
## 38 Bari 1087.791
## 39 Treviso 1081.600
## 40 Cesena 1075.150
## 41 Livorno 1057.786
## 42 Messina 1057.052
## 43 Ancona 1044.090
## 44 Pescara 1028.919
## 45 Venezia 1019.244
Now we can select only those team playing in the current 2021 Serie A league season
serie_a_teams_2021 <- serie_a_teams %>%
filter(team %in% c("Roma", "Milan","Napoli", "Inter", "Udinese","Bologna",
"Lazio", "Fiorentina", "Sassuolo", "Atalanta", "Torino",
"Empoli", "Genoa", "Venezia", "Sampdoria", "Juventus",
"Cagliari", "Spezia","Verona", "Salernitana")) %>%
arrange(-elo)
print.data.frame(serie_a_teams_2021)
## team elo
## 1 Juventus 1551.086
## 2 Inter 1464.413
## 3 Roma 1444.377
## 4 Lazio 1436.622
## 5 Atalanta 1412.872
## 6 Napoli 1369.907
## 7 Milan 1327.346
## 8 Torino 1303.022
## 9 Cagliari 1288.759
## 10 Bologna 1274.334
## 11 Sassuolo 1242.973
## 12 Udinese 1236.299
## 13 Sampdoria 1221.050
## 14 Empoli 1207.038
## 15 Fiorentina 1199.820
## 16 Verona 1163.226
## 17 Genoa 1150.361
## 18 Venezia 1019.244
Let’s calculate probabilities for an individual game, like the first one of the season: AS ROMA VS Fiorentina. We’ll use the elo.prob function
Roma <- subset(serie_a_teams_2021, team == "Roma")$elo
Fiorentina <- subset(serie_a_teams_2021, team == "Fiorentina")$elo
elo.prob(Roma, Fiorentina)
## [1] 0.8034168