In this report we show how to use R to process online English Premier League match data to produce a league table.
The Premier League match data can be found on the football-data.co.uk web-site. This site provides football results & statistics from more than 25 countries going back several years.
In this example, we will look at the match data for the English Premier League for 2019-2020 season. This is the current season at the time of writing, the league is suspended due to the Covid-19 crisis.
We can find the Premier League Results on the football-data.co.uk and download it using R as follows:
data.url <- "https://www.football-data.co.uk/mmz4281/1920/E0.csv"
data.file <- "tmp/matchData.csv"
data.date <- Sys.Date()
download.file(data.url, destfile = data.file, quiet = TRUE)
The file we have downloaded contains a lot of data; let us look at all the data columns:
data.matches <- read.csv(data.file, header = TRUE)
names(data.matches)
## [1] "Div" "Date" "Time" "HomeTeam" "AwayTeam"
## [6] "FTHG" "FTAG" "FTR" "HTHG" "HTAG"
## [11] "HTR" "Referee" "HS" "AS" "HST"
## [16] "AST" "HF" "AF" "HC" "AC"
## [21] "HY" "AY" "HR" "AR" "B365H"
## [26] "B365D" "B365A" "BWH" "BWD" "BWA"
## [31] "IWH" "IWD" "IWA" "PSH" "PSD"
## [36] "PSA" "WHH" "WHD" "WHA" "VCH"
## [41] "VCD" "VCA" "MaxH" "MaxD" "MaxA"
## [46] "AvgH" "AvgD" "AvgA" "B365.2.5" "B365.2.5.1"
## [51] "P.2.5" "P.2.5.1" "Max.2.5" "Max.2.5.1" "Avg.2.5"
## [56] "Avg.2.5.1" "AHh" "B365AHH" "B365AHA" "PAHH"
## [61] "PAHA" "MaxAHH" "MaxAHA" "AvgAHH" "AvgAHA"
## [66] "B365CH" "B365CD" "B365CA" "BWCH" "BWCD"
## [71] "BWCA" "IWCH" "IWCD" "IWCA" "PSCH"
## [76] "PSCD" "PSCA" "WHCH" "WHCD" "WHCA"
## [81] "VCCH" "VCCD" "VCCA" "MaxCH" "MaxCD"
## [86] "MaxCA" "AvgCH" "AvgCD" "AvgCA" "B365C.2.5"
## [91] "B365C.2.5.1" "PC.2.5" "PC.2.5.1" "MaxC.2.5" "MaxC.2.5.1"
## [96] "AvgC.2.5" "AvgC.2.5.1" "AHCh" "B365CAHH" "B365CAHA"
## [101] "PCAHH" "PCAHA" "MaxCAHH" "MaxCAHA" "AvgCAHH"
## [106] "AvgCAHA"
For this exercise, we need to use only a few columns. The Code Book for this data can be found at https://www.football-data.co.uk/notes.txt, this gives us a clue as to which variables might be useful to us. Let us look at that data now:
data.edit <- data.matches %>% select(HomeTeam, AwayTeam, FTHG, FTAG)
head(data.edit)
## HomeTeam AwayTeam FTHG FTAG
## 1 Liverpool Norwich 4 1
## 2 West Ham Man City 0 5
## 3 Bournemouth Sheffield United 1 1
## 4 Burnley Southampton 3 0
## 5 Crystal Palace Everton 0 0
## 6 Watford Brighton 0 3
Some of these data names have obvious meanings but others are not quite clear, the Code Book gives us the official column headings but I will summarise them here:
| Column | Contents |
|---|---|
| HomeTeam | The name of the team that played in their home ground. |
| AwayTeam | The name of the team that played away from home. |
| FTHG | Full Time Home Goals. The number of goals scored by the home team. |
| FTAG | Full Time Away Goals. The number of goals scored by the away team. |
The dataset we have downloaded does not show us the number of league points won by each team so we calculate these ourself. These functions indicate whether the team which scored ourScore goals won, drawn or lost the match. The calculate.points() function indicates the number of league points awarded to the team scoring ourScore.
calculate.win <- function(ourScore, theirScore){
return(ourScore > theirScore)
}
calculate.lose <- function(ourScore, theirScore){
return(ourScore < theirScore)
}
calculate.draw <- function(ourScore, theirScore){
return(ourScore == theirScore)
}
calculate.points <- function(ourScore, theirScore){
return(ifelse(ourScore < theirScore, 0, ifelse(ourScore == theirScore, 1, 3)))
}
The first step is to calculate the data we need for each team when they have played at home. This can be done by a single dplyr statement, which:
filter() function to remove records which are missing one of the team names. I’ve noticed that some of the data files from football-data.co.uk have rows of data which are entirely empty so this filter() call removes these records.HPts indicates the league points awarded to the home team. The other columns indicate the number of wins, loses and draws. # Calculate the Home data
home.data <- data.edit %>%
# Remove rows with team names missing
filter(!(is.na(HomeTeam) | (HomeTeam == "")
| is.na(AwayTeam) | (AwayTeam == ""))) %>%
# Calculate League Points for home team
mutate(HWin = calculate.win(FTHG, FTAG),
HDraw = calculate.draw(FTHG, FTAG),
HLose = calculate.lose(FTHG, FTAG),
HPts = calculate.points(FTHG, FTAG)) %>%
# Create Home Team table entry
group_by(HomeTeam) %>% summarise(HPlyd = length(HomeTeam),
HWin = sum(HWin),
HDraw = sum(HDraw),
HLose = sum(HLose),
HFor = sum(FTHG),
HAg = sum(FTAG),
HPts = sum(HPts)) %>%
rename(Team = HomeTeam)
The aggregation describes the home results of a team, it includes some new columns:
| Column | Contents |
|---|---|
| Team | The team whose data is recorded. |
| HPlyd | The number of games the team has played at home. |
| HWin | The number of games the team has won at home. |
| HDraw | The number of games the team has drawn at home. |
| HLose | The number of games the team has lost at home. |
| HFor | The number of goals scored by the team at home. |
| HAg | The number of goals conceded by the team at home. |
| HPts | The number of league points awarded to the team when playing at home. |
Let’s look at the first few lines of this data.
head(home.data)
## # A tibble: 6 x 8
## Team HPlyd HWin HDraw HLose HFor HAg HPts
## <fct> <int> <int> <int> <int> <int> <int> <dbl>
## 1 Arsenal 15 7 5 3 26 20 26
## 2 Aston Villa 13 5 2 6 18 24 17
## 3 Bournemouth 14 4 5 5 17 21 17
## 4 Brighton 14 4 6 4 17 15 18
## 5 Burnley 15 7 2 6 20 19 23
## 6 Chelsea 15 7 3 5 22 15 24
We need similar code to produce the data for the teams playing away from home:
away.data <- data.edit %>%
# Remove rows with team names missing
filter(!(is.na(HomeTeam) | (HomeTeam == "")
| is.na(AwayTeam) | (AwayTeam == ""))) %>%
# Calculate League Points for away team
mutate(AWin = calculate.win(FTAG, FTHG),
ADraw = calculate.draw(FTAG, FTHG),
ALose = calculate.lose(FTAG, FTHG),
APts = calculate.points(FTAG, FTHG)) %>%
# Create Away Team table entry
group_by(AwayTeam) %>% summarise(APlyd = length(AwayTeam),
AWin = sum(AWin),
ADraw = sum(ADraw),
ALose = sum(ALose),
AFor = sum(FTAG),
AAg = sum(FTHG),
APts = sum(APts)) %>%
rename(Team = AwayTeam)
This gives us the agrregated results of the teams when they played away from home. Here’s the first few records:
head(away.data)
## # A tibble: 6 x 8
## Team APlyd AWin ADraw ALose AFor AAg APts
## <fct> <int> <int> <int> <int> <int> <int> <dbl>
## 1 Arsenal 13 2 8 3 14 16 14
## 2 Aston Villa 15 2 2 11 16 32 8
## 3 Bournemouth 15 3 1 11 12 26 10
## 4 Brighton 15 2 5 8 15 25 11
## 5 Burnley 14 4 4 6 14 21 16
## 6 Chelsea 14 7 3 4 29 24 24
We join the home and away tables on their common, Team, column and add together the home and away columns. Finally, we select the columns we want and sort the data to show the league leaders at the top.
table.data <- inner_join(home.data, away.data, by = "Team") %>%
mutate(Plyd = HPlyd + APlyd,
Win = HWin + AWin,
Draw = HDraw + ADraw,
Lose = HLose + ALose,
GoalsFor = HFor + AFor,
GoalsAgainst = HAg + AAg,
GoalDiff = GoalsFor - GoalsAgainst,
Pts = HPts + APts) %>%
select(Team, Plyd, Win, Draw, Lose, GoalsFor, GoalsAgainst, GoalDiff, Pts) %>%
arrange(desc(Pts), GoalDiff, GoalDiff, GoalsFor)
We’ve now built the league table, here’s the first few entries:
head(table.data)
## # A tibble: 6 x 9
## Team Plyd Win Draw Lose GoalsFor GoalsAgainst GoalDiff Pts
## <fct> <int> <int> <int> <int> <int> <int> <int> <dbl>
## 1 Liverpool 29 27 1 1 66 21 45 82
## 2 Man City 28 18 3 7 68 31 37 57
## 3 Leicester 29 16 5 8 58 28 30 53
## 4 Chelsea 29 14 6 9 51 39 12 48
## 5 Man United 29 12 9 8 44 30 14 45
## 6 Sheffield United 28 11 10 7 30 25 5 43
We can display it, in a much nicer format using:
kable(table.data,
caption = paste0("English Premier League Table - ", data.date),
row.name = TRUE) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE)
| Team | Plyd | Win | Draw | Lose | GoalsFor | GoalsAgainst | GoalDiff | Pts | |
|---|---|---|---|---|---|---|---|---|---|
| 1 | Liverpool | 29 | 27 | 1 | 1 | 66 | 21 | 45 | 82 |
| 2 | Man City | 28 | 18 | 3 | 7 | 68 | 31 | 37 | 57 |
| 3 | Leicester | 29 | 16 | 5 | 8 | 58 | 28 | 30 | 53 |
| 4 | Chelsea | 29 | 14 | 6 | 9 | 51 | 39 | 12 | 48 |
| 5 | Man United | 29 | 12 | 9 | 8 | 44 | 30 | 14 | 45 |
| 6 | Sheffield United | 28 | 11 | 10 | 7 | 30 | 25 | 5 | 43 |
| 7 | Wolves | 29 | 10 | 13 | 6 | 41 | 34 | 7 | 43 |
| 8 | Tottenham | 29 | 11 | 8 | 10 | 47 | 40 | 7 | 41 |
| 9 | Arsenal | 28 | 9 | 13 | 6 | 40 | 36 | 4 | 40 |
| 10 | Crystal Palace | 29 | 10 | 9 | 10 | 26 | 32 | -6 | 39 |
| 11 | Burnley | 29 | 11 | 6 | 12 | 34 | 40 | -6 | 39 |
| 12 | Everton | 29 | 10 | 7 | 12 | 37 | 46 | -9 | 37 |
| 13 | Newcastle | 29 | 9 | 8 | 12 | 25 | 41 | -16 | 35 |
| 14 | Southampton | 29 | 10 | 4 | 15 | 35 | 52 | -17 | 34 |
| 15 | Brighton | 29 | 6 | 11 | 12 | 32 | 40 | -8 | 29 |
| 16 | Bournemouth | 29 | 7 | 6 | 16 | 29 | 47 | -18 | 27 |
| 17 | Watford | 29 | 6 | 9 | 14 | 27 | 44 | -17 | 27 |
| 18 | West Ham | 29 | 7 | 6 | 16 | 35 | 50 | -15 | 27 |
| 19 | Aston Villa | 28 | 7 | 4 | 17 | 34 | 56 | -22 | 25 |
| 20 | Norwich | 29 | 5 | 6 | 18 | 25 | 52 | -27 | 21 |
The dataset used in this study was obtained from:
URL: https://www.football-data.co.uk/mmz4281/1920/E0.csv
Time: 2020-04-12
This report was produced on Sun Apr 12 09:26:59 2020 in the following system environment:
sessionInfo()
## R version 3.6.2 (2019-12-12)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18363)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United Kingdom.1252
## [2] LC_CTYPE=English_United Kingdom.1252
## [3] LC_MONETARY=English_United Kingdom.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United Kingdom.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] knitr_1.28 kableExtra_1.1.0 dplyr_0.8.4
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.3 highr_0.8 pillar_1.4.3 compiler_3.6.2
## [5] tools_3.6.2 digest_0.6.23 evaluate_0.14 tibble_2.1.3
## [9] lifecycle_0.1.0 viridisLite_0.3.0 pkgconfig_2.0.3 rlang_0.4.4
## [13] cli_2.0.1 rstudioapi_0.11 yaml_2.2.1 xfun_0.12
## [17] stringr_1.4.0 httr_1.4.1 xml2_1.2.2 vctrs_0.2.2
## [21] hms_0.5.3 webshot_0.5.2 tidyselect_1.0.0 glue_1.3.1
## [25] R6_2.4.1 fansi_0.4.1 rmarkdown_2.1 readr_1.3.1
## [29] purrr_0.3.3 magrittr_1.5 scales_1.1.0 htmltools_0.4.0
## [33] assertthat_0.2.1 rvest_0.3.5 colorspace_1.4-1 utf8_1.1.4
## [37] stringi_1.4.4 munsell_0.5.0 crayon_1.3.4