Synopsis

In this report we show how to use R to process online English Premier League match data to produce a league table.

1 Source Dataset

The Premier League match data can be found on the football-data.co.uk web-site. This site provides football results & statistics from more than 25 countries going back several years.

In this example, we will look at the match data for the English Premier League for 2019-2020 season. This is the current season at the time of writing, the league is suspended due to the Covid-19 crisis.

2 Data Processing

2.1 Downloading the Source Dataset

We can find the Premier League Results on the football-data.co.uk and download it using R as follows:

    data.url  <- "https://www.football-data.co.uk/mmz4281/1920/E0.csv"
    data.file <- "tmp/matchData.csv"
    data.date <- Sys.Date()
    download.file(data.url, destfile = data.file, quiet = TRUE)

2.2 Examining the Source Dataset

The file we have downloaded contains a lot of data; let us look at all the data columns:

    data.matches <- read.csv(data.file, header = TRUE)
    names(data.matches)
##   [1] "Div"         "Date"        "Time"        "HomeTeam"    "AwayTeam"   
##   [6] "FTHG"        "FTAG"        "FTR"         "HTHG"        "HTAG"       
##  [11] "HTR"         "Referee"     "HS"          "AS"          "HST"        
##  [16] "AST"         "HF"          "AF"          "HC"          "AC"         
##  [21] "HY"          "AY"          "HR"          "AR"          "B365H"      
##  [26] "B365D"       "B365A"       "BWH"         "BWD"         "BWA"        
##  [31] "IWH"         "IWD"         "IWA"         "PSH"         "PSD"        
##  [36] "PSA"         "WHH"         "WHD"         "WHA"         "VCH"        
##  [41] "VCD"         "VCA"         "MaxH"        "MaxD"        "MaxA"       
##  [46] "AvgH"        "AvgD"        "AvgA"        "B365.2.5"    "B365.2.5.1" 
##  [51] "P.2.5"       "P.2.5.1"     "Max.2.5"     "Max.2.5.1"   "Avg.2.5"    
##  [56] "Avg.2.5.1"   "AHh"         "B365AHH"     "B365AHA"     "PAHH"       
##  [61] "PAHA"        "MaxAHH"      "MaxAHA"      "AvgAHH"      "AvgAHA"     
##  [66] "B365CH"      "B365CD"      "B365CA"      "BWCH"        "BWCD"       
##  [71] "BWCA"        "IWCH"        "IWCD"        "IWCA"        "PSCH"       
##  [76] "PSCD"        "PSCA"        "WHCH"        "WHCD"        "WHCA"       
##  [81] "VCCH"        "VCCD"        "VCCA"        "MaxCH"       "MaxCD"      
##  [86] "MaxCA"       "AvgCH"       "AvgCD"       "AvgCA"       "B365C.2.5"  
##  [91] "B365C.2.5.1" "PC.2.5"      "PC.2.5.1"    "MaxC.2.5"    "MaxC.2.5.1" 
##  [96] "AvgC.2.5"    "AvgC.2.5.1"  "AHCh"        "B365CAHH"    "B365CAHA"   
## [101] "PCAHH"       "PCAHA"       "MaxCAHH"     "MaxCAHA"     "AvgCAHH"    
## [106] "AvgCAHA"

For this exercise, we need to use only a few columns. The Code Book for this data can be found at https://www.football-data.co.uk/notes.txt, this gives us a clue as to which variables might be useful to us. Let us look at that data now:

    data.edit <- data.matches %>% select(HomeTeam, AwayTeam, FTHG, FTAG)
    head(data.edit)
##         HomeTeam         AwayTeam FTHG FTAG
## 1      Liverpool          Norwich    4    1
## 2       West Ham         Man City    0    5
## 3    Bournemouth Sheffield United    1    1
## 4        Burnley      Southampton    3    0
## 5 Crystal Palace          Everton    0    0
## 6        Watford         Brighton    0    3

Some of these data names have obvious meanings but others are not quite clear, the Code Book gives us the official column headings but I will summarise them here:

Column Contents
HomeTeam The name of the team that played in their home ground.
AwayTeam The name of the team that played away from home.
FTHG Full Time Home Goals. The number of goals scored by the home team.
FTAG Full Time Away Goals. The number of goals scored by the away team.

2.3 Generating the League Table

The dataset we have downloaded does not show us the number of league points won by each team so we calculate these ourself. These functions indicate whether the team which scored ourScore goals won, drawn or lost the match. The calculate.points() function indicates the number of league points awarded to the team scoring ourScore.

    calculate.win <- function(ourScore, theirScore){
        return(ourScore > theirScore)
    }

    calculate.lose <- function(ourScore, theirScore){
        return(ourScore < theirScore)
    }

    calculate.draw <- function(ourScore, theirScore){
        return(ourScore == theirScore)
    }

    calculate.points <- function(ourScore, theirScore){
        return(ifelse(ourScore < theirScore, 0, ifelse(ourScore == theirScore, 1, 3)))
    }

The first step is to calculate the data we need for each team when they have played at home. This can be done by a single dplyr statement, which:

  • uses the filter() function to remove records which are missing one of the team names. I’ve noticed that some of the data files from football-data.co.uk have rows of data which are entirely empty so this filter() call removes these records.
  • adds new columns. HPts indicates the league points awarded to the home team. The other columns indicate the number of wins, loses and draws.
  • aggregates the data for each team when it plays as home.
    # Calculate the Home data
    home.data <- data.edit %>%
        # Remove rows with team names missing
        filter(!(is.na(HomeTeam) | (HomeTeam == "") 
                 | is.na(AwayTeam) | (AwayTeam == ""))) %>%
        # Calculate League Points for home team
        mutate(HWin  = calculate.win(FTHG, FTAG),
               HDraw = calculate.draw(FTHG, FTAG),
               HLose = calculate.lose(FTHG, FTAG),
               HPts  = calculate.points(FTHG, FTAG)) %>%
        # Create Home Team table entry
        group_by(HomeTeam) %>% summarise(HPlyd = length(HomeTeam),
                                         HWin  = sum(HWin),
                                         HDraw = sum(HDraw),
                                         HLose = sum(HLose),
                                         HFor  = sum(FTHG), 
                                         HAg   = sum(FTAG),
                                         HPts  = sum(HPts)) %>%
                                         rename(Team = HomeTeam)

The aggregation describes the home results of a team, it includes some new columns:

Column Contents
Team The team whose data is recorded.
HPlyd The number of games the team has played at home.
HWin The number of games the team has won at home.
HDraw The number of games the team has drawn at home.
HLose The number of games the team has lost at home.
HFor The number of goals scored by the team at home.
HAg The number of goals conceded by the team at home.
HPts The number of league points awarded to the team when playing at home.

Let’s look at the first few lines of this data.

    head(home.data)
## # A tibble: 6 x 8
##   Team        HPlyd  HWin HDraw HLose  HFor   HAg  HPts
##   <fct>       <int> <int> <int> <int> <int> <int> <dbl>
## 1 Arsenal        15     7     5     3    26    20    26
## 2 Aston Villa    13     5     2     6    18    24    17
## 3 Bournemouth    14     4     5     5    17    21    17
## 4 Brighton       14     4     6     4    17    15    18
## 5 Burnley        15     7     2     6    20    19    23
## 6 Chelsea        15     7     3     5    22    15    24

We need similar code to produce the data for the teams playing away from home:

    away.data <- data.edit %>%
        # Remove rows with team names missing
        filter(!(is.na(HomeTeam) | (HomeTeam == "") 
                 | is.na(AwayTeam) | (AwayTeam == ""))) %>%
        # Calculate League Points for away team
        mutate(AWin  = calculate.win(FTAG, FTHG),
               ADraw = calculate.draw(FTAG, FTHG),
               ALose = calculate.lose(FTAG, FTHG),
               APts  = calculate.points(FTAG, FTHG)) %>%
        # Create Away Team table entry
        group_by(AwayTeam) %>% summarise(APlyd = length(AwayTeam),
                                         AWin  = sum(AWin),
                                         ADraw = sum(ADraw),
                                         ALose = sum(ALose),
                                         AFor  = sum(FTAG), 
                                         AAg   = sum(FTHG),
                                         APts  = sum(APts)) %>%
                                         rename(Team = AwayTeam)

This gives us the agrregated results of the teams when they played away from home. Here’s the first few records:

    head(away.data)
## # A tibble: 6 x 8
##   Team        APlyd  AWin ADraw ALose  AFor   AAg  APts
##   <fct>       <int> <int> <int> <int> <int> <int> <dbl>
## 1 Arsenal        13     2     8     3    14    16    14
## 2 Aston Villa    15     2     2    11    16    32     8
## 3 Bournemouth    15     3     1    11    12    26    10
## 4 Brighton       15     2     5     8    15    25    11
## 5 Burnley        14     4     4     6    14    21    16
## 6 Chelsea        14     7     3     4    29    24    24

We join the home and away tables on their common, Team, column and add together the home and away columns. Finally, we select the columns we want and sort the data to show the league leaders at the top.

    table.data <- inner_join(home.data, away.data, by = "Team") %>%
        mutate(Plyd         = HPlyd + APlyd, 
               Win          = HWin  + AWin,
               Draw         = HDraw + ADraw,
               Lose         = HLose + ALose,
               GoalsFor     = HFor  + AFor,
               GoalsAgainst = HAg   + AAg,
               GoalDiff     = GoalsFor - GoalsAgainst,
               Pts          = HPts  + APts) %>%
        select(Team, Plyd, Win, Draw, Lose, GoalsFor, GoalsAgainst, GoalDiff, Pts) %>%
        arrange(desc(Pts), GoalDiff, GoalDiff, GoalsFor)

3 Results

We’ve now built the league table, here’s the first few entries:

    head(table.data)
## # A tibble: 6 x 9
##   Team              Plyd   Win  Draw  Lose GoalsFor GoalsAgainst GoalDiff   Pts
##   <fct>            <int> <int> <int> <int>    <int>        <int>    <int> <dbl>
## 1 Liverpool           29    27     1     1       66           21       45    82
## 2 Man City            28    18     3     7       68           31       37    57
## 3 Leicester           29    16     5     8       58           28       30    53
## 4 Chelsea             29    14     6     9       51           39       12    48
## 5 Man United          29    12     9     8       44           30       14    45
## 6 Sheffield United    28    11    10     7       30           25        5    43

We can display it, in a much nicer format using:

    kable(table.data, 
          caption = paste0("English Premier League Table - ", data.date), 
          row.name = TRUE) %>%
    kable_styling(bootstrap_options = c("striped", "hover", "condensed"), 
                  full_width = FALSE)
English Premier League Table - 2020-04-12
Team Plyd Win Draw Lose GoalsFor GoalsAgainst GoalDiff Pts
1 Liverpool 29 27 1 1 66 21 45 82
2 Man City 28 18 3 7 68 31 37 57
3 Leicester 29 16 5 8 58 28 30 53
4 Chelsea 29 14 6 9 51 39 12 48
5 Man United 29 12 9 8 44 30 14 45
6 Sheffield United 28 11 10 7 30 25 5 43
7 Wolves 29 10 13 6 41 34 7 43
8 Tottenham 29 11 8 10 47 40 7 41
9 Arsenal 28 9 13 6 40 36 4 40
10 Crystal Palace 29 10 9 10 26 32 -6 39
11 Burnley 29 11 6 12 34 40 -6 39
12 Everton 29 10 7 12 37 46 -9 37
13 Newcastle 29 9 8 12 25 41 -16 35
14 Southampton 29 10 4 15 35 52 -17 34
15 Brighton 29 6 11 12 32 40 -8 29
16 Bournemouth 29 7 6 16 29 47 -18 27
17 Watford 29 6 9 14 27 44 -17 27
18 West Ham 29 7 6 16 35 50 -15 27
19 Aston Villa 28 7 4 17 34 56 -22 25
20 Norwich 29 5 6 18 25 52 -27 21

Appendices

Appendix: Source Data

The dataset used in this study was obtained from:

URL: https://www.football-data.co.uk/mmz4281/1920/E0.csv
Time: 2020-04-12

Appendix: Scripting Environment

This report was produced on Sun Apr 12 09:26:59 2020 in the following system environment:

sessionInfo()
## R version 3.6.2 (2019-12-12)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18363)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United Kingdom.1252 
## [2] LC_CTYPE=English_United Kingdom.1252   
## [3] LC_MONETARY=English_United Kingdom.1252
## [4] LC_NUMERIC=C                           
## [5] LC_TIME=English_United Kingdom.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] knitr_1.28       kableExtra_1.1.0 dplyr_0.8.4     
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.3        highr_0.8         pillar_1.4.3      compiler_3.6.2   
##  [5] tools_3.6.2       digest_0.6.23     evaluate_0.14     tibble_2.1.3     
##  [9] lifecycle_0.1.0   viridisLite_0.3.0 pkgconfig_2.0.3   rlang_0.4.4      
## [13] cli_2.0.1         rstudioapi_0.11   yaml_2.2.1        xfun_0.12        
## [17] stringr_1.4.0     httr_1.4.1        xml2_1.2.2        vctrs_0.2.2      
## [21] hms_0.5.3         webshot_0.5.2     tidyselect_1.0.0  glue_1.3.1       
## [25] R6_2.4.1          fansi_0.4.1       rmarkdown_2.1     readr_1.3.1      
## [29] purrr_0.3.3       magrittr_1.5      scales_1.1.0      htmltools_0.4.0  
## [33] assertthat_0.2.1  rvest_0.3.5       colorspace_1.4-1  utf8_1.1.4       
## [37] stringi_1.4.4     munsell_0.5.0     crayon_1.3.4