final

Published

April 17, 2026

Premier League Data Dive

Audience

Premier League fans - especially those interested in referee or betting information. This dataset contains very basic in match stats and would not be useful for players clubs or coaches.

Background Information and Issues

Data from https://www.football-data.co.uk/

Football Data compiles standard match statistics from Premier League games such as shots, goals, and results. Mainly meant for sports gamblers, the dataset has pregame odds from an abundance of sportsbooks along with referee data. The datasets are broken up by season, for this analysis I used the last year with a season of full completed results: the 24/25 season.

Every season the discussion of referee corruption and bias becomes more prevalent in Premier League spaces. Fans start to notice patterns from individual referees, but is any of this valid? Are there referees that actually favor home teams, or betting favorites? These are the main points I was curious to uncover through this analysis. With all of the betting data being made available in one place comparing the sportsbooks to see which ones are the least accurate would be advantageous to gamblers.

Objective

My goal is to figure out if referees have a bias towards home teams or betting favorites. Although it is obvious that home teams win more often than away teams and the favored team is favored for a reason, but, are their referees who favor either of these two groups disproportionately? And how can I show the different referee tendencies to prepare fans or betters before the match?

Initial EDA

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.2.0
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(palmerpenguins)


Attaching package: 'palmerpenguins'

The following objects are masked from 'package:datasets':

    penguins, penguins_raw

pl <- read_csv("C:/Users/bfunk/Downloads/E0.csv")

Rows: 380 Columns: 120
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr    (7): Div, Date, HomeTeam, AwayTeam, FTR, HTR, Referee
dbl  (112): FTHG, FTAG, HTHG, HTAG, HS, AS, HST, AST, HF, AF, HC, AC, HY, AY...
time   (1): Time

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Some basic dataset information. From here we can see there are 120 columns to describe 380 matches.

dim(pl)

[1] 380 120

names(pl)

  [1] "Div"       "Date"      "Time"      "HomeTeam"  "AwayTeam"  "FTHG"     
  [7] "FTAG"      "FTR"       "HTHG"      "HTAG"      "HTR"       "Referee"  
 [13] "HS"        "AS"        "HST"       "AST"       "HF"        "AF"       
 [19] "HC"        "AC"        "HY"        "AY"        "HR"        "AR"       
 [25] "B365H"     "B365D"     "B365A"     "BWH"       "BWD"       "BWA"      
 [31] "BFH"       "BFD"       "BFA"       "PSH"       "PSD"       "PSA"      
 [37] "WHH"       "WHD"       "WHA"       "1XBH"      "1XBD"      "1XBA"     
 [43] "MaxH"      "MaxD"      "MaxA"      "AvgH"      "AvgD"      "AvgA"     
 [49] "BFEH"      "BFED"      "BFEA"      "B365>2.5"  "B365<2.5"  "P>2.5"    
 [55] "P<2.5"     "Max>2.5"   "Max<2.5"   "Avg>2.5"   "Avg<2.5"   "BFE>2.5"  
 [61] "BFE<2.5"   "AHh"       "B365AHH"   "B365AHA"   "PAHH"      "PAHA"     
 [67] "MaxAHH"    "MaxAHA"    "AvgAHH"    "AvgAHA"    "BFEAHH"    "BFEAHA"   
 [73] "B365CH"    "B365CD"    "B365CA"    "BWCH"      "BWCD"      "BWCA"     
 [79] "BFCH"      "BFCD"      "BFCA"      "PSCH"      "PSCD"      "PSCA"     
 [85] "WHCH"      "WHCD"      "WHCA"      "1XBCH"     "1XBCD"     "1XBCA"    
 [91] "MaxCH"     "MaxCD"     "MaxCA"     "AvgCH"     "AvgCD"     "AvgCA"    
 [97] "BFECH"     "BFECD"     "BFECA"     "B365C>2.5" "B365C<2.5" "PC>2.5"   
[103] "PC<2.5"    "MaxC>2.5"  "MaxC<2.5"  "AvgC>2.5"  "AvgC<2.5"  "BFEC>2.5" 
[109] "BFEC<2.5"  "AHCh"      "B365CAHH"  "B365CAHA"  "PCAHH"     "PCAHA"    
[115] "MaxCAHH"   "MaxCAHA"   "AvgCAHH"   "AvgCAHA"   "BFECAHH"   "BFECAHA"

head(pl)

# A tibble: 6 × 120
  Div   Date       Time   HomeTeam  AwayTeam  FTHG  FTAG FTR    HTHG  HTAG HTR  
  <chr> <chr>      <time> <chr>     <chr>    <dbl> <dbl> <chr> <dbl> <dbl> <chr>
1 E0    16/08/2024 20:00  Man Unit… Fulham       1     0 H         0     0 D    
2 E0    17/08/2024 12:30  Ipswich   Liverpo…     0     2 A         0     0 D    
3 E0    17/08/2024 15:00  Arsenal   Wolves       2     0 H         1     0 H    
4 E0    17/08/2024 15:00  Everton   Brighton     0     3 A         0     1 A    
5 E0    17/08/2024 15:00  Newcastle Southam…     1     0 H         1     0 H    
6 E0    17/08/2024 15:00  Nott'm F… Bournem…     1     1 D         1     0 H    
# ℹ 109 more variables: Referee <chr>, HS <dbl>, AS <dbl>, HST <dbl>,
#   AST <dbl>, HF <dbl>, AF <dbl>, HC <dbl>, AC <dbl>, HY <dbl>, AY <dbl>,
#   HR <dbl>, AR <dbl>, B365H <dbl>, B365D <dbl>, B365A <dbl>, BWH <dbl>,
#   BWD <dbl>, BWA <dbl>, BFH <dbl>, BFD <dbl>, BFA <dbl>, PSH <dbl>,
#   PSD <dbl>, PSA <dbl>, WHH <dbl>, WHD <dbl>, WHA <dbl>, `1XBH` <dbl>,
#   `1XBD` <dbl>, `1XBA` <dbl>, MaxH <dbl>, MaxD <dbl>, MaxA <dbl>, AvgH <dbl>,
#   AvgD <dbl>, AvgA <dbl>, BFEH <dbl>, BFED <dbl>, BFEA <dbl>, …

pl |>
  count(FTR) |>
  mutate(percent = round(n / sum(n) * 100, 1))

# A tibble: 3 × 3
  FTR       n percent
  <chr> <int>   <dbl>
1 A       132    34.7
2 D        93    24.5
3 H       155    40.8

A brief look at which side wins matches based on home pitch advantage.

ggplot(pl, aes(x = FTR)) +
  geom_bar() +
  labs(
    title = "Match Results",
    x = "Result",
    y = "Number of Matches"
  )

Focusing on more baseline referee numbers to begin to set up my investigation. Going through the raw numbers you start to see some differences but it is difficult to draw any cohesive or meaningful conclusions from here.

ref_stats <- pl |>
  mutate(TG = FTHG + FTAG) |>
  mutate(YC = HY + AY)|>
  mutate(RC = HR + AR)|>
  group_by(Referee) |>
  summarise(matches = n(),
            avg_total_goals = mean(TG),
            avg_yellow_cards = mean(YC),
            avg_red_cards = mean(RC))
ref_stats

# A tibble: 26 × 5
   Referee    matches avg_total_goals avg_yellow_cards avg_red_cards
   <chr>        <int>           <dbl>            <dbl>         <dbl>
 1 A Kitchen        2            2                2           0     
 2 A Madley        20            2.8              4.3         0.05  
 3 A Taylor        30            3.6              2.93        0.2   
 4 C Kavanagh      24            2.33             4           0.208 
 5 C Pawson        22            2.55             4.27        0.136 
 6 D Bond          14            2.93             3.93        0.0714
 7 D Coote          6            2.5              5           0     
 8 D England       19            3.37             4.16        0.211 
 9 D Webb           2            4                3.5         0     
10 J Brooks        17            2.82             5.06        0.0588
# ℹ 16 more rows

home_ref <- pl |>
  group_by(Referee) |>
  summarise(
    matches = n(),
    home_win_rate = mean(FTR == "H"),
    away_win_rate = mean(FTR == "A"),
    home_card_tilt = mean(HY - AY),
    home_foul_tilt = mean(HF - AF)
  ) |>
  filter(matches >=10) |>        
  arrange(desc(home_win_rate))
home_ref

# A tibble: 18 × 6
   Referee     matches home_win_rate away_win_rate home_card_tilt home_foul_tilt
   <chr>         <int>         <dbl>         <dbl>          <dbl>          <dbl>
 1 J Brooks         17         0.706         0.118         0.588         -0.294 
 2 T Bramall        11         0.636         0.273         0              0.636 
 3 D Bond           14         0.571         0.143        -0.357         -0.286 
 4 P Bankes         23         0.565         0.348         0.0435        -0.0870
 5 S Hooper         23         0.478         0.261         0.0870        -0.696 
 6 M Salisbury      15         0.467         0.467         0.0667        -2     
 7 S Barrott        23         0.435         0.478        -0.435          1.04  
 8 C Pawson         22         0.409         0.318        -0.364         -0.682 
 9 A Madley         20         0.4           0.25         -0.6           -1.1   
10 T Harringt…      18         0.389         0.444        -0.444         -0.944 
11 J Gillett        16         0.375         0.5          -0.5           -2.19  
12 D England        19         0.368         0.421        -0.789         -2.11  
13 C Kavanagh       24         0.333         0.25          0.25           0.292 
14 T Robinson       16         0.312         0.25          0.125         -2     
15 M Oliver         25         0.28          0.4          -0.72           0.12  
16 S Attwell        19         0.263         0.474        -0.105          1.63  
17 A Taylor         30         0.233         0.533        -0.267         -0.167 
18 R Jones          22         0.182         0.409        -0.682         -1.82

Looking at the how many matches each referee got this past season you can see who the PL trusts the most, and which ones will have a more convincing sample size. From here on out I filtered out those who refereed less than 10 matches across the 24/25 season.

ref_stats |>
  ggplot(aes(x = reorder(Referee, matches), y = matches)) +
  geom_col() +
  coord_flip() +
  labs(title = "Matches per Referee", x = "Referee", y = "Matches")

Assumptions

Prior to diving into the analysis it is important to go over some of the assumptions I made for this report.

Statistical significance is not the same as practical significance.
- While statistics are a helpful tool they are still just a tool. A small difference can be statistically significant but not practically meaningful.
- The risk is that drawing conclusions just based off p-values may not be completely accurate, however the visualizations I provide shows another angle alongside the tests.
For similar reasons non significant results may still have value.
Referees matches are comparable enough to analyze against each other.
- While each ref refereed different clubs in different conditions, in the end they should balance out - to a degree. They also are reffing under the same rules in the same league under the same season, so their environments were mostly similar.
- The risk is an outlier game can over represent a referees data, and some of the data collected me be more telling of the match reffed than the referee’s actual tendencies.
- To mitigate the risk I recognize the bias and know that I cannot for certain say any referee is corrupt or bias no matter the results due to this risk.
Red cards are insignificant for this investigation.
- Red cards are given out so infrequently that they cannot be statistically significant.
- To manage this risk I ended up scrapping red cards from my analysis.

Analysis

For my analysis I will be showing graphs and charts exploring the dataset in an attempt to solve my problem statements. This is split up into three sections.

I. Home/away

library(ggplot2)
library(scales)


Attaching package: 'scales'

The following object is masked from 'package:purrr':

    discard

The following object is masked from 'package:readr':

    col_factor

ggplot(home_ref, aes(x = reorder(Referee, home_win_rate), y = home_win_rate)) +
  geom_col() +
  coord_flip() +
  scale_y_continuous(labels = percent_format()) +
  labs(
    title = "Home Win Rate by Referee",
    x = "Referee",
    y = "Home Win Rate"
  )

The home win rate sees a wide variety of results by referees. A fan can look at this prior to a match and use it to set expectations. Aligning this with the fouls and cards swing in favor of home teams could reveal a bias.

ggplot(home_ref, aes(x = reorder(Referee, home_card_tilt), y = home_card_tilt)) +
  geom_col() +
  coord_flip() +
  labs(
    title = "Card Tilt by Referee per Match",
    x = "Referee",
    y = "More Away Yellow Cards |  More Home Yellow Cards"
  )

While introducing the tilt metrics (showing favor towards whatever end for cards/fouls/shots) we see which referees tend to favor the home or away side when handing out away cards. It is important to note that most give more cards to the away team and there is a visible difference between the referees. However given fickle sports conditions it is important to remember differing match conditions, this is to be kept in mind for all the upcoming visuals as well.

ggplot(home_ref, aes(x = reorder(Referee, home_foul_tilt), y = home_foul_tilt)) +
  geom_col() +
  coord_flip() +
  labs(
    title = "Foul Tilt by Referee per Match",
    x = "Referee",
    y = "More Away Fouls | More Home Fouls"
  )

The foul tilt graph looks similar to the card tilt graph where there is a disparity but we should view it with caution. In the end the names that favor home teams for disciplinary reasons(fouls/cards) do not align with the names that referee matches where the home team wins.

Next step was to look at referee tendencies. Here I am attempting to identify the referees that interrupt the flow of the match for fouls as appose to “letting them play”. As a fan there are definitely referees I do not want to see on the match sheet due to how strict their whistle is.

II. Referee tendencies

ref_metrics <- pl |>
  group_by(Referee) |>
  summarise(
    matches = n(),
    home_yellow_per_foul = sum(HY, na.rm = TRUE) / sum(HF, na.rm = TRUE),
    away_yellow_per_foul = sum(AY, na.rm = TRUE) / sum(AF, na.rm = TRUE),
    yellow_per_foul = sum(HY + AY, na.rm = TRUE) / sum(HF + AF, na.rm = TRUE),
    yellow_foul_tilt = home_yellow_per_foul - away_yellow_per_foul,

    shots_per_foul = sum(HS + AS, na.rm = TRUE) / sum(HF + AF, na.rm = TRUE),
    .groups = "drop"
  ) |>
  filter(matches >= 10)

ref_metrics

# A tibble: 18 × 7
   Referee     matches home_yellow_per_foul away_yellow_per_foul yellow_per_foul
   <chr>         <int>                <dbl>                <dbl>           <dbl>
 1 A Madley         20                0.167                0.202           0.185
 2 A Taylor         30                0.126                0.149           0.137
 3 C Kavanagh       24                0.199                0.181           0.190
 4 C Pawson         22                0.178                0.199           0.189
 5 D Bond           14                0.162                0.190           0.176
 6 D England        19                0.162                0.197           0.181
 7 J Brooks         17                0.257                0.198           0.227
 8 J Gillett        16                0.189                0.196           0.193
 9 M Oliver         25                0.155                0.219           0.187
10 M Salisbury      15                0.148                0.117           0.131
11 P Bankes         23                0.187                0.181           0.184
12 R Jones          22                0.154                0.184           0.170
13 S Attwell        19                0.167                0.207           0.185
14 S Barrott        23                0.188                0.251           0.218
15 S Hooper         23                0.252                0.226           0.239
16 T Bramall        11                0.155                0.165           0.16 
17 T Harringt…      18                0.164                0.189           0.177
18 T Robinson       16                0.211                0.167           0.187
# ℹ 2 more variables: yellow_foul_tilt <dbl>, shots_per_foul <dbl>

ggplot(ref_metrics, aes(x = reorder(Referee, yellow_per_foul), y = yellow_per_foul)) +
  geom_col() +
  coord_flip() +
  labs(
    title = "Yellow Cards per Foul by Referee",
    x = "Referee",
    y = "Yellow Cards / Fouls"
  )

Some of the usual suspects fly to the top. Personally, as a Spurs fan, seeing a name like Simon Hooper on the match sheet scares me as I am immediately worrying about a Romero red card. As a fan having an informed expectation for how a match will play out could enhance the experience.

ggplot(ref_metrics, aes(x = reorder(Referee, yellow_foul_tilt), y = yellow_foul_tilt)) +
  geom_col() +
  coord_flip() +
  labs(
    title = "Yellow per Foul Tilt by Referee",
    x = "Referee",
    y = "Higher Away Yellow/Foul | Higher Home Yellow/Foul"
  )

This is another graph attempting to see if the home/away bias is real. This one, however, is less advantage based. While the home team usually holds onto the ball more, naturally the away team will commit more fouls and accumulate more cards trying to get the ball back. For this reason we can expect the shift we have seen between home and away discipline. However, in this case, just measuring the rate at which refs give cards per foul, you would expect home/away to be closer. The referee names themselves do not particularly jump out when compared to the other visualizations.

ggplot(ref_metrics, aes(x = reorder(Referee, shots_per_foul), y = shots_per_foul)) +
  geom_col() +
  coord_flip() +
  labs(
    title = "Shots per Foul by Referee",
    x = "Referee",
    y = "Shots / Fouls"
  )

Like mentioned earlier this just measures the flow of the game. Some referees are more likely to let things play out allowing to more chances. You can see here how these refs called games last season.

ggplot(ref_metrics, aes(x = shots_per_foul, y = yellow_per_foul, label = Referee)) +
  geom_point(size = 3) +
  geom_text(nudge_y = 0.005, size = 2.5) +
  labs(
    title = "Shots per Foul vs Yellow Cards per Foul",
    x = "Shots / Fouls",
    y = "Yellow Cards / Fouls"
  )

The next few charts are scatterplots comparing what we have already seen. To me Bramall and Atwell jump out as a referee that let players play.

ggplot(home_ref, aes(x = home_foul_tilt, y = home_card_tilt, label = Referee)) +
  geom_point(aes(size = matches)) +
  geom_text(nudge_y = 0.07, size = 2.5) +
  labs(
    title = "Home Card Tilt vs Home Foul Tilt",
    x = "Home Foul Tilt",
    y = "Home Card Tilt"
  )

These two metrics are expected to be similar because the more a ref blows their whistle towards a team the more likely they are to give their players cards. For this reason you can see a small linear relationship.

ggplot(home_ref, aes(x = home_foul_tilt, y = home_win_rate, label = Referee)) +
  geom_point(aes(size = matches)) +
  geom_text(nudge_y = 0.05, size = 2.5) +
  labs(
    title = "Home Win Rate vs Home Foul Tilt",
    x = "Home Foul Tilt",
    y = "Home Win Rate"
  )

This graph comparing home fouls and home win rate surprisingly has a very low correlation. It is hard to say any referee is particularly biased given this graph.

ggplot(ref_metrics, aes(x = away_yellow_per_foul, y = home_yellow_per_foul, label = Referee)) +
  geom_point(aes(size = matches)) +
  geom_text(nudge_y = 0.006, size = 2) +
  scale_x_continuous(limits = c(0.10, 0.27)) +
  scale_y_continuous(limits = c(0.10, 0.27)) +
  coord_equal() +
  labs(
    title = "Home vs Away Yellow Cards per Foul",
    x = "Away Yellow Cards per Foul",
    y = "Home Yellow Cards per Foul"
  )

This scatterplot provides another way to look at foul tilt.

III. Sportsbooks

To begin this section I wanted to see if any referees skewed towards the odds on favorite or underdog. Obviously similarly to home/away refs will look like they are favoring the favorite because the better team is more likely to win and less likely to foul. I used mostly the same structure for the first part of this section.

ref_bias <- pl |>
  mutate(
    favorite_side = case_when(
      AvgCH < AvgCA ~ "H",
      AvgCA < AvgCH ~ "A",
    ),

    favorite_won = case_when(
      favorite_side == "H" & FTR == "H" ~ 1,
      favorite_side == "A" & FTR == "A" ~ 1,
      favorite_side %in% c("H", "A") ~ 0,
    ),

    favorite_yellows = if_else(favorite_side == "H", HY, AY),
    underdog_yellows = if_else(favorite_side == "H", AY, HY),
    favorite_fouls = if_else(favorite_side == "H", HF, AF),
    underdog_fouls = if_else(favorite_side == "H", AF, HF),
    favorite_shots = if_else(favorite_side == "H", HS, AS),
    underdog_shots = if_else(favorite_side == "H", AS, HS)
  ) |>
  filter(!is.na(favorite_side)) |>
  group_by(Referee) |>
  summarise(
    matches = n(),
    favorite_win_rate = mean(favorite_won),
    favorite_card_tilt = mean(favorite_yellows - underdog_yellows),
    favorite_foul_tilt = mean(favorite_fouls - underdog_fouls),
    favorite_shot_tilt = mean(favorite_shots - underdog_shots),
    favorite_yellow_per_foul = sum(favorite_yellows) / sum(favorite_fouls),
    underdog_yellow_per_foul = sum(underdog_yellows) / sum(underdog_fouls),
    yellow_foul_tilt = favorite_yellow_per_foul - underdog_yellow_per_foul,
    .groups = "drop"
  ) |>
  filter(matches >= 10) |>
  arrange(desc(favorite_win_rate))

ref_bias

# A tibble: 18 × 9
   Referee      matches favorite_win_rate favorite_card_tilt favorite_foul_tilt
   <chr>          <int>             <dbl>              <dbl>              <dbl>
 1 T Bramall         11             0.818             -0.909             -1.18 
 2 M Salisbury       15             0.8               -0.2                0.933
 3 S Barrott         23             0.739             -0.435              0.174
 4 S Hooper          23             0.696             -0.435             -0.609
 5 P Bankes          23             0.652             -0.565             -1.22 
 6 A Taylor          30             0.633             -0.333             -1.57 
 7 J Brooks          17             0.588              0.353             -0.882
 8 D Bond            14             0.571             -1.07              -0.286
 9 J Gillett         16             0.562             -0.875             -2.44 
10 S Attwell         18             0.556             -0.444              1    
11 T Harrington      18             0.556             -1                  0.167
12 A Madley          20             0.55              -0.1               -0.8  
13 D England         19             0.526             -0.579             -2.53 
14 C Pawson          22             0.455             -0.636             -1.41 
15 M Oliver          25             0.44              -0.72              -0.28 
16 C Kavanagh        24             0.417             -0.833             -1.96 
17 T Robinson        16             0.375             -0.625             -0.75 
18 R Jones           22             0.273             -1.5               -1.55 
# ℹ 4 more variables: favorite_shot_tilt <dbl>, favorite_yellow_per_foul <dbl>,
#   underdog_yellow_per_foul <dbl>, yellow_foul_tilt <dbl>

ggplot(ref_bias, aes(x = reorder(Referee, favorite_win_rate), y = favorite_win_rate)) +
  geom_col() +
  coord_flip() +
  scale_y_continuous(labels = percent_format()) +
  labs(
    title = "Favorite Win Rate by Referee",
    x = "Referee",
    y = "Favorite Win Rate"
  )

Alike the home/away there is quite a large spread between how often favorites win or lose in the premier league depending on referee.

ggplot(ref_bias, aes(x = reorder(Referee, favorite_card_tilt), y = favorite_card_tilt)) +
  geom_col() +
  coord_flip() +
  labs(
    title = "Favorite Card Tilt by Referee per Game",
    x = "Referee",
    y = "More Underdog Yellows | More Favorite Yellows"
  )

The card tilt is pretty dramatic here too. Most of the names giving more cards to the underdog do not line up with the ones resulting in the favorite winning.

ggplot(ref_bias, aes(x = reorder(Referee, favorite_foul_tilt), y = favorite_foul_tilt)) +
  geom_col() +
  coord_flip() +
  labs(
    title = "Favorite Foul Tilt by Referee per Match",
    x = "Referee",
    y = "More Underdog Fouls - More Fouls"
  )

The foul tilt looks like most of the tilt graphs we have seen. There is a disparity but it is not enough to draw any conclusions from.

ggplot(ref_bias, aes(x = reorder(Referee, yellow_foul_tilt), y = yellow_foul_tilt)) +
  geom_col() +
  coord_flip() +
  labs(
    title = "Favorite Yellow per Foul Tilt by Referee",
    x = "Referee",
    y = "Higher Underdog Yellow/Foul | Higher Favorite Yellow/Foul"
  )

Yellows per foul do favor the favorite for almost all of the referees. There could be an unrealized bias from the refs but it is not enough to say for sure, and the spread is definitely not large enough to point to corruption. It could also be true that the underdogs are more likely to commit risky challenges due to desperation, but it is hard to say for sure what the reasoning behind this is.

ggplot(ref_bias, aes(x = favorite_foul_tilt, y = favorite_card_tilt, label = Referee)) +
  geom_point(aes(size = matches)) +
  geom_text(nudge_y = 0.07, size = 3) +
  labs(
    title = "Favorite Foul Tilt vs Favorite Card Tilt",
    x = "Favorite Foul Tilt",
    y = "Favorite Card Tilt"
  )

Another way to look at two of the recent bar charts.

The next few graphs here are looking at the differences between how accurate the five sportsbooks predicting match results from the dataset.

book_acc <- bind_rows(
  pl |> transmute(book = "Bet365", H = B365CH, D = B365CD, A = B365CA, FTR),
  pl |> transmute(book = "bwin", H = BWCH, D = BWCD, A = BWCA,   FTR),
  pl |> transmute(book = "Betfair", H = BFCH, D = BFCD, A = BFCA,   FTR),
  pl |> transmute(book = "Pinnacle", H = PSCH, D = PSCD, A = PSCA,   FTR),
  pl |> transmute(book = "William Hill", H = WHCH, D = WHCD, A = WHCA,   FTR),
  pl |> transmute(book = "1xBet", H = `1XBCH`, D = `1XBCD`, A = `1XBCA`, FTR)
) |>
  mutate(
    predicted = case_when(
      H < D & H < A ~ "H",
      D < H & D < A ~ "D",
      A < H & A < D ~ "A",
    ),
    correct = predicted == FTR
  ) |>
  filter(!is.na(predicted)) |>
  group_by(book) |>
  summarise(
    matches = n(),
    accuracy = mean(correct),
    .groups = "drop"
  ) |>
  arrange(desc(accuracy))

book_acc

# A tibble: 6 × 3
  book         matches accuracy
  <chr>          <int>    <dbl>
1 Bet365           378    0.556
2 Betfair          378    0.556
3 Pinnacle         379    0.554
4 1xBet            377    0.546
5 bwin             237    0.532
6 William Hill     289    0.529

ggplot(book_acc, aes(x = reorder(book, accuracy), y = accuracy)) +
  geom_col() +
  coord_flip() +
  scale_y_continuous(labels = percent_format()) +
  labs(
    title = "Sportsbook Accuracy vs Full-Time Result",
    x = "Sportsbook",
    y = "Accuracy"
  )

ggplot(book_acc, aes(x = accuracy, y = reorder(book, accuracy))) +
  geom_point(size = 3) +
  geom_text(aes(label = percent(accuracy, accuracy = 0.1)), nudge_x = 0.006) +
  scale_x_continuous(labels = percent_format(), limits = c(min(book_acc$accuracy) - 0.01, max(book_acc$accuracy) + 0.02)) +
  labs(
    title = "Sportsbook Accuracy on Closing Favorite Pick",
    x = "Accuracy",
    y = "Sportsbook"
  )

book_acc_diff <- book_acc |>
  mutate(diff_from_avg = accuracy - mean(accuracy))

ggplot(book_acc_diff, aes(x = reorder(book, diff_from_avg), y = diff_from_avg)) +
  geom_col() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  coord_flip() +
  scale_y_continuous(labels = percent_format(accuracy = 0.1)) +
  labs(
    title = "Accuracy Relative to Sportsbook Average",
    x = "Sportsbook",
    y = "Difference from Average Accuracy"
  )

book_acc_ci <- book_acc |>
  mutate(
    se = sqrt((accuracy * (1 - accuracy)) / matches),
    lower = accuracy - 1.96 * se,
    upper = accuracy + 1.96 * se
  )

ggplot(book_acc_ci, aes(x = accuracy, y = reorder(book, accuracy))) +
  geom_point(size = 3) +
  geom_errorbarh(aes(xmin = lower, xmax = upper), height = 0.15) +
  scale_x_continuous(labels = percent_format()) +
  labs(
    title = "Sportsbook Accuracy with 95% Confidence Intervals",
    x = "Accuracy",
    y = "Sportsbook"
  )

Warning: `geom_errorbarh()` was deprecated in ggplot2 4.0.0.
ℹ Please use the `orientation` argument of `geom_errorbar()` instead.

`height` was translated to `width`.

book_preds <- bind_rows(
  pl |> transmute(book = "Bet365", H = B365CH,  D = B365CD,  A = B365CA,  FTR),
  pl |> transmute(book = "bwin", H = BWCH,    D = BWCD,    A = BWCA,    FTR),
  pl |> transmute(book = "Betfair", H = BFCH,    D = BFCD,    A = BFCA,    FTR),
  pl |> transmute(book = "Pinnacle" ,H = PSCH,    D = PSCD,    A = PSCA,    FTR),
  pl |> transmute(book = "William Hill", H = WHCH,    D = WHCD,    A = WHCA,    FTR),
  pl |> transmute(book = "1xBet", H = `1XBCH`, D = `1XBCD`, A = `1XBCA`, FTR)
) |>
  mutate(
    predicted = case_when(
      H < D & H < A ~ "H",
      D < H & D < A ~ "D",
      A < H & A < D ~ "A",
    ),
    correct = predicted == FTR
  ) |>
  filter(!is.na(predicted))

book_outcome_acc <- book_preds |>
  group_by(book, predicted) |>
  summarise(
    picks = n(),
    accuracy = mean(correct),
    .groups = "drop"
  )

ggplot(book_outcome_acc, aes(x = book, y = accuracy, fill = predicted)) +
  geom_col(position = "dodge") +
  scale_y_continuous(labels = percent_format()) +
  labs(
    title = "Accuracy by Predicted Result",
    x = "Sportsbook",
    y = "Accuracy",
    fill = "Predicted"
  )

All of these visuals tell the same story. The accuracy scores are pretty similar no matter how you look at it. As expected the house usually wins. If you are trying to pick a book based on how often you are wrong you are better off forgetting it and comparing matchday odds by match.

book_pick_mix <- book_preds |>
  group_by(book, predicted) |>
  summarise(n = n(), .groups = "drop") |>
  group_by(book) |>
  mutate(rate = n / sum(n))

ggplot(book_pick_mix, aes(x = book, y = rate, fill = predicted)) +
  geom_col() +
  scale_y_continuous(labels = percent_format()) +
  labs(
    title = "What Each Sportsbook Picks Most Often",
    x = "Sportsbook",
    y = "Share of Picks",
    fill = "Predicted"
  )

This visual shows how much each sportsbook values home advantage. There is a slight difference but it is not much.

Statistical testing

chisq.test(table(pl$Referee, pl$FTR == "H"))

Warning in chisq.test(table(pl$Referee, pl$FTR == "H")): Chi-squared
approximation may be incorrect


    Pearson's Chi-squared test

data:  table(pl$Referee, pl$FTR == "H")
X-squared = 34.837, df = 25, p-value = 0.0912

ref_match <- pl |>
  mutate(
    favorite_side = case_when(
      AvgCH < AvgCA ~ "H",
      AvgCA < AvgCH ~ "A",
    ),
    favorite_won = case_when(
      favorite_side == "H" & FTR == "H" ~ 1,
      favorite_side == "A" & FTR == "A" ~ 1,
      favorite_side %in% c("H", "A") ~ 0,
    ),
    favorite_yellows = if_else(favorite_side == "H", HY, AY),
    underdog_yellows = if_else(favorite_side == "H", AY, HY),
    favorite_fouls = if_else(favorite_side == "H", HF, AF),
    underdog_fouls = if_else(favorite_side == "H", AF, HF),
    favorite_shots = if_else(favorite_side == "H", HS, AS),
    underdog_shots = if_else(favorite_side == "H", AS, HS),

    card_tilt = favorite_yellows - underdog_yellows,
    foul_tilt = favorite_fouls - underdog_fouls,
    shot_tilt = favorite_shots - underdog_shots
  ) |>
  filter(!is.na(favorite_side), !is.na(Referee))

H0: The favorite’s win rate is unaffected by who is refereeing.

H1: The favorite’s win rate is associated with who is refereeing.

chisq.test(table(ref_match$Referee, ref_match$favorite_won))

Warning in chisq.test(table(ref_match$Referee, ref_match$favorite_won)):
Chi-squared approximation may be incorrect


    Pearson's Chi-squared test

data:  table(ref_match$Referee, ref_match$favorite_won)
X-squared = 34.265, df = 25, p-value = 0.1024

With a p-value of .1024 we fail to reject the null hypothesis and do not have enough evidence to say referee is associated with the favorite winning a game or not.

These are two logistical models. Model 0 being every match has the same win rate regardless of referee and uses no predictors while model 1 does use referees as a predictor creating a binomial between referees affecting win rate, or referees not mattering.

m0 <- glm(favorite_won ~ 1, data = ref_match, family = binomial)
m1 <- glm(favorite_won ~ Referee, data = ref_match, family = binomial)

anova(m0, m1, test = "Chisq")

Analysis of Deviance Table

Model 1: favorite_won ~ 1
Model 2: favorite_won ~ Referee
  Resid. Df Resid. Dev Df Deviance Pr(>Chi)  
1       378     520.52                       
2       353     483.39 25   37.122  0.05621 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

With model 1’s deviance being slightly lower it did perform better. The p-value here is .05621, so if we are using a traditional .05 standard it will be rejected and win rate does not significantly differ by referee. However, I do believe this is a much more nuanced and less black and white topic. Since the p-value is so close to .05 I will say the referee assigned the a match does have suggestive evidence that the favored team wins or loses. This does not mean that the refs are biased or corrupt there are plenty of other potential explanations.

H0: Referees average the same amount card tilt.

H1: At least one referee has a different mean card tilt

summary(aov(card_tilt ~ Referee, data = ref_match))

             Df Sum Sq Mean Sq F value Pr(>F)
Referee      25   72.5   2.902   1.054  0.395
Residuals   353  971.6   2.752

We fail to reject the null hypothesis with a p-value of .395 and do not have evidence to say that any of the referee’s card giving tendencies differ from the rest significantly.

H0: Referees average the same amount foul tilt.

H1: At least one referee has a different mean foul tilt

summary(aov(foul_tilt ~ Referee, data = ref_match))

             Df Sum Sq Mean Sq F value Pr(>F)
Referee      25    472   18.87    0.81  0.729
Residuals   353   8220   23.29

We fail to reject the null hypothesis with a p-value of .729 and do not have evidence to say that any of the referee’s foul calling tendencies differ from the rest significantly.

H0: Referees average the same amount shot tilt.

H1: At least one referee has a different mean shot tilt

summary(aov(shot_tilt ~ Referee, data = ref_match))

             Df Sum Sq Mean Sq F value Pr(>F)
Referee      25   1456   58.24   0.905  0.598
Residuals   353  22709   64.33

We fail to reject the null hypothesis with a p-value of .598 and do not have evidence to say that any of the referee’s affect mean shot tilt significantly.

Conclusions

There is little statistical evidence that there is a difference between how individual referees affect matches.
There is however a difference between how each referee called their games this past season, and it would be useful to know this information as a fan.
Using the charts you can predict how a referee is going to call a game. This can be used to prepare yourself as a fan, or for betting purposes.
Referees like Simon Hooper are card happy, and referee tendencies seem to be mixed when it comes to bias.
Betting with William Hill gives the better a slight advantage since they are the most inaccurate out of the 5 measured, however, the difference is slim.

Additional links

Video presentation

Slide deck