EPL Project

Author

Mick Rathbone

Introduction

The goal of this project is to use data from the 2021-22 English Premier League season and find trends within this data. I will analyze the potential home bias of the referees in this season in three different ways. After this analysis, I will look at Yelp reviews from the best team of the season and see if the reviews reflect the positive season that this team had.

Data Dictionary

First, though, it is important to outline the variables used in this project. The following variables are used in the analysis and were created in a dataset by another user on Kaggle:

HomeTeam: The home team for each match

AwayTeam: The away team for each match

Referee: The center official for each match

FTHG: Final Time Home Goals

FTAG: Final Time Away Goals

FTR: Final Time Result (Home win (H), Draw (D), Away win (A))

HTR: Half Time Result (Home win (H), Draw (D), Away win (A))

HF: Home Foul

AF: Away Foul

HY: Home Yellow cards

AY: Away Yellow cards

Referee Analysis

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0     ✔ purrr   1.0.1
✔ tibble  3.2.1     ✔ dplyr   1.1.3
✔ tidyr   1.2.1     ✔ stringr 1.5.0
✔ readr   2.1.3     ✔ forcats 0.5.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

Attaching package: 'rvest'


The following object is masked from 'package:readr':

    guess_encoding


Loading required package: timechange


Attaching package: 'lubridate'


The following objects are masked from 'package:base':

    date, intersect, setdiff, union



Attaching package: 'textdata'


The following object is masked from 'package:httr':

    cache_info


Rows: 380 Columns: 22
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (6): Date, HomeTeam, AwayTeam, FTR, HTR, Referee
dbl (16): FTHG, FTAG, HTHG, HTAG, HS, AS, HST, AST, HF, AF, HC, AC, HY, AY, ...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Do certain referees lead to more home team goals?

# Creates a home and away goal ratio by referee
soccer_data %>% 
  group_by((Referee)) %>% 
  filter(!Referee == "J Brooks", !Referee == "M Salisbury", !Referee == "T Harrington") %>% 
  summarize(HomeAwayRatio = sum(FTHG) / sum(FTAG)) %>% 
  arrange(desc(HomeAwayRatio))
# A tibble: 19 × 2
   `(Referee)` HomeAwayRatio
   <chr>               <dbl>
 1 S Attwell           2.41 
 2 A Madley            2.12 
 3 J Moss              1.70 
 4 M Oliver            1.70 
 5 P Bankes            1.7  
 6 D England           1.38 
 7 M Atkinson          1.32 
 8 A Taylor            1.19 
 9 J Gillett           1.13 
10 C Pawson            1.13 
11 P Tierney           1.13 
12 M Dean              1.03 
13 G Scott             0.944
14 S Hooper            0.895
15 C Kavanagh          0.867
16 D Coote             0.864
17 K Friend            0.677
18 A Marriner          0.575
19 R Jones             0.565
# Creates a home and away goal ratio for all games
soccer_data %>% 
  summarize(HomeAwayRatio = sum(FTHG) / sum(FTAG)) %>% 
  arrange(desc(HomeAwayRatio))
# A tibble: 1 × 1
  HomeAwayRatio
          <dbl>
1          1.16

The question of referee bias can be looked at in a variety of ways, but for this dataset we are going to look at one specific aspect: Home and Away Goal Ratio. The home and away goal ratio for teams in the EPL is 1.16, meaning that for every away team goal, there is 1.16 goal for a home team. Therefore, any referees that have a ratio over this have been a part of games where the home team has been more productive than average. While there quite a few referees who have ratios over 1.16, there are five with at least 1.7 and two with over 2.0. These two, Andy Madley and Stuart Attwell, both reffed 15+ games which lowers the possibility that outliers increased their numbers.

Do certain referees require more fouls for a home player to get a yellow card compared to an away card?

# Shows how many more fouls are necessary for a home player to get a yellow than an away player by Referee
soccer_data %>% 
  group_by((Referee)) %>% 
  filter(!Referee == "J Brooks", !Referee == "M Salisbury", !Referee == "T Harrington") %>% 
  summarize(FoulsPerCard = (sum(HF) / sum(HY)) - (sum(AF) / sum(AY))) %>% 
  arrange(desc(FoulsPerCard))
# A tibble: 19 × 2
   `(Referee)` FoulsPerCard
   <chr>              <dbl>
 1 R Jones           3.13  
 2 J Gillett         2.18  
 3 A Madley          2.18  
 4 S Attwell         1.34  
 5 C Pawson          1.33  
 6 A Marriner        1.23  
 7 D England         0.971 
 8 C Kavanagh        0.640 
 9 A Taylor          0.622 
10 P Bankes          0.243 
11 S Hooper         -0.0400
12 J Moss           -0.0690
13 D Coote          -0.286 
14 K Friend         -0.388 
15 M Dean           -0.671 
16 M Oliver         -0.807 
17 G Scott          -0.833 
18 P Tierney        -1.15  
19 M Atkinson       -1.31  

Another way to check the bias exhibited by referees is looking at how many more fouls it takes a home player to get a card than it does an away player. This table took the ratio of home fouls to home yellow cards minus the away fouls to away yellow cards. For the top six referees, it required at least one more foul per yellow card for a home player when compared to an away player. In the case of Andy Madley and Jarred Gillett, they required over 2 extra fouls and for Robert Jones, more than 3 extra per card. This is an extremely high increase and could be cause for concern if this persisted over multiple seasons. This is also the second time for both Madley and Attwell in the top 5 for bias.

Do certain referees officiate a significantly higher number of home wins compared to draws or away wins?

# Shows how many more fouls are necessary for a home player to get a yellow than an away player by Referee

soccer_data %>% 
  filter(!Referee == "J Brooks", !Referee == "M Salisbury", !Referee == "T Harrington") %>% 
  ggplot(aes(x = HTR)) + 
  geom_bar() +
  facet_wrap(~Referee)+
  labs(title = "Total Results (at Half-Time) by Referee",
       x = "Count of Winners",
       y = "Winner (Away, Draw, Home)")

soccer_data %>% 
  filter(!Referee == "J Brooks", !Referee == "M Salisbury", !Referee == "T Harrington") %>% 
  ggplot(aes(x = FTR)) + 
  geom_bar() +
  facet_wrap(~Referee)+
  labs(title = "Total Results (at Full-Time) by Referee",
       x = "Count of Winners",
       y = "Winner (Away, Draw, Home)")

The final way I used this dataset to look at refereeing bias was by examining the amount of home wins compared to draws and away wins. Using a facet wrap, the above two visualizations show the trends of results by referee, with the first visual showing results at halftime and the second visual showing results at the end of the game. While the majority of these graphs show that there is no clear correlation between a referee and the result, a few of them showed clear trends. Kevin Friend officiated significantly more away wins than home, a trend that is not often seen. Jon Moss, Madley, and Attwell show the exact opposite results, as all of them show a much more home results oriented trend. This is now the third time that we have seen Attwell and Madley stand out with home-leaning bias.

Sentiment Analysis

# Loads in the review data that was scraped from Yelp
ManCity_reviews <- 
  read_csv("ManCityReviews.csv")
Rows: 30 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): reviewer_location, review_content
dbl  (1): review_rating
date (1): review_date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Loads in the bing lexicon
bing <- 
  get_sentiments("bing")

# Unnests tokens
tidy_MC <- 
  ManCity_reviews %>%
  unnest_tokens(word,review_content) %>%
  anti_join(stop_words)
Joining with `by = join_by(word)`
# Makes a table showing the total number of positive and negative words, as well as the seniment score
tidy_MC %>%
  inner_join(bing) %>% 
  group_by(sentiment) %>% 
  summarize(n = n()) %>% 
  spread(sentiment, n, fill = 0) %>% 
  mutate(sentiment = positive - negative)
Joining with `by = join_by(word)`
# A tibble: 1 × 3
  negative positive sentiment
     <dbl>    <dbl>     <dbl>
1       53      158       105

In the 2021-22 EPL season, Manchester City won the league while scoring the most amount of goals (99) and also conceding the least (26). Objectively, this should lead to a positive fan experience as the team was highly successful and rarely lost. The above table shows results from 30 reviews on Yelp from people who attended games in the last five years, when Man City have been just as dominant as they were in the singular season in the dataset. This table shows that there were a total of 158 positive words and 53 negative words, giving an overall sentiment score of 105. If you divide this by 30, we find that the average positivity score for each review is 3.5, which matches our expectations since this is a positive score for the best team in the league over this time period.

Conclusion

After examining the EPL dataset and the referees that took place in the 2021-22 season, there appeared to be two referees that continuously exhibited home bias: Andy Madley and Stuart Attwell. While this is only one season, both of these referees took part in a significant amount of matches and outliers should therefore not cause their data to be skewed for any question, let alone all three. Future research could examine these two further, using both other seasons and other variables to come to an even more definite conclusion. The sentiment analysis looked at Manchester City, a team that the data showed was the best that year. Using the bing lexicon and 30 reviews from Yelp, it was clear that a positive season for the team correlated to positive sentiment in reviews.