The Data

Join me as I explore the match data from the 2021/2022 English Premier League season and highlight interesting findings from the underlying data, this dataset is made available by Evan Gower and can be accessed here.

It contains all 380 games premier league games and the available match data from each game is listed below;

Preparing the enviroment

Loading the necessary packages
library(dplyr)
library(tidyr)
library(forcats)
library(readr)
library(ggplot2)
library(janitor)
library(lubridate)


Importing the dataset

The dataset is imported into the R enviroment
prem_matches <- read.csv("soccer21-22.csv")


Exploring the dataset

A glimpse into the structure of the data
tibble(prem_matches)
## # A tibble: 380 x 22
##    Date  HomeT~1 AwayT~2  FTHG  FTAG FTR    HTHG  HTAG HTR   Referee    HS    AS
##    <chr> <chr>   <chr>   <int> <int> <chr> <int> <int> <chr> <chr>   <int> <int>
##  1 13/0~ Brentf~ Arsenal     2     0 H         1     0 H     M Oliv~     8    22
##  2 14/0~ Man Un~ Leeds       5     1 H         1     0 H     P Tier~    16    10
##  3 14/0~ Burnley Bright~     1     2 A         1     0 H     D Coote    14    14
##  4 14/0~ Chelsea Crysta~     3     0 H         2     0 H     J Moss     13     4
##  5 14/0~ Everton Southa~     3     1 H         0     1 A     A Madl~    14     6
##  6 14/0~ Leices~ Wolves      1     0 H         1     0 H     C Paws~     9    17
##  7 14/0~ Watford Aston ~     3     2 H         2     0 H     M Dean     13    11
##  8 14/0~ Norwich Liverp~     0     3 A         0     1 A     A Marr~    14    19
##  9 15/0~ Newcas~ West H~     2     4 A         2     1 H     M Atki~    17     8
## 10 15/0~ Totten~ Man Ci~     1     0 H         0     0 D     A Tayl~    13    18
## # ... with 370 more rows, 10 more variables: HST <int>, AST <int>, HF <int>,
## #   AF <int>, HC <int>, AC <int>, HY <int>, AY <int>, HR <int>, AR <int>, and
## #   abbreviated variable names 1: HomeTeam, 2: AwayTeam
## # i Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
colnames(prem_matches)
##  [1] "Date"     "HomeTeam" "AwayTeam" "FTHG"     "FTAG"     "FTR"     
##  [7] "HTHG"     "HTAG"     "HTR"      "Referee"  "HS"       "AS"      
## [13] "HST"      "AST"      "HF"       "AF"       "HC"       "AC"      
## [19] "HY"       "AY"       "HR"       "AR"


Checking and removing duplicate entries in the dataset

prem_matches <- distinct(prem_matches)


Changing the date field from character to date format

prem_matches$Date <- dmy(prem_matches$Date)


Cleaning and replacing the column names in the dataset

Changing all column names to lower case
prem_matches <- clean_names(prem_matches)


Replacing abbreviations in the column names to make it more readable
prem_matches_cleaned <- prem_matches %>%
  select(date, home_team, away_team, ft_home_goals = fthg, ft_away_goals = ftag,
         ft_result = ftr, ht_home_goals = hthg, ht_away_goals = htag, ht_results = htr,
         referee, home_shots = hs, away_shots = as, home_shots_on_target = hst,
         away_shots_on_target = ast, home_fouls = hf, away_fouls = af, home_corners = hc,
         away_corners = ac, home_yellows = hy, away_yellows = ay, home_reds = hr, away_reds = ar)
tibble(prem_matches_cleaned)
## # A tibble: 380 x 22
##    date       home_team  away_~1 ft_ho~2 ft_aw~3 ft_re~4 ht_ho~5 ht_aw~6 ht_re~7
##    <date>     <chr>      <chr>     <int>   <int> <chr>     <int>   <int> <chr>  
##  1 2021-08-13 Brentford  Arsenal       2       0 H             1       0 H      
##  2 2021-08-14 Man United Leeds         5       1 H             1       0 H      
##  3 2021-08-14 Burnley    Bright~       1       2 A             1       0 H      
##  4 2021-08-14 Chelsea    Crysta~       3       0 H             2       0 H      
##  5 2021-08-14 Everton    Southa~       3       1 H             0       1 A      
##  6 2021-08-14 Leicester  Wolves        1       0 H             1       0 H      
##  7 2021-08-14 Watford    Aston ~       3       2 H             2       0 H      
##  8 2021-08-14 Norwich    Liverp~       0       3 A             0       1 A      
##  9 2021-08-15 Newcastle  West H~       2       4 A             2       1 H      
## 10 2021-08-15 Tottenham  Man Ci~       1       0 H             0       0 D      
## # ... with 370 more rows, 13 more variables: referee <chr>, home_shots <int>,
## #   away_shots <int>, home_shots_on_target <int>, away_shots_on_target <int>,
## #   home_fouls <int>, away_fouls <int>, home_corners <int>, away_corners <int>,
## #   home_yellows <int>, away_yellows <int>, home_reds <int>, away_reds <int>,
## #   and abbreviated variable names 1: away_team, 2: ft_home_goals,
## #   3: ft_away_goals, 4: ft_result, 5: ht_home_goals, 6: ht_away_goals,
## #   7: ht_results
## # i Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names


Combining home and away stats by matches

prem_matches_combined <- prem_matches %>%
  summarise(date, home_team, away_team, goals_scored = fthg + ftag, result = ftr,
            referee, total_shots = hs + as, shots_on_target = hst + ast,
            fouls_committed = hf + af, corner_count = hc + ac, bookings = hy+ay+hr+ar, yellow_cards = hy+ay, red_cards = hr+ar)
tibble(prem_matches_combined)
## # A tibble: 380 x 13
##    date       home_team  away_t~1 goals~2 result referee total~3 shots~4 fouls~5
##    <date>     <chr>      <chr>      <int> <chr>  <chr>     <int>   <int>   <int>
##  1 2021-08-13 Brentford  Arsenal        2 H      M Oliv~      30       7      20
##  2 2021-08-14 Man United Leeds          6 H      P Tier~      26      11      20
##  3 2021-08-14 Burnley    Brighton       3 A      D Coote      28      11      17
##  4 2021-08-14 Chelsea    Crystal~       3 H      J Moss       17       7      26
##  5 2021-08-14 Everton    Southam~       4 H      A Madl~      20       9      28
##  6 2021-08-14 Leicester  Wolves         1 H      C Paws~      26       8      16
##  7 2021-08-14 Watford    Aston V~       5 H      M Dean       24       9      31
##  8 2021-08-14 Norwich    Liverpo~       3 A      A Marr~      33      11      18
##  9 2021-08-15 Newcastle  West Ham       6 A      M Atki~      25      12       7
## 10 2021-08-15 Tottenham  Man City       1 H      A Tayl~      31       7      19
## # ... with 370 more rows, 4 more variables: corner_count <int>, bookings <int>,
## #   yellow_cards <int>, red_cards <int>, and abbreviated variable names
## #   1: away_team, 2: goals_scored, 3: total_shots, 4: shots_on_target,
## #   5: fouls_committed
## # i Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names


Exploratory Data Analysis

How many goals were scored throughout the season?

prem_matches_combined %>% 
  summarise(total_goals = sum(goals_scored),total_shots = sum(total_shots),
            total_shots_on_target = sum(shots_on_target),
            conversion_rate = total_goals/total_shots*100)
##   total_goals total_shots total_shots_on_target conversion_rate
## 1        1071        9722                  3352        11.01625

From the 380 fixtures played a total of 1,071 goals were scored from a total of 9722 shots at goal, that’s an average of 2.82 goals per game and a goal conversion rate of 11%. The average goal per game of 2.82 is the joint highest average goals per game in the history of the premier league with the 2018/19 season being the other.

ggplot(data = prem_matches_combined, aes(x = date, y = goals_scored)) +
  geom_col(width = 4, fill = "#38003c") +
  scale_x_date(date_labels="%b",date_breaks  ="1 month") +
  xlab("Month") + ylab("Goals Scored") + labs(title = "Goals by Month")

It is interesting to note that from the 10 months of football and 38 game weeks the most goals in a game week was recorded on the final day of the season.


How did the referees fare with the games?

prem_matches_combined %>%
  summarise(referees_used = sum(count(distinct(prem_matches, referee))),
            total_fouls = sum(fouls_committed),
            total_bookings = sum(bookings),
            total_yellow_cards = sum(yellow_cards),
            total_red_cards = sum(red_cards))
##   referees_used total_fouls total_bookings total_yellow_cards total_red_cards
## 1            22        7681           1334               1291              43

There were 22 different referees officiating across the premier league season with a whooping 7,681 fouls awarded, the data does not indicate if this includes offside violations and handballs. From the fouls awarded, 1,334 resulted in bookings being handed out with majority being yellow cards, the dataset does not indicate if the red cards also include double yellow card infringements.

ggplot(data = prem_matches_combined, aes(x = referee)) + 
  geom_bar(fill = "#38003c", color = "#38003c") + 
  theme(axis.text.x = element_text(angle = 90)) +
  xlab("Referee") + ylab("Games Officiated") + 
  labs(title = "Games Officiated by Referee")

Anthony Taylor, Paul Tierney and Craig Pawson officiated the most games during the 2021/22 season.

ggplot(data = prem_matches_combined, aes(x = referee, y = bookings, fill = bookings)) + 
  geom_col(fill = "#38003c", color = "#38003c") + 
  theme(axis.text.x = element_text(angle = 90)) +
  xlab("Referee") + ylab("Cards Issued") + 
  labs(title = "Bookings Issued by Referees")

As expected, the referees with the most matches officiated also issued out the most bookings during throughout the season.

ggplot(data = prem_matches_combined, aes(x = referee, y = red_cards, fill = bookings)) + 
  geom_col(fill = "red", color = "#38003c") + 
  theme(axis.text.x = element_text(angle = 90)) +
  xlab("Referee") + ylab("Card Count") + 
  labs(title = "Red Cards Issued by Referees")

Despite not dishing out the most bookings, John Moss and Michael Oliver handed out the most red cards.


Wins, Losses and Draws

wins <- prem_matches_cleaned %>%
  mutate(ft_result = case_when(
    ft_result == "H" ~ home_team,
    ft_result == "A" ~ away_team,
    ft_result == "D" ~ "Draw"))
wins <- wins %>% filter(ft_result != "Draw")

ggplot(data = wins, aes(x = ft_result)) +
  geom_bar(fill = "#00ff85", color = "#38003c") +
  theme(axis.text.x = element_text(angle = 90)) + 
  xlab("Teams") + ylab("Wins") + labs(title = "Total Games won by each Football team")

Manchester City and Liverpool managed to win the most games with both teams having over 25 wins, with the closest teams being Arsenal, Chelsea and Tottenham.While Burnley, Leeds, Norwich, Southampton and Watford being on the opposite end with neither Team being able to muster up 10 wins throughout the season.

losses <- prem_matches_cleaned %>%
  mutate(ft_result = case_when(
    ft_result == "H" ~ away_team,
    ft_result == "A" ~ home_team,
    ft_result == "D" ~ "Draw"))
losses <- losses %>% filter(ft_result != "Draw")

ggplot(data = losses, aes(x = ft_result)) +
  geom_bar(fill = "#e90052", color = "#38003c") +
  theme(axis.text.x = element_text(angle = 90)) + 
  xlab("Teams") + ylab("Losses") + labs(title = "Total Games lost by each Football Team")

When you do not win games, you are likely to lose them as Norwich and Watford recording over 25 losses with Everton in close third with 21 games lost. Whereas Manchester City and Liverpool both managed to avoid losing 5 games which is very impressive.

draws <- prem_matches_cleaned %>%
  select(home_team,away_team,ft_result) %>%
  mutate(ft_result = case_when(
    ft_result == "H" ~ home_team,
    ft_result == "A" ~ away_team,
    ft_result == "D" ~ "Draw"))
draws <- draws %>% filter(ft_result == "Draw")
draws <- cbind(draws[3], stack(draws[1:2]))

ggplot(data = draws, aes(x = values)) +
  geom_bar(fill = "#ffffff", color = "#38003c") +
  theme(axis.text.x = element_text(angle = 90)) + 
  xlab("Teams") + ylab("Draws") + labs(title = "Total Games Drawn by each Football Team")

The 2021/22 Premier League also recorded 88 drawn games across the 380 fixtures recorded, meaning 23% of the games played ended in a draw. Brighton and Crystal Palace are the joint teams with the most draws with both teams managing 15 draws each followed by Burnley and Southampton having 13 and 11 draws respectively.


Thank You!!! for taking your time to go through my analysis. Any feedback or advice is welcomed.