Premier League study

Data Science Task
1. For each season, which team(s) committed the most fouls?
2. How does the number of yellow cards this season compare to previous seasons?
3. Which referees would you describe as the most lenient? Why?
A little further investigation due to curiosity.
A different (more complete) approach to find the top 6 lineant referees

Loading data

glimpse(df)

## Observations: 6,926
## Variables: 14
## $ X                 <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 13, 11...
## $ home_team         <fct> Charlton, Chelsea, Coventry, Derby, Leeds, L...
## $ away_team         <fct> Man City, West Ham, Middlesbrough, Southampt...
## $ date              <fct> 2000-08-19, 2000-08-19, 2000-08-19, 2000-08-...
## $ home_goals        <int> 4, 4, 1, 2, 2, 0, 1, 1, 3, 2, 2, 1, 1, 2, 3,...
## $ away_goals        <int> 0, 2, 3, 2, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0,...
## $ referee           <fct> Rob Harris, Graham Barber, Barry Knight, And...
## $ home_fouls        <int> 13, 19, 15, 11, 21, 12, 8, 10, 14, 7, 25, 10...
## $ away_fouls        <int> 12, 14, 21, 13, 20, 12, 8, 21, 13, 13, 20, 7...
## $ home_yellow_cards <int> 1, 1, 5, 1, 1, 2, 1, 3, 0, 0, 2, 1, 2, 0, 2,...
## $ away_yellow_cards <int> 2, 2, 3, 1, 3, 3, 1, 1, 0, 1, 4, 4, 1, 1, 1,...
## $ home_red_cards    <int> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,...
## $ away_red_cards    <int> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 2, 0, 0, 0, 1,...
## $ season            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...

First “To do” list of things:

Delete ‘x’ attribute. Not needed.
Format data attributes porperly.
- date: As date, not factor.
- season: As factor, not integer.

Selecting all attributes but ‘x’

df = df[,2:ncol(df)]
names(df)

##  [1] "home_team"         "away_team"         "date"             
##  [4] "home_goals"        "away_goals"        "referee"          
##  [7] "home_fouls"        "away_fouls"        "home_yellow_cards"
## [10] "away_yellow_cards" "home_red_cards"    "away_red_cards"   
## [13] "season"

Formatting data

df$date = as_date(df$date)
df$season = factor(df$season)

glimpse(df[c("date", "season")])

## Observations: 6,926
## Variables: 2
## $ date   <date> 2000-08-19, 2000-08-19, 2000-08-19, 2000-08-19, 2000-0...
## $ season <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...

Data Science Task

1. For each season, which team(s) committed the most fouls?

Let’s take a look at the Top 3 teams that commited the most fouls for each season.

home = df %>%
  select(home_team, home_fouls, season) %>%
  group_by(season, home_team) %>%
  summarise(home_fouls = sum(home_fouls))

away = df %>%
  select(away_team, away_fouls, season) %>%
  group_by(season, away_team) %>%
  summarise(away_fouls = sum(away_fouls))

most_fouls = data.frame(season = home$season,
                        team = home$home_team,
                        total_fouls = home$home_fouls + away$away_fouls,
                        home_fouls = home$home_fouls,
                        away_fouls = away$away_fouls) %>%
  arrange(season, desc(total_fouls)) 
# %>%  group_by(season) %>%  slice(1:3)
  
most_fouls_top3 = by(most_fouls, most_fouls["season"], head, n = 3)
most_fouls_top3

## season: 1
##   season       team total_fouls home_fouls away_fouls
## 1      1 Sunderland         618        286        332
## 2      1      Leeds         589        300        289
## 3      1   Man City         589        284        305
## -------------------------------------------------------- 
## season: 102
##    season      team total_fouls home_fouls away_fouls
## 21    102     Leeds         594        274        320
## 22    102 Blackburn         566        270        296
## 23    102   Arsenal         561        259        302
## -------------------------------------------------------- 
## season: 203
##    season       team total_fouls home_fouls away_fouls
## 41    203    Everton         475        236        239
## 42    203      Leeds         467        221        246
## 43    203 Sunderland         439        250        189
## -------------------------------------------------------- 
## season: 304
##    season        team total_fouls home_fouls away_fouls
## 61    304 Aston Villa         498        240        258
## 62    304   Tottenham         498        233        265
## 63    304  Portsmouth         497        209        288
## -------------------------------------------------------- 
## season: 405
##    season        team total_fouls home_fouls away_fouls
## 81    405 Aston Villa         522        222        300
## 82    405   Blackburn         517        276        241
## 83    405  Birmingham         475        230        245
## -------------------------------------------------------- 
## season: 506
##     season       team total_fouls home_fouls away_fouls
## 101    506  Newcastle         600        310        290
## 102    506  Blackburn         568        300        268
## 103    506 Sunderland         538        257        281
## -------------------------------------------------------- 
## season: 607
##     season     team total_fouls home_fouls away_fouls
## 121    607  Watford         562        265        297
## 122    607 West Ham         534        280        254
## 123    607 Man City         533        274        259
## -------------------------------------------------------- 
## season: 708
##     season          team total_fouls home_fouls away_fouls
## 141    708     Blackburn         531        277        254
## 142    708 Middlesbrough         523        244        279
## 143    708      West Ham         518        257        261
## -------------------------------------------------------- 
## season: 809
##     season      team total_fouls home_fouls away_fouls
## 161    809      Hull         533        256        277
## 162    809 Blackburn         501        240        261
## 163    809    Bolton         497        256        241
## -------------------------------------------------------- 
## season: 910
##     season      team total_fouls home_fouls away_fouls
## 181    910 Blackburn         526        261        265
## 182    910      Hull         520        272        248
## 183    910    Bolton         519        242        277
## -------------------------------------------------------- 
## season: 1011
##     season    team total_fouls home_fouls away_fouls
## 201   1011  Bolton         480        235        245
## 202   1011   Wigan         461        225        236
## 203   1011 Everton         459        230        229
## -------------------------------------------------------- 
## season: 1112
##     season    team total_fouls home_fouls away_fouls
## 221   1112 Everton         449        225        224
## 222   1112   Wigan         447        235        212
## 223   1112  Bolton         429        223        206
## -------------------------------------------------------- 
## season: 1213
##     season     team total_fouls home_fouls away_fouls
## 241   1213 West Ham         445        217        228
## 242   1213 Man City         443        206        237
## 243   1213    Stoke         443        213        230
## -------------------------------------------------------- 
## season: 1314
##     season        team total_fouls home_fouls away_fouls
## 261   1314       Stoke         485        236        249
## 262   1314 Aston Villa         484        234        250
## 263   1314  Sunderland         444        236        208
## -------------------------------------------------------- 
## season: 1415
##     season           team total_fouls home_fouls away_fouls
## 281   1415 Crystal Palace         527        270        257
## 282   1415          Stoke         486        240        246
## 283   1415    Southampton         468        232        236
## -------------------------------------------------------- 
## season: 1516
##     season           team total_fouls home_fouls away_fouls
## 301   1516     Man United         472        230        242
## 302   1516 Crystal Palace         465        215        250
## 303   1516        Watford         460        210        250
## -------------------------------------------------------- 
## season: 1617
##     season          team total_fouls home_fouls away_fouls
## 321   1617       Watford         517        273        244
## 322   1617    Man United         497        249        248
## 323   1617 Middlesbrough         478        210        268
## -------------------------------------------------------- 
## season: 1718
##     season      team total_fouls home_fouls away_fouls
## 341   1718   Everton         459        234        225
## 342   1718   Watford         441        224        217
## 343   1718 West Brom         440        230        210
## -------------------------------------------------------- 
## season: 1819
##     season       team total_fouls home_fouls away_fouls
## 361   1819   Brighton         315        139        176
## 362   1819  Newcastle         284        144        140
## 363   1819 Man United         282        135        147

2. How does the number of yellow cards this season compare to previous seasons?

What does this tell us about how football has changed?

We want to compare the amount of yellow cards considering the same period of time, as the actual season has not finished yet.

season_1819 = df %>%
  filter(season == 1819)
range(as.yearmon(season_1819$date))

## [1] "Aug 2018" "Jan 2019"

August 2018 - January 2019 is the actual period of time we want to use to compare yellow cards between seasons.

Let’s visualize the total amount of yellow cards per season using a barplot. In order to do that, we will first filter the data by months, then group by season and adding the ttoal amount of yellow cards.

names_months = c("August", "September", "October", "November", "December", "January")

comparing_seasons_cards = df %>%
  mutate(total_yellow_cards = home_yellow_cards + away_yellow_cards,
         year_mon = months(date)) %>%
  select(season, year_mon, total_yellow_cards) %>%
  filter(year_mon %in% names_months) %>%
  group_by(season) %>%
  summarise(total_yellow_cards = sum(total_yellow_cards))


cards_compare_plot = ggplot(data = comparing_seasons_cards,
                            mapping = aes(x = season, y = total_yellow_cards)) +
  geom_bar(stat = "identity", width = 0.8, fill = "steelblue2") +
  geom_text(aes(label = total_yellow_cards),
            vjust = 1.6,
            color = "white",
            size = 3.5) +
  xlab("Season") +
  ylab(NULL)

cards_compare_plot

As we can see, this actual season is on average compared with the other ones, for this specific period of time, so we cannot say that football has suffered from any kind of change just looking at the amount of yellow cards.

For seasons 203, 304 and 405, there are some missing registers as seen below. There should be 380 played matches every season. This explains the little gap in the barplot above.

matches_380 = df %>%
  select(season) %>%
  group_by(season) %>%
  summarise(number_of_matches = n())

ggplot(data = matches_380,
       mapping = aes(x = season, y = number_of_matches)) +
  geom_bar(stat = "identity", width = 0.8, fill = "steelblue") +
  geom_text(aes(label = number_of_matches),
            vjust = 1.6,
            color = "white",
            size = 3.5) +
  ylab("Number of matches played") +
  xlab("Season")

3. Which referees would you describe as the most lenient? Why?

The most simple and easy approach is to take a look at the total amount of fouls registered by each referee, in all seasons.

lenient_referees = df %>%
  transmute(referee, total_fouls = home_fouls + away_fouls) %>%
  group_by(referee) %>%
  summarise(total_fouls = sum(total_fouls),
            n_matches_participated = n()) %>%
  arrange(total_fouls)

kable(head(lenient_referees))

referee	total_fouls	n_matches_participated
Ian Harris	22	1
M. L. Dean	23	1
St Bennett	24	1
P. Durkin	25	1
I Williamson	26	1
N Yates	27	1

Those are the most lenient referees (using this approach), but they have participated only in 1 match. These takes us to the next idea:

dens = density(lenient_referees$n_matches_participated)
plot(dens, main = "Density of participation in matches")
polygon(dens, col = "steelblue2", border = "darkblue")

Looking at the density graph above, we can see that most of the referees in the database participated in less than 100 matches. Maybe it would be interesting to create bins / cutoffs for a minimum of, for example, 114 matches (3 seasons).

Doing so, we would get as the most lenient referees the ones shown below:

lenient_referees_114 = lenient_referees %>%
  filter(n_matches_participated > 114) %>%
  head()

kable(lenient_referees_114)

referee	total_fouls	n_matches_participated
N Swarbrick	2762	132
C Pawson	2793	125
P Walton	3588	168
G Poll	3608	130
L Probert	3720	175
R Styles	3736	169

A little further investigation due to curiosity.

plot(x = lenient_referees$n_matches_participated,
     y = lenient_referees$total_fouls,
     type = "p", cex = 2, pch = 20, col = "steelblue2",
     main = "Relation between number of matches participated\nand amount of registered fouls",
     xlab = "Matches",
     ylab = "Fouls")
abline(lm(lenient_referees$total_fouls ~ lenient_referees$n_matches_participated),
       col = "red")

mat = as.matrix(lenient_referees[ ,c("total_fouls", "n_matches_participated")])
kable(cor(mat), caption = "Pearson correlation")

Pearson correlation
	total_fouls	n_matches_participated
total_fouls	1.0000000	0.9963647
n_matches_participated	0.9963647	1.0000000

Wanted to see how the number of fouls increased with the number of matches in which a referee has participated. Looks like there is almost a perfect linear correlation between both variables, which a bit weird in my opinion, as I would expect a high correlation between both of them, but 0.99 is too high. This means we could predict the (almost exact) amount of fouls a referee would register depending in how many matches he has participated, and that is just not real. I would say the data has been manipulated somehow, but that is beyond the scope of this report.

A different (more complete) approach to find the top 6 lineant referees

Instead of using the least amount of registered fouls approach, we could use 3 different approaches:

Ratio of yellow+red cards divided by amount of fouls
Ratio of yellow+red cards divided by amount of matches participated.

The first ratio is biased by the amount of matches participated. For example, two referees with the same ratio but the firt one participating in 5 matches, while the other one in 600.

The second approach is biased because it does not take in account the amount of fouls per match.

The third approach and the one that will be shown below (although the first and second one are calculated too) should be the most complete, as it takes in account the number of matches, number of fouls and number of cards. The formula is ((cards/fouls)/matches)*1000. This ratio resuts to be very small. Multiplying it by 1000 allows for a better understanding of a “ranking parameter”.

referees = df %>%
  select(season, date, referee, home_fouls, away_fouls, home_yellow_cards, home_red_cards) %>%
  mutate(total_fouls = home_fouls + away_fouls,
         total_cards = home_yellow_cards + home_red_cards)


ratio_cards = referees %>%
  select(referee, total_cards, total_fouls) %>%
  group_by(referee) %>%
  summarise(total_cards = sum(total_cards),
            
            total_fouls = sum(total_fouls),
            
            cards_x_foul = formattable(x = (total_cards/total_fouls)*50,
                                       digits = 1,
                                       format = "f"),
            
            n_matches_participated = n(),
            
            cards_x_match = formattable(x = total_cards/n_matches_participated,
                                     digits = 1,
                                     format = "f"),
            
            cards_xfouls_xmatches = formattable(x = ((total_cards/total_fouls)/n_matches_participated)*1000,
                                     digits = 1,
                                     format = "f"))

kable(ratio_cards[order(ratio_cards$cards_xfouls_xmatches, decreasing = F)[1:6], ])

referee	total_fouls	n_matches_participated
A Madley	33	2
Ian Harris	22	1
J.T. Winter	32	1
P. Durkin	25	1
P.A. Durkin	29	1
R Beeby	55	2

This time the ratio stays 0.0 because the amount of cards shown are 0. It means that, when calculating 0/‘something else’, the ratio stays 0. Let’s just filter the data by at least 5 cards and 114 matches, in order to get something more realistic.

most_lineant_approach3 = ratio_cards %>%
  select(referee, total_cards, total_fouls, cards_xfouls_xmatches, n_matches_participated) %>%
  filter(total_cards > 5 & n_matches_participated > 114) %>%
  arrange(cards_xfouls_xmatches)

kable(head(most_lineant_approach3))

referee	total_cards	total_fouls	cards_xfouls_xmatches	n_matches_participated
M Halsey	185	4690	0.2	217
M Dean	792	9789	0.2	431
M Atkinson	574	8330	0.2	365
P Dowd	449	7439	0.2	295
H Webb	430	7021	0.2	294
C Foy	303	5711	0.2	254

This time, we get the most lineant referees as the ones shown in the table.