Premier League study
Premier League study
- Data Science Task
- 1. For each season, which team(s) committed the most fouls?
- 2. How does the number of yellow cards this season compare to previous seasons?
- 3. Which referees would you describe as the most lenient? Why?
- A little further investigation due to curiosity.
- A different (more complete) approach to find the top 6 lineant referees
Loading data
glimpse(df)## Observations: 6,926
## Variables: 14
## $ X <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 13, 11...
## $ home_team <fct> Charlton, Chelsea, Coventry, Derby, Leeds, L...
## $ away_team <fct> Man City, West Ham, Middlesbrough, Southampt...
## $ date <fct> 2000-08-19, 2000-08-19, 2000-08-19, 2000-08-...
## $ home_goals <int> 4, 4, 1, 2, 2, 0, 1, 1, 3, 2, 2, 1, 1, 2, 3,...
## $ away_goals <int> 0, 2, 3, 2, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0,...
## $ referee <fct> Rob Harris, Graham Barber, Barry Knight, And...
## $ home_fouls <int> 13, 19, 15, 11, 21, 12, 8, 10, 14, 7, 25, 10...
## $ away_fouls <int> 12, 14, 21, 13, 20, 12, 8, 21, 13, 13, 20, 7...
## $ home_yellow_cards <int> 1, 1, 5, 1, 1, 2, 1, 3, 0, 0, 2, 1, 2, 0, 2,...
## $ away_yellow_cards <int> 2, 2, 3, 1, 3, 3, 1, 1, 0, 1, 4, 4, 1, 1, 1,...
## $ home_red_cards <int> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,...
## $ away_red_cards <int> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 2, 0, 0, 0, 1,...
## $ season <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
First “To do” list of things:
- Delete ‘x’ attribute. Not needed.
- Format data attributes porperly.
- date: As date, not factor.
- season: As factor, not integer.
Selecting all attributes but ‘x’
df = df[,2:ncol(df)]
names(df)## [1] "home_team" "away_team" "date"
## [4] "home_goals" "away_goals" "referee"
## [7] "home_fouls" "away_fouls" "home_yellow_cards"
## [10] "away_yellow_cards" "home_red_cards" "away_red_cards"
## [13] "season"
Formatting data
df$date = as_date(df$date)
df$season = factor(df$season)
glimpse(df[c("date", "season")])## Observations: 6,926
## Variables: 2
## $ date <date> 2000-08-19, 2000-08-19, 2000-08-19, 2000-08-19, 2000-0...
## $ season <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
Data Science Task
1. For each season, which team(s) committed the most fouls?
Let’s take a look at the Top 3 teams that commited the most fouls for each season.
home = df %>%
select(home_team, home_fouls, season) %>%
group_by(season, home_team) %>%
summarise(home_fouls = sum(home_fouls))
away = df %>%
select(away_team, away_fouls, season) %>%
group_by(season, away_team) %>%
summarise(away_fouls = sum(away_fouls))
most_fouls = data.frame(season = home$season,
team = home$home_team,
total_fouls = home$home_fouls + away$away_fouls,
home_fouls = home$home_fouls,
away_fouls = away$away_fouls) %>%
arrange(season, desc(total_fouls))
# %>% group_by(season) %>% slice(1:3)
most_fouls_top3 = by(most_fouls, most_fouls["season"], head, n = 3)
most_fouls_top3## season: 1
## season team total_fouls home_fouls away_fouls
## 1 1 Sunderland 618 286 332
## 2 1 Leeds 589 300 289
## 3 1 Man City 589 284 305
## --------------------------------------------------------
## season: 102
## season team total_fouls home_fouls away_fouls
## 21 102 Leeds 594 274 320
## 22 102 Blackburn 566 270 296
## 23 102 Arsenal 561 259 302
## --------------------------------------------------------
## season: 203
## season team total_fouls home_fouls away_fouls
## 41 203 Everton 475 236 239
## 42 203 Leeds 467 221 246
## 43 203 Sunderland 439 250 189
## --------------------------------------------------------
## season: 304
## season team total_fouls home_fouls away_fouls
## 61 304 Aston Villa 498 240 258
## 62 304 Tottenham 498 233 265
## 63 304 Portsmouth 497 209 288
## --------------------------------------------------------
## season: 405
## season team total_fouls home_fouls away_fouls
## 81 405 Aston Villa 522 222 300
## 82 405 Blackburn 517 276 241
## 83 405 Birmingham 475 230 245
## --------------------------------------------------------
## season: 506
## season team total_fouls home_fouls away_fouls
## 101 506 Newcastle 600 310 290
## 102 506 Blackburn 568 300 268
## 103 506 Sunderland 538 257 281
## --------------------------------------------------------
## season: 607
## season team total_fouls home_fouls away_fouls
## 121 607 Watford 562 265 297
## 122 607 West Ham 534 280 254
## 123 607 Man City 533 274 259
## --------------------------------------------------------
## season: 708
## season team total_fouls home_fouls away_fouls
## 141 708 Blackburn 531 277 254
## 142 708 Middlesbrough 523 244 279
## 143 708 West Ham 518 257 261
## --------------------------------------------------------
## season: 809
## season team total_fouls home_fouls away_fouls
## 161 809 Hull 533 256 277
## 162 809 Blackburn 501 240 261
## 163 809 Bolton 497 256 241
## --------------------------------------------------------
## season: 910
## season team total_fouls home_fouls away_fouls
## 181 910 Blackburn 526 261 265
## 182 910 Hull 520 272 248
## 183 910 Bolton 519 242 277
## --------------------------------------------------------
## season: 1011
## season team total_fouls home_fouls away_fouls
## 201 1011 Bolton 480 235 245
## 202 1011 Wigan 461 225 236
## 203 1011 Everton 459 230 229
## --------------------------------------------------------
## season: 1112
## season team total_fouls home_fouls away_fouls
## 221 1112 Everton 449 225 224
## 222 1112 Wigan 447 235 212
## 223 1112 Bolton 429 223 206
## --------------------------------------------------------
## season: 1213
## season team total_fouls home_fouls away_fouls
## 241 1213 West Ham 445 217 228
## 242 1213 Man City 443 206 237
## 243 1213 Stoke 443 213 230
## --------------------------------------------------------
## season: 1314
## season team total_fouls home_fouls away_fouls
## 261 1314 Stoke 485 236 249
## 262 1314 Aston Villa 484 234 250
## 263 1314 Sunderland 444 236 208
## --------------------------------------------------------
## season: 1415
## season team total_fouls home_fouls away_fouls
## 281 1415 Crystal Palace 527 270 257
## 282 1415 Stoke 486 240 246
## 283 1415 Southampton 468 232 236
## --------------------------------------------------------
## season: 1516
## season team total_fouls home_fouls away_fouls
## 301 1516 Man United 472 230 242
## 302 1516 Crystal Palace 465 215 250
## 303 1516 Watford 460 210 250
## --------------------------------------------------------
## season: 1617
## season team total_fouls home_fouls away_fouls
## 321 1617 Watford 517 273 244
## 322 1617 Man United 497 249 248
## 323 1617 Middlesbrough 478 210 268
## --------------------------------------------------------
## season: 1718
## season team total_fouls home_fouls away_fouls
## 341 1718 Everton 459 234 225
## 342 1718 Watford 441 224 217
## 343 1718 West Brom 440 230 210
## --------------------------------------------------------
## season: 1819
## season team total_fouls home_fouls away_fouls
## 361 1819 Brighton 315 139 176
## 362 1819 Newcastle 284 144 140
## 363 1819 Man United 282 135 147
2. How does the number of yellow cards this season compare to previous seasons?
What does this tell us about how football has changed?
We want to compare the amount of yellow cards considering the same period of time, as the actual season has not finished yet.
season_1819 = df %>%
filter(season == 1819)
range(as.yearmon(season_1819$date))## [1] "Aug 2018" "Jan 2019"
August 2018 - January 2019 is the actual period of time we want to use to compare yellow cards between seasons.
Let’s visualize the total amount of yellow cards per season using a barplot. In order to do that, we will first filter the data by months, then group by season and adding the ttoal amount of yellow cards.
names_months = c("August", "September", "October", "November", "December", "January")
comparing_seasons_cards = df %>%
mutate(total_yellow_cards = home_yellow_cards + away_yellow_cards,
year_mon = months(date)) %>%
select(season, year_mon, total_yellow_cards) %>%
filter(year_mon %in% names_months) %>%
group_by(season) %>%
summarise(total_yellow_cards = sum(total_yellow_cards))
cards_compare_plot = ggplot(data = comparing_seasons_cards,
mapping = aes(x = season, y = total_yellow_cards)) +
geom_bar(stat = "identity", width = 0.8, fill = "steelblue2") +
geom_text(aes(label = total_yellow_cards),
vjust = 1.6,
color = "white",
size = 3.5) +
xlab("Season") +
ylab(NULL)
cards_compare_plotAs we can see, this actual season is on average compared with the other ones, for this specific period of time, so we cannot say that football has suffered from any kind of change just looking at the amount of yellow cards.
For seasons 203, 304 and 405, there are some missing registers as seen below. There should be 380 played matches every season. This explains the little gap in the barplot above.
matches_380 = df %>%
select(season) %>%
group_by(season) %>%
summarise(number_of_matches = n())
ggplot(data = matches_380,
mapping = aes(x = season, y = number_of_matches)) +
geom_bar(stat = "identity", width = 0.8, fill = "steelblue") +
geom_text(aes(label = number_of_matches),
vjust = 1.6,
color = "white",
size = 3.5) +
ylab("Number of matches played") +
xlab("Season")3. Which referees would you describe as the most lenient? Why?
The most simple and easy approach is to take a look at the total amount of fouls registered by each referee, in all seasons.
lenient_referees = df %>%
transmute(referee, total_fouls = home_fouls + away_fouls) %>%
group_by(referee) %>%
summarise(total_fouls = sum(total_fouls),
n_matches_participated = n()) %>%
arrange(total_fouls)
kable(head(lenient_referees))| referee | total_fouls | n_matches_participated |
|---|---|---|
| Ian Harris | 22 | 1 |
| M. L. Dean | 23 | 1 |
| St Bennett | 24 | 1 |
| P. Durkin | 25 | 1 |
| I Williamson | 26 | 1 |
| N Yates | 27 | 1 |
Those are the most lenient referees (using this approach), but they have participated only in 1 match. These takes us to the next idea:
dens = density(lenient_referees$n_matches_participated)
plot(dens, main = "Density of participation in matches")
polygon(dens, col = "steelblue2", border = "darkblue")Looking at the density graph above, we can see that most of the referees in the database participated in less than 100 matches. Maybe it would be interesting to create bins / cutoffs for a minimum of, for example, 114 matches (3 seasons).
Doing so, we would get as the most lenient referees the ones shown below:
lenient_referees_114 = lenient_referees %>%
filter(n_matches_participated > 114) %>%
head()
kable(lenient_referees_114)| referee | total_fouls | n_matches_participated |
|---|---|---|
| N Swarbrick | 2762 | 132 |
| C Pawson | 2793 | 125 |
| P Walton | 3588 | 168 |
| G Poll | 3608 | 130 |
| L Probert | 3720 | 175 |
| R Styles | 3736 | 169 |
A little further investigation due to curiosity.
plot(x = lenient_referees$n_matches_participated,
y = lenient_referees$total_fouls,
type = "p", cex = 2, pch = 20, col = "steelblue2",
main = "Relation between number of matches participated\nand amount of registered fouls",
xlab = "Matches",
ylab = "Fouls")
abline(lm(lenient_referees$total_fouls ~ lenient_referees$n_matches_participated),
col = "red")mat = as.matrix(lenient_referees[ ,c("total_fouls", "n_matches_participated")])
kable(cor(mat), caption = "Pearson correlation")| total_fouls | n_matches_participated | |
|---|---|---|
| total_fouls | 1.0000000 | 0.9963647 |
| n_matches_participated | 0.9963647 | 1.0000000 |
Wanted to see how the number of fouls increased with the number of matches in which a referee has participated. Looks like there is almost a perfect linear correlation between both variables, which a bit weird in my opinion, as I would expect a high correlation between both of them, but 0.99 is too high. This means we could predict the (almost exact) amount of fouls a referee would register depending in how many matches he has participated, and that is just not real. I would say the data has been manipulated somehow, but that is beyond the scope of this report.
A different (more complete) approach to find the top 6 lineant referees
Instead of using the least amount of registered fouls approach, we could use 3 different approaches:
- Ratio of
yellow+red cardsdivided byamount of fouls - Ratio of
yellow+red cardsdivided by amount ofmatches participated.
The first ratio is biased by the amount of matches participated. For example, two referees with the same ratio but the firt one participating in 5 matches, while the other one in 600.
The second approach is biased because it does not take in account the amount of fouls per match.
The third approach and the one that will be shown below (although the first and second one are calculated too) should be the most complete, as it takes in account the number of matches, number of fouls and number of cards. The formula is ((cards/fouls)/matches)*1000. This ratio resuts to be very small. Multiplying it by 1000 allows for a better understanding of a “ranking parameter”.
referees = df %>%
select(season, date, referee, home_fouls, away_fouls, home_yellow_cards, home_red_cards) %>%
mutate(total_fouls = home_fouls + away_fouls,
total_cards = home_yellow_cards + home_red_cards)
ratio_cards = referees %>%
select(referee, total_cards, total_fouls) %>%
group_by(referee) %>%
summarise(total_cards = sum(total_cards),
total_fouls = sum(total_fouls),
cards_x_foul = formattable(x = (total_cards/total_fouls)*50,
digits = 1,
format = "f"),
n_matches_participated = n(),
cards_x_match = formattable(x = total_cards/n_matches_participated,
digits = 1,
format = "f"),
cards_xfouls_xmatches = formattable(x = ((total_cards/total_fouls)/n_matches_participated)*1000,
digits = 1,
format = "f"))
kable(ratio_cards[order(ratio_cards$cards_xfouls_xmatches, decreasing = F)[1:6], ])| referee | total_cards | total_fouls | cards_x_foul | n_matches_participated | cards_x_match | cards_xfouls_xmatches |
|---|---|---|---|---|---|---|
| A Madley | 0 | 33 | 0.0 | 2 | 0.0 | 0.0 |
| Ian Harris | 0 | 22 | 0.0 | 1 | 0.0 | 0.0 |
| J.T. Winter | 0 | 32 | 0.0 | 1 | 0.0 | 0.0 |
| P. Durkin | 0 | 25 | 0.0 | 1 | 0.0 | 0.0 |
| P.A. Durkin | 0 | 29 | 0.0 | 1 | 0.0 | 0.0 |
| R Beeby | 0 | 55 | 0.0 | 2 | 0.0 | 0.0 |
This time the ratio stays 0.0 because the amount of cards shown are 0. It means that, when calculating 0/‘something else’, the ratio stays 0. Let’s just filter the data by at least 5 cards and 114 matches, in order to get something more realistic.
most_lineant_approach3 = ratio_cards %>%
select(referee, total_cards, total_fouls, cards_xfouls_xmatches, n_matches_participated) %>%
filter(total_cards > 5 & n_matches_participated > 114) %>%
arrange(cards_xfouls_xmatches)
kable(head(most_lineant_approach3))| referee | total_cards | total_fouls | cards_xfouls_xmatches | n_matches_participated |
|---|---|---|---|---|
| M Halsey | 185 | 4690 | 0.2 | 217 |
| M Dean | 792 | 9789 | 0.2 | 431 |
| M Atkinson | 574 | 8330 | 0.2 | 365 |
| P Dowd | 449 | 7439 | 0.2 | 295 |
| H Webb | 430 | 7021 | 0.2 | 294 |
| C Foy | 303 | 5711 | 0.2 | 254 |
This time, we get the most lineant referees as the ones shown in the table.