Footballomics: Estimating League Disparity Performance with a Point-Rank Gini Index

Rank Gini coefficient

Last time we talked about Footbalomics (or the analysis of football data) we discussed a marked disparity in the performance of a certain Premier League club (Liverpool FC) in dealing with top competitors as opposed to lesser opponents. In that post we saw that LFC were doing remarkably well against better teams, while the norm is that most of the clubs do better when playing inferior teams (as expected) and a few do equally well against all (PL leaders Chelsea is a notable example). The observed disparity prompted me to attempt to summarise this trend and its fluctuation among clubs with one value. The difference between top10/bottom10 or even top6/bottom6 may not be very useful since the margins may vary depending on the skewed point distributions in the league. Other leagues may be tighter while others (e.g. the Spanish or the German, not to talk of the Greek) are one- or two-horse races. A concept that may be useful here is that of the Gini coefficient. To the uninitialized, the Gini coefficient is a measure of statistical dispersion. First introduced by Corrado Gini around the turn of the 20th century, it has earned significant attention at the turn of the 21st since it can be used to describe distribution disparity as it has, repeatedly, in the case of income distributions. In a nutshell, the Gini index tells you how much a certain value is distributed evenly on in a highly skewed manner. Assuming that all N citizens of country X share the same amount of its GDP, and thus each earns GDP/N, gives a Gini of 0, while in the (much more likely) case that one person earns all the GDP leaving 0 to everybody else, gives a Gini Index of 1. Real-life Gini coefficients range between the 0.30 and 0.70.

The Question: How can we apply a Gini coefficient in the case of football league performance?

We will start from where we left off last time, by creating a table of points earned for each PL club as one moves across the table from top to bottom. You can go back to the previous Footballomics post or simply get and run the code from this link.

source("Fomics_01_code.R")
head(rankedpoints)

##                   1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## Chelsea           0 3 3 1 3 3 3 3 6  3  3  6  4  3  6  3  4  6  3  3
## Tottenham         3 0 4 1 0 1 4 4 6  1  6  3  3  3  1  3  3  3  6  4
## Manchester City   0 1 0 1 3 3 1 3 4  6  1  6  6  3  0  3  6  3  1  6
## Liverpool         4 4 4 0 2 6 3 3 3  0  1  1  3  3  3  3  3  3  3  4
## Manchester United 0 3 0 2 0 1 1 3 2  4  3  4  1  3  6  3  3  4  6  3
## Arsenal           3 1 0 0 1 0 0 3 3  4  3  3  6  3  1  3  6  6  1  3

We first need to figure out what the “income” will be in this case. The “commodity” we measure is surely the points but we need a good way to define the disparity we already know that exists in this level. A first, naive approximation would look at the percentage of points collected by each club over all available points. The total “available” points is the sum of our rankedpoints table:

sum(rankedpoints)

## [1] 785

of which the percentages earned by each team are:

par(mar=c(5,10,2,3));
barplot(rev(rowSums(rankedpoints)/sum(rankedpoints)), horiz=T, names=rev(rownames(rankedpoints)), las=1)

In order to better see the disparity, we can imagine a 100% “egalitarian” (also: 1000% boring) Premier League table where each club earns an equal amount of points. This would equal the sum divided by the number of teams:

eqpoint=sum(rankedpoints)/20
eqpoint

## [1] 39.25

In this way, 39.25 is the average number of points earned by an egalitarian PL. If we were to plot a cumulative distribution of earned points:

plot(1:20, cumsum(rep(eqpoint,20))/sum(rankedpoints), type="l", col="dark grey", lwd=3, pch=16,  las=1, ylab="Cumulative % points earned", xlab="classification (low-to-high)")
lines(1:20, cumsum(rev(rowSums(rankedpoints)/sum(rankedpoints))), col="blue", lwd=3, type="l")

What the plot above shows is the expected “egalitarial PL” (grey line) and the actual PL (blue) where obviously some clubs get a greater share of the total points. The League’s Gini index can be strictly calculated as the area between the two curves, normalized and thus can be used directly to get a value of:

expected<-cumsum(rep(eqpoint,20))/sum(rankedpoints)
real<-(cumsum(rev(rowSums(rankedpoints)/sum(rankedpoints))))
gini<-(sum(expected-real))/20
gini

## [1] 0.09786624

A Gini Index of ~0.098 is far from “unequal” but then again, nobody expected the income distribution of Lesotho in a league of 20 millionaire clubs.

Gini disparity per club

We are still though far from our initial goal to calculate a Gini-like disparity index per club instead of one for the whole league. (Actually the League Gini would still be a good benchmark for League interest, how close the title and relegation race may be, how tight the table is ect but all that is a different story). We could use a similar approach on a per club basis and compare the cumulative percentage of earned points as we move from top to bottom compared to an “expected” distribution, according to which the points earned from each club would be equal to N/20 for every position (N being the total points tally of the club). We could then apply the process for the Gini Index we saw above directly like this:

#for Liverpool (position=4)
i<-4;
avpoints<-PLtable$PTotal[i]/20
expected<-cumsum(rep(avpoints,20))/sum(rankedpoints[i,])
real<-cumsum((c(rankedpoints[i,])))/sum(rankedpoints[i,])
plot(1:20, expected, type="l", col="dark grey", lwd=3, pch=16,  las=1, ylab="Cumulative % points earned", xlab="classification (high-to-low)")
lines(1:20, real, col="dark red", lwd=3, type="l")

Liverpool’s plot defies the norm in the sense that it lies above the expected line for most of the first part of the table (notice that in the per club analysis the x-axis denotes the real classification with Chelsea at 1, Spurs at 2 etc). This view does not tell us many new things that we didn’t know, but it gives us a way to calculate the disparity with one value.

gini<-(sum(expected-real))/20
gini

## [1] -0.00625

We see that this modified gini index is negative for Liverpool because of this special property of theirs, having earned more points from the “rich” only to give back to the poor later on at the table (the “Robin Hoods” of the League). In this sense, even with a distorted top-bottom pattern, Liverpool are very close to their “egalitarian” expectance. But how do the rest of the teams fair in this way? We can calculate this modified gini value for all clubs with a simple loop over the code above:

gini<-vector(mode="numeric", length=20)
for (i in 1:20){
  avpoints<-PLtable$PTotal[i]/20
  expected<-cumsum(rep(avpoints,20))/sum(rankedpoints[i,])
  real<-cumsum((c(rankedpoints[i,])))/sum(rankedpoints[i,])
  gini[i]<-(sum(expected-real))/20
}
gini<-data.frame(Team=PLtable$Team, Gini=gini)
gini

##                 Team        Gini
## 1            Chelsea  0.06268116
## 2          Tottenham  0.06313559
## 3    Manchester City  0.09956140
## 4          Liverpool -0.00625000
## 5  Manchester United  0.12884615
## 6            Arsenal  0.11700000
## 7            Everton  0.08900000
## 8          West Brom  0.15174419
## 9         Stoke City  0.22222222
## 10       Bournemouth  0.04772727
## 11       Southampton  0.15075758
## 12          West Ham  0.20075758
## 13           Burnley  0.11562500
## 14           Watford  0.08790323
## 15         Leicester  0.06000000
## 16    Crystal Palace  0.15178571
## 17           Swansea  0.11574074
## 18         Hull City  0.06666667
## 19     Middlesbrough  0.13863636
## 20        Sunderland  0.11750000

We thus recapitulate the results of our previous analysis, with Liverpool being the team that fairs better with the top clubs and worst with the bottom and with Man Utd and Arsenal doing better with the bottom ones. Let’s take a look of the Point-Ranki Gini plots for three sides of interest:

par(mfrow=c(1,3))
i<-1
  avpoints<-PLtable$PTotal[i]/20
  expected<-cumsum(rep(avpoints,20))/sum(rankedpoints[i,])
  real<-cumsum((c(rankedpoints[i,])))/sum(rankedpoints[i,])
plot(1:20, expected, type="l", col="dark grey", lwd=3, pch=16,  las=1, ylab="Cumulative % points earned", xlab="classification (high-to-low)", main="Chelsea", cex.main=1.5)
lines(1:20, real, col="blue", lwd=3, type="l")
i<-4
  avpoints<-PLtable$PTotal[i]/20
  expected<-cumsum(rep(avpoints,20))/sum(rankedpoints[i,])
  real<-cumsum((c(rankedpoints[i,])))/sum(rankedpoints[i,])
plot(1:20, expected, type="l", col="dark grey", lwd=3, pch=16,  las=1, ylab="Cumulative % points earned", xlab="classification (high-to-low)", main="Liverpool", cex.main=1.5)
lines(1:20, real, col="dark red", lwd=3, type="l")
i<-5
  avpoints<-PLtable$PTotal[i]/20
  expected<-cumsum(rep(avpoints,20))/sum(rankedpoints[i,])
  real<-cumsum((c(rankedpoints[i,])))/sum(rankedpoints[i,])
plot(1:20, expected, type="l", col="dark grey", lwd=3, pch=16,  las=1, ylab="Cumulative % points earned", xlab="classification (high-to-low)", main="Man Utd", cex.main=1.5)
lines(1:20, real, col="red", lwd=3, type="l")

Where the differences are more than notable.

Point-Rank Gini Index: A realistic Gini index per club

In the analysis described above we made a simplistic assumption that each club is expected to earn an equal amount of points from each other club. This is of course not true as it is harder to get points from Man City than Sunderland (at least in the strict sense of club quality, doing away with the increased need for winning points when you are about to be relegated). We can account for that using our table of “points on offer” by each club and using that to correct the “expected” points. For example, Liverpool have earned 56 points which broken down per club (again not accounting for the fact that they have not played against all twice yet) gives an average of 2.8 points. But it would be oversimplifying to use this figure for all teams, when we can scale it with the percentage of points each club gives away. For instance, from the total points given away in the league (the “wealth”), Chelsea only give away 1.52% of it while Sunderland have done so for more than 7.5%. We can use these values to scale the expected points for each team. We simply need to modify our expected points vector according to this formula, which in the case of Liverpool would be like this:

scale.factor<-colSums(rankedpoints)/sum(rankedpoints)
i<-4
expected<-cumsum(scale.factor*PLtable$PTotal[i])/PLtable$PTotal[i]

Lets use this improved expected points to create gini index plots for the three clubs we saw above:

See that the situation is even more pronounced here. Liverpool overperform in the first half of the table earning much more points than expected, then fall back to the average expectancy fast after position 8 and then steadily all the way to the bottom. The case for leaders Chelsea is rather different, with them being very close to the expected throughout the table (consistency), while a club like Mourinho’s Man Utd is better in “closing deals” with lesser teams, lying consistently below the expected value until the end of the table. We can use this improved scheme to rank teams according to their differential potential against top/bottom league teams:

Where we see that besides Liverpool, Bournemouth and Leicester do rather well in this context, while on the other hand Crystal Palace, West Ham and Stoke in particular do rather badly. Notice again how this ranking is not directly related to their classification in the table. In fact the two are pretty much uncorrelated.

cor.test(gini[,2], order(PLtable$PTotal), method="kendall")

## 
##  Kendall's rank correlation tau
## 
## data:  gini[, 2] and order(PLtable$PTotal)
## T = 76, p-value = 0.2333
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
##  tau 
## -0.2

where we’ve used a rank correlation coefficient, (Kendall’s tau).

Clustering of PL clubs according to Point-Rank Gini Index

As a last step we can try to see which clubs are similar to each other in the way they deal with opposition based on the opposition’s rank in the league. The Point-Rank Gini Index itself can be informative, but it would be better to take a look at the difference of expected vs real across the classification for each team. We can do this with a heatmap like last time that also clusters the clubs according to the shape of their perfomance.

scale.factor<-colSums(rankedpoints)/sum(rankedpoints)
ginimat<-matrix(0, nrow=20, ncol=20)
for (i in 1:20){
  expected<-cumsum(scale.factor*PLtable$PTotal[i])/PLtable$PTotal[i]
  real<-cumsum((c(rankedpoints[i,])))/sum(rankedpoints[i,])
  ginimat[i,]<-(expected-real)
}
ginimat<-as.data.frame(ginimat, row.names = as.character(PLtable$Team))

## 
## Attaching package: 'gplots'

## The following object is masked from 'package:stats':
## 
##     lowess

Again, Liverpool are marded outliers and Stoke and West Ham lie on the other side of the spectrum, earning their place on the table by being consistent against lesser opposition but consistently failing against top sides. Notice how the clubs are roughly split in two halves, one (bottom of the tree) that overperforms against better opposition, including the Cherries, Leicester, Hull, Watford and Everton alongside consistent clubs Spurs and Chelsea and to a lesser extent Man City. The other half is doing better with teams at the bottom of the table and includes many mid-table clubs (Stoke, West Ham, West Brom and the Saints) alongside underachievers Man Utd and Arsenal.