Being in science combines a number of rewarding activities that make the daily working routine fullfilling in many ways. You get to learn new things everyday, you (sometimes) even understand how things work (or even better how nature works) and you get to interact, through teaching, with young people that are full of contageous optimism and aspiration. Last but not least, you have the freedom to make your own working schedule and, more often than in other jobs, find time to apply what you learn in things you were always curious about.
And I am, I ’ve always been, curious (nay! crazy) about football in all aspects of it. Playing it on the street, and whatching it on TV, talking, thinking and dreaming about it. Over the years, job, family and age have caught up with me and thus I have now grown into a more mature way of appreciating the “beautiful game”, from worshiping players to admiring managers and from chanting on the stands to reading about football tactics. My professional involvement with data analysis and statistics, has also lead to my developing of a more “quantitative” approach about football and in this sense I have always wanted to try to use some of the simple (or not simple) principles of my everyday work routine, (which include making sense of data related to biological problems) to more “mundane” questions regarding football. In this, my first ever, attempt to analyze football data, I took the opportunity (OK, I took advantage) of teaching a (hopefully) interesting graduate class on “R for Bioinformatics” at the University of Crete’s, Medical School. After having introduced the basic concepts of R to the students I thought of giving them an example of how we can use it to attack simple questions based on data. And since they are (or will soon be) fed up with biological problems I thought of giving them a different kind of a puzzle, which brings us to:
Being a big (OK, huge) fan of Liverpool Football Club in the post-90s era can be exhilarating and frustrating at the same time. You get to experience glorious moments like the Miracle in Instabul or last year’s come-back against Borussia Dortmund, but you also get to see them fail miserably on league after league campaign due to unexpected losses to “lesser” teams like Crystal Palace in 2014. This year in particular, this trend of being imperious in big games, only to lose nerve against sides like Burnley, Bournemouth or Swansea has been more apparent than ever. Liverpool are doing very well when playing other title contenders and spectacularly sink when they find themselves against tough-to-crack defenses. I am, of course, not the first to address this issue, brought up by former managers and former players-turned-pundits. The question, though, when it comes to punchlines such as “Liverpool sink against lesser sides” is how well they are founded on real data and this is exactly the question I posed to my (patient) students. What they had to do was to test whether Liverpool indeed performed worse than expected against teams at the end of the table, the word “expected” being the key here.
All data analyses have to start with data, so first of all I got the full list of the 2016-2017 Premier League’s results up to March 19th, 2017 (you can get it yourselves from this link) and gave it to the students with a clear question:
“Find a way (or, even better, a number of different ways) to test the hypothesis according to which Liverpool is performing better against the top-clubs than against bottom ones”.
They did rather well, even though expectedly they were not so happy with their assignment. I will try to discuss some of their approaches in a next post but below I will give you my little take on it. Bits of code and statistical output is implemented in R.
We start by reading the raw data in a variable. This looks rough at first so we need to organize them a bit
pl<-read.delim("PL_16_17_Results_19Mar17.tsv", header=T, sep="\t")
head(pl)
## Home Away H A
## 1 Stoke City Middlesbrough 2 0
## 2 Tottenham Stoke City 4 0
## 3 Crystal Palace Middlesbrough 1 0
## 4 Everton Sunderland 2 0
## 5 Swansea Leicester 2 0
## 6 Liverpool Tottenham 2 0
We first check if all 20 teams of the Premier League are included in the results. We can simply check that through a tabulation of each of the two Home/Away lists
table(pl$Home)
##
## Arsenal Bournemouth Burnley Chelsea
## 13 15 14 13
## Crystal Palace Everton Hull City Leicester
## 14 15 14 14
## Liverpool Manchester City Manchester United Middlesbrough
## 14 14 14 14
## Southampton Stoke City Sunderland Swansea
## 12 15 15 14
## Tottenham Watford West Brom West Ham
## 15 14 15 15
length(table(pl$Home))
## [1] 20
So we have all 20 teams in our dataset (the numbers in the table shows how many games each has played at home and is unequal since the League is still ongoing). The first thing that needs to be done is to create the Classification Table from the full list of results. Remember that each team gets 3 points for a win, 1 point for a draw and 0 points for a loss. In case you know nothing about football (highly unlikely when you ’re in one of my classes), wins, draws and losses are directly deduced from the scoreline which is denoted in the pl dataframe in the $H and $A variables, which correspond to the goals scored by the home and the away team. Let’s create an empty table to hold the points of each team and let’s count the home and away points separately.
data.frame(Team=levels(pl$Home), PHome=rep(0,20), PAway=rep(0,20), PTotal=rep(0,20))->PLtable
head(PLtable)
## Team PHome PAway PTotal
## 1 Arsenal 0 0 0
## 2 Bournemouth 0 0 0
## 3 Burnley 0 0 0
## 4 Chelsea 0 0 0
## 5 Crystal Palace 0 0 0
## 6 Everton 0 0 0
now we need to populate the table. We will traverse the results table pl keeping account of which team is playing home or away and aggregating points accordingly. Below is the code to do that:
for(i in 1:length(pl[,1])){
home<-which(PLtable$Team==as.character(pl[i,1])) #home team
away<-which(PLtable$Team==as.character(pl[i,2])) #away team
if (pl$H[i]-pl$A[i]>0) { #home team wins
PLtable$PTotal[home]=PLtable$PTotal[home]+3
PLtable$PHome[home]=PLtable$PHome[home]+3
}
if (pl$H[i]-pl$A[i]==0) { #draw
PLtable$PTotal[home]=PLtable$PTotal[home]+1
PLtable$PTotal[away]=PLtable$PTotal[away]+1
PLtable$PHome[home]=PLtable$PHome[home]+1
PLtable$PAway[away]=PLtable$PAway[away]+1
}
if (pl$H[i]-pl$A[i]<0) { #away team wins
PLtable$PTotal[away]=PLtable$PTotal[away]+3
PLtable$PAway[away]=PLtable$PAway[away]+3
}
}
a bit long but straight-forward. Now let’s rank the teams in order of $PTotal (total points) and check if table looks OK.
PLtable[order(PLtable$PTotal, decreasing=T), ]->PLtable
PLtable
## Team PHome PAway PTotal
## 4 Chelsea 36 33 69
## 17 Tottenham 41 18 59
## 10 Manchester City 27 30 57
## 9 Liverpool 33 23 56
## 11 Manchester United 25 27 52
## 1 Arsenal 29 21 50
## 6 Everton 34 16 50
## 19 West Brom 29 14 43
## 14 Stoke City 23 13 36
## 2 Bournemouth 24 9 33
## 13 Southampton 18 15 33
## 20 West Ham 18 15 33
## 3 Burnley 29 3 32
## 18 Watford 19 12 31
## 8 Leicester 24 6 30
## 5 Crystal Palace 13 15 28
## 16 Swansea 17 10 27
## 7 Hull City 19 5 24
## 12 Middlesbrough 13 9 22
## 15 Sunderland 13 7 20
It certainly does not look “OK” with Liverpool at 4th, 13 points behind leaders Chelsea but it is sadly correct. Notice how the sum of $PHome and $PAway amounts to the total tally. We have up to now managed to recreate the table, which is no big deal, as we could have obtained it from many sources directly. The key is that we need the table to test whether Liverpool, or for that matter, any other team, does better or worse against top or bottom teams. In the following we will address this question through a simple statistical test and then with more advanced metrics.
To test our hypothesis, we can run a simple test on the total points amassed against teams at the two edges of the classification table. That is we will compare the points coming from the results of Liverpool against the teams in positions 11-20 to the points they got when playing teams in the top 10 (excluding themselves of course). But in order to keep on top of things lets break it down a bit and create a table where we store the points Liverpool has earned playing each of the teams in the league, with their index being their rank in the table. For simplicity (and easier comparisons) we ’ll make this a 20-value vector even though one value will be empty (the value corresponding to Liverpool’s rank).
liverpool<-vector(mode="numeric", length=20)
Now lets populate this table through a combination of the two tables. For each team in the classification we will go back to the results table and check its results against Liverpool. We need to know which one is Liverpool in the tables so we start with:
ourteam<-which(PLtable$Team=="Liverpool") #find Liverpool on the table
ourteam
## [1] 4
from then on the code that builds the table is:
for(i in 1:20) {
which((pl$Home==PLtable$Team[ourteam] & pl$Away==PLtable$Team[i]))->x;
which((pl$Away==PLtable$Team[ourteam] & pl$Home==PLtable$Team[i]))->y;
if (length(x)==0) {liverpool[i]=liverpool[i]+0}
if (length(y)==0) {liverpool[i]=liverpool[i]+0}
if (length(x)==1){
if(pl$H[x]-pl$A[x]>0) {liverpool[i]=liverpool[i]+3}
if(pl$H[x]-pl$A[x]==0) {liverpool[i]=liverpool[i]+1}
}
if (length(y)==1){
if(pl$H[y]-pl$A[y]==0) {liverpool[i]=liverpool[i]+1}
if(pl$H[y]-pl$A[y]<0) {liverpool[i]=liverpool[i]+3}
}
}
If it seems a bit complicated it’s because it can’t be easily done otherwise. The table is not complete as the League is not over yet and some of the results are not there. We need to account for that by checking the existence of each combination with length(x) and length(y). But you can easily check that the vector is created correctly:
liverpool
## [1] 4 4 4 0 2 6 3 3 3 0 1 1 3 3 3 3 3 3 3 4
Now we have all the points Liverpool has earned, broken down per team staring with aspiring league winners Chelsea at [1] all the way down to struggling Sunderland at the bottom [20]. Liverpool have won 0 points from themselves [4th] and Bournemouth (10th) whom they ’ve played only once (a miserable and frustrating last minute loss). The vector is correct. Now let’s use it properly. The first thing we can do is create the two sums we discussed earlier.
sum(liverpool[1:10])
## [1] 29
sum(liverpool[11:20])
## [1] 27
Indeed Liverpool have earned 2 more points playing with the top-10 than from playing the bottom-10. Is this enough to support our hypothesis? Of course not. We need to be more rigorous. We can break it down a bit:
sum(liverpool[1:6])/5
## [1] 4
sum(liverpool[7:14])/8
## [1] 2.125
sum(liverpool[15:20])/6
## [1] 3.166667
here we calculated average earned points by dividing the sums over teams played for three slices of the table (notice we divide the top 6 by 5 since Liverpool belong to that group). Although the over-performance with the top teams persists this can be biased by the fixture schedule, since some teams may have played more games against top, middle or/and bottom teams by now. Even though we cannot do away with this bias we can try to compare distributions instead of sums, that is we can look at the samples of points from the different slices of the table
boxplot(liverpool[c(1:3,5:10)],liverpool[11:20])
which doesn’t really say much. In fact if we try to address this difference statistically with a simple t.test() we get:
t.test(liverpool[c(1:3,5:10)],liverpool[11:20])
##
## Welch Two Sample t-test
##
## data: liverpool[c(1:3, 5:10)] and liverpool[11:20]
## t = 0.83689, df = 12.527, p-value = 0.4183
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.8310566 1.8755010
## sample estimates:
## mean of x mean of y
## 3.222222 2.700000
We see that the two means are different, the top one being slightly higher but that the difference is not significant. So Liverpool do not do significantly better with top teams compared to how they do with bottom ones. Is this correct? Well, not so. Remember the original question is whether they are doing better than expected. It is obvious that the games get tougher as you play better teams and thus you are expected to get worse results. Thus the null hypothesis in our analysis should not be that Liverpool is doing equally well with top/bottom teams and this is exactly what the t.test() is checking. It is going to be trickier to answer our original question as we would need to devise a test that would take into account the expected performance as the team you face is higher in the league. But instead of going into complicated modeling we can simply use benchmarking and compare Liverpool to their peers, by checking the same performance indicators among the rest of top Premier League clubs.
We can apply the exact same strategy we did for Liverpool to create a vector of earned points for each of the teams in the table. We simply loop over the teams in the classification in the same way we did for one club, once we have first created a 20x20 data.frame to hold all points:
rankedpoints<-matrix(0, nrow=20, ncol=20)
as.data.frame(rankedpoints)->rankedpoints
colnames(rankedpoints)<-1:20
rownames(rankedpoints)<-PLtable$Team
Now we will call the same code for each team separately making sure the data frame is populated correctly. We only need to adjust the code so that it populates a two-dimensional [i,j] data frame instead of a one-dimensional vector [i]
for(i in 1:20){
ourteam<-which(PLtable$Team==PLtable$Team[i]) # loop over teams in order of classification
for(j in 1:20) {
which((pl$Home==PLtable$Team[ourteam] & pl$Away==PLtable$Team[j]))->x;
which((pl$Away==PLtable$Team[ourteam] & pl$Home==PLtable$Team[j]))->y;
if (length(x)==0) {rankedpoints[i,j]=rankedpoints[i,j]+0}
if (length(y)==0) {rankedpoints[i,j]=rankedpoints[i,j]+0}
if (length(x)==1){
if(pl$H[x]-pl$A[x]>0) {rankedpoints[i,j]=rankedpoints[i,j]+3}
if(pl$H[x]-pl$A[x]==0) {rankedpoints[i,j]=rankedpoints[i,j]+1}
}
if (length(y)==1){
if(pl$H[y]-pl$A[y]==0) {rankedpoints[i,j]=rankedpoints[i,j]+1}
if(pl$H[y]-pl$A[y]<0) {rankedpoints[i,j]=rankedpoints[i,j]+3}
}
}
}
rankedpoints now has everything we need to benchmark teams against each other. See how the diagonal elements are as expected zero.
rankedpoints
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## Chelsea 0 3 3 1 3 3 3 3 6 3 3 6 4 3 6 3 4 6 3 3
## Tottenham 3 0 4 1 0 1 4 4 6 1 6 3 3 3 1 3 3 3 6 4
## Manchester City 0 1 0 1 3 3 1 3 4 6 1 6 6 3 0 3 6 3 1 6
## Liverpool 4 4 4 0 2 6 3 3 3 0 1 1 3 3 3 3 3 3 3 4
## Manchester United 0 3 0 2 0 1 1 3 2 4 3 4 1 3 6 3 3 4 6 3
## Arsenal 3 1 0 0 1 0 0 3 3 4 3 3 6 3 1 3 6 6 1 3
## Everton 0 1 4 0 1 3 0 6 4 3 3 3 0 0 3 4 1 4 4 6
## West Brom 0 1 0 0 0 3 0 0 4 3 3 4 3 3 3 3 3 4 2 4
## Stoke City 0 0 1 0 2 0 1 1 0 0 1 1 3 6 1 3 3 3 4 6
## Bournemouth 0 1 0 3 1 1 3 3 3 0 0 3 0 2 3 1 6 3 0 0
## Southampton 0 0 1 1 0 0 3 0 1 3 0 3 3 4 4 0 3 0 3 4
## West Ham 0 0 0 1 1 0 0 1 1 3 3 0 3 1 0 6 3 3 4 3
## Burnley 1 0 0 3 1 0 3 0 0 3 3 0 0 3 3 3 0 2 3 4
## Watford 0 0 0 0 3 3 3 0 0 2 1 4 3 0 3 1 1 3 4 0
## Leicester 0 1 3 3 0 1 0 0 1 0 1 6 3 0 0 3 3 3 2 0
## Crystal Palace 0 0 0 0 0 0 1 3 3 4 3 0 0 4 0 0 0 1 6 3
## Swansea 1 0 0 3 0 0 1 0 0 0 3 0 6 1 3 6 0 0 0 3
## Hull City 0 0 0 3 1 0 1 1 0 3 3 0 2 0 3 1 6 0 0 0
## Middlesbrough 0 0 1 0 0 1 1 2 1 3 0 1 0 1 2 0 3 3 0 3
## Sunderland 0 1 0 1 0 0 0 1 0 3 1 0 1 3 3 3 0 3 0 0
The first thing we wanted to do was to calculate the points tally between top and bottom clubs. Lets do it at once for all of them by applying a function on the rankedpoints table. We’ll just split the table in two parts, the first 10 columns corresponding to points earned from the top-10 and the last 10 corresponding to the bottom-10 and calculate the difference of top-bottom:
topvbottom<-apply(rankedpoints[,1:10], 1, sum)-apply(rankedpoints[,11:20], 1, sum)
par(mar=c(2,10,5,5));barplot(topvbottom, horiz=T, las=1, main="points difference from top10-bottom10")
So, what do we have here? Liverpool is the only team with a positive top10-bottom10 score. All of the rest have negative values which is expected as it should be easier to beat a bottom team than a title contender. Leaders Chelsea are getting 13 more points from bottom clubs than from the ones at the top and in this respect they do worse than relegation candidates such as Sunderland and Middlesbrough. The only team that goes close to Liverpool in doing equally well against top and bottom sides is their city rivals Everton. Among the top-6 Chelsea, Spurs and City have similar performance while Arsenal and Man Utd are among the most one sided clubs, easily beating lesser sides but failing to gain points when they play their top peers. The situation is even more extreme when one omits the middle of the table and focuses on the top-bottom6:
topvbottom<-apply(rankedpoints[,1:6], 1, sum)-apply(rankedpoints[,15:20], 1, sum)
cols<-c("blue","white","skyblue","dark red","red","firebrick3")
par(mar=c(2,10,5,5));barplot(topvbottom[1:6], horiz=T, las=1, col=cols, main="points difference from top6-bottom6")
Liverpool’s status as “Slayer of Dragons, prey to minnows” is consolidated. In fact in a mini-league of only the top 10 clubs the classification would look like this:
minileague<-apply(rankedpoints[1:10,1:10], 1, sum)
minileague[order(minileague, decreasing=T)]
## Liverpool Chelsea Tottenham Manchester City
## 29 28 24 22
## Everton Manchester United Arsenal Bournemouth
## 22 16 15 15
## West Brom Stoke City
## 11 5
With the Reds top of the table, followed by Chelsea, Spurs, City and Everton and with Man Utd and Arsenal having a points tally similar to the Cherries. Last, if we want to be more rigorous in our calculation we can apply a t.test() approach in the way we did before:
pvals<-apply(rankedpoints, 1, function(x) t.test(x[1:10],x[11:20])$p.value)
pointRatio<-apply(rankedpoints, 1, function(x) sum(x[1:10])/sum(x[11:20]))
plot(log2(pointRatio),-log10(pvals), xlim=c(-3,1), ylim=c(0,3), type="n", las=1)
abline(h=-log10(0.05), lty=3, lwd=2)
abline(v=0, lty=1, lwd=1)
for(i in 1:20){text(log2(pointRatio[i]), -log10(pvals[i]), PLtable$Team[i], cex=0.8)}
Here we have created a composite plot with a logged-ratio of points earned against top/bottom teams opposed to the significance of this ratio based on a simple t.test. You may see that compared to Liverpool all teams do worse against the top10 clubs and for some of the side this difference is significant based on a t.test (horizontal line at p=0.05).
The analysis we have performed so far gives us an idea of a concept that will be related to the disparity with which teams perform against rivals based on their position in the table. Had we wanted to see the performance of all teams at once we could ask for the rankedpoints table to be plotted in the form of a heatmap() where we would see the accumulation of points as we move from the top to the bottom of the table for each team. In order to do that we will have to create a cumulative sum table of the ranked points first:
cumsumpoints<-t(apply(rankedpoints,1, cumsum))
head(cumsumpoints)
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
## Chelsea 0 3 6 7 10 13 16 19 25 28 31 37 41 44 50 53 57 63 66
## Tottenham 3 3 7 8 8 9 13 17 23 24 30 33 36 39 40 43 46 49 55
## Manchester City 0 1 1 2 5 8 9 12 16 22 23 29 35 38 38 41 47 50 51
## Liverpool 4 8 12 12 14 20 23 26 29 29 30 31 34 37 40 43 46 49 52
## Manchester United 0 3 3 5 5 6 7 10 12 16 19 23 24 27 33 36 39 43 49
## Arsenal 3 4 4 4 5 5 5 8 11 15 18 21 27 30 31 34 40 46 47
## 20
## Chelsea 69
## Tottenham 59
## Manchester City 57
## Liverpool 56
## Manchester United 52
## Arsenal 50
and then plot this new matrix as a heatmap with increasing points from blue to red
library(gplots)
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
cols<-colorRampPalette(c("steelblue4","white","firebrick4"))
heatmap.2(as.matrix(t(apply(rankedpoints,1, cumsum))), Colv=NA, labRow = rownames(rankedpoints), scale="none", trace="none", hclustfun = function(x) hclust(x,method = 'ward.D2'), xlab="Rank in the table", mar=c(5,8), col=cols)
## Warning in heatmap.2(as.matrix(t(apply(rankedpoints, 1, cumsum))), Colv
## = NA, : Discrepancy: Colv is FALSE, while dendrogram is `both'. Omitting
## column dendogram.
Notice how Chelsea’s bar redens once we go past the middle of the table and becomes dark red in the final 5th of the table, an indication that Chelsea get their advantage over their rivals by being consistent with teams at the bottom and a verification of the fact that being consistent and ruthless against the lesser teams is what distinguishes the good teams from the champions. What is also interesting is the clustering of the teams in this view. The top4 create a cluster on their own but that is not based only to their classification as it is also on the way they deal with sides across the table. Notice how mid-table clubs such as Bournemouth(10th) and Stoke(8th) or West Ham(12th) and Southampton(11th) fall into different clusters precisely because of their different “attitude” in seeing off lesser teams. Liverpool’s position as an outlier in this sense can be better seen if we plot the same heatmap scaled across columns, thus giving us a better idea of how each club does against each position.
heatmap.2(as.matrix(t(apply(rankedpoints,1, cumsum))), Colv=NA, labRow = rownames(rankedpoints), scale="column", trace="none", hclustfun = function(x) hclust(x,method = 'ward.D2'), xlab="Rank in the table", mar=c(5,8), col=cols)
## Warning in heatmap.2(as.matrix(t(apply(rankedpoints, 1, cumsum))), Colv
## = NA, : Discrepancy: Colv is FALSE, while dendrogram is `both'. Omitting
## column dendogram.
So Liverpool are good against the top, Spurs against the mid-table, Chelsea against the bottom. A clear division of labour!
*Let me know what you think of this analysis in the comments and stay tuned for a more conceptualized study of this rank inequality in a next post.