I’m a big soccer(football) fan and have become fascinated with the history of the English Premier League (EPL). I also love stats and wanted to practice some of my newly acquired data science/R programming expertise by looking at EPL historical table data.
To start with, I wrote an R script to scrape Wikipedia for historical data and assemble into a dataframe for analysis. If you are interested, you can find the code on my Github repository.
We’ll begin by loading libraries followed by the data:
library(plyr)
library(dplyr)
library(ggplot2)
megatable <- read.csv("EPL_historical_table.csv", stringsAsFactors = FALSE)
Let’s start by looking at a summary of our data:
summary(megatable)
## Pos Team Pld W
## Min. : 1.00 Length:486 Min. :38.00 Min. : 1.00
## 1st Qu.: 6.00 Class :character 1st Qu.:38.00 1st Qu.:10.00
## Median :11.00 Mode :character Median :38.00 Median :13.00
## Mean :10.64 Mean :38.54 Mean :14.14
## 3rd Qu.:16.00 3rd Qu.:38.00 3rd Qu.:17.00
## Max. :22.00 Max. :42.00 Max. :29.00
## D L GF GA
## Min. : 3.00 Min. : 0.00 Min. : 20.00 Min. : 15.00
## 1st Qu.: 8.00 1st Qu.:10.00 1st Qu.: 41.00 1st Qu.: 43.00
## Median :10.00 Median :15.00 Median : 48.00 Median : 51.00
## Mean :10.26 Mean :14.14 Mean : 50.83 Mean : 50.83
## 3rd Qu.:12.00 3rd Qu.:18.00 3rd Qu.: 59.00 3rd Qu.: 58.75
## Max. :18.00 Max. :29.00 Max. :103.00 Max. :100.00
## GD Pts Year
## Length:486 Min. : 11.00 Min. :1992
## Class :character 1st Qu.: 42.00 1st Qu.:1997
## Mode :character Median : 50.00 Median :2003
## Mean : 53.74 Mean :2003
## 3rd Qu.: 63.00 3rd Qu.:2009
## Max. :393.00 Max. :2015
First thing that jumped out at me is the “Number of Games Played” indicated by the Pld column. The modern EPL I was familiar with features 20 teams with each team playing each other team both home and away. So I expected to see a uniform value of 38 games played here. After a little further research on Wikipedia, I learned that for the first 3 seasons, the EPL featured 22 teams which explains why we see a max of 42 games played in the summary.
Another column that begs examination is Pts. The summary shows the max value as 393 which is clearly erroneous.
Let’s try and isolate the problem records with a simple dplyr filter:
filter(megatable, Pts > 100)
## Pos Team Pld W D L GF GA GD Pts Year
## 1 19 Middlesbrough 38 10 12 16 51 60 −9 393 1996
## 2 20 Portsmouth 38 7 7 24 34 66 −32 191 2009
OK, so what’s going on here. Well after a bit of examination of the Wikipedia tables, I discovered that these entries had a subscript associated with them that was incorrectly being interpreted as part of the value.
So we have a bit more cleanup to perform.
Since only two records are affected and we know the bad values only appear once in the dataframe, I’m going to do use a simple text replace to fix the values:
megatable$Pts <- sub("393", "39", megatable$Pts)
megatable$Pts <- sub("191", "19", megatable$Pts)
megatable$Pts <- as.numeric(megatable$Pts)
Now let’s confirm the values have been fixed:
summary(megatable$Pts)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 11.00 42.00 50.00 52.66 63.00 95.00
OK that did it! Now let’s continue on with our analysis.
Next I’d like to look at the distribution of past champions so let’s subset our data and build a simple histogram to visualize:
champs <- filter(megatable, Pos == 1)
#This function will arrange our histogram to show clubs in order of total EPL titles won
reorder_size <- function(x) {
factor(x, levels = names(sort(table(x), decreasing = TRUE)))
}
#Plot EPL titles by club
ggplot(data = champs, aes(reorder_size(Team))) +
geom_bar(width = 0.6, fill = c("red","midnightblue","firebrick","skyblue","black","midnightblue")) +
xlab("") +
ylab("Titles") +
ggtitle("Premier League Titles by Club")
No surprise to anyone who follows the EPL, that it has been dominated by Manchester United under Sir Alex Ferguson with 13 titles.
Here’s a breakdown of Titles by Club:
champs %>% count(Team, sort = TRUE)
## Source: local data frame [6 x 2]
##
## Team n
## (chr) (int)
## 1 Manchester United 13
## 2 Chelsea 4
## 3 Arsenal 3
## 4 Manchester City 2
## 5 Blackburn Rovers 1
## 6 Leicester City 1
We can see only a few select clubs have had the privilege of winning the Premier League. Only 4 clubs have won multiple EPL titles. Clubs like Chelsea and Manchester City have enjoyed recent success thanks to significant investment from their wealthy owners. Blackburn Rovers, the other lone winning club, was relegated in 2011 and have yet to regain promotion.
Let’s see if we can compare champions across the different EPL seasons to gauge their relative strength. To start we can look at champions that won the league with fewest and most points respectively.
filter(champs, Pts %in% range(Pts)) %>% knitr::kable()
Pos | Team | Pld | W | D | L | GF | GA | GD | Pts | Year |
---|---|---|---|---|---|---|---|---|---|---|
1 | Manchester United | 38 | 21 | 12 | 5 | 76 | 44 | +32 | 75 | 1996 |
1 | Chelsea | 38 | 29 | 8 | 1 | 72 | 15 | +57 | 95 | 2004 |
How else can we compare champions across seasons given the limited stats available in the historical league tables?
One way we can go about this is to look at the relative strengths of each champion’s closest competitors. We’ll limit our comparison to the top 4 finishers each season calculating the Avg Pts and Goal-differential per season making use of the handy ddply()
function from the plyr
package.
champs$T4Avg <- ddply(filter(megatable, Pos < 5), .(Year), summarise, Avg=round(mean(Pts)))[,2]
champs$GDAvg <- ddply(filter(megatable, Pos < 5), .(Year), summarise, Avg=round(mean(GF-GA)))[,2]
Now we can calculate a relative Strength metric for each champion by computing the sum of the differences between their Points totals and Goal differential respectively with their competitors.
mutate(champs, STR=(Pts-T4Avg)+((GF-GA)-GDAvg)) %>%
select(-GD, -Pos) %>%
arrange(desc(STR)) %>%
top_n(10) %>%
knitr::kable()
## Selecting by STR
Team | Pld | W | D | L | GF | GA | Pts | Year | T4Avg | GDAvg | STR |
---|---|---|---|---|---|---|---|---|---|---|---|
Manchester United | 38 | 28 | 7 | 3 | 97 | 45 | 91 | 1999 | 75 | 30 | 38 |
Chelsea | 38 | 29 | 8 | 1 | 72 | 15 | 95 | 2004 | 79 | 35 | 38 |
Manchester City | 38 | 28 | 5 | 5 | 93 | 29 | 89 | 2011 | 79 | 42 | 32 |
Manchester United | 38 | 28 | 5 | 5 | 83 | 27 | 89 | 2006 | 77 | 38 | 30 |
Chelsea | 38 | 27 | 5 | 6 | 103 | 32 | 86 | 2009 | 79 | 49 | 29 |
Arsenal | 38 | 26 | 12 | 0 | 73 | 26 | 90 | 2003 | 76 | 33 | 28 |
Manchester United | 42 | 24 | 12 | 6 | 67 | 31 | 84 | 1992 | 75 | 18 | 27 |
Manchester United | 38 | 24 | 8 | 6 | 79 | 31 | 80 | 2000 | 72 | 32 | 24 |
Chelsea | 38 | 29 | 4 | 5 | 72 | 22 | 91 | 2005 | 81 | 39 | 21 |
Manchester City | 38 | 27 | 5 | 6 | 102 | 37 | 86 | 2013 | 83 | 47 | 21 |
The top two clearly stand out with a tie between the famous treble-winning Manchester United team and Jose Mourinho’s 2004 team which finished with a record 95 points. The 2004 Chelsea team would probably get the nod for the overall top spot based on their superior Points total and Goal Differential.
The 2011 Manchester City team which finished 3rd in our ranking is actually an interesting case because they won the league dramatically on the final match-day with a late winner by Sergio Aguero versus Queens Park Rangers. They won their championship on goal-differential after finishing even on points with Manchester United. That season City and United were so far ahead of their competitors that it afforded City a high Strength metric even though they actually won the league by the slimmest margin in EPL history.
Also interesting to see Arsenal’s iconic Invincibles squad was ranked outside the top 5 largely due to the high number of games they drew that season.
Another measure of success in the EPL is top 4 finishes that award a place in the prestigious and lucrative Champion’s League competition. Let’s see which teams have achieved the most top 4 finishes.
First let’s subset our data based on table position within the top 4 and start from the year the EPL switched to 20 teams.
topfour <- megatable %>% filter(Pos < 5 & Year > 1994)
Now for a quick visualization to identify the most successful clubs.
ggplot(data = topfour, aes(reorder_size(Team))) +
geom_bar(width = 0.75, fill = "midnightblue") +
xlab("") +
ylab("Top 4 Finishes") +
theme(axis.text.x = element_text(angle=90))
No surprise to see Arsenal and Manchester United at the top. Liverpool have enjoyed 12 Top 4 finishes but have never reached the summit much to the chagrin of their supporters. Clubs like Newcastle United and Aston Villa have enjoyed success in the distant past but are no longer top 4 contenders.
Let’s turn our attention to the bottom of the table and see which clubs have suffered relegation most often. First, we’ll subset our data to those clubs who have suffered relegation.
rlgtd <- megatable %>% filter(Year > 1994, Pos > 17)
Now we can use our handy dplyr
syntax again to determine which clubs have been sent down most often
rlgtd %>% count(Team, sort = TRUE) %>% top_n(5)
## Selecting by n
## Source: local data frame [6 x 2]
##
## Team n
## (chr) (int)
## 1 Birmingham City 3
## 2 Bolton Wanderers 3
## 3 Norwich City 3
## 4 Queens Park Rangers 3
## 5 Sunderland 3
## 6 West Bromwich Albion 3
Regaining promotion back to the EPL is extremely difficult. Let’s see if we can quantify how difficult by determining what percentage of teams that are relegated regain promotion at the earliest opportunity.
Let’s start by writing an R function to determine if a team regained promotion or not after being relegated
redemption <- function(team, year) {
next_year = as.integer(year) + 2
if (team %in% megatable[megatable$Year == next_year, "Team"]) {
promoted = 1
}
else {
promoted = 0
}
return(promoted)
}
Our function takes two parameters, team
and year
, indicating the club and the year they were relegated. The earliest a team could find itself back in the EPL after relegation would be 2 years from the year they were relegated. So we caclulate this year and determine the clubs participating in the EPL for that season and test whether the club specified in the team
variable is in that subset.
Pretty straightforward!
Now, we need to go through our rlgtd
dataframe and apply this function to each entry. FOR
loops in R have a bad reputation and are poorly optimized. Fortunately R has a handy class of apply
functions to replace FOR loops.
For this particular case, we’ll use the mapply
function which allows us to pass multiple variables to our redemption()
using the data in our rlgtd
dataframe
redeemed <- mapply(redemption, rlgtd[,"Team"], rlgtd[, "Year"])
This gives us a vector called redeemed
with either a 0 or 1 for each relegated club, with a 1 indicating they regained promotion that the next opportunity.
Now we can calculated the percentage of clubs that gained promotion immediately following relegation
round(sum(redeemed)/nrow(rlgtd)*100, 2)
## [1] 23.81
So we can see that only 24% of clubs that were relegated immediately regained promotion. It is indeed quite a challenge to regain promotion and thanks to R, we now know how difficult!
The EPL is the most popular soccer league in the world and a big part of what makes the competition so exciting are the goals. Soccer is a low-scoring game making each goal an exciting and memorable moment in the matches. Let’s see if we can spot any trends in number of goals scored between the seasons.
We’ll start by computing the sums of goals-scored per team across each season.
totgoals <- filter(megatable, Year > 1994) %>% group_by(Year) %>% summarise(Tot = sum(GF))
From this summary, we can see the average number of goals per season is 1004 with a standard deviation of 43.
Let’s visualize the data with a histogram to see if we can spot any trends. The dashed line indicates the average number of total goals per EPL season.
ggplot(data = totgoals, aes(x = as.numeric(Year), y = Tot)) +
geom_bar(stat = "identity", fill = "orange") +
geom_hline(yintercept = round(mean(totgoals$Tot)), linetype = 2) +
xlab("Year") +
ylab("Total Goals") +
theme(axis.text.x = element_text(angle = 90, hjust = 0.9)) +
scale_x_continuous(breaks = seq(1995,2015)) +
ggtitle("Total Goals Scored per Season")
We can see that overall the total number of goals scored has not varied that much across EPL seasons. The seasons from 2009-2013 have been characterized by slightly above-average number of goals scored. The most recent 2014 season showed a slight drop in total goals though I’m not sure if that is attributable to anything in particular. It will be interesting to see if we start to see a trend of increasing goals scored with the influx of better attacking players owing partly to the growing affluence of the EPL.
The story of the 2016 EPL season was clearly the unprecedented title-winners, Leicester City. For a club that barely escaped relegation the previous season, it was an almost unthinkable turn of events. It’s really difficult to put Leicester’s achievement in context for those used to following American sports with their playoff systems that allow for underdogs to prevail more often.
While Leicester deserve an enormous amount of credit for their achievement, it has been suggested that their success is owed partly to the unexpected decline of the perenial contenders.
Let’s see if we can quantify this. First we’ll start by selecting the clubs we want to focus our analysis on.
teams <- c("Leicester City",
"Arsenal",
"Manchester United",
"Chelsea",
"Liverpool",
"Southampton",
"Tottenham Hotspur",
"West Ham United")
Notice, we left out Manchester City but we will explain why in a bit.
We’ll compare their points total and goals-scored to their historical averages.
teamperf <- megatable %>%
filter(Year > 1994, Team %in% teams) %>%
group_by(Team) %>%
summarise(ptsavg=mean(Pts), gfavg=mean(GF)) %>%
arrange(Team)
OK now we’ll come back to Manchester City. To get a more realistic picture of their recent performance, we will compare against only their seasonal averages since the club takeover by the Abu Dhabi group.
mcavg <- megatable %>%
filter(Year > 2010, Team == "Manchester City") %>%
group_by(Team) %>%
summarise(ptsavg=mean(Pts), gfavg=mean(GF))
Next, we’ll add their data to the dataframe with data for the other clubs.
teamperf <- bind_rows(teamperf, mcavg)
teamperf <- teamperf[order(teamperf$Team),]
teams <- append(teams, "Manchester City")
Now we get a dataframe for our teams of interest for the most recent EPL campaign and append their historical averages as new columns.
epl15 <- megatable %>%
filter(Year == 2015, Team %in% teams) %>%
select(Year, Team, Pts, GF) %>%
arrange(Team)
epl15$PtsAvg <- round(teamperf$ptsavg, 2)
epl15$GFAvg <- round(teamperf$gfavg, 2)
Finally we’ll calculate the percent-change in points and goals-scored and show our results.
epl15 <- epl15 %>%
mutate(PtsDiff=Pts-PtsAvg, GFDiff=GF-GFAvg) %>%
arrange(desc(PtsDiff))
epl15 %>%
mutate(PtsPctChng=(Pts-PtsAvg)/Pts*100, GFPctChng=(GF-GFAvg)/GF*100) %>%
arrange(desc(PtsPctChng)) %>%
knitr::kable()
Year | Team | Pts | GF | PtsAvg | GFAvg | PtsDiff | GFDiff | PtsPctChng | GFPctChng |
---|---|---|---|---|---|---|---|---|---|
2015 | Leicester City | 81 | 68 | 48.33 | 47.00 | 32.67 | 21.00 | 40.333333 | 30.882353 |
2015 | Southampton | 63 | 59 | 47.14 | 46.43 | 15.86 | 12.57 | 25.174603 | 21.305085 |
2015 | West Ham United | 62 | 65 | 47.61 | 45.89 | 14.39 | 19.11 | 23.209677 | 29.400000 |
2015 | Tottenham Hotspur | 70 | 69 | 56.90 | 54.29 | 13.10 | 14.71 | 18.714286 | 21.318841 |
2015 | Arsenal | 71 | 65 | 74.71 | 70.29 | -3.71 | -5.29 | -5.225352 | -8.138462 |
2015 | Liverpool | 60 | 63 | 67.05 | 63.67 | -7.05 | -0.67 | -11.750000 | -1.063492 |
2015 | Manchester City | 66 | 71 | 79.60 | 83.00 | -13.60 | -12.00 | -20.606061 | -16.901408 |
2015 | Manchester United | 66 | 49 | 80.38 | 75.14 | -14.38 | -26.14 | -21.787879 | -53.346939 |
2015 | Chelsea | 50 | 59 | 73.10 | 67.14 | -23.10 | -8.14 | -46.200000 | -13.796610 |
As we expected, Leicester City were the biggest over-performers but Southampton, West Ham United and Tottenham also showed considerable improvement as well. Chelsea stand out among the underperformers with their meek title-defense which resulted in the early sacking of Jose Mourinho. The Van Gaal era at United ended with an FA Cup trophy but outside the Champions League spots and resulted in his sacking.
It is unlikely we will see the top clubs struggle the way they did in 2015-16 season again making a repeat of Leicester’s triumph difficult to see, but they have given hope to the smaller clubs for the future!
I’m a huge Arsenal fan so let’s see what EPL historical table can tell us about how the club has performed to date. We already know that Arsenal has won 3 EPL titles and have enjoyed a league-best, 17 top 4 finishes.
Let’s look at the other end and see which years they performed poorly in.
afc <- filter(megatable, Team == "Arsenal")
filter(afc, Pos > 4)
## Pos Team Pld W D L GF GA GD Pts Year
## 1 10 Arsenal 42 15 11 16 40 38 +2 56 1992
## 2 12 Arsenal 42 13 12 17 52 49 +3 51 1994
## 3 5 Arsenal 38 17 12 9 49 32 +17 63 1995
So we see Arsenal have only finished outside the top 4 on three separate occasions. Their worst performances came early in EPL before Arsene Wenger arrived to take over from Bruce Rioch.
Arsenal have been known for their attractive, attacking style of football under Wenger so let’s see how their number of goals scored has varied through their time in the EPL.
ggplot(data = afc, aes(x = as.numeric(Year), y = GF)) +
geom_line(size = 2.0, alpha = 0.7, color = "firebrick") +
geom_point(size = 1.0) +
xlab("Year") +
ylab("Goals Scored") +
theme(axis.text.x = element_text(angle = 90, hjust = 0.9)) +
scale_x_continuous(breaks = seq(1992,2015)) +
ggtitle("Arsenal Goals Scored per Season")
We can see an immediate increase in goals following Wenger’s arrival in 1997 leading to their peak in the 2004-05 season. Statistically, that may have been their best season but sadly they finished a distant second to a rampant Chelsea team under Jose Mourinho in his first season at Stamford Bridge.
There was a sharp drop-off in goals following the 2005 season stemming from squad turnover due to the financial restraints imposed by the move from Highbury to the Emirates stadium. Arsenal fans have lamented the lack of proven goal-scorers like Henry and Van Persie in recent seasons. The additions of top-players like Ozil and Alexis have gone some way towards remedying that but the team is still a ways away from the offensive potency of early-2000s Wenger era.
To make things a bit more interesting, looks look at a points comparison between Arsenal and their north-London rivals, Tottenham Hotspur.
nld <- megatable %>% filter(Team %in% c("Arsenal", "Tottenham Hotspur"))
ggplot(data = nld, aes(x = as.numeric(Year), y = Pts)) +
geom_line(size = 2.0, alpha = 0.7, aes(color = Team)) +
geom_point(data = subset(nld, Pos == 1), size = 2.0, alpha = 0.7) +
scale_color_manual(values = c("firebrick","midnightblue")) +
xlab("Year") +
ylab("Points") +
ggtitle("Arsenal vs. Tottenham EPL Comparison")
Arsenal supporters everywhere take great delight in the fact that the club has not finished behind Spurs since 1994. They actually celebrate the day when it becomes mathematically impossible for Spurs to finish above Arsenal in the table known as St. Totteringham’s Day. I feared the streak would end last season but thanks to a spectacular collapse by Spurs, Arsenal finished above them yet again.
We can see Arsenal enjoyed greatest ascendancy in their local rivalry during the early 2000’s. The points on the graph indicate Arsenal’s title-winning years which coincide with biggest points-differentials much to the chagrin of Spurs’ fans no doubt!
This concludes my analysis of EPL historical table. I found it to be a beneficial learning experience and improved my fluency with R and visualization using ggplot2. My next project will be delve into some more analysis of Arsenal this time using Python and Jupyter notebooks.