Exploring Premier League data

1: Data and Purpose

This dataset has been created using English Premier League match history data from http://www.football-data.co.uk and manager appointment data from https://en.wikipedia.org/wiki/List_of_Premier_League_managers

The dataset consists of 9664 observations, each corresponding to a Premier League match since the competition’s inception in 1993. Each season was run as a simulation, match by match, capturing a snapshot of the situation for both teams at the time with 192 variables (e.g position in table, away goals scored, length of winning streak, duration of manager tenure).

The objective of this study is to investigate various research questions as detailed below.

This research is exploratory in nature, with no claims towards statistical inference being made.

The data set has also been used to predict Premier League fixtures using an array of models.

N.B. for the first two seasons of the Premier League there were 22 teams, reduced to 20 thereafter.

2: Research Questions

1: The Honeymoon Period
a) How long is the honeymoon period for new mangers and how significant is the effect?
b) Does it depend on the time of the season a change of manager occurs - eg before or after Christmas?
c) Is there a relationship between points per game for the outgoing manager and the magnitude of the honeymoon period effect?

2: Results towards end of season
a) Do a team’s results change if they are not fighting for anything at the end of the season (against an opponent that is), compared to two teams that aren’t fighting or two teams that are?
b) Do teams tend to win their last home game of the season as a send off for the fans?

3: To stay up, is it better to be attack or defence minded?
Comparing teams that play open football vs those that do not.

4: If teams underperform or overperform one season, how do they fare the following year?

5: How long do winning streaks usually last for?
a) How long do runs tend to last for?
b) Does the likelihood of extending a run change depending on how many games the run stands at?

6: Bookmaker accuracy
a) How do the bookmakers’ predictions fare over the season? Are they better at predicting near the end?
b) How successful are bookmakers at identifying draws?

3: Exploratory data analysis

Research quesion 1: Honeymoon Period

Firstly, how many matches in the dataset have featured managers who have joined the club that season?

## [1] "Games featuring new manager: 46.9 %"

** 1a) How strong is the honeymoon period effect for new managers and how long does it last? **

##                                      Home Managers Away Managers
## Average points per game all managers          1.32          1.35
## Average ppg established managers              1.37          1.39
## Average ppg new managers                      1.19          1.22

At first glance, teams with new managers get less points per game, but this is actually unsurprising seeing as the worse teams are generally the ones appointing new managers. A better indication is the difference a new manager makes to the points per game and goal difference per game compared to their predecessor.

This plot shows the change in average points per game (compared to the previous manager) for all matches played under new managers. (Coloured by team name).

The high initial variance is due to managers winning or losing their first game in charge (giving an initial ppg of 3 or 0 within the first week of their reign). Once this effect subsides however, we see that the majority of the data fall above the red line, indicating that new managers do indeed have a positive impact.

Trails of points of the same colour represent games played by a team in the same season.

We can simplify the plot by looking at only the last game of the season, thereby restricting the data to one observation per season for teams with new managers.

The plot confirms that the majority of new managers improve the team’s performance. The high variance to the left is for the same reason as descirbed previously - the skew of a first win or loss. The fanning out of the plot to the right is caused by interim managers and their successors. e.g Kevin MacDonald won 2 points per game for Aston Villa at the start of the 2010-11 season, but Gerard Houllier managed only 1.15 ppg thereafter. Conversely, Ruud Gullit took 1 point from 5 games in 1999 before Bobby Robson took over.

## [1] "Percentage of teams with improved points per game under new manager: 72.7%"

## [1] "Average points per game improvement under new manager: 0.27"

A 0.27 improvement in points per game would account for over 10 points in a season. This could easily make the difference in terms of staying up or qualification for Europe.

The so-called honeymoon period is the short term improvement in a team’s performance however.

How does this vary over time?

The first few games represent a period of high variance, where individual results have more bearing. Beyond this the improvement settles at around 0.3 ppg. After about 13 games (perhaps once the manager has had a chance to impose their ideas on the team) we see that the results improve steadily, from an average of 0.3 ppg up to almost 0.5.

After about 25 games in charge the data become more erratic owing to the fewer observations available. (Not many teams sack their managers within the first 8 games, which they would need to do for the new manager to have overseen 30+ games in a season). This is indicated by the red points on the plot and shown in detail in the frequency table below.

## 
##   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18 
## 142 142 139 140 136 137 130 124 122 121 115 110 105  97  95  83  81  70 
##  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36 
##  69  64  58  53  50  41  38  31  27  23  20  14  11   8   5   3   2   1

1b) When is it best to replace a manager?

Does the honeymoon period depend on the time of the season a change of manager occurs?

We can compare the fate of teams who replace their managers early in the season with those that do so later, using Christmas as a convenient demarcation point.

## [1] "Mean points per game improvement for all manager changes: 0.33"

Points per game improvement based on appointment before/after Christmas

##  after before 
##   0.19   0.38

We see a significant difference in the performance of new managers compared to their predecessors, depending on when the appointment is made. Appointments before Christmas improve on average by twice as much as those made after.

One reason for this is that a poorly performing manager at the start of the season is likely to have a very low ppg, which is easily improved upon by a successor. Beyond that however, the conclusion seems to be that it is less effective for a team to change their manager as a panic measure in the second half of the season.

The red points cease after a certain amount of time since being appointed after Christmas means there are fewer days left until the end of the season.

It is possible to see that the red points are slightly lower on average than the blue ones.

1c) Is there a relationship between points per game for the outgoing manager and the magnitude of the honeymoon period effect?

In this plot, colour represents the category of team, where ‘Upper’ corresponds to a top 4 team (last season), and ‘Lower’ to a team that finished in the bottom half of the table.

Observed features of the plot:
- There is a clear negative correlation, since the fewer points per game a team has, the easier it is to improve on that.
- The higher level teams (blue) do not appear in the left side of the plot since they have a higher bar for success, removing managers even when they are still winning points.
- Vertical lines represent data for the same club, since the previous manager’s ppg is constant, while the new manager’s will shift.
- Diagonal banding is due to the discrete ppg values that can be obtained by a new manager after 1 or 2 games (0, 0.5, 1, 1.5, 2, 3)
- The band of uppermost points all correspond to maximum ppg (3) for the new manager (old manager ppg + difference between old and new). The lack of vertical clustering for these points demonstrates that they represent new managers winning their first game or two in charge.

Research quesion 2: Results towards end of season
a) Do a team’s results change if they are not fighting for anything at the end of the season (against an opponent that is), compared to two teams that aren’t fighting or two teams that are?

We consider those matches where one team is ‘still fighting’- be it for the title, champions league qualification, or survival, vs a team which has nothing but pride (and final standing premier league royalties) to play for.

N.B. Factors such as pride for an already relegated team not to finish rock bottom, desire to finish in the top half of the table and desire to qualify for the Europa league have not been considered.

This chart is best analysed by comparing the ratios of results with those in the first column, since this represents the vast majority of all matches.

We can see that when the away team has nothing to fight for (column 2), the home team wins more games, particularly at the expense of the draws.

Conversely, when the home team has nothing to fight for (column 3), the proportion of drawn games remains similar, but the away wins increase significantly.

When neither team is particularly motivated (column 4), we can see that these effects cancel each other out and the proportions are similar to column 1.

Do teams tend to win their last home game of the season as a send off for the fans?
Perhaps home teams are more motivated in the last home game of the season, which are typically well supported and a chance for the players to sign off in style.

There are 10 final home games each season (11 for first two Premier League seasons), giving a total of 252 games.
The results are shown in the right hand column below

##    
##     FALSE TRUE
##   A  2596   61
##   D  2483   63
##   H  4333  128

The relative proportions are shown below

##    
##     FALSE TRUE
##   A  0.28 0.24
##   D  0.26 0.25
##   H  0.46 0.51

We can see that the proportion of home wins increases from 46% to 51%, mostly at the expense of away wins.

Research quesion 3: Conditions for relegation
a) Attacking vs defensive style of play.
Are attack-minded teams more likely to stay up or vice versa?

An attack minded team should score more but also concede more. Adding up a team’s goals scored and goals conceded should encapsulate this concept. An attack minded team should have a high number, while a defence minded team should have a low number (for teams in a similar league position).

Plot of final league position against total goals scored and conceded.

The U-shape to this curve is understandable, since the best teams will score a lot of goals (upper left area) and the worst teams will concede lots (upper right).
Interestingly, however, we see an approximately even spread above and below the curve for any given league position. This means that it is just as possible to place very highly in the league with a very attacking style as it is with a more defensive style. The same is true for relegation.

N.B. This plot has a ‘jitter’ function applied to reveal location of overlapping data points.

This plot demonstrates a more intuitive truth - that the ratio of goals scored to goals conceded determines how the team fares.
N.B. The outlier in the top left corresponds to Manchester City’s unparalleled dominance in 2017-18

This plot shows the relationship between goals scored and conceded for teams that narrowly avoided relegation (that were in a dog fight) and those that were relegated.

The black line corresponds to goals scored=goals conceded, so almost all of these teams fell below a ratio of 1.

Although it’s clear that the closer to the upper-left portion of the graph a team is, the more likely they will be relegated, the two sets of data points are highly overlapping.

Plotting the ratio instead, it is clear to see that no team has been relegated when the ratio of goals scored to goals conceded was better than 3:4

## [1] "Average ratio for teams that just missed survival: 0.62"

## [1] "Average ratio for teams that just survived: 0.67"

Research quesion 4: If teams dramatically underperform or overperform one season, how do they fare the following year?

This plot shows the changing fortunes of three clubs.

Arsenal are extremely consistent, with their league position rarely changing by more than 2 places year on year.

Everton show more volatility, with a cardiogram characteristic to the plot.

Both have a tendency to self-correct (what goes up must come down).

Leicester are extremely unusual, showing consecutive high increases in back to back seasons (earning them a league title in 2015-16) followed by a dramatic plummet of 10 places the following year.

By considering three consecutive seasons we can plot the change in position between seasons 1 (‘penultimate’) and 2 (‘last’) against the change between seasons 2 and 3 (‘current’).

We can see a loose general correlation which shows that teams who finish higher or lower than the previous season will to some extent revert to previous form the folowing year.

## [1] "This has a correlation of: -0.29"

Some teams (like Leicester 2015-16) improved in back to back years.

##        season        team penultimate pos_this_last_diff
## 1710 s2005_06   Tottenham           5                  5
## 212  s2007_08 Aston Villa           5                  5
## 1120 s2011_12   Newcastle           9                  7
## 1418 s2013_14 Southampton           7                  6
## 720  s2015_16   Leicester           7                 13
## 221  s2016_17 Bournemouth           5                  6
## 422  s2017_18     Burnley           6                  9

This plot shows the fortunes of those teams that improved by over 7 places the year before. The majority fall below the line, showing they do not improve in back to back seasons. We can see this occurs in a loosely proportional way. (i.e the higher you rise the harder you fall).

What of the opposite scenario? How do teams that fall significantly one year, bounce back the year after?
(N.B. This plot excludes teams that were relegated the first season, thus is not as representative).

We see that the majority of the points are above the line, showing that teams tend to bounce back from a poor season.

The same correlation is not observed however, that the harder you fall the higher you bounce back. This is partly because teams that are relegated in the penultimate season do not feature in the following season’s data, but could also be due to the pitfalls of having a poor season - cycle of changing managers, best players cherry-picked by other clubs, inability to attract better players etc.

This plot demonstrates the above findings - that the higher a team rises in one season, the more they fall in the following one. This effect is especially true for teams that rise 10 places or more. Reasons for this could be rising expecatations and pressure on players, the effect of playing extra games in Europe with a thin squad, or top players being bought away by bigger clubs.

The reverse effect is not observed, with teams bouncing back well from poor seasons. As described, the relegation of teams that drop into the bottom three masks whether they do in fact recover to have better seasons.

Below are the two teams who ascended most dramatically in the 2018-18 season.

##           team pos pos_this_last_diff
## 422    Burnley   7                  9
## 1322 Newcastle  10                 11

Based on the graph above, in 2018-19, Burnley would be expected (on average) to drop 2-3 places to 9th/10th, with Newcastle dropping 3-4 to 13th/14th.

Research quesion 5: Winning streaks
How long do streaks tend to last and how does the probability of the run extending change as the streak length increases?

Below are the number of instances of each length of winning streak.

## 
##     0     1     2     3     4     5     6     7     8     9    10    11 
## 12399  4215  1491   635   279   140    67    40    24    14     9     5 
##    12    13    14    15    16    17    18 
##     3     2     1     1     1     1     1

The plot shows an exponential decay as the streaks come to an end.

Game by game, how does the conversion rate (the proportion of streaks that extend) change as the streak grows?
(i.e when are streaks likely to end).

In the above plot, the points represent the number of instances of each streak length, as per the previous plot. The overlaid line is the conversion rate for each streak length.

The drop-off for streaks is high from one game to the next (in 25 years, only 140 teams have put together 5 wins in a row and only one team has gone beyond 13 consecutive wins (18, by Manchester City, 2017-18)) (N.B. This does not take into account streaks that continue between seasons).

The conversion rate for a streak increases as time goes on however. The rationale here is that any team can pull off back to back wins against weak opposition, but for a streak to extend beyond four or five games, the team is probably very good and therefore more likely than not to win the next game as well (thus extending the streak further).

From the above plot, when teams get to a win streak of 5 it becomes statistically more likely they will win each match thereafter than not.

The flattening out of the line to the right is due to there only being one instance of a team reaching that point, hence the conversion rate will be 1.0 for as long as they can keep extending the record.

We can plot similar graphs for other types of streak - Losing - Unbeaten (wins and draws) - Non-winning (losses and draws)

All four plots finish with the longest streak of their kind being extended until it finishes (hence 100% and then 0%). Winning and losing streaks are far shorter (max 18 and 14) than their corresponding unbeaten and unwinning streaks (max 37 and 31). The likelihood of extending such streaks is therefore lower - initially under 50%, whereas the initial likelihood of extending the other streaks is initially more like 60%.

On the lower-left plot, we see a marked downturn, telling us that of the 16 losing streaks of 8 games, 15 of them ended in the 9th game.

On the lower right plot, we see a flat section half way along. This tells us that all 5 teams who failed to win in 17 games had to then wait at least another 3 games before finally winning!

Research quesion 6: Bookmaker success

6a) How do the bookmakers’ predictions fare over the season? Are they better at predicting as the season goes on?

Rather than use a single bookmaker (or comparing different ones) we will instead consider the average probability across a range of bookmaker data with the overround* removed so that the sum of probabilities for each game is 1. *(N.B. The overround is essentially the same thing as the ‘house advantage’ in casinos. It is set so that, under random betting patterns, whatever the result of sporting events, on average the bookmaker will profit by a set percentage - the overround).

Below is a summary of bookmaker ‘predictions’ (shortest odds or highest probability outcome) for every premier league game with bookmaker data.

## 
##    A    H 
## 1859 4981

Bookmakers never predict a draw as the most likely outcome, so what is the best they can do by only predicting home or away wins?

The table below shows the proportion of results across all Premier League matches.

## 
##     A     D     H 
## 0.280 0.256 0.464

Draws account for 25.6% of all Premier league results, so the best bookmakers can do by predicting Away wins or Home wins is 74.4%

How do they measure up against that ceiling?

Below is a table showing the proportion of games where the result matches the bookmaker’s most likely outcome

## 
## FALSE  TRUE 
## 0.466 0.534

Bookmakers get 53.4% of results correct from a possible 74.4%

Considering only the home and away wins:

## 
## FALSE  TRUE 
## 0.282 0.718

Of games which are not drawn, the bookmakers still get 28.2% wrong.

To analyse their performance through the season, we can take an average for each number of games played by the home team. This is a rough proxy for time.

We see from the linear model fit that the predictions do become more accurate through the season, as would be expected when the quality of each team is better understood. There are some huge swings in variance however, almost 65% accuracy after 30 games and only 43% 2 games later.

Predictions at 30 games

## 
## FALSE  TRUE 
##    61   110

Predictions at 32 games

## 
## FALSE  TRUE 
##   104    78

This is surprising as the sample sizes are not that small.

The variance makes the shape of the graph extremely sensitive to how the time series is formulated. Alternatively, for example we could aggregate by calendar month.

Again, we see a gradual increase (from 51% to 56%) with large variance - poor prediction in November (<51%), high in March (>60%).

6b) How successful are bookmakers at ‘predicting’ draws? Unlike many sports - Tennis, American Football, basketball - draws are a possible and regular occurence in football. Although bookmakers do not assign highest probabilities to draws, assigning a higher probability to them should suggest that the draw is more likely than in other fixtures.

To begin with, what odds do bookmakers assign to draws on average, and how does that compare for the games that are drawn?

## [1] "Average bookmaker probability of a draw 0.259"

## [1] "Average bookmaker probability of a draw for games that were in fact drawn 0.267"

## [1] "Average bookmaker probability of a draw for games that were not drawn 0.256"

On average, bookmakers assign the probability of a draw only 1.1% higher for games that are drawn than for games that are not.

Looking at it the other way, when bookmakers assign relativelty high odds of draws, what are the observed results?

First explore the range of probablities assigned.

For simpilicity, we can assign the bookmakers’ predictions to three groups. The highest probabilities are the right tail (p > 0.295) while the loweest probabilities are in the right tail (p <0.255). Those in between are medium probability (note this is where the most commonly assigned probabilities tend to lie).

Below are the actual results of the games, split out into the probability assigment group the bookmakers’ placed the likelihood of a draw.

##    
##     high  low medium
##   A   78  454   1381
##   D   94  367   1290
##   H  110 1171   1895

We can then calculate the accuracy of predictions based on the games that were in fact drawn.

##   high medium    low 
##  0.333  0.283  0.184

These accuracies are in the expected sequence, but even when the bookmakers place the probability of a draw as being low, 18% of games still result in draws.

Assigned probabilities of draws for games that were draws.

This plot and the previous one are nearly identical distributions, but if the bookmakers were able to accurately assign higher probabilities to games that were ultimately drawn, the second plot should have higher density to the right and significantly lower density on the left tail. Neither of these is observed.

4: Conclusion

Many of these findings warrant further investigation, with robust statistical inference applied.

It would also be interesting to widen the research by running identical studies to other professional leagues such as La Liga, Serie A and the Bundesliga.