Across many sports there has been lots of research into home-field advantage. It’s clear that it exists, that it is more prevalent in some sports compared to others and that it has been declining over time. In soccer, this fall away in home-field advantage appears tied to the fall in total goals scored per game. Oliver Roeder and I discussed this in this fivethirtyeight piece. I also chatted about home-field advantage across more soccer leagues in this NPR interview.
Given this historical trend, the results of Burnley Football Club in England’s Premier League appear to be extraordinary. At the time of writing (Jan 17 2017) Burnley have played 21 games and have 26 points - but only 1 point came away from home. Just how extraordinary is this?
Fortunately (or unfortunately depending on your perspective) I have collected every single English soccer league game result in history and put them into an R package -engsoccerdata. And if you’re going to go to that trouble, you may as well make use of it.
It turns out that this question is a nice exercise in using the tidyverse suite of packages (dplyr, tidyr and ggplot in particular) to find out. So, I wrote up how I did it as a mini-tutorial. If you want to flex your tidyverse skills, then there are a couple of exercises at the bottom.
The engsoccerdata package contains several datasets. We’ll use england that contains every result from the 1888/89 season to the 2015/16 season. This package is on CRAN. If you use the latest version on GitHub then you can also use the england_current() function to automatically bring in data from the most current (2016/17) season.
We use rbind to bind these two datasets together and then only keep the top-tier results. Next, we use the homeaway() function that is in engsoccerdata to reshape the dataset to list the home and away results of every team in the same dataframe.
library(tidyverse)
library(engsoccerdata)
df <- rbind(england, england_current()) %>%
filter(tier==1) %>%
homeaway()
df$Date <- as.Date(as.character(df$Date))
head(df)
Date Season team opp gf ga division tier venue
1 1888-09-08 1888 Accrington F.C. Everton 1 2 1 1 away
2 1888-09-15 1888 Accrington F.C. Blackburn Rovers 5 5 1 1 away
3 1888-09-22 1888 Accrington F.C. Derby County 1 1 1 1 away
4 1888-09-29 1888 Accrington F.C. Stoke City 4 2 1 1 away
5 1888-10-06 1888 Accrington F.C. Wolverhampton Wanderers 4 4 1 1 home
6 1888-10-13 1888 Accrington F.C. Derby County 6 2 1 1 home
tail(df)
Date Season team opp gf ga division tier venue
95475 2012-04-11 2011 Wolverhampton Wanderers Arsenal 0 3 1 1 home
95476 2012-04-14 2011 Wolverhampton Wanderers Sunderland 0 0 1 1 away
95477 2012-04-22 2011 Wolverhampton Wanderers Manchester City 0 2 1 1 home
95478 2012-04-28 2011 Wolverhampton Wanderers Swansea City 4 4 1 1 away
95479 2012-05-06 2011 Wolverhampton Wanderers Everton 0 0 1 1 home
95480 2012-05-13 2011 Wolverhampton Wanderers Wigan Athletic 2 3 1 1 away
Now we do the bulk of the data analysis. We first want to add a column that has the ‘game number’ for each team in each season - i.e. is it the 1st, 2nd, 3rd, 4th game etc. We do this because we want to know if teams up to the 21st game of the season have a more extreme home record in terms of points than 2016/17 Burnley.
After we keep all games for each team up to their 21stgame, we then calculate the home and away points for each team in each season up to that point. Here, we assume 3 points for a win to make sure all seasons can be compared. The next task is to reshape the data so each row is a team/Season combination and the away and home points are in separate columns. We use spread from tidyr. It’s then very trivial to calculate the total points and the percent of points gained at home by each team in each season. Finally, we ungroup and arrange the dataframe to show us the top 10 results.
df1 <- df %>%
group_by(team,Season) %>%
arrange(Date) %>%
mutate(gameno = row_number()) %>%
filter(gameno<=21) %>%
group_by(team,venue,Season) %>%
summarise(totalpts = 3*sum(gf>ga)+sum(gf==ga)) %>%
spread(venue,totalpts) %>%
mutate(totalpts = away+home, pcthome = home/totalpts) %>%
ungroup() %>%
arrange(-pcthome)
head(df1,10)
# A tibble: 10 C 6
team Season away home totalpts pcthome
<fctr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Derby County 1926 0 25 25 1.0000000
2 Liverpool 1953 0 20 20 1.0000000
3 Manchester United 1930 0 8 8 1.0000000
4 Manchester United 1936 0 17 17 1.0000000
5 Queens Park Rangers 2014 0 19 19 1.0000000
6 Burnley 2016 1 25 26 0.9615385
7 Nottingham Forest 1895 1 25 26 0.9615385
8 Stoke City 1893 1 25 26 0.9615385
9 Wolverhampton Wanderers 1894 1 24 25 0.9600000
10 Notts County 1901 1 22 23 0.9565217
Interestingly, Burnely in 2016/17 are up there - but they aren’t the most extreme. Derby County in 1926/27 had 25 points from 21 games with all of them coming at home. They did win their next away game though and ended up winning 3 games and drawing 3 games.
I checked this using this code: england %>% filter(Season==1926) %>% homeaway() %>% filter(team=="Derby County").
To get an idea of how exceptional all of these home records are, let’s make a histogram of “percent points at home” across all teams in all seasons:
ggplot(df1, aes(pcthome)) +
geom_histogram(binwidth = 0.01, fill="dodgerblue", color='blue4') +
geom_vline(xintercept=0.5, color="red", lty=2)+
xlab("Percent") +
ggtitle("Percent of total points coming at home after 21 games")
This is pretty much what we expected. The majority of teams get most of their points at home. However, there are some very intriguing teams that have woeful home records - who are these teams? We can find out by simply arranging the pcthome variable in the reverse order.
df1 %>%
arrange(pcthome) %>%
head(10)
# A tibble: 10 C 6
team Season away home totalpts pcthome
<fctr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Crystal Palace 1997 19 4 23 0.1739130
2 Bolton Wanderers 2011 12 4 16 0.2500000
3 Sunderland 1981 12 5 17 0.2941176
4 West Ham United 2002 11 5 16 0.3125000
5 Chelsea 1986 13 6 19 0.3157895
6 Aston Villa 2013 15 8 23 0.3478261
7 Chelsea 1965 22 12 34 0.3529412
8 Norwich City 1993 22 12 34 0.3529412
9 Portsmouth 2007 22 12 34 0.3529412
10 West Ham United 1988 11 6 17 0.3529412
The most notable thing about this table is that all of these occurrences are relatively recent.
EXERCISES:
If you want to try out your data munging skills, try this:
Which teams in which years have the most extreme home vs away records in other leagues such as Spain, France, Italy, Holland, Germany? What about other tiers of the English league?
Which teams in which years have the most skewed home record over a whole season?
To get in touch with me please use twitter or email: jc3181 AT columbia DOT edu.