Breakpoint analysis of English Soccer Data

James P. Curley - 2 Dec 2014

jc3181 AT columbia DOT edu

Here I am going to use a new open-source R package developed by twitter called BreakoutDetection to analyze changes in scoring patterns over time in English soccer. It is available on GitHub here.

Install required packages

Install the two packages you need from GitHub. Make sure you have the devtools package loaded:

library(devtools)
install_github('jalapic/engsoccerdata', username = "jalapic")
install_github("twitter/BreakoutDetection")

Get data into a neat format

Load required packages.

library(engsoccerdata)
library(BreakoutDetection)
library(dplyr)
library(ggplot2)

Use dplyr to quickly create a summary dataframe of the goals-per-game in each season of the engsoccerdata2 dataset. This contains all English professional league soccer results from 1888-2014 (>180,000 complete games). For simplicity, we will only consider matches occurring in the top division.

gpg <- engsoccerdata2 %>%
  filter(tier==1)%>%
  group_by(Season) %>%
  summarise(gpg=mean(totgoal))

head(gpg)

## Source: local data frame [6 x 2]
## 
##   Season      gpg
## 1   1888 4.439394
## 2   1889 4.628788
## 3   1890 4.196970
## 4   1891 4.269231
## 5   1892 3.900000
## 6   1893 3.912500

Let’s make a nice graph of these data. This is the same chart discussed in this article at the 538 blog.

To do this, I also need to add some rows for missing years in the time-series when the league was not in operation (due to the World Wars)

#add missing years to dataframe
gpgNA <- data.frame(Season = setdiff(1888:2013, gpg$Season), gpg = NA)  #create a df we can append to the gpg df 
gpg1 <- rbind(gpg, gpgNA) #this ensures that the war years are gaps when doing a line graph


g1 <- ggplot(gpg1, aes(Season,gpg)) +
geom_point(color="red", size=2.5) + 
       geom_line(aes(group=1), lwd=1, color="firebrick3") +
         scale_x_continuous(breaks=c(seq(1880,2020,10))) +
       ggtitle("Goals per game in England's top flight: 1888-2014") + 
       ylab("GPG") + xlab("") +
  theme(
    panel.grid.major.x = element_line(color="gray85"),
    panel.grid.major.y = element_line(color="gray85"),
    axis.ticks.x = element_blank(),
    axis.ticks.y = element_blank(),
    panel.grid.minor = element_blank(),
    plot.background  = element_rect(color = "ghostwhite"),
    panel.background = element_blank(),
    plot.title = element_text(hjust=0,vjust=1)
    )

g1

Finding breakpoints

The function for detecting breakpoints in the BreakoutDetection package is called breakout. The algorithm employed is essentially looking for transitions between two steady states. These may be sudden shifts or may be ramping ups or ramping downs. This algorithm is referred to as E-Divisive with Medians (EDM) and is utilizing permutation tests to assess for changes in medians between steady states. For more information see the help associated with the package.

To use this function is very simple. There are however several optional arguments that one can adjust to alter the sensitivty of the analysis. The only one that I have thus far altered is called min.size. This refers to the minimum number of observations between change points. I think a decade is reasonable for these data, so I have set this to 10.

Finally, it appears to me that this type of analysis cannot cope with missing data in the middle of the time-series. Therefore, I am considering just the time-series in the ‘gpg’ dataframe as if it were a continuous sample.

Here is the code:

res <- breakout(gpg$gpg, min.size=10, method='multi', plot=TRUE)

res$plot  # this is the plot produced by the package

res$loc   # this is the observation number at which breakpoints are detected, they refer to the years: 1898, 1923, 1933, 1950, 1961 & 1974

## [1] 11 32 42 53 64 77

To add this to our ggplot produced figure, I am simply going to add dotted vertical lines at the appropriate dates:

g2 <- g1 + geom_vline(xintercept = c(1898, 1923, 1933,  1950, 1961, 1974), color="indianred3", lwd=1, lty=2)
g2

Next, I wanted to plot the means on top of this graph, so I calculated the means for each interval:

m1 <- gpg %>% filter(Season<1898) %>% summarise(meangpg=mean(gpg)) #3.87
m2 <- gpg %>% filter(Season >= 1899 & Season<1923) %>% summarise(meangpg=mean(gpg)) #2.92
m3 <- gpg %>% filter(Season >= 1923 & Season<1933) %>% summarise(meangpg=mean(gpg)) #3.49
m4 <- gpg %>% filter(Season >= 1933 & Season<1950) %>% summarise(meangpg=mean(gpg)) #3.09
m5 <- gpg %>% filter(Season >= 1950 & Season<1961) %>% summarise(meangpg=mean(gpg)) #3.44
m6 <- gpg %>% filter(Season >= 1961 & Season<1974) %>% summarise(meangpg=mean(gpg)) #2.90
m7 <- gpg %>% filter(Season >= 1974) %>% summarise(meangpg=mean(gpg)) #2.64

This can be put into a dataframe. This dataframe will serve as the input to the geom_rect command to produce a nice output of the differences in means in different time periods. To assist with this, I’m also adding an ‘id’ variable to the dataframe which is equivalent to a unique identifier for each period. I’m also going to make the ‘ymin’ and ‘ymax’ slightly wider than the actual mean so that the geom rectangles become visible.

mydf<-data.frame(
  id = 1:7,
  xmin = c(1881, 1898, 1923, 1933,  1950, 1961, 1974),
  xmax = c(1898, 1923, 1933,  1950, 1961, 1974, 2013),
  ymin  = as.numeric(unlist(c(m1,m2,m3,m4,m5,m6,m7))),
  ymax  = as.numeric(unlist(c(m1,m2,m3,m4,m5,m6,m7)))
)

mydf[,4] <- mydf[,4]-.05
mydf[,5] <- mydf[,5]+.05

mydf

##   id xmin xmax     ymin     ymax
## 1  1 1881 1898 3.817605 3.917605
## 2  2 1898 1923 2.872406 2.972406
## 3  3 1923 1933 3.437446 3.537446
## 4  4 1933 1950 3.044845 3.144845
## 5  5 1950 1961 3.394510 3.494510
## 6  6 1961 1974 2.849600 2.949600
## 7  7 1974 2013 2.587147 2.687147

And this is how we plot it….

g2 + geom_rect(data=mydf, inherit.aes=FALSE,
          aes(xmin=xmin,xmax=xmax,ymin=ymin,ymax=ymax,
              group=id), fill="dodgerblue", alpha=0.5)

Summary

This is a very quick and dirty look at one potential method for analyzing change-points in these kinds of data. The data here show quite abrupt changes and several of them. The ones identified by the BreakoutDetection package are probably those that we would expect. The most interesting one is probably the one identified at 1923. The offside law was changed in 1925 to enable more scoring. The other revealing one is at 1974. Soccer tactics were evolving throughout the 1960s with the adoption of ‘method football’ and more focus on defense. This shift is picked up by this analysis at the 1974 breakpoint.

If I was interested in taking this further, I’d probably like to get to understand the optional arguments in the breakout function more thoroughly. Also, this method may not be the most appropriate for this relatively short time series (that also contains some missing data), so it might be worth investigating one other interesting change-point analysis package - bcpa or Behavioral Change Point Analysis. It was fun to get to know this new BreakpointDetection package all the same.

Any questions or comments, please email me at jc3181 AT columbia DOT edu