James P. Curley - 2 Dec 2014

jc3181 AT columbia DOT edu

   

In this quick tutorial, I shall give a short illustrated example of the sort of question that can be answered using some of the data contained within my engsoccerdata package. I picked the following question more or less at random, but it serves our purpose well…

 

Let’s say we wanted to know for every season what proportion of teams who got through to the quarter-finals (last eight teams) also did so the following season? This would be one crude and basic way of assessing the ‘competitiveness’ of the competition.

 


How do we do this ?

If you haven’t already installed the engsoccerdata package, then do that:

library(devtools)
install_github('jalapic/engsoccerdata', username = "jalapic")

 

Next, load the packages we will need and take a look at the facup dataset from engsoccerdata. Because there are 19 variables and they don’t all fit nicely across the screen, I shall display them in 2 chunks:

#these are the packages that we will need to have loaded for our data munging and graphing:
library(engsoccerdata)
library(dplyr)
library(tidyr)
library(ggplot2)
library(zoo)

 

facup[1:5, 1:14]   #first 5 rows, variables 1-14
##         date Season                home             visitor  FT hgoal
## 1 1871-11-11   1871              Barnes       Civil Service 2-0     2
## 2 1871-11-11   1871             Hitchin Crystal Palace 1861 0-0     0
## 3 1871-11-11   1871   Maidenhead United              Marlow 2-0     2
## 4 1871-11-11   1871          Upton Park      Clapham Rovers 0-3     0
## 5 1871-12-16   1871 Crystal Palace 1861   Maidenhead United 3-0     3
##   vgoal round     tie  aet  pen pens hp vp
## 1     0     1 initial <NA> <NA> <NA> NA NA
## 2     0     1 initial <NA> <NA> <NA> NA NA
## 3     0     1 initial <NA> <NA> <NA> NA NA
## 4     3     1 initial <NA> <NA> <NA> NA NA
## 5     0     2 initial <NA> <NA> <NA> NA NA

These first rows give the date of the match, the season (e.g. 1871 refers to 1871/72), the home team, the visiting team, & the final score (regardless of whether this was after 90 minutes or extra time). The next two variables refer to the number of goals scored by the teams in the home and visitor variables respectively.
   

The next column refers to the round of the match. This can be 1 to 6, ‘s’ for semi, ‘f’ for final or ‘3pp’ for third-place playoff (which only occurred 1970/1-1973/4). The ‘tie’ variable refers to whether the game is an initial tie (first meeting in that round), a replay or something else (a bye, walkover or void match). The next variable ‘aet’ is whether extra time was played. The final four variables here refer to penalty shoot-outs but that isn’t important right now.  

facup[1:5, 15:19]  #first 5 rows, variables 15-19
##                    Venue attendance nonmatch
## 1                   <NA>      1,200     <NA>
## 2     Top Field, Hitchin        750     <NA>
## 3  York Road, Maidenhead      1,287     <NA>
## 4  West Ham Park, London      1,500     <NA>
## 5 Crystal Palace, London       <NA>     <NA>
##                            notes neutral
## 1                           <NA>    <NA>
## 2 No replay, both teams progress    <NA>
## 3                           <NA>    <NA>
## 4                           <NA>    <NA>
## 5                           <NA>    <NA>

These variables refer to the venue, attendance and other notes and details regarding the circumstances of the individual game. This isn’t important either right now.
 

Because we are interested in the last eight teams each season, we should take a look at how many teams are in each round in each year. To do this, I will only look at initial ties. This will produce quite a lot of output, but it’s probably worth reproducing all of it for this purpose:

facup.i <- facup %>% filter(tie=="initial")  
table(facup.i$round, facup.i$Season)
##      
##       1871 1872 1873 1874 1875 1876 1877 1878 1879 1880 1881 1882 1883
##   1      6    7   14   14   16   18   21   21   26   31   36   39   48
##   2      5    3    7    7    8    9   11   11   13   13   16   20   24
##   3      2    2    3    4    4    5    5    5    5    6    7   10   11
##   3pp    0    0    0    0    0    0    0    0    0    0    0    0    0
##   4      0    1    0    0    0    2    3    3    5    6    7    7    8
##   5      0    0    0    0    0    0    0    0    2    3    3    4    4
##   6      0    0    0    0    0    0    0    0    0    0    0    0    0
##   f      1    1    1    1    1    1    1    1    1    1    1    1    1
##   s      3    1    2    2    2    1    1    1    1    1    2    2    2
##      
##       1884 1885 1886 1887 1888 1889 1890 1891 1892 1893 1894 1895 1896
##   1     55   65   62   73   16   16   13   16   16   16   16   16   16
##   2     27   32   31   35    8    8    8    8    8    8    8    8    8
##   3     13   15   16   18    4    4    4    4    4    4    4    4    4
##   3pp    0    0    0    0    0    0    0    0    0    0    0    0    0
##   4      9    2    3    7    0    0    0    0    0    0    0    0    0
##   5      1    8    8    8    0    0    0    0    0    0    0    0    0
##   6      4    4    4    4    0    0    0    0    0    0    0    0    0
##   f      1    1    1    1    1    1    1    1    1    1    1    1    1
##   s      2    2    2    2    2    2    2    2    2    2    2    2    2
##      
##       1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909
##   1     16   16   16   16   16   16   16   16   32   32   32   32   32
##   2      8    8    8    8    8    8    8    8   16   16   16   16   16
##   3      4    4    4    4    4    4    4    4    8    8    8    8    8
##   3pp    0    0    0    0    0    0    0    0    0    0    0    0    0
##   4      0    0    0    0    0    0    0    0    4    4    4    4    4
##   5      0    0    0    0    0    0    0    0    0    0    0    0    0
##   6      0    0    0    0    0    0    0    0    0    0    0    0    0
##   f      1    1    1    1    1    1    1    1    1    1    1    1    1
##   s      2    2    2    2    2    2    2    2    2    2    2    2    2
##      
##       1910 1911 1912 1913 1914 1919 1920 1921 1922 1923 1924 1925 1926
##   1     32   32   32   32   32   32   32   32   32   32   32   38   38
##   2     16   16   16   16   16   16   16   16   16   16   16   19   19
##   3      8    8    8    8    8    8    8    8    8    8    8   32   32
##   3pp    0    0    0    0    0    0    0    0    0    0    0    0    0
##   4      4    4    4    4    4    4    4    4    4    4    4   16   16
##   5      0    0    0    0    0    0    0    0    0    0    0    8    8
##   6      0    0    0    0    0    0    0    0    0    0    0    4    4
##   f      1    1    1    1    1    1    1    1    1    1    1    1    1
##   s      2    2    2    2    2    2    2    2    2    2    2    2    2
##      
##       1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1945
##   1     34   34   34   34   34   34   34   34   34   34   34   34    0
##   2     17   17   17   17   17   17   17   17   17   17   17   17    0
##   3     32   32   32   32   32   32   32   32   32   32   32   32    0
##   3pp    0    0    0    0    0    0    0    0    0    0    0    0    0
##   4     16   16   16   16   16   16   16   16   16   16   16   16    0
##   5      8    8    8    8    8    8    8    8    8    8    8    8    0
##   6      4    4    4    4    4    4    4    4    4    4    4    4    0
##   f      1    1    1    1    1    1    1    1    1    1    1    1    1
##   s      2    2    2    2    2    2    2    2    2    2    2    2    2
##      
##       1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958
##   1     34   34   34   34   34   40   40   40   40   40   40   40   40
##   2     17   17   17   17   17   20   20   20   20   20   20   20   20
##   3     32   32   32   32   32   32   32   32   32   32   32   32   32
##   3pp    0    0    0    0    0    0    0    0    0    0    0    0    0
##   4     16   16   16   16   16   16   16   16   16   16   16   16   16
##   5      8    8    8    8    8    8    8    8    8    8    8    8    8
##   6      4    4    4    4    4    4    4    4    4    4    4    4    4
##   f      1    1    1    1    1    1    1    1    1    1    1    1    1
##   s      2    2    2    2    2    2    2    2    2    2    2    2    2
##      
##       1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971
##   1     40   40   40   40   40   40   40   40   40   40   40   40   40
##   2     20   20   20   20   20   20   20   20   20   20   20   20   20
##   3     32   32   32   32   32   32   32   32   32   32   32   32   32
##   3pp    0    0    0    0    0    0    0    0    0    0    0    1    1
##   4     16   16   16   16   16   16   16   16   16   16   16   16   16
##   5      8    8    8    8    8    8    8    8    8    8    8    8    8
##   6      4    4    4    4    4    4    4    4    4    4    4    4    4
##   f      1    1    1    1    1    1    1    1    1    1    1    1    1
##   s      2    2    2    2    2    2    2    2    2    2    2    2    2
##      
##       1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984
##   1     40   40   40   40   40   40   40   40   40   40   40   40   40
##   2     20   20   20   20   20   20   20   20   20   20   20   20   20
##   3     32   32   32   32   32   32   32   32   32   32   32   32   32
##   3pp    1    1    0    0    0    0    0    0    0    0    0    0    0
##   4     16   16   16   16   16   16   16   16   16   16   16   16   16
##   5      8    8    8    8    8    8    8    8    8    8    8    8    8
##   6      4    4    4    4    4    4    4    4    4    4    4    4    4
##   f      1    1    1    1    1    1    1    1    1    1    1    1    1
##   s      2    2    2    2    2    2    2    2    2    2    2    2    2
##      
##       1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997
##   1     40   40   40   40   40   40   40   40   40   39   40   40   40
##   2     20   20   20   20   20   20   20   20   20   20   20   20   20
##   3     32   32   32   32   32   32   32   32   32   32   32   32   32
##   3pp    0    0    0    0    0    0    0    0    0    0    0    0    0
##   4     16   16   16   16   16   16   16   16   16   16   16   16   16
##   5      8    8    8    8    8    8    8    8    8    8    8    8    8
##   6      4    4    4    4    4    4    4    4    4    4    4    4    4
##   f      1    1    1    1    1    1    1    1    1    1    1    1    1
##   s      2    2    2    2    2    2    2    2    2    2    2    2    2
##      
##       1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
##   1     40   40   40   40   40   40   40   40   40   40   40   40   40
##   2     20   20   20   20   20   20   20   20   20   20   20   20   20
##   3     32   32   32   32   32   32   32   32   32   32   32   32   32
##   3pp    0    0    0    0    0    0    0    0    0    0    0    0    0
##   4     16   16   16   16   16   16   16   16   16   16   16   16   16
##   5      8    8    8    8    8    8    8    8    8    8    8    8    8
##   6      4    4    4    4    4    4    4    4    4    4    4    4    4
##   f      1    1    1    1    1    1    1    1    1    1    1    1    1
##   s      2    2    2    2    2    2    2    2    2    2    2    2    2
##      
##       2011 2012 2013
##   1     40   40   40
##   2     20   20   20
##   3     32   32   32
##   3pp    0    0    0
##   4     16   16   16
##   5      8    8    8
##   6      4    4    4
##   f      1    1    1
##   s      2    2    2

Going backwards in time, it’s pretty clear that the format is quite orderly from 1946/47 to 2013/14 and that the 6th round has 4 ties (8 teams) in it for each season during this stretch. In 1945 there was a different format (we’ll return to that in a minute). The 6th round also consisted of 4 ties from 1925/26 - 1938/39. From 1905/06 - 1924/25 it was the 4th round that had 4 ties. From 1888/89 to 1904/05 it was the 3rd round. From 1885/86 - 1887/88 it was the 6th round again with 4 ties. I won’t bother in this tutorial with looking prior to the 1885/86 season as the format was very unwieldy earlier than this.

  In the 1945/46 season, teams played each other in 2 legs for every round prior to the final. Therefore, we need to look at this season separately, and it is clear that round 6 has 4 ties/8 teams:

facup.45 <- facup %>% filter(Season==1945)  
table(facup.45$round, facup.45$tie)
##    
##     initial leg1 leg2 replay
##   1       0   34   34      0
##   2       0   17   17      0
##   3       0   32   32      2
##   4       0   16   16      1
##   5       0    8    8      0
##   6       0    4    4      0
##   f       1    0    0      0
##   s       2    0    0      1

   

Next we shall use dplyr and tidyr to collate all the rounds/seasons we need to answer our question. Each series of commands filters the seasons/rounds/ties of interest before creating a neat, two-variable df consisting of ‘season’ and ‘team’:

#1946/47 - 2013/14
df1<-facup %>% 
  filter(Season<=2013 & Season>=1946) %>%
  filter(tie=="initial" & round=="6") %>%
  select(Season,home,visitor) %>%
  gather(key,team,2:3) %>%
  select(Season,team) %>%
  arrange(Season,team)


#1945/46
df2<-facup %>% 
  filter(Season==1945) %>%
  filter(tie=="leg1" & round=="6") %>%
  select(Season,home,visitor) %>%
  gather(key,team,2:3) %>%
  select(Season,team) %>%
  arrange(Season,team)


#1925/26 - 1938/39
df3<-facup %>% 
  filter(Season<=1938 & Season>=1925) %>%
  filter(tie=="initial" & round=="6") %>%
  select(Season,home,visitor) %>%
  gather(key,team,2:3) %>%
  select(Season,team) %>%
  arrange(Season,team)


#1905/06 - 1924/25
df4<-facup %>% 
  filter(Season<=1924 & Season>=1905) %>%
  filter(tie=="initial" & round=="4") %>%
  select(Season,home,visitor) %>%
  gather(key,team,2:3) %>%
  select(Season,team) %>%
  arrange(Season,team)


#1888/89 - 1904/05
df5<-facup %>% 
  filter(Season<=1904 & Season>=1888) %>%
  filter(tie=="initial" & round=="3") %>%
  select(Season,home,visitor) %>%
  gather(key,team,2:3) %>%
  select(Season,team) %>%
  arrange(Season,team)


#1885/86 - 1887/88
df6<-facup %>% 
  filter(Season<=1887 & Season>=1885) %>%
  filter(tie=="initial" & round=="6") %>%
  select(Season,home,visitor) %>%
  gather(key,team,2:3) %>%
  select(Season,team) %>%
  arrange(Season,team)

To give an example of what each of these looks like, here are the first 16 rows of ‘df1’. You can see here that ‘Preston North End’ were the only team to appear in both 1946/47 and 1947/48:

head(df1,16)
##    Season                team
## 1    1946     Birmingham City
## 2    1946             Burnley
## 3    1946   Charlton Athletic
## 4    1946           Liverpool
## 5    1946       Middlesbrough
## 6    1946    Newcastle United
## 7    1946   Preston North End
## 8    1946    Sheffield United
## 9    1947           Blackpool
## 10   1947        Derby County
## 11   1947              Fulham
## 12   1947   Manchester United
## 13   1947   Preston North End
## 14   1947 Queens Park Rangers
## 15   1947         Southampton
## 16   1947   Tottenham Hotspur

    Now we simply bind all of these together and we have our dataframe of ‘last 8’ teams for 1885/86 - 2013/14…

last8 <- rbind(df6,df5,df4,df3,df2,df1)
head(last8)
##   Season             team
## 1   1885  Birmingham City
## 2   1885 Blackburn Rovers
## 3   1885        Brentwood
## 4   1885 Old Westminsters
## 5   1885           Redcar
## 6   1885      South Shore

    Now for the slightly tricky bit. We need to work out the difference in the number of teams from one season to the next. Fortunately this is made easy by using the setdiff function. I’m going to run a for loop for this, there are probably better ways, but this method is quite easy to deconstruct to illustrate what is going on. Here, I will build up the steps:

First create a list of each set of 8 teams using split on season:

last8.l <- split(last8, last8$Season)

For example, this is the third season in our data - 1886/87

last8.l[[3]] 
##    Season                 team
## 17   1887     Blackburn Rovers
## 18   1887      Crewe Alexandra
## 19   1887       Derby Junction
## 20   1887        Middlesbrough
## 21   1887      Old Carthusians
## 22   1887    Preston North End
## 23   1887  Sheffield Wednesday
## 24   1887 West Bromwich Albion
last8.l[[3]][,2] #this just returns a vector of teams, i.e. the 2nd variable
## [1] "Blackburn Rovers"     "Crewe Alexandra"      "Derby Junction"      
## [4] "Middlesbrough"        "Old Carthusians"      "Preston North End"   
## [7] "Sheffield Wednesday"  "West Bromwich Albion"
last8.l[[4]][,2] #this returns a vector of teams from the following season
## [1] "Aston Villa"             "Birmingham St George's" 
## [3] "Blackburn Rovers"        "Chatham"                
## [5] "Preston North End"       "Sheffield Wednesday"    
## [7] "West Bromwich Albion"    "Wolverhampton Wanderers"

Now get the difference between these lists:

setdiff(last8.l[[4]][,2] , last8.l[[3]][,2])
## [1] "Aston Villa"             "Birmingham St George's" 
## [3] "Chatham"                 "Wolverhampton Wanderers"

Although we can obviously see that 4 teams (Aston Villa, Birmingham St George’s, Chatham, Wolverhampton Wanderers) were in the last 8 in 1887/88 that were not in 1886/87, we can automate this process using length:

length(setdiff(last8.l[[4]][,2] , last8.l[[3]][,2])) #which means that 8-4 = 4 teams were the same
## [1] 4

 

Let’s put all of that in a loop. We start the loop at the 2nd season as obviously we cannot start at the first as it has no comparison (for the purposes of this tutorial):

results <- NULL #create empty vector

for (i in 2:length(last8.l)){
differences <- length(setdiff(last8.l[[i]][,2] , last8.l[[i-1]][,2]))
results[[i]] <- 8 - differences
}

 

The results show the number of teams in the last 8 who were also in the season before:

results
##   [1] NA  2  3  4  4  5  6  4  2  4  5  2  4  5  4  3  1  3  2  2  4  4  1
##  [24]  2  1  2  3  2  3  1  2  3  4  1  0  3  0  2  1  3  4  0  1  1  1  3
##  [47]  1  2  4  2  0  2  1  2  2  2  1  1  2  0  4  2  2  1  3  4  3  1  3
##  [70]  2  2  4  4  4  2  1  3  2  0  0  0  3  1  2  4  3  1  0  2  1  4  2
##  [93]  3  3  2  1  2  0  0  2  2  2  0  3  3  1  2  2  2  2  1  3  3  3  2
## [116]  3  2  2  3

 

We can calculate the number of times that every team that has reached the last eight did not the previous year.

sum(results==0, na.rm=T) 
## [1] 12

 

 

We could also do a simple histogram plot

hist(results, col="lightblue", breaks=c(0,1,2,3,4,5,6,7,8), main="Number of teams reaching conseuctive quarter-finals", xlab="Number of teams") 

  Let’s make that into a dataframe and add season as a varaiable too:

mydf <- data.frame(Season = names(last8.l), sameteams = results)
head(mydf)
##   Season sameteams
## 1   1885        NA
## 2   1886         2
## 3   1887         3
## 4   1888         4
## 5   1889         4
## 6   1890         5

  It’s pretty clear from the above data that the pattern of the number of same teams making it into the last 8 is quite constant from year to year. To detect patterns, it might be worthwhile to add a moving average - say of every 5 years.

rollapply(mydf$sameteams, width=5, FUN=mean, fill=NA, align=c("right"))  #from library(zoo)
##   [1]  NA  NA  NA  NA  NA 3.6 4.4 4.6 4.2 4.2 4.2 3.4 3.4 4.0 4.0 3.6 3.4
##  [18] 3.2 2.6 2.2 2.4 3.0 2.6 2.6 2.4 2.0 1.8 2.0 2.2 2.2 2.2 2.2 2.6 2.2
##  [35] 2.0 2.2 1.6 1.2 1.2 1.8 2.0 2.0 1.8 1.8 1.4 1.2 1.4 1.6 2.2 2.4 1.8
##  [52] 2.0 1.8 1.4 1.4 1.8 1.6 1.6 1.6 1.2 1.6 1.8 2.0 1.8 2.4 2.4 2.6 2.4
##  [69] 2.8 2.6 2.2 2.4 3.0 3.2 3.2 3.0 2.8 2.4 1.6 1.2 1.0 1.0 0.8 1.2 2.0
##  [86] 2.6 2.2 2.0 2.0 1.4 1.6 1.8 2.4 2.6 2.8 2.2 2.2 1.6 1.0 1.0 1.2 1.2
## [103] 1.2 1.8 2.0 1.8 1.8 2.2 2.0 1.8 1.8 2.0 2.2 2.4 2.4 2.8 2.6 2.4 2.4
mydf$movavg <- rollapply(mydf$sameteams, width=5, FUN=mean, fill=NA, align=c("right"))  #add into mydf

head(mydf, 10)
##    Season sameteams movavg
## 1    1885        NA     NA
## 2    1886         2     NA
## 3    1887         3     NA
## 4    1888         4     NA
## 5    1889         4     NA
## 6    1890         5    3.6
## 7    1891         6    4.4
## 8    1892         4    4.6
## 9    1893         2    4.2
## 10   1894         4    4.2

  Before we graph it, I want to add some years so that gaps will appear along the x-axis for the two World Wars.

setdiff(1885:2013,names(last8.l)) #get those years not included
##  [1] 1915 1916 1917 1918 1939 1940 1941 1942 1943 1944
mydfNA <- data.frame(Season = setdiff(1885:2013, names(last8.l)), sameteams = NA, movavg = NA)  #create a df we can append to mydf 

mydf$Season <- as.numeric(as.character(mydf$Season)) #need to make sure that season is a number not a factor


mydf <- rbind(mydf, mydfNA) #this ensures that the war years are gaps when doing a line graph

 

Now let’s plot:

ggplot(mydf, aes(Season, movavg)) + 
       geom_point(color="navy", size=2.5) + 
       geom_line(aes(group=1), lwd=1, color="dodgerblue") +
         scale_x_continuous(breaks=c(seq(1880,2020,10))) +
       ggtitle("Total teams in last 8 of the FA Cup in consecutive seasons: 1885-2014") + 
       ylab("Total teams") +
  theme(
    panel.grid.major.x = element_line(color="gray55"),
    panel.grid.major.y = element_line(color="gray55"),
    axis.ticks.x = element_blank(),
    axis.ticks.y = element_blank(),
    panel.grid.minor = element_blank(),
    plot.background  = element_rect(color = "ghostwhite"),
    panel.background = element_blank(),
    plot.title = element_text(hjust=0,vjust=1)
    )

  My first impression of this plot is that there does appear to be a decline in the number of the same teams appearing in the last 8 from 1885 (averaging 4 of the same teams, season to season) to less than 2 teams repeating in the mid 1920s and mid 1930s. If increased parity is equivalent to a more exciting competition, then it appears that the mid 1970s and mid 1990s were very exciting as very few teams repeated getting into the last 8 in consecutive seasons. There also seems to be an increasing trend towards less parity / increased competitiveness since the beginning of the 21st century, which may be tailing off in the last few seasons. However, the ups-and-downs seen in the second half of this chart could easily be due to random noise rather than any meaningful trends.

This is just a quick example of how easy it is to turn all of this data into something visually appealing that can stimulate further research questions and ideas.

If you’re interested in this and want to learn more or discuss, please get in touch with me - jc3181 AT columbia DOT edu