jc3181 AT columbia DOT edu

 

In this vignette I shall explore changes in English soccer team performances from one season to the next. In particular, I will focus on the top flight and ask questions such as - which teams increased their goals per game from one season to the next the most? Which teams dropped the most points per game from one season to the next?

We can do this very easily using my R package engsoccerdata that contains the date and result of every league game ever played.

Throughout the guide I will try to explain what the code is doing as much as possible, either in the text or with #annotations# in the code chunks - any questions please email me.

 


Getting started

First install my engsoccerdata package from GitHub if you haven’t already. Make sure you have the devtools package loaded:

library(devtools)
install_github('jalapic/engsoccerdata', username = "jalapic")

 

Now load the required packages. In addition to engsoccerdata, we are also using dplyr to flexibly restructure our data and ggplot2 and gridExtra for visualizing it.

library(engsoccerdata)
library(dplyr)
library(ggplot2)
library(gridExtra)

 

Restructuring data

The dataset to use is engsoccerdata2 - this contains all league results up to the end of the 2013/14 season. Each season in this dataset is referred to by the year that the season began in. i.e. 2013 refers to the 2013/14 season.

The first thing to do is to use filter from dplyr to only keep top flight (tier=1) games and to remove those games that took place in the truncated 1939 season. We also use as.Date to ensure that the ‘Date’ variable is in R date format.

 

Throughout this guide, I shall also wrap some of the dplyr chains in tbl_df() as this will truncate the output so that huge dataframes aren’t shown in their entirety.

 

df <- tbl_df(engsoccerdata2 %>% filter(tier==1 & Season!=1939))
df$Date <- as.Date(df$Date, format="%Y-%m-%d")   

df
## Source: local data frame [46,770 x 12]
## 
##          Date Season            home              visitor  FT hgoal vgoal
## 1  1888-12-15   1888 Accrington F.C.          Aston Villa 1-1     1     1
## 2  1889-01-19   1888 Accrington F.C.     Blackburn Rovers 0-2     0     2
## 3  1889-03-23   1888 Accrington F.C.     Bolton Wanderers 2-3     2     3
## 4  1888-12-01   1888 Accrington F.C.              Burnley 5-1     5     1
## 5  1888-10-13   1888 Accrington F.C.         Derby County 6-2     6     2
## 6  1888-12-29   1888 Accrington F.C.              Everton 3-1     3     1
## 7  1889-01-26   1888 Accrington F.C.         Notts County 1-2     1     2
## 8  1888-10-20   1888 Accrington F.C.    Preston North End 0-0     0     0
## 9  1889-04-20   1888 Accrington F.C.           Stoke City 2-0     2     0
## 10 1888-11-24   1888 Accrington F.C. West Bromwich Albion 2-1     2     1
## ..        ...    ...             ...                  ... ...   ...   ...
## Variables not shown: division (chr), tier (int), totgoal (int), goaldif
##   (int), result (chr)

The engsoccerdata2 dataset contains a separate row for every match. For this analysis we need to have two rows for every match - one for each team taking part in each match. The reason for this is so we can calculate each team’s records within each season more easily.

Here, I make one dataframe for all ‘home’ games and one for all ‘away’ games and then bind them together. You will also note that I have added new variables - GF (goals for), GA (goals against), GD (goal difference), result (Win, Loss, Draw) and venue (home or away). I then only keep these variables plus the team, season and date of each game.

 

dfhome <- df %>% mutate(team = home,
                        opp = visitor,
                        GF=as.numeric(as.character(hgoal)),
                        GA=as.numeric(as.character(vgoal)),
                        GD = GF-GA,
                        result=ifelse(GD>0, "W", ifelse(GD<0, "L", "D")),
                        venue="home") 


dfaway <- df %>% mutate(team = visitor,
                        opp = home, 
                        GF=as.numeric(as.character(vgoal)),
                        GA=as.numeric(as.character(hgoal)),
                        GD = GF-GA,
                        result=ifelse(GD>0, "W", ifelse(GD<0, "L", "D")),
                        venue="away") 


dfboth <- rbind(dfhome,dfaway) %>% select(Date, Season, team, opp, GF, GA, GD)

dfboth
## Source: local data frame [93,540 x 7]
## 
##          Date Season            team                  opp GF GA GD
## 1  1888-12-15   1888 Accrington F.C.          Aston Villa  1  1  0
## 2  1889-01-19   1888 Accrington F.C.     Blackburn Rovers  0  2 -2
## 3  1889-03-23   1888 Accrington F.C.     Bolton Wanderers  2  3 -1
## 4  1888-12-01   1888 Accrington F.C.              Burnley  5  1  4
## 5  1888-10-13   1888 Accrington F.C.         Derby County  6  2  4
## 6  1888-12-29   1888 Accrington F.C.              Everton  3  1  2
## 7  1889-01-26   1888 Accrington F.C.         Notts County  1  2 -1
## 8  1888-10-20   1888 Accrington F.C.    Preston North End  0  0  0
## 9  1889-04-20   1888 Accrington F.C.           Stoke City  2  0  2
## 10 1888-11-24   1888 Accrington F.C. West Bromwich Albion  2  1  1
## ..        ...    ...             ...                  ... .. .. ..

 

Calculate cumulative totals

The next step is to calculate the cumulative goals for, goals against, goal difference and points for each game within a season for each team. To enable comparisons across seasons, I am assuming 3 points for a win throughout and I am not factoring in points penalties incurred. I also calculate for each game number in a season the cumulative goals for per game, goals against per game, goal difference per game and points per game. The reason for this is because different seasons have a different total number of games, this will enable season to season comparisons.

To do all of this, the main thing to do is to group together the ‘Season’ and ‘team’ variable using group_by in dplyr . This means that everything we do after this function is done independently for each team/Season combination.

To create new variables we use mutate in dplyr. To calculate the game number within each season for each team, we use dense_rank, and to caculate cumulative totals we use base-r’s cumsum. Before we calculate cumualtive totals, we need to make sure that the dataframe is organized in ascending order of ‘gameno’ - so we use arrange to ensure this.

To get the ‘per-game’ values, we simply divide these cumulative totals by ‘gameno’. The ‘per-game’ variables have the suffix ‘.pg’ . Lastly, we only want to keep certain variables and we use select to do this.

 

mydf <- 
  dfboth %>%
  group_by(Season, team) %>%
  mutate(result = ifelse(GD>0, "W", ifelse(GD<0, "L", "D")),
         pts = ifelse(GD>0, 3, ifelse(GD<0, 0, 1)),
         gameno = dense_rank(Date)) %>%
  arrange(Season,team, gameno) %>%
  mutate(Cumpts = cumsum(pts),
         CumGF = cumsum(GF),
         CumGA = cumsum(GA),
         CumGD = cumsum(GD),
         pts.pg = Cumpts/gameno,
         GF.pg = CumGF/gameno,
         GA.pg = CumGA/gameno,
         GD.pg = CumGD/gameno
         ) %>%
  select(Season, team, gameno, Cumpts, CumGF, CumGA, CumGD, pts.pg, GF.pg, GA.pg, GD.pg)
                       
mydf #this has 93,540 rows/observations
## Source: local data frame [93,540 x 11]
## Groups: Season, team
## 
##    Season            team gameno Cumpts CumGF CumGA CumGD    pts.pg
## 1    1888 Accrington F.C.      1      0     1     2    -1 0.0000000
## 2    1888 Accrington F.C.      2      1     6     7    -1 0.5000000
## 3    1888 Accrington F.C.      3      2     7     8    -1 0.6666667
## 4    1888 Accrington F.C.      4      5    11    10     1 1.2500000
## 5    1888 Accrington F.C.      5      6    15    14     1 1.2000000
## 6    1888 Accrington F.C.      6      9    21    16     5 1.5000000
## 7    1888 Accrington F.C.      7     10    21    16     5 1.4285714
## 8    1888 Accrington F.C.      8     10    24    20     4 1.2500000
## 9    1888 Accrington F.C.      9     11    26    22     4 1.2222222
## 10   1888 Accrington F.C.     10     12    29    25     4 1.2000000
## ..    ...             ...    ...    ...   ...   ...   ...       ...
## Variables not shown: GF.pg (dbl), GA.pg (dbl), GD.pg (dbl)
tail(mydf) #just to show you the last few observations in the data
## Source: local data frame [6 x 11]
## Groups: Season, team
## 
##   Season            team gameno Cumpts CumGF CumGA CumGD   pts.pg    GF.pg
## 1   2013 West Ham United     33     37    37    44    -7 1.121212 1.121212
## 2   2013 West Ham United     34     37    38    47    -9 1.088235 1.117647
## 3   2013 West Ham United     35     37    38    48   -10 1.057143 1.085714
## 4   2013 West Ham United     36     37    38    49   -11 1.027778 1.055556
## 5   2013 West Ham United     37     40    40    49    -9 1.081081 1.081081
## 6   2013 West Ham United     38     40    40    51   -11 1.052632 1.052632
## Variables not shown: GA.pg (dbl), GD.pg (dbl)

   

Get final standings data

What we really want to do in this example is to compare each team’s performance from one season to the next. This requires us to only keep the cumulative data for the last match of each team in each season. To do this, it’s a simple procedure to combine max from base-r and filter from dplyr to get the data from the highest gameno for each grouped team/Season combination (remember the data is still grouped - the group_by performed earlier is still a property of the dataframe).

 

mydf.final <- mydf %>% filter(gameno == max(gameno)) 

mydf.final  #this shows the final standings of teams in the first season - 1888/89
## Source: local data frame [2,363 x 11]
## Groups: Season, team
## 
##    Season              team gameno Cumpts CumGF CumGA CumGD    pts.pg
## 1    1888   Accrington F.C.     22     26    48    48     0 1.1818182
## 2    1888       Aston Villa     22     41    61    43    18 1.8636364
## 3    1888  Blackburn Rovers     22     36    66    45    21 1.6363636
## 4    1888  Bolton Wanderers     22     32    63    59     4 1.4545455
## 5    1888           Burnley     22     24    42    62   -20 1.0909091
## 6    1888      Derby County     22     23    41    61   -20 1.0454545
## 7    1888           Everton     22     29    35    46   -11 1.3181818
## 8    1888      Notts County     22     17    40    73   -33 0.7727273
## 9    1888 Preston North End     22     58    74    15    59 2.6363636
## 10   1888        Stoke City     22     16    26    51   -25 0.7272727
## ..    ...               ...    ...    ...   ...   ...   ...       ...
## Variables not shown: GF.pg (dbl), GA.pg (dbl), GD.pg (dbl)

 

If we use ungroup() from dplyr, we can now also arrange the data by team. Here are the first few rows. As you can see, Accrington F. C.’s last hurrah in the top tier of English soccer was 1892/93, and Arsenal’s first season was 1904/05.

mydf.final <- mydf.final %>% ungroup() %>%  arrange(team,Season)
mydf.final
## Source: local data frame [2,363 x 11]
## 
##    Season            team gameno Cumpts CumGF CumGA CumGD    pts.pg
## 1    1888 Accrington F.C.     22     26    48    48     0 1.1818182
## 2    1889 Accrington F.C.     22     33    53    56    -3 1.5000000
## 3    1890 Accrington F.C.     22     22    28    50   -22 1.0000000
## 4    1891 Accrington F.C.     26     28    40    78   -38 1.0769231
## 5    1892 Accrington F.C.     30     29    57    81   -24 0.9666667
## 6    1904         Arsenal     34     45    36    40    -4 1.3235294
## 7    1905         Arsenal     38     52    62    64    -2 1.3684211
## 8    1906         Arsenal     38     64    66    59     7 1.6842105
## 9    1907         Arsenal     38     48    51    63   -12 1.2631579
## 10   1908         Arsenal     38     52    52    49     3 1.3684211
## ..    ...             ...    ...    ...   ...   ...   ...       ...
## Variables not shown: GF.pg (dbl), GA.pg (dbl), GD.pg (dbl)

 

Worked Example - Stoke City

 

Just as an example, here are all the end of season records for Stoke City in the top tier:

mydf.final %>% filter(team=="Stoke City")
## Source: local data frame [58 x 11]
## 
##    Season       team gameno Cumpts CumGF CumGA CumGD    pts.pg     GF.pg
## 1    1888 Stoke City     22     16    26    51   -25 0.7272727 1.1818182
## 2    1889 Stoke City     22     13    27    69   -42 0.5909091 1.2272727
## 3    1891 Stoke City     26     19    38    61   -23 0.7307692 1.4615385
## 4    1892 Stoke City     30     41    58    48    10 1.3666667 1.9333333
## 5    1893 Stoke City     30     42    65    79   -14 1.4000000 2.1666667
## 6    1894 Stoke City     30     33    50    67   -17 1.1000000 1.6666667
## 7    1895 Stoke City     30     45    56    47     9 1.5000000 1.8666667
## 8    1896 Stoke City     30     36    48    59   -11 1.2000000 1.6000000
## 9    1897 Stoke City     30     32    35    55   -20 1.0666667 1.1666667
## 10   1898 Stoke City     34     46    47    52    -5 1.3529412 1.3823529
## 11   1899 Stoke City     34     47    37    45    -8 1.3823529 1.0882353
## 12   1900 Stoke City     34     38    46    57   -11 1.1176471 1.3529412
## 13   1901 Stoke City     34     42    45    55   -10 1.2352941 1.3235294
## 14   1902 Stoke City     34     52    46    38     8 1.5294118 1.3529412
## 15   1903 Stoke City     34     37    54    57    -3 1.0882353 1.5882353
## 16   1904 Stoke City     34     43    40    58   -18 1.2647059 1.1764706
## 17   1905 Stoke City     38     55    54    55    -1 1.4473684 1.4210526
## 18   1906 Stoke City     38     34    41    64   -23 0.8947368 1.0789474
## 19   1922 Stoke City     42     40    47    67   -20 0.9523810 1.1190476
## 20   1933 Stoke City     42     56    58    71   -13 1.3333333 1.3809524
## 21   1934 Stoke City     42     60    71    70     1 1.4285714 1.6904762
## 22   1935 Stoke City     42     67    57    57     0 1.5952381 1.3571429
## 23   1936 Stoke City     42     57    72    57    15 1.3571429 1.7142857
## 24   1937 Stoke City     42     51    58    59    -1 1.2142857 1.3809524
## 25   1938 Stoke City     42     63    71    68     3 1.5000000 1.6904762
## 26   1946 Stoke City     42     79    90    53    37 1.8809524 2.1428571
## 27   1947 Stoke City     42     52    41    55   -14 1.2380952 0.9761905
## 28   1948 Stoke City     42     57    66    68    -2 1.3571429 1.5714286
## 29   1949 Stoke City     42     45    45    75   -30 1.0714286 1.0714286
## 30   1950 Stoke City     42     53    50    59    -9 1.2619048 1.1904762
## 31   1951 Stoke City     42     43    49    88   -39 1.0238095 1.1666667
## 32   1952 Stoke City     42     46    53    66   -13 1.0952381 1.2619048
## 33   1963 Stoke City     42     52    77    78    -1 1.2380952 1.8333333
## 34   1964 Stoke City     42     58    67    66     1 1.3809524 1.5952381
## 35   1965 Stoke City     42     57    65    64     1 1.3571429 1.5476190
## 36   1966 Stoke City     42     58    63    58     5 1.3809524 1.5000000
## 37   1967 Stoke City     42     49    50    73   -23 1.1666667 1.1904762
## 38   1968 Stoke City     42     42    40    63   -23 1.0000000 0.9523810
## 39   1969 Stoke City     42     60    56    52     4 1.4285714 1.3333333
## 40   1970 Stoke City     42     49    44    48    -4 1.1666667 1.0476190
## 41   1971 Stoke City     42     45    39    56   -17 1.0714286 0.9285714
## 42   1972 Stoke City     42     52    61    56     5 1.2380952 1.4523810
## 43   1973 Stoke City     42     61    54    42    12 1.4523810 1.2857143
## 44   1974 Stoke City     42     66    64    48    16 1.5714286 1.5238095
## 45   1975 Stoke City     42     56    48    50    -2 1.3333333 1.1428571
## 46   1976 Stoke City     42     44    28    51   -23 1.0476190 0.6666667
## 47   1979 Stoke City     42     49    44    58   -14 1.1666667 1.0476190
## 48   1980 Stoke City     42     54    51    60    -9 1.2857143 1.2142857
## 49   1981 Stoke City     42     44    44    63   -19 1.0476190 1.0476190
## 50   1982 Stoke City     42     57    53    64   -11 1.3571429 1.2619048
## 51   1983 Stoke City     42     50    44    63   -19 1.1904762 1.0476190
## 52   1984 Stoke City     42     17    24    91   -67 0.4047619 0.5714286
## 53   2008 Stoke City     38     45    38    55   -17 1.1842105 1.0000000
## 54   2009 Stoke City     38     47    34    48   -14 1.2368421 0.8947368
## 55   2010 Stoke City     38     46    46    48    -2 1.2105263 1.2105263
## 56   2011 Stoke City     38     45    36    53   -17 1.1842105 0.9473684
## 57   2012 Stoke City     38     42    34    45   -11 1.1052632 0.8947368
## 58   2013 Stoke City     38     50    45    52    -7 1.3157895 1.1842105
## Variables not shown: GA.pg (dbl), GD.pg (dbl)

 

What you will notice here is that Stoke were founder members of the top tier in 1888/89 but have been in and out of the top division.

We are only interested in looking at performances from one season to the next that both occurred in the top tier. We also don’t want to include consecutive appearances in the top tier - e.g. there’s no need to compare Stoke’s record in 1984/85 to 2009/09, but we do want to compare their performance from 2008/09 to 2009/10 to 2010/11 etc. Incidentally, I am also comparing their record from 1938/39 to 1946/47 as these are the two complete seasons either side of the war.

To do this, what we need to do for each team is to re-insert the Seasons that they were not in the top tier, but that there was top tier soccer - this is done in the next section.

 

Re-inserting non top-tier seasons

 

We can get every team’s individual record by using base-R’s split - this splits every team’s data into a separate dataframe, with all of them being stored in a list.

mydf.final.split <- split(mydf.final, mydf.final$team)

 

Showing a summary of the dataframes returned by split would produce too much output for this guide. Each team’s dataframe is essentially equivalent to the example shown above for Stoke City. In fact, we can save any team’s data just by adding e.g. $'Stoke City' to the end of the split function.

I will do this with Stoke’s data as a worked example of what we are about to do to every team’s data.

stoke <- split(mydf.final, mydf.final$team)$`Stoke City`
head(stoke) 
## Source: local data frame [6 x 11]
## 
##   Season       team gameno Cumpts CumGF CumGA CumGD    pts.pg    GF.pg
## 1   1888 Stoke City     22     16    26    51   -25 0.7272727 1.181818
## 2   1889 Stoke City     22     13    27    69   -42 0.5909091 1.227273
## 3   1891 Stoke City     26     19    38    61   -23 0.7307692 1.461538
## 4   1892 Stoke City     30     41    58    48    10 1.3666667 1.933333
## 5   1893 Stoke City     30     42    65    79   -14 1.4000000 2.166667
## 6   1894 Stoke City     30     33    50    67   -17 1.1000000 1.666667
## Variables not shown: GA.pg (dbl), GD.pg (dbl)

 

The seasons that Stoke City were in the top flight are in the following variable:

stoke$Season
##  [1] 1888 1889 1891 1892 1893 1894 1895 1896 1897 1898 1899 1900 1901 1902
## [15] 1903 1904 1905 1906 1922 1933 1934 1935 1936 1937 1938 1946 1947 1948
## [29] 1949 1950 1951 1952 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972
## [43] 1973 1974 1975 1976 1979 1980 1981 1982 1983 1984 2008 2009 2010 2011
## [57] 2012 2013

 

Every season ever in the top tier can be got using unique from base-r. I just use sort here to make sure the data are returned in numerical order:

sort(unique(df$Season)) #missing seasons between 1888-2013 are due to world wars
##   [1] 1888 1889 1890 1891 1892 1893 1894 1895 1896 1897 1898 1899 1900 1901
##  [15] 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 1919
##  [29] 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933
##  [43] 1934 1935 1936 1937 1938 1946 1947 1948 1949 1950 1951 1952 1953 1954
##  [57] 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968
##  [71] 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982
##  [85] 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996
##  [99] 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
## [113] 2011 2012 2013

 

We can calculate the seasons that Stoke were not in the top tier, but that top tier football took place, by using base-r ’s setdiff - sort is used again just to ensure numerical order:

sort(setdiff(unique(mydf.final$Season), stoke$Season)) #seasons Stoke City not in top-flight but top tier soccer took place
##  [1] 1890 1907 1908 1909 1910 1911 1912 1913 1914 1919 1920 1921 1923 1924
## [15] 1925 1926 1927 1928 1929 1930 1931 1932 1953 1954 1955 1956 1957 1958
## [29] 1959 1960 1961 1962 1977 1978 1985 1986 1987 1988 1989 1990 1991 1992
## [43] 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006
## [57] 2007
xtra.seasons <- setdiff(unique(mydf.final$Season), stoke$Season) #storing these seasons in a vector

 

Now to add in these blank seasons to Stoke’s dataframe. This is done by binding Stoke’s data to a new data.frame that has all the missing seasons in the first variable, the team name in the second variable, and every other variable is named the same as in the Stoke dataframe but contains NAs.

stoke <-rbind(stoke, 
          data.frame(Season=xtra.seasons,
                     team=unique(stoke$team),
                     gameno=NA,
                     Cumpts=NA,
                     CumGF=NA,
                     CumGA=NA,
                     CumGD=NA,
                     pts.pg=NA,
                     GF.pg=NA,
                     GA.pg=NA,
                     GD.pg=NA) 
) 

stoke %>% arrange(Season) # in season order 
## Source: local data frame [115 x 11]
## 
##    Season       team gameno Cumpts CumGF CumGA CumGD    pts.pg    GF.pg
## 1    1888 Stoke City     22     16    26    51   -25 0.7272727 1.181818
## 2    1889 Stoke City     22     13    27    69   -42 0.5909091 1.227273
## 3    1890 Stoke City     NA     NA    NA    NA    NA        NA       NA
## 4    1891 Stoke City     26     19    38    61   -23 0.7307692 1.461538
## 5    1892 Stoke City     30     41    58    48    10 1.3666667 1.933333
## 6    1893 Stoke City     30     42    65    79   -14 1.4000000 2.166667
## 7    1894 Stoke City     30     33    50    67   -17 1.1000000 1.666667
## 8    1895 Stoke City     30     45    56    47     9 1.5000000 1.866667
## 9    1896 Stoke City     30     36    48    59   -11 1.2000000 1.600000
## 10   1897 Stoke City     30     32    35    55   -20 1.0666667 1.166667
## ..    ...        ...    ...    ...   ...   ...   ...       ...      ...
## Variables not shown: GA.pg (dbl), GD.pg (dbl)
tail(stoke %>% arrange(Season), 10) #last 10 rows of Stoke's dataframe
## Source: local data frame [10 x 11]
## 
##    Season       team gameno Cumpts CumGF CumGA CumGD   pts.pg     GF.pg
## 1    2004 Stoke City     NA     NA    NA    NA    NA       NA        NA
## 2    2005 Stoke City     NA     NA    NA    NA    NA       NA        NA
## 3    2006 Stoke City     NA     NA    NA    NA    NA       NA        NA
## 4    2007 Stoke City     NA     NA    NA    NA    NA       NA        NA
## 5    2008 Stoke City     38     45    38    55   -17 1.184211 1.0000000
## 6    2009 Stoke City     38     47    34    48   -14 1.236842 0.8947368
## 7    2010 Stoke City     38     46    46    48    -2 1.210526 1.2105263
## 8    2011 Stoke City     38     45    36    53   -17 1.184211 0.9473684
## 9    2012 Stoke City     38     42    34    45   -11 1.105263 0.8947368
## 10   2013 Stoke City     38     50    45    52    -7 1.315789 1.1842105
## Variables not shown: GA.pg (dbl), GD.pg (dbl)

  Note how seasons 1890/91 and 2004/05-2007/08 don’t have any valued data as Stoke weren’t in the top-tier during these seasons.

 

Brief reprise: some visualizations

OK - that’s a lot of data munging. Just as a quick respite, here are some visualizations of the Stoke data just to get an idea of what we are collecting.

Here, we will look at season by season changes in points per game, goals scored per game, goals against per game and goal difference per game.

I won’t go too much into the ggplot2 code. Possibly the only slightly unsual thing is the geom_line(aes(group=1))... part which enables us to plot lines that do not join data points across missing (NA) values:

stoke$Season <- as.numeric(stoke$Season) #for graphing purposes make sure Season variable is numeric

g1 <- ggplot(stoke, aes(Season, pts.pg)) + 
  geom_point(color="black", size=2) + 
  geom_line(aes(group=1), color="red") +
  theme_bw() +
  scale_x_continuous(breaks=seq(1888, 2014, by=63))

g2 <- ggplot(stoke, aes(Season, GF.pg)) + 
  geom_point(color="black", size=2) + 
  geom_line(aes(group=1), color="red") +
  theme_bw() +
  scale_x_continuous(breaks=seq(1888, 2014, by=63))

g3 <- ggplot(stoke, aes(Season, GA.pg)) + 
  geom_point(color="black", size=2) + 
  geom_line(aes(group=1), color="red") +
  theme_bw() +
  scale_x_continuous(breaks=seq(1888, 2014, by=63))

g4 <- ggplot(stoke, aes(Season, GD.pg)) + 
  geom_point(color="black", size=2) + 
  geom_line(aes(group=1), color="red") +
  theme_bw() +
  scale_x_continuous(breaks=seq(1888, 2014, by=63))

grid.arrange(g1,g2,g3,g4, ncol=2)

  Just out of interest - we can see that Stoke had one outstanding season in terms of points per game and one really terrible one. We can find these using filter:

stoke %>% filter(pts.pg == max(pts.pg, na.rm=T))
## Source: local data frame [1 x 11]
## 
##   Season       team gameno Cumpts CumGF CumGA CumGD   pts.pg    GF.pg
## 1   1946 Stoke City     42     79    90    53    37 1.880952 2.142857
## Variables not shown: GA.pg (dbl), GD.pg (dbl)
stoke %>% filter(pts.pg == min(pts.pg, na.rm=T))
## Source: local data frame [1 x 11]
## 
##   Season       team gameno Cumpts CumGF CumGA CumGD    pts.pg     GF.pg
## 1   1984 Stoke City     42     17    24    91   -67 0.4047619 0.5714286
## Variables not shown: GA.pg (dbl), GD.pg (dbl)

 

In 1946/47, Stoke averaged 1.88 points per game (assuming 3 points for a win). They finished in 4th position that year.

In 1984/85, Stoke only averaged 0.40 points per game - they finished rock bottom that season.

 

Calculating differences from one season to the next.

This is the main purpose of this guide - and it is incredibly simple to do this using lag in base-r. This function will take a vector (or variable in a dataframe) and return that vector or variable lagged by one. i.e. in our situation, it will return the previous season’s value. If the previous season has an NA, it will return an NA.

For example, let’s look at Stoke City’s points per game. We will select just the season and points per game variables, then create a new variable which is a lagged points per game and then another variable that is the difference.

Note that it’s important for the dataframe to be organized in Season order for lag to work correctly.  

stoke1 <- stoke %>% 
             select(Season,pts.pg) %>% 
             arrange(Season) %>%
             mutate(pts.pglag = lag(pts.pg), pts.pgDIF = pts.pg - pts.pglag)

as.data.frame(stoke1) 
##     Season    pts.pg pts.pglag   pts.pgDIF
## 1     1888 0.7272727        NA          NA
## 2     1889 0.5909091 0.7272727 -0.13636364
## 3     1890        NA 0.5909091          NA
## 4     1891 0.7307692        NA          NA
## 5     1892 1.3666667 0.7307692  0.63589744
## 6     1893 1.4000000 1.3666667  0.03333333
## 7     1894 1.1000000 1.4000000 -0.30000000
## 8     1895 1.5000000 1.1000000  0.40000000
## 9     1896 1.2000000 1.5000000 -0.30000000
## 10    1897 1.0666667 1.2000000 -0.13333333
## 11    1898 1.3529412 1.0666667  0.28627451
## 12    1899 1.3823529 1.3529412  0.02941176
## 13    1900 1.1176471 1.3823529 -0.26470588
## 14    1901 1.2352941 1.1176471  0.11764706
## 15    1902 1.5294118 1.2352941  0.29411765
## 16    1903 1.0882353 1.5294118 -0.44117647
## 17    1904 1.2647059 1.0882353  0.17647059
## 18    1905 1.4473684 1.2647059  0.18266254
## 19    1906 0.8947368 1.4473684 -0.55263158
## 20    1907        NA 0.8947368          NA
## 21    1908        NA        NA          NA
## 22    1909        NA        NA          NA
## 23    1910        NA        NA          NA
## 24    1911        NA        NA          NA
## 25    1912        NA        NA          NA
## 26    1913        NA        NA          NA
## 27    1914        NA        NA          NA
## 28    1919        NA        NA          NA
## 29    1920        NA        NA          NA
## 30    1921        NA        NA          NA
## 31    1922 0.9523810        NA          NA
## 32    1923        NA 0.9523810          NA
## 33    1924        NA        NA          NA
## 34    1925        NA        NA          NA
## 35    1926        NA        NA          NA
## 36    1927        NA        NA          NA
## 37    1928        NA        NA          NA
## 38    1929        NA        NA          NA
## 39    1930        NA        NA          NA
## 40    1931        NA        NA          NA
## 41    1932        NA        NA          NA
## 42    1933 1.3333333        NA          NA
## 43    1934 1.4285714 1.3333333  0.09523810
## 44    1935 1.5952381 1.4285714  0.16666667
## 45    1936 1.3571429 1.5952381 -0.23809524
## 46    1937 1.2142857 1.3571429 -0.14285714
## 47    1938 1.5000000 1.2142857  0.28571429
## 48    1946 1.8809524 1.5000000  0.38095238
## 49    1947 1.2380952 1.8809524 -0.64285714
## 50    1948 1.3571429 1.2380952  0.11904762
## 51    1949 1.0714286 1.3571429 -0.28571429
## 52    1950 1.2619048 1.0714286  0.19047619
## 53    1951 1.0238095 1.2619048 -0.23809524
## 54    1952 1.0952381 1.0238095  0.07142857
## 55    1953        NA 1.0952381          NA
## 56    1954        NA        NA          NA
## 57    1955        NA        NA          NA
## 58    1956        NA        NA          NA
## 59    1957        NA        NA          NA
## 60    1958        NA        NA          NA
## 61    1959        NA        NA          NA
## 62    1960        NA        NA          NA
## 63    1961        NA        NA          NA
## 64    1962        NA        NA          NA
## 65    1963 1.2380952        NA          NA
## 66    1964 1.3809524 1.2380952  0.14285714
## 67    1965 1.3571429 1.3809524 -0.02380952
## 68    1966 1.3809524 1.3571429  0.02380952
## 69    1967 1.1666667 1.3809524 -0.21428571
## 70    1968 1.0000000 1.1666667 -0.16666667
## 71    1969 1.4285714 1.0000000  0.42857143
## 72    1970 1.1666667 1.4285714 -0.26190476
## 73    1971 1.0714286 1.1666667 -0.09523810
## 74    1972 1.2380952 1.0714286  0.16666667
## 75    1973 1.4523810 1.2380952  0.21428571
## 76    1974 1.5714286 1.4523810  0.11904762
## 77    1975 1.3333333 1.5714286 -0.23809524
## 78    1976 1.0476190 1.3333333 -0.28571429
## 79    1977        NA 1.0476190          NA
## 80    1978        NA        NA          NA
## 81    1979 1.1666667        NA          NA
## 82    1980 1.2857143 1.1666667  0.11904762
## 83    1981 1.0476190 1.2857143 -0.23809524
## 84    1982 1.3571429 1.0476190  0.30952381
## 85    1983 1.1904762 1.3571429 -0.16666667
## 86    1984 0.4047619 1.1904762 -0.78571429
## 87    1985        NA 0.4047619          NA
## 88    1986        NA        NA          NA
## 89    1987        NA        NA          NA
## 90    1988        NA        NA          NA
## 91    1989        NA        NA          NA
## 92    1990        NA        NA          NA
## 93    1991        NA        NA          NA
## 94    1992        NA        NA          NA
## 95    1993        NA        NA          NA
## 96    1994        NA        NA          NA
## 97    1995        NA        NA          NA
## 98    1996        NA        NA          NA
## 99    1997        NA        NA          NA
## 100   1998        NA        NA          NA
## 101   1999        NA        NA          NA
## 102   2000        NA        NA          NA
## 103   2001        NA        NA          NA
## 104   2002        NA        NA          NA
## 105   2003        NA        NA          NA
## 106   2004        NA        NA          NA
## 107   2005        NA        NA          NA
## 108   2006        NA        NA          NA
## 109   2007        NA        NA          NA
## 110   2008 1.1842105        NA          NA
## 111   2009 1.2368421 1.1842105  0.05263158
## 112   2010 1.2105263 1.2368421 -0.02631579
## 113   2011 1.1842105 1.2105263 -0.02631579
## 114   2012 1.1052632 1.1842105 -0.07894737
## 115   2013 1.3157895 1.1052632  0.21052632

 

To keep with the Stoke City worked example, we can do this lagged difference for points, GF, GA, and GD and visualize:

stoke2 <- stoke %>% 
             arrange(Season) %>%
             mutate(pts.pglag = lag(pts.pg), pts.pgDIF = pts.pg - pts.pglag,
                    GF.pglag = lag(GF.pg), GF.pgDIF = GF.pg - GF.pglag,
                    GA.pglag = lag(GA.pg), GA.pgDIF = GA.pg - GA.pglag,
                    GD.pglag = lag(GD.pg), GD.pgDIF = GD.pg - GF.pglag
                    )

gg1 <- ggplot(stoke2, aes(Season, pts.pgDIF)) + 
  geom_point(color="black", size=2) + 
  geom_line(aes(group=1), color="red") +
  theme_bw() +
  scale_x_continuous(breaks=seq(1888, 2014, by=63))

gg2 <- ggplot(stoke2, aes(Season, GF.pgDIF)) + 
  geom_point(color="black", size=2) + 
  geom_line(aes(group=1), color="red") +
  theme_bw() +
  scale_x_continuous(breaks=seq(1888, 2014, by=63))

gg3 <- ggplot(stoke2, aes(Season, GA.pgDIF)) + 
  geom_point(color="black", size=2) + 
  geom_line(aes(group=1), color="red") +
  theme_bw() +
  scale_x_continuous(breaks=seq(1888, 2014, by=63))

gg4 <- ggplot(stoke2, aes(Season, GD.pgDIF)) + 
  geom_point(color="black", size=2) + 
  geom_line(aes(group=1), color="red") +
  theme_bw() +
  scale_x_continuous(breaks=seq(1888, 2014, by=63))

grid.arrange(gg1,gg2,gg3,gg4, ncol=2)

  There are some interesting changes in Stoke’s Goals For in the late 1940s…

stoke2 %>% 
      filter(Season >= 1945 & Season <= 1952) %>% 
      select(Season, GF.pg, GF.pglag, GF.pgDIF)
## Source: local data frame [7 x 4]
## 
##   Season     GF.pg  GF.pglag    GF.pgDIF
## 1   1946 2.1428571 1.6904762  0.45238095
## 2   1947 0.9761905 2.1428571 -1.16666667
## 3   1948 1.5714286 0.9761905  0.59523810
## 4   1949 1.0714286 1.5714286 -0.50000000
## 5   1950 1.1904762 1.0714286  0.11904762
## 6   1951 1.1666667 1.1904762 -0.02380952
## 7   1952 1.2619048 1.1666667  0.09523810

As we saw earlier, Stoke finished in 4th in 1946/47 and scored 90 goals in 42 games. The following season they only scored 41 goals in 42 games and finished in 15th, though they rebounded in 1948/49 and scored 66 goals in 42 games. It’s also interesting that in their relegation season of 1984/85 they actually scored a similar amount of goals to the previous season - they just conceded far more.

 

Repeatability of performance visualizations

One final set of visualizations in the worked example of Stoke City. Scatterplots to look at repeatability from one season to the next. Here, we just plot one season’s points per game, GF per game, GA per game and GD per game against the next season’s. Seasons that did not have a preceding or succeding season in the top tier are obviously not included.

#visualizing top-flight Points per game repeatability for Wolves
ggg1 <- ggplot(stoke2, aes(pts.pglag, pts.pg)) + 
  geom_point(size=2, color="red") + 
  stat_smooth(method="lm",se=F, lwd=1, color="firebrick") +
  theme_bw() +
  xlab("Preceding Season") + ylab("Succeding Season") + ggtitle("Points per game")

ggg2 <- ggplot(stoke2, aes(GF.pglag, GF.pg)) + 
  geom_point(size=2, color="red") + 
  stat_smooth(method="lm",se=F, lwd=1, color="firebrick") +
  theme_bw() +
  xlab("Preceding Season") + ylab("Succeding Season") + ggtitle("GF per game")

ggg3 <- ggplot(stoke2, aes(GA.pglag, GA.pg)) + 
  geom_point(size=2, color="red") + 
  stat_smooth(method="lm",se=F, lwd=1, color="firebrick") +
  theme_bw() +
  xlab("Preceding Season") + ylab("Succeding Season") + ggtitle("GA per game")

ggg4 <- ggplot(stoke2, aes(GD.pglag, GD.pg)) + 
  geom_point(size=2, color="red") + 
  stat_smooth(method="lm",se=F, lwd=1, color="firebrick") +
  theme_bw() +
  xlab("Preceding Season") + ylab("Succeding Season") + ggtitle("GD per game")

grid.arrange(ggg1,ggg2,ggg3,ggg4, ncol=2)

   

All teams

 

To look at season by season trends for all teams, we first need to write a generic function. This is essentially using the above code applied to Stoke City but written generically.

This function takes two arguments - the second is going to be each individual team’s dataframe that we stored above after completing the split function. This is denoted as an x. The first argument is the filtered dataframe that we produced above that contains all the seasons that top tier football occurred in.

fun1 <- function(mydf.final, x){

x <- as.data.frame(x)

xtra.seasons <- setdiff(unique(mydf.final$Season), x$Season) #storing these seasons in a vector

x <-rbind(x, 
          data.frame(Season=xtra.seasons,
                     team=unique(x$team),
                     gameno=NA,
                     Cumpts=NA,
                     CumGF=NA,
                     CumGA=NA,
                     CumGD=NA,
                     pts.pg=NA,
                     GF.pg=NA,
                     GA.pg=NA,
                     GD.pg=NA) 
) 

x$Season <- as.numeric(x$Season) #for graphing purposes

x <- x %>% 
  arrange(Season) %>% 
  mutate(pts.pglag = lag(pts.pg),
         GF.pglag = lag(GF.pg),
         GA.pglag = lag(GA.pg),
         GD.pglag = lag(GD.pg),
         pts.pgDIF=pts.pg - lag(pts.pg),
         GF.pgDIF=GF.pg - lag(GF.pg),
         GA.pgDIF=GA.pg - lag(GA.pg),
         GD.pgDIF=GD.pg - lag(GD.pg)         
         ) 

return(x)
}

   

Apply the above function to all teams using lapply and then convert to one large dataframe using do.call and rbind:

mydf.all <- lapply(mydf.final.split, function(x) fun1(mydf.final, x))
alldf    <- do.call("rbind", mydf.all)

sum(!is.na(alldf$pts.pgDIF)) #this is the total number of observations containing data in points per game = 2088
## [1] 2088

 

This alldf contains all the data we need. Let’s visualize it…

 

Change in points per game - all teams

 

Just as an illustration, here we rank the points per game difference between two seasons (all 2088 observations) and plot them.

tbl_df(alldf %>% 
  filter(!is.na(pts.pgDIF)) %>%
  mutate(rank = rank(pts.pgDIF)) %>% 
  select(team,Season,pts.pgDIF,rank) %>%
  arrange(rank)
  )
## Source: local data frame [2,088 x 4]
## 
##                    team Season  pts.pgDIF rank
## 1               Everton   1970 -1.0952381  1.0
## 2           Aston Villa   1900 -0.9411765  2.0
## 3      Newcastle United   1977 -0.9285714  3.5
## 4      Sheffield United   1975 -0.9285714  3.5
## 5  West Bromwich Albion   1890 -0.8636364  5.0
## 6      Sheffield United   1898 -0.8490196  6.0
## 7               Arsenal   1912 -0.8421053  7.0
## 8  West Bromwich Albion   1920 -0.8333333  8.0
## 9   Sheffield Wednesday   1919 -0.8120301  9.0
## 10         Ipswich Town   2001 -0.7894737 10.0
## ..                  ...    ...        ...  ...

 

Straight away we can see that Everton’s 1970/71 season was an absolute disaster compared to their previous season. In 1969/70, Everton won the league but bombed to 14th the following season.

tbl_df(alldf %>% 
  filter(!is.na(pts.pgDIF)) %>%
  mutate(rank = rank(pts.pgDIF)) %>% 
  select(team,Season,pts.pgDIF,rank) %>%
  arrange(desc(rank))
  )
## Source: local data frame [2,088 x 4]
## 
##                team Season pts.pgDIF   rank
## 1           Arsenal   1930 0.9761905 2088.0
## 2           Arsenal   1970 0.9523810 2087.0
## 3      Derby County   1895 0.9333333 2086.0
## 4        Sunderland   1891 0.8321678 2085.0
## 5  Sheffield United   1899 0.8235294 2084.0
## 6       Aston Villa   1989 0.7894737 2083.0
## 7   Manchester City   1967 0.7857143 2081.5
## 8   West Ham United   1985 0.7857143 2081.5
## 9        Sunderland   1897 0.7666667 2080.0
## 10          Everton   1938 0.7380952 2079.0
## ..              ...    ...       ...    ...

 

The biggest gains were both by Arsenal. In 1970/71, the year Everton finished 14th, they finished champions after being only 12th the year before. In 1930/31, they were champs after being 14th the year before.

 

Here is a plot of all teams by rank order:

tmp1 <- alldf %>% 
  filter(!is.na(pts.pgDIF)) %>%
  mutate(rank = rank(pts.pgDIF)) %>% 
  select(team,Season,pts.pgDIF,rank)

ggplot(tmp1, aes(rank, pts.pgDIF)) + geom_point() + theme_bw() 

  Another way of visualizing this data is to look across time:

ggplot(tmp1, aes(Season, pts.pgDIF)) + geom_point() + theme_bw() +
  ylab("Points per game change season to season")

  With some extra edits, gives this:  

     


Change in goals scored - all teams

tbl_df(alldf %>% 
  filter(!is.na(GF.pgDIF)) %>%
  mutate(rank = rank(GF.pgDIF)) %>% 
  select(team,Season,GF.pgDIF,rank) %>%
  arrange(rank)
  )
## Source: local data frame [2,088 x 4]
## 
##                    team Season   GF.pgDIF rank
## 1     Preston North End   1890 -1.2272727    1
## 2  West Bromwich Albion   1920 -1.1904762    2
## 3      Blackburn Rovers   1890 -1.1818182    3
## 4            Stoke City   1947 -1.1666667    4
## 5       Accrington F.C.   1890 -1.1363636    5
## 6               Arsenal   1933 -1.0238095    6
## 7      Newcastle United   1997 -1.0000000    7
## 8           Aston Villa   1892 -0.9897436    8
## 9               Everton   1891 -0.9790210    9
## 10              Arsenal   1992 -0.9761905   10
## ..                  ...    ...        ...  ...

 

Nothing too exciting here - lots of teams from long ago seem to have dramatic changes in goals scored per game - but then again, this is when scoring was higher and so you’d expect more fluctuations as a result.

 

tbl_df(alldf %>% 
  filter(!is.na(GF.pgDIF)) %>%
  mutate(rank = rank(GF.pgDIF)) %>% 
  select(team,Season,GF.pgDIF,rank) %>%
  arrange(desc(rank))
  )
## Source: local data frame [2,088 x 4]
## 
##                    team Season  GF.pgDIF rank
## 1           Aston Villa   1891 1.3776224 2088
## 2               Everton   1889 1.3636364 2087
## 3            Sunderland   1891 1.2587413 2086
## 4  West Bromwich Albion   1919 1.1867168 2085
## 5               Arsenal   1930 1.1666667 2084
## 6      Sheffield United   1925 1.1190476 2083
## 7               Everton   1984 1.0476190 2082
## 8     Tottenham Hotspur   1956 1.0238095 2081
## 9       Manchester City   1967 1.0238095 2080
## 10              Arsenal   1925 0.9761905 2079
## ..                  ...    ...       ...  ...

 

Same as the above comment, though Man City won the league in 1967/68 after only finishing 15th the year before and Everton won in 1984/85. Spurs finished 2nd in 1956/57.

 

Across time we can see how that goal scoring fluctuations haven’t actual been that much greater in the early years. There are some interesting recent quite dramatic increases and decreases in goal scoring:

tmp2 <- alldf %>% filter(!is.na(GF.pgDIF)) 

ggplot(tmp2, aes(Season, GF.pgDIF)) + geom_point() + theme_bw() +
  ylab("Goals scored per game change season to season")

  With some extra edits, gives this:  

   

Any questions or comments, please email me.