In this vignette I shall explore changes in English soccer team performances from one season to the next. In particular, I will focus on the top flight and ask questions such as - which teams increased their goals per game from one season to the next the most? Which teams dropped the most points per game from one season to the next?
We can do this very easily using my R package engsoccerdata that contains the date and result of every league game ever played.
Throughout the guide I will try to explain what the code is doing as much as possible, either in the text or with #annotations# in the code chunks - any questions please email me.
First install my engsoccerdata package from GitHub if you haven’t already. Make sure you have the devtools package loaded:
library(devtools)
install_github('jalapic/engsoccerdata', username = "jalapic")
Now load the required packages. In addition to engsoccerdata, we are also using dplyr to flexibly restructure our data and ggplot2 and gridExtra for visualizing it.
library(engsoccerdata)
library(dplyr)
library(ggplot2)
library(gridExtra)
The dataset to use is engsoccerdata2 - this contains all league results up to the end of the 2013/14 season. Each season in this dataset is referred to by the year that the season began in. i.e. 2013 refers to the 2013/14 season.
The first thing to do is to use filter from dplyr to only keep top flight (tier=1) games and to remove those games that took place in the truncated 1939 season. We also use as.Date to ensure that the ‘Date’ variable is in R date format.
Throughout this guide, I shall also wrap some of the dplyr chains in tbl_df() as this will truncate the output so that huge dataframes aren’t shown in their entirety.
df <- tbl_df(engsoccerdata2 %>% filter(tier==1 & Season!=1939))
df$Date <- as.Date(df$Date, format="%Y-%m-%d")
df
## Source: local data frame [46,770 x 12]
##
## Date Season home visitor FT hgoal vgoal
## 1 1888-12-15 1888 Accrington F.C. Aston Villa 1-1 1 1
## 2 1889-01-19 1888 Accrington F.C. Blackburn Rovers 0-2 0 2
## 3 1889-03-23 1888 Accrington F.C. Bolton Wanderers 2-3 2 3
## 4 1888-12-01 1888 Accrington F.C. Burnley 5-1 5 1
## 5 1888-10-13 1888 Accrington F.C. Derby County 6-2 6 2
## 6 1888-12-29 1888 Accrington F.C. Everton 3-1 3 1
## 7 1889-01-26 1888 Accrington F.C. Notts County 1-2 1 2
## 8 1888-10-20 1888 Accrington F.C. Preston North End 0-0 0 0
## 9 1889-04-20 1888 Accrington F.C. Stoke City 2-0 2 0
## 10 1888-11-24 1888 Accrington F.C. West Bromwich Albion 2-1 2 1
## .. ... ... ... ... ... ... ...
## Variables not shown: division (chr), tier (int), totgoal (int), goaldif
## (int), result (chr)
The engsoccerdata2 dataset contains a separate row for every match. For this analysis we need to have two rows for every match - one for each team taking part in each match. The reason for this is so we can calculate each team’s records within each season more easily.
Here, I make one dataframe for all ‘home’ games and one for all ‘away’ games and then bind them together. You will also note that I have added new variables - GF (goals for), GA (goals against), GD (goal difference), result (Win, Loss, Draw) and venue (home or away). I then only keep these variables plus the team, season and date of each game.
dfhome <- df %>% mutate(team = home,
opp = visitor,
GF=as.numeric(as.character(hgoal)),
GA=as.numeric(as.character(vgoal)),
GD = GF-GA,
result=ifelse(GD>0, "W", ifelse(GD<0, "L", "D")),
venue="home")
dfaway <- df %>% mutate(team = visitor,
opp = home,
GF=as.numeric(as.character(vgoal)),
GA=as.numeric(as.character(hgoal)),
GD = GF-GA,
result=ifelse(GD>0, "W", ifelse(GD<0, "L", "D")),
venue="away")
dfboth <- rbind(dfhome,dfaway) %>% select(Date, Season, team, opp, GF, GA, GD)
dfboth
## Source: local data frame [93,540 x 7]
##
## Date Season team opp GF GA GD
## 1 1888-12-15 1888 Accrington F.C. Aston Villa 1 1 0
## 2 1889-01-19 1888 Accrington F.C. Blackburn Rovers 0 2 -2
## 3 1889-03-23 1888 Accrington F.C. Bolton Wanderers 2 3 -1
## 4 1888-12-01 1888 Accrington F.C. Burnley 5 1 4
## 5 1888-10-13 1888 Accrington F.C. Derby County 6 2 4
## 6 1888-12-29 1888 Accrington F.C. Everton 3 1 2
## 7 1889-01-26 1888 Accrington F.C. Notts County 1 2 -1
## 8 1888-10-20 1888 Accrington F.C. Preston North End 0 0 0
## 9 1889-04-20 1888 Accrington F.C. Stoke City 2 0 2
## 10 1888-11-24 1888 Accrington F.C. West Bromwich Albion 2 1 1
## .. ... ... ... ... .. .. ..
The next step is to calculate the cumulative goals for, goals against, goal difference and points for each game within a season for each team. To enable comparisons across seasons, I am assuming 3 points for a win throughout and I am not factoring in points penalties incurred. I also calculate for each game number in a season the cumulative goals for per game, goals against per game, goal difference per game and points per game. The reason for this is because different seasons have a different total number of games, this will enable season to season comparisons.
To do all of this, the main thing to do is to group together the ‘Season’ and ‘team’ variable using group_by in dplyr . This means that everything we do after this function is done independently for each team/Season combination.
To create new variables we use mutate in dplyr. To calculate the game number within each season for each team, we use dense_rank, and to caculate cumulative totals we use base-r’s cumsum. Before we calculate cumualtive totals, we need to make sure that the dataframe is organized in ascending order of ‘gameno’ - so we use arrange to ensure this.
To get the ‘per-game’ values, we simply divide these cumulative totals by ‘gameno’. The ‘per-game’ variables have the suffix ‘.pg’ . Lastly, we only want to keep certain variables and we use select to do this.
mydf <-
dfboth %>%
group_by(Season, team) %>%
mutate(result = ifelse(GD>0, "W", ifelse(GD<0, "L", "D")),
pts = ifelse(GD>0, 3, ifelse(GD<0, 0, 1)),
gameno = dense_rank(Date)) %>%
arrange(Season,team, gameno) %>%
mutate(Cumpts = cumsum(pts),
CumGF = cumsum(GF),
CumGA = cumsum(GA),
CumGD = cumsum(GD),
pts.pg = Cumpts/gameno,
GF.pg = CumGF/gameno,
GA.pg = CumGA/gameno,
GD.pg = CumGD/gameno
) %>%
select(Season, team, gameno, Cumpts, CumGF, CumGA, CumGD, pts.pg, GF.pg, GA.pg, GD.pg)
mydf #this has 93,540 rows/observations
## Source: local data frame [93,540 x 11]
## Groups: Season, team
##
## Season team gameno Cumpts CumGF CumGA CumGD pts.pg
## 1 1888 Accrington F.C. 1 0 1 2 -1 0.0000000
## 2 1888 Accrington F.C. 2 1 6 7 -1 0.5000000
## 3 1888 Accrington F.C. 3 2 7 8 -1 0.6666667
## 4 1888 Accrington F.C. 4 5 11 10 1 1.2500000
## 5 1888 Accrington F.C. 5 6 15 14 1 1.2000000
## 6 1888 Accrington F.C. 6 9 21 16 5 1.5000000
## 7 1888 Accrington F.C. 7 10 21 16 5 1.4285714
## 8 1888 Accrington F.C. 8 10 24 20 4 1.2500000
## 9 1888 Accrington F.C. 9 11 26 22 4 1.2222222
## 10 1888 Accrington F.C. 10 12 29 25 4 1.2000000
## .. ... ... ... ... ... ... ... ...
## Variables not shown: GF.pg (dbl), GA.pg (dbl), GD.pg (dbl)
tail(mydf) #just to show you the last few observations in the data
## Source: local data frame [6 x 11]
## Groups: Season, team
##
## Season team gameno Cumpts CumGF CumGA CumGD pts.pg GF.pg
## 1 2013 West Ham United 33 37 37 44 -7 1.121212 1.121212
## 2 2013 West Ham United 34 37 38 47 -9 1.088235 1.117647
## 3 2013 West Ham United 35 37 38 48 -10 1.057143 1.085714
## 4 2013 West Ham United 36 37 38 49 -11 1.027778 1.055556
## 5 2013 West Ham United 37 40 40 49 -9 1.081081 1.081081
## 6 2013 West Ham United 38 40 40 51 -11 1.052632 1.052632
## Variables not shown: GA.pg (dbl), GD.pg (dbl)
What we really want to do in this example is to compare each team’s performance from one season to the next. This requires us to only keep the cumulative data for the last match of each team in each season. To do this, it’s a simple procedure to combine max from base-r and filter from dplyr to get the data from the highest gameno for each grouped team/Season combination (remember the data is still grouped - the group_by performed earlier is still a property of the dataframe).
mydf.final <- mydf %>% filter(gameno == max(gameno))
mydf.final #this shows the final standings of teams in the first season - 1888/89
## Source: local data frame [2,363 x 11]
## Groups: Season, team
##
## Season team gameno Cumpts CumGF CumGA CumGD pts.pg
## 1 1888 Accrington F.C. 22 26 48 48 0 1.1818182
## 2 1888 Aston Villa 22 41 61 43 18 1.8636364
## 3 1888 Blackburn Rovers 22 36 66 45 21 1.6363636
## 4 1888 Bolton Wanderers 22 32 63 59 4 1.4545455
## 5 1888 Burnley 22 24 42 62 -20 1.0909091
## 6 1888 Derby County 22 23 41 61 -20 1.0454545
## 7 1888 Everton 22 29 35 46 -11 1.3181818
## 8 1888 Notts County 22 17 40 73 -33 0.7727273
## 9 1888 Preston North End 22 58 74 15 59 2.6363636
## 10 1888 Stoke City 22 16 26 51 -25 0.7272727
## .. ... ... ... ... ... ... ... ...
## Variables not shown: GF.pg (dbl), GA.pg (dbl), GD.pg (dbl)
If we use ungroup() from dplyr, we can now also arrange the data by team. Here are the first few rows. As you can see, Accrington F. C.’s last hurrah in the top tier of English soccer was 1892/93, and Arsenal’s first season was 1904/05.
mydf.final <- mydf.final %>% ungroup() %>% arrange(team,Season)
mydf.final
## Source: local data frame [2,363 x 11]
##
## Season team gameno Cumpts CumGF CumGA CumGD pts.pg
## 1 1888 Accrington F.C. 22 26 48 48 0 1.1818182
## 2 1889 Accrington F.C. 22 33 53 56 -3 1.5000000
## 3 1890 Accrington F.C. 22 22 28 50 -22 1.0000000
## 4 1891 Accrington F.C. 26 28 40 78 -38 1.0769231
## 5 1892 Accrington F.C. 30 29 57 81 -24 0.9666667
## 6 1904 Arsenal 34 45 36 40 -4 1.3235294
## 7 1905 Arsenal 38 52 62 64 -2 1.3684211
## 8 1906 Arsenal 38 64 66 59 7 1.6842105
## 9 1907 Arsenal 38 48 51 63 -12 1.2631579
## 10 1908 Arsenal 38 52 52 49 3 1.3684211
## .. ... ... ... ... ... ... ... ...
## Variables not shown: GF.pg (dbl), GA.pg (dbl), GD.pg (dbl)
Just as an example, here are all the end of season records for Stoke City in the top tier:
mydf.final %>% filter(team=="Stoke City")
## Source: local data frame [58 x 11]
##
## Season team gameno Cumpts CumGF CumGA CumGD pts.pg GF.pg
## 1 1888 Stoke City 22 16 26 51 -25 0.7272727 1.1818182
## 2 1889 Stoke City 22 13 27 69 -42 0.5909091 1.2272727
## 3 1891 Stoke City 26 19 38 61 -23 0.7307692 1.4615385
## 4 1892 Stoke City 30 41 58 48 10 1.3666667 1.9333333
## 5 1893 Stoke City 30 42 65 79 -14 1.4000000 2.1666667
## 6 1894 Stoke City 30 33 50 67 -17 1.1000000 1.6666667
## 7 1895 Stoke City 30 45 56 47 9 1.5000000 1.8666667
## 8 1896 Stoke City 30 36 48 59 -11 1.2000000 1.6000000
## 9 1897 Stoke City 30 32 35 55 -20 1.0666667 1.1666667
## 10 1898 Stoke City 34 46 47 52 -5 1.3529412 1.3823529
## 11 1899 Stoke City 34 47 37 45 -8 1.3823529 1.0882353
## 12 1900 Stoke City 34 38 46 57 -11 1.1176471 1.3529412
## 13 1901 Stoke City 34 42 45 55 -10 1.2352941 1.3235294
## 14 1902 Stoke City 34 52 46 38 8 1.5294118 1.3529412
## 15 1903 Stoke City 34 37 54 57 -3 1.0882353 1.5882353
## 16 1904 Stoke City 34 43 40 58 -18 1.2647059 1.1764706
## 17 1905 Stoke City 38 55 54 55 -1 1.4473684 1.4210526
## 18 1906 Stoke City 38 34 41 64 -23 0.8947368 1.0789474
## 19 1922 Stoke City 42 40 47 67 -20 0.9523810 1.1190476
## 20 1933 Stoke City 42 56 58 71 -13 1.3333333 1.3809524
## 21 1934 Stoke City 42 60 71 70 1 1.4285714 1.6904762
## 22 1935 Stoke City 42 67 57 57 0 1.5952381 1.3571429
## 23 1936 Stoke City 42 57 72 57 15 1.3571429 1.7142857
## 24 1937 Stoke City 42 51 58 59 -1 1.2142857 1.3809524
## 25 1938 Stoke City 42 63 71 68 3 1.5000000 1.6904762
## 26 1946 Stoke City 42 79 90 53 37 1.8809524 2.1428571
## 27 1947 Stoke City 42 52 41 55 -14 1.2380952 0.9761905
## 28 1948 Stoke City 42 57 66 68 -2 1.3571429 1.5714286
## 29 1949 Stoke City 42 45 45 75 -30 1.0714286 1.0714286
## 30 1950 Stoke City 42 53 50 59 -9 1.2619048 1.1904762
## 31 1951 Stoke City 42 43 49 88 -39 1.0238095 1.1666667
## 32 1952 Stoke City 42 46 53 66 -13 1.0952381 1.2619048
## 33 1963 Stoke City 42 52 77 78 -1 1.2380952 1.8333333
## 34 1964 Stoke City 42 58 67 66 1 1.3809524 1.5952381
## 35 1965 Stoke City 42 57 65 64 1 1.3571429 1.5476190
## 36 1966 Stoke City 42 58 63 58 5 1.3809524 1.5000000
## 37 1967 Stoke City 42 49 50 73 -23 1.1666667 1.1904762
## 38 1968 Stoke City 42 42 40 63 -23 1.0000000 0.9523810
## 39 1969 Stoke City 42 60 56 52 4 1.4285714 1.3333333
## 40 1970 Stoke City 42 49 44 48 -4 1.1666667 1.0476190
## 41 1971 Stoke City 42 45 39 56 -17 1.0714286 0.9285714
## 42 1972 Stoke City 42 52 61 56 5 1.2380952 1.4523810
## 43 1973 Stoke City 42 61 54 42 12 1.4523810 1.2857143
## 44 1974 Stoke City 42 66 64 48 16 1.5714286 1.5238095
## 45 1975 Stoke City 42 56 48 50 -2 1.3333333 1.1428571
## 46 1976 Stoke City 42 44 28 51 -23 1.0476190 0.6666667
## 47 1979 Stoke City 42 49 44 58 -14 1.1666667 1.0476190
## 48 1980 Stoke City 42 54 51 60 -9 1.2857143 1.2142857
## 49 1981 Stoke City 42 44 44 63 -19 1.0476190 1.0476190
## 50 1982 Stoke City 42 57 53 64 -11 1.3571429 1.2619048
## 51 1983 Stoke City 42 50 44 63 -19 1.1904762 1.0476190
## 52 1984 Stoke City 42 17 24 91 -67 0.4047619 0.5714286
## 53 2008 Stoke City 38 45 38 55 -17 1.1842105 1.0000000
## 54 2009 Stoke City 38 47 34 48 -14 1.2368421 0.8947368
## 55 2010 Stoke City 38 46 46 48 -2 1.2105263 1.2105263
## 56 2011 Stoke City 38 45 36 53 -17 1.1842105 0.9473684
## 57 2012 Stoke City 38 42 34 45 -11 1.1052632 0.8947368
## 58 2013 Stoke City 38 50 45 52 -7 1.3157895 1.1842105
## Variables not shown: GA.pg (dbl), GD.pg (dbl)
What you will notice here is that Stoke were founder members of the top tier in 1888/89 but have been in and out of the top division.
We are only interested in looking at performances from one season to the next that both occurred in the top tier. We also don’t want to include consecutive appearances in the top tier - e.g. there’s no need to compare Stoke’s record in 1984/85 to 2009/09, but we do want to compare their performance from 2008/09 to 2009/10 to 2010/11 etc. Incidentally, I am also comparing their record from 1938/39 to 1946/47 as these are the two complete seasons either side of the war.
To do this, what we need to do for each team is to re-insert the Seasons that they were not in the top tier, but that there was top tier soccer - this is done in the next section.
We can get every team’s individual record by using base-R’s split - this splits every team’s data into a separate dataframe, with all of them being stored in a list.
mydf.final.split <- split(mydf.final, mydf.final$team)
Showing a summary of the dataframes returned by split would produce too much output for this guide. Each team’s dataframe is essentially equivalent to the example shown above for Stoke City. In fact, we can save any team’s data just by adding e.g. $'Stoke City' to the end of the split function.
I will do this with Stoke’s data as a worked example of what we are about to do to every team’s data.
stoke <- split(mydf.final, mydf.final$team)$`Stoke City`
head(stoke)
## Source: local data frame [6 x 11]
##
## Season team gameno Cumpts CumGF CumGA CumGD pts.pg GF.pg
## 1 1888 Stoke City 22 16 26 51 -25 0.7272727 1.181818
## 2 1889 Stoke City 22 13 27 69 -42 0.5909091 1.227273
## 3 1891 Stoke City 26 19 38 61 -23 0.7307692 1.461538
## 4 1892 Stoke City 30 41 58 48 10 1.3666667 1.933333
## 5 1893 Stoke City 30 42 65 79 -14 1.4000000 2.166667
## 6 1894 Stoke City 30 33 50 67 -17 1.1000000 1.666667
## Variables not shown: GA.pg (dbl), GD.pg (dbl)
The seasons that Stoke City were in the top flight are in the following variable:
stoke$Season
## [1] 1888 1889 1891 1892 1893 1894 1895 1896 1897 1898 1899 1900 1901 1902
## [15] 1903 1904 1905 1906 1922 1933 1934 1935 1936 1937 1938 1946 1947 1948
## [29] 1949 1950 1951 1952 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972
## [43] 1973 1974 1975 1976 1979 1980 1981 1982 1983 1984 2008 2009 2010 2011
## [57] 2012 2013
Every season ever in the top tier can be got using unique from base-r. I just use sort here to make sure the data are returned in numerical order:
sort(unique(df$Season)) #missing seasons between 1888-2013 are due to world wars
## [1] 1888 1889 1890 1891 1892 1893 1894 1895 1896 1897 1898 1899 1900 1901
## [15] 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 1919
## [29] 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933
## [43] 1934 1935 1936 1937 1938 1946 1947 1948 1949 1950 1951 1952 1953 1954
## [57] 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968
## [71] 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982
## [85] 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996
## [99] 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
## [113] 2011 2012 2013
We can calculate the seasons that Stoke were not in the top tier, but that top tier football took place, by using base-r ’s setdiff - sort is used again just to ensure numerical order:
sort(setdiff(unique(mydf.final$Season), stoke$Season)) #seasons Stoke City not in top-flight but top tier soccer took place
## [1] 1890 1907 1908 1909 1910 1911 1912 1913 1914 1919 1920 1921 1923 1924
## [15] 1925 1926 1927 1928 1929 1930 1931 1932 1953 1954 1955 1956 1957 1958
## [29] 1959 1960 1961 1962 1977 1978 1985 1986 1987 1988 1989 1990 1991 1992
## [43] 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006
## [57] 2007
xtra.seasons <- setdiff(unique(mydf.final$Season), stoke$Season) #storing these seasons in a vector
Now to add in these blank seasons to Stoke’s dataframe. This is done by binding Stoke’s data to a new data.frame that has all the missing seasons in the first variable, the team name in the second variable, and every other variable is named the same as in the Stoke dataframe but contains NAs.
stoke <-rbind(stoke,
data.frame(Season=xtra.seasons,
team=unique(stoke$team),
gameno=NA,
Cumpts=NA,
CumGF=NA,
CumGA=NA,
CumGD=NA,
pts.pg=NA,
GF.pg=NA,
GA.pg=NA,
GD.pg=NA)
)
stoke %>% arrange(Season) # in season order
## Source: local data frame [115 x 11]
##
## Season team gameno Cumpts CumGF CumGA CumGD pts.pg GF.pg
## 1 1888 Stoke City 22 16 26 51 -25 0.7272727 1.181818
## 2 1889 Stoke City 22 13 27 69 -42 0.5909091 1.227273
## 3 1890 Stoke City NA NA NA NA NA NA NA
## 4 1891 Stoke City 26 19 38 61 -23 0.7307692 1.461538
## 5 1892 Stoke City 30 41 58 48 10 1.3666667 1.933333
## 6 1893 Stoke City 30 42 65 79 -14 1.4000000 2.166667
## 7 1894 Stoke City 30 33 50 67 -17 1.1000000 1.666667
## 8 1895 Stoke City 30 45 56 47 9 1.5000000 1.866667
## 9 1896 Stoke City 30 36 48 59 -11 1.2000000 1.600000
## 10 1897 Stoke City 30 32 35 55 -20 1.0666667 1.166667
## .. ... ... ... ... ... ... ... ... ...
## Variables not shown: GA.pg (dbl), GD.pg (dbl)
tail(stoke %>% arrange(Season), 10) #last 10 rows of Stoke's dataframe
## Source: local data frame [10 x 11]
##
## Season team gameno Cumpts CumGF CumGA CumGD pts.pg GF.pg
## 1 2004 Stoke City NA NA NA NA NA NA NA
## 2 2005 Stoke City NA NA NA NA NA NA NA
## 3 2006 Stoke City NA NA NA NA NA NA NA
## 4 2007 Stoke City NA NA NA NA NA NA NA
## 5 2008 Stoke City 38 45 38 55 -17 1.184211 1.0000000
## 6 2009 Stoke City 38 47 34 48 -14 1.236842 0.8947368
## 7 2010 Stoke City 38 46 46 48 -2 1.210526 1.2105263
## 8 2011 Stoke City 38 45 36 53 -17 1.184211 0.9473684
## 9 2012 Stoke City 38 42 34 45 -11 1.105263 0.8947368
## 10 2013 Stoke City 38 50 45 52 -7 1.315789 1.1842105
## Variables not shown: GA.pg (dbl), GD.pg (dbl)
Note how seasons 1890/91 and 2004/05-2007/08 don’t have any valued data as Stoke weren’t in the top-tier during these seasons.
OK - that’s a lot of data munging. Just as a quick respite, here are some visualizations of the Stoke data just to get an idea of what we are collecting.
Here, we will look at season by season changes in points per game, goals scored per game, goals against per game and goal difference per game.
I won’t go too much into the ggplot2 code. Possibly the only slightly unsual thing is the geom_line(aes(group=1))... part which enables us to plot lines that do not join data points across missing (NA) values:
stoke$Season <- as.numeric(stoke$Season) #for graphing purposes make sure Season variable is numeric
g1 <- ggplot(stoke, aes(Season, pts.pg)) +
geom_point(color="black", size=2) +
geom_line(aes(group=1), color="red") +
theme_bw() +
scale_x_continuous(breaks=seq(1888, 2014, by=63))
g2 <- ggplot(stoke, aes(Season, GF.pg)) +
geom_point(color="black", size=2) +
geom_line(aes(group=1), color="red") +
theme_bw() +
scale_x_continuous(breaks=seq(1888, 2014, by=63))
g3 <- ggplot(stoke, aes(Season, GA.pg)) +
geom_point(color="black", size=2) +
geom_line(aes(group=1), color="red") +
theme_bw() +
scale_x_continuous(breaks=seq(1888, 2014, by=63))
g4 <- ggplot(stoke, aes(Season, GD.pg)) +
geom_point(color="black", size=2) +
geom_line(aes(group=1), color="red") +
theme_bw() +
scale_x_continuous(breaks=seq(1888, 2014, by=63))
grid.arrange(g1,g2,g3,g4, ncol=2)
Just out of interest - we can see that Stoke had one outstanding season in terms of points per game and one really terrible one. We can find these using filter:
stoke %>% filter(pts.pg == max(pts.pg, na.rm=T))
## Source: local data frame [1 x 11]
##
## Season team gameno Cumpts CumGF CumGA CumGD pts.pg GF.pg
## 1 1946 Stoke City 42 79 90 53 37 1.880952 2.142857
## Variables not shown: GA.pg (dbl), GD.pg (dbl)
stoke %>% filter(pts.pg == min(pts.pg, na.rm=T))
## Source: local data frame [1 x 11]
##
## Season team gameno Cumpts CumGF CumGA CumGD pts.pg GF.pg
## 1 1984 Stoke City 42 17 24 91 -67 0.4047619 0.5714286
## Variables not shown: GA.pg (dbl), GD.pg (dbl)
In 1946/47, Stoke averaged 1.88 points per game (assuming 3 points for a win). They finished in 4th position that year.
In 1984/85, Stoke only averaged 0.40 points per game - they finished rock bottom that season.
This is the main purpose of this guide - and it is incredibly simple to do this using lag in base-r. This function will take a vector (or variable in a dataframe) and return that vector or variable lagged by one. i.e. in our situation, it will return the previous season’s value. If the previous season has an NA, it will return an NA.
For example, let’s look at Stoke City’s points per game. We will select just the season and points per game variables, then create a new variable which is a lagged points per game and then another variable that is the difference.
Note that it’s important for the dataframe to be organized in Season order for lag to work correctly.
stoke1 <- stoke %>%
select(Season,pts.pg) %>%
arrange(Season) %>%
mutate(pts.pglag = lag(pts.pg), pts.pgDIF = pts.pg - pts.pglag)
as.data.frame(stoke1)
## Season pts.pg pts.pglag pts.pgDIF
## 1 1888 0.7272727 NA NA
## 2 1889 0.5909091 0.7272727 -0.13636364
## 3 1890 NA 0.5909091 NA
## 4 1891 0.7307692 NA NA
## 5 1892 1.3666667 0.7307692 0.63589744
## 6 1893 1.4000000 1.3666667 0.03333333
## 7 1894 1.1000000 1.4000000 -0.30000000
## 8 1895 1.5000000 1.1000000 0.40000000
## 9 1896 1.2000000 1.5000000 -0.30000000
## 10 1897 1.0666667 1.2000000 -0.13333333
## 11 1898 1.3529412 1.0666667 0.28627451
## 12 1899 1.3823529 1.3529412 0.02941176
## 13 1900 1.1176471 1.3823529 -0.26470588
## 14 1901 1.2352941 1.1176471 0.11764706
## 15 1902 1.5294118 1.2352941 0.29411765
## 16 1903 1.0882353 1.5294118 -0.44117647
## 17 1904 1.2647059 1.0882353 0.17647059
## 18 1905 1.4473684 1.2647059 0.18266254
## 19 1906 0.8947368 1.4473684 -0.55263158
## 20 1907 NA 0.8947368 NA
## 21 1908 NA NA NA
## 22 1909 NA NA NA
## 23 1910 NA NA NA
## 24 1911 NA NA NA
## 25 1912 NA NA NA
## 26 1913 NA NA NA
## 27 1914 NA NA NA
## 28 1919 NA NA NA
## 29 1920 NA NA NA
## 30 1921 NA NA NA
## 31 1922 0.9523810 NA NA
## 32 1923 NA 0.9523810 NA
## 33 1924 NA NA NA
## 34 1925 NA NA NA
## 35 1926 NA NA NA
## 36 1927 NA NA NA
## 37 1928 NA NA NA
## 38 1929 NA NA NA
## 39 1930 NA NA NA
## 40 1931 NA NA NA
## 41 1932 NA NA NA
## 42 1933 1.3333333 NA NA
## 43 1934 1.4285714 1.3333333 0.09523810
## 44 1935 1.5952381 1.4285714 0.16666667
## 45 1936 1.3571429 1.5952381 -0.23809524
## 46 1937 1.2142857 1.3571429 -0.14285714
## 47 1938 1.5000000 1.2142857 0.28571429
## 48 1946 1.8809524 1.5000000 0.38095238
## 49 1947 1.2380952 1.8809524 -0.64285714
## 50 1948 1.3571429 1.2380952 0.11904762
## 51 1949 1.0714286 1.3571429 -0.28571429
## 52 1950 1.2619048 1.0714286 0.19047619
## 53 1951 1.0238095 1.2619048 -0.23809524
## 54 1952 1.0952381 1.0238095 0.07142857
## 55 1953 NA 1.0952381 NA
## 56 1954 NA NA NA
## 57 1955 NA NA NA
## 58 1956 NA NA NA
## 59 1957 NA NA NA
## 60 1958 NA NA NA
## 61 1959 NA NA NA
## 62 1960 NA NA NA
## 63 1961 NA NA NA
## 64 1962 NA NA NA
## 65 1963 1.2380952 NA NA
## 66 1964 1.3809524 1.2380952 0.14285714
## 67 1965 1.3571429 1.3809524 -0.02380952
## 68 1966 1.3809524 1.3571429 0.02380952
## 69 1967 1.1666667 1.3809524 -0.21428571
## 70 1968 1.0000000 1.1666667 -0.16666667
## 71 1969 1.4285714 1.0000000 0.42857143
## 72 1970 1.1666667 1.4285714 -0.26190476
## 73 1971 1.0714286 1.1666667 -0.09523810
## 74 1972 1.2380952 1.0714286 0.16666667
## 75 1973 1.4523810 1.2380952 0.21428571
## 76 1974 1.5714286 1.4523810 0.11904762
## 77 1975 1.3333333 1.5714286 -0.23809524
## 78 1976 1.0476190 1.3333333 -0.28571429
## 79 1977 NA 1.0476190 NA
## 80 1978 NA NA NA
## 81 1979 1.1666667 NA NA
## 82 1980 1.2857143 1.1666667 0.11904762
## 83 1981 1.0476190 1.2857143 -0.23809524
## 84 1982 1.3571429 1.0476190 0.30952381
## 85 1983 1.1904762 1.3571429 -0.16666667
## 86 1984 0.4047619 1.1904762 -0.78571429
## 87 1985 NA 0.4047619 NA
## 88 1986 NA NA NA
## 89 1987 NA NA NA
## 90 1988 NA NA NA
## 91 1989 NA NA NA
## 92 1990 NA NA NA
## 93 1991 NA NA NA
## 94 1992 NA NA NA
## 95 1993 NA NA NA
## 96 1994 NA NA NA
## 97 1995 NA NA NA
## 98 1996 NA NA NA
## 99 1997 NA NA NA
## 100 1998 NA NA NA
## 101 1999 NA NA NA
## 102 2000 NA NA NA
## 103 2001 NA NA NA
## 104 2002 NA NA NA
## 105 2003 NA NA NA
## 106 2004 NA NA NA
## 107 2005 NA NA NA
## 108 2006 NA NA NA
## 109 2007 NA NA NA
## 110 2008 1.1842105 NA NA
## 111 2009 1.2368421 1.1842105 0.05263158
## 112 2010 1.2105263 1.2368421 -0.02631579
## 113 2011 1.1842105 1.2105263 -0.02631579
## 114 2012 1.1052632 1.1842105 -0.07894737
## 115 2013 1.3157895 1.1052632 0.21052632
To keep with the Stoke City worked example, we can do this lagged difference for points, GF, GA, and GD and visualize:
stoke2 <- stoke %>%
arrange(Season) %>%
mutate(pts.pglag = lag(pts.pg), pts.pgDIF = pts.pg - pts.pglag,
GF.pglag = lag(GF.pg), GF.pgDIF = GF.pg - GF.pglag,
GA.pglag = lag(GA.pg), GA.pgDIF = GA.pg - GA.pglag,
GD.pglag = lag(GD.pg), GD.pgDIF = GD.pg - GF.pglag
)
gg1 <- ggplot(stoke2, aes(Season, pts.pgDIF)) +
geom_point(color="black", size=2) +
geom_line(aes(group=1), color="red") +
theme_bw() +
scale_x_continuous(breaks=seq(1888, 2014, by=63))
gg2 <- ggplot(stoke2, aes(Season, GF.pgDIF)) +
geom_point(color="black", size=2) +
geom_line(aes(group=1), color="red") +
theme_bw() +
scale_x_continuous(breaks=seq(1888, 2014, by=63))
gg3 <- ggplot(stoke2, aes(Season, GA.pgDIF)) +
geom_point(color="black", size=2) +
geom_line(aes(group=1), color="red") +
theme_bw() +
scale_x_continuous(breaks=seq(1888, 2014, by=63))
gg4 <- ggplot(stoke2, aes(Season, GD.pgDIF)) +
geom_point(color="black", size=2) +
geom_line(aes(group=1), color="red") +
theme_bw() +
scale_x_continuous(breaks=seq(1888, 2014, by=63))
grid.arrange(gg1,gg2,gg3,gg4, ncol=2)
There are some interesting changes in Stoke’s Goals For in the late 1940s…
stoke2 %>%
filter(Season >= 1945 & Season <= 1952) %>%
select(Season, GF.pg, GF.pglag, GF.pgDIF)
## Source: local data frame [7 x 4]
##
## Season GF.pg GF.pglag GF.pgDIF
## 1 1946 2.1428571 1.6904762 0.45238095
## 2 1947 0.9761905 2.1428571 -1.16666667
## 3 1948 1.5714286 0.9761905 0.59523810
## 4 1949 1.0714286 1.5714286 -0.50000000
## 5 1950 1.1904762 1.0714286 0.11904762
## 6 1951 1.1666667 1.1904762 -0.02380952
## 7 1952 1.2619048 1.1666667 0.09523810
As we saw earlier, Stoke finished in 4th in 1946/47 and scored 90 goals in 42 games. The following season they only scored 41 goals in 42 games and finished in 15th, though they rebounded in 1948/49 and scored 66 goals in 42 games. It’s also interesting that in their relegation season of 1984/85 they actually scored a similar amount of goals to the previous season - they just conceded far more.
One final set of visualizations in the worked example of Stoke City. Scatterplots to look at repeatability from one season to the next. Here, we just plot one season’s points per game, GF per game, GA per game and GD per game against the next season’s. Seasons that did not have a preceding or succeding season in the top tier are obviously not included.
#visualizing top-flight Points per game repeatability for Wolves
ggg1 <- ggplot(stoke2, aes(pts.pglag, pts.pg)) +
geom_point(size=2, color="red") +
stat_smooth(method="lm",se=F, lwd=1, color="firebrick") +
theme_bw() +
xlab("Preceding Season") + ylab("Succeding Season") + ggtitle("Points per game")
ggg2 <- ggplot(stoke2, aes(GF.pglag, GF.pg)) +
geom_point(size=2, color="red") +
stat_smooth(method="lm",se=F, lwd=1, color="firebrick") +
theme_bw() +
xlab("Preceding Season") + ylab("Succeding Season") + ggtitle("GF per game")
ggg3 <- ggplot(stoke2, aes(GA.pglag, GA.pg)) +
geom_point(size=2, color="red") +
stat_smooth(method="lm",se=F, lwd=1, color="firebrick") +
theme_bw() +
xlab("Preceding Season") + ylab("Succeding Season") + ggtitle("GA per game")
ggg4 <- ggplot(stoke2, aes(GD.pglag, GD.pg)) +
geom_point(size=2, color="red") +
stat_smooth(method="lm",se=F, lwd=1, color="firebrick") +
theme_bw() +
xlab("Preceding Season") + ylab("Succeding Season") + ggtitle("GD per game")
grid.arrange(ggg1,ggg2,ggg3,ggg4, ncol=2)
To look at season by season trends for all teams, we first need to write a generic function. This is essentially using the above code applied to Stoke City but written generically.
This function takes two arguments - the second is going to be each individual team’s dataframe that we stored above after completing the split function. This is denoted as an x. The first argument is the filtered dataframe that we produced above that contains all the seasons that top tier football occurred in.
fun1 <- function(mydf.final, x){
x <- as.data.frame(x)
xtra.seasons <- setdiff(unique(mydf.final$Season), x$Season) #storing these seasons in a vector
x <-rbind(x,
data.frame(Season=xtra.seasons,
team=unique(x$team),
gameno=NA,
Cumpts=NA,
CumGF=NA,
CumGA=NA,
CumGD=NA,
pts.pg=NA,
GF.pg=NA,
GA.pg=NA,
GD.pg=NA)
)
x$Season <- as.numeric(x$Season) #for graphing purposes
x <- x %>%
arrange(Season) %>%
mutate(pts.pglag = lag(pts.pg),
GF.pglag = lag(GF.pg),
GA.pglag = lag(GA.pg),
GD.pglag = lag(GD.pg),
pts.pgDIF=pts.pg - lag(pts.pg),
GF.pgDIF=GF.pg - lag(GF.pg),
GA.pgDIF=GA.pg - lag(GA.pg),
GD.pgDIF=GD.pg - lag(GD.pg)
)
return(x)
}
Apply the above function to all teams using lapply and then convert to one large dataframe using do.call and rbind:
mydf.all <- lapply(mydf.final.split, function(x) fun1(mydf.final, x))
alldf <- do.call("rbind", mydf.all)
sum(!is.na(alldf$pts.pgDIF)) #this is the total number of observations containing data in points per game = 2088
## [1] 2088
This alldf contains all the data we need. Let’s visualize it…
Just as an illustration, here we rank the points per game difference between two seasons (all 2088 observations) and plot them.
tbl_df(alldf %>%
filter(!is.na(pts.pgDIF)) %>%
mutate(rank = rank(pts.pgDIF)) %>%
select(team,Season,pts.pgDIF,rank) %>%
arrange(rank)
)
## Source: local data frame [2,088 x 4]
##
## team Season pts.pgDIF rank
## 1 Everton 1970 -1.0952381 1.0
## 2 Aston Villa 1900 -0.9411765 2.0
## 3 Newcastle United 1977 -0.9285714 3.5
## 4 Sheffield United 1975 -0.9285714 3.5
## 5 West Bromwich Albion 1890 -0.8636364 5.0
## 6 Sheffield United 1898 -0.8490196 6.0
## 7 Arsenal 1912 -0.8421053 7.0
## 8 West Bromwich Albion 1920 -0.8333333 8.0
## 9 Sheffield Wednesday 1919 -0.8120301 9.0
## 10 Ipswich Town 2001 -0.7894737 10.0
## .. ... ... ... ...
Straight away we can see that Everton’s 1970/71 season was an absolute disaster compared to their previous season. In 1969/70, Everton won the league but bombed to 14th the following season.
tbl_df(alldf %>%
filter(!is.na(pts.pgDIF)) %>%
mutate(rank = rank(pts.pgDIF)) %>%
select(team,Season,pts.pgDIF,rank) %>%
arrange(desc(rank))
)
## Source: local data frame [2,088 x 4]
##
## team Season pts.pgDIF rank
## 1 Arsenal 1930 0.9761905 2088.0
## 2 Arsenal 1970 0.9523810 2087.0
## 3 Derby County 1895 0.9333333 2086.0
## 4 Sunderland 1891 0.8321678 2085.0
## 5 Sheffield United 1899 0.8235294 2084.0
## 6 Aston Villa 1989 0.7894737 2083.0
## 7 Manchester City 1967 0.7857143 2081.5
## 8 West Ham United 1985 0.7857143 2081.5
## 9 Sunderland 1897 0.7666667 2080.0
## 10 Everton 1938 0.7380952 2079.0
## .. ... ... ... ...
The biggest gains were both by Arsenal. In 1970/71, the year Everton finished 14th, they finished champions after being only 12th the year before. In 1930/31, they were champs after being 14th the year before.
Here is a plot of all teams by rank order:
tmp1 <- alldf %>%
filter(!is.na(pts.pgDIF)) %>%
mutate(rank = rank(pts.pgDIF)) %>%
select(team,Season,pts.pgDIF,rank)
ggplot(tmp1, aes(rank, pts.pgDIF)) + geom_point() + theme_bw()
Another way of visualizing this data is to look across time:
ggplot(tmp1, aes(Season, pts.pgDIF)) + geom_point() + theme_bw() +
ylab("Points per game change season to season")
With some extra edits, gives this:
tbl_df(alldf %>%
filter(!is.na(GF.pgDIF)) %>%
mutate(rank = rank(GF.pgDIF)) %>%
select(team,Season,GF.pgDIF,rank) %>%
arrange(rank)
)
## Source: local data frame [2,088 x 4]
##
## team Season GF.pgDIF rank
## 1 Preston North End 1890 -1.2272727 1
## 2 West Bromwich Albion 1920 -1.1904762 2
## 3 Blackburn Rovers 1890 -1.1818182 3
## 4 Stoke City 1947 -1.1666667 4
## 5 Accrington F.C. 1890 -1.1363636 5
## 6 Arsenal 1933 -1.0238095 6
## 7 Newcastle United 1997 -1.0000000 7
## 8 Aston Villa 1892 -0.9897436 8
## 9 Everton 1891 -0.9790210 9
## 10 Arsenal 1992 -0.9761905 10
## .. ... ... ... ...
Nothing too exciting here - lots of teams from long ago seem to have dramatic changes in goals scored per game - but then again, this is when scoring was higher and so you’d expect more fluctuations as a result.
tbl_df(alldf %>%
filter(!is.na(GF.pgDIF)) %>%
mutate(rank = rank(GF.pgDIF)) %>%
select(team,Season,GF.pgDIF,rank) %>%
arrange(desc(rank))
)
## Source: local data frame [2,088 x 4]
##
## team Season GF.pgDIF rank
## 1 Aston Villa 1891 1.3776224 2088
## 2 Everton 1889 1.3636364 2087
## 3 Sunderland 1891 1.2587413 2086
## 4 West Bromwich Albion 1919 1.1867168 2085
## 5 Arsenal 1930 1.1666667 2084
## 6 Sheffield United 1925 1.1190476 2083
## 7 Everton 1984 1.0476190 2082
## 8 Tottenham Hotspur 1956 1.0238095 2081
## 9 Manchester City 1967 1.0238095 2080
## 10 Arsenal 1925 0.9761905 2079
## .. ... ... ... ...
Same as the above comment, though Man City won the league in 1967/68 after only finishing 15th the year before and Everton won in 1984/85. Spurs finished 2nd in 1956/57.
Across time we can see how that goal scoring fluctuations haven’t actual been that much greater in the early years. There are some interesting recent quite dramatic increases and decreases in goal scoring:
tmp2 <- alldf %>% filter(!is.na(GF.pgDIF))
ggplot(tmp2, aes(Season, GF.pgDIF)) + geom_point() + theme_bw() +
ylab("Goals scored per game change season to season")
With some extra edits, gives this:
Any questions or comments, please email me.