I have recently uploaded by engsoccerdata package to CRAN. It contains complete historical soccer results of major European leagues, the FA Cup and Champions League. For English soccer, it has every soccer game ever played in the top four divisions. That is 192,004 matches.
Recently Prof Antony Unwin of the University of Augsburg mailed me with an interesting question. He had been using my engsoccerdata for some examples and had been interested that Southampton had improved their league position for each of the last six seasons. Using my data he found three teams that had improved 8 years in a row. I thought that I would have a look at this myself as an example of what sort of questions can be quickly addressed using the data in engsoccerdata.
Question: How many teams have had runs of six seasons or more of successively improving their league position?
data.table for its speed.
library(engsoccerdata)
library(data.table)
library(ggplot2)
library(ggrepel)
maketable_eng function which accounts for the tie-breaking procedures and points for a win for each division in each season.
## Use the 'maketable_eng' function to make English tables based on rules for each season/division
#split Seasons by divisions/tiers into list
engseasons <- split(england, list(england$Season,england$division,england$tier))
engseasons <- Filter(function(x) dim(x)[1] > 0, engseasons) #drop empty dataframes (e.g no division 4 in 1900)
#get all combintions of divisions and seasons in data
seasontier <- as.data.frame(matrix(unlist(lapply(names(engseasons), function(x) strsplit(x, split="\\."))),ncol=3,byrow=T))
seasontier$V1 <- as.numeric(as.character(seasontier$V1))
seasontier$V2 <- as.character(seasontier$V2)
seasontier$V3 <- as.numeric(as.character(seasontier$V3))
#go through each division for each year and make a table
engtables <- list()
for(i in 1:nrow(seasontier)){
tmp <- maketable_eng(engseasons[[i]], Season=seasontier$V1[i], tier=seasontier$V3[i], division=seasontier$V2[i])
tmp$Season = seasontier$V1[i]
tmp$division = seasontier$V2[i]
tmp$tier=seasontier$V3[i]
engtables[[i]]<-tmp
}
#using data.table - Get position in pyramid for each team for each season
DT <- rbindlist(engtables)
DT$Pos <- as.numeric(DT$Pos)
DT <- DT[order(DT$Season,DT$tier,DT$Pos,DT$Pts,DT$gd),]
DT[, pyramidpos := seq_len(.N), by = Season]
DT
## team GP W D L gf ga gd Pts Pos Season
## 1: Preston North End 22 18 4 0 74 15 4.933333 40 1 1888
## 2: Aston Villa 22 12 5 5 61 43 1.418605 29 2 1888
## 3: Wolverhampton Wanderers 22 12 4 6 50 37 1.351351 28 3 1888
## 4: Blackburn Rovers 22 10 6 6 66 45 1.466667 26 4 1888
## 5: Bolton Wanderers 22 10 2 10 63 59 1.067797 22 5 1888
## ---
## 9000: Crawley Town 46 13 8 25 45 78 -33.000000 47 20 2015
## 9001: Morecambe 46 12 10 24 69 91 -22.000000 46 21 2015
## 9002: Newport County 46 10 13 23 43 64 -21.000000 43 22 2015
## 9003: Dagenham and Redbridge 46 8 10 28 46 81 -35.000000 34 23 2015
## 9004: York City 46 7 13 26 51 87 -36.000000 34 24 2015
## division tier pyramidpos
## 1: 1 1 1
## 2: 1 1 2
## 3: 1 1 3
## 4: 1 1 4
## 5: 1 1 5
## ---
## 9000: 4 4 88
## 9001: 4 4 89
## 9002: 4 4 90
## 9003: 4 4 91
## 9004: 4 4 92
The ‘gd’ variable above contains the tie-breaker for each season. In the 1888-89 season it was goal average (hence the decimal places) whereas in the 2015-16 season it was goal difference.
rle and keep those of six or more.
#calculate if pyramid position is higher or lower than previous year
setkey(DT,team)
DT[,posdiff:=c(NA,diff(pyramidpos)),by=team]
# this method of calculating difference in league positions has one draw back
# if teams get relegated from div4 and come back e.g. York City were relegated in 2003 and came back in 2012.
# it considers 2003 and 2012 to be successive seasons.
# we could add in NA seasons, but for this it is quicker to check manually our results at the end for these issues
DT[,updown:=posdiff/abs(posdiff),by=team]
DT[, updown := ifelse(is.na(updown), 0, updown)] #updown reads 0 - same or initial position, -1 = higher position, +1 = lower pos.
DT[, better := ifelse(updown==-1, 1, 0)] #if did better than previous year make 1, otherwise a 0.
DT[, rleLength := {rr <- rle(better); rep(rr$length, rr$length)}, by = 'team'] #calculate runs (better or not better)
#Find those with equal to or more than 6 seasons improvement in a row:
setkey(DT, better, rleLength)
DT1 <- DT[list(1, unique(rleLength[rleLength>=6])),nomatch=0]
#want to create unique id for each run ... (issue that Luton, Swansea and Southampton appear twice)
DT1[, id:=rleid(team)]
DT1$id <- ifelse(DT1$team=="Southampton" & DT1$Season>=2010, 26, DT1$id) #only Southampton is consecutive in the dt.
DT1 #check
## team GP W D L gf ga gd Pts Pos Season division
## 1: Bury 46 24 12 10 76 50 26.0000000 84 4 1984 4
## 2: Bury 46 12 13 21 63 67 -4.0000000 49 20 1985 3
## 3: Bury 46 14 13 19 54 60 -6.0000000 55 16 1986 3
## 4: Bury 46 15 14 17 58 57 1.0000000 59 14 1987 3
## 5: Bury 46 16 13 17 55 67 -12.0000000 61 13 1988 3
## ---
## 167: Oldham Athletic 46 25 12 9 83 47 1.7659574 62 1 1973 3
## 168: Oldham Athletic 42 10 15 17 40 48 0.8333333 35 18 1974 2
## 169: Oldham Athletic 42 13 12 17 57 68 0.8382353 38 17 1975 2
## 170: Oldham Athletic 42 14 10 18 52 64 -12.0000000 38 13 1976 2
## 171: Oldham Athletic 42 13 16 13 54 58 -4.0000000 42 8 1977 2
## tier pyramidpos posdiff updown better rleLength id
## 1: 4 72 -11 -1 1 6 1
## 2: 3 64 -8 -1 1 6 1
## 3: 3 60 -4 -1 1 6 1
## 4: 3 58 -2 -1 1 6 1
## 5: 3 57 -1 -1 1 6 1
## ---
## 167: 3 45 -3 -1 1 8 25
## 168: 2 40 -5 -1 1 8 25
## 169: 2 39 -1 -1 1 8 25
## 170: 2 35 -4 -1 1 8 25
## 171: 2 30 -5 -1 1 8 25
#Birmingham's run jumps from 1914 Season to 1919 Season, but that's ok
#Burnley's run jumps from 1914 Season to 1919 Season, but that's ok
#Huddersfield's run jumps from 1914 Season to 1919 Season, but that's ok
# To track improvement, we also want the Season before data
dt0 = DT1[DT1[, .I[Season == min(Season)], by = id]$V1]
dt0 = dt0[order(id),]
dt0[,Season:=Season-1] #this works as no Season is immediately after a break e.g. world war
dt0res=list()
for(i in 1:nrow(dt0)){ dt0res[[i]] = DT[Season==dt0$Season[i] & team==dt0$team[i],] }
DT0 <- rbindlist(dt0res)
DT0$id<-1:26
DT1 <- rbind(DT0,DT1)
setorder(DT1, id)
DT1[,rleLengthx := max(rleLength), by = 'id'] #update length value
### #add identifying variable for each team + season range
DT1[,teamseason := paste(team,paste(min(Season),max(Season+1),sep="-")), by = 'id']
#new teamseason variable based on magnitude of improvement
DT1[,improvement:=max(pyramidpos)-min(pyramidpos),by='teamseason']
mylevels <- unique(DT1[order(improvement),teamseason])
DT1$teamseason1 <- factor(DT1$teamseason,levels=mylevels)
geom_step to delineate each division for each year. I also add labels using ggrepel and want to add them at the top of each line. If we wanted to make this graph look a lot more professional it would be worth manually adjusting the position of the text labels, but this is ok for exploration. If the figure was increased in size it would look even better with much less overlap of labels.
## Show rise by year
# subset DT to get last Season of each team-season combination
DT1[,teamseason2 := paste(team,paste(min(Season),max(Season+1),sep="-"),sep="\n"), by = 'id'] #want variable to be separated by \n
DT1mins <- DT1[DT1[, .I[pyramidpos == min(pyramidpos)], by = teamseason1]$V1]
#get highest pyramid position by tier by Season
DTp <- DT[DT[, .I[pyramidpos == max(pyramidpos)], by=list(Season,tier)]$V1]
DTp <- DTp[order(tier,Season),] #1958 went to four divions
DTp$tier1 <- ifelse(DTp$Season>=1958 & DTp$tier==3, 5, DTp$tier) #making another tier to avoid a line joining tiers 3 and 4 vertically in 1958
ggplot(DT1, aes(Season,pyramidpos,group=teamseason,col=rleLengthx)) +
geom_path(lwd=1) +
scale_y_reverse() +
geom_point(size=3,fill="white",pch=21) +
geom_text_repel(
data=DT1mins,
aes(Season, pyramidpos, fill = factor(id), label = teamseason2),
size=3,
nudge_y =5,
nudge_x =0.1
) +
geom_step(data=DTp, aes(x=Season,y=pyramidpos,group=tier1),color="gray68") +
theme_minimal() +
theme(plot.title = element_text(margin=margin(0,0,20,0)),
legend.position="none",
panel.grid.major.y = element_blank()
) +
ylab("Position in Football Pyramid")+
ggtitle("Improving League Position for each of at least 6 Seasons")
The next plot is to show which teams had the biggest jumps in league position over their runs. Antony was right, there are only 3 teams (Gillingham, Oldham Athletic and Luton Town) that have improved their league position for 8 straight years. Three teams (Southampton, Luton Town and Swansea City) have had runs of improving for six seasons twice.
### Sort Plot by Improvement
ggplot(as.data.frame(DT1), aes(teamseason1,pyramidpos,color=factor(rleLengthx))) +
geom_path(lwd=1) +
geom_point(size=2,fill="white",pch=21) +
coord_flip() +
scale_y_reverse() +
theme_minimal() +
theme(plot.title = element_text(margin=margin(0,0,20,0),hjust=-1.2)) +
scale_color_manual(values=c("red","blue","black"),name="Consecutive \nSeasons") +
ylab("Position in Football Pyramid") +
xlab("") +
ggtitle("League Places Risen in Consecutive Seasons")
I hope that this was interesting or useful and helped with using data.table, ggplot2 or ggrepel. Any questions or comments please get in touch at jc3181 AT columbia DOT edu or on https://twitter.com/jalapic.