I have recently uploaded by engsoccerdata package to CRAN. It contains complete historical soccer results of major European leagues, the FA Cup and Champions League. For English soccer, it has every soccer game ever played in the top four divisions. That is 192,004 matches.

 

Recently Prof Antony Unwin of the University of Augsburg mailed me with an interesting question. He had been using my engsoccerdata for some examples and had been interested that Southampton had improved their league position for each of the last six seasons. Using my data he found three teams that had improved 8 years in a row. I thought that I would have a look at this myself as an example of what sort of questions can be quickly addressed using the data in engsoccerdata.

 

Question: How many teams have had runs of six seasons or more of successively improving their league position?

 

  1. First load the packages. I will use data.table for its speed.

 

library(engsoccerdata)
library(data.table)
library(ggplot2)
library(ggrepel)

 

  1. Get position in football pyramid for each team in each season. This is done by calculating the league position for each team using my maketable_eng function which accounts for the tie-breaking procedures and points for a win for each division in each season.

 

## Use the 'maketable_eng' function to make English tables based on rules for each season/division

#split Seasons by divisions/tiers into list
engseasons <- split(england, list(england$Season,england$division,england$tier))
engseasons <- Filter(function(x) dim(x)[1] > 0, engseasons) #drop empty dataframes (e.g no division 4 in 1900)

#get all combintions of divisions and seasons in data
seasontier <- as.data.frame(matrix(unlist(lapply(names(engseasons), function(x) strsplit(x, split="\\."))),ncol=3,byrow=T))
seasontier$V1 <- as.numeric(as.character(seasontier$V1))
seasontier$V2 <- as.character(seasontier$V2)
seasontier$V3 <- as.numeric(as.character(seasontier$V3))

#go through each division for each year and make a table
engtables <- list()
for(i in 1:nrow(seasontier)){
  tmp <- maketable_eng(engseasons[[i]], Season=seasontier$V1[i], tier=seasontier$V3[i], division=seasontier$V2[i])
  tmp$Season = seasontier$V1[i]
  tmp$division = seasontier$V2[i]
  tmp$tier=seasontier$V3[i]
  engtables[[i]]<-tmp
}

#using data.table - Get position in pyramid for each team for each season
DT <- rbindlist(engtables)
DT$Pos <- as.numeric(DT$Pos)
DT <- DT[order(DT$Season,DT$tier,DT$Pos,DT$Pts,DT$gd),]
DT[, pyramidpos := seq_len(.N), by = Season]
DT
##                          team GP  W  D  L gf ga         gd Pts Pos Season
##    1:       Preston North End 22 18  4  0 74 15   4.933333  40   1   1888
##    2:             Aston Villa 22 12  5  5 61 43   1.418605  29   2   1888
##    3: Wolverhampton Wanderers 22 12  4  6 50 37   1.351351  28   3   1888
##    4:        Blackburn Rovers 22 10  6  6 66 45   1.466667  26   4   1888
##    5:        Bolton Wanderers 22 10  2 10 63 59   1.067797  22   5   1888
##   ---                                                                    
## 9000:            Crawley Town 46 13  8 25 45 78 -33.000000  47  20   2015
## 9001:               Morecambe 46 12 10 24 69 91 -22.000000  46  21   2015
## 9002:          Newport County 46 10 13 23 43 64 -21.000000  43  22   2015
## 9003:  Dagenham and Redbridge 46  8 10 28 46 81 -35.000000  34  23   2015
## 9004:               York City 46  7 13 26 51 87 -36.000000  34  24   2015
##       division tier pyramidpos
##    1:        1    1          1
##    2:        1    1          2
##    3:        1    1          3
##    4:        1    1          4
##    5:        1    1          5
##   ---                         
## 9000:        4    4         88
## 9001:        4    4         89
## 9002:        4    4         90
## 9003:        4    4         91
## 9004:        4    4         92

 

The ‘gd’ variable above contains the tie-breaker for each season. In the 1888-89 season it was goal average (hence the decimal places) whereas in the 2015-16 season it was goal difference.

 

  1. Calculate if the pyramid position of each team is higher or lower than the previous season. Then calculate the runs of improved performances using rle and keep those of six or more.

 

#calculate if pyramid position is higher or lower than previous year
setkey(DT,team) 
DT[,posdiff:=c(NA,diff(pyramidpos)),by=team]  

# this method of calculating difference in league positions has one draw back
# if teams get relegated from div4 and come back e.g. York City were relegated in 2003 and came back in 2012.
# it considers 2003 and 2012 to be successive seasons.
# we could add in NA seasons, but for this it is quicker to check manually our results at the end for these issues

DT[,updown:=posdiff/abs(posdiff),by=team]  
DT[, updown := ifelse(is.na(updown), 0, updown)]  #updown reads 0 - same or initial position, -1 = higher position, +1 = lower pos.
DT[, better := ifelse(updown==-1, 1, 0)]  #if did better than previous year make 1, otherwise a 0.
DT[, rleLength := {rr <- rle(better); rep(rr$length, rr$length)}, by = 'team']  #calculate  runs (better or not better)

#Find those with equal to or more than 6 seasons improvement in a row:
setkey(DT, better, rleLength)
DT1 <- DT[list(1, unique(rleLength[rleLength>=6])),nomatch=0]

#want to create unique id for each run ... (issue that Luton, Swansea and Southampton appear twice)
DT1[, id:=rleid(team)]
DT1$id <- ifelse(DT1$team=="Southampton" & DT1$Season>=2010, 26, DT1$id) #only Southampton is consecutive in the dt.

DT1 #check
##                 team GP  W  D  L gf ga          gd Pts Pos Season division
##   1:            Bury 46 24 12 10 76 50  26.0000000  84   4   1984        4
##   2:            Bury 46 12 13 21 63 67  -4.0000000  49  20   1985        3
##   3:            Bury 46 14 13 19 54 60  -6.0000000  55  16   1986        3
##   4:            Bury 46 15 14 17 58 57   1.0000000  59  14   1987        3
##   5:            Bury 46 16 13 17 55 67 -12.0000000  61  13   1988        3
##  ---                                                                      
## 167: Oldham Athletic 46 25 12  9 83 47   1.7659574  62   1   1973        3
## 168: Oldham Athletic 42 10 15 17 40 48   0.8333333  35  18   1974        2
## 169: Oldham Athletic 42 13 12 17 57 68   0.8382353  38  17   1975        2
## 170: Oldham Athletic 42 14 10 18 52 64 -12.0000000  38  13   1976        2
## 171: Oldham Athletic 42 13 16 13 54 58  -4.0000000  42   8   1977        2
##      tier pyramidpos posdiff updown better rleLength id
##   1:    4         72     -11     -1      1         6  1
##   2:    3         64      -8     -1      1         6  1
##   3:    3         60      -4     -1      1         6  1
##   4:    3         58      -2     -1      1         6  1
##   5:    3         57      -1     -1      1         6  1
##  ---                                                   
## 167:    3         45      -3     -1      1         8 25
## 168:    2         40      -5     -1      1         8 25
## 169:    2         39      -1     -1      1         8 25
## 170:    2         35      -4     -1      1         8 25
## 171:    2         30      -5     -1      1         8 25
#Birmingham's run jumps from 1914 Season to 1919 Season, but that's ok 
#Burnley's run jumps from 1914 Season to 1919 Season, but that's ok 
#Huddersfield's run jumps from 1914 Season to 1919 Season, but that's ok 

 

  1. For graphs we actually want to plot from the season prior to improvement - so we get that here for each team. We also add a variable that identifies each team and the years that they improved uniquely. The levels are also set by how great an improvement each team made.

 

# To track improvement, we also want the Season before data
dt0 = DT1[DT1[, .I[Season == min(Season)], by = id]$V1]
dt0  = dt0[order(id),]
dt0[,Season:=Season-1] #this works as no Season is immediately after a break e.g. world war

dt0res=list()
for(i in 1:nrow(dt0)){ dt0res[[i]] = DT[Season==dt0$Season[i] & team==dt0$team[i],] }
DT0 <- rbindlist(dt0res)
DT0$id<-1:26
DT1 <- rbind(DT0,DT1)
setorder(DT1, id)
DT1[,rleLengthx := max(rleLength), by = 'id'] #update length value


### #add identifying variable for each team + season range
DT1[,teamseason := paste(team,paste(min(Season),max(Season+1),sep="-")), by = 'id'] 


#new teamseason variable based on magnitude of improvement
DT1[,improvement:=max(pyramidpos)-min(pyramidpos),by='teamseason']
mylevels <- unique(DT1[order(improvement),teamseason])  
DT1$teamseason1 <- factor(DT1$teamseason,levels=mylevels)

 

  1. Plotting. The first graph is to plot each team’s improvement by Season. To do this we simply plot Season against pyramid position. I also add some lines using geom_step to delineate each division for each year. I also add labels using ggrepel and want to add them at the top of each line. If we wanted to make this graph look a lot more professional it would be worth manually adjusting the position of the text labels, but this is ok for exploration. If the figure was increased in size it would look even better with much less overlap of labels.

 

## Show rise by year

# subset DT to get last Season of each team-season combination
DT1[,teamseason2 := paste(team,paste(min(Season),max(Season+1),sep="-"),sep="\n"), by = 'id'] #want variable to be separated by \n
DT1mins <- DT1[DT1[, .I[pyramidpos == min(pyramidpos)], by = teamseason1]$V1]

#get highest pyramid position by tier by Season
DTp <- DT[DT[, .I[pyramidpos == max(pyramidpos)], by=list(Season,tier)]$V1]
DTp <- DTp[order(tier,Season),]  #1958 went to four divions
DTp$tier1 <- ifelse(DTp$Season>=1958 & DTp$tier==3, 5, DTp$tier) #making another tier to avoid a line joining tiers 3 and 4 vertically in 1958

 

 

ggplot(DT1, aes(Season,pyramidpos,group=teamseason,col=rleLengthx)) + 
  geom_path(lwd=1) + 
  scale_y_reverse() +
  geom_point(size=3,fill="white",pch=21) +
  geom_text_repel(
    data=DT1mins,
    aes(Season, pyramidpos, fill = factor(id), label = teamseason2),
    size=3,
    nudge_y =5, 
    nudge_x =0.1 
  ) +
  geom_step(data=DTp, aes(x=Season,y=pyramidpos,group=tier1),color="gray68") +
  theme_minimal() +
  theme(plot.title = element_text(margin=margin(0,0,20,0)),
        legend.position="none",
        panel.grid.major.y = element_blank()
        ) +
  ylab("Position in Football Pyramid")+
  ggtitle("Improving League Position for each of at least 6 Seasons")

 

The next plot is to show which teams had the biggest jumps in league position over their runs. Antony was right, there are only 3 teams (Gillingham, Oldham Athletic and Luton Town) that have improved their league position for 8 straight years. Three teams (Southampton, Luton Town and Swansea City) have had runs of improving for six seasons twice.

 

### Sort Plot by Improvement
ggplot(as.data.frame(DT1), aes(teamseason1,pyramidpos,color=factor(rleLengthx))) + 
  geom_path(lwd=1) + 
  geom_point(size=2,fill="white",pch=21) +
  coord_flip() + 
  scale_y_reverse() + 
  theme_minimal() +
  theme(plot.title = element_text(margin=margin(0,0,20,0),hjust=-1.2)) +
  scale_color_manual(values=c("red","blue","black"),name="Consecutive \nSeasons") +
  ylab("Position in Football Pyramid") + 
  xlab("") +
  ggtitle("League Places Risen in Consecutive Seasons")

 

I hope that this was interesting or useful and helped with using data.table, ggplot2 or ggrepel. Any questions or comments please get in touch at jc3181 AT columbia DOT edu or on https://twitter.com/jalapic.