Synopsis

How good are football academies really? The media narrative is usually negative: We are told that, with the influx of TV money into the English game, young players get “too much too soon” and have no ambition. In comparison to their European team-mates, they lack technical ability. And with the influx of foreign players into the national leagues, they have no future. Here I looked at the careers of English footballers born in the 1960s, 1970s and 1980s. Specifically, I was interested in patterns of drop out and mobility between different leagues of the English football pyramid. I find that, no matter at which league level they start their careers, many 19-21-year-old footballers drop out of football altogether. Additionally, there is significant downward mobility - English youngsters who do stay in the game tend to move down the leagues. This is also true for players starting out at top clubs (e.g. Arsenal, Manchester United, Liverpool and Chelsea). However, for players who start out at Level 4, the distribution of appearances across the leagues becomes normal, suggesting the training of youngsters at Level 4 matches their potential. I think these results are interesting because they show that there is significant mobility between the leagues. Therefore, players, fans and the media should not judge a player’s career by the club or league level he starts out at.

See also my blog at: http://wp.me/p78lre-cp

Prerequisites

library(XML) 
library(stringr)
library(lubridate)
library(data.table)
library(gdata)
library(ggplot2)

Data Processing

Player data

The player data was scraped from Soccer Base. The following parameters are extracted from the relevant html pages and stored in a players database:

  • firstname
  • lastname
  • nationality
  • birthdate
  • height
  • weight
  • clubs (all the clubs, including the current club if applicable)
  • seasonstarts (the start … )
  • seasonends ( … and the end of the season spent at each club)
  • leagueapps (the total number of appearances at each club)

The code is shown here:

# get player data from soccerbase: loop through player IDs and extract:
# table players: player_id, name, birthdate, nationality
# table player_history: (player_id) x (club, season)
getPlayerData <- function() {
    # pre-allocate space for data-tables
    n <- 89906 # player id goes up to 89906; empirically determined on the website
    players <- data.table(playerID=as.integer(rep(NA,n)), firstname=as.character(rep(NA,n)), lastname=as.character(rep(NA,n)), nationality=as.character(rep(NA,n)), birthdate=as.Date.POSIXct(rep(0,n)), height=as.numeric(rep(NA,n)), weight=as.numeric(rep(NA,n)))
    playingcareers <- data.table(playerID=as.integer(rep(NA,5*n)), clubs=as.character(rep(NA,5*n)), seasonstarts=as.Date.POSIXct(rep(0,5*n)), seasonends=as.Date.POSIXct(rep(0,5*n)), leagueapps=as.integer(rep(NA,5*n)))
    players$birthdate <- NA
    playingcareers$seasonstarts <- NA
    playingcareers$seasonends <- NA
    index <- 1
    indexP <- 1
    
    for (id in 1:n){
        # information about nationality, height, weight etc
        tryCatch({
            # load website; player id goes up to 89906
            url <- paste("http://www.soccerbase.com/players/player.sd?player_id=",id,sep="") 
            html <- htmlTreeParse(url,useInternalNodes=TRUE)
            title <- strsplit(xpathSApply(html,"//title",xmlValue)," \\| ")
            
            # extract from title: name and current club
            player_id <- as.integer(id)
            name <- strsplit(title[[1]][1]," ")[[1]]
            firstname <- name[1]
            lastname <- name[length(name)]
            currentclub <- title[[1]][3]
            # extract birthdate
            birthdate <- as.Date.POSIXct(dmy(sub("\\)","",sub(".*Born","",
                strsplit(xpathSApply(html,"//table[@class='clubInfo']/tr/td",xmlValue)[2],"\n ")[[1]][3])))) },
            error=function(cond) {
                print(id)
                birthdate <- as.Date.POSIXct(dmy("01-01-1910"))
                birthdate <- NA
                message(cond)
            })
        if (is.na(birthdate) | birthdate < as.Date.POSIXct(ymd("1950-01-01"))) { next() }
        
        tryCatch({
            # nationality
            nationality <- 
                strsplit(xpathSApply(html,"//table[@class='clubInfo']/tr/td",xmlValue)[6],"\n ")[[1]][2]
            # height
            height <- 
                as.numeric(sub("m\\)","",sub(".*\\(","",(strsplit(xpathSApply(html,"//table[@class='clubInfo']/tr/td",xmlValue)[3],"\n ")[[1]][2]))))
            # weight
            weight <- 
                as.numeric(sub("kg\\)","",sub( ".*\\(","",strsplit(xpathSApply(html,"//table[@class='clubInfo']/tr/td",xmlValue)[4],"\n ")[[1]][2])))
            
            # information about clubs
            clubs <- sub(" $","",sapply(strsplit(xpathSApply(html,"//table[@class='table right career']/tbody/tr/td[1]",xmlValue),"\n "),function(x) { x[2] }))
            # seasonstarts
            seasonstarts <- 
                as.Date.POSIXct(dmy(sub(" $","",sapply(strsplit(xpathSApply(html,"//table[@class='table right career']/tbody/tr/td[2]",xmlValue),"\n "),function(x) { x[2] }))))
            # seasonends
            seasonends <- 
                as.Date.POSIXct(dmy(sub(" $","",sapply(strsplit(xpathSApply(html,"//table[@class='table right career']/tbody/tr/td[3]",xmlValue),"\n "),function(x) { x[2] }))))
            # total league apps at each club
            leagueapps <- 
                sapply(strsplit(xpathSApply(html,"//table[@class='table right career']/tbody/tr/td[5]",xmlValue)," "),function(x) { as.integer(x[1]) })
            
            # remove the last two datapoints (these are not clubs)
            last <- length(clubs) - 2
            # add data for current season
            clubs <- append(clubs[1:last], currentclub, after = 0)
            seasonstarts <- append(seasonstarts[1:last], as.Date.POSIXct(ymd("2015/07/01")),after = 0)
            if (is.na(seasonends[2])) {
                seasonends[2] <- as.Date.POSIXct(ymd("2015/06/30"))
            }
            seasonends <- append(seasonends[1:last],as.Date.POSIXct(ymd("2016/06/30")),after = 0)
            leagueapps <- append(leagueapps[1:last], NA,after = 0)
            if (tolower(currentclub) == "no club") {
                seasonstarts[1] <- NA
                seasonends[1] <- NA
                leagueapps[1] <- NA
            }
            
            # put data in tables
            set(players, i=as.integer(index),1L:7L, 
                list(player_id,firstname,lastname,nationality,birthdate,height,weight))
            for (j in seq(from = 0, to = length(leagueapps)-1, by=1)) {
                set(playingcareers, i=as.integer((indexP+j)),1L:5L, list(player_id, clubs[j+1],
                                                                         seasonstarts[j+1],
                                                                         seasonends[j+1], 
                                                                         leagueapps[j+1]))
            }
            
            index <- index+1
            indexP <- indexP+length(leagueapps)},
            error=function(cond) {
                print(id)
                message(cond)
            })
    }
    
    # write tables to files
    write.table(players,"players.txt",row.names=FALSE)
    write.table(playingcareers,"playingcareers.txt",row.names=FALSE)
}

League data

The league data was scraped from Soccer Base. The following parameters are extracted from the relevant html pages and stored in a leagues database:

  • league (name)
  • seasonstart
  • seasonend
  • clubs (one string containing the names of all clubs in the league that season)

The code is shown here:

# information about leagues those clubs have been in
# http://www.soccerbase.com/tournaments/tournament.sd?tourn_id=1487 # id goes up to 1487
# get seasonstart, seasonends, clubs and league positions
getLeagueData <- function() {
    # pre-allocate space for data-tables
    n <- 1487
    leagues <- data.table(league = as.character(rep(NA,n)),
                          seasonstart = as.Date.POSIXct(rep(0,n)),
                          seasonend = as.Date.POSIXct(rep(0,n)),
                          clubs = as.character(rep(NA,n)))
    leagues$seasonstart <- NA
    leagues$seasonend <- NA
    index <- 1
    
    for (id in 1:n){
        # load tournament
        url <- paste("http://www.soccerbase.com/tournaments/tournament.sd?tourn_id=",id,sep="")
        html <- htmlTreeParse(url,useInternalNodes=TRUE)
        title <- strsplit(xpathSApply(html,"//title",xmlValue)," \\| ")
        
        # extract data from html
        tryCatch(
            {
                # league name
                league <- sub(" Betting","",title[[1]][1])
                # season start
                seasonstart <- as.Date.POSIXct(ymd(paste(strsplit(title[[1]][2],"/")[[1]][1],"07/01",sep="/")))
                # season end
                seasonend <- as.Date.POSIXct(ymd(paste(strsplit(title[[1]][2],"/")[[1]][2],"06/30",sep="/")))
                # clubs
                clubs <- str_trim(sapply(strsplit(xpathSApply(html,"//table[@class='table']/tbody/tr/td[@class='bull']",xmlValue),"\n "), function(x){x[2]}), side="right")
                clubs <- paste(clubs,sep ="",collapse=" | ")
                
                # add data to table
                set(leagues,i=as.integer(index),1L:4L,list(league,seasonstart,seasonend,clubs))
                index <- index+1
            },
            error=function(cond) {
                print(id)
                message(cond)
            }
        )  
    }
    
    # write tables to files
    write.table(leagues,"leagues.txt",row.names=FALSE)
}

Career data

The player data was re-organised in the following way. For each player, rows were added so that each season could be listed on a separate row. Because I only know the total number of apps per club (sometimes comprising several seasons spent at the club), this variable was changed to average league apps per season at the club. This information is stored in a new careers table:

  • playerID
  • clubs
  • seasonstarts
  • seasonends
  • leagueapps (mean league apps per season at the club)

The code is shown here:

# make table careers based on playingcareers
# for each player, add rows, so that each season has one row (fix by start and end dates)
# and change league apps to average league apps / season at the club
separateLeagues <- function() {
    playingcareers <- read.table("playingcareers.txt",stringsAsFactors = FALSE, header = TRUE)
    players <- read.table("players.txt",stringsAsFactors = FALSE, header = TRUE)
    # convert dates to dates
    playingcareers$seasonstarts <- as.Date.POSIXct(ymd(playingcareers$seasonstarts))
    playingcareers$seasonends <- as.Date.POSIXct(ymd(playingcareers$seasonends))
    
    careers <- data.frame(playingcareers)
    careers <- rbind(careers,careers)
    careers[!is.na(careers)] <- NA
    p <-1
    for (i in seq_along(players$playerID)){
        # find rows for that player in playing careers
        rows <- playingcareers$playerID==players$playerID[i]
        tempcareers <- playingcareers[rows,]
        tempseasons <- tempcareers[1,]
        y <- 1L
        # expand seasons so they are continuous
        for (j in seq(from=1, to=sum(rows), by=1)){
            # if time spent at that club is more than one year, add seasons
            seasonsspent <- interval(tempcareers$seasonstarts[j],tempcareers$seasonends[j])
            if (!is.na(seasonsspent) & as.period(seasonsspent) > period(1,units = "year")) {
                if (month(tempcareers$seasonstarts[j]) >= 1 & month(tempcareers$seasonstarts[j]) < 7){
                    seasonstartsvect <- 
                        seq(from = tempcareers$seasonstarts[j],to = tempcareers$seasonends[j], by="year")
                    month(seasonstartsvect[2:length(seasonstartsvect)]) <- 7
                    day(seasonstartsvect[2:length(seasonstartsvect)]) <- 1
                    seasonstartsvect[2:length(seasonstartsvect)] <-
                        seasonstartsvect[2:length(seasonstartsvect)]-period(1,units = "year")
                    seasonendsvect <- seasonstartsvect+years(1)-days(1)
                    month(seasonendsvect[1]) <- 6
                    day(seasonendsvect[1]) <- 30
                    seasonendsvect[1] <- seasonstartsvect[1]
                    year(seasonendsvect[1]) <- year(years(year(seasonendsvect[1])))
                    month(seasonendsvect[1]) <- 6
                    day(seasonendsvect[1]) <- 30
                } else if (month(tempcareers$seasonstarts[j]) >= 7 & month(tempcareers$seasonstarts[j]) <= 12){
                    seasonstartsvect <- 
                        seq(from = tempcareers$seasonstarts[j],to = tempcareers$seasonends[j], by="year")
                    month(seasonstartsvect[2:length(seasonstartsvect)]) <- 7
                    day(seasonstartsvect[2:length(seasonstartsvect)]) <- 1
                    seasonstartsvect[2:length(seasonstartsvect)] <-
                        seasonstartsvect[2:length(seasonstartsvect)]
                    seasonendsvect <- seasonstartsvect+years(1)-days(1)
                    month(seasonendsvect[1]) <- 6
                    day(seasonendsvect[1]) <- 30
                    seasonendsvect[1] <- seasonstartsvect[1]
                    year(seasonendsvect[1]) <- year(years(year(seasonendsvect[1])))
                    month(seasonendsvect[1]) <- 6
                    day(seasonendsvect[1]) <- 30
                }
                seasonendsvect[length(seasonendsvect)] <- tempcareers$seasonends[j]
                
                c <- !is.na(seasonstartsvect) & is.na(seasonendsvect)
                if (sum(c)>0) {
                    if (month(seasonstartsvect[c]) >= 1 & month(seasonstartsvect[c]) < 7){
                        seasonendsvect[c] <- seasonstartsvect[c]
                        year(seasonendsvect[c]) <- year(years(year(seasonendsvect[c])))
                        month(seasonendsvect[c]) <- 6
                        day(seasonendsvect[c]) <- 30
                    }
                    else if (month(seasonstartsvect[c]) >= 7 & month(seasonstartsvect[c]) <= 12){
                        seasonendsvect[c] <- seasonstartsvect[c]
                        year(seasonendsvect[c]) <- year(years(year(seasonendsvect[c]))+period(1,"year"))
                        month(seasonendsvect[c]) <- 6
                        day(seasonendsvect[c]) <- 30
                    }
                }
                mm <- seasonstartsvect>=seasonendsvect
                seasonstartsvect <- seasonstartsvect[!mm]
                seasonendsvect <- seasonendsvect[!mm]
                
                indices <- seq(from=y,to=y+length(seasonstartsvect)-1,by=1)
                tempseasons[indices,] <- tempcareers[j,]
                tempseasons$seasonstarts[indices] <- seasonstartsvect
                tempseasons$seasonends[indices] <- seasonendsvect
                # average the leagueapps per season
                leagueappsvect <-
                    rep(tempcareers$leagueapps[j]/length(seasonstartsvect),length(seasonstartsvect))
                tempseasons$leagueapps[indices] <- leagueappsvect
                y <- y+length(seasonstartsvect)
            } else {
                tempseasons[y,] <- tempcareers[j,]
                y <- y+1
            }
        }
        tempseasons <- tempseasons[order(tempseasons$seasonstarts,tempseasons$seasonends),]
        careers[seq(from = p, to = p+nrow(tempseasons)-1,by=1),] <- tempseasons
        p <- p+nrow(tempseasons)
    }
    row.names(careers) <- seq_along(careers[,1])
    careers <- careers[complete.cases(careers[,1]),]
    careers$seasonends[Vectorize(isTRUE)(careers$seasonstarts > careers$seasonends)] <-
        careers$seasonends[Vectorize(isTRUE)(careers$seasonstarts > careers$seasonends)]+period(1,"year")
    write.table(careers,"careers.txt",row.names=FALSE)
}

Then the following columns were added to careers:

  • playernationality
  • playerage (the age at the start of each season)
  • playerseason (numbering the seasons for each player from 1 to the current season; this variable is not used in the current analysis, but at the time I thought it may be useful at a later point)

The code is shown here:

# then add columns to table careers: 
# - nationality of the player
# - age at the start of each season
# - season number (1st, 2nd etc) for that player

addPlayerDetails <- function(){
    players <- read.table("players.txt",stringsAsFactors = FALSE, header = TRUE)
    careers <- read.table("careers.txt",stringsAsFactors = FALSE, header = TRUE)
    careers$seasonstarts <- as.Date.POSIXct(ymd(careers$seasonstarts))
    careers$seasonends <- as.Date.POSIXct(ymd(careers$seasonends))
    players$birthdate <- as.Date.POSIXct(ymd(players$birthdate))
    # add the columns
    tempmerge <- merge(careers, players, by.x = "playerID", by.y = "playerID")
    careers$playernationality <- tempmerge$nationality
    careers$playerage <- year(as.period(interval(tempmerge$birthdate,tempmerge$seasonstarts)))
    careers$playerseason <- 
        unlist(tapply(careers$playerage, careers$playerID, function(x){x-min(x, na.rm=TRUE)+1}),
               use.names = FALSE)
    # save
    write.table(careers,"careers.txt",row.names=FALSE)
}

Then the following columns were added to careers:

  • leagues (the league the club was in that season)
  • leaguelevel (the level of the league on the football pyramid; the top level is 1)
  • leaguenationality (the nationality of the league (i.e. the country it is in))

The code is shown here:

addLeagueDetails <- function() {
    # - league the club was in that season
    # - league level (1st, 2nd, etc) on the national football pyramid
    # - nationality of the league
    careers <- read.table("careers.txt",stringsAsFactors = FALSE, header = TRUE)
    careers$seasonstarts <- as.Date.POSIXct(ymd(careers$seasonstarts))
    careers$seasonends <- as.Date.POSIXct(ymd(careers$seasonends))
    careers$leagues <- as.character(NA)
    leagues <- read.table("leagues.txt",stringsAsFactors = FALSE, header = TRUE)
    leagues$seasonstart <- as.Date.POSIXct(ymd(leagues$seasonstart))
    leagues$seasonend <- as.Date.POSIXct(ymd(leagues$seasonend))
    leagues <- leagues[leagues$seasonstart>"1950-01-01",]
    leagues <- leagues[complete.cases(leagues[,1]),]
    
    count <- 0
    noleaguecount <- 0
    for (i in seq_along(careers$clubs)) { #
        club <-careers$clubs[i]
        # find leagues in the same season
        if (is.na(careers$seasonstarts[i])) { next() }
        if (month(careers$seasonstarts[i]) >= 1 & month(careers$seasonstarts[i]) < 7){
            indices <- year(careers$seasonstarts[i])==year(leagues$seasonend)}
        else if (month(careers$seasonstarts[i]) >= 7 & month(careers$seasonstarts[i]) <= 12){
            indices <- year(careers$seasonstarts[i])==year(leagues$seasonstart)}
        leaguetemp <- leagues[indices,]
        
        # check which of those leagues the club was in
        club <- str_trim(gsub("\\(.*\\)","",club), side="right")
        stringclub <- str_sub(club,1,7)
        index <- 
            unlist(lapply(leaguetemp$clubs, function(x){j <- grepl(stringclub, x); j}),
                   use.names = FALSE)
        if (sum(index)==1) { 
            careers$leagues[i] <- leaguetemp$league[index]
        } else if (sum(index)==0) { 
            message("no league found")
            careers$leagues[i] <- ""
            noleaguecount <- noleaguecount+1
        } else {
            message("more than one league")
            index <- 
                unlist(lapply(leaguetemp$clubs, function(x){j <- grepl(str_c("\\| ",club," \\|"), x); j}),
                       use.names = FALSE)
            if (sum(index)==1) { 
                message("..fixed")
                careers$leagues[i] <- leaguetemp$league[index]
            } else {
                sum(index)
                count <- count+1
            }
        }
    }
    
    # add league nationality
    careers$leaguenationality[grepl("English|Football|League|Premier|National|Conference|Southern|South|Northern|North|Championship",careers$leagues)] <- "England"
    careers$leaguenationality[grepl("Scottish",careers$leagues)] <- "Scotland"
    careers$leaguenationality[grepl("Welsh",careers$leagues)] <- "Wales"
    careers$leaguenationality[grepl("Italian",careers$leagues)] <- "Italy"
    careers$leaguenationality[grepl("Spanish",careers$leagues)] <- "Spain"
    careers$leaguenationality[grepl("German",careers$leagues)] <- "Germany"
    careers$leaguenationality[grepl("Russian",careers$leagues)] <- "Russia"
    careers$leaguenationality[grepl("Austrian",careers$leagues)] <- "Austria"
    careers$leaguenationality[grepl("Hungarian",careers$leagues)] <- "Hungary"
    careers$leaguenationality[grepl("French",careers$leagues)] <- "France"
    careers$leaguenationality[grepl("Polish",careers$leagues)] <- "Poland"
    careers$leaguenationality[grepl("Belgian",careers$leagues)] <- "Belgium"
    careers$leaguenationality[grepl("Danish",careers$leagues)] <- "Denmark"
    careers$leaguenationality[grepl("Ukrainian",careers$leagues)] <- "Ukraine"
    careers$leaguenationality[grepl("Romanian",careers$leagues)] <- "Romania"
    careers$leaguenationality[grepl("Greek",careers$leagues)] <- "Greece"
    careers$leaguenationality[grepl("Serbian",careers$leagues)] <- "Serbia"
    careers$leaguenationality[grepl("Australian",careers$leagues)] <- "Australia"
    careers$leaguenationality[grepl("Dutch",careers$leagues)] <- "Netherlands"
    careers$leaguenationality[grepl("Portuguese",careers$leagues)] <- "Portugal"
    careers$leaguenationality[grepl("Czech",careers$leagues)] <- "Czech Republic"
    careers$leaguenationality[grepl("Turkish",careers$leagues)] <- "Turkey"
    careers$leaguenationality[grepl("Irish",careers$leagues)] <- "Ireland"
    careers$leaguenationality[grepl("Bulgarian",careers$leagues)] <- "Bulgaria"
    careers$leaguenationality[grepl("Swiss",careers$leagues)] <- "Switzerland"
    careers$leaguenationality[grepl("Israeli",careers$leagues)] <- "Israel"
    careers$leaguenationality[grepl("Slovenian",careers$leagues)] <- "Slovenia"
    
    # add league level
    # if there is only one, it is probably the top level
    uniqueLeagueLevels <- 0
    listLeagues <- tapply(careers$leagues,careers$leaguenationality,unique)
    for (ll in 1:length(listLeagues)){
        if (length(listLeagues[[ll]])==1) {
            uniqueLeagueLevels[ll] <- listLeagues[[ll]]
        }
    }
    
    # for the others, manually set the level: make tables for different countries with the league system
    englishleagues <- data.frame(
        from = c("1950","1992","1950","1992","2004","1950","1992","2004","1950","1992","2004","1950","1950","1950"),
        level = c(1,1,2,2,2,3,3,3,4,4,4,5,6,7),
        league = c("English Division 1|Football League First Division","Premier League|Premiership",
                   "Football League Second Division|English Division 2|", "Football League First Division|","Football League First Division|Championship|",
                   "Football League Third Division|English Division 3|","Football League Second Division|","Football League Second Division|League One|Football League One|",
                   "Football League Fourth Division|English Division 4|","Football League Third Division|","Football League Third Division|League Two|Football League Two|",
                   "National League|Nationwide Conference|Conference Premier|Blue Sq Premier|Blue Sq Bet Premier|Football Conference|",
                   "Football Conference South|Football Conference North|Conference South|Blue Sq South|Blue Sq North|GM Vauxhall Conference|",
                   "Ryman Premier|Evo-Stik Southern Premier|Evo-Stik Northern Premier|Southern Premier|Unibond Premier|")
    )
    
    scottishleagues <- data.frame(level = c(1,2,3,4),league = c("Scottish Premiership|Scottish Premier Division|Scottish Division One|Scottish Premier League|Scottish Division 1 \\(old\\)","Scottish Championship|Scottish Division Two","Scottish League One|Scottish Division 3","Scottish League Two|Scottish Division 4"))
    spanishleagues <- data.frame(level = c(1,2,2),league =c("Spanish Primera Liga","Spanish Segunda Division","Spanish Seguda Division"))
    
    careers$leaguelevel <- NA
    for (l in seq(1,length(careers$leagues),1)){
        stringleague <- str_trim(gsub("\\(.*\\)","",careers$leagues[l]), side="right")
        stringleague <- str_c(stringleague,"\\|")
        ix <- grepl(careers$leagues[l],uniqueLeagueLevels)
        if (sum(ix,na.rm=T)>0){ careers$leaguelevel[l] <- 1; next() }
        ix <- grepl(careers$leagues[l],scottishleagues$league)
        if (sum(ix,na.rm=T)>0){ careers$leaguelevel[l] <- scottishleagues$level[ix]; next() }
        ix <- grepl(careers$leagues[l],spanishleagues$league)
        if (sum(ix,na.rm=T)>0){ careers$leaguelevel[l] <- spanishleagues$level[ix]; next() }
        ix <- grepl(stringleague,englishleagues$league)
        if (sum(ix,na.rm=T)>0){
            if (sum(ix,na.rm=T)==1) { careers$leaguelevel[l] <- englishleagues$level[ix]; next() }
            season <- year(careers$seasonstarts[l])
            if (season >= 2004) {
                iy <- englishleagues$from=="2004"
            } else if (season >= 1992 & season < 2004){
                iy <- englishleagues$from=="1992"
            } else if (season >= 1950 & season < 1992){
                iy <- englishleagues$from=="1950"
            }
            careers$leaguelevel[l] <- englishleagues$level[ix & iy]
        }
    }
    careers$leaguelevel[is.na(careers$leaguenationality)] <- NA
    write.table(careers,"careers.txt",row.names=FALSE)
}

Cleaning and imputing missing data

After some exploratory analysis, I decided to focus on the careers of English players in English leagues, because for foreign leagues, only information at level 1 is available. For English leagues, there is information about levels 1-7 of the English league pyramid.

I first subset the careers table to include only the information about English leagues. This table is called careersEngland.

careers <- read.table("careers.txt",stringsAsFactors = FALSE, header = TRUE)
careers$seasonstarts <- as.Date.POSIXct(ymd(careers$seasonstarts))
careers$seasonends <- as.Date.POSIXct(ymd(careers$seasonends))
players <- read.table("players.txt",stringsAsFactors = FALSE, header = TRUE)
players$birthdate <- as.Date.POSIXct(ymd(players$birthdate))

# create temporary tables: England
# first fill in some missing values for league nationality (almost no English
# players abroad, so set to England for English players)
careers$leaguenationality[is.na(careers$leaguenationality) & careers$playernationality == "England"] <- "England"
careersEngland <- careers[Vectorize(isTRUE)(careers$leaguenationality=="England"),]
careersEngland$leaguelevel <- factor(careersEngland$leaguelevel)

Then I created a new column in the careersEngland table which categorises players by birthdate. I decided to look at players born in the 1950s, 1960s, 1970s, 1980s, and 1990s. Added column:

  • yeargroup (born in the 1950s, 1960s, 1970s, 1980s, 1990s)
tempmerge <- merge(careersEngland,players)
# make a new table with factors for when players were born: 1960-1970,1970-1980,1980-1990
careersEngland[(year(tempmerge$birthdate)>=1950 & year(tempmerge$birthdate)<1960),"yeargroup"]  <- 5 
careersEngland[(year(tempmerge$birthdate)>=1960 & year(tempmerge$birthdate)<1970),"yeargroup"]  <- 4 
careersEngland[(year(tempmerge$birthdate)>=1970 & year(tempmerge$birthdate)<1980),"yeargroup"]  <- 3 
careersEngland[(year(tempmerge$birthdate)>=1980 & year(tempmerge$birthdate)<1990),"yeargroup"]  <- 2 
careersEngland[(year(tempmerge$birthdate)>=1990 & year(tempmerge$birthdate)<2000),"yeargroup"]  <- 1 
careersEngland$yeargroup <- factor(careersEngland$yeargroup,labels = c("born in the 1990s","born in the 1980s","born in the 1970s","born in the 1960s","born in the 1950s"))
# exclude players with uncategorised birthdates
careersEngland <- careersEngland[complete.cases(careersEngland$yeargroup),]

Then I added the following columns for later analysis:

  • initialLevel (the level of the league the player was in at the start of his career)
  • initialclub (the club the player was registered with at the start of his career)
# make a column with total number of apps
careersEngland$totalapps <- unlist(tapply(careersEngland$leagueapps, careersEngland$playerID, function(x){ rep(sum(x,na.rm=TRUE),length(x)) }), use.names = FALSE)

# initial league level
careersEngland$initialLevel[Vectorize(isTRUE)(careersEngland$playerage<20)] <- unlist(tapply(careersEngland$leaguelevel[Vectorize(isTRUE)(careersEngland$playerage<20)], careersEngland$playerID[Vectorize(isTRUE)(careersEngland$playerage<20)], function(x){ rep(x[1],length(x)) }), use.names = FALSE)
careersEngland$initialLevel <- unlist(tapply(careersEngland$leaguelevel, careersEngland$playerID, function(x){ rep(x[1],length(x)) }), use.names = FALSE)
careersEngland$initialLevel <- factor(careersEngland$initialLevel)

# initial club
careersEngland$initialclub[Vectorize(isTRUE)(careersEngland$playerage<20)] <- unlist(tapply(careersEngland$clubs[Vectorize(isTRUE)(careersEngland$playerage<20)], careersEngland$playerID[Vectorize(isTRUE)(careersEngland$playerage<20)], function(x){ rep(x[1],length(x)) }), use.names = FALSE)
careersEngland$initialclub <- unlist(tapply(careersEngland$clubs, careersEngland$playerID, function(x){ rep(x[1],length(x)) }), use.names = FALSE)
careersEngland$initialclub <- factor(careersEngland$initialclub)

Data Analysis

First, I decided to look at the number of players who keep playing football at any league level as they grow older, i.e. the number of players still playing football at different ages. This is shown for different initial league levels and yeargroup. It is another way of looking at how many players drop out, and at what ages. Here is the code for the plot:

# number of players of different ages for different levels and yeargroups
t <- aggregate(careersEngland$playerID, by=list(yeargroup = careersEngland$yeargroup,
                                                   age = careersEngland$playerage,
                                                   nationality = careersEngland$playernationality=="England",
                                                   initialLevel = careersEngland$initialLevel), 
          length)
# subset for levels 1-3 and yeargroups 2-4
t <- subset(x = t, yeargroup=="born in the 1960s" | yeargroup=="born in the 1970s" | yeargroup=="born in the 1980s")
t <- subset(x = t, initialLevel=="1" | initialLevel=="2" | initialLevel=="3")
t <- t[t$age > 15 & t$age < 45,]
tEnglish <- t[t$nationality,]
# plot
g <- ggplot(data = tEnglish, aes(x = age, y=x, fill = initialLevel))
g + geom_bar(stat = "identity", position = "stack") + facet_grid(yeargroup ~ initialLevel) +
        ggtitle("Staying Power") + ylab("Number of players") + xlab("Player age (years)")

This figure shows that, regardless of the league level at which players start their career, many players drop out of English football between the ages of 19-21. Almost no English players go abroad, so this means that they drop out of the game completely. This is a comparatively recent effect (i.e. it is not observed for players born in the 1960s). Players who stay in the game also retire a few years earlier than they used to.

Then I looked at the players who stay in the game. Do they play football in the league they start out in or do they move down (or up) the leagues? This is shown in the next plot - the mean percentage of league appearances at each league level, for players who started out at level 1, 2, 3, 4 or 5. Because most players have a large majority of their appearances at mainly one league level, this percentage also (roughly) corresponds to the percentage of players who make a career at a given league level.

# histogram: apps per league level over career; only for players who have more than 10 league apps
# for players who started out at 1st league level, 2nd level, aso
t1 <- aggregate(careersEngland$leagueapps, by=list(id = careersEngland$playerID,
                                                yeargroup = careersEngland$yeargroup,
                                                nationality = careersEngland$playernationality=="England",
                                                initialLevel = careersEngland$initialLevel,
                                                leaguelevel = careersEngland$leaguelevel,
                                                apps10 = careersEngland$totalapps > 10), 
               sum, na.rm = TRUE)
t1 <- t1[order(t1$id),]
t1 <- subset(x = t1, yeargroup=="born in the 1960s" | yeargroup=="born in the 1970s" | yeargroup=="born in the 1980s")
t1 <- subset(x = t1, initialLevel=="1" | initialLevel=="2" | initialLevel=="3" | initialLevel=="4" | initialLevel=="5")
t1 <- t1[t1$nationality & t1$apps10,]

# put in missing values: 0s for missing league levels for each player
ids <- unique(t1$id)
t <- NULL
l <- seq(from=1,to=7,by=1)
temp <- data.frame("leaguelevel" = l)
for (i in seq_along(ids)) {
        temp$id <- rep(ids[i], 7)
        temp$yeargroup <- rep( t1$yeargroup[t1$id==ids[i]][1], 7)
        temp$initialLevel <- rep(t1$initialLevel[t1$id==ids[i]][1], 7)
        temp$leagueapps <- rep(0,7)
        temp$leagueapps[t1$leaguelevel[t1$id==ids[i]]] <- t1$x[t1$id==ids[i]]
        t <- rbind(t, temp)
}

# calculate mean percentage of apps per level
t$totalapps <- unlist(tapply(t$leagueapps, t$id, function(x) { rep(sum(x,na.rm = TRUE), length(x))} ))
t$percperleague <- t$leagueapps / t$totalapps

t <- aggregate(t$percperleague, by = list(yeargroup = t$yeargroup,
                                            initialLevel = t$initialLevel,
                                            leaguelevel = t$leaguelevel), 
                mean, na.rm = TRUE)
# subset for levels 1-3 and yeargroups 2-4
# plot
g <- ggplot(data = t, aes(leaguelevel,x, fill = initialLevel))
g + geom_bar(stat = "identity", position = "stack") + facet_grid(yeargroup ~ initialLevel)+
        ggtitle("Career progression from different leagues") + ylab("Mean percentage of appearances") + xlab("League level")

This figure shows that there is a lot of mobility within the English league system. Most young footballers who start out at league level 1 end up playing at levels 2-5. This is also observed for players who start out at levels 2 and 3. For players trained at levels 4 and 5, the distributions become more normal, i.e. most players who start out at that level, stay at that level.

This displacement of active footballers down the leagues may be caused by an increase in the percentage of foreign players in the top English leagues over the years, although it has to be said that at levels 2-4, the level of English players may actually have improved a bit over time (more of these players are now playing in higher leagues).

Finally, I wanted to look at career progression at a few clubs who supposedly have good academies: Arsenal, Manchester United, Chelsea, Liverpool and Southampton. The sample size for these plots is obviously much smaller than for whole leagues (i.e. we are now looking at a few 10s of players instead of a few 100s of players). Here is the code for the plot:

# career progression at individual clubs
t1 <- aggregate(careersEngland$leagueapps, by=list(id = careersEngland$playerID,
                                                   yeargroup = careersEngland$yeargroup,
                                                   nationality = careersEngland$playernationality=="England",
                                                   initialClub = careersEngland$initialclub,
                                                   leaguelevel = careersEngland$leaguelevel,
                                                   apps10 = careersEngland$totalapps>=10), 
                sum, na.rm = TRUE)
t1 <- t1[order(t1$id),]
t1 <- subset(x = t1, yeargroup=="born in the 1960s" | yeargroup=="born in the 1970s" | yeargroup=="born in the 1980s")
t1 <- subset(x = t1, initialClub=="Arsenal" | initialClub=="Man Utd" | initialClub=="Liverpool" | initialClub=="Chelsea" | initialClub=="Southampton")
t1 <- t1[t1$nationality & t1$apps10,]

# put in missing values: 0s for missing league levels for each player
ids <- unique(t1$id)
t <- NULL
l <- seq(from=1,to=7,by=1)
temp <- data.frame("leaguelevel" = l)
for (i in seq_along(ids)) {
        temp$id <- rep(ids[i], 7)
        temp$yeargroup <- rep( t1$yeargroup[t1$id==ids[i]][1], 7)
        temp$initialClub <- rep(t1$initialClub[t1$id==ids[i]][1], 7)
        temp$leagueapps <- rep(0,7)
        temp$leagueapps[t1$leaguelevel[t1$id==ids[i]]] <- t1$x[t1$id==ids[i]]
        t <- rbind(t, temp)
}

# calculate mean percentage of apps per level
t$totalapps <- unlist(tapply(t$leagueapps, t$id, function(x) { rep(sum(x,na.rm = TRUE), length(x))} ))
t$percperleague <- t$leagueapps / t$totalapps

t <- aggregate(t$percperleague, by = list(yeargroup = t$yeargroup,
                                          initialClub = t$initialClub,
                                          leaguelevel = t$leaguelevel), 
               mean, na.rm = TRUE)
# subset for levels 1-3 and yeargroups 2-4
# plot
g <- ggplot(data = t, aes(leaguelevel,x, fill = initialClub))
g + geom_bar(stat = "identity", position = "stack") + facet_grid(yeargroup ~ initialClub)+
        ggtitle("Career progression from individual clubs") + ylab("Mean percentage of appearances") + xlab("League level")

There are only small differences between clubs. Liverpool’s academy is probably the worst of these. Manchester United most recently have brought through surprisingly few level 1 players as well. Arsenal’s, Chelsea’s and Southampton’s academies are fairly equal regarding quality of recent output (players born in the 1980s). All clubs produced significantly more level 1 players in the past (i.e. players born in the 1960s).

I think that top level clubs probably take in many youngsters who don’t really have the potential to succeed at the parent club - after all, very few young players are that talented, yet you need enough players to make up the numbers. I personally don’t think that this is a problem, because many clubs at the top level have a good record of getting their graduates playing in the lower leagues.

Conclusion

No matter at which league level English footballers start their career, many drop out of football altogether between the age of 19 and 21. Those youngsters who do stay in the game tend to move down the leagues. This is also true for players starting out at top clubs (e.g. Arsenal, Manchester United, Chelsea and so on). However, for players who start out at Level 4 of the league pyramid, the distribution of appearances of players across the leagues becomes normal, suggesting Level 4 training matches Level 4 potential. I think these results are interesting because they show that players, fans and the media should not judge a player’s career by the club or league level he starts out at. I don’t think that is emphasised enough - a downward move is usually seen as failure, but actually it is completely normal at levels 1-3 and making a career at the same level is the exception.