How good are football academies really? The media narrative is usually negative: We are told that, with the influx of TV money into the English game, young players get “too much too soon” and have no ambition. In comparison to their European team-mates, they lack technical ability. And with the influx of foreign players into the national leagues, they have no future. Here I looked at the careers of English footballers born in the 1960s, 1970s and 1980s. Specifically, I was interested in patterns of drop out and mobility between different leagues of the English football pyramid. I find that, no matter at which league level they start their careers, many 19-21-year-old footballers drop out of football altogether. Additionally, there is significant downward mobility - English youngsters who do stay in the game tend to move down the leagues. This is also true for players starting out at top clubs (e.g. Arsenal, Manchester United, Liverpool and Chelsea). However, for players who start out at Level 4, the distribution of appearances across the leagues becomes normal, suggesting the training of youngsters at Level 4 matches their potential. I think these results are interesting because they show that there is significant mobility between the leagues. Therefore, players, fans and the media should not judge a player’s career by the club or league level he starts out at.
See also my blog at: http://wp.me/p78lre-cp
library(XML)
library(stringr)
library(lubridate)
library(data.table)
library(gdata)
library(ggplot2)
The player data was scraped from Soccer Base. The following parameters are extracted from the relevant html pages and stored in a players database:
The code is shown here:
# get player data from soccerbase: loop through player IDs and extract:
# table players: player_id, name, birthdate, nationality
# table player_history: (player_id) x (club, season)
getPlayerData <- function() {
# pre-allocate space for data-tables
n <- 89906 # player id goes up to 89906; empirically determined on the website
players <- data.table(playerID=as.integer(rep(NA,n)), firstname=as.character(rep(NA,n)), lastname=as.character(rep(NA,n)), nationality=as.character(rep(NA,n)), birthdate=as.Date.POSIXct(rep(0,n)), height=as.numeric(rep(NA,n)), weight=as.numeric(rep(NA,n)))
playingcareers <- data.table(playerID=as.integer(rep(NA,5*n)), clubs=as.character(rep(NA,5*n)), seasonstarts=as.Date.POSIXct(rep(0,5*n)), seasonends=as.Date.POSIXct(rep(0,5*n)), leagueapps=as.integer(rep(NA,5*n)))
players$birthdate <- NA
playingcareers$seasonstarts <- NA
playingcareers$seasonends <- NA
index <- 1
indexP <- 1
for (id in 1:n){
# information about nationality, height, weight etc
tryCatch({
# load website; player id goes up to 89906
url <- paste("http://www.soccerbase.com/players/player.sd?player_id=",id,sep="")
html <- htmlTreeParse(url,useInternalNodes=TRUE)
title <- strsplit(xpathSApply(html,"//title",xmlValue)," \\| ")
# extract from title: name and current club
player_id <- as.integer(id)
name <- strsplit(title[[1]][1]," ")[[1]]
firstname <- name[1]
lastname <- name[length(name)]
currentclub <- title[[1]][3]
# extract birthdate
birthdate <- as.Date.POSIXct(dmy(sub("\\)","",sub(".*Born","",
strsplit(xpathSApply(html,"//table[@class='clubInfo']/tr/td",xmlValue)[2],"\n ")[[1]][3])))) },
error=function(cond) {
print(id)
birthdate <- as.Date.POSIXct(dmy("01-01-1910"))
birthdate <- NA
message(cond)
})
if (is.na(birthdate) | birthdate < as.Date.POSIXct(ymd("1950-01-01"))) { next() }
tryCatch({
# nationality
nationality <-
strsplit(xpathSApply(html,"//table[@class='clubInfo']/tr/td",xmlValue)[6],"\n ")[[1]][2]
# height
height <-
as.numeric(sub("m\\)","",sub(".*\\(","",(strsplit(xpathSApply(html,"//table[@class='clubInfo']/tr/td",xmlValue)[3],"\n ")[[1]][2]))))
# weight
weight <-
as.numeric(sub("kg\\)","",sub( ".*\\(","",strsplit(xpathSApply(html,"//table[@class='clubInfo']/tr/td",xmlValue)[4],"\n ")[[1]][2])))
# information about clubs
clubs <- sub(" $","",sapply(strsplit(xpathSApply(html,"//table[@class='table right career']/tbody/tr/td[1]",xmlValue),"\n "),function(x) { x[2] }))
# seasonstarts
seasonstarts <-
as.Date.POSIXct(dmy(sub(" $","",sapply(strsplit(xpathSApply(html,"//table[@class='table right career']/tbody/tr/td[2]",xmlValue),"\n "),function(x) { x[2] }))))
# seasonends
seasonends <-
as.Date.POSIXct(dmy(sub(" $","",sapply(strsplit(xpathSApply(html,"//table[@class='table right career']/tbody/tr/td[3]",xmlValue),"\n "),function(x) { x[2] }))))
# total league apps at each club
leagueapps <-
sapply(strsplit(xpathSApply(html,"//table[@class='table right career']/tbody/tr/td[5]",xmlValue)," "),function(x) { as.integer(x[1]) })
# remove the last two datapoints (these are not clubs)
last <- length(clubs) - 2
# add data for current season
clubs <- append(clubs[1:last], currentclub, after = 0)
seasonstarts <- append(seasonstarts[1:last], as.Date.POSIXct(ymd("2015/07/01")),after = 0)
if (is.na(seasonends[2])) {
seasonends[2] <- as.Date.POSIXct(ymd("2015/06/30"))
}
seasonends <- append(seasonends[1:last],as.Date.POSIXct(ymd("2016/06/30")),after = 0)
leagueapps <- append(leagueapps[1:last], NA,after = 0)
if (tolower(currentclub) == "no club") {
seasonstarts[1] <- NA
seasonends[1] <- NA
leagueapps[1] <- NA
}
# put data in tables
set(players, i=as.integer(index),1L:7L,
list(player_id,firstname,lastname,nationality,birthdate,height,weight))
for (j in seq(from = 0, to = length(leagueapps)-1, by=1)) {
set(playingcareers, i=as.integer((indexP+j)),1L:5L, list(player_id, clubs[j+1],
seasonstarts[j+1],
seasonends[j+1],
leagueapps[j+1]))
}
index <- index+1
indexP <- indexP+length(leagueapps)},
error=function(cond) {
print(id)
message(cond)
})
}
# write tables to files
write.table(players,"players.txt",row.names=FALSE)
write.table(playingcareers,"playingcareers.txt",row.names=FALSE)
}
The league data was scraped from Soccer Base. The following parameters are extracted from the relevant html pages and stored in a leagues database:
The code is shown here:
# information about leagues those clubs have been in
# http://www.soccerbase.com/tournaments/tournament.sd?tourn_id=1487 # id goes up to 1487
# get seasonstart, seasonends, clubs and league positions
getLeagueData <- function() {
# pre-allocate space for data-tables
n <- 1487
leagues <- data.table(league = as.character(rep(NA,n)),
seasonstart = as.Date.POSIXct(rep(0,n)),
seasonend = as.Date.POSIXct(rep(0,n)),
clubs = as.character(rep(NA,n)))
leagues$seasonstart <- NA
leagues$seasonend <- NA
index <- 1
for (id in 1:n){
# load tournament
url <- paste("http://www.soccerbase.com/tournaments/tournament.sd?tourn_id=",id,sep="")
html <- htmlTreeParse(url,useInternalNodes=TRUE)
title <- strsplit(xpathSApply(html,"//title",xmlValue)," \\| ")
# extract data from html
tryCatch(
{
# league name
league <- sub(" Betting","",title[[1]][1])
# season start
seasonstart <- as.Date.POSIXct(ymd(paste(strsplit(title[[1]][2],"/")[[1]][1],"07/01",sep="/")))
# season end
seasonend <- as.Date.POSIXct(ymd(paste(strsplit(title[[1]][2],"/")[[1]][2],"06/30",sep="/")))
# clubs
clubs <- str_trim(sapply(strsplit(xpathSApply(html,"//table[@class='table']/tbody/tr/td[@class='bull']",xmlValue),"\n "), function(x){x[2]}), side="right")
clubs <- paste(clubs,sep ="",collapse=" | ")
# add data to table
set(leagues,i=as.integer(index),1L:4L,list(league,seasonstart,seasonend,clubs))
index <- index+1
},
error=function(cond) {
print(id)
message(cond)
}
)
}
# write tables to files
write.table(leagues,"leagues.txt",row.names=FALSE)
}
The player data was re-organised in the following way. For each player, rows were added so that each season could be listed on a separate row. Because I only know the total number of apps per club (sometimes comprising several seasons spent at the club), this variable was changed to average league apps per season at the club. This information is stored in a new careers table:
The code is shown here:
# make table careers based on playingcareers
# for each player, add rows, so that each season has one row (fix by start and end dates)
# and change league apps to average league apps / season at the club
separateLeagues <- function() {
playingcareers <- read.table("playingcareers.txt",stringsAsFactors = FALSE, header = TRUE)
players <- read.table("players.txt",stringsAsFactors = FALSE, header = TRUE)
# convert dates to dates
playingcareers$seasonstarts <- as.Date.POSIXct(ymd(playingcareers$seasonstarts))
playingcareers$seasonends <- as.Date.POSIXct(ymd(playingcareers$seasonends))
careers <- data.frame(playingcareers)
careers <- rbind(careers,careers)
careers[!is.na(careers)] <- NA
p <-1
for (i in seq_along(players$playerID)){
# find rows for that player in playing careers
rows <- playingcareers$playerID==players$playerID[i]
tempcareers <- playingcareers[rows,]
tempseasons <- tempcareers[1,]
y <- 1L
# expand seasons so they are continuous
for (j in seq(from=1, to=sum(rows), by=1)){
# if time spent at that club is more than one year, add seasons
seasonsspent <- interval(tempcareers$seasonstarts[j],tempcareers$seasonends[j])
if (!is.na(seasonsspent) & as.period(seasonsspent) > period(1,units = "year")) {
if (month(tempcareers$seasonstarts[j]) >= 1 & month(tempcareers$seasonstarts[j]) < 7){
seasonstartsvect <-
seq(from = tempcareers$seasonstarts[j],to = tempcareers$seasonends[j], by="year")
month(seasonstartsvect[2:length(seasonstartsvect)]) <- 7
day(seasonstartsvect[2:length(seasonstartsvect)]) <- 1
seasonstartsvect[2:length(seasonstartsvect)] <-
seasonstartsvect[2:length(seasonstartsvect)]-period(1,units = "year")
seasonendsvect <- seasonstartsvect+years(1)-days(1)
month(seasonendsvect[1]) <- 6
day(seasonendsvect[1]) <- 30
seasonendsvect[1] <- seasonstartsvect[1]
year(seasonendsvect[1]) <- year(years(year(seasonendsvect[1])))
month(seasonendsvect[1]) <- 6
day(seasonendsvect[1]) <- 30
} else if (month(tempcareers$seasonstarts[j]) >= 7 & month(tempcareers$seasonstarts[j]) <= 12){
seasonstartsvect <-
seq(from = tempcareers$seasonstarts[j],to = tempcareers$seasonends[j], by="year")
month(seasonstartsvect[2:length(seasonstartsvect)]) <- 7
day(seasonstartsvect[2:length(seasonstartsvect)]) <- 1
seasonstartsvect[2:length(seasonstartsvect)] <-
seasonstartsvect[2:length(seasonstartsvect)]
seasonendsvect <- seasonstartsvect+years(1)-days(1)
month(seasonendsvect[1]) <- 6
day(seasonendsvect[1]) <- 30
seasonendsvect[1] <- seasonstartsvect[1]
year(seasonendsvect[1]) <- year(years(year(seasonendsvect[1])))
month(seasonendsvect[1]) <- 6
day(seasonendsvect[1]) <- 30
}
seasonendsvect[length(seasonendsvect)] <- tempcareers$seasonends[j]
c <- !is.na(seasonstartsvect) & is.na(seasonendsvect)
if (sum(c)>0) {
if (month(seasonstartsvect[c]) >= 1 & month(seasonstartsvect[c]) < 7){
seasonendsvect[c] <- seasonstartsvect[c]
year(seasonendsvect[c]) <- year(years(year(seasonendsvect[c])))
month(seasonendsvect[c]) <- 6
day(seasonendsvect[c]) <- 30
}
else if (month(seasonstartsvect[c]) >= 7 & month(seasonstartsvect[c]) <= 12){
seasonendsvect[c] <- seasonstartsvect[c]
year(seasonendsvect[c]) <- year(years(year(seasonendsvect[c]))+period(1,"year"))
month(seasonendsvect[c]) <- 6
day(seasonendsvect[c]) <- 30
}
}
mm <- seasonstartsvect>=seasonendsvect
seasonstartsvect <- seasonstartsvect[!mm]
seasonendsvect <- seasonendsvect[!mm]
indices <- seq(from=y,to=y+length(seasonstartsvect)-1,by=1)
tempseasons[indices,] <- tempcareers[j,]
tempseasons$seasonstarts[indices] <- seasonstartsvect
tempseasons$seasonends[indices] <- seasonendsvect
# average the leagueapps per season
leagueappsvect <-
rep(tempcareers$leagueapps[j]/length(seasonstartsvect),length(seasonstartsvect))
tempseasons$leagueapps[indices] <- leagueappsvect
y <- y+length(seasonstartsvect)
} else {
tempseasons[y,] <- tempcareers[j,]
y <- y+1
}
}
tempseasons <- tempseasons[order(tempseasons$seasonstarts,tempseasons$seasonends),]
careers[seq(from = p, to = p+nrow(tempseasons)-1,by=1),] <- tempseasons
p <- p+nrow(tempseasons)
}
row.names(careers) <- seq_along(careers[,1])
careers <- careers[complete.cases(careers[,1]),]
careers$seasonends[Vectorize(isTRUE)(careers$seasonstarts > careers$seasonends)] <-
careers$seasonends[Vectorize(isTRUE)(careers$seasonstarts > careers$seasonends)]+period(1,"year")
write.table(careers,"careers.txt",row.names=FALSE)
}
Then the following columns were added to careers:
The code is shown here:
# then add columns to table careers:
# - nationality of the player
# - age at the start of each season
# - season number (1st, 2nd etc) for that player
addPlayerDetails <- function(){
players <- read.table("players.txt",stringsAsFactors = FALSE, header = TRUE)
careers <- read.table("careers.txt",stringsAsFactors = FALSE, header = TRUE)
careers$seasonstarts <- as.Date.POSIXct(ymd(careers$seasonstarts))
careers$seasonends <- as.Date.POSIXct(ymd(careers$seasonends))
players$birthdate <- as.Date.POSIXct(ymd(players$birthdate))
# add the columns
tempmerge <- merge(careers, players, by.x = "playerID", by.y = "playerID")
careers$playernationality <- tempmerge$nationality
careers$playerage <- year(as.period(interval(tempmerge$birthdate,tempmerge$seasonstarts)))
careers$playerseason <-
unlist(tapply(careers$playerage, careers$playerID, function(x){x-min(x, na.rm=TRUE)+1}),
use.names = FALSE)
# save
write.table(careers,"careers.txt",row.names=FALSE)
}
Then the following columns were added to careers:
The code is shown here:
addLeagueDetails <- function() {
# - league the club was in that season
# - league level (1st, 2nd, etc) on the national football pyramid
# - nationality of the league
careers <- read.table("careers.txt",stringsAsFactors = FALSE, header = TRUE)
careers$seasonstarts <- as.Date.POSIXct(ymd(careers$seasonstarts))
careers$seasonends <- as.Date.POSIXct(ymd(careers$seasonends))
careers$leagues <- as.character(NA)
leagues <- read.table("leagues.txt",stringsAsFactors = FALSE, header = TRUE)
leagues$seasonstart <- as.Date.POSIXct(ymd(leagues$seasonstart))
leagues$seasonend <- as.Date.POSIXct(ymd(leagues$seasonend))
leagues <- leagues[leagues$seasonstart>"1950-01-01",]
leagues <- leagues[complete.cases(leagues[,1]),]
count <- 0
noleaguecount <- 0
for (i in seq_along(careers$clubs)) { #
club <-careers$clubs[i]
# find leagues in the same season
if (is.na(careers$seasonstarts[i])) { next() }
if (month(careers$seasonstarts[i]) >= 1 & month(careers$seasonstarts[i]) < 7){
indices <- year(careers$seasonstarts[i])==year(leagues$seasonend)}
else if (month(careers$seasonstarts[i]) >= 7 & month(careers$seasonstarts[i]) <= 12){
indices <- year(careers$seasonstarts[i])==year(leagues$seasonstart)}
leaguetemp <- leagues[indices,]
# check which of those leagues the club was in
club <- str_trim(gsub("\\(.*\\)","",club), side="right")
stringclub <- str_sub(club,1,7)
index <-
unlist(lapply(leaguetemp$clubs, function(x){j <- grepl(stringclub, x); j}),
use.names = FALSE)
if (sum(index)==1) {
careers$leagues[i] <- leaguetemp$league[index]
} else if (sum(index)==0) {
message("no league found")
careers$leagues[i] <- ""
noleaguecount <- noleaguecount+1
} else {
message("more than one league")
index <-
unlist(lapply(leaguetemp$clubs, function(x){j <- grepl(str_c("\\| ",club," \\|"), x); j}),
use.names = FALSE)
if (sum(index)==1) {
message("..fixed")
careers$leagues[i] <- leaguetemp$league[index]
} else {
sum(index)
count <- count+1
}
}
}
# add league nationality
careers$leaguenationality[grepl("English|Football|League|Premier|National|Conference|Southern|South|Northern|North|Championship",careers$leagues)] <- "England"
careers$leaguenationality[grepl("Scottish",careers$leagues)] <- "Scotland"
careers$leaguenationality[grepl("Welsh",careers$leagues)] <- "Wales"
careers$leaguenationality[grepl("Italian",careers$leagues)] <- "Italy"
careers$leaguenationality[grepl("Spanish",careers$leagues)] <- "Spain"
careers$leaguenationality[grepl("German",careers$leagues)] <- "Germany"
careers$leaguenationality[grepl("Russian",careers$leagues)] <- "Russia"
careers$leaguenationality[grepl("Austrian",careers$leagues)] <- "Austria"
careers$leaguenationality[grepl("Hungarian",careers$leagues)] <- "Hungary"
careers$leaguenationality[grepl("French",careers$leagues)] <- "France"
careers$leaguenationality[grepl("Polish",careers$leagues)] <- "Poland"
careers$leaguenationality[grepl("Belgian",careers$leagues)] <- "Belgium"
careers$leaguenationality[grepl("Danish",careers$leagues)] <- "Denmark"
careers$leaguenationality[grepl("Ukrainian",careers$leagues)] <- "Ukraine"
careers$leaguenationality[grepl("Romanian",careers$leagues)] <- "Romania"
careers$leaguenationality[grepl("Greek",careers$leagues)] <- "Greece"
careers$leaguenationality[grepl("Serbian",careers$leagues)] <- "Serbia"
careers$leaguenationality[grepl("Australian",careers$leagues)] <- "Australia"
careers$leaguenationality[grepl("Dutch",careers$leagues)] <- "Netherlands"
careers$leaguenationality[grepl("Portuguese",careers$leagues)] <- "Portugal"
careers$leaguenationality[grepl("Czech",careers$leagues)] <- "Czech Republic"
careers$leaguenationality[grepl("Turkish",careers$leagues)] <- "Turkey"
careers$leaguenationality[grepl("Irish",careers$leagues)] <- "Ireland"
careers$leaguenationality[grepl("Bulgarian",careers$leagues)] <- "Bulgaria"
careers$leaguenationality[grepl("Swiss",careers$leagues)] <- "Switzerland"
careers$leaguenationality[grepl("Israeli",careers$leagues)] <- "Israel"
careers$leaguenationality[grepl("Slovenian",careers$leagues)] <- "Slovenia"
# add league level
# if there is only one, it is probably the top level
uniqueLeagueLevels <- 0
listLeagues <- tapply(careers$leagues,careers$leaguenationality,unique)
for (ll in 1:length(listLeagues)){
if (length(listLeagues[[ll]])==1) {
uniqueLeagueLevels[ll] <- listLeagues[[ll]]
}
}
# for the others, manually set the level: make tables for different countries with the league system
englishleagues <- data.frame(
from = c("1950","1992","1950","1992","2004","1950","1992","2004","1950","1992","2004","1950","1950","1950"),
level = c(1,1,2,2,2,3,3,3,4,4,4,5,6,7),
league = c("English Division 1|Football League First Division","Premier League|Premiership",
"Football League Second Division|English Division 2|", "Football League First Division|","Football League First Division|Championship|",
"Football League Third Division|English Division 3|","Football League Second Division|","Football League Second Division|League One|Football League One|",
"Football League Fourth Division|English Division 4|","Football League Third Division|","Football League Third Division|League Two|Football League Two|",
"National League|Nationwide Conference|Conference Premier|Blue Sq Premier|Blue Sq Bet Premier|Football Conference|",
"Football Conference South|Football Conference North|Conference South|Blue Sq South|Blue Sq North|GM Vauxhall Conference|",
"Ryman Premier|Evo-Stik Southern Premier|Evo-Stik Northern Premier|Southern Premier|Unibond Premier|")
)
scottishleagues <- data.frame(level = c(1,2,3,4),league = c("Scottish Premiership|Scottish Premier Division|Scottish Division One|Scottish Premier League|Scottish Division 1 \\(old\\)","Scottish Championship|Scottish Division Two","Scottish League One|Scottish Division 3","Scottish League Two|Scottish Division 4"))
spanishleagues <- data.frame(level = c(1,2,2),league =c("Spanish Primera Liga","Spanish Segunda Division","Spanish Seguda Division"))
careers$leaguelevel <- NA
for (l in seq(1,length(careers$leagues),1)){
stringleague <- str_trim(gsub("\\(.*\\)","",careers$leagues[l]), side="right")
stringleague <- str_c(stringleague,"\\|")
ix <- grepl(careers$leagues[l],uniqueLeagueLevels)
if (sum(ix,na.rm=T)>0){ careers$leaguelevel[l] <- 1; next() }
ix <- grepl(careers$leagues[l],scottishleagues$league)
if (sum(ix,na.rm=T)>0){ careers$leaguelevel[l] <- scottishleagues$level[ix]; next() }
ix <- grepl(careers$leagues[l],spanishleagues$league)
if (sum(ix,na.rm=T)>0){ careers$leaguelevel[l] <- spanishleagues$level[ix]; next() }
ix <- grepl(stringleague,englishleagues$league)
if (sum(ix,na.rm=T)>0){
if (sum(ix,na.rm=T)==1) { careers$leaguelevel[l] <- englishleagues$level[ix]; next() }
season <- year(careers$seasonstarts[l])
if (season >= 2004) {
iy <- englishleagues$from=="2004"
} else if (season >= 1992 & season < 2004){
iy <- englishleagues$from=="1992"
} else if (season >= 1950 & season < 1992){
iy <- englishleagues$from=="1950"
}
careers$leaguelevel[l] <- englishleagues$level[ix & iy]
}
}
careers$leaguelevel[is.na(careers$leaguenationality)] <- NA
write.table(careers,"careers.txt",row.names=FALSE)
}
After some exploratory analysis, I decided to focus on the careers of English players in English leagues, because for foreign leagues, only information at level 1 is available. For English leagues, there is information about levels 1-7 of the English league pyramid.
I first subset the careers table to include only the information about English leagues. This table is called careersEngland.
careers <- read.table("careers.txt",stringsAsFactors = FALSE, header = TRUE)
careers$seasonstarts <- as.Date.POSIXct(ymd(careers$seasonstarts))
careers$seasonends <- as.Date.POSIXct(ymd(careers$seasonends))
players <- read.table("players.txt",stringsAsFactors = FALSE, header = TRUE)
players$birthdate <- as.Date.POSIXct(ymd(players$birthdate))
# create temporary tables: England
# first fill in some missing values for league nationality (almost no English
# players abroad, so set to England for English players)
careers$leaguenationality[is.na(careers$leaguenationality) & careers$playernationality == "England"] <- "England"
careersEngland <- careers[Vectorize(isTRUE)(careers$leaguenationality=="England"),]
careersEngland$leaguelevel <- factor(careersEngland$leaguelevel)
Then I created a new column in the careersEngland table which categorises players by birthdate. I decided to look at players born in the 1950s, 1960s, 1970s, 1980s, and 1990s. Added column:
tempmerge <- merge(careersEngland,players)
# make a new table with factors for when players were born: 1960-1970,1970-1980,1980-1990
careersEngland[(year(tempmerge$birthdate)>=1950 & year(tempmerge$birthdate)<1960),"yeargroup"] <- 5
careersEngland[(year(tempmerge$birthdate)>=1960 & year(tempmerge$birthdate)<1970),"yeargroup"] <- 4
careersEngland[(year(tempmerge$birthdate)>=1970 & year(tempmerge$birthdate)<1980),"yeargroup"] <- 3
careersEngland[(year(tempmerge$birthdate)>=1980 & year(tempmerge$birthdate)<1990),"yeargroup"] <- 2
careersEngland[(year(tempmerge$birthdate)>=1990 & year(tempmerge$birthdate)<2000),"yeargroup"] <- 1
careersEngland$yeargroup <- factor(careersEngland$yeargroup,labels = c("born in the 1990s","born in the 1980s","born in the 1970s","born in the 1960s","born in the 1950s"))
# exclude players with uncategorised birthdates
careersEngland <- careersEngland[complete.cases(careersEngland$yeargroup),]
Then I added the following columns for later analysis:
# make a column with total number of apps
careersEngland$totalapps <- unlist(tapply(careersEngland$leagueapps, careersEngland$playerID, function(x){ rep(sum(x,na.rm=TRUE),length(x)) }), use.names = FALSE)
# initial league level
careersEngland$initialLevel[Vectorize(isTRUE)(careersEngland$playerage<20)] <- unlist(tapply(careersEngland$leaguelevel[Vectorize(isTRUE)(careersEngland$playerage<20)], careersEngland$playerID[Vectorize(isTRUE)(careersEngland$playerage<20)], function(x){ rep(x[1],length(x)) }), use.names = FALSE)
careersEngland$initialLevel <- unlist(tapply(careersEngland$leaguelevel, careersEngland$playerID, function(x){ rep(x[1],length(x)) }), use.names = FALSE)
careersEngland$initialLevel <- factor(careersEngland$initialLevel)
# initial club
careersEngland$initialclub[Vectorize(isTRUE)(careersEngland$playerage<20)] <- unlist(tapply(careersEngland$clubs[Vectorize(isTRUE)(careersEngland$playerage<20)], careersEngland$playerID[Vectorize(isTRUE)(careersEngland$playerage<20)], function(x){ rep(x[1],length(x)) }), use.names = FALSE)
careersEngland$initialclub <- unlist(tapply(careersEngland$clubs, careersEngland$playerID, function(x){ rep(x[1],length(x)) }), use.names = FALSE)
careersEngland$initialclub <- factor(careersEngland$initialclub)
First, I decided to look at the number of players who keep playing football at any league level as they grow older, i.e. the number of players still playing football at different ages. This is shown for different initial league levels and yeargroup. It is another way of looking at how many players drop out, and at what ages. Here is the code for the plot:
# number of players of different ages for different levels and yeargroups
t <- aggregate(careersEngland$playerID, by=list(yeargroup = careersEngland$yeargroup,
age = careersEngland$playerage,
nationality = careersEngland$playernationality=="England",
initialLevel = careersEngland$initialLevel),
length)
# subset for levels 1-3 and yeargroups 2-4
t <- subset(x = t, yeargroup=="born in the 1960s" | yeargroup=="born in the 1970s" | yeargroup=="born in the 1980s")
t <- subset(x = t, initialLevel=="1" | initialLevel=="2" | initialLevel=="3")
t <- t[t$age > 15 & t$age < 45,]
tEnglish <- t[t$nationality,]
# plot
g <- ggplot(data = tEnglish, aes(x = age, y=x, fill = initialLevel))
g + geom_bar(stat = "identity", position = "stack") + facet_grid(yeargroup ~ initialLevel) +
ggtitle("Staying Power") + ylab("Number of players") + xlab("Player age (years)")
This figure shows that, regardless of the league level at which players start their career, many players drop out of English football between the ages of 19-21. Almost no English players go abroad, so this means that they drop out of the game completely. This is a comparatively recent effect (i.e. it is not observed for players born in the 1960s). Players who stay in the game also retire a few years earlier than they used to.
Then I looked at the players who stay in the game. Do they play football in the league they start out in or do they move down (or up) the leagues? This is shown in the next plot - the mean percentage of league appearances at each league level, for players who started out at level 1, 2, 3, 4 or 5. Because most players have a large majority of their appearances at mainly one league level, this percentage also (roughly) corresponds to the percentage of players who make a career at a given league level.
# histogram: apps per league level over career; only for players who have more than 10 league apps
# for players who started out at 1st league level, 2nd level, aso
t1 <- aggregate(careersEngland$leagueapps, by=list(id = careersEngland$playerID,
yeargroup = careersEngland$yeargroup,
nationality = careersEngland$playernationality=="England",
initialLevel = careersEngland$initialLevel,
leaguelevel = careersEngland$leaguelevel,
apps10 = careersEngland$totalapps > 10),
sum, na.rm = TRUE)
t1 <- t1[order(t1$id),]
t1 <- subset(x = t1, yeargroup=="born in the 1960s" | yeargroup=="born in the 1970s" | yeargroup=="born in the 1980s")
t1 <- subset(x = t1, initialLevel=="1" | initialLevel=="2" | initialLevel=="3" | initialLevel=="4" | initialLevel=="5")
t1 <- t1[t1$nationality & t1$apps10,]
# put in missing values: 0s for missing league levels for each player
ids <- unique(t1$id)
t <- NULL
l <- seq(from=1,to=7,by=1)
temp <- data.frame("leaguelevel" = l)
for (i in seq_along(ids)) {
temp$id <- rep(ids[i], 7)
temp$yeargroup <- rep( t1$yeargroup[t1$id==ids[i]][1], 7)
temp$initialLevel <- rep(t1$initialLevel[t1$id==ids[i]][1], 7)
temp$leagueapps <- rep(0,7)
temp$leagueapps[t1$leaguelevel[t1$id==ids[i]]] <- t1$x[t1$id==ids[i]]
t <- rbind(t, temp)
}
# calculate mean percentage of apps per level
t$totalapps <- unlist(tapply(t$leagueapps, t$id, function(x) { rep(sum(x,na.rm = TRUE), length(x))} ))
t$percperleague <- t$leagueapps / t$totalapps
t <- aggregate(t$percperleague, by = list(yeargroup = t$yeargroup,
initialLevel = t$initialLevel,
leaguelevel = t$leaguelevel),
mean, na.rm = TRUE)
# subset for levels 1-3 and yeargroups 2-4
# plot
g <- ggplot(data = t, aes(leaguelevel,x, fill = initialLevel))
g + geom_bar(stat = "identity", position = "stack") + facet_grid(yeargroup ~ initialLevel)+
ggtitle("Career progression from different leagues") + ylab("Mean percentage of appearances") + xlab("League level")
This figure shows that there is a lot of mobility within the English league system. Most young footballers who start out at league level 1 end up playing at levels 2-5. This is also observed for players who start out at levels 2 and 3. For players trained at levels 4 and 5, the distributions become more normal, i.e. most players who start out at that level, stay at that level.
This displacement of active footballers down the leagues may be caused by an increase in the percentage of foreign players in the top English leagues over the years, although it has to be said that at levels 2-4, the level of English players may actually have improved a bit over time (more of these players are now playing in higher leagues).
Finally, I wanted to look at career progression at a few clubs who supposedly have good academies: Arsenal, Manchester United, Chelsea, Liverpool and Southampton. The sample size for these plots is obviously much smaller than for whole leagues (i.e. we are now looking at a few 10s of players instead of a few 100s of players). Here is the code for the plot:
# career progression at individual clubs
t1 <- aggregate(careersEngland$leagueapps, by=list(id = careersEngland$playerID,
yeargroup = careersEngland$yeargroup,
nationality = careersEngland$playernationality=="England",
initialClub = careersEngland$initialclub,
leaguelevel = careersEngland$leaguelevel,
apps10 = careersEngland$totalapps>=10),
sum, na.rm = TRUE)
t1 <- t1[order(t1$id),]
t1 <- subset(x = t1, yeargroup=="born in the 1960s" | yeargroup=="born in the 1970s" | yeargroup=="born in the 1980s")
t1 <- subset(x = t1, initialClub=="Arsenal" | initialClub=="Man Utd" | initialClub=="Liverpool" | initialClub=="Chelsea" | initialClub=="Southampton")
t1 <- t1[t1$nationality & t1$apps10,]
# put in missing values: 0s for missing league levels for each player
ids <- unique(t1$id)
t <- NULL
l <- seq(from=1,to=7,by=1)
temp <- data.frame("leaguelevel" = l)
for (i in seq_along(ids)) {
temp$id <- rep(ids[i], 7)
temp$yeargroup <- rep( t1$yeargroup[t1$id==ids[i]][1], 7)
temp$initialClub <- rep(t1$initialClub[t1$id==ids[i]][1], 7)
temp$leagueapps <- rep(0,7)
temp$leagueapps[t1$leaguelevel[t1$id==ids[i]]] <- t1$x[t1$id==ids[i]]
t <- rbind(t, temp)
}
# calculate mean percentage of apps per level
t$totalapps <- unlist(tapply(t$leagueapps, t$id, function(x) { rep(sum(x,na.rm = TRUE), length(x))} ))
t$percperleague <- t$leagueapps / t$totalapps
t <- aggregate(t$percperleague, by = list(yeargroup = t$yeargroup,
initialClub = t$initialClub,
leaguelevel = t$leaguelevel),
mean, na.rm = TRUE)
# subset for levels 1-3 and yeargroups 2-4
# plot
g <- ggplot(data = t, aes(leaguelevel,x, fill = initialClub))
g + geom_bar(stat = "identity", position = "stack") + facet_grid(yeargroup ~ initialClub)+
ggtitle("Career progression from individual clubs") + ylab("Mean percentage of appearances") + xlab("League level")
There are only small differences between clubs. Liverpool’s academy is probably the worst of these. Manchester United most recently have brought through surprisingly few level 1 players as well. Arsenal’s, Chelsea’s and Southampton’s academies are fairly equal regarding quality of recent output (players born in the 1980s). All clubs produced significantly more level 1 players in the past (i.e. players born in the 1960s).
I think that top level clubs probably take in many youngsters who don’t really have the potential to succeed at the parent club - after all, very few young players are that talented, yet you need enough players to make up the numbers. I personally don’t think that this is a problem, because many clubs at the top level have a good record of getting their graduates playing in the lower leagues.
No matter at which league level English footballers start their career, many drop out of football altogether between the age of 19 and 21. Those youngsters who do stay in the game tend to move down the leagues. This is also true for players starting out at top clubs (e.g. Arsenal, Manchester United, Chelsea and so on). However, for players who start out at Level 4 of the league pyramid, the distribution of appearances of players across the leagues becomes normal, suggesting Level 4 training matches Level 4 potential. I think these results are interesting because they show that players, fans and the media should not judge a player’s career by the club or league level he starts out at. I don’t think that is emphasised enough - a downward move is usually seen as failure, but actually it is completely normal at levels 1-3 and making a career at the same level is the exception.