Reletive Age Effect in Australian professional football

An investigation into whether or not the Reletive Age Effect was present in the 2019/2020 Australian football (soccer) season

Douglas James Kors S3869417

Last updated: 25 October, 2020

Introduction

The Relative Age Effect (RAE) was first researched in a sporting context by Barnsley, Thompson and Barnsley (1985). The RAE has been shown to have found that athletes with a greater relative age, those that are born in the first three months of the year, are more likely to be identified as talented athletes and be selected in sporting teams because of their likely physical advantages compared to athletes born in the later months of the year (Helsen, van Winckel & Williams, 2005).

It has been explored and found recently in a range of different sports around the world such as Danish Handball (Wrang, Rossing, Diernæs, Hansen, Dalgaard-Hansen & Karbing, 2018), Norweigan Handball (Bjørndal, Luteberget, Till & Holm, 2018), Polish Youth Football (Rubajczyk & Rokita, 2018), Spanish Basketball (López de Subijana & Lorenzo, 2018), Ukrainian Swimmers (Atar, Özen & Koç, 2019) and Russian Ice Hockey (Bezuglov, Shvets, Lyubushkina, Lazarev, Valova, Zholinsky & Waśkiewicz, 2020).

In the sport of football (soccer), there have been many recet studies that have shown the RAE as prevelant throughout european nations, where the RAE has been studied acorss both Birth Months and Birth Quartiles (Kelly, Wilson, Gough, Knapman, Morgan, Cole, Jackson & Williams, 2019; Romann & Fuchslocher, 2013; Salinero, Pérez, Burillo & Lesma, 2013; Yagüe, de la Rubia, Sánchez-Molina, Maroto-Izquierdo & Moliner, 2018).

However, there is limited published research into whether or not the RAE is present in Australian professional football.

Problem Statement

The aim of this investiation is to observe whether or not the RAE is present in professional football players in the Australian professional football league (A-League).

Data collected from the website Transfermarkt.com has been preprocessed into a sample that will allow Birth Month and Birth Quartiles of professional footballers to be analysed, summarised and vizualised.

The Chi-square Goodness of Fit Test has been used to determine whether the distribution of Birth Month and Birth Quartiles for professional football players in Australia follows the normal birth distribution of Australians.

Additional analysis on player positions and minutes played will also be used to compare groups.

Data - Description

This dataset is a sample of Australian professional soccer players who have appeared in a matchday squad in the 2019/2020 A-League Season. The data is based on the html data tables taken from the squad details and stat pages from each teams Transfermarkt.com page located at https://www.transfermarkt.com/a-league/startseite/wettbewerb/AUS1.

Although this dataset contains additional, only 16 are relevant for the purposes of this analysis. These variables are:
Team [Character]: Name of team
TeamCode [Character]: Uniquee Team ID number
Name [Character]: Name of player
Position [Factor]: Categorical variable with 16 factors (Goalkeeper, Right-Back, Centre-Back, Defender, Left-Back, Defensive Midfield, Central Midfield, Midfielder, Attacking Midfield, Second Striker, Right Winger, Right Midfield, Left Midfield, Left Winger, Centre-Forward, Forward)
Date of Birth [Date]: Date of birth of player
Height [Integer]: Numerical variable, height of player
Foot [Factor]: Categorical variable with 4 factors (left, both, right, unknown)
Age [Integer]: Numerical variable, age of player
Squad_Selection [Integer]: Numerical value, number of matchday squads selected in during season
Games_Played [Integer]: Numerical value, number of games played in season
Goals [Integer]: Numerical value, number of goals in season
Assists [Integer]: Numerical value, number of goals in season
Minutes_Played [Integer]: Numerical value, number of minutes played in season

The remaining variables, superfluous to this analysis, were dropped after reading the file.

Data - Collecting Team Codes and Team Name

teamcodes <- read_html("https://www.transfermarkt.com/a-league/startseite/wettbewerb/AUS1/plus/?saison_id=2019") %>%
  html_nodes(".hide-for-pad .vereinprofil_tooltip") %>%
  html_attr("href") %>% 
  data.frame() %>% 
  separate(.,.,into = c("blank","team","startseite","verein","teamcode","saison","year"), sep = "/") %>% 
  select(-1,-3,-4,-6,-7)
  head(teamcodes, n= 10)

Data - Collecting Squad Details

squadlist <- apply(teamcodes, 1, function(x){
  teamcodes <- (x['teamcode'])
  teamname <- (x['team'])
  starturl <- "https://www.transfermarkt.com/sydney-fc/kader/verein/"
  endurl <- "/plus/1/galerie/0?saison_id=2019"
  theurl <- paste(starturl,teamcodes,endurl, sep = "")
  squad <- read_html(theurl) %>%
    html_nodes(.,"table") %>%
    html_table(.,fill = TRUE)  %>% .[2] %>%
    data.frame() %>% 
    select(-1,-2,-3,-7,-8,-11,-12,-13,-14) %>%
    drop_na() %>%
    cbind(.,teamname)
})
squad <- do.call(rbind.data.frame, squadlist)
names(squad) <- c("Name", "Position","DOB","Height","Foot","Team")
head(squad, n=6)

Data - Collecting Squad Statistics

The same process and functions that were used to scrape the Squad Details were then used for the Player Statsitics which come from a different url and provide a different dataset.

# Get the stats data
statslist <- apply(teamcodes, 1, function(x){
  teamcodes <- (x['teamcode'])
  teamname <- (x['team'])
  starturl <- "https://www.transfermarkt.com/sydney-fc/leistungsdaten/verein/"
  endurl <- "/plus/1?reldata=%262019"
  theurl <- paste(starturl,teamcodes,endurl, sep = "")
  
stats <- read_html(theurl) %>%
         html_nodes(.,"table") %>%
         html_table(.,fill = TRUE)  %>% .[2] %>%
         data.frame() %>% 
         select(-1,-2,-3,-7,-12,-13,-14,-15,-16,-17,-19,-20,-21,-22,-23,-24,-25,-26,-27) %>%
         drop_na() %>%
         cbind(.,teamname)
})

stats <- do.call(rbind.data.frame, statslist)
names(stats) <- c("Name", "Position","Age", "Squad_Selections", "Games_Played","Goals","Assists","Minutes_Played","Team")
head(stats, n=8)

Data - Merging and Preprocessing

stats$Name %<>% str_replace_all(., "[:punct:]", "") %<>% str_trim()
squad$Name %<>% str_replace_all(., "[:punct:]", "") %<>% str_trim()
team_data <- stats %>% left_join(squad, by = c("Name","Position","Team"))
team_data$Position %<>% factor(
  levels = c("Goalkeeper","Right-Back","Centre-Back","Defender","Left-Back","Defensive Midfield","Central Midfield","Midfielder","Attacking Midfield","Second Striker","Right Winger","Right Midfield","Left Midfield","Left Winger","Centre-Forward","Forward"),
  labels = c("GK","RB","CB","CB","LB","CDM","CM","CM","CAM","CAM","RW","RW","LW","LW","FW","FW"))
team_data$Foot %<>% factor(levels = c("left","both","right"))
team_data[,3:7] %<>% mutate_all(funs(str_replace_all(., "[:punct:]", "0")))
team_data[,8:12] %<>% mutate_all(funs(str_replace_all(., "[:punct:]", "")))
team_data[,3:8] <- sapply(team_data[,3:8],as.numeric)
team_data$Height %<>% str_replace_all(., "m","") %>% as.numeric()
team_data$DOB %<>% str_sub(., ,-4) %>% mdy()
head(team_data, n=8)

Data - Missing Data and Outliers

team <- team_data
team[,c(3:8)] %<>% impute(., fun = 0)
team$Foot %<>% impute(., fun = "unknown")
team$Height[is.na(team$Height)] <- mean(team$Height, na.rm = TRUE)
team %>% filter(., Age < 15 | Age > 45 | Height < 160 | Height > 210) %>% select(Name, Age, Height, Minutes_Played)
team$Height[which(team$Height <50)] <- mean(team$Height, na.rm = TRUE)
head(team)

Data - Creating Bins

team %<>% separate(DOB, into =  c("BirthYear","BirthMonth","BirthDay"))
team[,10:12] <- sapply(team[,10:12],as.numeric)
BirthMonthQuartile <- discretize(team$BirthMonth, disc = "equalwidth", nbins=4)

names(BirthMonthQuartile) <- c("BirthMonthQuartile")
team <- cbind(team, BirthMonthQuartile)
str(team)
## 'data.frame':    339 obs. of  15 variables:
##  $ Name              : chr  "Tom GloverT Glover" "Dean BouzanisD Bouzanis" "Joe GauciJoe Gauci" "Jack HendryJ Hendry" ...
##  $ Position          : Factor w/ 10 levels "GK","RB","CB",..: 1 1 1 3 3 2 4 3 3 4 ...
##  $ Age               : num  21 28 18 24 28 20 24 27 26 30 ...
##  $ Squad_Selections  : num  25 25 11 2 15 29 30 27 31 25 ...
##  $ Games_Played      : num  15 15 0 2 9 23 24 26 28 24 ...
##  $ Goals             : num  0 0 0 0 0 1 1 1 1 0 ...
##  $ Assists           : num  0 0 0 0 1 0 1 0 1 2 ...
##  $ Minutes_Played    : num  1380 1350 0 180 670 ...
##  $ Team              : chr  "melbourneheartfc" "melbourneheartfc" "melbourneheartfc" "melbourneheartfc" ...
##  $ BirthYear         : num  1997 1990 2000 1995 1991 ...
##  $ BirthMonth        : num  12 10 7 5 4 6 4 3 3 10 ...
##  $ BirthDay          : num  24 2 4 7 2 13 10 15 23 13 ...
##  $ Height            : num  190 189 181 192 183 ...
##  $ Foot              : 'impute' chr  "unknown" "right" "unknown" "right" ...
##   ..- attr(*, "imputed")= int  1 3 8 11 12 13 19 20 21 22 ...
##  $ BirthMonthQuartile: int  4 4 3 2 2 2 2 1 1 4 ...

Analysis - Birth Month Frequencies

The BirthMonth variable is presented below in a frequency distribution and bar chart.
The data is displayed as proportions, f/n, where f = the count or frequency or a value, and n = sample size.

MonthCol = c("#D7191C","#D7191C","#D7191C","#FDAE61","#FDAE61","#FDAE61","#ABDDA4","#ABDDA4","#ABDDA4","#2B83BA","#2B83BA","#2B83BA")
BirthMonth <- team$BirthMonth %>% table() %>% prop.table()*100
knitr::kable(round(t(BirthMonth),2))
1 2 3 4 5 6 7 8 9 10 11 12
13.57 8.26 11.21 9.73 6.19 8.55 7.67 7.96 6.78 8.26 5.31 6.49
BirthMonth %>% barplot(main = "Birth Month - Percentage",ylab="Percent", ylim=c(0,15), col = MonthCol)

Analysis - Birth Quartile Frequencies

The BirthQuartile variable is presented below in a frequency distribution and bar chart.
The data is displayed as proportions, f/n, where f = the count or frequency or a value, and n = sample size.

BirthQuartile <- team$BirthMonthQuartile %>% table() %>% prop.table()*100
knitr::kable(round(t(BirthQuartile),2))
1 2 3 4
33.04 24.48 22.42 20.06
BirthQuartile %>% barplot(main = "Birth Quartile - Percentage",ylab="Percent", ylim=c(0,35), col = (brewer.pal(4,"Spectral")),)

Analysis - Birth Quartile by Position Frequencies

The BirthQuartile variable is presented below in a bar chart, grouped by Position. The data is displayed as proportions, f/n, where f = the count or frequency or a value, and n = sample size.

positiontable <- table(team$BirthMonthQuartile,team$Position) %>% prop.table(margin = 2)*100

positiontable %>% barplot(main = "Birth Quartile by Position", ylab="Percent",
                    ylim=c(0,50), legend=rownames(positiontable), beside=TRUE,
                    col = (brewer.pal(4,"Spectral")),
                    args.legend=c(x = "topright", horiz=TRUE, title="Birth Quartile"),
                    xlab="Position")

Analysis - Summary Statistics

The Minutes_Played variable grouped by he BirthMonthQuartile and BirthMonth is summarised in a table including the min, Q1, median, Q3, max, mean, standard deviation, n and missing value count.

team %>% group_by(BirthMonthQuartile) %>% summarise(Min = min(Minutes_Played,na.rm = TRUE),
                                         Q1 = quantile(Minutes_Played,probs = .25,na.rm = TRUE),
                                         Median = median(Minutes_Played, na.rm = TRUE),
                                         Q3 = quantile(Minutes_Played,probs = .75,na.rm = TRUE),
                                         Max = max(Minutes_Played,na.rm = TRUE),
                                         Mean = mean(Minutes_Played, na.rm = TRUE),
                                         SD = sd(Minutes_Played, na.rm = TRUE),
                                         n = n(),
                                         Missing = sum(is.na(Minutes_Played)))
team %>% group_by(BirthMonth) %>% summarise(Min = min(Minutes_Played,na.rm = TRUE),
                                         Q1 = quantile(Minutes_Played,probs = .25,na.rm = TRUE),
                                         Median = median(Minutes_Played, na.rm = TRUE),
                                         Q3 = quantile(Minutes_Played,probs = .75,na.rm = TRUE),
                                         Max = max(Minutes_Played,na.rm = TRUE),
                                         Mean = mean(Minutes_Played, na.rm = TRUE),
                                         SD = sd(Minutes_Played, na.rm = TRUE),
                                         n = n(),
                                         Missing = sum(is.na(Minutes_Played)))

Analysis - Minutes Played

Box Plots have beeen used to depict the quartiles of distribution for Minutes Played by both the Birth Quartile and Birth Month variables.

par(mfrow=c(1,2))
team %>% boxplot(Minutes_Played ~ BirthMonthQuartile,data = ., main="Minutes Played of Birth Quartile", 
                     ylab="Minute Played", xlab="Birth Quartile", col = (brewer.pal(4,"Spectral")))


team %>% boxplot(Minutes_Played ~ BirthMonth,data = ., main="Minutes Played of Birth Month", 
                     ylab="Minute Played", xlab="Birth Month", col = MonthCol)

Hypothesis Testing - Birth Quartiles

H0: The population distribution of professional footballers in Austrlaia by birth quartile is 25% Q1, 25% Q2, 25% Q3, 25% Q4.
HA: The population distribution of professional footballers in Austrlaia by birth quartile is NOT 25% Q1, 25% Q2, 25% Q3, 25% Q4.

Chi-Square Goodness of Fit Test \[\chi^2 = ∑{\frac{(Obs−Exp)^2}{Exp}}\]

table(team$BirthMonthQuartile) %>% prop.table()
## 
##         1         2         3         4 
## 0.3303835 0.2448378 0.2241888 0.2005900
population_prop <- c(0.25,0.25,0.25,0.25)
chi1<-chisq.test(table(team$BirthMonthQuartile), p = population_prop)
chi1
## 
##  Chi-squared test for given probabilities
## 
## data:  table(team$BirthMonthQuartile)
## X-squared = 13.012, df = 3, p-value = 0.004611
chi1$observed
## 
##   1   2   3   4 
## 112  83  76  68
chi1$expected
##     1     2     3     4 
## 84.75 84.75 84.75 84.75

Hypothesis Testing - Birth Months

H0: The population distribution of professional footballers in Australia by birth month is 8% Jan, 8% Feb, 9% Mar, 8% Apr, 8% May, 8% Jun, 9% Jul, 8% Aug, 9% Sep, 9% Oct, 8% Nov, 8% Dec.
HA: The population distribution of professional footballers in Australia by birth month is NOT 8% Jan, 8% Feb, 9% Mar, 8% Apr, 8% May, 8% Jun, 9% Jul, 8% Aug, 9% Sep, 9% Oct, 8% Nov, 8% Dec.

Chi-Square Goodness of Fit Test \[\chi^2 = ∑{\frac{(Obs−Exp)^2}{Exp}}\]

table(team$BirthMonth) %>% prop.table()
## 
##          1          2          3          4          5          6          7 
## 0.13569322 0.08259587 0.11209440 0.09734513 0.06194690 0.08554572 0.07669617 
##          8          9         10         11         12 
## 0.07964602 0.06784661 0.08259587 0.05309735 0.06489676
monthpopulation_prop <- c(0.08,0.08,0.09,0.08,0.08,0.08,0.09,0.08,0.09,0.09,0.08,0.08)
monthchi<-chisq.test(table(team$BirthMonth), p = monthpopulation_prop)
monthchi
## 
##  Chi-squared test for given probabilities
## 
## data:  table(team$BirthMonth)
## X-squared = 24.553, df = 11, p-value = 0.01059
monthchi$observed
## 
##  1  2  3  4  5  6  7  8  9 10 11 12 
## 46 28 38 33 21 29 26 27 23 28 18 22
monthchi$expected
##     1     2     3     4     5     6     7     8     9    10    11    12 
## 27.12 27.12 30.51 27.12 27.12 27.12 30.51 27.12 30.51 30.51 27.12 27.12

Discussion

The RAE has been found across several sports from the 1980s to 2020, but there is limited resarch in professionall football in Australia. This study sought to find whether or not there was statistically significant evidence to support rejecting the null hypothesis that the RARE was not present and that population would follow a normal birth distribution.

Birth Quartile A Chi-square goodness of fit test was used to determine whether the distribution of professional footballers in Australia followed the normal birth quartile distribution of 25% Q1, 25% Q2, 25% Q3 and 25% Q4. The test was statistically significant, χ2 = 13.012, df=3, p<0.01 This suggests that the distribution of of professional footballers in Australia do not follow the the normal birth quartile distribution.

Birth Month A Chi-square goodness of fit test was used to determine whether the distribution of professional footballers in Australia followed the normal birth month distribution of 8% Jan, 8% Feb, 9% Mar, 8% Apr, 8% May, 8% Jun, 9% Jul, 8% Aug, 9% Sep, 9% Oct, 8% Nov, 8% Dec. The test was statistically significant, χ2 = 24.553, df=11, p<0.01 This suggests that the distribution of of professional footballers in Australia do not follow the the normal birth month distribution.

These results support similar studies throughout professional football in europe (Kelly et al. 2019; Romann & Fuchslocher, 2013; Salinero et al. 2018).

An interesting observation was seen in the distribution by position which identifies that the GK position appears to have a normal distrubtion, whilst the CDM position is heavily skewed to the right. Further investigation should be made into whether there are different methods of talent identification and youth development pathways for these positions to seek to explain the difference.

Additionally we note that whils the frequency distribution of players skews tot he right, highlighting the RAE, this does no occur for the distribution of Minutes Played. So whilst professional footballers born in the first three months of the year may have an advantage in being identified as talented athletes and selected in a teams squad, this does not neccesarily occur in playing selection for games and further investigation across other competitions and sports may provide insights into whether or not the RAE occurs in playing minutes or only in squad selction.

References

Atar, Ö., Özen, G., & Koç, H. (2019). Analysis of relative age effect in muscular strength of adolescent swimmers. Pedagogika, Psikhologii͡a︡ i Medico-Biologicheskie Problemy Fizicheskogo Vospitanii͡a︡ i Sporta, 23(5), 214–218. https://doi.org/10.15561/18189172.2019.0501

Barnsley, R.H., Thompson A.H. and Barnsley P.E. (1985) Hockey success and birthdate: the relative age effect. Canadian Association for Health, Physical Education and Recreation 51(8), 23-80

Bezuglov, E., Shvets, E., Lyubushkina, A., Lazarev, A., Valova, Y., Zholinsky, A., & Waśkiewicz, Z. (2020). Relative Age Effect in Russian Elite Hockey. Journal of Strength and Conditioning Research, 34(9), 2522–2527. https://doi.org/10.1519/JSC.0000000000003687

Bjørndal, C.T., Luteberget, L.S, Till, K., & Holm, S. (2018). The relative age effect in selection to international team matches in Norwegian handball. PloS One, 13(12), e0209288. https://doi.org/10.1371/journal.pone.0209288

Helsen, W.F., van Winckel, J, & Williams, A.M. (2005). The relative age effect in youth soccer across Europe. Journal of Sports Sciences, 23(6), 629–636. https://doi.org/10.1080/02640410400021310

Kelly, A. L., Wilson, M. R., Gough, L. A., Knapman, H., Morgan, P., Cole, M., Jackson, D.T., & Williams, C.A. (2019). A longitudinal investigation into the relative age effect in an English professional football club: exploring the ‘underdog hypothesis’ Science and Medicine in Football, 4(2), 1–8. https://doi.org/10.1080/24733938.2019.1694169

López de Subijana, C., & Lorenzo, J. (2018). Relative Age Effect and Long-Term Success in the Spanish Soccer and Basketball National Teams. Journal of Human Kinetics, 65(1), 197–204. https://doi.org/10.2478/hukin-2018-0027

Romann, M., & Fuchslocher, J. (2013). Relative age effects in Swiss junior soccer and their relationship with playing position. European Journal of Sport Science, 13(4), 356–363. https://doi.org/10.1080/17461391.2011.635699

Rubajczyk, K., & Rokita, A. (2018). The Relative Age Effect in Poland's Elite Youth Soccer Players. Journal of Human Kinetics, 64(1), 265–273. https://doi.org/10.1515/hukin-2017-0200

Salinero, J.J., Pérez, B., Burillo, P., & Lesma, M.L. (2013). Relative age effect in European professional football. Analysis by position. Journal of Human Sport and Exercise, 8(4), 966–973. https://doi.org/10.4100/jhse.2013.84.07

Yagüe, J.M., de la Rubia, A., Sánchez-Molina, J., Maroto-Izquierdo, S., & Molinero, O. (2018). The Relative Age Effect in the 10 Best Leagues of Male Professional Football of the Union of European Football Associations (UEFA). Journal of Sports Science & Medicine, 17(3), 409–416.

Wrang, C.M., Rossing, N.N., Diernæs, R.M., Hansen, C.G., Dalgaard-Hansen, C., & Karbing, D.S. (2018). Relative Age Effect and the Re-Selection of Danish Male Handball Players for National Teams. Journal of Human Kinetics, 63(1), 33–41. https://doi.org/10.2478/hukin-2018-0004