The 2016 UEFA European and CONCACAF/CONMEBOL Championships are about to start this month. Each country picks a squad of 23 players for the tournament. This number always makes me think of the Birthday Paradox because 23 is the number of people for which it is 50% likely that any two will share a birthday.

The distribution for number of people in a group versus likelikhood that any two will share a birthday is shown below (image from wikipedia):

 

There are 24 teams taking part in Euro 2016 and 16 playing in Centario Copa America, making 40 teams in total. Each has 23 players. If you had to guess, how many teams would you think have 2 players that share a birthday? You will probably guess 20 teams (50% of 40 teams). Could you make an argument for more or less than this ?

Let’s take a look at the data to answer this problem. Fortunately, all the birthdates of these teams are on wikipedia and we can quickly get the information:

 

library(dplyr)
library(rvest)

eurourl = "https://en.wikipedia.org/wiki/UEFA_Euro_2016_squads"
amerurl = "https://en.wikipedia.org/wiki/Copa_Am%C3%A9rica_Centenario_squads"

eurosquads <- read_html(eurourl, encoding = "UTF-8") %>%
  html_nodes(xpath='//*[@id="mw-content-text"]/table') %>%
  html_table(fill=T)

amersquads <- read_html(amerurl, encoding = "UTF-8") %>%
  html_nodes(xpath='//*[@id="mw-content-text"]/table') %>%
  html_table(fill=T)


fun1 = function(df){
j <- df[,4]
j <- j[j != ""]
unlist(lapply(regmatches(j, gregexpr("(?<=\\().*?(?=\\))", j, perl=T)),function(x) x[[1]]))
}

eurodobs <- lapply(eurosquads[1:24], fun1)
amerdobs <- lapply(amersquads[1:16], fun1)

names(eurodobs) = c('France','Romania','Albania','Switzerland','England','Russia','Wales','Slovakia','Germany','Ukraine','Poland','Northern Ireland','Spain','Czech Republic','Turkey','Croatia','Belgium','Italy','Republic of Ireland','Sweden','Portugal','Iceland','Austria','Hungary')

names(amerdobs)=c("Colombia","Costa Rica","Paraguay","United States","Brazil","Ecuador","Haiti","Peru","Jamaica","Mexico","Uruguay","Venezuela","Argentina","Bolivia","Chile","Panama")

eurodf = data.frame(dob = do.call('rbind', (lapply(eurodobs,cbind))))
amerdf = data.frame(dob = do.call('rbind', (lapply(amerdobs,cbind))))
eurodf$team = rep(names(eurodobs),each=23)
amerdf$team = rep(names(amerdobs),each=23)  

df = rbind(eurodf,amerdf)
df$name = c(unlist(lapply(eurosquads[1:24], function(x) x[,3])),unlist(lapply(amersquads[1:16], function(x) x[,3])))
df$name = ifelse(df$team=="Romania"|df$team=="Iceland", df$name, iconv(df$name, "UTF-8", "LATIN2")) 

df$dob = as.Date(df$dob,format="%Y-%m-%d")
df$month = lubridate::month(df$dob)
df$day = lubridate::day(df$dob)

head(df)
##          dob   team                  name month day
## 1 1986-12-26 France Hugo Lloris (captain)    12  26
## 2 1983-10-31 France     Christophe Jallet    10  31
## 3 1981-05-15 France          Patrice Evra     5  15
## 4 1985-12-27 France             Adil Rami    12  27
## 5 1991-03-29 France          N'Golo Kanté     3  29
## 6 1986-01-14 France          Yohan Cabaye     1  14
tail(df)
##            dob   team                   name month day
## 915 1985-03-10 Panama       Ricardo Buitrago     3  10
## 916 1987-12-18 Panama       Alberto Quintero    12  18
## 917 1990-02-10 Panama           Aníbal Godoy     2  10
## 918 1983-08-02 Panama      Amílcar Henríquez     8   2
## 919 1985-08-14 Panama          José Calderón     8  14
## 920 1981-02-24 Panama Felipe Baloy (Captain)     2  24

 

You can see from the above that it’s fairly easy to get the names and birth dates of every player. I didn’t bother too much to clear up the encoding of the various characters in players’ names or removing the “(captain)” that identifies the team captains - they are readable enough for now.

Next we can find within each team whether there are any matching birthdates (ignoring year of birth). There are various ways of doing this but I am just going to make a new variable containing month and day of birth and then use dplyr to determine the total number within each team of each birthdate. We will then filter by those with more than one birthdate and join back in the names of the players:

 

df = df %>% mutate(monthday = paste(month,day)) 

bdaymatches = df %>% group_by(team,monthday) %>% tally %>% filter(n>1) %>% data.frame()
bdaymatches <- bdaymatches %>% left_join(df)

bdaymatches %>% select(1,3,4,5)
##                   team n        dob                         name
## 1              Albania 2 1977-09-25                  Orges Shehi
## 2              Albania 2 1993-09-25                 Arlind Ajeti
## 3            Argentina 2 1987-02-10            Facundo Roncaglia
## 4            Argentina 2 1986-02-10                Nahuel Guzmán
## 5              Belgium 2 1991-06-28              Kevin De Bruyne
## 6              Belgium 2 1995-06-28                Jason Denayer
## 7              Bolivia 2 1983-04-22               Nelson Cabrera
## 8              Bolivia 2 1986-04-22               Wálter Veizaga
## 9              Bolivia 2 1988-05-10              Jhasmani Campos
## 10             Bolivia 2 1984-05-10      Martin Smedberg-Dalence
## 11              Brazil 2 1987-06-12                          Gil
## 12              Brazil 2 1992-06-12            Philippe Coutinho
## 13              Brazil 2 1985-08-09                  Filipe Luís
## 14              Brazil 2 1988-08-09                      Willian
## 15            Colombia 2 1986-09-30              Cristián Zapata
## 16            Colombia 2 1978-09-30              Róbinson Zapata
## 17             Croatia 2 1994-05-06                Mateo Kovaèiæ
## 18             Croatia 2 1995-05-06                  Marko Pjaca
## 19      Czech Republic 2 1989-03-29                 Tomá¹ Vaclík
## 20      Czech Republic 2 1988-03-29                  Marek Suchý
## 21             England 2 1990-05-28                  Kyle Walker
## 22             England 2 1994-05-28                  John Stones
## 23              France 2 1985-12-05          André-Pierre Gignac
## 24              France 2 1995-12-05              Anthony Martial
## 25              France 2 1991-03-29                 N'Golo Kanté
## 26              France 2 1987-03-29                Dimitri Payet
## 27             Hungary 2 1994-05-06                Barnabás Bese
## 28             Hungary 2 1990-05-06                Péter Gulácsi
## 29             Iceland 2 1986-06-19           Ragnar Sigurðsson
## 30             Iceland 2 1989-06-19        Ã<U+0096>gmundur Kristinsson
## 31    Northern Ireland 2 1984-07-12             Michael McGovern
## 32    Northern Ireland 2 1991-07-12               Shane Ferguson
## 33            Paraguay 2 1977-06-30                 Justo Villar
## 34            Paraguay 2 1990-06-30                Dario Lezcano
## 35              Poland 2 1990-04-18            Wojciech Szczêsny
## 36              Poland 2 1985-04-18             £ukasz Fabiañski
## 37            Portugal 2 1990-10-01                Anthony Lopes
## 38            Portugal 2 1983-10-01                       Eliseu
## 39            Portugal 3 1983-12-22                   José Fonte
## 40            Portugal 3 1993-12-22            Raphaël Guerreiro
## 41            Portugal 3 1987-12-22                         Éder
## 42 Republic of Ireland 2 1986-04-04                Aiden McGeady
## 43 Republic of Ireland 2 1986-04-04                Stephen Quinn
## 44              Russia 2 1987-01-27               Roman Shishkin
## 45              Russia 2 1987-01-27              Denis Glushakov
## 46              Russia 2 1982-06-20           Aleksei Berezutski
## 47              Russia 2 1982-06-20            Vasili Berezutski
## 48            Slovakia 2 1982-12-05                    Ján Mucha
## 49            Slovakia 2 1994-12-05                  Ondrej Duda
## 50               Spain 2 1992-01-08                         Koke
## 51               Spain 2 1986-01-08                  David Silva
## 52              Sweden 2 1990-01-08                  Robin Olsen
## 53              Sweden 2 1992-01-08              Patrik Carlgren
## 54              Sweden 2 1981-10-03             Andreas Isaksson
## 55              Sweden 2 1981-10-03 Zlatan Ibrahimoviæ (captain)
## 56              Turkey 2 1991-02-24                   Semih Kaya
## 57              Turkey 2 1992-02-24                  Yunus Malli
## 58              Turkey 2 1983-03-23                  Hakan Balta
## 59              Turkey 2 1995-03-23                   Ozan Tufan
## 60       United States 2 1993-01-28                  John Brooks
## 61       United States 2 1983-01-28            Chris Wondolowski
## 62             Uruguay 2 1984-12-02               Carlos Sánchez
## 63             Uruguay 2 1990-12-02               Gastón Ramírez
## 64           Venezuela 2 1984-05-31          Oswaldo Vizcarrondo
## 65           Venezuela 2 1997-05-31          Adalberto Penaranda
## 66               Wales 2 1989-01-23                James Chester
## 67               Wales 2 1987-01-23                   Joe Ledley
## 68               Wales 2 1984-08-23    Ashley Williams (captain)
## 69               Wales 2 1983-08-23                James Collins

 

Here are all the players and teams - this equates to 26/40 teams having players with the same birthday. There are actually 33 different dates where two teammates share a birthday and one date where three teammates share a birthday - José Fonte, Raphaël Guerreiro and Éder of Portugal.

 

unique(bdaymatches$team) #26/40
##  [1] "Albania"             "Argentina"           "Belgium"            
##  [4] "Bolivia"             "Brazil"              "Colombia"           
##  [7] "Croatia"             "Czech Republic"      "England"            
## [10] "France"              "Hungary"             "Iceland"            
## [13] "Northern Ireland"    "Paraguay"            "Poland"             
## [16] "Portugal"            "Republic of Ireland" "Russia"             
## [19] "Slovakia"            "Spain"               "Sweden"             
## [22] "Turkey"              "United States"       "Uruguay"            
## [25] "Venezuela"           "Wales"

 

26/40 is obviously not a huge deviation from 20 but it is interesting that there are more than 20 teams with two individuals sharing a birthday and there are 34 separate instances of one birthdate being shared by teammates.

So what reasons could there be for this ?

Looking at the data, there is one obvious reason - there is one set of twins in the Russian team - Aleksei and Vasili Berezutski!

Another thing pops out too - a lot of the players have early month in the year birthdates. In fact 46 of the 69 players were born in June or earlier:

 

table(lubridate::month(bdaymatches$dob))
## 
##  1  2  3  4  5  6  7  8  9 10 12 
## 10  4  6  6 10 10  2  4  4  4  9

 

It’s a well known phenomenon that professional soccer players are more likely to be early born in the year. The major reason for this being that talent scouts often pick the bigger kids at earlier ages who tend to be earlier born.

Here is the distribution of birthdates by month for all 40 teams:

 

library(ggplot2)
ggplot(df,aes(x=month)) + geom_histogram() + theme_minimal() + scale_x_continuous(breaks=1:12,labels=month.abb) +
  xlab("")+ylab("Frequency") + ggtitle("Birth Dates by Month of UEFA & Copa America 2016 players")

 

We can look in a bit more detail at each team. First for the UEFA teams:

 

df %>% filter(team %in% names(eurodobs)) %>%
ggplot(.,aes(x=month)) + geom_histogram() + theme_minimal() + scale_x_continuous(breaks=1:12) +
  xlab("")+ylab("Frequency") + ggtitle("Birth Dates by Month of UEFA 2016 players") +
  facet_wrap(~team)

 

..and here it is for the Copa America teams:

 

df %>% filter(team %in% names(amerdobs)) %>%
ggplot(.,aes(x=month)) + geom_histogram() + theme_minimal() + scale_x_continuous(breaks=1:12) +
  xlab("")+ylab("Frequency") + ggtitle("Birth Dates by Month of Copa America 2016 players") +
  facet_wrap(~team)

 

There are many teams with very uniform distributions but others (Argentina!, Bolivia!, Austria!) with really skewed distributions.

Any questions - please contact me - jc3181 AT columbia DOT edu