The 2016 UEFA European and CONCACAF/CONMEBOL Championships are about to start this month. Each country picks a squad of 23 players for the tournament. This number always makes me think of the Birthday Paradox because 23 is the number of people for which it is 50% likely that any two will share a birthday.
The distribution for number of people in a group versus likelikhood that any two will share a birthday is shown below (image from wikipedia):
There are 24 teams taking part in Euro 2016 and 16 playing in Centario Copa America, making 40 teams in total. Each has 23 players. If you had to guess, how many teams would you think have 2 players that share a birthday? You will probably guess 20 teams (50% of 40 teams). Could you make an argument for more or less than this ?
Let’s take a look at the data to answer this problem. Fortunately, all the birthdates of these teams are on wikipedia and we can quickly get the information:
library(dplyr)
library(rvest)
eurourl = "https://en.wikipedia.org/wiki/UEFA_Euro_2016_squads"
amerurl = "https://en.wikipedia.org/wiki/Copa_Am%C3%A9rica_Centenario_squads"
eurosquads <- read_html(eurourl, encoding = "UTF-8") %>%
html_nodes(xpath='//*[@id="mw-content-text"]/table') %>%
html_table(fill=T)
amersquads <- read_html(amerurl, encoding = "UTF-8") %>%
html_nodes(xpath='//*[@id="mw-content-text"]/table') %>%
html_table(fill=T)
fun1 = function(df){
j <- df[,4]
j <- j[j != ""]
unlist(lapply(regmatches(j, gregexpr("(?<=\\().*?(?=\\))", j, perl=T)),function(x) x[[1]]))
}
eurodobs <- lapply(eurosquads[1:24], fun1)
amerdobs <- lapply(amersquads[1:16], fun1)
names(eurodobs) = c('France','Romania','Albania','Switzerland','England','Russia','Wales','Slovakia','Germany','Ukraine','Poland','Northern Ireland','Spain','Czech Republic','Turkey','Croatia','Belgium','Italy','Republic of Ireland','Sweden','Portugal','Iceland','Austria','Hungary')
names(amerdobs)=c("Colombia","Costa Rica","Paraguay","United States","Brazil","Ecuador","Haiti","Peru","Jamaica","Mexico","Uruguay","Venezuela","Argentina","Bolivia","Chile","Panama")
eurodf = data.frame(dob = do.call('rbind', (lapply(eurodobs,cbind))))
amerdf = data.frame(dob = do.call('rbind', (lapply(amerdobs,cbind))))
eurodf$team = rep(names(eurodobs),each=23)
amerdf$team = rep(names(amerdobs),each=23)
df = rbind(eurodf,amerdf)
df$name = c(unlist(lapply(eurosquads[1:24], function(x) x[,3])),unlist(lapply(amersquads[1:16], function(x) x[,3])))
df$name = ifelse(df$team=="Romania"|df$team=="Iceland", df$name, iconv(df$name, "UTF-8", "LATIN2"))
df$dob = as.Date(df$dob,format="%Y-%m-%d")
df$month = lubridate::month(df$dob)
df$day = lubridate::day(df$dob)
head(df)
## dob team name month day
## 1 1986-12-26 France Hugo Lloris (captain) 12 26
## 2 1983-10-31 France Christophe Jallet 10 31
## 3 1981-05-15 France Patrice Evra 5 15
## 4 1985-12-27 France Adil Rami 12 27
## 5 1991-03-29 France N'Golo Kanté 3 29
## 6 1986-01-14 France Yohan Cabaye 1 14
tail(df)
## dob team name month day
## 915 1985-03-10 Panama Ricardo Buitrago 3 10
## 916 1987-12-18 Panama Alberto Quintero 12 18
## 917 1990-02-10 Panama Aníbal Godoy 2 10
## 918 1983-08-02 Panama Amílcar Henríquez 8 2
## 919 1985-08-14 Panama José Calderón 8 14
## 920 1981-02-24 Panama Felipe Baloy (Captain) 2 24
You can see from the above that it’s fairly easy to get the names and birth dates of every player. I didn’t bother too much to clear up the encoding of the various characters in players’ names or removing the “(captain)” that identifies the team captains - they are readable enough for now.
Next we can find within each team whether there are any matching birthdates (ignoring year of birth). There are various ways of doing this but I am just going to make a new variable containing month and day of birth and then use dplyr to determine the total number within each team of each birthdate. We will then filter by those with more than one birthdate and join back in the names of the players:
df = df %>% mutate(monthday = paste(month,day))
bdaymatches = df %>% group_by(team,monthday) %>% tally %>% filter(n>1) %>% data.frame()
bdaymatches <- bdaymatches %>% left_join(df)
bdaymatches %>% select(1,3,4,5)
## team n dob name
## 1 Albania 2 1977-09-25 Orges Shehi
## 2 Albania 2 1993-09-25 Arlind Ajeti
## 3 Argentina 2 1987-02-10 Facundo Roncaglia
## 4 Argentina 2 1986-02-10 Nahuel Guzmán
## 5 Belgium 2 1991-06-28 Kevin De Bruyne
## 6 Belgium 2 1995-06-28 Jason Denayer
## 7 Bolivia 2 1983-04-22 Nelson Cabrera
## 8 Bolivia 2 1986-04-22 Wálter Veizaga
## 9 Bolivia 2 1988-05-10 Jhasmani Campos
## 10 Bolivia 2 1984-05-10 Martin Smedberg-Dalence
## 11 Brazil 2 1987-06-12 Gil
## 12 Brazil 2 1992-06-12 Philippe Coutinho
## 13 Brazil 2 1985-08-09 Filipe Luís
## 14 Brazil 2 1988-08-09 Willian
## 15 Colombia 2 1986-09-30 Cristián Zapata
## 16 Colombia 2 1978-09-30 Róbinson Zapata
## 17 Croatia 2 1994-05-06 Mateo Kovaèiæ
## 18 Croatia 2 1995-05-06 Marko Pjaca
## 19 Czech Republic 2 1989-03-29 Tomá¹ Vaclík
## 20 Czech Republic 2 1988-03-29 Marek Suchý
## 21 England 2 1990-05-28 Kyle Walker
## 22 England 2 1994-05-28 John Stones
## 23 France 2 1985-12-05 André-Pierre Gignac
## 24 France 2 1995-12-05 Anthony Martial
## 25 France 2 1991-03-29 N'Golo Kanté
## 26 France 2 1987-03-29 Dimitri Payet
## 27 Hungary 2 1994-05-06 Barnabás Bese
## 28 Hungary 2 1990-05-06 Péter Gulácsi
## 29 Iceland 2 1986-06-19 Ragnar Sigurðsson
## 30 Iceland 2 1989-06-19 Ã<U+0096>gmundur Kristinsson
## 31 Northern Ireland 2 1984-07-12 Michael McGovern
## 32 Northern Ireland 2 1991-07-12 Shane Ferguson
## 33 Paraguay 2 1977-06-30 Justo Villar
## 34 Paraguay 2 1990-06-30 Dario Lezcano
## 35 Poland 2 1990-04-18 Wojciech Szczêsny
## 36 Poland 2 1985-04-18 £ukasz Fabiañski
## 37 Portugal 2 1990-10-01 Anthony Lopes
## 38 Portugal 2 1983-10-01 Eliseu
## 39 Portugal 3 1983-12-22 José Fonte
## 40 Portugal 3 1993-12-22 Raphaël Guerreiro
## 41 Portugal 3 1987-12-22 Éder
## 42 Republic of Ireland 2 1986-04-04 Aiden McGeady
## 43 Republic of Ireland 2 1986-04-04 Stephen Quinn
## 44 Russia 2 1987-01-27 Roman Shishkin
## 45 Russia 2 1987-01-27 Denis Glushakov
## 46 Russia 2 1982-06-20 Aleksei Berezutski
## 47 Russia 2 1982-06-20 Vasili Berezutski
## 48 Slovakia 2 1982-12-05 Ján Mucha
## 49 Slovakia 2 1994-12-05 Ondrej Duda
## 50 Spain 2 1992-01-08 Koke
## 51 Spain 2 1986-01-08 David Silva
## 52 Sweden 2 1990-01-08 Robin Olsen
## 53 Sweden 2 1992-01-08 Patrik Carlgren
## 54 Sweden 2 1981-10-03 Andreas Isaksson
## 55 Sweden 2 1981-10-03 Zlatan Ibrahimoviæ (captain)
## 56 Turkey 2 1991-02-24 Semih Kaya
## 57 Turkey 2 1992-02-24 Yunus Malli
## 58 Turkey 2 1983-03-23 Hakan Balta
## 59 Turkey 2 1995-03-23 Ozan Tufan
## 60 United States 2 1993-01-28 John Brooks
## 61 United States 2 1983-01-28 Chris Wondolowski
## 62 Uruguay 2 1984-12-02 Carlos Sánchez
## 63 Uruguay 2 1990-12-02 Gastón Ramírez
## 64 Venezuela 2 1984-05-31 Oswaldo Vizcarrondo
## 65 Venezuela 2 1997-05-31 Adalberto Penaranda
## 66 Wales 2 1989-01-23 James Chester
## 67 Wales 2 1987-01-23 Joe Ledley
## 68 Wales 2 1984-08-23 Ashley Williams (captain)
## 69 Wales 2 1983-08-23 James Collins
Here are all the players and teams - this equates to 26/40 teams having players with the same birthday. There are actually 33 different dates where two teammates share a birthday and one date where three teammates share a birthday - José Fonte, Raphaël Guerreiro and Éder of Portugal.
unique(bdaymatches$team) #26/40
## [1] "Albania" "Argentina" "Belgium"
## [4] "Bolivia" "Brazil" "Colombia"
## [7] "Croatia" "Czech Republic" "England"
## [10] "France" "Hungary" "Iceland"
## [13] "Northern Ireland" "Paraguay" "Poland"
## [16] "Portugal" "Republic of Ireland" "Russia"
## [19] "Slovakia" "Spain" "Sweden"
## [22] "Turkey" "United States" "Uruguay"
## [25] "Venezuela" "Wales"
26/40 is obviously not a huge deviation from 20 but it is interesting that there are more than 20 teams with two individuals sharing a birthday and there are 34 separate instances of one birthdate being shared by teammates.
So what reasons could there be for this ?
Looking at the data, there is one obvious reason - there is one set of twins in the Russian team - Aleksei and Vasili Berezutski!
Another thing pops out too - a lot of the players have early month in the year birthdates. In fact 46 of the 69 players were born in June or earlier:
table(lubridate::month(bdaymatches$dob))
##
## 1 2 3 4 5 6 7 8 9 10 12
## 10 4 6 6 10 10 2 4 4 4 9
It’s a well known phenomenon that professional soccer players are more likely to be early born in the year. The major reason for this being that talent scouts often pick the bigger kids at earlier ages who tend to be earlier born.
Here is the distribution of birthdates by month for all 40 teams:
library(ggplot2)
ggplot(df,aes(x=month)) + geom_histogram() + theme_minimal() + scale_x_continuous(breaks=1:12,labels=month.abb) +
xlab("")+ylab("Frequency") + ggtitle("Birth Dates by Month of UEFA & Copa America 2016 players")
We can look in a bit more detail at each team. First for the UEFA teams:
df %>% filter(team %in% names(eurodobs)) %>%
ggplot(.,aes(x=month)) + geom_histogram() + theme_minimal() + scale_x_continuous(breaks=1:12) +
xlab("")+ylab("Frequency") + ggtitle("Birth Dates by Month of UEFA 2016 players") +
facet_wrap(~team)
..and here it is for the Copa America teams:
df %>% filter(team %in% names(amerdobs)) %>%
ggplot(.,aes(x=month)) + geom_histogram() + theme_minimal() + scale_x_continuous(breaks=1:12) +
xlab("")+ylab("Frequency") + ggtitle("Birth Dates by Month of Copa America 2016 players") +
facet_wrap(~team)
There are many teams with very uniform distributions but others (Argentina!, Bolivia!, Austria!) with really skewed distributions.
Any questions - please contact me - jc3181 AT columbia DOT edu