Introduction

While watching a particular game in the ongoing Euro 2021 competition, a discussion between commentators about the average age of a particular team really got me curious. I immediately began to wonder what the age distribution of players on teams at the competition was like. I decided to go online and search for data on player ages, which in-turn led to an entire data analysis project.

In this project, I’ll be analyzing data such as age, number of caps, number of goals, etc for players at the Euro 2021 competition.

Scraping The Data

I decided to try something new this time by web scraping the data, as opposed to just downloading a ready made dataset. This was my first time trying web scraping so I’m not able to give detailed explanations on the web scraping process. However 2 helpful resources that do just that are the UC Business Analytics R Programming Guide and Dataslice.

To scrape the data, I made use of 2 popular R libraries, tidyverse and rvest.

# Load libraries
library(tidyverse)
library(rvest)
library(patchwork)

Next I scraped wikipedia to get data about participating countries using the following steps:

Step 1: Scrape List of Participating Countries

Using functions from the rvest library, I extracted a list of h3 headers from the web page, then filtered for the first 24 h3 headers which were the names of participating countries.

# Wikipedia page url
url <- "https://en.wikipedia.org/wiki/UEFA_Euro_2020_squads"

# Covert url into html object
page <- read_html(url)

# Extract list of participating countries
countries_list <- page %>% 
  html_nodes("h3") %>% 
  html_text()

countries_list <- countries_list[1:24]
countries_list
##  [1] "Italy"           "Switzerland"     "Turkey"          "Wales"          
##  [5] "Belgium"         "Denmark"         "Finland"         "Russia"         
##  [9] "Austria"         "Netherlands"     "North Macedonia" "Ukraine"        
## [13] "Croatia"         "Czech Republic"  "England"         "Scotland"       
## [17] "Poland"          "Slovakia"        "Spain"           "Sweden"         
## [21] "France"          "Germany"         "Hungary"         "Portugal"

Step 2: Scrape Names of Team Coaches

On the web page, a new paragraph starts right after the h3 headers. Some of these paragraphs (html tag “p”) included the names of team coaches.

I extracted all paragraphs on the webpage, added them to a dataframe, filtered for all rows that included the string “Manager”, then finally removed the string “Manager:” using stringr, thus leaving only the names of the coaches.

# Get names of coaches
coach_names <- page %>% 
  html_nodes("p") %>% 
  html_text() %>% 
  str_split("\n") %>% 
  unlist() %>% 
  tibble() %>% 
  rename(coach = ".") %>% 
  filter(grepl("Manager", coach)) %>% 
  mutate(coach = coach %>% str_remove_all("Manager: "),
         coach = coach %>% str_trim(side = "both"))

coach_names %>% head()
## # A tibble: 6 x 1
##   coach            
##   <chr>            
## 1 Roberto Mancini  
## 2 Vladimir Petković
## 3 Şenol Güneş      
## 4 Rob Page         
## 5 Roberto Martínez 
## 6 Kasper Hjulmand

Step 3: Scrape Data Tables for Participating Countries

On the webpage, there are 24 data tables which I scraped next. I did so by first extracting the html tag “table”, then once again filtering for the first 24 tables into a list. I show the code below and display the first table in the list.

# Get data tables
table_list <- html_nodes(page, "table")

table_list <- table_list %>% 
  .[1:24] %>% 
  html_table(fill = TRUE)

table_list[[1]] %>% head()
## # A tibble: 6 x 7
##     No. Pos.  Player           `Date of birth (age)`      Caps Goals Club       
##   <int> <chr> <chr>            <chr>                     <int> <int> <chr>      
## 1     1 1GK   Salvatore Sirigu (1987-01-12)12 January 1…    26     0 Torino     
## 2     2 2DF   Giovanni Di Lor… (1993-08-04)4 August 199…     7     0 Napoli     
## 3     3 2DF   Giorgio Chielli… (1984-08-14)14 August 19…   107     8 Juventus   
## 4     4 2DF   Leonardo Spinaz… (1993-03-25)25 March 199…    14     0 Roma       
## 5     5 3MF   Manuel Locatelli (1998-01-08)8 January 19…    10     1 Sassuolo   
## 6     6 3MF   Marco Verratti   (1992-11-05)5 November 1…    40     3 Paris Sain…

Step 4: Clean! Clean!! Clean!!!

The first thing I did here was to create a dataframe of the countries and coaches.

# Dataframe of country & coach
country_coach_tbl <- tibble(country = countries_list) %>% 
  bind_cols(coach_names)

country_coach_tbl %>% head()
## # A tibble: 6 x 2
##   country     coach            
##   <chr>       <chr>            
## 1 Italy       Roberto Mancini  
## 2 Switzerland Vladimir Petković
## 3 Turkey      Şenol Güneş      
## 4 Wales       Rob Page         
## 5 Belgium     Roberto Martínez 
## 6 Denmark     Kasper Hjulmand

The second thing I did here was create a custom function to merge the country and coach columns from above to the corresponding data table for that country. Next I applied that function each of the 24 data tables, while also during a rbind to merge all 24 tables.

# Custom function to merge data table with corresponding country and coach
func_add_country_coach <- function(table_list_no = 1, coach_country_no = 1){
  
  table_list[[table_list_no]] %>% 
    mutate(country  = country_coach_tbl[coach_country_no,]$country,
           coach    = country_coach_tbl[coach_country_no,]$coach)
}

# Apply custom function from above to all 24 tables while binding rows
euros_tbl <- tibble(
  rbind(
  func_add_country_coach(1, 1), func_add_country_coach(2, 2),
  func_add_country_coach(3, 3), func_add_country_coach(4, 4),
  func_add_country_coach(5, 5), func_add_country_coach(6, 6),
  func_add_country_coach(7, 7), func_add_country_coach(8, 8),
  func_add_country_coach(9, 9), func_add_country_coach(10, 10),
  func_add_country_coach(11, 11), func_add_country_coach(12, 12),
  func_add_country_coach(13, 13), func_add_country_coach(14, 14),
  func_add_country_coach(15, 15), func_add_country_coach(16, 16),
  func_add_country_coach(17, 17), func_add_country_coach(18, 18),
  func_add_country_coach(19, 19), func_add_country_coach(20, 20),
  func_add_country_coach(21, 21), func_add_country_coach(22, 22),
  func_add_country_coach(23, 23), func_add_country_coach(24, 24)
  
  )
)

euros_tbl %>% head()
## # A tibble: 6 x 9
##     No. Pos.  Player     `Date of birth (age…  Caps Goals Club    country coach 
##   <int> <chr> <chr>      <chr>                <int> <int> <chr>   <chr>   <chr> 
## 1     1 1GK   Salvatore… (1987-01-12)12 Janu…    26     0 Torino  Italy   Rober…
## 2     2 2DF   Giovanni … (1993-08-04)4 Augus…     7     0 Napoli  Italy   Rober…
## 3     3 2DF   Giorgio C… (1984-08-14)14 Augu…   107     8 Juvent… Italy   Rober…
## 4     4 2DF   Leonardo … (1993-03-25)25 Marc…    14     0 Roma    Italy   Rober…
## 5     5 3MF   Manuel Lo… (1998-01-08)8 Janua…    10     1 Sassuo… Italy   Rober…
## 6     6 3MF   Marco Ver… (1992-11-05)5 Novem…    40     3 Paris … Italy   Rober…

Finally I did some additional cleaning/formatting including extracting the players age, added a column to indicate the players position, added another column to indicate if a player is the team captain or not.

remove <- c("captain|\\(|\\)")

# Extract age
final_euros_tbl <- euros_tbl %>% 
    setNames(names(.) %>% str_to_lower()) %>% 
    rename(dob = `date of birth (age)`) %>% 
    separate(col = dob, into = c("col1", "col2", "col3", "col4", "col5"), sep = " ") %>% 
    select(-c("col1", "col2", "col3", "col4")) %>% 
    rename(age = col5) %>% 
    mutate(age = age %>% str_remove_all("\\)") %>% as.numeric) %>% 
    
    # Add position column
    mutate(position = case_when(
    str_detect(pos., "GK") ~ "Goalkeepers",
    str_detect(pos., "DF") ~ "Defenders",
    str_detect(pos., "MF") ~ "Midfielders",
    str_detect(pos., "FW") ~ "Forwards",
    )) %>% 
    mutate(position = position %>% fct_relevel("Goalkeepers", "Defenders", "Midfielders", "Forwards")) %>% 
    
    # Add captain indicator column
    mutate(captain = case_when(
        str_detect(player, "(captain)") ~ "Yes",
        TRUE ~ "No"
    )) %>% 
    
    # Remove the string "(captain)" from the "player" column
    mutate(player = player %>% str_remove_all(remove)) %>% 
    mutate(player_country = str_glue("{player} ({country})"))

final_euros_tbl %>% head()
## # A tibble: 6 x 12
##     no. pos.  player     age  caps goals club    country coach  position captain
##   <int> <chr> <chr>    <dbl> <int> <int> <chr>   <chr>   <chr>  <fct>    <chr>  
## 1     1 1GK   "Salvat…    34    26     0 Torino  Italy   Rober… Goalkee… No     
## 2     2 2DF   "Giovan…    27     7     0 Napoli  Italy   Rober… Defende… No     
## 3     3 2DF   "Giorgi…    36   107     8 Juvent… Italy   Rober… Defende… Yes    
## 4     4 2DF   "Leonar…    28    14     0 Roma    Italy   Rober… Defende… No     
## 5     5 3MF   "Manuel…    23    10     1 Sassuo… Italy   Rober… Midfiel… No     
## 6     6 3MF   "Marco …    28    40     3 Paris … Italy   Rober… Midfiel… No     
## # … with 1 more variable: player_country <glue>

At this stage, I’m done with all the cleaning and formatting and have a clean dataset to analyze. Before I proceed to any analysis, I checked to see if there were any NA values.

# Check For NAs
final_euros_tbl %>% 
    sapply(function(x) sum(is.na(x)))
##            no.           pos.         player            age           caps 
##              0              0              0              0              0 
##          goals           club        country          coach       position 
##              0              0              0              0              0 
##        captain player_country 
##              0              0

Great! No NA values.


Analyzing The Data

I had several questions I was curious about such as:

I then proceeded to get the answers to these questions from the data and visualize using barcharts.

What players have the most caps for their country?

Note: caps refers to the number of international games a player has played for he’s country.

# Function to create bar charts
func_plot_bar_chart <- function(data, x_var, y_var, label_var){
    
    x_var_expr <- rlang::enquo(x_var)
    y_var_expr <- rlang::enquo(y_var)
    label_var_expr <- rlang::enquo(label_var)
    
    data %>% 
    ggplot(aes(!!x_var_expr, !!y_var_expr))+
    geom_col(fill = "#669cd1", width = 0.7)+
    geom_text(aes(label = !!label_var_expr), hjust = 1, size = 3.5)+
    theme_minimal()+
    theme(axis.text.x = element_text(size = 10, colour = "black"),
                   axis.text.y = element_text(size = 10, color = "black"))
}

# Top 10 capped players
final_euros_tbl %>% 
    arrange(desc(caps)) %>% 
    slice_max(order_by = caps, n = 10) %>% 
    mutate(player_country = player_country %>% fct_reorder(caps)) %>% 
    func_plot_bar_chart(x_var = caps, y_var = player_country, label_var = caps)+
    labs(title = "Top 10 Capped Players", y = "", x = "\nNo. of Caps")


Who are the highest capped players under 30?

# Top 10 capped players under 30
final_euros_tbl %>% 
    filter(age < 30) %>% 
    arrange(desc(caps)) %>% 
    slice_max(order_by = caps, n = 10) %>% 
    mutate(player_country = player_country %>% fct_reorder(caps)) %>% 
    func_plot_bar_chart(x_var = caps, y_var = player_country, label_var = caps)+
    labs(title = "Top 10 Capped Players Under 30", y = "", x = "\nNo. of Caps")


Who are the highest capped players under 25?

final_euros_tbl %>% 
    filter(age < 25) %>% 
    arrange(desc(caps)) %>% 
    slice_max(order_by = caps, n = 10) %>% 
    mutate(player_country = player_country %>% fct_reorder(caps)) %>% 
    func_plot_bar_chart(x_var = caps, y_var = player_country, label_var = caps)+
    labs(title = "Top 10 Capped Players Under 25", y = "", x = "\nNo. of Caps")


What players have the most goals for their country?

# Top 10 goal scorers
final_euros_tbl %>% 
    arrange(desc(goals)) %>% 
    slice_max(order_by = goals, n = 10) %>% 
    mutate(player_country = player_country %>% fct_reorder(goals)) %>% 
    func_plot_bar_chart(x_var = goals, y_var = player_country, label_var = goals)+
    labs(title = "Top 10 Goal Scorers", y = "", x = "\nNo. of Goals")

Understandably, players (forwards especially) who have a high number of caps will also have a higher number of international goals than players who may just be early or mid-way through their international careers. Thus I wanted to look at the ratio of goals to caps for players with at least 5 international caps, 10 international caps, 50 international caps, and 100 international caps.

# Add caps bin to dataset 
goals_caps_tbl <- final_euros_tbl%>% 
    mutate(caps_bin = case_when(
        caps >= 5  & caps <= 10  ~ "5 - 10 caps",
        caps >= 10 & caps <= 50  ~ "11 - 50 caps",
        caps >= 50 & caps <= 100 ~ "51 - 100 caps",
        caps >= 100 ~ "Above 100 caps",
        TRUE ~ "Exclude")) %>% 
    mutate(caps_bin = caps_bin %>% fct_relevel("5 - 10 caps", "11 - 50 caps",
                                               "51 - 100 caps", "Above 100 caps",
                                               "Exclude")) %>% 
    filter(caps_bin != "Exclude") %>% 
    mutate(goal_to_cap_ratio = goals / caps) %>% 
    mutate(label_text = str_glue("{goals} | {caps} | {goal_to_cap_ratio %>% scales::percent(accuracy = 1)}"))

# Function to plot goals to caps ratio
func_plot_goal_cap_ratio <- function(data){
    
    data %>% 
        mutate(goal_to_cap_ratio = goals / caps) %>%
        arrange(desc(goal_to_cap_ratio)) %>% 
        mutate(player_country = player_country %>% fct_reorder(goal_to_cap_ratio)) %>% 
        slice_max(order_by = goal_to_cap_ratio, n = 10) 
}

# Goals to caps plot: 5 - 10 caps
goals_caps_tbl %>% 
    filter(caps_bin == "5 - 10 caps") %>% 
    func_plot_goal_cap_ratio() %>% 
    func_plot_bar_chart(x_var = goal_to_cap_ratio, y_var = player_country, label_var = label_text)+
    scale_x_continuous(labels = scales::percent_format(accuracy = 1))+
    labs(title = "Top 10 Players: Goals to Caps Ratio (5 - 10 Caps)", subtitle = "Goals | Caps | Ratio",
         x = "\nGoals to Caps Ratio", y = "")

# Goals to caps plot: 10 - 50 caps
goals_caps_tbl %>% 
    filter(caps_bin == "11 - 50 caps") %>% 
    func_plot_goal_cap_ratio() %>% 
    func_plot_bar_chart(x_var = goal_to_cap_ratio, y_var = player_country, label_var = label_text)+
    scale_x_continuous(labels = scales::percent_format(accuracy = 1))+
    labs(title = "Top 10 Players: Goals to Caps Ratio (11 - 50 Caps)", subtitle = "Goals | Caps | Ratio",
         x = "\nGoals to Caps Ratio", y = "")

# Goals to caps plot: 50 - 100 caps
goals_caps_tbl %>% 
    filter(caps_bin == "51 - 100 caps") %>% 
    func_plot_goal_cap_ratio() %>% 
    func_plot_bar_chart(x_var = goal_to_cap_ratio, y_var = player_country, label_var = label_text)+
    scale_x_continuous(labels = scales::percent_format(accuracy = 1))+
    labs(title = "Top 10 Players: Goals to Caps Ratio (51 - 100 Caps)", subtitle = "Goals | Caps | Ratio",
         x = "\nGoals to Caps Ratio", y = "")

# Goals to caps plot: above 100 caps
goals_caps_tbl %>% 
    filter(caps_bin == "Above 100 caps") %>% 
    func_plot_goal_cap_ratio() %>% 
    func_plot_bar_chart(x_var = goal_to_cap_ratio, y_var = player_country, label_var = label_text)+
    scale_x_continuous(labels = scales::percent_format(accuracy = 1))+
    labs(title = "Top 10 Players: Goals to Caps Ratio (Above 100 caps)", subtitle = "Goals | Caps | Ratio",
         x = "\nGoals to Caps Ratio", y = "")


Who are the oldest and youngest players in the tournament?

# Oldest players
final_euros_tbl %>% 
    arrange(desc(age)) %>% 
    slice_max(order_by = age, n = 10) %>% 
    mutate(player_country = player_country %>% fct_reorder(age)) %>% 
    func_plot_bar_chart(x_var = age, y_var = player_country, label_var = age)+
    labs(title = "10 Oldest Players", x = "\nAge", y = "",
         caption = "More than 10 value returned when multiple players share the same age")

# Youngest players
final_euros_tbl %>% 
    arrange(age) %>% 
    slice_min(order_by = age, n = 10) %>% 
    mutate(player_country = player_country %>% fct_reorder(age)) %>% 
    func_plot_bar_chart(x_var = age, y_var = player_country, label_var = age)+
    labs(title = "10 Youngest Players", x = "\nAge", y = "",
         caption = "More than 10 value returned when multiple players share the same age")


What is the age distribution of players like?

# Age distribution (all)
p1 <- final_euros_tbl %>% 
    ggplot(aes(age))+
    geom_boxplot(fill = "#669cd1")+
    theme_minimal()+
    labs(title = "Age Distribution",
         subtitle = str_glue("Min Age: {min(final_euros_tbl$age)} | Average Age: {round(mean(final_euros_tbl$age),0)} | Max Age: {max(final_euros_tbl$age)}"))

# Age distribution by position
p2 <- final_euros_tbl %>% 
    ggplot(aes(age))+
    geom_boxplot(fill = "#669cd1")+
    theme_minimal()+
    facet_wrap(~ position)+
    labs(title = "Age Distribution by Position", x = "Age")

p1 / p2


What clubs are most represented at the tournament?

# Clubs with the highest representation at the tournament
final_euros_tbl %>% 
    group_by(club) %>% 
    count() %>% 
    arrange(desc(n)) %>% 
    ungroup() %>% 
    slice(1:10) %>% 
    mutate(club = club %>% fct_reorder(n)) %>% 
    func_plot_bar_chart(x_var = n, y_var = club, label_var = n)+
    labs(title = "Top 10 Clubs Represented", x = "No. of Players", y = "")


What Countries have the most players at a single club?

# Countries with the most players at a single club
final_euros_tbl %>% 
    group_by(country, club) %>% 
    count() %>% 
    arrange(desc(n)) %>% 
    ungroup() %>% 
    filter(n >= 5) %>% 
    mutate(country_club = str_glue("{country} ({club})")) %>% 
    mutate(country_club = country_club %>% fct_reorder(n)) %>%  
    func_plot_bar_chart(x_var = n, y_var = country_club, label_var = n)+
    labs(title = "Countries With the Most Players at a Single Club", x = "No of Players", y = "")


Wrapping Up

In this project, I scrapped Euro 2021 data from wikipedia, then proceeded to answer some questions around player stats, using bar charts and boxplots to visualize the data.