1 Introduction

This tutorial will step through the process of creating functions to analyse player match stats. These functions can be useful to analyse player performance or identify any trends in their data.

The first part of the tutorial will cover the scraping and cleaning of the AFL data - using the fitzRoy package to scrape the data, and tidyverse/dplyr for cleaning.

The second part will cover how to make functions and visualisations - visualisations created using ggplot.

2 Scraping and Cleaning Data

2.1 Scraping with the fitzRoy package

The fitzRoy package, allows easy access to many APIs to scrape our data from. The data which can be scraped includes match results, player stats, fixtures and more.

For this example, we will scrape player match stat data from the AFL website/API for the 2022 Season

# Load the fitzRoy package
library(fitzRoy)

# Scraping data for the 2022 Season, from the AFL Website
playerStats <- fetch_player_stats_afl(season = 2022) # Assign scraped data to object

head(playerStats, 5) # View first 5 rows
## # A tibble: 5 × 93
##   provi…¹ utcSt…² status compS…³ round…⁴ round…⁵ venue…⁶ home.…⁷ away.…⁸ playe…⁹
##   <chr>   <chr>   <chr>  <chr>   <chr>     <int> <chr>   <chr>   <chr>     <int>
## 1 CD_M20… 2022-0… CONCL… Premie… Round 1       1 MCG     Melbou… Wester…      12
## 2 CD_M20… 2022-0… CONCL… Premie… Round 1       1 MCG     Melbou… Wester…      17
## 3 CD_M20… 2022-0… CONCL… Premie… Round 1       1 MCG     Melbou… Wester…      10
## 4 CD_M20… 2022-0… CONCL… Premie… Round 1       1 MCG     Melbou… Wester…      50
## 5 CD_M20… 2022-0… CONCL… Premie… Round 1       1 MCG     Melbou… Wester…      31
## # … with 83 more variables: player.photoURL <chr>,
## #   player.player.position <chr>, player.player.player.playerId <chr>,
## #   player.player.player.captain <lgl>,
## #   player.player.player.playerJumperNumber <int>,
## #   player.player.player.givenName <chr>, player.player.player.surname <chr>,
## #   teamId <chr>, gamesPlayed <lgl>, timeOnGroundPercentage <dbl>, goals <dbl>,
## #   behinds <dbl>, superGoals <lgl>, kicks <dbl>, handballs <dbl>, …

Other data sources which can be scraped from include footywire.com, afltables.com and Fryzigg. We are using stats from the AFL website as they include more advanced stats (e.g. kick-ins).

# Example 
fetch_player_stats_afltables()
fetch_player_stats_footywire()
fetch_player_stats_fryzigg() 

2.2 Cleaning with the tidyverse and dplyr

As we can see in the first code chunk, there are over 90 variables in the scraped data. We will be building functions for player disposals, and player kick-ins, so let’s clean up the data to include only relevant stats. We will load the tidyverse package which contains the dplyr package for data wrangling, as well as ggplot which we will use later.

colnames(playerStats) # Use this to explore the available variables
# Load the tidyverse
library(tidyverse)

playerStats <- playerStats %>% 
  # Selecting appropriate variables
  dplyr::select(
    c(round.roundNumber,   # Round Number
      team.name,   # Player's team
      player.givenName,   # Player Name
      player.surname,   # Player Surname
      teamStatus,   # Home/Away Indicator
      disposals,   # Player disposals
      extendedStats.kickins,   # Player Kick-ins taken
      extendedStats.kickinsPlayon,   # Player Kick-ins played-on from
      timeOnGroundPercentage)  # Time spent on ground
  ) %>% 
  # Filter out subs who did not play a significant amount of time
  dplyr::filter(
    timeOnGroundPercentage > 5
  ) %>% 
  # Combine player first name and surname to one column
  tidyr::unite(
    col = 'Player',   # Name of new column
    player.givenName:player.surname, 
    sep = " "
  )

head(playerStats, 5)
## # A tibble: 5 × 8
##   round.roundNumber team.name Player     teamS…¹ dispo…² exten…³ exten…⁴ timeO…⁵
##               <int> <chr>     <chr>      <chr>     <dbl>   <dbl>   <dbl>   <dbl>
## 1                 1 Melbourne Toby Bedf… home          9       0       0      55
## 2                 1 Melbourne Jake Bowey home          9       1       1      73
## 3                 1 Melbourne Angus Bra… home         23       0       0      83
## 4                 1 Melbourne Ben Brown  home         13       0       0      86
## 5                 1 Melbourne Bayley Fr… home          9       0       0      81
## # … with abbreviated variable names ¹​teamStatus, ²​disposals,
## #   ³​extendedStats.kickins, ⁴​extendedStats.kickinsPlayon,
## #   ⁵​timeOnGroundPercentage

The dataset is much smaller now, so much easier to view and handle. Let’s move on to visualisations.

3 Creating Visualisations

3.1 First Graph

The graphs we create will be built using the ggplot package contained in the tidyverse. ggplot offers a catalogue of potential graphs we can utilise for our visualisations. For our analyses, we will be using bar charts. Let’s build a graph to analyse Patrick Cripps’ disposals across the season.

playerStats %>% 
  filter(Player == 'Patrick Cripps') %>% 
  # Call the ggplot function to begin our visualisation
  ggplot(
    aes(x = round.roundNumber, y = disposals)    # Choosing our mapping aesthetics
  ) +
  # Add our bars
  geom_bar(stat = 'identity')

Now we have a bar chart for Patrick Cripps’ disposals in each game of 2022. It looks a bit plain so let’s pretty it up.

3.2 Second Graph

We’ll change the colour scheme so that it’s nicer to look at, add appropriate titles and change axis intervals so the information is better to understand. Let’s also view how often he had more than 20 disposals.

playerStats %>% 
  filter(Player == 'Patrick Cripps') %>% 
  mutate(
    # Assigning colours for when he is over 20 or under 20 disposals
    OverUnder = ifelse(disposals < 20,    # "if disposals is under 20;
                       '#a61c00',         # make bars red
                       '#459e1e')         # otherwise, make them green
    ) %>% 
  ggplot(aes(x = round.roundNumber, y = disposals)) + 
  geom_bar(stat = 'identity',
           aes(fill = OverUnder)) +   # Fill bars based on threshold colour assigned above
  scale_fill_identity() +   # " Activate" our bar colours
  
  # adding label into bars
  geom_text(aes(label = disposals),
            vjust = 2,   # moving text position
            color = "white",
            fontface = "bold") +  # changing text style
  
  # Titles
  labs(title = 'Patick Cripps Disposals', # Title
       subtitle = 'Season 2022', # subtitle
       x = 'Round', # x-axis title
       y = 'Disposals') +  # y-axis title
  
  # Add a line to visualise threshold
  geom_hline(yintercept = 20,
             linetype = "dashed",
             size = 1.4, 
             colour = "blue") +
  
  # Axis scaling/intervals
  scale_y_continuous(breaks = seq(0, max(playerStats$disposals), by = 5)) +  # setting axis range and breakpoints
  scale_x_continuous(breaks = seq(1, max(playerStats$round.roundNumber)), # Setting the graph at the maximum possible round, to capture missed matches if player missed recent games
                     limits = c(0, max(playerStats$round.roundNumber) + 1), 
                     expand = c(0, 0)) +

  # Changing overall graph aesthetics
  theme(
    # increase space around chart
        plot.margin = margin(0, 0.5, 0.5, 0.5,"cm"),
        
    # changing font of texts
        plot.title = element_text(color = '#FFFFFF', size = 24, margin = margin(b = .5, t = .75, unit = "cm")),
        plot.subtitle = element_text(color = '#FFFFFF', size = 12),
        axis.title.x = element_text(color = '#FFFFFF', size = 16, margin = margin(t = 1, unit = 'cm')),
        axis.title.y = element_text(color = '#FFFFFF', size = 16, margin = margin(r = 1, unit = 'cm')),
        axis.text = element_text(color = '#FFFFFF', size = 12),
        
    # making the chart 'dark mode'
        panel.background = element_rect(fill = "#2a2929"),
        plot.background = element_rect(fill = "#2a2929"),
        
    # removing or changing the gridlines 
        panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank(),
        panel.grid.major.y = element_line(color = '#b3b3b3'),
        panel.grid.minor.y = element_blank()) 

The graph not only looks much nicer now, but important information is much clearer to see. However, if we want to analyse a different player or a different threshold, we would need to trawl through the lines of code again to try find the right code to change - assuming we don’t forget what needs to be changed. Or, if we wanted to analyse multiple players at once, we would need to copy many, many lines of code. Therefore, let’s create a function so that it is to repeat our graph over and over, whilst easily changing parameters.

4 Creating Functions

4.1 Disposals Function

To create a function, we can simply define the above graph code within a function operation, changing the variables which we would want to change.

round <- max(playerStats$round.roundNumber) # Defining latest round of data

playerDisposals <- # Name of our function
  function(player, threshold){   #Defining our inputs that will be used i.e. a player name, and disposal threshold
    
    # Insert code from last section, replacing any variables with our new inputs
playerStats %>% 
  filter(Player == player) %>% # Change to our player input, instead of 'Patrick Cripps'
  mutate(OverUnder = ifelse(disposals < threshold, '#a61c00', '#459e1e')) %>% # Change to our threshold input, instead of '20'
  ggplot(aes(x = round.roundNumber, y = disposals)) + 
  geom_bar(stat = 'identity', aes(fill = OverUnder)) +
  scale_fill_identity() + 
  geom_text(aes(label = disposals), vjust = 2, color = "white", fontface = "bold") + 
  labs(title = paste0(player, ' Disposals'), # Changing our title to be dynamic, using the paste0 function
       subtitle = 'Season 2022',
       x = 'Round', y = 'Disposals') + 
      geom_hline(yintercept = threshold, # Change y-int to be threshold input
             linetype = "dashed",
             size = 1.4, 
             colour = "blue") +
  scale_y_continuous(breaks = seq(0, max(playerStats$disposals), by = 5)) +
  scale_x_continuous(breaks = seq(1, round), # Change to our round object, defined outside of the function
                     limits = c(0, round + 1), 
                     expand = c(0, 0)) +
  theme(
        plot.margin = margin(0, 0.5, 0.5, 0.5,"cm"),
        plot.title = element_text(color = '#FFFFFF', size = 24, margin = margin(b = .5, t = .75, unit = "cm")),
        plot.subtitle = element_text(color = '#FFFFFF', size = 12),
        axis.title.x = element_text(color = '#FFFFFF', size = 16, margin = margin(t = 1, unit = 'cm')),
        axis.title.y = element_text(color = '#FFFFFF', size = 16, margin = margin(r = 1, unit = 'cm')),
        axis.text = element_text(color = '#FFFFFF', size = 12),
        panel.background = element_rect(fill = "#2a2929"),
        plot.background = element_rect(fill = "#2a2929"),
        panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank(),
        panel.grid.major.y = element_line(color = '#b3b3b3'),
        panel.grid.minor.y = element_blank()) 
  
    
  }

Why is there no graph now? This is because we now need to call our function. Let’s look at Lachie Neale, Touk Miller and Clayton Oliver.

4.1.1 L.Neale

playerDisposals("Lachie Neale", 20)

4.1.2 T.Miller

playerDisposals("Touk Miller", 20)

4.1.3 C.Oliver

playerDisposals("Clayton Oliver", 20)

Three graphs are now outputted, using just as many lines of code. Without a function, we would have need to copy the original graph code three times, creating a messy wall of text that is much harder to work with - not to mention increasing the chances of breaking our code in the process.

With this framework, we can analyse almost any recorded stats by simply changing what we scrape and what we call for out y-axis. For our final function, let’s analyse player kick-ins when another teammate is or isn’t also playing.

4.2 Kick-in Function

In general, when building a function, we should build our code outside of a function first as we did previously with the graphs, in order to test that our code is functional. For the following example, the initial build won’t be shown, just the function code for simplicity.

withoutPlayerKickIns <- # Naming our function
  function(targetPlayer, otherPlayer){ # Define out inputs
    
  # Creating an indicator when the 'otherPlayer' was in the game
  other <- playerStats %>% 
    filter(Player == otherPlayer) %>% 
    select(round.roundNumber) %>% 
    mutate(
      otherPlayerPlayed = 'Y' 
    )
  
  # Joining the indicator with the target player
  playerStats <- playerStats %>% 
    filter(Player == targetPlayer) %>% 
    select(round.roundNumber, extendedStats.kickins) %>% 
    left_join(other, by = "round.roundNumber") # Left join the previously created dataframe by round number
  
  ### Dataframe now contains targetPlayer matches, with an indicator if the otherPlayer was present
  
  # Replacing NAs
  playerStats$otherPlayerPlayed[is.na(playerStats$otherPlayerPlayed)] <-  'N'
  
  # Setting graphs limit
  maxKI <- max(playerStats$extendedStats.kickins)
  
playerStats %>%
    
    #Graph framework remains almost the same as above
  ggplot(aes(x = round.roundNumber, y = extendedStats.kickins)) + 
    geom_bar(stat = 'identity', aes(fill = otherPlayerPlayed)) +
    scale_fill_manual(values = c("Y" = '#a61c00',
                                 "N" = '#459e1e'),
                      labels = c("With", "Without")) + # Defining colours and legend based on whether otherPlayer was present
    geom_text(aes(label = extendedStats.kickins), vjust = 2, color = "#FFFFFF", fontface = "bold") + 
    labs(title = paste0(targetPlayer, ' Kick ins taken With/out ', otherPlayer), # Dynamic title using both player names
         subtitle = 'Season 2022',
         x = 'Round', y = 'Kick ins') + 
    
    # adding mean line for targetPlayer's averages with/out otherPlayer
    geom_hline(yintercept = 
                 mean(playerStats$extendedStats.kickins[playerStats$otherPlayerPlayed == 'N']), linetype = 'dashed', color = '#459e1e') + 
    geom_hline(yintercept = 
                 mean(playerStats$extendedStats.kickins[playerStats$otherPlayerPlayed == 'Y']), linetype = 'dashed', color = '#a61c00') + 
    scale_y_continuous(breaks = seq(0, maxKI, by = 2)) +
    scale_x_continuous(breaks = seq(1, round), 
                       limits = c(0, round + 1), 
                       expand = c(0, 0)) +
    theme(
          plot.margin = margin(0, 0.5, 0.5, 0.5,"cm"),
          plot.title = element_text(color = '#FFFFFF', size = 18, margin = margin(b = .5, t = .75, unit = "cm")),
          plot.subtitle = element_text(color = '#FFFFFF', size = 11),
          axis.title.x = element_text(color = '#FFFFFF', size = 14, margin = margin(t = 1, unit = 'cm')),
          axis.title.y = element_text(color = '#FFFFFF', size = 14, margin = margin(r = 1, unit = 'cm')),
          axis.text = element_text(color = '#FFFFFF', size = 10),
          panel.background = element_rect(fill = "#2a2929"),
          plot.background = element_rect(fill = "#2a2929"),
          panel.grid.major.x = element_blank(),
          panel.grid.minor.x = element_blank(),
          panel.grid.major.y = element_line(color = '#b3b3b3'),
          panel.grid.minor.y = element_blank(),
          # legend box themeing
          legend.position = 'bottom',
          legend.title = element_blank()) 
  
    
  }

Now to run the function. We will investigate how Caleb Daniel’s absence from the team influenced Bailey Dale’s kick in frequency.

withoutPlayerKickIns('Bailey Dale', "Caleb Daniel")

If we want to rerun the code with different players, it is as simple as calling the function with different inputs. Functions make repeating code much easier and cleaner, and should be utilised when any piece of code needs to be used more than once.

5 Final Words

As mentioned previously, the framework covered above is sufficient and malleable to suit the visualisation and analysis of any recorded stat. The visualisation techniques are good for easily answering any question i.e. “How often is a player over 20 Disposals” or “How does one player affect another’s kick ins”.