This tutorial will step through the process of creating functions to analyse player match stats. These functions can be useful to analyse player performance or identify any trends in their data.
The first part of the tutorial will cover the scraping and cleaning
of the AFL data - using the fitzRoy package to scrape the
data, and tidyverse/dplyr for cleaning.
The second part will cover how to make functions and visualisations -
visualisations created using ggplot.
The fitzRoy
package, allows easy access to many APIs to scrape our data from.
The data which can be scraped includes match results, player stats,
fixtures and more.
For this example, we will scrape player match stat data from the AFL website/API for the 2022 Season
# Load the fitzRoy package
library(fitzRoy)
# Scraping data for the 2022 Season, from the AFL Website
playerStats <- fetch_player_stats_afl(season = 2022) # Assign scraped data to object
head(playerStats, 5) # View first 5 rows
## # A tibble: 5 × 93
## provi…¹ utcSt…² status compS…³ round…⁴ round…⁵ venue…⁶ home.…⁷ away.…⁸ playe…⁹
## <chr> <chr> <chr> <chr> <chr> <int> <chr> <chr> <chr> <int>
## 1 CD_M20… 2022-0… CONCL… Premie… Round 1 1 MCG Melbou… Wester… 12
## 2 CD_M20… 2022-0… CONCL… Premie… Round 1 1 MCG Melbou… Wester… 17
## 3 CD_M20… 2022-0… CONCL… Premie… Round 1 1 MCG Melbou… Wester… 10
## 4 CD_M20… 2022-0… CONCL… Premie… Round 1 1 MCG Melbou… Wester… 50
## 5 CD_M20… 2022-0… CONCL… Premie… Round 1 1 MCG Melbou… Wester… 31
## # … with 83 more variables: player.photoURL <chr>,
## # player.player.position <chr>, player.player.player.playerId <chr>,
## # player.player.player.captain <lgl>,
## # player.player.player.playerJumperNumber <int>,
## # player.player.player.givenName <chr>, player.player.player.surname <chr>,
## # teamId <chr>, gamesPlayed <lgl>, timeOnGroundPercentage <dbl>, goals <dbl>,
## # behinds <dbl>, superGoals <lgl>, kicks <dbl>, handballs <dbl>, …
Other data sources which can be scraped from include footywire.com, afltables.com and Fryzigg. We are using stats from the AFL website as they include more advanced stats (e.g. kick-ins).
# Example
fetch_player_stats_afltables()
fetch_player_stats_footywire()
fetch_player_stats_fryzigg()
As we can see in the first code chunk, there are over 90 variables in
the scraped data. We will be building functions for player disposals,
and player kick-ins, so let’s clean up the data to include only relevant
stats. We will load the tidyverse package which contains
the dplyr package for data wrangling, as well as
ggplot which we will use later.
colnames(playerStats) # Use this to explore the available variables
# Load the tidyverse
library(tidyverse)
playerStats <- playerStats %>%
# Selecting appropriate variables
dplyr::select(
c(round.roundNumber, # Round Number
team.name, # Player's team
player.givenName, # Player Name
player.surname, # Player Surname
teamStatus, # Home/Away Indicator
disposals, # Player disposals
extendedStats.kickins, # Player Kick-ins taken
extendedStats.kickinsPlayon, # Player Kick-ins played-on from
timeOnGroundPercentage) # Time spent on ground
) %>%
# Filter out subs who did not play a significant amount of time
dplyr::filter(
timeOnGroundPercentage > 5
) %>%
# Combine player first name and surname to one column
tidyr::unite(
col = 'Player', # Name of new column
player.givenName:player.surname,
sep = " "
)
head(playerStats, 5)
## # A tibble: 5 × 8
## round.roundNumber team.name Player teamS…¹ dispo…² exten…³ exten…⁴ timeO…⁵
## <int> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1 Melbourne Toby Bedf… home 9 0 0 55
## 2 1 Melbourne Jake Bowey home 9 1 1 73
## 3 1 Melbourne Angus Bra… home 23 0 0 83
## 4 1 Melbourne Ben Brown home 13 0 0 86
## 5 1 Melbourne Bayley Fr… home 9 0 0 81
## # … with abbreviated variable names ¹teamStatus, ²disposals,
## # ³extendedStats.kickins, ⁴extendedStats.kickinsPlayon,
## # ⁵timeOnGroundPercentage
The dataset is much smaller now, so much easier to view and handle. Let’s move on to visualisations.
The graphs we create will be built using the ggplot
package contained in the tidyverse. ggplot
offers a catalogue of potential graphs we can utilise for our
visualisations. For our analyses, we will be using bar charts. Let’s
build a graph to analyse Patrick Cripps’ disposals across the
season.
playerStats %>%
filter(Player == 'Patrick Cripps') %>%
# Call the ggplot function to begin our visualisation
ggplot(
aes(x = round.roundNumber, y = disposals) # Choosing our mapping aesthetics
) +
# Add our bars
geom_bar(stat = 'identity')
Now we have a bar chart for Patrick Cripps’ disposals in each game of 2022. It looks a bit plain so let’s pretty it up.
We’ll change the colour scheme so that it’s nicer to look at, add appropriate titles and change axis intervals so the information is better to understand. Let’s also view how often he had more than 20 disposals.
playerStats %>%
filter(Player == 'Patrick Cripps') %>%
mutate(
# Assigning colours for when he is over 20 or under 20 disposals
OverUnder = ifelse(disposals < 20, # "if disposals is under 20;
'#a61c00', # make bars red
'#459e1e') # otherwise, make them green
) %>%
ggplot(aes(x = round.roundNumber, y = disposals)) +
geom_bar(stat = 'identity',
aes(fill = OverUnder)) + # Fill bars based on threshold colour assigned above
scale_fill_identity() + # " Activate" our bar colours
# adding label into bars
geom_text(aes(label = disposals),
vjust = 2, # moving text position
color = "white",
fontface = "bold") + # changing text style
# Titles
labs(title = 'Patick Cripps Disposals', # Title
subtitle = 'Season 2022', # subtitle
x = 'Round', # x-axis title
y = 'Disposals') + # y-axis title
# Add a line to visualise threshold
geom_hline(yintercept = 20,
linetype = "dashed",
size = 1.4,
colour = "blue") +
# Axis scaling/intervals
scale_y_continuous(breaks = seq(0, max(playerStats$disposals), by = 5)) + # setting axis range and breakpoints
scale_x_continuous(breaks = seq(1, max(playerStats$round.roundNumber)), # Setting the graph at the maximum possible round, to capture missed matches if player missed recent games
limits = c(0, max(playerStats$round.roundNumber) + 1),
expand = c(0, 0)) +
# Changing overall graph aesthetics
theme(
# increase space around chart
plot.margin = margin(0, 0.5, 0.5, 0.5,"cm"),
# changing font of texts
plot.title = element_text(color = '#FFFFFF', size = 24, margin = margin(b = .5, t = .75, unit = "cm")),
plot.subtitle = element_text(color = '#FFFFFF', size = 12),
axis.title.x = element_text(color = '#FFFFFF', size = 16, margin = margin(t = 1, unit = 'cm')),
axis.title.y = element_text(color = '#FFFFFF', size = 16, margin = margin(r = 1, unit = 'cm')),
axis.text = element_text(color = '#FFFFFF', size = 12),
# making the chart 'dark mode'
panel.background = element_rect(fill = "#2a2929"),
plot.background = element_rect(fill = "#2a2929"),
# removing or changing the gridlines
panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank(),
panel.grid.major.y = element_line(color = '#b3b3b3'),
panel.grid.minor.y = element_blank())
The graph not only looks much nicer now, but important information is much clearer to see. However, if we want to analyse a different player or a different threshold, we would need to trawl through the lines of code again to try find the right code to change - assuming we don’t forget what needs to be changed. Or, if we wanted to analyse multiple players at once, we would need to copy many, many lines of code. Therefore, let’s create a function so that it is to repeat our graph over and over, whilst easily changing parameters.
To create a function, we can simply define the above graph code within a function operation, changing the variables which we would want to change.
round <- max(playerStats$round.roundNumber) # Defining latest round of data
playerDisposals <- # Name of our function
function(player, threshold){ #Defining our inputs that will be used i.e. a player name, and disposal threshold
# Insert code from last section, replacing any variables with our new inputs
playerStats %>%
filter(Player == player) %>% # Change to our player input, instead of 'Patrick Cripps'
mutate(OverUnder = ifelse(disposals < threshold, '#a61c00', '#459e1e')) %>% # Change to our threshold input, instead of '20'
ggplot(aes(x = round.roundNumber, y = disposals)) +
geom_bar(stat = 'identity', aes(fill = OverUnder)) +
scale_fill_identity() +
geom_text(aes(label = disposals), vjust = 2, color = "white", fontface = "bold") +
labs(title = paste0(player, ' Disposals'), # Changing our title to be dynamic, using the paste0 function
subtitle = 'Season 2022',
x = 'Round', y = 'Disposals') +
geom_hline(yintercept = threshold, # Change y-int to be threshold input
linetype = "dashed",
size = 1.4,
colour = "blue") +
scale_y_continuous(breaks = seq(0, max(playerStats$disposals), by = 5)) +
scale_x_continuous(breaks = seq(1, round), # Change to our round object, defined outside of the function
limits = c(0, round + 1),
expand = c(0, 0)) +
theme(
plot.margin = margin(0, 0.5, 0.5, 0.5,"cm"),
plot.title = element_text(color = '#FFFFFF', size = 24, margin = margin(b = .5, t = .75, unit = "cm")),
plot.subtitle = element_text(color = '#FFFFFF', size = 12),
axis.title.x = element_text(color = '#FFFFFF', size = 16, margin = margin(t = 1, unit = 'cm')),
axis.title.y = element_text(color = '#FFFFFF', size = 16, margin = margin(r = 1, unit = 'cm')),
axis.text = element_text(color = '#FFFFFF', size = 12),
panel.background = element_rect(fill = "#2a2929"),
plot.background = element_rect(fill = "#2a2929"),
panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank(),
panel.grid.major.y = element_line(color = '#b3b3b3'),
panel.grid.minor.y = element_blank())
}
Why is there no graph now? This is because we now need to call our function. Let’s look at Lachie Neale, Touk Miller and Clayton Oliver.
playerDisposals("Lachie Neale", 20)
playerDisposals("Touk Miller", 20)
playerDisposals("Clayton Oliver", 20)
Three graphs are now outputted, using just as many lines of code. Without a function, we would have need to copy the original graph code three times, creating a messy wall of text that is much harder to work with - not to mention increasing the chances of breaking our code in the process.
With this framework, we can analyse almost any recorded stats by simply changing what we scrape and what we call for out y-axis. For our final function, let’s analyse player kick-ins when another teammate is or isn’t also playing.
In general, when building a function, we should build our code outside of a function first as we did previously with the graphs, in order to test that our code is functional. For the following example, the initial build won’t be shown, just the function code for simplicity.
withoutPlayerKickIns <- # Naming our function
function(targetPlayer, otherPlayer){ # Define out inputs
# Creating an indicator when the 'otherPlayer' was in the game
other <- playerStats %>%
filter(Player == otherPlayer) %>%
select(round.roundNumber) %>%
mutate(
otherPlayerPlayed = 'Y'
)
# Joining the indicator with the target player
playerStats <- playerStats %>%
filter(Player == targetPlayer) %>%
select(round.roundNumber, extendedStats.kickins) %>%
left_join(other, by = "round.roundNumber") # Left join the previously created dataframe by round number
### Dataframe now contains targetPlayer matches, with an indicator if the otherPlayer was present
# Replacing NAs
playerStats$otherPlayerPlayed[is.na(playerStats$otherPlayerPlayed)] <- 'N'
# Setting graphs limit
maxKI <- max(playerStats$extendedStats.kickins)
playerStats %>%
#Graph framework remains almost the same as above
ggplot(aes(x = round.roundNumber, y = extendedStats.kickins)) +
geom_bar(stat = 'identity', aes(fill = otherPlayerPlayed)) +
scale_fill_manual(values = c("Y" = '#a61c00',
"N" = '#459e1e'),
labels = c("With", "Without")) + # Defining colours and legend based on whether otherPlayer was present
geom_text(aes(label = extendedStats.kickins), vjust = 2, color = "#FFFFFF", fontface = "bold") +
labs(title = paste0(targetPlayer, ' Kick ins taken With/out ', otherPlayer), # Dynamic title using both player names
subtitle = 'Season 2022',
x = 'Round', y = 'Kick ins') +
# adding mean line for targetPlayer's averages with/out otherPlayer
geom_hline(yintercept =
mean(playerStats$extendedStats.kickins[playerStats$otherPlayerPlayed == 'N']), linetype = 'dashed', color = '#459e1e') +
geom_hline(yintercept =
mean(playerStats$extendedStats.kickins[playerStats$otherPlayerPlayed == 'Y']), linetype = 'dashed', color = '#a61c00') +
scale_y_continuous(breaks = seq(0, maxKI, by = 2)) +
scale_x_continuous(breaks = seq(1, round),
limits = c(0, round + 1),
expand = c(0, 0)) +
theme(
plot.margin = margin(0, 0.5, 0.5, 0.5,"cm"),
plot.title = element_text(color = '#FFFFFF', size = 18, margin = margin(b = .5, t = .75, unit = "cm")),
plot.subtitle = element_text(color = '#FFFFFF', size = 11),
axis.title.x = element_text(color = '#FFFFFF', size = 14, margin = margin(t = 1, unit = 'cm')),
axis.title.y = element_text(color = '#FFFFFF', size = 14, margin = margin(r = 1, unit = 'cm')),
axis.text = element_text(color = '#FFFFFF', size = 10),
panel.background = element_rect(fill = "#2a2929"),
plot.background = element_rect(fill = "#2a2929"),
panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank(),
panel.grid.major.y = element_line(color = '#b3b3b3'),
panel.grid.minor.y = element_blank(),
# legend box themeing
legend.position = 'bottom',
legend.title = element_blank())
}
Now to run the function. We will investigate how Caleb Daniel’s absence from the team influenced Bailey Dale’s kick in frequency.
withoutPlayerKickIns('Bailey Dale', "Caleb Daniel")
If we want to rerun the code with different players, it is as simple as calling the function with different inputs. Functions make repeating code much easier and cleaner, and should be utilised when any piece of code needs to be used more than once.
As mentioned previously, the framework covered above is sufficient and malleable to suit the visualisation and analysis of any recorded stat. The visualisation techniques are good for easily answering any question i.e. “How often is a player over 20 Disposals” or “How does one player affect another’s kick ins”.