Stats Overload!?

Have you ever found yourself ‘drowned in data’ during an AFL broadcast? Just want to watch your team play and keep getting bombarded with ‘contested possession differentials’ and ‘inside 50 efficiency’ numbers? Ok we get it! But what does it all mean?

Turns out, those in the know (Coaches, Commentators, Players, Agents, Fantasy Football hacks!) use this data to see who is “winning” a game, because sometimes the scoreboard doesn’t always tell the whole story.

This tutorial will assist you in finding, making sense of, and putting into practice using key football statistics to find a relationship, if any, between these key statistics and the final score margin of a match during the 2022 AFL season.

Ok, so where are we getting this data?

Within R, we will be utilising the fitzRoy package to access all the data that we need. The data is scraped from the AFL Website. To download and install the data set, you will need to install the package install.packages("fitzRoy") and then utilise the following code:

library(fitzRoy)

Now we want to get two data sets initially, the results_data will give us the margin of each game in the season, while the stats_data will give us all the individual player statistics for each game, which we will wrangle into something a little more digestable (more on this later):

results_data <- fitzRoy::fetch_results_afltables(season = 2022)
stats_data <- fitzRoy::get_afltables_stats(start_date = "2022-01-01", end_date = "2022-10-10")

Once we manipulate these two data sets - both to obtain the game margins and also the key match statistics - to get the data we need, we can merge them (unfortunately we can’t get everything we need with just one data set!) and begin to make sense of the key statistical categories that really are important to keep an eye on.

Step 1. Getting Results & Margin data (using results_data)

Firstly, we need a data set that shows us the match, and margin of the match. We also need to create uniform ‘match-id’ that we can use in both data sets to eventually merge them.

The following code chunk will obtain all of the above:

# Creating a unique match-id for each game using home team, away team and game date
library(tidyverse) # load in this package in order to use dplyr and ggplot
results_data <- results_data %>%
  mutate(match_id = 
           paste(Home.Team, Away.Team, Date, sep = "-"))

# The only data we need to retain is the teams, match-id and margin
results_data <- results_data %>%
  select(Home.Team, Away.Team, Margin, match_id)

Step 2. Getting Statistical data (using stats_data)

Now we have the margin for each game, we need to obtain the statistics differentials for each game. In fitzRoy, we get statistical data game-by-game and player-by-player, therefore we need to collate the data at game level and obtain the differential per team (for example, if Collingwood at home against Geelong, and Collingwood had 45 inside 50’s to Geelongs 21, the inside 50 differential would be +24)

This code chunk is a little longer, so stick with me:

# First, we need to ensure all teams have the same name. The Bulldogs and Giants have different 
# names in each data set, so we need to make them the same
stats_data$Home.team[stats_data$Home.team=="Greater Western Sydney"] <- "GWS"
stats_data$Away.team[stats_data$Away.team=="Greater Western Sydney"] <- "GWS"
stats_data$Playing.for[stats_data$Playing.for=="Greater Western Sydney"] <- "GWS"

stats_data$Home.team[stats_data$Home.team=="Western Bulldogs"] <- "Footscray"
stats_data$Away.team[stats_data$Away.team=="Western Bulldogs"] <- "Footscray"
stats_data$Playing.for[stats_data$Playing.for=="Western Bulldogs"] <- "Footscray"

# Let's create a unique match-id (same format as Step 1) to make it easy to merge the df's
stats_data <- stats_data %>%
  mutate(match_id = 
           paste(Home.team, Away.team, Date, sep = "-"))

# Labelling the home team as "A" and away as "B", this is important when obtaining our 
# differentials, and making sure they're in the right order
stats_data <- stats_data %>%
  mutate(Location = ifelse(Playing.for == Home.team, "A", "B"))

# Getting match data with selected statistics grouped by team/match
# Note: you can select as many or as little statistical categories as you like. For the purposes 
# of this tutorial I have chosen the 7 below
stats_data <- stats_data %>% 
  group_by(match_id, Location, Playing.for) %>% 
  summarise(total_clear=sum(Clearances), total_contposs=sum(Contested.Possessions),
            total_marks=sum(Marks), total_1p=sum(One.Percenters), total_tackles=sum(Tackles),
            total_uncontposs=sum(Uncontested.Possessions), total_I50=sum(Inside.50s))

# Grouping at a match level to get the differentials per game
stats_data <- stats_data %>% group_by(match_id) %>%
  summarise(diff_clear=diff(total_clear), diff_contposs=diff(total_contposs),
            diff_marks=diff(total_marks), diff_1p=diff(total_1p), diff_tackles=diff(total_tackles),
            diff_uncontposs=diff(total_uncontposs), diff_I50=diff(total_I50))

stats_data$diff_clear <- stats_data$diff_clear*-1
stats_data$diff_contposs <- stats_data$diff_contposs*-1
stats_data$diff_marks <- stats_data$diff_marks*-1
stats_data$diff_1p <- stats_data$diff_1p*-1
stats_data$diff_tackles <- stats_data$diff_tackles*-1
stats_data$diff_uncontposs <- stats_data$diff_uncontposs*-1
stats_data$diff_I50 <- stats_data$diff_I50*-1

Step 3. Merging the data and assessing correlations

Now we have all the data we need, both margins and statistical differentials for each match of the season, we can merge the data and see if there are any correlations between key categories and the winning margin:

# Merging dataframes
complete_data <- stats_data %>% left_join(results_data, by="match_id")

# Creating a df with just margin and statistical metrics in order to test correlation
complete_data_metrics <- complete_data[,c(2:8,11)]

# Utilising psych package to see a correlation matrix
library(psych)
pairs.panels(complete_data_metrics)

The pairs.panel() will display the following correlation matrix. As we can see, there is a reltively strong positive correlation between contested possession differential and margin (0.58), along withinside 50 differential and margin (0.71). We can also see the statistics where there is minimal correlation with margin, like tackles (0.12) and one-percenters (0.15).

Correlation Matrix comparing key stat categories and margin

Step 4. Visualising the Relationship

Now that we have the margins and key statistical categories, we can visualise what this looks like for each game of the season via a ggplot visualisation. We can plot margin against any of our statistical categories:

# Plotting margin against contested possessions 
complete_data %>%
  ggplot(aes(diff_contposs, Margin)) +
  geom_point(alpha = 0.9, col = "red", size = 2.5) +
  geom_smooth(method = 'lm') +
  geom_hline(yintercept = 0, color = "black") +
  geom_vline(xintercept = 0, color = "black") +
  theme_bw()

Showing the positive relationship between contested possession differential and winning margin

Step 5. Regression Analysis

Finally, we want to see if one (or more) variables in conjunction can look to predict a margin. For this, we can utilise a linear regression model. For the purposes of the example below, we will use both contested possessions and inside 50s as they both have a relatively strong correlation to margin, whilst also not being strongly correlated with each other (0.55):

# Building a linear regression model
model <- lm(Margin ~ diff_contposs + diff_I50, 
             data = complete_data)

# Getting summary of training data
summary(model) # adj R^2 = 55.1%

# Plotting the model to ensure it meets assumptions
par(mfrow = c(2, 2))
plot(model)

Utilising the above model, we can see that contested possession differential and inside 50 differential go some way (adjusted R^2 = 55.2%) to predicting the margin!

Now it all makes sense!

Feel free to manipulate and utilise the above steps using other statistics and building other models. But from this, hopefully it helps give you a greater understanding of which stats are crucial to predicting the final margin of an AFL match using 2022 regular season data.