Round 11s match of the 2020 AFL season between Port Adelaide & Richmond has been widely praised as the “best game of the season”. A fast-finishing Port Adelaide prevailed by 21 points. Despite being a close game, match-reports appear to show the game was statistically dominated by Port Adelaide

Consider ESPNs summary

Port Adelaide were

Despite these discrepancies, the game was close for most of the day, with Richmond even leading at 3/4 time

This had me wondering how important commonly measured AFL statistics are to the end result?

To better understand this, I have aggregated all AFL match-day statistics for season 2019, & looked at the relationship between game-day stats & the eventual winning margin of the game

Specifically I have focused upon:

“differential statistics” - what was the difference between the winning & losing side for metric x, y or z?

Below I have outlined the programming to do this

The Data

AFLTables & Footywire are undoubtedly the best independent online resources for footy-related statistics

R users are fortunate that the vast depth of statistics from these websites are easily accessible via the “FitzRoy” package

Let’s read-in FitzRoy, and all other packages we will use for analysis

# Data extraction 
library(devtools)
library(fitzRoy)
# Data cleaning 
library(snakecase)
library(tidyr)
library(dplyr)
library(reshape)
library(knitr)
# Data visualisation
library(kableExtra)
library(corrplot)
library(taucharts)
library(kableExtra)
# Data analysis 
library(ClustOfVar)
library(cluster)

FitzRoy provides a straight-forward function to read-in data from Footywire (“get_footywire_stats”). The only required input is which matches to include in the data-extract via the websites Match_ids

The range of “match_ids” has been limited to the 207 games contested in the 2019 season. Match_ids can be identified via the web-URL for each game on FootyWire

I encountered some buggy-behaviour reading in every required match in a single line of code, but found splitting it out worked OK

The match-data from footywire takes ~ 10-15 minutes to read, so patience is required :)

# Extract
footywire2 <- get_footywire_stats(ids = 9876:9927) # works
footywire1 <- get_footywire_stats(ids = 9721:9875) # works 
# Bind into a single data-frame 
Season2019 <- rbind(footywire1, footywire2)
# lower case, under_score all column titles
names(Season2019) <- to_snake_case(names(Season2019))

The structure of this data-extract is 207 games x 44 selected players for each game, which produces 9108 rows of data

Each row reports the statistics of an individual player in every game

This affords great flexibility to explore both match-level and player-level statistics

date season round venue player team opposition status match_id cp up ed de cm ga_15 mi_5 one_percenters bo ccl scl si mg to itc t_5 tog k hb d m g b t ho ga_35 i_50 cl cg r_50 ff fa af sc
2019-03-21 2019 Round 1 MCG Patrick Cripps Carlton Richmond Home 9721 21 11 26 81.2 0 0 0 0 1 4 3 5 263 3 4 0 89 10 22 32 1 0 0 6 0 0 2 7 3 2 3 1 101 126
2019-03-21 2019 Round 1 MCG Marc Murphy Carlton Richmond Home 9721 6 23 21 72.4 0 0 0 0 0 0 1 7 530 5 3 0 87 16 13 29 4 1 0 1 0 0 5 1 1 4 1 0 97 91
2019-03-21 2019 Round 1 MCG Kade Simpson Carlton Richmond Home 9721 5 19 21 77.8 0 0 0 1 0 1 1 2 462 2 6 0 84 15 12 27 6 0 0 1 0 0 1 2 1 5 1 0 92 83
2019-03-21 2019 Round 1 MCG Dale Thomas Carlton Richmond Home 9721 6 17 23 85.2 0 1 0 4 0 0 0 8 434 4 6 0 78 15 12 27 3 1 0 2 0 1 3 0 4 5 1 1 90 93
2019-03-21 2019 Round 1 MCG Nic Newman Carlton Richmond Home 9721 5 17 22 84.6 0 0 0 3 0 0 1 4 584 2 6 0 84 21 5 26 9 1 0 2 0 0 2 1 2 12 1 0 115 134
2019-03-21 2019 Round 1 MCG Edward Curnow Carlton Richmond Home 9721 7 18 18 72.0 0 1 2 1 0 0 2 8 303 6 0 2 82 13 12 25 12 0 1 3 0 1 3 2 4 1 1 0 113 98

Re-shape & aggregate

Two key steps required to to transform this data for analysis are:

We will use several “tidyverse” functions to first aggregate our match-level statistics

# tally up all key statistics per game, per team 
Season2019_Ag <- Season2019 %>%
  dplyr::select(season, 
         round,
         date,
         team, 
         opposition, 
         status,
         cp, # CONTESTED POSSESSION
         up, # UNCONTESTED POSSESSION 
         de, # DISPOSAL EFFICIENCY 
         one_percenters, # ONE PERCENTERS 
         mg, # METRES GAINED
         to, # TURNOVER 
         k, # KICKS
         hb, # HANDBALLS
         d, # DISPOSALS
         m, # MARK
         i_50, # INSIDE 50
         cl, # CLEARANCE
         cg, # CLANGERS
         r_50, # REBOUND 50 
         ff, # FREES FOR 
         fa, # FREES AGAINST
         cm, # CONTESTED MARKS
         ga_15, # GOAL ASSISTS
         bo, # BOUNCES
         ccl, # CENTRE CLEARANCES
         scl, # STOPPAGE CLEARANCE 
         itc, # INTERCEPTS
         si, # SCORE INVOLVEMENTS 
         t_5, # TACKLES INSIDE 50
         match_id) %>% 
  group_by(match_id, 
           team, 
           opposition, 
           status,
           season, 
           round,
           date) %>%
  summarise(
    CP = sum(cp),
    UP = sum(up),
    DE = round(mean(de),1),
    OP = sum(one_percenters),
    MG = round(sum(mg),1),
    TO = sum(to),
    K = sum(k),
    HB = sum(hb),
    D = sum(d),
    M = sum(m),
    I50 = sum(i_50),
    CL = sum(cl),
    CG = sum(cg),
    R50 = sum(r_50),
    FF = sum(ff),
    FA = sum(fa),
    CM = sum(cm),
    GA = sum(ga_15),
    BO = sum(bo),
    CCL = sum(ccl),
    SCL = sum(scl),
    ITC = sum(itc),
    SI = sum(si),
    T5 = sum(t_5),
  )

However, we have one further issue to address: Data for each match is split across two rows - one for the home team, the other for the away team. The required end-state is for data from both teams to be summarized in a single row

The dplyr function spread can almost solve this issue. However, it only accepts one “value” argument & we therefore are not able to pivot over multiple variables

Fortunately, clever user danr from the R Studio Community wrote a function augmenting the spread-command to be able to do this. We will implement this below:

# remove opposition 
Season2019_Ag$opposition <- NULL

# Function to spread across multiple values 
myspread <- function(df, key, value) {
  # quote key
  keyq <- rlang::enquo(key)
  # break value vector into quotes
  valueq <- rlang::enquo(value)
  s <- rlang::quos(!!valueq)
  df %>% gather(variable, value, !!!s) %>%
    unite(temp, !!keyq, variable) %>%
    spread(temp, value)
}

# spread - so each row reflects a single game 
Season2019_Ag <- Season2019_Ag %>%
  myspread(key = status, value = c(team, CP, UP, DE, OP, MG, TO, K, HB, D, M,
                                   I50, CL, CG, R50, FF, FA, CM, GA, BO, CCL,
                                   SCL, ITC, SI, T5))

& after some final tidying …

# create common key
Season2019_Ag$JOIN_ID <- paste(Season2019_Ag$date, "-", 
                               Season2019_Ag$Home_team, "-", 
                               Season2019_Ag$Away_team)

# Re-order Columns
Season2019_Ag <- Season2019_Ag[,c(2,3,4,1,55,52,27,30:51,53:54,5:26,28:29)]

# Ensure data-set is actually a dataframe
Season2019_Ag <- as.data.frame(Season2019_Ag)

We now have an aggregated, pivoted, tidied data-frame. Each game is captured as a unique row, and player statistics have been aggregated into team statistics.

To be sure the aggregation worked, I spot-checked a number of random matches against official match-day statistics. This confirmed the results of this analysis were consistent with official statistics

However an important feature is missing - the outcome of the game. We can’t get too far with these statistics if the winning team is unknown

The match-result could in-theory be determined from this data-set. It would require calculating the overall team-scores from individual players goals & behinds

However I decided an easier method would be to source this data directly from AFL-tables, accessible again from the FitzRoy package

# read in data from AFL tables
AT_2019 <- get_afltables_stats(start_date = '2019-01-01',
                                 end_date = '2019-12-31')

# replace dots with underscores, all lower case
names(AT_2019) <- to_snake_case(names(AT_2019))

In order to combine the two data-sets, a common key is required. By combining three columns which exist in both data-sets (date, home team & away team), I was able to create my own common key

# create a common key 
AT_2019$JOIN_ID <- paste(AT_2019$date, "-", 
                         AT_2019$home_team, "-", 
                         AT_2019$away_team)

# Select scores - the only rows we would like to keep from this table 
AT_2019 <- dplyr::select(AT_2019, 
                  JOIN_ID, 
                  home_score, 
                  away_score)

# remove columns which are not unique 
AT_2019 <- AT_2019 %>% distinct(JOIN_ID, .keep_all = TRUE)

# Convert to dataframe
AT_2019 <- as.data.frame(AT_2019)

After some further tidying (error correction & harmonizing team names), the two data-sets are ready to be joined by the calculated JOINID column

# Ensure join columns are comparable 
AT_2019$JOIN_ID <- to_snake_case(AT_2019$JOIN_ID)
Season2019_Ag$JOIN_ID <- as.character(Season2019_Ag$JOIN_ID)
Season2019_Ag$JOIN_ID <- to_snake_case(Season2019_Ag$JOIN_ID)

# Other corrections
AT_2019$JOIN_ID <- gsub("greater_western_sydney", "gws", AT_2019$JOIN_ID)
AT_2019$JOIN_ID <- gsub("brisbane_lions", "brisbane", AT_2019$JOIN_ID)

# Correct error in AFL tables listing Geelong as home-side in the 2019 Preliminary Final
AT_2019$JOIN_ID <- gsub("2019_09_20_geelong_richmond", "2019_09_20_richmond_geelong", AT_2019$JOIN_ID)

# Join Footywire & AFL tables 
Season2019_Ag <- left_join(Season2019_Ag, AT_2019, by = "JOIN_ID")

& some final tidying to re-order our variables, remove redundant variables, & convert statistics to numeric format

# Re-order columns 
Season2019_Ag <- Season2019_Ag[,c(1:7, 56:57, 8:55)]

# Remove IDs - no longer required
Season2019_Ag$match_id <- NULL
Season2019_Ag$JOIN_ID <- NULL

# need to change all stats columns from character to numeric 
cols = c(8:55)    
Season2019_Ag[,cols] = apply(Season2019_Ag[,cols], 2, function(x) as.numeric(as.character(x)));

Our data model is nearly complete with match-statistics for both home & away teams, & the total score for each team

The next step is to calculate the statistical “differentials” between the winning & losing teams

An issue is the winning-score could come from either the home or away score columns. As such a simple subtraction of these fields won’t work.

The necessary work-around is to split our data in two: “winning home team” & “winning away team”, then recombine them into an overall “winning differentials” data-frame

Let’s start with winning home team:

# "HOME" Winners
# Calculate a "winning margin" score - ultimately we want to see how different
# statistics are related to this outcome variable 
Season2019_Ag$margin <- Season2019_Ag$home_score - Season2019_Ag$away_score

# Subset only "home" team wins 
WinnersHome <- filter(Season2019_Ag, margin >= 1)

# Create differential columns 
WinnersHome$BO_Diff <- WinnersHome$Home_BO - WinnersHome$Away_BO
WinnersHome$CCL_Diff <- WinnersHome$Home_CCL - WinnersHome$Away_CCL
WinnersHome$CG_Diff <- WinnersHome$Home_CG - WinnersHome$Away_CG
WinnersHome$CL_Diff <- WinnersHome$Home_CL - WinnersHome$Away_CL
WinnersHome$CM_Diff <- WinnersHome$Home_CM - WinnersHome$Away_CM
WinnersHome$CP_Diff <- WinnersHome$Home_CP - WinnersHome$Away_CP
WinnersHome$D_Diff <- WinnersHome$Home_D - WinnersHome$Away_D
WinnersHome$DE_Diff <- WinnersHome$Home_DE - WinnersHome$Away_DE
WinnersHome$BO_Diff <- WinnersHome$Home_BO - WinnersHome$Away_BO
WinnersHome$FA_Diff <- WinnersHome$Home_FA - WinnersHome$Away_FA
WinnersHome$FF_Diff <- WinnersHome$Home_FF - WinnersHome$Away_FF
WinnersHome$GA_Diff <- WinnersHome$Home_GA - WinnersHome$Away_GA
WinnersHome$HB_Diff <- WinnersHome$Home_HB - WinnersHome$Away_HB
WinnersHome$I50_Diff <- WinnersHome$Home_I50 - WinnersHome$Away_I50
WinnersHome$ITC_Diff <- WinnersHome$Home_ITC - WinnersHome$Away_ITC
WinnersHome$K_Diff <- WinnersHome$Home_K - WinnersHome$Away_K
WinnersHome$M_Diff <- WinnersHome$Home_M - WinnersHome$Away_M
WinnersHome$MG_Diff <- WinnersHome$Home_MG - WinnersHome$Away_MG
WinnersHome$OP_Diff <- WinnersHome$Home_OP - WinnersHome$Away_OP
WinnersHome$R50_Diff <- WinnersHome$Home_R50 - WinnersHome$Away_R50
WinnersHome$SCL_Diff <- WinnersHome$Home_SCL - WinnersHome$Away_SCL
WinnersHome$SI_Diff <- WinnersHome$Home_SI - WinnersHome$Away_SI
WinnersHome$TO_Diff <- WinnersHome$Home_TO - WinnersHome$Away_TO
WinnersHome$UP_Diff <- WinnersHome$Home_UP - WinnersHome$Away_UP
WinnersHome$T5_Diff <- WinnersHome$Home_T5 - WinnersHome$Away_T5

# Designate these games as "home" team winning 
WinnersHome$Winner <- "HOME"
WinnersHome$WinningTeam <- WinnersHome$Home_team

cols = c(1:7,82,56:81) 
WinnersHome <- WinnersHome[, cols]

Repeat this step for “winning away team”

# "AWAY" Winners
# Subset only "away" team wins 
WinnersAway <- filter(Season2019_Ag, margin <= -1)

# Create differential columns 
WinnersAway$BO_Diff <- WinnersAway$Away_BO - WinnersAway$Home_BO
WinnersAway$CCL_Diff <- WinnersAway$Away_CCL - WinnersAway$Home_CCL
WinnersAway$CG_Diff <- WinnersAway$Away_CG - WinnersAway$Home_CG
WinnersAway$CL_Diff <- WinnersAway$Away_CL - WinnersAway$Home_CL
WinnersAway$CM_Diff <- WinnersAway$Away_CM - WinnersAway$Home_CM
WinnersAway$CP_Diff <- WinnersAway$Away_CP - WinnersAway$Home_CP
WinnersAway$D_Diff <- WinnersAway$Away_D - WinnersAway$Home_D
WinnersAway$DE_Diff <- WinnersAway$Away_DE - WinnersAway$Home_DE
WinnersAway$BO_Diff <- WinnersAway$Away_BO - WinnersAway$Home_BO
WinnersAway$FA_Diff <- WinnersAway$Away_FA - WinnersAway$Home_FA
WinnersAway$FF_Diff <- WinnersAway$Away_FF - WinnersAway$Home_FF
WinnersAway$GA_Diff <- WinnersAway$Away_GA - WinnersAway$Home_GA
WinnersAway$HB_Diff <- WinnersAway$Away_HB - WinnersAway$Home_HB
WinnersAway$I50_Diff <- WinnersAway$Away_I50 - WinnersAway$Home_I50
WinnersAway$ITC_Diff <- WinnersAway$Away_ITC - WinnersAway$Home_ITC
WinnersAway$K_Diff <- WinnersAway$Away_K - WinnersAway$Home_K
WinnersAway$M_Diff <- WinnersAway$Away_M - WinnersAway$Home_M
WinnersAway$MG_Diff <- WinnersAway$Away_MG - WinnersAway$Home_MG
WinnersAway$OP_Diff <- WinnersAway$Away_OP - WinnersAway$Home_OP
WinnersAway$R50_Diff <- WinnersAway$Away_R50 - WinnersAway$Home_R50
WinnersAway$SCL_Diff <- WinnersAway$Away_SCL - WinnersAway$Home_SCL
WinnersAway$SI_Diff <- WinnersAway$Away_SI - WinnersAway$Home_SI
WinnersAway$TO_Diff <- WinnersAway$Away_TO - WinnersAway$Home_TO
WinnersAway$UP_Diff <- WinnersAway$Away_UP - WinnersAway$Home_UP
WinnersAway$T5_Diff <- WinnersAway$Away_T5 - WinnersAway$Home_T5

# Designate these games as "home" team winning 
WinnersAway$Winner <- "AWAY"

# Designate these games as "home" team winning 
WinnersAway$WinningTeam <- WinnersAway$Away_team

cols = c(1:7,82,56:81) 
WinnersAway <- WinnersAway[, cols]

# Change margin to absolute
WinnersAway$margin <- abs(WinnersAway$margin)

Now re-combine these two data-sets into our final data-frame of “winning differentials”

# Bind together 
Matches_2019 <- rbind(WinnersHome, WinnersAway)

# Alternative version with only numeric columns 
cols = c(9:33)
Matches_2019_Numeric <- Matches_2019[, cols]

Which results in the final data-model

Each row reflects a single match of the 2019 season, & includes:

season round date Home_team Away_team home_score away_score WinningTeam margin BO_Diff CCL_Diff CG_Diff CL_Diff CM_Diff CP_Diff D_Diff DE_Diff FA_Diff FF_Diff GA_Diff HB_Diff I50_Diff ITC_Diff K_Diff M_Diff MG_Diff OP_Diff R50_Diff SCL_Diff SI_Diff TO_Diff UP_Diff T5_Diff Winner
2019 Round 1 2019-03-23 Western Bulldogs Sydney 82 65 Western Bulldogs 17 5 1 -5 9 -3 26 5 -5.8 0 0 -2 -5 21 3 10 -26 466 -3 -19 8 5 -4 -17 6 HOME
2019 Round 1 2019-03-23 Brisbane West Coast 102 58 Brisbane 44 0 2 2 8 -1 29 88 2.6 8 -8 6 54 8 2 34 15 847 -28 -1 6 63 -3 60 5 HOME
2019 Round 1 2019-03-24 St Kilda Gold Coast 85 84 St Kilda 1 7 5 18 6 -1 9 33 1.4 11 -11 1 3 7 0 30 5 4 18 -6 1 11 0 16 -1 HOME
2019 Round 1 2019-03-24 GWS Essendon 112 40 GWS 72 -5 6 -8 -3 4 41 62 4.0 -11 11 6 19 3 10 43 26 916 9 10 -9 63 -10 35 2 HOME
2019 Round 1 2019-03-24 Fremantle North Melbourne 141 59 Fremantle 82 -2 6 -8 13 10 27 -9 -1.5 4 -4 9 -35 21 11 26 7 1197 -18 -7 7 73 -11 -35 -3 HOME
2019 Round 2 2019-03-30 Port Adelaide Carlton 88 72 Port Adelaide 16 5 6 -6 24 -4 11 94 6.5 2 -3 1 61 17 -4 33 32 468 6 -13 18 24 1 91 7 HOME

Prioritise Statistics of Interest

It’s worth taking a moment to review the included statistics. Many metrics are very similar, so we would expect collinearity between them. Based on little more than intuition, I have opted to delete the below metrics:

Matches_2019_Numeric$FA_Diff <- NULL # FA a mirror of FF 
Matches_2019_Numeric$CCL_Diff <- NULL # already have clearance data
Matches_2019_Numeric$SCL_Diff <- NULL # already have clearance data
Matches_2019_Numeric$D_Diff <- NULL # already have kick & handball data
Matches_2019_Numeric$M_Diff <- NULL # selected contested marking instead
Matches_2019_Numeric$TO_Diff <- NULL # selected contested marking instead
Matches_2019_Numeric$SI_Diff <- NULL # in effect, a component of margin
Matches_2019_Numeric$GA_Diff <- NULL # in effect, a component of margin

Visualise & Explore

Correlation Matrices

The R package Corrplot is one of the easiest, & most engaging ways to summarize the relationships between variables via a matrix of correlations

Let’s first create a matrix of correlations

# create matrix of correlations
M <- cor(Matches_2019_Numeric)
# round data to 2 decimal places
M <- round(M, 2)

& now generate two correlation plots for the data

The first plot uses coloured squares to summarize the correlations between variables, ranging from navy (perfect positive correlation) to maroon (perfect negative correlation). The strength of the correlation is also depicted by the size of the square

The second plot summarises the same information, but provides the actual R values instead of colored squares

With corrplot you can pass arguments to augment aesthetics such as text-size. These text-editing steps are important to improve the aesthetic of the plot(s), particularly when there are many variables, and/or long variable names

CM1 <- 
corrplot(M,
         method = "square", 
         type = "upper",
         tl.col= "black", 
         tl.cex = 0.6, # Text label color and rotation
         cl.cex = 0.6 # Correlation label color and rotation
         )

CM2 <- 
corrplot(M,
         method = "number", 
         type = "upper",
         tl.col="black", 
         # tl.srt=45, 
         tl.cex = 0.6, # Text label color and rotation
         cl.cex = 0.6, # Correlation label color and rotation
         number.cex = .6
         )

In the above Figures, the top horizontal row of the grid is of most interest. This row depicts each statistical-differentials correlation with the winning margin

The immediate stand-out along this row is MG_Diff : Metres Gained Differential

What this part of the grid shows is the larger the number of metres gained by the winning team, the greater the winning margin. This is far-and-away the largest correlation of all statistics, which we can characterize as positive & strong (R = 0.82).

Metres Gained has become one of the most fashionable statistics in AFL. An in-depth examination of what it means & does not mean can be read here.

It should be acknowledged the gurus at Champion Data have delved into this statistic in much greater depth. My characterization here is simplistic. Much of the nuance which is absent here is covered in this article (for example, delineating effective metres gained)

Comparatively, the next most important metrics were moderate in strength. These were effective disposal percentage differential of the winning side (R = 0.47), and how many more kicks the winning side had (R=0.42).

We can look more closely at these relationships via a series of interactive scatterplots:

Scatterplots

winning-Margin x Metres-Gained-Differential

# the greater the MG discrepancy 
Matches_2019 %>% 
  dplyr::select(Home_team, Away_team, WinningTeam, margin, MG_Diff) %>% 
  tauchart() %>% 
  tau_point("margin", "MG_Diff") %>% 
  tau_tooltip() 

Examining the scatterplot of metres gained & winning margin, a few further insights can be gleaned:

  • there are very few games where the winning team loses the metres-gained differential (only 25 out of 207)
  • The highest winning margin for the year for a team conceding the metres-gained statistic was 24 points (Geelong defeating North Melbourne, with a very small -9 metres gained). In other words, if a team does not win the metres gained statistic, they are very unlikely to win by a large margin
  • The largest amount of metres gained conceded for a winning team was -461 metres, with Fremantle defeating Sydney despite clearly losing the territory battle
  • Fremantle were also involved in the only clear outlying game, when they lost to West Coast by 91 points, but conceded only 559 metres (an abnormally small difference when compared to other losses of that margin)

Winning-Margin x Disposal-Efficiency-Differential

Matches_2019 %>% 
  dplyr::select(Home_team, Away_team, WinningTeam, margin, DE_Diff) %>% 
  tauchart() %>% 
  tau_point("margin", "DE_Diff") %>% 
  tau_tooltip() 

One interesting observation between winning margin and disposal efficiency differential is the poorest differential observed for a winning team was -8.7%, recorded by Collingwood in a one point defeat of West-Coast

winning-Margin x Number-of-Kicks-Differential

# the greater the MG discrepancy 
Matches_2019 %>% 
  dplyr::select(Home_team, Away_team, WinningTeam, margin, K_Diff) %>% 
  tauchart() %>% 
  tau_point("margin", "K_Diff") %>% 
  tau_tooltip() 

There are four potential outlier matches in this plot, with all involving Richmond. In three of these outliers, Richmond won comfortably, despite recording more than 40 less kicks than there opposition (wins against Sydney, St Kilda & Brisbane).

Inversely, Richmond were defeated by Collingwood in round 2 by 44 points, conceding the equal largest amount of kicks for any game (+107). Other matches with similar kick-differentials resulted in much greater winning margins

Clustering - k-Means

“Clustering” is a technique used to explore sub-groups of observations within a data set. In this data set, an “observation” is equivalent to a match of football

For this data, the goal of clustering is to explore whether we can group games together by their statistical-similarities. For example, we might hypothesize that there exists a “group” of games with extreme-values (i.e. a high winning margin, accompanied by large differences in the number of kicks, inside 50s, metres gained etc)

To do this, we will implement K-means clustering, probably the most common algorithm for partitioning. In layman terms, k-means clustering ‘groups’ observations (i.e. matches) such that the observations within the group are as similar as possible. Specifically, clusters are formed via observations that demonstrate high intra-class similarity, measured by the clusters ‘centroid’ (the mean value of points assigned to the cluster). ‘Principal components’ are derived as combinations of the variables in the data-set

In k-means clustering, ‘k’ denotes the number of clusters we will separate the data into. This is pre-specified by the analyst, & for this data I have opted for k = 3. In other words, the end result of this analysis will be three “clusters” of AFL matches from the 2019 season

For more information on k-means/principal components analysis, this towards data science article is an easily digestible summary

To begin, lets construct a dissimilarity matrix, utilizing Gowers Distance. This matrix is a necessary step to calculate our k-means

d <- 
  Matches_2019_Numeric %>% 
  daisy(metric = "gower")

Calculating k-means is very straight-forward. Although there are many tuning parameters, for the sake of simplicity, I have kept the defaults. The argument ‘3’ is specifying the data will be separated into three groups

kfit <- kmeans(d, 3)

With k-means calculated, I have generated a scatterplot depicting the first two principal components of the analysis. Within this scatterplot, the three k-mean groups are depicted

The k-means analysis will have produced more than two principal components. The scatterplot only shows the first two components, as multidimensional data would require multidimensional visualization techniques. More thorough analysis can explore these additional components and optimize the number of components

clusplot(as.matrix(d), 
         kfit$cluster, 
         color = T, 
         shade = F, 
         labels = 1, 
         lines = 0,
         cex = 0.7,
         cex.txt = 0.8,
         cex.axis = 0.8,
         cex.lab = 0.8,
         cex.main = 1,
         lwd = 0.8,
        main = '2D PCA-plot of k-means clustering')

The scatterplot depicts three slightly distinct, but ultimately overlapping groups of matches. Let’s remember that regardless of whether legitimately distinct groups exist, this k-means analysis imposes groupings upon the data

In data-sets with clear, definable groups, we would anticipate to see some complete separation between the clusters, as is the case when performing k-means on the iris dataset

But this analysis is exploratory, without any true a priori hypotheses, or frankly, any reason to anticipate clearly definable groupings. With these caveats in mind, a result of 62% variance explained by the first two principal components isn’t too bad

A final step we can take is to join our grouping demarcations back to the initial data-set, then run summary statistics on each of our variables to look more closely at group differences

kfit$cluster
Matches_2019_Numeric[,"Cluster"] <- kfit$cluster
MatchGroup <- Matches_2019_Numeric %>% 
  group_by(Cluster) %>% 
  summarise_all(funs(mean)) %>% 
  mutate_if(is.numeric, round, 1)
Cluster margin BO_Diff CG_Diff CL_Diff CM_Diff CP_Diff DE_Diff FF_Diff HB_Diff I50_Diff ITC_Diff K_Diff MG_Diff OP_Diff R50_Diff UP_Diff T5_Diff
1 60.9 0.8 -4.2 5.5 3.3 20.3 5.8 0.6 37.2 18.8 6.6 40.2 951.3 4.3 -8.7 62.0 4.3
2 23.2 1.5 -1.6 2.2 0.7 8.7 1.5 0.2 10.6 7.2 2.8 19.3 382.0 -0.7 -3.3 22.2 1.4
3 19.4 1.9 1.7 -2.9 0.5 -4.3 1.2 -2.0 -19.0 -3.2 2.6 -3.7 135.6 2.7 6.4 -21.4 -0.9

If we treat “winning margin” as an anchor for how the games are different, we can see group 2 is clearly a “big win” group, whilst group 1 & 3 are “small win” groups

Unsurprisingly, the “big win” group is associated with statistical domination. The differentials for all statistics are stark, & much higher than clusters 1 & 3. We can imagine these matches as the type of game where one team dominates from beginning to end, controlling the ball & the scoreboard

Perhaps Cluster 3 is the most interesting. This cluster includes matches that on average were won by 19.5 points, but were also associated with a number of statistics that the winning team conceded:

In Cluster 3, even metres-gained was only marginally better in the winning team (+94)

Clearly, this sub-group of games would make for an interesting analysis to explore what metrics were associated with winning. Cluster 3 matches may be those where despite one team controlling the game, they fail to apply any scoreboard pressure, keeping the opposition in the game before conceding a flurry of goals in quick succession. I need not remind Geelong supporters of the 2008 Grand Final!