Round 11s match of the 2020 AFL season between Port Adelaide & Richmond has been widely praised as the “best game of the season”. A fast-finishing Port Adelaide prevailed by 21 points. Despite being a close game, match-reports appear to show the game was statistically dominated by Port Adelaide
Consider ESPNs summary
Port Adelaide were
- +43 in contested possessions
- +31 for inside 50s
- +80 for total possessions
Despite these discrepancies, the game was close for most of the day, with Richmond even leading at 3/4 time
This had me wondering how important commonly measured AFL statistics are to the end result?
To better understand this, I have aggregated all AFL match-day statistics for season 2019, & looked at the relationship between game-day stats & the eventual winning margin of the game
Specifically I have focused upon:
“differential statistics” - what was the difference between the winning & losing side for metric x, y or z?
Below I have outlined the programming to do this
The Data
AFLTables & Footywire are undoubtedly the best independent online resources for footy-related statistics
R users are fortunate that the vast depth of statistics from these websites are easily accessible via the “FitzRoy” package
Let’s read-in FitzRoy, and all other packages we will use for analysis
# Data extraction
library(devtools)
library(fitzRoy)
# Data cleaning
library(snakecase)
library(tidyr)
library(dplyr)
library(reshape)
library(knitr)
# Data visualisation
library(kableExtra)
library(corrplot)
library(taucharts)
library(kableExtra)
# Data analysis
library(ClustOfVar)
library(cluster)FitzRoy provides a straight-forward function to read-in data from Footywire (“get_footywire_stats”). The only required input is which matches to include in the data-extract via the websites Match_ids
The range of “match_ids” has been limited to the 207 games contested in the 2019 season. Match_ids can be identified via the web-URL for each game on FootyWire
I encountered some buggy-behaviour reading in every required match in a single line of code, but found splitting it out worked OK
The match-data from footywire takes ~ 10-15 minutes to read, so patience is required :)
# Extract
footywire2 <- get_footywire_stats(ids = 9876:9927) # works
footywire1 <- get_footywire_stats(ids = 9721:9875) # works
# Bind into a single data-frame
Season2019 <- rbind(footywire1, footywire2)
# lower case, under_score all column titles
names(Season2019) <- to_snake_case(names(Season2019))The structure of this data-extract is 207 games x 44 selected players for each game, which produces 9108 rows of data
Each row reports the statistics of an individual player in every game
This affords great flexibility to explore both match-level and player-level statistics
| date | season | round | venue | player | team | opposition | status | match_id | cp | up | ed | de | cm | ga_15 | mi_5 | one_percenters | bo | ccl | scl | si | mg | to | itc | t_5 | tog | k | hb | d | m | g | b | t | ho | ga_35 | i_50 | cl | cg | r_50 | ff | fa | af | sc |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2019-03-21 | 2019 | Round 1 | MCG | Patrick Cripps | Carlton | Richmond | Home | 9721 | 21 | 11 | 26 | 81.2 | 0 | 0 | 0 | 0 | 1 | 4 | 3 | 5 | 263 | 3 | 4 | 0 | 89 | 10 | 22 | 32 | 1 | 0 | 0 | 6 | 0 | 0 | 2 | 7 | 3 | 2 | 3 | 1 | 101 | 126 |
| 2019-03-21 | 2019 | Round 1 | MCG | Marc Murphy | Carlton | Richmond | Home | 9721 | 6 | 23 | 21 | 72.4 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 7 | 530 | 5 | 3 | 0 | 87 | 16 | 13 | 29 | 4 | 1 | 0 | 1 | 0 | 0 | 5 | 1 | 1 | 4 | 1 | 0 | 97 | 91 |
| 2019-03-21 | 2019 | Round 1 | MCG | Kade Simpson | Carlton | Richmond | Home | 9721 | 5 | 19 | 21 | 77.8 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 2 | 462 | 2 | 6 | 0 | 84 | 15 | 12 | 27 | 6 | 0 | 0 | 1 | 0 | 0 | 1 | 2 | 1 | 5 | 1 | 0 | 92 | 83 |
| 2019-03-21 | 2019 | Round 1 | MCG | Dale Thomas | Carlton | Richmond | Home | 9721 | 6 | 17 | 23 | 85.2 | 0 | 1 | 0 | 4 | 0 | 0 | 0 | 8 | 434 | 4 | 6 | 0 | 78 | 15 | 12 | 27 | 3 | 1 | 0 | 2 | 0 | 1 | 3 | 0 | 4 | 5 | 1 | 1 | 90 | 93 |
| 2019-03-21 | 2019 | Round 1 | MCG | Nic Newman | Carlton | Richmond | Home | 9721 | 5 | 17 | 22 | 84.6 | 0 | 0 | 0 | 3 | 0 | 0 | 1 | 4 | 584 | 2 | 6 | 0 | 84 | 21 | 5 | 26 | 9 | 1 | 0 | 2 | 0 | 0 | 2 | 1 | 2 | 12 | 1 | 0 | 115 | 134 |
| 2019-03-21 | 2019 | Round 1 | MCG | Edward Curnow | Carlton | Richmond | Home | 9721 | 7 | 18 | 18 | 72.0 | 0 | 1 | 2 | 1 | 0 | 0 | 2 | 8 | 303 | 6 | 0 | 2 | 82 | 13 | 12 | 25 | 12 | 0 | 1 | 3 | 0 | 1 | 3 | 2 | 4 | 1 | 1 | 0 | 113 | 98 |
Re-shape & aggregate
Two key steps required to to transform this data for analysis are:
- Aggregate the data so each row reflects match-level statistics rather than player level, &
- Pivot the data so every row reflects a unique game of the season
We will use several “tidyverse” functions to first aggregate our match-level statistics
# tally up all key statistics per game, per team
Season2019_Ag <- Season2019 %>%
dplyr::select(season,
round,
date,
team,
opposition,
status,
cp, # CONTESTED POSSESSION
up, # UNCONTESTED POSSESSION
de, # DISPOSAL EFFICIENCY
one_percenters, # ONE PERCENTERS
mg, # METRES GAINED
to, # TURNOVER
k, # KICKS
hb, # HANDBALLS
d, # DISPOSALS
m, # MARK
i_50, # INSIDE 50
cl, # CLEARANCE
cg, # CLANGERS
r_50, # REBOUND 50
ff, # FREES FOR
fa, # FREES AGAINST
cm, # CONTESTED MARKS
ga_15, # GOAL ASSISTS
bo, # BOUNCES
ccl, # CENTRE CLEARANCES
scl, # STOPPAGE CLEARANCE
itc, # INTERCEPTS
si, # SCORE INVOLVEMENTS
t_5, # TACKLES INSIDE 50
match_id) %>%
group_by(match_id,
team,
opposition,
status,
season,
round,
date) %>%
summarise(
CP = sum(cp),
UP = sum(up),
DE = round(mean(de),1),
OP = sum(one_percenters),
MG = round(sum(mg),1),
TO = sum(to),
K = sum(k),
HB = sum(hb),
D = sum(d),
M = sum(m),
I50 = sum(i_50),
CL = sum(cl),
CG = sum(cg),
R50 = sum(r_50),
FF = sum(ff),
FA = sum(fa),
CM = sum(cm),
GA = sum(ga_15),
BO = sum(bo),
CCL = sum(ccl),
SCL = sum(scl),
ITC = sum(itc),
SI = sum(si),
T5 = sum(t_5),
)However, we have one further issue to address: Data for each match is split across two rows - one for the home team, the other for the away team. The required end-state is for data from both teams to be summarized in a single row
The dplyr function spread can almost solve this issue. However, it only accepts one “value” argument & we therefore are not able to pivot over multiple variables
Fortunately, clever user danr from the R Studio Community wrote a function augmenting the spread-command to be able to do this. We will implement this below:
# remove opposition
Season2019_Ag$opposition <- NULL
# Function to spread across multiple values
myspread <- function(df, key, value) {
# quote key
keyq <- rlang::enquo(key)
# break value vector into quotes
valueq <- rlang::enquo(value)
s <- rlang::quos(!!valueq)
df %>% gather(variable, value, !!!s) %>%
unite(temp, !!keyq, variable) %>%
spread(temp, value)
}
# spread - so each row reflects a single game
Season2019_Ag <- Season2019_Ag %>%
myspread(key = status, value = c(team, CP, UP, DE, OP, MG, TO, K, HB, D, M,
I50, CL, CG, R50, FF, FA, CM, GA, BO, CCL,
SCL, ITC, SI, T5))& after some final tidying …
# create common key
Season2019_Ag$JOIN_ID <- paste(Season2019_Ag$date, "-",
Season2019_Ag$Home_team, "-",
Season2019_Ag$Away_team)
# Re-order Columns
Season2019_Ag <- Season2019_Ag[,c(2,3,4,1,55,52,27,30:51,53:54,5:26,28:29)]
# Ensure data-set is actually a dataframe
Season2019_Ag <- as.data.frame(Season2019_Ag)We now have an aggregated, pivoted, tidied data-frame. Each game is captured as a unique row, and player statistics have been aggregated into team statistics.
To be sure the aggregation worked, I spot-checked a number of random matches against official match-day statistics. This confirmed the results of this analysis were consistent with official statistics
However an important feature is missing - the outcome of the game. We can’t get too far with these statistics if the winning team is unknown
The match-result could in-theory be determined from this data-set. It would require calculating the overall team-scores from individual players goals & behinds
However I decided an easier method would be to source this data directly from AFL-tables, accessible again from the FitzRoy package
# read in data from AFL tables
AT_2019 <- get_afltables_stats(start_date = '2019-01-01',
end_date = '2019-12-31')
# replace dots with underscores, all lower case
names(AT_2019) <- to_snake_case(names(AT_2019))In order to combine the two data-sets, a common key is required. By combining three columns which exist in both data-sets (date, home team & away team), I was able to create my own common key
# create a common key
AT_2019$JOIN_ID <- paste(AT_2019$date, "-",
AT_2019$home_team, "-",
AT_2019$away_team)
# Select scores - the only rows we would like to keep from this table
AT_2019 <- dplyr::select(AT_2019,
JOIN_ID,
home_score,
away_score)
# remove columns which are not unique
AT_2019 <- AT_2019 %>% distinct(JOIN_ID, .keep_all = TRUE)
# Convert to dataframe
AT_2019 <- as.data.frame(AT_2019)After some further tidying (error correction & harmonizing team names), the two data-sets are ready to be joined by the calculated JOINID column
# Ensure join columns are comparable
AT_2019$JOIN_ID <- to_snake_case(AT_2019$JOIN_ID)
Season2019_Ag$JOIN_ID <- as.character(Season2019_Ag$JOIN_ID)
Season2019_Ag$JOIN_ID <- to_snake_case(Season2019_Ag$JOIN_ID)
# Other corrections
AT_2019$JOIN_ID <- gsub("greater_western_sydney", "gws", AT_2019$JOIN_ID)
AT_2019$JOIN_ID <- gsub("brisbane_lions", "brisbane", AT_2019$JOIN_ID)
# Correct error in AFL tables listing Geelong as home-side in the 2019 Preliminary Final
AT_2019$JOIN_ID <- gsub("2019_09_20_geelong_richmond", "2019_09_20_richmond_geelong", AT_2019$JOIN_ID)
# Join Footywire & AFL tables
Season2019_Ag <- left_join(Season2019_Ag, AT_2019, by = "JOIN_ID")& some final tidying to re-order our variables, remove redundant variables, & convert statistics to numeric format
# Re-order columns
Season2019_Ag <- Season2019_Ag[,c(1:7, 56:57, 8:55)]
# Remove IDs - no longer required
Season2019_Ag$match_id <- NULL
Season2019_Ag$JOIN_ID <- NULL
# need to change all stats columns from character to numeric
cols = c(8:55)
Season2019_Ag[,cols] = apply(Season2019_Ag[,cols], 2, function(x) as.numeric(as.character(x)));Our data model is nearly complete with match-statistics for both home & away teams, & the total score for each team
The next step is to calculate the statistical “differentials” between the winning & losing teams
An issue is the winning-score could come from either the home or away score columns. As such a simple subtraction of these fields won’t work.
The necessary work-around is to split our data in two: “winning home team” & “winning away team”, then recombine them into an overall “winning differentials” data-frame
Let’s start with winning home team:
# "HOME" Winners
# Calculate a "winning margin" score - ultimately we want to see how different
# statistics are related to this outcome variable
Season2019_Ag$margin <- Season2019_Ag$home_score - Season2019_Ag$away_score
# Subset only "home" team wins
WinnersHome <- filter(Season2019_Ag, margin >= 1)
# Create differential columns
WinnersHome$BO_Diff <- WinnersHome$Home_BO - WinnersHome$Away_BO
WinnersHome$CCL_Diff <- WinnersHome$Home_CCL - WinnersHome$Away_CCL
WinnersHome$CG_Diff <- WinnersHome$Home_CG - WinnersHome$Away_CG
WinnersHome$CL_Diff <- WinnersHome$Home_CL - WinnersHome$Away_CL
WinnersHome$CM_Diff <- WinnersHome$Home_CM - WinnersHome$Away_CM
WinnersHome$CP_Diff <- WinnersHome$Home_CP - WinnersHome$Away_CP
WinnersHome$D_Diff <- WinnersHome$Home_D - WinnersHome$Away_D
WinnersHome$DE_Diff <- WinnersHome$Home_DE - WinnersHome$Away_DE
WinnersHome$BO_Diff <- WinnersHome$Home_BO - WinnersHome$Away_BO
WinnersHome$FA_Diff <- WinnersHome$Home_FA - WinnersHome$Away_FA
WinnersHome$FF_Diff <- WinnersHome$Home_FF - WinnersHome$Away_FF
WinnersHome$GA_Diff <- WinnersHome$Home_GA - WinnersHome$Away_GA
WinnersHome$HB_Diff <- WinnersHome$Home_HB - WinnersHome$Away_HB
WinnersHome$I50_Diff <- WinnersHome$Home_I50 - WinnersHome$Away_I50
WinnersHome$ITC_Diff <- WinnersHome$Home_ITC - WinnersHome$Away_ITC
WinnersHome$K_Diff <- WinnersHome$Home_K - WinnersHome$Away_K
WinnersHome$M_Diff <- WinnersHome$Home_M - WinnersHome$Away_M
WinnersHome$MG_Diff <- WinnersHome$Home_MG - WinnersHome$Away_MG
WinnersHome$OP_Diff <- WinnersHome$Home_OP - WinnersHome$Away_OP
WinnersHome$R50_Diff <- WinnersHome$Home_R50 - WinnersHome$Away_R50
WinnersHome$SCL_Diff <- WinnersHome$Home_SCL - WinnersHome$Away_SCL
WinnersHome$SI_Diff <- WinnersHome$Home_SI - WinnersHome$Away_SI
WinnersHome$TO_Diff <- WinnersHome$Home_TO - WinnersHome$Away_TO
WinnersHome$UP_Diff <- WinnersHome$Home_UP - WinnersHome$Away_UP
WinnersHome$T5_Diff <- WinnersHome$Home_T5 - WinnersHome$Away_T5
# Designate these games as "home" team winning
WinnersHome$Winner <- "HOME"
WinnersHome$WinningTeam <- WinnersHome$Home_team
cols = c(1:7,82,56:81)
WinnersHome <- WinnersHome[, cols]Repeat this step for “winning away team”
# "AWAY" Winners
# Subset only "away" team wins
WinnersAway <- filter(Season2019_Ag, margin <= -1)
# Create differential columns
WinnersAway$BO_Diff <- WinnersAway$Away_BO - WinnersAway$Home_BO
WinnersAway$CCL_Diff <- WinnersAway$Away_CCL - WinnersAway$Home_CCL
WinnersAway$CG_Diff <- WinnersAway$Away_CG - WinnersAway$Home_CG
WinnersAway$CL_Diff <- WinnersAway$Away_CL - WinnersAway$Home_CL
WinnersAway$CM_Diff <- WinnersAway$Away_CM - WinnersAway$Home_CM
WinnersAway$CP_Diff <- WinnersAway$Away_CP - WinnersAway$Home_CP
WinnersAway$D_Diff <- WinnersAway$Away_D - WinnersAway$Home_D
WinnersAway$DE_Diff <- WinnersAway$Away_DE - WinnersAway$Home_DE
WinnersAway$BO_Diff <- WinnersAway$Away_BO - WinnersAway$Home_BO
WinnersAway$FA_Diff <- WinnersAway$Away_FA - WinnersAway$Home_FA
WinnersAway$FF_Diff <- WinnersAway$Away_FF - WinnersAway$Home_FF
WinnersAway$GA_Diff <- WinnersAway$Away_GA - WinnersAway$Home_GA
WinnersAway$HB_Diff <- WinnersAway$Away_HB - WinnersAway$Home_HB
WinnersAway$I50_Diff <- WinnersAway$Away_I50 - WinnersAway$Home_I50
WinnersAway$ITC_Diff <- WinnersAway$Away_ITC - WinnersAway$Home_ITC
WinnersAway$K_Diff <- WinnersAway$Away_K - WinnersAway$Home_K
WinnersAway$M_Diff <- WinnersAway$Away_M - WinnersAway$Home_M
WinnersAway$MG_Diff <- WinnersAway$Away_MG - WinnersAway$Home_MG
WinnersAway$OP_Diff <- WinnersAway$Away_OP - WinnersAway$Home_OP
WinnersAway$R50_Diff <- WinnersAway$Away_R50 - WinnersAway$Home_R50
WinnersAway$SCL_Diff <- WinnersAway$Away_SCL - WinnersAway$Home_SCL
WinnersAway$SI_Diff <- WinnersAway$Away_SI - WinnersAway$Home_SI
WinnersAway$TO_Diff <- WinnersAway$Away_TO - WinnersAway$Home_TO
WinnersAway$UP_Diff <- WinnersAway$Away_UP - WinnersAway$Home_UP
WinnersAway$T5_Diff <- WinnersAway$Away_T5 - WinnersAway$Home_T5
# Designate these games as "home" team winning
WinnersAway$Winner <- "AWAY"
# Designate these games as "home" team winning
WinnersAway$WinningTeam <- WinnersAway$Away_team
cols = c(1:7,82,56:81)
WinnersAway <- WinnersAway[, cols]
# Change margin to absolute
WinnersAway$margin <- abs(WinnersAway$margin)Now re-combine these two data-sets into our final data-frame of “winning differentials”
# Bind together
Matches_2019 <- rbind(WinnersHome, WinnersAway)
# Alternative version with only numeric columns
cols = c(9:33)
Matches_2019_Numeric <- Matches_2019[, cols]Which results in the final data-model
Each row reflects a single match of the 2019 season, & includes:
- the teams involved in each match
- the date of the match
- which team won
- the margin of the win
- 24 “statistical differences” for the winning team
| season | round | date | Home_team | Away_team | home_score | away_score | WinningTeam | margin | BO_Diff | CCL_Diff | CG_Diff | CL_Diff | CM_Diff | CP_Diff | D_Diff | DE_Diff | FA_Diff | FF_Diff | GA_Diff | HB_Diff | I50_Diff | ITC_Diff | K_Diff | M_Diff | MG_Diff | OP_Diff | R50_Diff | SCL_Diff | SI_Diff | TO_Diff | UP_Diff | T5_Diff | Winner |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2019 | Round 1 | 2019-03-23 | Western Bulldogs | Sydney | 82 | 65 | Western Bulldogs | 17 | 5 | 1 | -5 | 9 | -3 | 26 | 5 | -5.8 | 0 | 0 | -2 | -5 | 21 | 3 | 10 | -26 | 466 | -3 | -19 | 8 | 5 | -4 | -17 | 6 | HOME |
| 2019 | Round 1 | 2019-03-23 | Brisbane | West Coast | 102 | 58 | Brisbane | 44 | 0 | 2 | 2 | 8 | -1 | 29 | 88 | 2.6 | 8 | -8 | 6 | 54 | 8 | 2 | 34 | 15 | 847 | -28 | -1 | 6 | 63 | -3 | 60 | 5 | HOME |
| 2019 | Round 1 | 2019-03-24 | St Kilda | Gold Coast | 85 | 84 | St Kilda | 1 | 7 | 5 | 18 | 6 | -1 | 9 | 33 | 1.4 | 11 | -11 | 1 | 3 | 7 | 0 | 30 | 5 | 4 | 18 | -6 | 1 | 11 | 0 | 16 | -1 | HOME |
| 2019 | Round 1 | 2019-03-24 | GWS | Essendon | 112 | 40 | GWS | 72 | -5 | 6 | -8 | -3 | 4 | 41 | 62 | 4.0 | -11 | 11 | 6 | 19 | 3 | 10 | 43 | 26 | 916 | 9 | 10 | -9 | 63 | -10 | 35 | 2 | HOME |
| 2019 | Round 1 | 2019-03-24 | Fremantle | North Melbourne | 141 | 59 | Fremantle | 82 | -2 | 6 | -8 | 13 | 10 | 27 | -9 | -1.5 | 4 | -4 | 9 | -35 | 21 | 11 | 26 | 7 | 1197 | -18 | -7 | 7 | 73 | -11 | -35 | -3 | HOME |
| 2019 | Round 2 | 2019-03-30 | Port Adelaide | Carlton | 88 | 72 | Port Adelaide | 16 | 5 | 6 | -6 | 24 | -4 | 11 | 94 | 6.5 | 2 | -3 | 1 | 61 | 17 | -4 | 33 | 32 | 468 | 6 | -13 | 18 | 24 | 1 | 91 | 7 | HOME |
Prioritise Statistics of Interest
It’s worth taking a moment to review the included statistics. Many metrics are very similar, so we would expect collinearity between them. Based on little more than intuition, I have opted to delete the below metrics:
- Removed “Frees-Against”, as it is a perfect mirror of “Frees-For” - differences here are only really useful at an individual player level rather than in aggregate
- Removed “centre clearances” & “stoppage clearances” in favor of keeping “total clearances” for this analysis
- Removed “total disposals” in favor of keeping its constituent elements - total kicks & handballs
- Removed “total marks” in favor of keeping total contested marks (a subset of total marks). I figured contested marking would be more closely associated with winning margin
- Removed “turn-overs” in favor of keeping intercept possessions as these two measures are unsurprisingly highly correlated
- Removed “Score Involvements” & “Goal Assists”. These metrics are clearly a function of the score margin, & as such would be uninformative to any analysis
Matches_2019_Numeric$FA_Diff <- NULL # FA a mirror of FF
Matches_2019_Numeric$CCL_Diff <- NULL # already have clearance data
Matches_2019_Numeric$SCL_Diff <- NULL # already have clearance data
Matches_2019_Numeric$D_Diff <- NULL # already have kick & handball data
Matches_2019_Numeric$M_Diff <- NULL # selected contested marking instead
Matches_2019_Numeric$TO_Diff <- NULL # selected contested marking instead
Matches_2019_Numeric$SI_Diff <- NULL # in effect, a component of margin
Matches_2019_Numeric$GA_Diff <- NULL # in effect, a component of marginVisualise & Explore
Correlation Matrices
The R package Corrplot is one of the easiest, & most engaging ways to summarize the relationships between variables via a matrix of correlations
Let’s first create a matrix of correlations
# create matrix of correlations
M <- cor(Matches_2019_Numeric)
# round data to 2 decimal places
M <- round(M, 2)& now generate two correlation plots for the data
The first plot uses coloured squares to summarize the correlations between variables, ranging from navy (perfect positive correlation) to maroon (perfect negative correlation). The strength of the correlation is also depicted by the size of the square
The second plot summarises the same information, but provides the actual R values instead of colored squares
With corrplot you can pass arguments to augment aesthetics such as text-size. These text-editing steps are important to improve the aesthetic of the plot(s), particularly when there are many variables, and/or long variable names
CM1 <-
corrplot(M,
method = "square",
type = "upper",
tl.col= "black",
tl.cex = 0.6, # Text label color and rotation
cl.cex = 0.6 # Correlation label color and rotation
)CM2 <-
corrplot(M,
method = "number",
type = "upper",
tl.col="black",
# tl.srt=45,
tl.cex = 0.6, # Text label color and rotation
cl.cex = 0.6, # Correlation label color and rotation
number.cex = .6
)In the above Figures, the top horizontal row of the grid is of most interest. This row depicts each statistical-differentials correlation with the winning margin
The immediate stand-out along this row is MG_Diff : Metres Gained Differential
What this part of the grid shows is the larger the number of metres gained by the winning team, the greater the winning margin. This is far-and-away the largest correlation of all statistics, which we can characterize as positive & strong (R = 0.82).
Metres Gained has become one of the most fashionable statistics in AFL. An in-depth examination of what it means & does not mean can be read here.
It should be acknowledged the gurus at Champion Data have delved into this statistic in much greater depth. My characterization here is simplistic. Much of the nuance which is absent here is covered in this article (for example, delineating effective metres gained)
Comparatively, the next most important metrics were moderate in strength. These were effective disposal percentage differential of the winning side (R = 0.47), and how many more kicks the winning side had (R=0.42).
We can look more closely at these relationships via a series of interactive scatterplots:
Scatterplots
winning-Margin x Metres-Gained-Differential
# the greater the MG discrepancy
Matches_2019 %>%
dplyr::select(Home_team, Away_team, WinningTeam, margin, MG_Diff) %>%
tauchart() %>%
tau_point("margin", "MG_Diff") %>%
tau_tooltip() Examining the scatterplot of metres gained & winning margin, a few further insights can be gleaned:
- there are very few games where the winning team loses the metres-gained differential (only 25 out of 207)
- The highest winning margin for the year for a team conceding the metres-gained statistic was 24 points (Geelong defeating North Melbourne, with a very small -9 metres gained). In other words, if a team does not win the metres gained statistic, they are very unlikely to win by a large margin
- The largest amount of metres gained conceded for a winning team was -461 metres, with Fremantle defeating Sydney despite clearly losing the territory battle
- Fremantle were also involved in the only clear outlying game, when they lost to West Coast by 91 points, but conceded only 559 metres (an abnormally small difference when compared to other losses of that margin)
Winning-Margin x Disposal-Efficiency-Differential
Matches_2019 %>%
dplyr::select(Home_team, Away_team, WinningTeam, margin, DE_Diff) %>%
tauchart() %>%
tau_point("margin", "DE_Diff") %>%
tau_tooltip() One interesting observation between winning margin and disposal efficiency differential is the poorest differential observed for a winning team was -8.7%, recorded by Collingwood in a one point defeat of West-Coast
winning-Margin x Number-of-Kicks-Differential
# the greater the MG discrepancy
Matches_2019 %>%
dplyr::select(Home_team, Away_team, WinningTeam, margin, K_Diff) %>%
tauchart() %>%
tau_point("margin", "K_Diff") %>%
tau_tooltip() There are four potential outlier matches in this plot, with all involving Richmond. In three of these outliers, Richmond won comfortably, despite recording more than 40 less kicks than there opposition (wins against Sydney, St Kilda & Brisbane).
Inversely, Richmond were defeated by Collingwood in round 2 by 44 points, conceding the equal largest amount of kicks for any game (+107). Other matches with similar kick-differentials resulted in much greater winning margins
Clustering - k-Means
“Clustering” is a technique used to explore sub-groups of observations within a data set. In this data set, an “observation” is equivalent to a match of football
For this data, the goal of clustering is to explore whether we can group games together by their statistical-similarities. For example, we might hypothesize that there exists a “group” of games with extreme-values (i.e. a high winning margin, accompanied by large differences in the number of kicks, inside 50s, metres gained etc)
To do this, we will implement K-means clustering, probably the most common algorithm for partitioning. In layman terms, k-means clustering ‘groups’ observations (i.e. matches) such that the observations within the group are as similar as possible. Specifically, clusters are formed via observations that demonstrate high intra-class similarity, measured by the clusters ‘centroid’ (the mean value of points assigned to the cluster). ‘Principal components’ are derived as combinations of the variables in the data-set
In k-means clustering, ‘k’ denotes the number of clusters we will separate the data into. This is pre-specified by the analyst, & for this data I have opted for k = 3. In other words, the end result of this analysis will be three “clusters” of AFL matches from the 2019 season
For more information on k-means/principal components analysis, this towards data science article is an easily digestible summary
To begin, lets construct a dissimilarity matrix, utilizing Gowers Distance. This matrix is a necessary step to calculate our k-means
Calculating k-means is very straight-forward. Although there are many tuning parameters, for the sake of simplicity, I have kept the defaults. The argument ‘3’ is specifying the data will be separated into three groups
With k-means calculated, I have generated a scatterplot depicting the first two principal components of the analysis. Within this scatterplot, the three k-mean groups are depicted
The k-means analysis will have produced more than two principal components. The scatterplot only shows the first two components, as multidimensional data would require multidimensional visualization techniques. More thorough analysis can explore these additional components and optimize the number of components
clusplot(as.matrix(d),
kfit$cluster,
color = T,
shade = F,
labels = 1,
lines = 0,
cex = 0.7,
cex.txt = 0.8,
cex.axis = 0.8,
cex.lab = 0.8,
cex.main = 1,
lwd = 0.8,
main = '2D PCA-plot of k-means clustering')The scatterplot depicts three slightly distinct, but ultimately overlapping groups of matches. Let’s remember that regardless of whether legitimately distinct groups exist, this k-means analysis imposes groupings upon the data
In data-sets with clear, definable groups, we would anticipate to see some complete separation between the clusters, as is the case when performing k-means on the iris dataset
But this analysis is exploratory, without any true a priori hypotheses, or frankly, any reason to anticipate clearly definable groupings. With these caveats in mind, a result of 62% variance explained by the first two principal components isn’t too bad
A final step we can take is to join our grouping demarcations back to the initial data-set, then run summary statistics on each of our variables to look more closely at group differences
MatchGroup <- Matches_2019_Numeric %>%
group_by(Cluster) %>%
summarise_all(funs(mean)) %>%
mutate_if(is.numeric, round, 1)| Cluster | margin | BO_Diff | CG_Diff | CL_Diff | CM_Diff | CP_Diff | DE_Diff | FF_Diff | HB_Diff | I50_Diff | ITC_Diff | K_Diff | MG_Diff | OP_Diff | R50_Diff | UP_Diff | T5_Diff |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 60.9 | 0.8 | -4.2 | 5.5 | 3.3 | 20.3 | 5.8 | 0.6 | 37.2 | 18.8 | 6.6 | 40.2 | 951.3 | 4.3 | -8.7 | 62.0 | 4.3 |
| 2 | 23.2 | 1.5 | -1.6 | 2.2 | 0.7 | 8.7 | 1.5 | 0.2 | 10.6 | 7.2 | 2.8 | 19.3 | 382.0 | -0.7 | -3.3 | 22.2 | 1.4 |
| 3 | 19.4 | 1.9 | 1.7 | -2.9 | 0.5 | -4.3 | 1.2 | -2.0 | -19.0 | -3.2 | 2.6 | -3.7 | 135.6 | 2.7 | 6.4 | -21.4 | -0.9 |
If we treat “winning margin” as an anchor for how the games are different, we can see group 2 is clearly a “big win” group, whilst group 1 & 3 are “small win” groups
Unsurprisingly, the “big win” group is associated with statistical domination. The differentials for all statistics are stark, & much higher than clusters 1 & 3. We can imagine these matches as the type of game where one team dominates from beginning to end, controlling the ball & the scoreboard
Perhaps Cluster 3 is the most interesting. This cluster includes matches that on average were won by 19.5 points, but were also associated with a number of statistics that the winning team conceded:
- less clearances
- less contested possessions
- less frees for
- less handballs
- less inside 50s
- less kicks
- less uncontested possessions
- less tackles inside 50
- more clangers
In Cluster 3, even metres-gained was only marginally better in the winning team (+94)
Clearly, this sub-group of games would make for an interesting analysis to explore what metrics were associated with winning. Cluster 3 matches may be those where despite one team controlling the game, they fail to apply any scoreboard pressure, keeping the opposition in the game before conceding a flurry of goals in quick succession. I need not remind Geelong supporters of the 2008 Grand Final!