The stability of a metric is important in sport for making knowledgeable decisions about future skill. Furthermore, having a stable metric can increase confidence in the metric of predicting future performance.
For this analysis we will be calculating the stability for the variables clearances and marks.
The equation for stability is 1 - (mean[(Xspm - Xpm)^2 - BV[Xspm]]/mean[(Xspm - Xm)^2 - BV[Xspm]])
Load the library tidyverse() and load the data set with the player metrics that you want to assess by using read.csv().
#Load Library
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
#Load Data
afl_data <- read.csv("AT_data_afl.csv")
Using mutate() calculate the total minutes of play for each player each game.
afl_data <- afl_data %>%
mutate(Mins = (time_on_ground_percentage/100)*80)
Again using mutate() and group_by() to calculate the players total minutes and how many seasons each player has played. Aswell as changing season and player_id to a character useing as.character().
#Cleaning the data
afl_data <- afl_data %>%
group_by(player_id) %>%
mutate(totalmin = sum(Mins)) %>%
mutate(season = as.character(season)) %>%
mutate(nseason = n_distinct(season)) %>%
mutate(player_id = as.character(player_id))
This step involves summarising the data to only include the variables needed in the analysis, which includes clearances and marks.
#Create a new data Frame
afl_data1 <- afl_data %>%
group_by(player_id,season) %>%
summarise(player_id = player_id[1],
season = season,
player_team = player_team[1],
marks = mean(marks),
clearances = mean(clearances),
totalmin = totalmin,
nseason = nseason[1])
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
## always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `summarise()` has grouped output by 'player_id', 'season'. You can override
## using the `.groups` argument.
Xpm is the calculation of each metric for all player over all seasons.
#Calculating Xpm
avg.Xpm <- afl_data1 %>%
group_by(player_id) %>%
summarise(Xpm.player_id = player_id[1],
Xpm.season = season,
Xpm.player_team = player_team[1],
Xpm.marks = mean(marks),
Xpm.clearances = mean(clearances))
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
## always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `summarise()` has grouped output by 'player_id'. You can override using the
## `.groups` argument.
Xspm is the average value of every player across every season
#Calculating Xspm
Xspm <- afl_data1 %>%
group_by(player_id,season) %>%
summarise(Xspm.player_id = player_id[1],
Xspm.season = season,
Xspm.player_team = player_team[1],
Xspm.marks = mean(marks),
Xspm.clearances = mean(clearances))
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
## always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `summarise()` has grouped output by 'player_id', 'season'. You can override
## using the `.groups` argument.
To calculate the square root of Xspm - Xpm we first need to use the left_join() finction to combine the two data frames above. Then we use the mutate() function to calculate this equation (Xspm - Xpm)^2 for our variables
sqXspm_Xpm <- left_join(Xspm, avg.Xpm, by = "player_id") %>%
mutate(sq.marks = (Xspm.marks - Xpm.marks)^2,
sq.clearances = (Xspm.clearances - Xpm.clearances)^2)
## Warning in left_join(Xspm, avg.Xpm, by = "player_id"): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 1 of `x` matches multiple rows in `y`.
## ℹ Row 1 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
The bootstrap method uses random sampling with replacement on a data set and therefore creating a new version of the dataset. This method increase accuracy of the model and estimates the size of confidence intervals.
#bootstrapping the metrics
#Create a dataframe with every player's stats
aflplayerseason <- afl_data[c('season','player_id')] %>%
unique()
nboots = 10 # number of bootstraps (replays) we are going to do
replyafl = list() # empty list to store results
for (i in 1:nboots){
show(i)
# randomly re-sample at the team-game level
resamp_aflplayerseason <- aflplayerseason %>%
group_by(player_id, season) %>%
sample_frac(size = 1, replace = T)
# join the player stats
new_alfplayerseason <- left_join(resamp_aflplayerseason,
afl_data,
by = c("season", "player_id"))
#summary stats for this reply
newseason_statsafl <- new_alfplayerseason %>%
group_by(player_id) %>%
summarise(player_id = player_id[1],
season = season,
player_team = player_team[1],
marks = mean(marks),
clearances = mean(clearances),
totalmin = totalmin,
nseason = nseason[1])
# store results in the list
replyafl[[i]] <- newseason_statsafl
}
## [1] 1
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
## always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `summarise()` has grouped output by 'player_id'. You can override using the
## `.groups` argument.
## [1] 2
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
## always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `summarise()` has grouped output by 'player_id'. You can override using the
## `.groups` argument.
## [1] 3
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
## always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `summarise()` has grouped output by 'player_id'. You can override using the
## `.groups` argument.
## [1] 4
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
## always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `summarise()` has grouped output by 'player_id'. You can override using the
## `.groups` argument.
## [1] 5
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
## always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `summarise()` has grouped output by 'player_id'. You can override using the
## `.groups` argument.
## [1] 6
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
## always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `summarise()` has grouped output by 'player_id'. You can override using the
## `.groups` argument.
## [1] 7
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
## always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `summarise()` has grouped output by 'player_id'. You can override using the
## `.groups` argument.
## [1] 8
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
## always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `summarise()` has grouped output by 'player_id'. You can override using the
## `.groups` argument.
## [1] 9
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
## always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `summarise()` has grouped output by 'player_id'. You can override using the
## `.groups` argument.
## [1] 10
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
## always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `summarise()` has grouped output by 'player_id'. You can override using the
## `.groups` argument.
# bind all the re-samples into 1 dataframe
resamp_resultsafl <- bind_rows(replyafl)
Now we are going to work out the bootstrapped variances for each metric
#bootstrapped variances
bootstrap_varsafl <- resamp_resultsafl %>%
group_by(player_id) %>%
summarise(player_id = player_id[1],
season = season[1],
player_team = player_team[1],
var.marks = var(marks),
var.clearances = var(clearances))
To calculate the numerator we use this equation mean[(Xspm - Xpm)^2 - BV[Xspm]], which we have every calculated every metric we need.
#Calculating the numerator for Marks
smn.marks <- (sqXspm_Xpm$sq.marks - bootstrap_varsafl$var.marks)
## Warning in sqXspm_Xpm$sq.marks - bootstrap_varsafl$var.marks: longer object
## length is not a multiple of shorter object length
#Need to average the numerator
smn.marks <- mean(smn.marks)
#Calculating the numerator for Clearances
smn.clearances <- (sqXspm_Xpm$sq.clearances - bootstrap_varsafl$var.clearances)
## Warning in sqXspm_Xpm$sq.clearances - bootstrap_varsafl$var.clearances: longer
## object length is not a multiple of shorter object length
#Need to average the numerator
smn.clearances <- mean(smn.clearances)
For Part A we need to calculate the metric X.m which is the average over all players and seasons
#Calculating X.m
X.m <- afl_data %>%
summarise(marks = mean(marks),
clearances = mean(clearances))
Now we have all the metrics we need to work out the Denominator equation which is mean[(Xspm - Xm)^2 - BV[Xspm]]
#Calculating the denominator for Marks
smd.marks <-(Xspm$Xspm.marks - X.m$marks)^2 - bootstrap_varsafl$var.marks
## Warning in Xspm$Xspm.marks - X.m$marks: longer object length is not a multiple
## of shorter object length
## Warning in (Xspm$Xspm.marks - X.m$marks)^2 - bootstrap_varsafl$var.marks:
## longer object length is not a multiple of shorter object length
#Need to average the denominator
smd.marks <- mean(smd.marks)
#Calculating the denominator for Clearances
smd.clearances <- (Xspm$Xspm.clearances - X.m$clearances)^2 - bootstrap_varsafl$var.clearances
## Warning in Xspm$Xspm.clearances - X.m$clearances: longer object length is not a
## multiple of shorter object length
## Warning in (Xspm$Xspm.clearances - X.m$clearances)^2 -
## bootstrap_varsafl$var.clearances: longer object length is not a multiple of
## shorter object length
#Need to average the denominator
smd.clearances <- mean(smd.clearances)
Finally after working out the denominator and the numerator for the equation we can workout the stability of the metrics. We use this equation: 1 - ([(Xspm - Xpm)^2 - BV[Xspm]]/[(Xspm - Xm)^2 - BV[Xspm]]) For this we do not need to add the mean of the denominator and the numerator because we have already did this in a step above
#Stability of Marks
sm.mark <- 1-(smn.marks/smd.marks)
#To see the Mark stability value
sm.mark
## [1] 0.8464899
#Stability of Clearances
sm.clearances <- 1-(smn.clearances/smd.clearances)
#To see the Mark stability value
sm.clearances
## [1] 0.88976