As of this writing the 2019 NCAA® women’s division I Final Four® tournament is about to begin. There’s a high degree of interest here in Oregon, because the Oregon Ducks are competing against Baylor, Notre Dame and UConn for the championship. So I’ve wrangled some data and come up with an archetypal analysis.
Archetypal analysis (Eugster 2012) is a statistical technique for analyzing athletes’ performance. It operates as follows:
We create the input data as follows:
library(dplyr) # get the pipe operator
# the minutes played are given in the form "800:45"
.parse_minutes <- function(text_minutes) {
items <- stringr::str_split_fixed(text_minutes, ":", -1)
return(as.numeric(items[, 1]) + as.numeric(items[, 2]) / 60.0)
}
# raw data - just filter out some NAs
raw <- readr::read_delim(
"~/Downloads/division_1_womens.tsv",
"\t", escape_double = FALSE,
trim_ws = TRUE
) %>%
dplyr::filter(
!is.na(minutes_played),
!is.na(games_played),
!is.na(class_year),
!is.na(position)
)
# clean - some fields have NA where there really should be a zero
cleaned <- raw %>%
tidyr::replace_na(list(
field_goals_made = 0,
three_point_field_goals = 0,
free_throws = 0,
offensive_rebounds = 0,
defensive_rebounds = 0,
assists = 0,
turnovers = 0,
steals = 0,
blocks = 0
)) %>%
dplyr::mutate(
two_point_field_goals = field_goals_made - three_point_field_goals,
total_minutes = .parse_minutes(minutes_played)
) %>%
dplyr::select(
player_name,
team_name,
class_year,
position,
height,
games_played,
total_minutes,
two_point_field_goals,
three_point_field_goals,
free_throws,
offensive_rebounds,
defensive_rebounds,
assists,
turnovers,
steals,
blocks
)
# there are duplicate player names in this data set, so we add the team name
# in parentheses
cleaned$player_name <- paste0(cleaned$player_name, " (", cleaned$team_name, ")")
Normally we would search for the number of archetypes to use, typically three to seven for basketball. However, for simplicity we will use the default, three. This has some advantages in interpretation and visualization:
We use the dfstools library package (Borasky 2019) to do the calculations.
player_totals <- cleaned %>% dplyr::select(player_name, total_minutes:blocks)
player_labels <- cleaned %>% dplyr::select(player_name:height)
archetype_models <- dfstools::compute_archetypes(player_totals, player_labels)
## Joining, by = "player_name"
player_alphas <- archetype_models[["player_alphas"]] %>% dplyr::arrange(Bench)
player_alphas[, 6:8] <- round(player_alphas[, 6:8], digits = 3)
DT::datatable(player_alphas)
Notes:
I’ve broken out the teams in the Final Four for exploration below.
Baylor <- player_alphas %>% dplyr::filter(team_name == "Baylor")
DT::datatable(Baylor)
NotreDame <- player_alphas %>% dplyr::filter(team_name == "Notre Dame")
DT::datatable(NotreDame)
### UConn
UConn <- player_alphas %>% dplyr::filter(team_name == "UConn")
DT::datatable(UConn)
Oregon <- player_alphas %>% dplyr::filter(team_name == "Oregon")
DT::datatable(Oregon)
To wrap up, let’s look at the totals of archetypal ratings for the teams.
column_sums <- dplyr::bind_rows(
colSums(Baylor[, 6:7]),
colSums(NotreDame[, 6:7]),
colSums(UConn[, 6:7]),
colSums(Oregon[, 6:7])
)
column_sums <- dplyr::bind_cols(
tibble::enframe(c("Baylor", "Notre Dame", "UConn", "Oregon"),
name = NULL, value = "Team"),
column_sums
)
column_sums$Total <- (column_sums[, 2] + column_sums[, 3]) %>% tibble::deframe()
column_sums <- column_sums %>% arrange(desc(Total))
DT::datatable(column_sums)
What this says is that Baylor has the equivalent of 1.866 Ciera Dillards and 3.286 Teaira McCowans, etc. The totals give the overall strength of the teams. It appears that Notre Dame is strongest overall, with Baylor being best at the rim and Oregon being best in three-point shooting.
Of course, coaching and strategy can even things up, and three-point shooting tends to add more value than rim protection in modern basketball. This promises to be an exciting tournament. And #GoDucks!
Borasky, M. Edward (Ed). 2017. “Archetypal Ballers and Ternary Plots.” https://rpubs.com/znmeb/pdxdataviz20170209.
———. 2019. Dfstools: Tidy Data Analytics from Sports Data Apis. https://znmeb.github.io/dfstools.
Eugster, Manuel J. A. 2012. “Performance Profiles Based on Archetypal Athletes.” International Journal of Performance Analysis in Sport 12 (1): 166–87. http://epub.ub.uni-muenchen.de/12336/.