Purpose. This report analyzes NBA player availability and on-court movement trends. We assemble long‑horizon datasets, visualize key patterns, and generate short‑term forecasts using classical time‑series models. The goal is to provide a clear, reproducible workflow you can extend for deeper team or player analyses with more available and detailed data.

Setup

Download Dataset 2015–2025 Player Totals (ZIP)

Download Team Data (ZIP)

We load the libraries used for data prep, plotting, tables, and forecasting. Knitr options keep the output clean for publishing.

knitr::opts_chunk$set(
  echo = TRUE, message = FALSE, warning = FALSE,
  fig.width = 9, fig.height = 5, fig.align = "center"
)

library(readxl)
library(dplyr)
library(ggplot2)
library(gridExtra)
library(ggrepel)
library(purrr)
library(tidyverse)
library(lubridate)
library(timetk)
library(fpp3)         # tsibble, fable, feasts
library(patchwork)
library(gt)
library(scales)

Plotting Utilities & Style Guide

We define a minimal presentation theme, a consistent color palette, and a few utilities for annotations used throughout the charts.

# Slide-style theme
theme_deck <- theme_minimal(base_size = 18) +
  theme(
    plot.title      = element_text(face = "bold", size = 22, margin = margin(b = 6)),
    plot.subtitle   = element_text(size = 15, margin = margin(b = 10)),
    axis.title.x    = element_text(face = "bold", margin = margin(t = 6)),
    axis.title.y    = element_text(face = "bold", margin = margin(r = 6)),
    axis.text.x     = element_text(size = 14),
    axis.text.y     = element_text(size = 14),
    panel.grid.minor= element_blank(),
    panel.grid.major= element_line(color = "#e9eef2"),
    plot.caption    = element_text(size = 12, color = "#6b7b8a"),
    plot.background = element_rect(fill = "white", color = NA),
    legend.position = "bottom",
    legend.title    = element_blank(),
    legend.text     = element_text(size = 14),
    strip.text      = element_text(face = "bold", size = 16)
  )

# Blog-style theme
theme_blog <- theme_minimal(base_size = 14) +
  theme(
    plot.title = element_text(face = "bold", size = 16),
    plot.subtitle = element_text(size = 12, margin = margin(b = 10)),
    axis.title.x = element_text(face = "bold"),
    axis.title.y = element_text(face = "bold"),
    panel.grid.major = element_line(color = "#f1f1f1"),
    panel.grid.minor = element_blank(),
    panel.background = element_rect(fill = "#ffffff", color = NA),
    plot.background = element_rect(fill = "#ffffff", color = "white", linewidth = 1),
    plot.margin = margin(10, 10, 10, 10)
  )

# Model colors for forecast plots
model_cols <- c(
  "ARIMA"   = "#0B7285",
  "ETS"     = "#B8590C",
  "TSLM"    = "#5C7CFA",
  "lm"      = "#5C7CFA",
  "Prophet" = "#2F9E44",
  "ARIMA Boost" = "#AE3EC9",
  "RW" = "red"
)

# Helpers
label_last <- function(df, x, y, nudge_y = 0, digits = 3) {
  df |> dplyr::sliceMax({{x}}, n = 1, with_ties = FALSE) |>
    dplyr::mutate(lbl = paste0(scales::number({{y}}, accuracy = 10^-digits)))
}

ts_years <- function(dates) {
  rng <- range(dates, na.rm = TRUE)
  y0 <- lubridate::year(rng[1]); y1 <- lubridate::year(rng[2])
  if (y0 == y1) paste("in", y0) else paste("from", y0, "to", y1)
}

shade_future <- function(x_start, x_end) {
  annotate("rect", xmin = x_start, xmax = x_end,
           ymin = -Inf, ymax = Inf, fill = "grey80", alpha = 0.2)
}

vline_future <- function(x) {
  geom_vline(xintercept = x, linetype = "dashed", linewidth = 0.7, color = "grey35")
}

nba_palette <- c(
  "2014" = "#99E3C3",
  "2015" = "#50E3C2",
  "2016" = "#44C7C1",
  "2017" = "#38AAC0",
  "2018" = "#2C8EBE",
  "2019" = "#2172BD",
  "2020" = "#1656BB",
  "2021" = "#133FB0",
  "2022" = "#1136A0",
  "2023" = "#0E2D91",
  "2024" = "#0B2481",
  "2025" = "#081a5b"
)

This helps us display our visualtions with clarity and allows eaiser interpretation of results as a result for readers..

Player Totals (2000–2025): Ingestion & Availability Metric

We ingest Player Totals (one file per season) and construct an availability metric from the top‑minutes cohort.

# Years and file pattern
years <- 2000:2025

# Read & combine yearly totals
df_list <- list()
for (year in years) {
  file_name <- paste0("data/Dataset 20152025 Player Totals/Player_Total_", year, ".xlsx")
  df <- readxl::read_excel(file_name)
  df$Year <- year
  df_list[[as.character(year)]] <- df
}
final_df <- dplyr::bind_rows(df_list)

# Total games per season (account for shortened seasons)
season_games <- function(year) {
  if (year == 2020) return(72)
  if (year == 2012) return(66)
  else return(82)
}

# Compute availability for top 250 by minutes
top200_df <- final_df %>%
  dplyr::group_by(Year) %>%
  dplyr::slice_max(order_by = MP, n = 250, with_ties = FALSE) %>%
  dplyr::ungroup() %>%
  dplyr::mutate(
    Total_Games_Season = sapply(Year, season_games),
    Games_Missed = Total_Games_Season - G
  )

avg_games_missed_by_year <- top200_df %>%
  dplyr::group_by(Year) %>%
  dplyr::summarise(Avg_Games_Missed = mean(Games_Missed, na.rm = TRUE))

It is important that we only take a selection of NBA players within the total NBA landscape. taking all of the NBA players and we create too much noise with players that miss games because they are not good enough to play in all the games. if we limit too far we reduce the overall understanding and fail to capure players at the top level whom miss games. it was found that 250 was a reasomable balance for this concept.

Visual: Games Missed Long‑Run Trend

We visualize the average games missed per season, highlighting the COVID window and overlaying a trend fitted to non‑COVID seasons.

# Exclude 2021–2022 for a clean trend line
filtered_data <- avg_games_missed_by_year %>%
  dplyr::filter(Year < 2021 | Year > 2022)

plotInjuries <- ggplot(avg_games_missed_by_year, aes(x = Year, y = Avg_Games_Missed)) +
  # COVID window
  annotate("rect", xmin = 2020.3, xmax = 2022.7, ymin = -Inf, ymax = Inf, alpha = 0.1, fill = "red") +
  geom_vline(xintercept = 2020.3, color = "red", linetype = "dashed", linewidth = 0.5) +
  geom_vline(xintercept = 2022.7, color = "red", linetype = "dashed", linewidth = 0.5) +
  # Series
  geom_line(color = "#2C3E50", linewidth = 1, alpha = 0.8) +
  geom_point(size = 3.5, color = "#2980B9") +
  # Trend (excl COVID)
  geom_smooth(data = filtered_data, method = "loess", se = FALSE,
              color = "#ffbb6f", linetype = "solid", linewidth = 1, alpha = 0.5) +
  labs(
    title = "NBA Availability Decrease",
    subtitle = "Top 250 by minutes played show rising games missed in recent seasons",
    caption = "Source: Basketball-Reference | Plot by @BeyondLines__",
    x = "Season", y = "Average Games Missed"
  ) +
  scale_x_continuous(breaks = seq(2000, 2025, by = 1)) +
  scale_y_continuous(limits = c(5, 23.5)) +
  annotate("text", x = 2020.3, y = 20, label = "COVID Impact",
           hjust = 1.1, vjust = 4, size = 3.5, color = "red", fontface = "italic") +
  annotate("text", x = 2000, y = 10, label = "Trend (excl. COVID)",
           hjust = 0, vjust = 0, size = 3.5, color = "#ffbb6f", fontface = "bold") +
  theme_minimal(base_size = 14) +
  theme(
    plot.title = element_text(face = "bold"),
    axis.title.x = element_text(face = "bold"),
    axis.title.y = element_text(face = "bold"),
    panel.grid.major = element_line(color = "#f1f1f1"),
    panel.grid.minor = element_blank(),
    panel.background = element_rect(fill = "#ffffff", color = NA),
    plot.background = element_rect(fill = "#ffffff", color = "white", linewidth = 1),
    plot.margin = margin(10, 10, 10, 10)
  )

plotInjuries

As we can see over time the avg number of games missed has increased slighly from the years of 2000-2019. It skyrocketed in 2020 due to COVID and the health and safety protocals that came with it as a result (highlighted in red). As the restriction easied the value came down but in the last year has increased again and at this rate shows no indication of slowing down post COVID

Team Movement & Speed Data (2014–2025): Ingestion & Merge

We read four team‑level datasets and merge them by Team and Season, constructing derived features where helpful.

# File path and sheet names
file_path <- "data/Team Data/10yrNBAdata.xlsx"
sheets <- c("10yrAge", "10yrSpeed", "10yrAdvanced", "10yrDrives")
data_list <- list()

# Clean any trailing/misc characters from team names
clean_team_names <- function(team_name) gsub("\\*", "", team_name)

for (sheet in sheets) {
  data <- readxl::read_excel(file_path, sheet = sheet)
  data$Team <- clean_team_names(data$Team)
  data_list[[sheet]] <- data
}

age_data      <- data_list[["10yrAge"]]
speed_data    <- data_list[["10yrSpeed"]]
advanced_data <- data_list[["10yrAdvanced"]]
drives_data   <- data_list[["10yrDrives"]]

# Merge and tidy
merged_data <- purrr::reduce(
  list(age_data, speed_data, advanced_data, drives_data),
  ~ merge(.x, .y, by = c("Team", "Season"), all = TRUE)
)
colnames(merged_data) <- make.names(colnames(merged_data), unique = TRUE)

merged_data <- merged_data %>%
  dplyr::mutate(
    Season_Team = paste(Season, Team, sep = " - "),
    Season_Normalized = as.numeric(as.factor(Season)),
    Dist.Per.Possession = Dist..Miles / PACE
  )

# League average distance per possession
league_avg_data <- merged_data %>%
  dplyr::group_by(Season) %>%
  dplyr::summarise(Avg.Dist.Per.Possession = mean(Dist.Per.Possession, na.rm = TRUE))

Using data from the NBA stats page on player movement we are able to capture the movement of NBA teams across the court at a surface level which is avaliable to the public. We want to investigate the impact of this movement so we capture the Drives per game, possessions per game, distance covered and the distance covered per possession.

Visuals: Pace, Distance, Drives

We explore associations among pace, total distance, and drives per game.

# Label helper for simple lm overlays
lm_eqn <- function(df, x, y) {
  formula <- as.formula(paste(y, "~", x))
  model <- lm(formula, data = df)
  eq <- substitute(
    italic(Slope) == b * "," ~~ italic(R)^2 ~ "=" ~ r2,
    list(
      b = format(coef(model)[2], digits = 2),
      r2 = format(summary(model)$r.squared, digits = 3)
    )
  )
  as.character(as.expression(eq))
}

# League average trend
cor_val <- cor(league_avg_data$Season, league_avg_data$Avg.Dist.Per.Possession, method = "spearman")

league_avg_plot <- ggplot(league_avg_data, aes(x = Season, y = Avg.Dist.Per.Possession)) +
  geom_line(color = "#2C3E50", linewidth = 0.8, alpha = 0.75) +
  geom_point(color = "#2980B9", size = 3) +
  geom_text_repel(aes(label = round(Avg.Dist.Per.Possession, 3)),
                  size = 4, color = "black", nudge_y = 0.0005, max.overlaps = 20) +
  geom_smooth(method = "gam", se = TRUE, color = "#ffbb6f", linetype = "solid", linewidth = 1) +
  scale_x_continuous(breaks = 2014:2025) +
  annotate("text", x = 2014, y = 0.18,
           label = paste0("Spearman ρ = ", round(cor_val, 2)),
           hjust = 0, size = 5) +
  labs(
    title = "League Average Distance Covered per Possession",
    subtitle = "Miles traveled per possession from 2014 to 2025 NBA regular seasons",
    x = "Season",
    y = "Distance per Possession (Miles)",
    caption = "Source: www.nba.com/stats | Plot by @BeyondLines__"
  ) +
  theme_blog

# PACE vs Distance
plot1 <- ggplot(merged_data, aes(x = PACE, y = Dist..Miles, color = Season)) +
  geom_point(alpha = 0.75) +
  geom_smooth(method = "lm", se = TRUE, color = "#2980B9", linewidth = 2) +
  scale_color_viridis_c(option = "A", direction = -1) +
  labs(
    title = "Possessions vs Distance Covered",
    subtitle = "Correlation between pace and total distance run",
    x = "PACE (Possessions/Game)",
    y = "Distance (Miles)",
    caption = "Source: www.nba.com/stats | Plot by @BeyondLines__"
  ) +
  theme_blog +
  annotate("text", x = max(merged_data$PACE, na.rm = TRUE) - 1,
           y = min(merged_data$Dist..Miles, na.rm = TRUE) + 0.5,
           label = lm_eqn(merged_data, "PACE", "Dist..Miles"),
           parse = TRUE, hjust = 1, size = 4)

# Drives vs PACE
plot2 <- ggplot(merged_data, aes(x = DRIVES, y = PACE, color = Season)) +
  geom_point(alpha = 0.75) +
  geom_smooth(method = "lm", se = TRUE, color = "#2980B9", linewidth = 2) +
  scale_color_viridis_c(option = "A", direction = -1) +
  labs(
    title = "Drives vs Possessions per Game",
    subtitle = "Are more drives associated with a faster pace?",
    x = "Drives per Game",
    y = "PACE",
    caption = "Source: www.nba.com/stats | Plot by @BeyondLines__"
  ) +
  theme_blog +
  annotate("text", x = max(merged_data$DRIVES, na.rm = TRUE) - 1,
           y = min(merged_data$PACE, na.rm = TRUE) + 0.5,
           label = lm_eqn(merged_data, "DRIVES", "PACE"),
           parse = TRUE, hjust = 1, size = 4)

plot1

plot2

league_avg_plot

Possession vs Distance Covered: Over time we cna see the Gradiant of the seasonal points for each team shift towards the top right quadrant. this indicates that since 2025 teams are not only covering more total distance in games but the pace has increased too. Player are forced to play harder and faster.

Drives vs Possession: Distance is not always associated with hard playing. Depedning on the circumstances it can be low effort running but when we support this that over time we see the same gradient shift in drives and pace into that top right quadrant in the last 10 years it is indicative and supportive of players pushing themselves harder and harder.

League Average Distance Covered Per Possession: We can see that this over time the apparent impact per possession is increases. this view provides a bit more supportive context of a time series trend, of which we can capture and understand that intentisty and the total tax players are putting on their body is increasing over time. How much longer will this upwards trend continue and will we see more games missed from stars as a result from them puhing themselves so hard.

Forecasts: Games Missed (Top‑Minutes Cohort)

We build a univariate time series of Average Games Missed and forecast the next five seasons with ARIMA, TSLM (trend), and Random Walk with drift.

avg_games_missed_by_year_ts <-  avg_games_missed_by_year %>% 
  as_tsibble(index = Year)

gm_last <- avg_games_missed_by_year %>%
  dplyr::slice_max(Year, n = 1, with_ties = FALSE)

gamesmissed_base <- autoplot(avg_games_missed_by_year_ts) +
  geom_point(size = 3, color = "#0B7285") +
  geom_line(linewidth = 1.2, color = "#0B7285") +
  geom_text(data = gm_last,
            aes(x = Year, y = Avg_Games_Missed,
                label = paste0("Last: ",
                               scales::number(Avg_Games_Missed, accuracy = 0.1))),
            nudge_x = -1.2, nudge_y = 1, size = 5, hjust = 0) +
  scale_x_continuous(breaks = scales::pretty_breaks(8), expand = expansion(mult = c(0.01, 0.06))) +
  scale_y_continuous(breaks = scales::pretty_breaks(6)) +
  labs(
    title = "Average Games Missed Per Season",
    subtitle = "Top 250 by minutes played each year",
    x = "Season", y = "AVG Games Missed",
    caption = "Source: Basketball-Reference  |  @BeyondLines__"
  ) +
  coord_cartesian(clip = "off") +
  theme_deck

# ACF/PACF
gamesmissed_acf <- ACF(avg_games_missed_by_year_ts, Avg_Games_Missed) |> autoplot() +
  labs(title = "ACF", x = "Lag", y = "Correlation") +
  coord_cartesian(ylim = c(-1,1)) + theme_deck +
  theme(panel.grid.major.x = element_blank())

gamesmissed_pacf <- PACF(avg_games_missed_by_year_ts, Avg_Games_Missed) |> autoplot() +
  labs(title = "PACF", x = "Lag", y = "Partial correlation") +
  coord_cartesian(ylim = c(-1,1)) + theme_deck +
  theme(panel.grid.major.x = element_blank())

gamesmissed_acf_pacf <- (gamesmissed_acf / gamesmissed_pacf) +
  patchwork::plot_annotation(title = "Games missed: ACF and PACF") &
  theme(plot.title = element_text(face = "bold", size = 22))

# Impute COVID gap with linear interpolation for modeling
avg_games_missed_by_year_ts_less <- avg_games_missed_by_year_ts %>%
  as_tibble() %>%
  arrange(Year) %>%
  mutate(
    is_imputed = Year %in% 2021:2022,
    Avg_Games_Missed = approx(
      x = Year[!is_imputed],
      y = Avg_Games_Missed[!is_imputed],
      xout = Year
    )$y
  ) %>%
  as_tsibble(index = Year)

avg_games_missed_by_year_inter <- avg_games_missed_by_year |>
  arrange(Year) |>
  mutate(
    is_imputed = Year %in% 2021:2022,
    Avg_Games_Missed_interp = approx(
      x = Year[!is_imputed],
      y = Avg_Games_Missed[!is_imputed],
      xout = Year
    )$y
  )

gamesmissed_inter <- ggplot(avg_games_missed_by_year_inter,
                            aes(Year, Avg_Games_Missed_interp)) +
  annotate("rect", xmin = 2020.3, xmax = 2022.7, ymin = -Inf, ymax = Inf,
           fill = "red", alpha = 0.06) +
  geom_line(linewidth = 1.2, color = "#0B7285") +
  geom_point(aes(shape = is_imputed), size = 3.2, stroke = 1.1, color = "#0B7285", fill = "white") +
  scale_shape_manual(values = c(`FALSE` = 16, `TRUE` = 21)) +
  labs(
    title = "Games missed with linear interpolation for COVID gap",
    subtitle = "Open markers indicate imputed seasons",
    x = "Season", y = "Average games missed"
  ) +
  theme_deck

# Fit
gm_fits <- avg_games_missed_by_year_ts_less %>%
  model(
    ARIMA = ARIMA(Avg_Games_Missed),
    TSLM  = TSLM(Avg_Games_Missed ~ trend()),
    RW    = RW(Avg_Games_Missed ~ drift())
  )

gamesmissed_accuracy <- accuracy(gm_fits)
gm_fc_tbl <- forecast(gm_fits, h = "5 years")

gamesmissed_forecast <- autoplot(gm_fc_tbl, avg_games_missed_by_year_ts, level = 80) +
  geom_text(data = gm_last,
            aes(x = Year, y = Avg_Games_Missed,
                label = paste0("Last: ",
                               scales::number(Avg_Games_Missed, accuracy = 0.01))),
            nudge_x = -2.2, nudge_y = 0.01, size = 3.3, hjust = 0) +
  shade_future(max(avg_games_missed_by_year_ts$Year)+1, max(avg_games_missed_by_year_ts$Year) + 5) +
  vline_future(max(avg_games_missed_by_year_ts$Year)+1) +
  scale_color_manual(values = model_cols[c("ARIMA","TSLM","RW")]) +
  scale_fill_manual(values  = scales::alpha(model_cols[c("ARIMA","TSLM","RW")], 0.18)) +
  labs(color = "Model", fill = "Model",
       title = "Forecast: Average Games Missed",
       subtitle = paste0("Forecast horizon 5 seasons, starting at ", max(avg_games_missed_by_year_ts$Year)+1),
       x = "Season", y = "AVG Games Missed") +
  theme_deck +
  geom_point(
    data = gm_fc_tbl,
    aes(x = Year, y = .mean, color = .model),
    size = 1.5,
    inherit.aes = FALSE
  )

# Labels for first/last forecast steps
gm_fc_labels <- gm_fc_tbl %>%
  dplyr::group_by(.model) %>%
  dplyr::slice_max(Year, n = 1, with_ties = FALSE) %>%
  dplyr::ungroup() %>%
  dplyr::mutate(lbl = scales::number(.mean, accuracy = 0.1))

gm_fc_labels_min <- gm_fc_tbl %>%
  dplyr::group_by(.model) %>%
  dplyr::slice_min(Year, n = 1, with_ties = FALSE) %>%
  dplyr::ungroup() %>%
  dplyr::mutate(lbl = scales::number(.mean, accuracy = 0.1))

gamesmissed_forecast <- gamesmissed_forecast +
  ggrepel::geom_text_repel(
    data = gm_fc_labels,
    aes(x = Year, y = .mean, label = lbl, color = .model),
    size = 3.3, nudge_x = 0.25, box.padding = 0.2, point.padding = 0.15,
    show.legend = FALSE, inherit.aes = FALSE, seed = 123
  ) +
  ggrepel::geom_text_repel(
    data = gm_fc_labels_min,
    aes(x = Year, y = .mean, label = lbl, color = .model),
    size = 3.3, nudge_x = -0.25, box.padding = 0.2, point.padding = 0.15,
    show.legend = FALSE, inherit.aes = FALSE, seed = 123
  ) +
  coord_cartesian(clip = "off") +
  scale_x_continuous(expand = expansion(mult = c(0.01, 0.08)))

# Print components
gamesmissed_acf_pacf

gamesmissed_inter

gamesmissed_forecast

ACF: small but significant positive spikes at lags 1–2, then it drops inside the bands and stays near zero. This says games missed is correlated with the last one to two seasons but there is no long-memory or seasonality in the annual data.

PACF: a clear spike at lag 1 and a smaller one at lag 2, then nothing systematic. That pattern points to a short AR structure, roughly AR(1)–AR(2) at most.

COVID seasons are structural outliers: shortened schedules, health-and-safety protocols, and postponements inflated “games missed” for reasons unrelated to baseline injury risk. Interpolating those seasons smooths the shock so the models learn the underlying availability trend, not a one-off pandemic artifact, while keeping season-to-season comparisons fair.

ARIMA ~ headline forecast. It has the lowest MAE and RMSE (≈0.77 and 0.94) and best MAPE (≈7.7%), which matches the short AR structure we saw in the ACF/PACF.

TSLM is a solid trend baseline with slightly higher error. Keep it to show the long-run linear view.

Random Walk (RW) has the highest error, so treat it as a sanity check that continues from the latest level.

We use ARIMA for the main projection, cross-checked against a linear trend and a random-walk continuation. All three agree on direction, with ARIMA most accurate on history.

Forecasts: Distance per Possession (League Average)

We repeat the univariate workflow for distance per possession and compare multiple models (ARIMA, ETS, TSLM, RW).

league_avg_dist_pp_ts <- league_avg_data %>% 
  as_tsibble(index = Season)

# Label last observed
dpp_last <- league_avg_data %>%
  dplyr::slice_max(Season, n = 1, with_ties = FALSE)

dpp_base <- autoplot(league_avg_dist_pp_ts) +
  geom_point(size = 3, color = "#0B7285") +
  geom_line(linewidth = 1.2, color = "#0B7285") +
  geom_text(data = dpp_last,
            aes(x = Season, y = Avg.Dist.Per.Possession,
                label = paste0("Last: ",
                               scales::number(Avg.Dist.Per.Possession, accuracy = 0.001))),
            nudge_x = -1.2, nudge_y = 0.0008, size = 5, hjust = 0) +
  scale_x_continuous(breaks = scales::pretty_breaks(8), expand = expansion(mult = c(0.01, 0.06))) +
  scale_y_continuous(labels = \(x) scales::number(x, accuracy = 0.001)) +
  labs(
    title = "League Average Distance Per Possession",
    subtitle = "Regular seasons 2014 to 2025",
    x = "Season", y = "DPP (Miles)",
    caption = "Source: nba.com/stats  |  @BeyondLines__"
  ) +
  coord_cartesian(clip = "off") +
  theme_deck

# ACF/PACF
dpp_acf  <- ACF(league_avg_dist_pp_ts, Avg.Dist.Per.Possession) |> autoplot() +
  labs(title = "ACF", x = "Lag", y = "Correlation") +
  coord_cartesian(ylim = c(-1, 1)) + theme_deck +
  theme(panel.grid.major.x = element_blank())

dpp_pacf <- PACF(league_avg_dist_pp_ts, Avg.Dist.Per.Possession) |> autoplot() +
  labs(title = "PACF", x = "Lag", y = "Partial correlation") +
  coord_cartesian(ylim = c(-1, 1)) + theme_deck +
  theme(panel.grid.major.x = element_blank())

dpp_acf_pacf <- (dpp_acf / dpp_pacf) +
  patchwork::plot_annotation(title = "Distance per possession: ACF and PACF") &
  theme(plot.title = element_text(face = "bold", size = 22))

# STL decomposition
dpp_decom <- league_avg_dist_pp_ts |>
  model(STL(Avg.Dist.Per.Possession)) |>
  components() |>
  autoplot() +
  labs(
    title = "STL decomposition of distance per possession",
    x = "Season", y = NULL
  ) +
  scale_x_continuous(breaks = scales::pretty_breaks(6)) +
  theme_deck

# Fit models
fits <- league_avg_dist_pp_ts %>%
  model(
    ARIMA = ARIMA(Avg.Dist.Per.Possession),
    ETS   = ETS(Avg.Dist.Per.Possession),
    TSLM  = TSLM(Avg.Dist.Per.Possession ~ trend()),
    RW    = RW(Avg.Dist.Per.Possession ~ drift())
  )

# Accuracy & forecasts
dpp_accuracy <- accuracy(fits)
dpp_fc_tbl   <- forecast(fits, h = "5 years")

dpp_fc_start <- max(league_avg_dist_pp_ts$Season, na.rm = TRUE)
dpp_fc_end   <- dpp_fc_start + 5

dpp_forecast <- autoplot(dpp_fc_tbl, league_avg_dist_pp_ts, level = 80) +
  geom_text(data = dpp_last,
            aes(x = Season, y = Avg.Dist.Per.Possession,
                label = paste0("Last: ",
                               scales::number(Avg.Dist.Per.Possession, accuracy = 0.001))),
            nudge_x = -1.2, nudge_y = 0.0005, size = 3.3, hjust = 0) +
  shade_future(dpp_fc_start+1, dpp_fc_end) +
  vline_future(dpp_fc_start+1) +
  scale_color_manual(values = model_cols[c("ARIMA","ETS","TSLM","RW")]) +
  scale_fill_manual(values  = scales::alpha(model_cols[c("ARIMA","ETS","TSLM","RW")], 0.18)) +
  labs(color = "Model", fill = "Model",
       title = "Forecast: Distance Per Possession",
       subtitle = paste0("Forecast horizon 5 seasons, starting at ", dpp_fc_start),
       x = "Season", y = "Dist Per Poss (Miles)") +
  theme_deck +
  geom_point(
    data = dpp_fc_tbl,
    aes(x = Season, y = .mean, color = .model),
    size = 1.5,
    inherit.aes = FALSE
  )

# Labels for first/last forecast steps
dpp_fc_labels <- dpp_fc_tbl %>%
  dplyr::group_by(.model) %>%
  dplyr::slice_max(Season, n = 1, with_ties = FALSE) %>%
  dplyr::ungroup() %>%
  dplyr::mutate(lbl = scales::number(.mean, accuracy = 0.001))

dpp_fc_labels_min <- dpp_fc_tbl %>%
  dplyr::group_by(.model) %>%
  dplyr::slice_min(Season, n = 1, with_ties = FALSE) %>%
  dplyr::ungroup() %>%
  dplyr::mutate(lbl = scales::number(.mean, accuracy = 0.001))

dpp_forecast <- dpp_forecast +
  ggrepel::geom_text_repel(
    data = dpp_fc_labels,
    aes(x = Season, y = .mean, label = lbl, color = .model),
    size = 3.3, nudge_x = 0.25, box.padding = 0.2, point.padding = 0.15,
    show.legend = FALSE, inherit.aes = FALSE, seed = 123
  ) +
  ggrepel::geom_text_repel(
    data = dpp_fc_labels_min,
    aes(x = Season, y = .mean, label = lbl, color = .model),
    size = 3.3, nudge_x = -0.25, box.padding = 0.2, point.padding = 0.15,
    show.legend = FALSE, inherit.aes = FALSE, seed = 123
  ) +
  coord_cartesian(clip = "off") +
  scale_x_continuous(expand = expansion(mult = c(0.01, 0.08)))

# Print components
dpp_base

dpp_acf_pacf

dpp_decom

dpp_forecast

ACF shows one clear positive spike at lag 1, then values sit inside the bands with no repeating pattern.

PACF also has a single significant spike at lag 1 and little else.

TSLM has the lowest training error (MAE 0.001, RMSE 0.002) and produces a modest upward path that matches the diagnostics for a short-memory series with a linear trend. ETS is a conservative baseline that largely holds the level near 0.184, useful as a “flat” scenario.

RW with drift gives the highest trajectory and can serve as an upper-bound continuation from the last value.

ARIMA dips below the 2025 level in the fan chart, which is inconsistent with the simple ACF/PACF read and likely reflects over-differencing or parameter noise.

Very small magnitudes make MAPE unstable, the history is short (annual 2014–2025), and there is no out-of-sample test for this series, so treat these as directional guides with uncertainty rather than precise point predictions.

Accuracy Tables

We format model accuracy side‑by‑side using gt tables.

fmt_pct_auto <- function(x, acc = 0.1) {
  if (is.null(x) || all(is.na(x))) return(x)
  if (max(x, na.rm = TRUE) <= 1.5) scales::percent(x, accuracy = acc) else scales::percent(x/100, accuracy = acc)
}

make_gt <- function(.acc, title) {
  df <- .acc
  if (".model_desc" %in% names(df) && !"Model" %in% names(df)) df <- df %>% dplyr::rename(Model = .model_desc)
  if (".model"      %in% names(df) && !"Model" %in% names(df)) df <- df %>% dplyr::rename(Model = .model)
  if (".type"       %in% names(df) && !"Type"  %in% names(df)) df <- df %>% dplyr::rename(Type  = .type)
  if ("me"          %in% names(df) && !"ME"    %in% names(df)) df <- df %>% dplyr::rename(ME    = me)
  if ("rmse"        %in% names(df) && !"RMSE"  %in% names(df)) df <- df %>% dplyr::rename(RMSE  = rmse)
  if ("mae"         %in% names(df) && !"MAE"   %in% names(df)) df <- df %>% dplyr::rename(MAE   = mae)
  if ("mape"        %in% names(df) && !"MAPE"  %in% names(df)) df <- df %>% dplyr::rename(MAPE  = mape)
  if ("smape"       %in% names(df) && !"sMAPE" %in% names(df)) df <- df %>% dplyr::rename(sMAPE = smape)
  if ("rsq"         %in% names(df) && !"R2"    %in% names(df)) df <- df %>% dplyr::rename(R2    = rsq)

  keep <- intersect(c("Model","Type","MAE","RMSE","MAPE","sMAPE","R2"), names(df))
  df <- df %>% dplyr::select(dplyr::all_of(keep))

  best_rmse <- if ("RMSE" %in% names(df)) min(df$RMSE, na.rm = TRUE) else NA_real_
  best_mae  <- if ("MAE"  %in% names(df)) min(df$MAE,  na.rm = TRUE) else NA_real_
  best_r2   <- if ("R2"   %in% names(df)) max(df$R2,   na.rm = TRUE) else NA_real_

  df %>%
    dplyr::mutate(
      MAPE  = if ("MAPE"  %in% names(.)) fmt_pct_auto(MAPE)  else NULL,
      sMAPE = if ("sMAPE" %in% names(.)) fmt_pct_auto(sMAPE) else NULL,
      R2    = if ("R2"    %in% names(.)) sprintf("%.3f", R2) else NULL
    ) %>%
    gt() %>%
    tab_header(title = title) %>%
    { if ("MAE" %in% names(df)) fmt_number(., columns = MAE, decimals = 3) else . } %>%
    { if ("RMSE" %in% names(df)) fmt_number(., columns = RMSE, decimals = 3) else . } %>%
    cols_align(align = "right", columns = everything()) %>%
    { if (!is.na(best_rmse)) tab_style(.,
                                       list(cell_fill(color = "#e1f3d8"), cell_text(weight = "bold")),
                                       locations = cells_body(columns = RMSE, rows = RMSE == best_rmse)) else . } %>%
    { if (!is.na(best_mae)) tab_style(.,
                                      list(cell_fill(color = "#e1f3d8"), cell_text(weight = "bold")),
                                      locations = cells_body(columns = MAE, rows = MAE == best_mae)) else . } %>%
    { if (!is.na(best_r2) && is.finite(best_r2)) tab_style(.,
                                                           list(cell_fill(color = "#d9ebff"), cell_text(weight = "bold")),
                                                           locations = cells_body(columns = R2, rows = R2 == sprintf("%.3f", best_r2))) else . } %>%
    tab_options(table.font.size = px(13), data_row.padding = px(6))
}

# Build and print tables
gt_dpp  <- make_gt(dpp_accuracy,            "Distance per possession — training accuracy")
gt_games<- make_gt(gamesmissed_accuracy,    "Games missed — training accuracy")

gt_dpp

Model	Type	MAE	RMSE	MAPE
Distance per possession — training accuracy
ARIMA	Training	0.002	0.002	92.2%
ETS	Training	0.001	0.002	81.8%
TSLM	Training	0.001	0.002	81.0%
RW	Training	0.002	0.002	86.9%

gt_games

Model	Type	MAE	RMSE	MAPE
Games missed — training accuracy
ARIMA	Training	0.772	0.937	7.7%
TSLM	Training	0.929	1.138	9.1%
RW	Training	1.011	1.188	10.2%

What we’d take forward (accuracy vs. interpretability)

Distance per Possession (DPP): Errors are very similar across ARIMA/ETS/TSLM/RW. For communication and policy (“how fast is the game getting?”), TSLM (trend): it’s transparent (a single trend coefficient), stable, and—per your chart—projects a gradual uptick to ~0.185–0.187 miles/possession over five seasons. Keep ETS as a secondary check for a data-driven level/smoothing model (similar accuracy, still simple).

Average Games Missed: ARIMA is the accuracy leader (lowest MAE/RMSE), with TSLM close behind but more interpretable. In practice: ARIMA for the headline forecast (projects a modest rise toward the mid-15s), TSLM would be best as narrative companion to explain the underlying upward trend; RW remains a baseline/benchmark only.

Why this matters for the NBA story

The models collectively suggest movement per possession is inching up while games missed are drifting higher—even after down-weighting the COVID shock. That supports the working hypothesis in this report: more continuous, high-tempo movement loads (plus the strategic rest culture it encourages) are consistent with the creeping unavailability we observe.

This isn’t a causal claim, but the alignment is practical: if the league continues to reward relentless pace/space and drive volume, player-load management (sports science, travel, schedule density, substitution patterns) becomes a first-order lever for keeping stars on the floor. Our recommended modeling set (ARIMA + TSLM) gives teams and league offices both accuracy and explainability to track these trends season by season and stress-test policy changes.

NBA Workloads, Availability & Movement: Trends and Forecasts

Alexander Martin

2025-10-25