Unexpected NBA Player Performance Analysis

Author

Lucas Tetrault

Audience

This analysis is intended for the NBA commissioner Adam Silver and his team along with all the front offices around the league that are interested. This could also be useful for sports analysts, broadcasters or anyone else that is in charge of creating content for sports media outlets like ESPN, B/R or Barstool.


Background & Objective

The games where a player greatly exceeds expectations are the performances that NBA fans remember the most. Metrics like Game Score (GmSc) summarize a player’s overall contribution in a game. The main objective of this project is to determine how unexpected player performances vary across teams and over time, identifying key factors that contribute to high-impact performances along the way.


library(tidyverse)
Warning: package 'ggplot2' was built under R version 4.3.3
Warning: package 'lubridate' was built under R version 4.3.3
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
library(ggplot2)
library(ggrepel)
Warning: package 'ggrepel' was built under R version 4.3.3
library(tsibble)
Warning: package 'tsibble' was built under R version 4.3.3
Registered S3 method overwritten by 'tsibble':
  method               from 
  as_tibble.grouped_df dplyr

Attaching package: 'tsibble'

The following object is masked from 'package:lubridate':

    interval

The following objects are masked from 'package:base':

    intersect, setdiff, union
library(pwr)
Warning: package 'pwr' was built under R version 4.3.3
nba <- read.csv("nba.csv")

Data Description

This analysis uses the nba.csv dataset, which contains each NBA player’s single most unexpected performance from the 1985–2022 seasons. The data was sourced from Basketball Reference and compiled for The Greatest Unexpected NBA Performances project (February 2023). Each row represents one player and their standout game, meaning the dataset is focused specifically on peak, outlier performances rather than full career histories.

Key variables include traditional box score statistics such as points (PTS), rebounds (TRB), assists (AST), steals (STL), and blocks (BLK), along with Game Score (GmSc) which summarizes overall performance. The dataset also includes contextual information such as team (Tm), opponent (Opp), season and whether the game occurred in the playoffs. Most importantly, it introduces measures of unexpectedness including a moving Z-score (GmScMovingZ) which compares a performance to a player’s typical level, and a secondary Game Score (GmSc2) to evaluate how much a player’s top performance differs from their next-best game.

The data is limited to players whose careers began in or after the 1984–85 season, when Game Score first became available. Unexpectedness is calculated using a centered moving average over a full season of games before and after each performance, allowing for a consistent comparison across different eras.

Initial Exploration

summary(nba[, c("GmSc", "PTS", "TRB", "AST", "GmScMovingZ")])
      GmSc            PTS             TRB             AST       
 Min.   : 6.40   Min.   : 4.00   Min.   : 0.00   Min.   : 0.00  
 1st Qu.:18.90   1st Qu.:19.00   1st Qu.: 4.00   1st Qu.: 1.00  
 Median :24.10   Median :24.00   Median : 7.00   Median : 3.00  
 Mean   :25.14   Mean   :26.06   Mean   : 7.37   Mean   : 3.74  
 3rd Qu.:30.10   3rd Qu.:32.00   3rd Qu.:10.00   3rd Qu.: 5.00  
 Max.   :64.60   Max.   :81.00   Max.   :29.00   Max.   :22.00  
  GmScMovingZ   
 Min.   :2.170  
 1st Qu.:3.240  
 Median :3.630  
 Mean   :3.691  
 3rd Qu.:4.050  
 Max.   :6.750  
nba |>
  group_by(Playoffs) |>
  summarize(
    avg_pts = mean(PTS, na.rm = TRUE),
    avg_gmsc = mean(GmSc, na.rm = TRUE)
    )
# A tibble: 2 × 3
  Playoffs avg_pts avg_gmsc
  <chr>      <dbl>    <dbl>
1 false       25.9     25.0
2 true        32.9     30.3
team_colors <- c(
  "ATL" = "#E03A3E",  # Hawks
  "BOS" = "#007A33",  # Celtics
  "BRK" = "#000000",  # Nets
  "CHA" = "#1D1160",  # Hornets
  "CHI" = "#CE1141",  # Bulls
  "CLE" = "#6F263D",  # Cavaliers
  "DAL" = "#00538C",  # Mavericks
  "DEN" = "#0E2240",  # Nuggets
  "DET" = "#C8102E",  # Pistons
  "GSW" = "#1D428A",  # Warriors
  "HOU" = "#CE1141",  # Rockets
  "IND" = "#002D62",  # Pacers
  "LAC" = "#C8102E",  # Clippers
  "LAL" = "#552583",  # Lakers
  "MEM" = "#5D76A9",  # Grizzlies
  "MIA" = "#98002E",  # Heat
  "MIL" = "#00471B",  # Bucks
  "MIN" = "#0C2340",  # Timberwolves
  "NOP" = "#0C2340",  # Pelicans
  "NYK" = "#006BB6",  # Knicks
  "OKC" = "#007AC1",  # Thunder
  "ORL" = "#0077C0",  # Magic
  "PHI" = "#006BB6",  # 76ers
  "PHO" = "#E56020",  # Suns
  "POR" = "#E03A3E",  # Trail Blazers
  "SAC" = "#5A2D81",  # Kings
  "SAS" = "#C4CED4",  # Spurs
  "TOR" = "#CE1141",  # Raptors
  "UTA" = "#002B5C",  # Jazz
  "WAS" = "#002B5C",   # Wizards
  "WSB" = "#002B5C",   # Bullets
  "NJN" = "#000000",  # Nets (New Jersey)
  "SEA" = "#00471B"  # Supersonics

)

ggplot(nba, aes(x = GmSc, y = reorder(Tm, GmSc), fill = Tm)) +
  geom_boxplot() +
  scale_fill_manual(values = team_colors) +
  theme_fivethirtyeight() +
  labs(
    title = "Game Score Distribution by Team",
    x = "Game Score",
    y = "Team",
    fill = "Team"
  ) +
  theme(legend.position = "none")

Initial EDA shows us that player performance varies widely across teams, with certain teams having a greater spread of high game score performances than others with some even represented as outliers. These standout performances coming from teams like the Bulls, Lakers and Rockets makes sense because of players like Michael Jordan, Kobe Bryant and Hakeem Olajuwon who are regarded as some of the greatest NBA players of all time and are known for having some historic performances on the court. There are also teams like the Phoenix Suns and Golden State Warriors who are known for having very effective offenses with consistently high scoring outputs that put there players in a position to succeed. This is likely due to their pace of play and 3-point shooting from players with high shot-making ability.

Assumptions

Game Score is a valid measure of overall player performance

This is important because the entire analysis is built around Game Score as the primary metric. If GmSc does not fully capture a player’s overall impact, then the results may be biased toward certain player types or styles of play.

Observations are independent

This is critical for statistical models like regression and hypothesis testing. If performances are actually related, then results may overstate significance and make relationships appear stronger than they really are.

Dataset represents a mix of regular season and playoff games

This matters when comparing contexts. Since playoff games are much less frequent, the dataset may be unbalanced which could influence conclusions about differences between regular season and playoff performance.

Mitigating Risks

Avoid overinterpreting aggregated data like totals or averages

Aggregates can sometimes hide variability and outliers. We should try to avoid making broad claims that do not reflect individual differences or extreme performances.

Acknowledge missing variables such as minutes played or pace of play

These factors might heavily influence performance but are not included in the dataset. Recognizing this limitation prevents misinterpreting effects to the variables that we do observe.

Use multiple models for more thorough analysis

Combining various visualizations, regression models and hypothesis testing helps validate findings across methods and reduces reliance on any single approach, leading to more robust conclusions.

Analysis & Support

nba <- read.csv("nba.csv")

nba |>
  ggplot(aes(x = GmSc, y = GmScMovingZ)) +
  geom_point(size = 2, alpha = 0.7) +
  theme_minimal() +
  labs(
    title = "Game Score vs Statistical Unexpectedness",
    x = "Game Score",
    y = "Game Score Moving Z"
  )

This scatterplot explores the relationship between a player’s gamescore on a certain night and how statistically unexpected the performance is. Most of the extremely high gamescores do not have extreme z-score values, showing that these are usually performances coming from NBA superstars. Some moderate gamescores have extremely high z-scores, which likely represent ordinary players that overperformed in a major way that game. This confirms that unexpectedness is not always equivalent to raw performance data. The true statistical outliers would be difficult to identify if only GmSc was taken into account.

nba |>
  ggplot(aes(x = Playoffs, y = GmScMovingZ)) +
  geom_boxplot(fill = "steelblue", alpha = 0.7) +
  theme_minimal() +
  labs(
    title = "Unexpectedness in Playoffs vs Regular Season",
    x = "Playoffs",
    y = "Game Score Moving Z"
  )

This boxplot compares the distribution of regular season and playoff performances based on how unexpected they are using the moving z-scores. While there is obviously a larger range in the regular season values and more outliers due to the sheer amount of games played during this time period, the playoff distribution is fairly similar just on a smaller scale. The fact that the ratios are even similar at all shows that unexpected performances are just as likely to happen in a more competitive game with higher stakes and maybe even more so due to less defensive intensity. One risk is that if playoff games have different variance patterns, then the moving window approach could affect bias in z-scores.

nba$Playoffs <- ifelse(nba$Playoffs == "true", 1, 0)
model_log <- glm(Playoffs ~ PTS + TRB + AST, data = nba, family = "binomial") 
summary(model_log)

Call:
glm(formula = Playoffs ~ PTS + TRB + AST, family = "binomial", 
    data = nba)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -5.03843    0.48283 -10.435  < 2e-16 ***
PTS          0.05410    0.01193   4.536 5.73e-06 ***
TRB         -0.00334    0.03308  -0.101    0.920    
AST         -0.01348    0.04703  -0.287    0.774    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 437.25  on 1702  degrees of freedom
Residual deviance: 418.37  on 1699  degrees of freedom
AIC: 426.37

Number of Fisher Scoring iterations: 6
nba |>
  ggplot(aes(x = PTS, y = Playoffs)) +
  geom_jitter(width = 0, height = 0.1, alpha = 0.5) +
  geom_smooth(method = "glm", method.args = list(family = "binomial"), se = FALSE, color = "red") +
  labs(
    title = "Logistic Fit for Playoff Probability",
    x = "Points",
    y = "Probability of Playoff Game"
  ) +
  theme_minimal()
`geom_smooth()` using formula = 'y ~ x'

The logistic regression results show that points scored is the only statistically significant predictor of whether a game is a playoff game or not since p < 0.001. The positive coefficient for PTS (0.054) indicates that higher-scoring performances are often associated with an increased likelihood of occurring in playoff games. The 95% confidence interval for PTS does not include zero, reinforcing that this relationship is statistically meaningful. In contrast, total rebounds and assists are not deemed statistically significant, as their p-values are very high and their confidence intervals both include zero. This suggests that these variables do not have as much of a clear or reliable relationship with whether a game is a playoff game or not in this dataset. Overall, this model provides some evidence that scoring performance differs between playoff and regular season games, but it also suggests that other aspects of player performance may not be strong indicators of playoff context. The relatively small improvement from null to residual deviance also indicates that the model only barely improves prediction over a baseline model.

#aggregate game score by year
nba_year <- nba |>
  group_by(Year) |>
  summarize(AvgGmSc = mean(GmSc, na.rm = TRUE))

#convert to tsibble
nba_ts <- nba_year |> as_tsibble(index = Year)

#smoothed time series plot
ggplot(nba_year, aes(x = Year, y = AvgGmSc)) +
  geom_line(color = "gray70") +
  geom_smooth(span = 0.5, color = "purple") +
  theme_clean() +
  labs(
    title = "Average Game Score Over Time"
  )
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

This time series plot shows how average Game Score changes from year to year over time. There are some pretty dramatic spikes and dips especially earlier on, suggesting player performance has not remained very stable over the years. This could be due to a number of factors including sample size or the fact that this is all data collected from unexpected performances. There is definitely a clear upward trend in average Game Score especially from the early 2000’s to present day. The model suggests Game Score will continue increase each year, which could reflect changes in the NBA like faster pace or better offensive scoring efficiency. Smoothing the trend line is helpful in this case because it reduces noise and highlights underlying patterns without all of the drastic changes from year to year that are shown in light grey. The relationship is mostly positive besides the brief time period when it started to decrease between 1990 and 2000 before climbing back up again and consistently increasing ever since.

Hypothesis Testing

Do unexpected playoff performances have a higher average Game Score than unexpected regular season performances?

Null and Alternative

H0: μPlayoffs = μRegular

HA: μPlayoffs > μRegular

Test Design

Alpha = 0.05 (standard/moderate)

Power = 0.8 (80% chance of detecting meaningful difference)

Minimum Effect Size = 2 (less than 2 GmSc points has limited practical meaning in NBA performance terms)

#calculate sample size
pwr.t.test(
  d = -2 / sd(nba$GmSc, na.rm = TRUE),
  power = 0.80,
  sig.level = 0.05,
  type = "two.sample",
  alternative = "less"
)

     Two-sample t test power calculation 

              n = 222.3259
              d = -0.2361939
      sig.level = 0.05
          power = 0.8
    alternative = less

NOTE: n is number in *each* group
#run test
t.test(
  GmSc ~ Playoffs,
  data = nba,
  alternative = "less"
)

    Welch Two Sample t-test

data:  GmSc by Playoffs
t = -3.6885, df = 49.023, p-value = 0.0002825
alternative hypothesis: true difference in means between group 0 and group 1 is less than 0
95 percent confidence interval:
      -Inf -2.870964
sample estimates:
mean in group 0 mean in group 1 
       24.98882        30.25208 
nba$Playoffs <- factor(nba$Playoffs, levels = c(0,1), labels = c("Regular Season", "Playoffs"))

ggplot(nba, aes(x = Playoffs, y = GmSc, fill = Playoffs)) +
  geom_boxplot(alpha = 0.7) +
  labs(
    title = "Playoffs vs Regular Season Game Score Distribution",
    x = "Game Type",
    y = "Game Score"
  ) +
  theme_minimal()

The mean Game Score in the regular season was 24.99, compared to 30.25 in the playoffs. The test indicated a statistically significant difference with an extremely low p-value. At a = 0.05, we are able to reject the null hypothesis and conclude that playoff performances have significantly higher Game Scores than regular season performances. The 95% confidence interval indicates that playoff Game Scores exceed regular season Game Scores by at least 2.87 points on average, which is statistically significant as well. This could suggest that players increase their level of play during playoff games due to a number of factors including higher stakes, stars getting more playing time and just a more competitive environment overall. However, there could be some misrepresentation due to the disproportionate amount of regular season games compared to playoff games and because of how many potential outliers there are as well.

Linear Regression

Does the number of points scored in a game significantly influence a player’s Game Score?

Model Equation

GmSc = β0 + β1(PTS) + ϵ

Intercept (β0): The predicted Game Score when a player scores zero points

Slope (β1): The average increase in Game Score for each additional point scored

reg_model <- lm(GmSc ~ PTS, data = nba)
summary(reg_model)

Call:
lm(formula = GmSc ~ PTS, data = nba)

Residuals:
    Min      1Q  Median      3Q     Max 
-9.4427 -2.2393 -0.2399  2.0597 15.0110 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 5.585814   0.227045    24.6   <2e-16 ***
PTS         0.750196   0.008101    92.6   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.446 on 1701 degrees of freedom
Multiple R-squared:  0.8345,    Adjusted R-squared:  0.8344 
F-statistic:  8575 on 1 and 1701 DF,  p-value: < 2.2e-16
ggplot(nba, aes(x = PTS, y = GmSc)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = TRUE, color = "red") +
  labs(
    title = "Points vs. Game Score",
    x = "PTS",
    y = "Game Score"
  ) +
  theme_minimal()
`geom_smooth()` using formula = 'y ~ x'

For every additional point scored by an NBA player in one game, the Game Score increases by about 0.75 on average. The intercept (5.586) represents the predicted Game Score when a player scores 0 points. The coefficient for PTS is statistically significant with a very small p-value indicating that there is a strong relationship between points scored and Game Score. The R-squared value shows that about 83.45% of the variability in Game Score can be explained by points scored alone. This makes sense because scoring is a major component of the Game Score formula. Players who score more points generally have higher Game Scores, which aligns with how Game Score is designed to measure overall performance. While there are many other stats that contribute to the metric, it still makes sense that this relationship is as strong as it is.

Conclusion

This analysis suggests that the most impactful NBA performances are not simply the highest-scoring ones, but those that exceed expectations as a player that does not typically stand out on the stat sheet. While Game Score effectively captures overall production, combining it with measures of unexpectedness like moving Z-scores provides a more complete and meaningful evaluation. For stakeholders like Adam Silver, team front offices and media outlets such as ESPN, this approach offers a clearer way to identify standout performances, uncover undervalued players and improve how player impact is analyzed and communicated.

Additional metrics like Scoring Efficiency and Game Score Differential could be helpful to explore as well. Scoring Efficiency measures how much overall impact a player has on the game per point scored and helps answer how much players are contributing beyond just scoring. A high value would mean the player is efficient and contributes in multiple ways beyond just scoring while a low value would indicate that the player relies heavily on scoring and contributes less in other areas. Game Score Differential measures the difference in performance between a player’s most unexpected performances using Game Score. It shows how much better their most unexpected performance was compared to another top performance of theirs. A high value would mean their top performance was significantly better than any other game they have played. A low value would mean they have had multiple performances of that caliber during their career. These metrics could be used within the context of other variables when ESPN or other sports media outlets are comparing players or certain performances. They could help analysts back up their arguments when discussing top performances or front offices that are trying to decide which players to keep or target.

Overall, I think integrating these metrics into scouting reports, broadcast analysis and player evaluation models could lead to better and more informed decisions or arguments. Teams could better identify high-impact role players, data analysts could provide more nuanced insights and media narratives could shift toward highlighting performances that are not only impressive but actually statistically meaningful. Expanding the model to include contextual factors like minutes played, pace and usage rate would further improve accuracy and make sure that performance is evaluated within the full context of the game.

Presentation