Load Dataset

nba <- read.csv("nba.csv")

Unclear Columns (Before Documentation)

The first column labled ‘bbrID’ could be confusing before knowing that most of the data is being pulled from basketball-reference.com. Without reading the documentation, one might not realize that these encoded values represent real players in the NBA.

The ‘GmSc’ column could also be misleading since GM usually stands for General Manager in NBA terms and even if you did know it was an abbreviation of Game Score, one might assume that the values are just points rather than realizing the complexity of the statistic.

Still Unclear (After Documentation)

One thing that still remains uncertain even after going through the documentation is how exactly the moving average window is applied across career boundaries in certain cases. How are rookies handled? What about players with shorter careers? Are the playoffs included?

Visualization 1 (GmSc vs Unexpectedness Scatterplot)

nba <- read.csv("nba.csv")

nba |>
  ggplot(aes(x = GmSc, y = GmScMovingZ)) +
  geom_point(size = 2, alpha = 0.7) +
  theme_minimal() +
  labs(
    title = "Game Score vs Statistical Unexpectedness",
    x = "Game Score",
    y = "Game Score Moving Z"
  )

This scatterplot explores the relationship between a player’s gamescore on a certain night and how statistically unexpected the performance is. Most of the super high gamescores do not have extreme z-score values, showing that these are usually performances from NBA superstars. Some moderate gamescores have extremely high z-scores, which likely represent ordinary players that overperformed in a major way. This confirms that unexpectedness is not always equivalent to raw performance data. The true statistical outliers would be difficult to identify if only GmSc was taken into account.

Visualization 2 (Playoffs vs Regular Season Unexpectedness)

nba |>
  ggplot(aes(x = Playoffs, y = GmScMovingZ)) +
  geom_boxplot(fill = "steelblue", alpha = 0.7) +
  theme_minimal() +
  labs(
    title = "Unexpectedness in Playoffs vs Regular Season",
    x = "Playoffs",
    y = "Game Score Moving Z"
  )

This boxplot compares the distribution of regular season and playoff performances based on how unexpected they are using the z-scores. While there is obviously a larger range in the regular season values and more outliers due to the sheer amount of games played during this time period, the playoff distribution is fairly similar just on a smaller scale. The fact that the ratios are even similar at all shows that unexpected performances are just as likely to happen in a more competitive game with higher stakes and maybe even more so. One risk is that if playoff games have different variance patterns, then the moving window approach could affect bias in z-scores.

Categorical Column Checks

Date: While there are no explicitly missing rows in this dataset, there are plenty of implicitly missing rows since an unexpected performance in the NBA does not occur everyday.

Season/Playoffs: There are also no explicitly missing rows in either of these columns. There could be implicitly missing rows for players that have had multiple unexpected performances since this dataset only includes their highest one. There might be empty groups for certain seasons where there are no unexpected performances that happened during the playoffs. I will check for those below:

nba |> 
  count(Season, Playoffs) |> 
  complete(Season, Playoffs, fill = list(n = 0))
## # A tibble: 76 × 3
##    Season  Playoffs     n
##    <chr>   <chr>    <int>
##  1 1984-85 false        5
##  2 1984-85 true         0
##  3 1985-86 false       15
##  4 1985-86 true         0
##  5 1986-87 false       14
##  6 1986-87 true         1
##  7 1987-88 false       22
##  8 1987-88 true         0
##  9 1988-89 false       21
## 10 1988-89 true         0
## # ℹ 66 more rows

Since some seasons have no playoff games recorded, the comparison of “unexpectedness in playoffs vs regular season” could be misleading.

Continuous Column Outliers (GmScMovingZ)

hist(nba$GmScMovingZ)

quantile(nba$GmScMovingZ, probs = c(0.01, 0.99), na.rm = TRUE)
##     1%    99% 
## 2.4502 5.5398

I would define performances that have a GmScMovingZ below 2.5 above 5.5 as outliers because there is a significant dropoff in performances that are outside of that range as you can tell by the histogram and these numbers also roughly represent the 1st and 99th percentile of z-scores within the dataset.

It is important to note that the moving z-score range in this dataset is much different than most normal distributions because all of the performances are technically outliers since they are considered unexpected compared to the rest of each player’s games in terms of their statistical averages. This is significant because it would probably be confusing for someone that was trying to analyze the data without reading any of the documentation beforehand. This is not just about the best performances, it is about statistical surprise. If you skipped the documentation and went straight into the dataset, your entire analysis would likely be wrong. One question I have is how exactly will this affect my work going forward? What is the best way to approach a dataset full of outliers like this?