library(tidyverse)
## Warning: package 'lubridate' was built under R version 4.3.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.0     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Load Dataset

nba <- read.csv("nba.csv")

Group 1 (Game Type)

df_game_type <- nba |>
  group_by(Playoffs) |>
  summarize(
    games = n(),
    avg_pts = mean(PTS, na.rm = TRUE),
    avg_gmsc = mean(GmSc, na.rm = TRUE),
    .groups = "drop"
  ) |>
  mutate(
    prob = games / sum(games),
    rarity_flag = if_else(prob == min(prob), "Rare", "Common")
  )
print(df_game_type)
## # A tibble: 2 × 6
##   Playoffs games avg_pts avg_gmsc   prob rarity_flag
##   <chr>    <int>   <dbl>    <dbl>  <dbl> <chr>      
## 1 false     1655    25.9     25.0 0.972  Common     
## 2 true        48    32.9     30.3 0.0282 Rare

Playoff games are in the smallest group, meaning the probability that a randomly selected row comes from a playoff game is pretty low. This reflects the structure of a season in the NBA since playoff games are limited and only qualifying teams are able to play in them. If a row was selected at random, it is far more likely that it would be a regular season game rather than a playoff game.

Hypothesis: Since playoff games are less common, average performance metrics such as Game Score will be higher due to there being more intensity and overall minutes played.

Visualization 1

df_game_type |>
  ggplot(aes(x = factor(Playoffs), y = games, fill = rarity_flag)) +
  geom_col() +
  labs(
    title = "Distribution of Games by Game Type",
    x = "Game Type",
    y = "Number of Games"
  )

Group #2 (Conference)

nba <- nba |>
  mutate(
    Conference = case_when(
      Tm %in% c(
        "ATL","BOS","BRK","CHA","CHH","CHI","CHO","CLE","DET","IND",
        "MIA","MIL","NJN","NYK","ORL","PHI","TOR","WAS","WSB"
      ) ~ "East",

      Tm %in% c(
        "DAL","DEN","GSW","HOU","LAC","LAL","MEM","MIN","NOH","NOK",
        "NOP","OKC","PHO","POR","SAC","SAS","SEA","UTA","VAN"
      ) ~ "West",

      TRUE ~ "Other"
    )
  )
df_conference <- nba |>
  mutate(
    Conference = factor(Conference, levels = c("East", "West", "Other"))
  ) |>
  group_by(Conference) |>
  summarize(
    games = n(),
    avg_pts = mean(PTS, na.rm = TRUE),
    .groups = "drop"
  ) |>
  mutate(
    prob = games / sum(games),
    rarity_flag = if_else(prob == min(prob), "Rare", "Common")
  )
print(df_conference)
## # A tibble: 2 × 5
##   Conference games avg_pts  prob rarity_flag
##   <fct>      <int>   <dbl> <dbl> <chr>      
## 1 East         837    25.8 0.491 Rare       
## 2 West         866    26.3 0.509 Common

First, I had to assign each team to a conference in either the East or the West. There are 15 teams that belong to each conference in the modern NBA, with very few historical teams falling into the “Other” category. Therefore, a randomly selected game is very unlikely to involve a team outside of the standard East/West classification.

Hypothesis: The amount of performances from players that belong to teams in either the Eastern or Western conferences will be fairly similar, while the “Other” category that corresponds to much older teams will be extremely small if there even are any.

Visualization #2

df_conference |>
  ggplot(aes(x = Conference, y = games, fill = rarity_flag)) +
  geom_col() +
  scale_x_discrete(drop = FALSE) +
  labs(
    title = "Game Counts by Conference (Including Other)",
    x = "Conference",
    y = "Number of Games"
  )

Group #3 (Scoring)

df_scoring_bins <- nba |>
  mutate(
    pts_bin = cut(
      PTS,
      breaks = c(0, 10, 20, 30, 40, Inf),
      labels = c("0–10", "11–20", "21–30", "31–40", "40+")
    )
  ) |>
  group_by(pts_bin) |>
  summarize(
    games = n(),
    avg_gmsc = mean(GmSc, na.rm = TRUE),
    .groups = "drop"
  ) |>
  mutate(
    prob = games / sum(games),
    rarity_flag = if_else(prob == min(prob), "Rare", "Common")
  )
print(df_scoring_bins)
## # A tibble: 5 × 5
##   pts_bin games avg_gmsc   prob rarity_flag
##   <fct>   <int>    <dbl>  <dbl> <chr>      
## 1 0–10       50     12.7 0.0294 Rare       
## 2 11–20     494     17.7 0.290  Common     
## 3 21–30     664     24.4 0.390  Common     
## 4 31–40     333     31.7 0.196  Common     
## 5 40+       162     41.2 0.0951 Common

Usually, 40+ point performances would be the rarest since they are the most difficult and require the most skill. But since this is a collection of unexpected NBA player performances, there are a disproportianal amount of high scoring outputs in these games. This is why a player falling into the lowest points bin is actually has the lowest probability in this dataset.

Hypothesis: Both extremely high and extremely low scoring outputs only occur when many other factors align or the player is just having an unexplainable night, so the bins will get smaller as you deviate from the middle similar to a normal distribution.

Visualization 3

df_scoring_bins |>
  ggplot(aes(x = pts_bin, y = games, fill = rarity_flag)) +
  geom_col() +
  labs(
    title = "Distribution of Games by Scoring Range",
    x = "Points Scored",
    y = "Number of Games"
  )

Combination Analysis (Playoffs/Conference)

df_combo <- nba |>
  count(Conference, Playoffs) |>
  mutate(
    prob = n / sum(n),
    rarity_flag = if_else(prob == min(prob), "Rare", "Common")
  )
print(df_combo)
##   Conference Playoffs   n        prob rarity_flag
## 1       East    false 820 0.481503230      Common
## 2       East     true  17 0.009982384        Rare
## 3       West    false 835 0.490311216      Common
## 4       West     true  31 0.018203171      Common

Both conferences obviously have much more unexpected performances taking in place in the regular season than the playoffs due to the sheer amount of regular season games played in an NBA season compared to the postseason. The West has slightly more unexpected performances in total than the East, so it makes sense that they would have more of them occurring in the playoffs as well.

Visualization #4 (Combo)

df_combo |>
  ggplot(aes(x = Conference, y = n, fill = factor(Playoffs))) +
  geom_col(position = "dodge") +
  labs(
    title = "Performances by Conference and Game Type",
    x = "Conference",
    y = "Number of Games",
    fill = "Playoffs"
  )

Conclusion Small groups in this dataset correspond to events that have a low probability rather than any data errors. Playoff games and extremely high or low scoring performances are rare by design, but they may carry immense analytical importance.

Future Questions Do rare groups exhibit higher variance in performance metrics? Are anomalies driven more by player skill or by game context? How much does the rarity of these groups change over time? Is the Western Conference historically better than the East?