library(tidyverse)
## Warning: package 'lubridate' was built under R version 4.3.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.0 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Load Dataset
nba <- read.csv("nba.csv")
Group 1 (Game Type)
df_game_type <- nba |>
group_by(Playoffs) |>
summarize(
games = n(),
avg_pts = mean(PTS, na.rm = TRUE),
avg_gmsc = mean(GmSc, na.rm = TRUE),
.groups = "drop"
) |>
mutate(
prob = games / sum(games),
rarity_flag = if_else(prob == min(prob), "Rare", "Common")
)
print(df_game_type)
## # A tibble: 2 × 6
## Playoffs games avg_pts avg_gmsc prob rarity_flag
## <chr> <int> <dbl> <dbl> <dbl> <chr>
## 1 false 1655 25.9 25.0 0.972 Common
## 2 true 48 32.9 30.3 0.0282 Rare
Playoff games are in the smallest group, meaning the probability that a randomly selected row comes from a playoff game is pretty low. This reflects the structure of a season in the NBA since playoff games are limited and only qualifying teams are able to play in them. If a row was selected at random, it is far more likely that it would be a regular season game rather than a playoff game.
Hypothesis: Since playoff games are less common, average performance metrics such as Game Score will be higher due to there being more intensity and overall minutes played.
Visualization 1
df_game_type |>
ggplot(aes(x = factor(Playoffs), y = games, fill = rarity_flag)) +
geom_col() +
labs(
title = "Distribution of Games by Game Type",
x = "Game Type",
y = "Number of Games"
)
Group #2 (Conference)
nba <- nba |>
mutate(
Conference = case_when(
Tm %in% c(
"ATL","BOS","BRK","CHA","CHH","CHI","CHO","CLE","DET","IND",
"MIA","MIL","NJN","NYK","ORL","PHI","TOR","WAS","WSB"
) ~ "East",
Tm %in% c(
"DAL","DEN","GSW","HOU","LAC","LAL","MEM","MIN","NOH","NOK",
"NOP","OKC","PHO","POR","SAC","SAS","SEA","UTA","VAN"
) ~ "West",
TRUE ~ "Other"
)
)
df_conference <- nba |>
mutate(
Conference = factor(Conference, levels = c("East", "West", "Other"))
) |>
group_by(Conference) |>
summarize(
games = n(),
avg_pts = mean(PTS, na.rm = TRUE),
.groups = "drop"
) |>
mutate(
prob = games / sum(games),
rarity_flag = if_else(prob == min(prob), "Rare", "Common")
)
print(df_conference)
## # A tibble: 2 × 5
## Conference games avg_pts prob rarity_flag
## <fct> <int> <dbl> <dbl> <chr>
## 1 East 837 25.8 0.491 Rare
## 2 West 866 26.3 0.509 Common
First, I had to assign each team to a conference in either the East or the West. There are 15 teams that belong to each conference in the modern NBA, with very few historical teams falling into the “Other” category. Therefore, a randomly selected game is very unlikely to involve a team outside of the standard East/West classification.
Hypothesis: The amount of performances from players that belong to teams in either the Eastern or Western conferences will be fairly similar, while the “Other” category that corresponds to much older teams will be extremely small if there even are any.
Visualization #2
df_conference |>
ggplot(aes(x = Conference, y = games, fill = rarity_flag)) +
geom_col() +
scale_x_discrete(drop = FALSE) +
labs(
title = "Game Counts by Conference (Including Other)",
x = "Conference",
y = "Number of Games"
)
Group #3 (Scoring)
df_scoring_bins <- nba |>
mutate(
pts_bin = cut(
PTS,
breaks = c(0, 10, 20, 30, 40, Inf),
labels = c("0–10", "11–20", "21–30", "31–40", "40+")
)
) |>
group_by(pts_bin) |>
summarize(
games = n(),
avg_gmsc = mean(GmSc, na.rm = TRUE),
.groups = "drop"
) |>
mutate(
prob = games / sum(games),
rarity_flag = if_else(prob == min(prob), "Rare", "Common")
)
print(df_scoring_bins)
## # A tibble: 5 × 5
## pts_bin games avg_gmsc prob rarity_flag
## <fct> <int> <dbl> <dbl> <chr>
## 1 0–10 50 12.7 0.0294 Rare
## 2 11–20 494 17.7 0.290 Common
## 3 21–30 664 24.4 0.390 Common
## 4 31–40 333 31.7 0.196 Common
## 5 40+ 162 41.2 0.0951 Common
Usually, 40+ point performances would be the rarest since they are the most difficult and require the most skill. But since this is a collection of unexpected NBA player performances, there are a disproportianal amount of high scoring outputs in these games. This is why a player falling into the lowest points bin is actually has the lowest probability in this dataset.
Hypothesis: Both extremely high and extremely low scoring outputs only occur when many other factors align or the player is just having an unexplainable night, so the bins will get smaller as you deviate from the middle similar to a normal distribution.
Visualization 3
df_scoring_bins |>
ggplot(aes(x = pts_bin, y = games, fill = rarity_flag)) +
geom_col() +
labs(
title = "Distribution of Games by Scoring Range",
x = "Points Scored",
y = "Number of Games"
)
Combination Analysis (Playoffs/Conference)
df_combo <- nba |>
count(Conference, Playoffs) |>
mutate(
prob = n / sum(n),
rarity_flag = if_else(prob == min(prob), "Rare", "Common")
)
print(df_combo)
## Conference Playoffs n prob rarity_flag
## 1 East false 820 0.481503230 Common
## 2 East true 17 0.009982384 Rare
## 3 West false 835 0.490311216 Common
## 4 West true 31 0.018203171 Common
Both conferences obviously have much more unexpected performances taking in place in the regular season than the playoffs due to the sheer amount of regular season games played in an NBA season compared to the postseason. The West has slightly more unexpected performances in total than the East, so it makes sense that they would have more of them occurring in the playoffs as well.
Visualization #4 (Combo)
df_combo |>
ggplot(aes(x = Conference, y = n, fill = factor(Playoffs))) +
geom_col(position = "dodge") +
labs(
title = "Performances by Conference and Game Type",
x = "Conference",
y = "Number of Games",
fill = "Playoffs"
)
Conclusion Small groups in this dataset correspond to events that have a
low probability rather than any data errors. Playoff games and extremely
high or low scoring performances are rare by design, but they may carry
immense analytical importance.
Future Questions Do rare groups exhibit higher variance in performance metrics? Are anomalies driven more by player skill or by game context? How much does the rarity of these groups change over time? Is the Western Conference historically better than the East?