library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.5.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.1     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

NBA Dataset

Loading in the data:
NBA_Data <- read_csv("NBA Dataset for Submission.csv")
## Rows: 46977 Columns: 76
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (10): season_type, game_id, team_abbreviation_home, team_name_home, tea...
## dbl  (65): season_id, season, team_id_home, team_id_away, fgm_home, fga_home...
## date  (1): game_date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Filtering out unnecessary games:
NBA_Data <- NBA_Data |>
  filter(season_type != 'All-Star',
         season_type != 'All Star',
         season_type != 'Pre Season')

Data frame 1: Average Attendance

group1 <- NBA_Data |>
  group_by(team_name_home) |>
  summarise(
    avg_attendance = mean(attendance, na.rm = TRUE),
    games = n()
  ) |>
  arrange(desc(games))

Findings

This data frame summarizes the number of home games and average attendance for each team. The smallest counts are the teams with the fewest home games recorded in the data set.

Probability Interpretation

If a row is selected at random from the data set, the probability of selecting a game from team X is:

\[ P(\mathrm{team\ X}) = \frac{\mathrm{games\ for\ team\ X}}{\mathrm{total\ games}} \]

Teams in the smallest group have the lowest probability of being selected. These teams are tagged as “rare_home_team”.

Interpretation in Context

A team appearing infrequently may indicate:

  • Missing seasons, games, or data in general

  • A team relocation or change in name

Hypothesis

Teams are disproportionately counted because of data aggregation issues like changes in names or team relocation.

Visualization

ggplot(group1, aes(x = reorder(team_name_home, avg_attendance),
                   y = avg_attendance)) +
  geom_col(fill = "steelblue") +
  coord_flip() +
  labs(title = "Average Home Attendance by Team",
       x = "Team",
       y = "Average Attendance")

As we can see in the visualization, a team like the Clippers who moved from San Diego to Los Angeles is “penalized” for their attendance figures since they are split amongst three different “teams” listed. Additionally, this visualization shows good reason for why the Clippers chose to make their move with such poor attendance in San Diego, followed by improved attendance following their move north.

Data frame 2: Home Wins and Points Score

group2 <- NBA_Data |>
  group_by(wl_home) |>
  summarise(
    avg_points = mean(pts_home, na.rm = TRUE),
    games = n()
  ) |>
  arrange(desc(games))

Findings

This data frame displays that teams averaged more home points in wins than in losses.

Probability Interpretation

Since there are more losses present in the data set

\[ P(\mathrm{home\ loss}) = \frac{\mathrm{number\ of\ home\ losses}}{\mathrm{total\ games}} \]

This would imply that home losses are less than likely to be selected at random.

Interpretation in Context

A smaller “L” group suggests:

  • Home teams often win more than away teams.

Hypothesis

Home teams score more points and win more often because there are elements of home‑court advantage that can effect play like improved shooting efficiency and reduced turnovers.

Visualization

ggplot(NBA_Data, aes(x = wl_home, y = pts_home, fill = wl_home)) +
  geom_boxplot() +
  labs(title = "Home Points Scored in Wins vs Losses",
       x = "Home Win/Loss",
       y = "Points Scored")

Data frame 3: Attendance bins and Fastbreak points

group3 <- NBA_Data |>
  filter(!is.na(attendance),
         pts_fb_home != 0,
         pts_fb_away != 0) |>
  mutate(attendance_bin = cut(attendance, breaks = 4),
         total_transition = pts_fb_home + pts_fb_away) |>
  group_by(attendance_bin) |>
  summarise(
    avg_fastbreak = mean(total_transition, na.rm = TRUE),
    games = n()
  ) |>
  arrange(desc(games))

Findings

Games are filtered out to have attendance, and register both fast break points for home and away teams. Attendance is divided into four equal width bins with the smallest bin representing the least common attendance range.

Probability Interpretation

\[ P(\mathrm{attendance\ bin}) = \frac{\mathrm{games\ in\ bin}}{\mathrm{total\ games}} \]

The smallest bin has the lowest probability of being selected.

Interpretation in Context

A rare attendance bin may indicate:

  • Certain arenas draw larger crowds

  • Some games attract fewer fans

  • COVID era games were capacity constricted and can effect attendance

Hypothesis

Fast break scoring may correlate with crowd size since high energy environments could influence pace of play and nerves of players.

Visualization

ggplot(group3, aes(x = attendance_bin, y = avg_fastbreak)) +
  geom_col(fill = "darkorange") +
  labs(title = "Average Fastbreak Points by Attendance Bin",
       x = "Attendance Range",
       y = "Avg Fastbreak Points")

There appears to be the highest average of fast break points in the most well attended games, but the second highest average is in the second lowest attendance bucket so I am not sure the hypothesis can hold water just yet.

Two Categorical Variable Combos

All possible combinations of Teams and Wins/Losses at Home

all_combos <- expand.grid(
  team_abbreviation_home = unique(NBA_Data$team_abbreviation_home),
  wl_home = unique(NBA_Data$wl_home)
)

Observed Combos

observed <- NBA_Data |>
  count(team_abbreviation_home, wl_home)

Missing Combos

missing <- anti_join(all_combos, observed)
## Joining with `by = join_by(team_abbreviation_home, wl_home)`
missing
##   team_abbreviation_home wl_home
## 1                    KCK       W

It looks as though the Kansas City Kings winning a game are the only combination not listed in this data frame. While the team did in fact win games, they also moved to Sacramento so it is possible their relocation efforts effected the data.

Most and Least Common Combos

observed |>
  arrange(n)
## # A tibble: 83 × 3
##    team_abbreviation_home wl_home     n
##    <chr>                  <chr>   <int>
##  1 SDC                    W           1
##  2 KCK                    L           4
##  3 SDC                    L           5
##  4 NOK                    L          34
##  5 NOK                    W          48
##  6 VAN                    W          66
##  7 UTH                    L         107
##  8 NOH                    L         140
##  9 SAN                    L         153
## 10 VAN                    L         164
## # ℹ 73 more rows

The San Diego Clippers, like the Kansas City Kings, were not a particularly talented bunch but also struggled with the relocation issue while the Lakers have not moved from Los Angeles in decades and have had years of sustained success while being there.

Visualization

NBA_Data |>
  count(team_abbreviation_home, wl_home) |>
  ggplot(aes(x = n, y = reorder(team_abbreviation_home, n), color = wl_home)) +
  geom_point(size = 3) +
  labs(title = "Dot Plot of Team × Win/Loss Frequencies",
       x = "Count",
       y = "Team")