library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.5.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.1 ✔ stringr 1.5.2
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
NBA_Data <- read_csv("NBA Dataset for Submission.csv")
## Rows: 46977 Columns: 76
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): season_type, game_id, team_abbreviation_home, team_name_home, tea...
## dbl (65): season_id, season, team_id_home, team_id_away, fgm_home, fga_home...
## date (1): game_date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
NBA_Data <- NBA_Data |>
filter(season_type != 'All-Star',
season_type != 'All Star',
season_type != 'Pre Season')
group1 <- NBA_Data |>
group_by(team_name_home) |>
summarise(
avg_attendance = mean(attendance, na.rm = TRUE),
games = n()
) |>
arrange(desc(games))
This data frame summarizes the number of home games and average attendance for each team. The smallest counts are the teams with the fewest home games recorded in the data set.
If a row is selected at random from the data set, the probability of selecting a game from team X is:
\[ P(\mathrm{team\ X}) = \frac{\mathrm{games\ for\ team\ X}}{\mathrm{total\ games}} \]
Teams in the smallest group have the lowest probability of being selected. These teams are tagged as “rare_home_team”.
A team appearing infrequently may indicate:
Missing seasons, games, or data in general
A team relocation or change in name
Teams are disproportionately counted because of data aggregation issues like changes in names or team relocation.
ggplot(group1, aes(x = reorder(team_name_home, avg_attendance),
y = avg_attendance)) +
geom_col(fill = "steelblue") +
coord_flip() +
labs(title = "Average Home Attendance by Team",
x = "Team",
y = "Average Attendance")
As we can see in the visualization, a team like the Clippers who moved from San Diego to Los Angeles is “penalized” for their attendance figures since they are split amongst three different “teams” listed. Additionally, this visualization shows good reason for why the Clippers chose to make their move with such poor attendance in San Diego, followed by improved attendance following their move north.
group2 <- NBA_Data |>
group_by(wl_home) |>
summarise(
avg_points = mean(pts_home, na.rm = TRUE),
games = n()
) |>
arrange(desc(games))
This data frame displays that teams averaged more home points in wins than in losses.
Since there are more losses present in the data set
\[ P(\mathrm{home\ loss}) = \frac{\mathrm{number\ of\ home\ losses}}{\mathrm{total\ games}} \]
This would imply that home losses are less than likely to be selected at random.
A smaller “L” group suggests:
Home teams score more points and win more often because there are elements of home‑court advantage that can effect play like improved shooting efficiency and reduced turnovers.
ggplot(NBA_Data, aes(x = wl_home, y = pts_home, fill = wl_home)) +
geom_boxplot() +
labs(title = "Home Points Scored in Wins vs Losses",
x = "Home Win/Loss",
y = "Points Scored")
group3 <- NBA_Data |>
filter(!is.na(attendance),
pts_fb_home != 0,
pts_fb_away != 0) |>
mutate(attendance_bin = cut(attendance, breaks = 4),
total_transition = pts_fb_home + pts_fb_away) |>
group_by(attendance_bin) |>
summarise(
avg_fastbreak = mean(total_transition, na.rm = TRUE),
games = n()
) |>
arrange(desc(games))
Games are filtered out to have attendance, and register both fast break points for home and away teams. Attendance is divided into four equal width bins with the smallest bin representing the least common attendance range.
\[ P(\mathrm{attendance\ bin}) = \frac{\mathrm{games\ in\ bin}}{\mathrm{total\ games}} \]
The smallest bin has the lowest probability of being selected.
A rare attendance bin may indicate:
Certain arenas draw larger crowds
Some games attract fewer fans
COVID era games were capacity constricted and can effect attendance
Fast break scoring may correlate with crowd size since high energy environments could influence pace of play and nerves of players.
ggplot(group3, aes(x = attendance_bin, y = avg_fastbreak)) +
geom_col(fill = "darkorange") +
labs(title = "Average Fastbreak Points by Attendance Bin",
x = "Attendance Range",
y = "Avg Fastbreak Points")
There appears to be the highest average of fast break points in the most well attended games, but the second highest average is in the second lowest attendance bucket so I am not sure the hypothesis can hold water just yet.
all_combos <- expand.grid(
team_abbreviation_home = unique(NBA_Data$team_abbreviation_home),
wl_home = unique(NBA_Data$wl_home)
)
observed <- NBA_Data |>
count(team_abbreviation_home, wl_home)
missing <- anti_join(all_combos, observed)
## Joining with `by = join_by(team_abbreviation_home, wl_home)`
missing
## team_abbreviation_home wl_home
## 1 KCK W
It looks as though the Kansas City Kings winning a game are the only combination not listed in this data frame. While the team did in fact win games, they also moved to Sacramento so it is possible their relocation efforts effected the data.
observed |>
arrange(n)
## # A tibble: 83 × 3
## team_abbreviation_home wl_home n
## <chr> <chr> <int>
## 1 SDC W 1
## 2 KCK L 4
## 3 SDC L 5
## 4 NOK L 34
## 5 NOK W 48
## 6 VAN W 66
## 7 UTH L 107
## 8 NOH L 140
## 9 SAN L 153
## 10 VAN L 164
## # ℹ 73 more rows
The San Diego Clippers, like the Kansas City Kings, were not a particularly talented bunch but also struggled with the relocation issue while the Lakers have not moved from Los Angeles in decades and have had years of sustained success while being there.
NBA_Data |>
count(team_abbreviation_home, wl_home) |>
ggplot(aes(x = n, y = reorder(team_abbreviation_home, n), color = wl_home)) +
geom_point(size = 3) +
labs(title = "Dot Plot of Team × Win/Loss Frequencies",
x = "Count",
y = "Team")