library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.5.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.1     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)

NBA Dataset

NBA_Data <- read_csv("NBA Dataset for Submission.csv")
## Rows: 46977 Columns: 76
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (10): season_type, game_id, team_abbreviation_home, team_name_home, tea...
## dbl  (65): season_id, season, team_id_home, team_id_away, fgm_home, fga_home...
## date  (1): game_date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Numerical Summaries

How many different types of games are there, and what are they?

unique(NBA_Data$season_type)
## [1] "Playoffs"       "All-Star"       "All Star"       "Regular Season"
## [5] "Pre Season"
summary(NBA_Data$season_type)
##    Length     Class      Mode 
##     46977 character character
For additional analysis, any all star and pre-season games will be filtered out since they do not feature “legitimate” competition amongst everyday starters on teams.
NBA_Data <- NBA_Data |>
  filter(season_type != 'All-Star',
         season_type != 'All Star',
         season_type != 'Pre Season')
That eliminated roughly 1600 records from the original data set, but there are still about 45,400 records remaining.

Average Home Scoring Margin

NBA_Data <- NBA_Data|>
  mutate(home_score_margin = pts_home - pts_away)

summary(NBA_Data$home_score_margin)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -58.000  -6.000   4.000   3.367  12.000  73.000
Home teams are outscoring away teams by an average of 3.4 points per game. While I can’t prove home court advantage, that certainly raises interesting questions about the impact.

Lingering Questions

1. What impact does home court advantage have on NBA games?

2. How has the introduction of the three-point line changed the game of basketball?

3. Has there been an increase in transition points or up-tempo game play in the last 5-10 years?

Home Court Evaluation

How strong is the advantage for teams winning at home? Is there a sizable difference?

NBA_Data |>
  select(wl_home) |>
  table()
## wl_home
##     L     W 
## 17927 27471
The home team is winning almost 61% of their home games.
NBA_Data |>
  filter(season_type == "Playoffs") |>
  select(wl_home) |>
  table()
## wl_home
##    L    W 
##  997 1737
That number increases to almost 64% in the playoffs.
Is there a correlation between attendance and winning?
NBA_Data2 <- NBA_Data |>
  filter(!is.na(attendance))

NBA_Data2 <- NBA_Data2 |>
  mutate(win_binary = ifelse(wl_home == "W", 1, 0))

cor_val <- cor(NBA_Data2$attendance, NBA_Data2$win_binary, use = "complete.obs")

cor_val
## [1] 0.02911234
ggplot(NBA_Data2, aes(x = wl_home, y = attendance, fill = wl_home)) +
  geom_boxplot(alpha = 0.6) +
  geom_jitter(width = 0.15, alpha = 0.4) +
  labs(
    title = "Attendance vs Home Team Result",
    subtitle = paste("Point-biserial correlation:", round(cor_val, 3)),
    x = "Home Team Result (W/L)",
    y = "Attendance"
  ) +
  theme_minimal()

Not definitively according to my correlation graph.

Changes Due to the Three Point Line

I think the league has taken more threes as the years have gone on, but let’s see.

NBA_Data |>
  mutate(total_threes = fg3m_home + fg3m_away) |>
  group_by(season) |>
  summarise(total_threes = sum(total_threes)) |>
  ggplot(aes(x = as.numeric(substr(season, 1, 4)), y = total_threes)) +
  geom_line(linewidth = 1.2, color = "steelblue") +
  geom_point(color = "steelblue") +
  labs(
    title = "Total Threes by Season",
    x = "Season",
    y = "Total Threes"
  ) +
  theme_minimal()

This graph verifies my hypothesis but also brings to my attention an important piece of information I forgot about - the lockout of 2012 which cut down the number of games significantly.

Transition Points in the Last 5 Seasons

NBA_Data |>
  filter(
    !is.na(pts_fb_home),
    !is.na(pts_fb_away),
    as.numeric(substr(season, 1, 4)) >= 2017
  ) |>
  mutate(total_transition = pts_fb_home + pts_fb_away) |>
  group_by(season) |>
  summarise(total_transition = sum(total_transition)) |>
  ggplot(aes(x = as.numeric(substr(season, 1, 4)), y = total_transition)) +
  geom_line(linewidth = 1.2, color = "steelblue") +
  geom_point(color = "steelblue") +
  labs(
    title = "Total Fastbreak Points by Season",
    x = "Season",
    y = "Total Transition Points"
  ) +
  theme_minimal()

The dip in fast-break/transition points coinciding with the global pandemic is extremely fascinating. I wonder if the lack of games, or the NBA’s bubble format had an impact on the speed of the games, where it did not seem to impact the aforementioned three point volume.
I want to explore the dip more in further analyses to see what other statistical measures or game states could have been effected in this seasonal range.